The Syracuse University Special Collections Research Center has also been so kind as to provide the XML source files for their finding aids for use in the ArchivesZ project. I loaded 572 finding aids and no errors were generated during the parsing of the XML files.
My scripts extracted 6632 unique ‘tags’ from the subjects assigned to the finding aids. As part of the data parsing and loading of data for use in the visualizations, the script divides up compound subjects into tags. For example, in the subjects we find assigned to Syracuse University finding aids we find these values (number shown is number of finding aids to which that subject is assigned):
- Art — American — 20th century (1)
- Art — Cartoonists (68)
- Art — Cartoonists. (3)
- Art — Exhibitions. (1)
- Art — Illustrators (36)
- Art — Illustrators. (1)
- Art — Painters (77)
- Art — Philosophy. (1)
- Art — Sculpture (33)
As well as subjects, where the components are separated by commas such as these (number listed indicates total finding aids assigned that subject):
- Art, American (33)
- Art, American. (46)
- Art, American, 20th century (28)
- Art, American, 20th century. (31)
- Art, Cuban, 20th century (1)
- Art, Modern (1)
- Art, French, 20th century. (1)
The goal is to capture the core ideas – to capture the overlap in subject matter among diverse collections. All of the collections with any of these subjects are about Art. With the current script, the tag Art is associated with 179 collections from Syracuse University. You can see from this tiny subset of subjects that other themes would be revealed when these subjects were decomposed more completely – and this just scratches the surface.
Out of the 6676 subjects, 5658 subjects are assigned to single collections. Out of the 6632 tags the current script extracted from those subjects, 5594 tags are assigned to single collections. Not much improvement with the current state of the script.
While currently the script does a good job with the Library of Congress double dash separation pattern, the Syracuse University data has shown me a number of other standard patterns that need to be handled which can be seen in the small sampling of art related subjects shown above. The easy one is removing periods and stripping spaces from the end of subject values. The harder change will be to implement smart separation of subjects into tags based on commas. This would need the code to only break up <subject> values while leaving <persname> and <corpname> alone. I will also need to examine <geogname> values from across various institutions to decide if it is better to break them up or leave them be.
Other than these subject issues, there are a few other script modification that I will need to make based on scenarios the data in the Syracuse finding aids have shown me:
- Syracuse University uses an entity to populate the repository values – the current script does not handle this at all.
- Ensure that single item collections are assigned a size of .25 linear feet
- Linear ft must be added as another recognized abbreviation for linear feet
All these issues are being added to my master ‘to do’ list for updating the EAD parsing script. Onward to the next data set.
Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via my contact form.
Image Credit: Syracuse University image above from Syracuse University Special Collections Research Center home page.
What’s your thinking in assigning .25 l.f. to a single-item collection? Seems like way too much.
You are probably right, but at the moment the smallest unit I had been using was 1/4 of a linear foot. I guess I could go with .1 or .01 linear feet for a single item. Sound better?
Glad you managed to wade through our data (and happy there were no XML errors!) Amazing that so many (5658) subjects are assigned to single collections! I had no idea we had so many unique subject headings. When you parse them at the double-dash I wonder if that will change, since the different pieces of an LCSH heading will be able to mix and match?
Re. the size of the small collections: we use 0.1 linear ft.; they’re not actually physically that big, most of them, but there is a minimum amount of time and effort that goes into processing and EAD creation, no matter how small the collection, and we figured that 0.1 reflected that fairly accurately.
FYI, the art-related subjects that you noted are all @source=local and are used to create a list of the core subject areas in which we collect (see the drop-down subject list here or the full subject list here. You could probably omit them without distorting the data, since they’re also reflected more formally in the @source=lcsh subject tags.
Correction: Should have said, “Some of the art-related subjects you mention, as well as a few other common subject headings, are @source=local…” Too hasty first thing in the morning!
Pingback:ArchivesZ Data Challenges: Utah Government Archives & Records Service - Spellbound Blog
Pingback:ArchivesZ Data Challenges: Forest History Society - Spellbound Blog