ArchivesZ Data Challenges: Princeton University

I received a zip file of 1,771 EAD encoded finding aids from the kind EAD enthusiasts at the Seely G. Mudd Manuscript Library. These finding aids came from five divisions within Princeton’s Library:

So onward to the data issues and what they mean for my ever growing ‘script fix to-do list’.

Repository Names

As we saw with the Oregon State University finding aids, the finding aids from Princeton University had a wide range of different values for repository names. In the list below we spot some issues. Some end in periods, some do not. One has extra space (probably a carriage return) in the middle. One does not include Princeton in the repository name. Once we have many repositories’ finding aids in ArchivesZ, a repository name of ‘Engineering Library’ does not tell the user enough about where those collections can be found.

Here is the list of repository titles my script extracted:

Princeton University Library. Department of Rare Books and Special Collections.
Engineering Library
Princeton University Library
Princeton University Library. Department of Rare Books and Special Collections.
Princeton University Library.

My script can handle the extra period and the extra spaces, but the non-specific name would need to ultimately be fixed on the source side.

Collection Size

The current script assumes that there is only one extent value specified to express the size of the collection. Princeton’s finding aids showed me examples of multiple extent values. For example, the Christina Georgina Rossetti Collection has both a collection level size of 0.4 linear feet (1 archival box) as well as a 2nd extent specification corresponding to a specific folder with the value of (1 poem, 3 drawings, 1 photo, 1 incomplete article). The script must be modified to only consider the collection level size.

Complicated Titles

The current script logic apparently does not handle what I would call ‘complicated collection titles’. For example, I ended up with “Edward Livingston Papers, ” as the title for a collection with a full title of Edward Livingston Papers, 1683-1877 (bulk 1764-1836). This is the way that this title is encoded:<unittitle encodinganalog="245$a" label="Title and dates: ">Edward Livingston Papers, <unitdate encodinganalog="245$f" normal="1683/1877" type="inclusive">1683-1877</unitdate> (bulk <unitdate encodinganalog="245$g" normal="1764/1836" type="bulk">1764-1836</unitdate>)</unittitle>

Too Many Tags

The Engineering Library’s Department of Mechanical and Aerospace Engineering Technical Reports: Finding Aid has 522 tags assigned to it! Almost all of these are the names of the authors of the individual reports. This scenario goes on the list of reasons why I might choose to not include (at least for this version) persname subjects. The other option for handling this situation is to only use subjects assigned at the collection level and ignoring subjects assigned at lower unit/container levels. Without the author tags, this single collection ends up with this nice, reasonable list of tags:

Fluid mechanics
Mechanical engineering
Combustion
Aerospace engineering
Propulsion systems

Year Challenges
I found two different issues related to year ranges:

Women in Argentina, VI, 1989-2001: Finding Aid: The current script does not properly extract the inclusive dates which are encoded within the titleproper tags, but rather assumes that it will be encoded using a unitdate tag.
An assortment of finding aids include subjects which have year spans as part of the subject. When these subjects are decomposed into tags, we end up with tags like ‘1850-1950’. Since we have the time period communicated via the inclusive dates, I will likely just drop these portions of the subjects rather than create a tag for each unique year span.

General Code Fixes

It is reassuring at this point to spot the same issues with data from multiple repositories. Here are data and code logic issues that I have seen elsewhere that are revalidated by Princeton’s finding aids:

Need to strip /n & /t characters
Need to break subjects up based on commas
Need to drop final periods from repository names, subjects and titles
The designation of size in volumes, as in “793 volumes”. I need to pick an approach for translating from volumes to linear feet

The script to-do list is still getting longer, but I am not done cycling through new institutions’ XML files to find new issues. Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via my contact form.

Image Credit: Top image from the Seeley G. Mudd Manuscript Library homepage.