Gina Strack of the Utah State Archives and Records Service provided me with access to the XML of 1,196 EAD encoded finding aids. These EAD 2.0 XML files are a product of a grant funded project completed last year to migrate from EAD 1.0 finding aids. Their website includes a detailed account of the EAD Project.
These finding aids have helped me identify three types of ArchivesZ data challenges:
- strange characters
- broad composite subjects
- determination of accurate collection size
Strange and mysterious characters!
These finding aids use a special character in the place of the standard Library of Congress double dash which normally appears between subsections of the subject heading.
An example subject from the Utah Government XML looks like this:
Women—Suffrage—Utah.
Viewing the same subject in a pure text editor (such as vi):
Women—Suffrage—Utah.
By the time it gets into my database and is pulled out via a query in MySQL Query Browser it looks like this:
Women?¢‚Ǩ‚ÄùSuffrage?¢‚Ǩ‚ÄùUtah.
Rather than just stripping out all instances of —, my plan is to replace them with the standard Library of Congress double dash. This will ensure that the existing code that breaks the subjects down to tags will still work.
Composite Subjects
When I say “composite subject” what I mean is a subject that includes multiple very disparate terms. Rather than the Library of Congress style subjects, all aspects of which relate to the collection in question, these composite subjects cover multiple subjects which are grouped together for convenience.
This is a list of some of the most popular subjects for the Utah Gov collections:
- Politics, Government, and Law
- Business, Industry, Labor, and Commerce
- Science, Technology, and Health
- Arts, Humanities, and Social Sciences
These subjects throw a monkey wrench into my theories about decomposing subjects based on commas. The collections to which these subjects are assigned likely fit in only one of the component themes. For example, the “Inventory of Publications from Department of Technology Services, 1993-2008” is assigned the subject “Science, Technology, and Health”. If I divide this subject into 3 separate tags, the Science and Health tags would be quite misleading.
So that leaves me a bit trapped. If I want to divide subjects such as “Art, Cuban, 20th century”, as I discuss in my Syracuse University post, then I end up also dividing these umbrella subjects which separate such very divergent terms with commas.
This issue goes on my list of reasons to add a repository configuration file for use by the data extraction script.
Accurate Collection Size
In my quest to convert all sizes to linear feet – sizes such as these are challenging:
- 0.20 cubic foot and 1 microfilm reel
- 0.35 cubic foot and 2 microfilm reels
I also have situations of sizes be specified in multiple sections of the finding aid. The Inventory of ALERT Foundation records from Governor Bangerter, 1986-1991 has a collection level size of “0.50 cubic foot and 2 microfilm reels”, but further down in this finding aid I see this:
series: ALERT Foundation records
- box 1, folder 1: Documentary: “”Letters from our Children,”” Motion picture film reel, 16mm
- box 1, folder 2: Documentary: “”Letters from our Children,”” VHS videocassette
- box 1, folder 3: Documentary: “”Letters from our Children,”” VHS videocassette
- box 1, folder 4: Documentary: “”Letters from our Children,”” VHS videocassette
When they said 2 microfilm reels – do they really mean a 16mm motion picture film reel and a VHS videocassette? Is there 1 VHS videocassette or 3? How sizes are specified in a specific repository’s finding aids is another possible candidate for a repository level configuration script.
Tagging Statistics
Finally, here are a few tag stats:
- Only 31 tags (1.5% of all Utah Government tags) are associated with 10 or more collections
- 1404 tags (71.5%) are assigned to only a single collection
- 107 collections have been assigned only 1 tag
- 10 collections have no subjects
Of course these statistics are based on the current incarnation of the data extraction script. After I modify the script, there will be a greater number of tags and (hopefully) more overlap of tags across multiple collections. These types of statistics should help me gauge how well my data extraction logic is working.
You may just want to go ahead and change some of the other Unicode characters, which correspond to the special Microsoft characters, which might trip you up later. The 82xx range look like common ones:
Or you could rewrite them in a format that MySQL could display, since they are valid Unicode. But then they won’t match the LOC.
-p
Right – the trick is identifying all the special 82xx characters that might be used to separate the units within the subject so I can divide them up into tags. Thanks for that link!
This is a very interesting analysis, thanks! One or two items that might clarify some of these issues. The composite subjects came as a requirement of the grant project and are drawn from a “browse list” compiled by the Utah Manuscripts Association (see this PDF for full list). We had some difficulty finding appropriate terms for our records and so used the most broad terms an awful lot. In the EAD, they are distinguished from LCSH by the source attribute (i.e. “UMABroad”). They are not meant to be displayed (altrender=”nodisplay”) and it might be best to leave them out of any extraction if you are able.
And if it helps untangle some unique EAD usage, the complete best practices as we adopted them are available here.
Gina,
Thanks for this. I had noticed the altrender=”nodisplay”. Is that a convention you created or a standard one that others may follow? I can certainly screen them out when parsing your files for my final data ingestion.
Jeanne
Our encoding best practices were modeled after the Northwest Digital Archive, including the browse subjects or categories. See how they are implemented at their search page. Perhaps since they are created for a specific–almost local–purpose, that is reason enough to exclude here I think.
However, it does bring up how important it is to consider “downstream” users when creating interoperable description and metadata, the kind of larger issue we have already faced for Open Archives Initiative (OAI) harvesting of digital collections.
Very interesting stuff so far.
At East Carolina University, we’re also using a similar technique to divide our LC Subject Headings. In our case, though, we’re using the UTF-8 em dash character code of “—”
Additionally, we’re also using the altrender attribute in those subjects for a very localized reason (which I doubt you’d encounter elsewhere). Basically, I wanted to be able to re-generate a MARC record pretty easily if need be, so I cheated and crammed something into the altrender attributes for our subjects (primarily because EAD doesn’t really allow you to break up those LCSH). And so, we’ve got some ugly encoding that looks like this:
United States—History—Civil War, 1861-1865—Campaigns
But it works fine for display. Plus, it’ll also help us to break up our subject headings so that we can eventually use them for the same sort of browsing purposes that we’re using on our Digital Collections homepage [see the “subject could” in the bottom-right corner].