Mark Shelstad, head of Archives and Special Collections at University of Texas at San Antonio, sent me a link to the TARO (Texas Archival Resources Online) page for UTSA’s Archives and Special Collections finding aids in XML format.
With the current scripts, these are the fun tag stats:
- 1,684 total tags extracted
- 75% (1,266 tags) are associated with only one finding aid
- 3% (51 tags) are associated with 10 or more finding aids
Collection Size
235 out of tne 253 collections ended up with a collection size of 0.
Consider the encoding of the collection size in the Guide to the Women’s Overseas Service League Records, 1910-2007:
<physdesc label="Extent:" encodinganalog="300$a"> 77 linear feet (approximately 44,000 items) </physdesc>
Contrast this with one of the examples where the size of the collection was extracted properly by the current script:
<physdesc label="Extent:" encodinganalog="300$a"> <extent>8.4 linear feet</extent> (14 boxes) </physdesc>
Sometimes it feels like a game of Where’s Waldo. In this case we are simply missing the set of <extent> tags from the first example. Off I went to the EAD tag descriptions to find the guidelines for use of the <physdesc> tag, where I found this overview of the tag:
A wrapper element for bundling information about the appearance or construction of the described materials, such as their dimensions, a count of their quantity or statement about the space they occupy, and terms describing their genre, form, or function, as well as any other aspects of their appearance, such as color, substance, style, and technique or method of creation. The information may be presented as plain text, or it may be divided into the <dimension>, <extent>, <genreform>, and <physfacet> subelements.
Bad news for my script logic – both versions are valid! This is a great example of how valid encoding can still present challenges. While in this example it seems just as easy to parse the version with the <extent> tags as without, it will only be through examination of a much broader sample of data that we can determine how much of a problem we have on our hands with this scenario of size data included in the <physdesc> tags without enclosing <extent> or <dimension> tags.
Inclusive Dates
Twenty of the UTSA collections came through with no years. When I examined the data, I found an assortment of <unitdate> formats that my current script could not parse properly, including the examples below:
- 1917-1980 (bulk 1920-1945)
- 1876-1903, 1914-1919, 1940-2002
- 1940s, 1970s-1990s
Another encoding approach that could not be parsed was the one used for the finding aid of the Church Women United of San Antonio Records. In this case the <unitdate> tag is within the <unittitle> tag as seen here:
<unittitle label="Title:" encodinganalog="245"> Church Women United of San Antonio Records, <unitdate label="Dates:" encodinganalog="245$a">1961-2005</unitdate> </unittitle>
Among the finding aids for which I did extract a range of inclusive date years, I also found issues with values like 1950s-1990s. The current script interpreted this to represent 1950 through 1990, but I believe it would be more properly translated as representing 1950 through 1999.
General Code Fixes
The University of Texas at San Antonio’s finding aids have provided additional examples of the following data and encoding issues already identified in earlier data sets:
- Inconsistent repository titles (26 different variations of “The University of Texas at San Antonio Library”)
- Titles with embedded and tagged dates
- Carriage return and tab characters that need to be removed
- Emphasis within a title or abstract added via a tag (such as <emph render=”italic”>Storyletters</emph> seen in A Guide to the Storyletters Records, 1991-2000) which interrupts extraction of text at that point
Next Steps
This is the last data set I am analyzing before tackling actual updates to the ArchivesZ data extraction script. My next step is to review and prioritize my long to do list for updates to this script. Most of what I have found in my examination of the data sets are ways in which my script was not smart enough to handle valid variations in encoding and the tabs, carriage returns, formatting tags and special characters found throughout everyone’s XML. Yes, there are some cases in which the data itself is less than optimal (such as non-standardized repository titles) or the values challenging (so many ways to describe the size of a collection!), but overall I am optimistic about how much more I can improve the extraction script before I have to resort to hand correcting records in the database.
Thanks to everyone for your patience with these data analysis posts. Onward to programming!
Regarding your 1266 uniquely-used tags, do you mean not just the EAD elements (of which there are of course far fewer than 1266) but also elements in combination with the attributes, encodinganalogs, etc that are used to modify them? If so, it doesn’t seem quite so odd. Have you looked at occurrences of elements on their own?
I took all the subjects (persnames, corpnames, geonames and subjects) associated with the 253 collections and divided them into their component parts, which I call tags. So the subject ‘United States — Maryland — Agriculture’ can be decomposed into 3 tags: ‘United States’, ‘Maryland’ and ‘Agriculture’.
This process generated a total of 1,684 tags, 1,266 of which were each associated with only one of the 253 collections. Only 3% of the tags were associated with 10 or more of the USTA finding aids. Does that clarify my tag stats?