Amanda Ross, project archivist for the Forest History Society, sent me 57 EAD finding aids to include in the ArchivesZ project. These are the data challenges that the current data extraction script does not address:
- Titles with embedded tags or punctuation. Generally the script drops anything after it hits either, so rather than a title like William E. Towell Papers, 1941 – 1988, my database ended up only with “William E Towell Papers,” based on this encoding: <titleproper>Inventory of the William E. Towell Papers, <date normal=”1941/1988″>1941 – 1988</date></titleproper>
- Need to handle a conversion factor for a size of “1 folder” (as found in the Inventory of the Biltmore Forest School Images, 1890 – 1988)
- My script chokes on the Inclusive Year format “1910 and 1931 – 1937” (as found in the Inventory of the Alfred Cunningham Papers, 1910 and 1931 – 1937)
- The presence of a <lb/> character within the <extent> tag, used to force a line break, is preventing my script from extracting any size information at all (as found in the Inventory of the DeWitt Nelson Papers, 1940 – 1976)
- Within the <abstract> tag, my script drops everything after an <emph render=”doublequote”> tag (making for a very short abstract in the case of the Inventory of the Arthur Bernard Recknagel Auxiliary Photograph Collection, 1911 – 1947).
The most dramatic issue, seen across all the finding aids in this set, is that no subject data was extracted from any of the finding aids. My working theory for the moment is that this is due to the use of <list> and <item> tags as shown here:
<controlaccess> <head>Subject Headings</head> <list type="simple"> <item><genreform source="lcnaf" encodinganalog="655">Audiotapes</genreform></item> <item><persname source="lcnaf" encodinganalog="600">Ainsworth, John H., 1909-</persname></item> <item><subject source="lcnaf" encodinganalog="650">Businessmen -- United States</subject></item>
This is in contrast with this example of encoding from Syracuse University:
<controlaccess> <head>Subject and Genre Headings</head> <subject encodinganalog="650" source="local">Adult education</subject> <persname encodinganalog="600" source="lcnaf">Adolphson, L. H.</persname> <persname encodinganalog="600" source="lcnaf">Bradford, Leland Powers, 1905-</persname>
Or this sample from Oregon State University:
<controlaccess id="a12"> <controlaccess> <persname encodinganalog="600" source="local" rules="aacr2" role="subject">Aitken, Frances Alva, 1889-1970.</persname> </controlaccess> <controlaccess> <corpname encodinganalog="610" source="local" role="subject" rules="aacr2">Oregon Agricultural College. Class of 1910.</corpname> <corpname source="lcnaf" encodinganalog="610" role="subject">Oregon Agricultural College--Students.</corpname> </controlaccess> <controlaccess> <geogname source="lcsh" role="subject" encodinganalog="651">Corvallis (Or.)</geogname> </controlaccess> <controlaccess> <subject encodinganalog="650" source="lcsh">Student activities--Oregon--Corvallis.</subject> </controlaccess>
Both the Syracuse and OSU examples are handled by the current state of the data extract script.
Amanda pointed me to the NCEAD Best Practice Guidelines for EAD 2002. Down in Appendex G: How Do I Encode…, the second question down is “What if I have multi-part scope notes, biographical notes or subject headings?” followed by exactly the <list> and <item> tag usage as is being done for the Forest History Society finding aids. This format clearly should be handled.
So, no fun tag stats for this run – but I hope to fix my ruby script so that the Forest History Society finding aids can be incorporated into the data set I use for testing version 2 of ArchivesZ. My ruby script to do list is getting quite long!