context | Spellbound Blog

SAA2007: Content Aggregation, Shareable Metadata and Access (Session 607)

November 9, 2007 1 Comment

Focusing on the challenges of sharing metadata to support content aggregation and access, SAA2007 Session 607’s official title was The Dynamics in the Aggregate: Shareable Metadata and Next-Generation Access Systems. Bill Landis, the Head of Arrangement, Description, & Metadata Coordinator at Yale University Library’s Manuscripts and Archives division, began the session by stressing that while it is hard to predict the future it seems obvious that there will be an increase in the aggregation of content. Google is one type of aggregator. Many institutions are using the standards of the Open Archives Initiative (OAI) to both publish and to harvest data. This session considered shareable metadata and how it can support or hinder content aggregation and access. A pointer was give to the Best Practices for OAI Data Provider Implementations and Shareable Metadata joint initiative of the Digital Library Federation and the National Science Digital Library.

Introduction to Shareable Metadata and Interoperability

The first speaker, Sarah Shreeves, started the panel off with her presentation titled The Dynamics of Sharing: Introduction to Shareable Metadata and Interoperability (follow the link to view the full set of slides). Sarah is not an archivist, but she has extensive experience with metadata aggregation.

She began with the assumption that “we” (libraries/archives/museums/cultural organizations) cannot afford to think about our collections only in the context of our local community. There is no way to know where your metadata is going to end up – either grouped with other things or pulled out of your collection into single atomized items.

Why share content? It benefits our users, supports one-stop searching, brings together distributed collections and supports mashups . Sharing helps us and increases our exposure. We have to do this – we cannot assume that our users will come in through the front door. Lorcan Dempsey uses the phrase In The Flow to mean how to get your content “out” into the world where users will find it.

Keys to Shareability or Interoperability:

You need the technical side (Z39.50, OAI PMH, RSS …etc)
Organization commitment of resources (people, training, time, priority)
Standards.. lots and lots of standards

There are two main ways to share metadata. The first is known as federated search. In this model a user searches from a single central location. That query is sent to distributed database and the answers are sent back for the central query source to assemble the results. Z39.50 and Search/Retrieval via URL ( SRU) are examples of technology used to perform federated searches.

The second way of sharing metadata is known as the metadata aggregation model. In this scenario, metadata is pulled from many places into a single location. This is what search engines, union catalogs, OAI PMH, RSS and Atom do. It provides an opportunity to massage and normalize the data. Once users find what they are looking for – they are often redirected to the original source of the item.

A major challenge of the metadata aggregation model is “the ability to perform a search over diverse sets of metadata records and obtain meaningful results” Priscilla Caplan (in Metadata Fundamentals for All Librarians). This is hard because we are not used to what metadata looks like outside our local context. Sarah then showed lots of different examples so the audience could see how different the metadata is.

Metadata is not monolithic. It can be a view projected from a single information object. It is possible to create multiple views appropriate for different uses. Each view will affect the granularity of description, choice of vocabularies, and choice of formats.

You can customize the format of your metadata depending on the context of how the metadata will be consumed. This might sound scary, hard and overwhelming – but Sarah is confident that we can do this in smart ways. She believes that we should be able to lobby for the features we need to support different views.

Sarah’s list of attributes of ‘shareable metadata’:

is quality metadata
promotes search interoperability
is human understandable outside of its local context
must be useful outside its local context – an aggregator can actually build services based on the data in the records provided – example was geographic data that can be used to put the items on a map
preferably is machine processable – Subject clustering – machine created – but still needs lots of human intervention to make it work
provides enough contextual information – the Theodore Roosevelt collection didn’t have a Roosevelt subject term because the title of the collection was assumed to be enough. She also mentioned a map that didn’t include the fact that it was a map in it’s metadata
is consistent across a collection – ie, same date field, same controlled vocabulary.. this is within a single collection
is coherent
is true to its content but also its audience – different views for different perspectives
conforms to standards – descriptive, technical, etc

There are some safe assumptions you can make. Users often get to your data through shared records – not through your front door. Users either don’t know about your collection or won’t remember. Shared records can lead users to local environments where the full context is available. Users are often entering through deep links that may bypass the introductory information that provides the larger context for a collection.

Implementing Shareable Metadata Practices

Jenn Riley, of the Inquiring Librarian Blog, gave the 2nd presentation: Implementing Shareable Metadata Practices in a Diverse University Environment. Jenn has a grand vision of what we are trying to achieve with all these efforts to share metadata. We needs lots of different ways to discover the data.. lots of environments.

We need machine-readable descriptive metadata, definitions of properties of shareable metadata in various communities (this is the focus of this session), and protocols and systems that use them for sharing that make it automatic. We also need online delivery of content too, but that is a big challenge and out of scope for this session.

Archives and digital libraries face different challenges in implementing standard practices related to shareable metadata. Archives are unique, making the notion of a single workflow model not possible. They are not a ‘homogeneous body’. Archives need to figure out how to support the expanding view of the mission to meet the needs of online users and make more services available. They need to find resources to provide appropriate description as well as technical implementation – and need time and money and skills in order to do this. On the other hand, digital library practice assumes content is digitized– that there will be ‘stuff’ at the end. Metadata-only workflows are not common. Digital libraries usually assume item-level description, but this is often not the case, and concepts of provenance and original order are largely foreign.

Communities need to agree on key definitions to bridge the gulf between digital libraries and archives. Digital libraries need to understand that Encoded Archival Description (EAD) is not a metadata format, EAD is a markup language.

The good news is that aggregations are not out to replace archives specific discovery systems. We don’t have to give up the local robust environment, we can and need to do both.

Key shareable metadata principles for archives:

Context: need enough context so the user can figure out if the record is useful for them. At the same time – too much repeated info can cause issues too.
Content: what is the appropriate granularity for shared records from archives — this choice needs to be done per usage and per audience.

Possible strategies include the creation of collection-level records only, creation of an aggregator that understands multi-level descriptions, the design of multi-level descriptions carefully for future item/file-level view, linking to digital objects from the lowest level of description in the finding aid and description at the item level.

Jenn then discussed the experiences at Indiana University’s Digital Library Program :

They have a new EAD finding aid website
the new system is more faithful to encoding with less ‘helpful’ fixed presentation
mutual learning process about archival descriptive practices
many decisions made about when encoding should be changed when systems should be changed
results of this process: RE-ENGINEERING! New template, report card, better previewing capability — new template for EAD that supports new data we didn’t have before… report card built on schema-tron and encoder can preview how their encoding is really working and preview what the final product finding aid will look like
some EAD files link to digital objects
soon there will be item-level OAI records (Dublin Core and MODS) for digitized items linked from finding aids
central Digital Library repository that allows EAD as the *master* metadata format
new workflow that permits links from any level of a multi-level description in EAD

The more you put stuff online – the more you attract the sort of attention that gets you more money to put more stuff online. Jenn suggests lobbying of software vendors for a better support of EAD.. don’t settle for Dublin core. We need to discuss with our user communities about the need for an archives-specific aggregators and consider the multi-level description.

Libraries and archives are learning from one another. The item centric view can be too narrow – but it can help with re-engineering. More structure in finding aids can be a good thing. Archives can show libraries why expertise in descriptive practice is still necessary — maybe those who are running out of things to catalog on the library side can spend some time describing over on the archives side?

Archival Frameworks for Shareable Metadata

Kelcy Shepherd, Digital Interfaces Librarian at the University of Massachusetts Amherst , gave the final presentation of the session: “Archival Standards and Tools: A Framework for Shareable Metadata”.

The first framework Kelcy addressed was Describing Archives: A Content Standard (DACS). What about DACS is applicable to sharing metadata? It is compatible for use with controlled vocabularies. It can make sure that our access points will work well with access points from other metadata communities. Since DACS is output agnostic, you can create the data and then use that data to generate different views or formats. A single set of DACS based data can produce printed finding aids, EAD finding aids, MARC 21 or MODS records.

In order to produce each of these different views from a single original format, you must a crosswalk. A crosswalk maps individual elements from one data format to corresponding elements of another. Unfortunately, crosswalks come with their own challenges:

granularity
missing elements
single element on one side that would need to be split into multiple elements on the other side

You need expertise in both standards addressed by a each crosswalk in order to do this well.

Next Kelcy discussed Encoded Archival Description (EAD). EAD is a data structure standard, machine readable format for encoding archival descriptions. It allows archivists to share the data across institutions. If you want to re-purpose a finding aids metadata, the data needs to be in a machine readable format. EAD gives you this. You can convert an EAD encoded finding aid into a Metadata Object Description Schema (MODS) document using an XSLT stylesheet and a crosswalk. The stylesheet may take a lot of work (especially for use across many finding aids), but there is a big payoff. Once the work is done a single stylesheet can be used across many many finding aids.

The Archivists’ Toolkit was cited as an example of a tool that can let you output multiple formats from a single set of data. It can produce EAD, MARC, MODS and Dublin Core records.

Tools can support efforts – but it all comes back to quality archival description. The best tool in the world will never make bad content into good content. If data is inconsistent – you have to manually go back and clean it up. I particularly liked Kelcy’s point about ensuring that your data doesn’t need the screen labels you to make sense. If you don’t consider this, when you export that data into a new format or view the data can loose it’s meaning.

Her concluding point was that if you don’t have the tech skills or support, work on your content.. use DACS… get your data in order and it will pay off later.

Questions and Answers

Question: How does this work when you are trying to share your metadata with communities that use different controlled vocabularies – thinking about the single EAD that generates MODS and MARC .etc etc…

Answer: Aggregators often they don’t use subject headings. This is nearly impossible to do in OAI – people use lots of different controlled vocabularies.. and sometimes no controlled vocabulary at all. There are experiments being done with subject clustering. Algorithms are used to cluster like things together – but it still requires human intervention to make sure the clusters make sense.

On the other hand – if you are using a standard vocabulary, there is work being done to map from one standard to another. An example of this is the OCLC Metadata switch project .

Question : What about social taggging?

Jenn : We are in no position to turn down metadata.

Sarah: DSpace has a concept of community. There is a way to let a community organically build their own controlled vocabulary as they go – new contributions are provided choices of terms that have been used before.

Bill talked about the article about Michaelson where they gave the same finding aids to 40 archivists to use LCSH for picking subject headings. The result was 0 consistency! Every single archivist picked different subject headings.

Jordan: PennTags is an example of an effort to combine social tagging with traditional classification. It shows tagging not as competition but as another way to get user generated descriptive information. It is an example of a way to ‘get into the flow’.

Sarah: Google will now use OAI PMH as a site map for indexing, but it throws away the metadata.

Jenn: Dlib – representing digital collections on wikipedia article.

Bill: PennTags is acting as an aggregation system to pull siloed information together.

Question: In some cases EAD data is flattened down for all items so that each item has all the context data and only one field is different on each? Is this an indication that the mapping have been better?

Answer: It is a problem – can be a problem…ultimately it is all about use and audience.

My Thoughts

I came away from this session with my head whirling with ideas. I was so pleased to hear people talk about concrete examples. We need more examples of challenges and real world benefits to further efforts to aggregate, publish and share archival content and it’s metadata. None of this is easy, but each project will give us new lessons and add to the growing set of best practices.

I truly believe that the sooner we tackle these thorny problems, the sooner we will start seeing the impact in improved access to archival records. The sooner we deal with it, the less we will be adding data that will have to be fixed later.

For anyone who has been following my blog – you will already know about my ArchivesZ project from last spring. One of the big struggles we had was figuring out how to make the subject term metadata ‘useful’ for aggregation and visualization. Another example of the challenges and benefits to shareable metadata is the SAA presentation about Publisher’s Bindings Online .

I had one last sentence in my notes from this session – an idea for a Facebook application that would let you feature your favorite archival image or record. This would be an amazing example of getting archival records ‘in the flow’ and showing up in surprising new places where no-one is ‘looking’ for records. Hey – maybe I should prod the Footnote people with this idea. It might be right up their alley!

As is the case with all my session summaries from SAA2007, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Book Review: Dreaming in Code (a book about why software is hard)

May 24, 2007 1 Comment

Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software
(or “A book about why software is hard”) by Scott Rosenberg

Before I dive into my review of this book – I have to come clean. I must admit that I have lived and breathed the world of software development for years. I have, in fact, dreamt in code. That is NOT to say that I was programming in my dream, rather that the logic of the dream itself was rooted in the logic of the programming language I was learning at the time (they didn’t call it Oracle Bootcamp for nothing).

With that out of the way I can say that I loved this book. This book was so good that I somehow managed to read it cover to cover while taking two graduate school courses and working full time. Looking back, I am not sure when I managed to fit in all 416 pages of it (ok, there are some appendices and such at the end that I merely skimmed).

Rosenberg reports on the creation of an open source software tool named Chandler. He got permission to report on the project much as an embedded journalist does for a military unit. He went to meetings. He interviewed team members. He documented the ups and downs and real-world challenges of building a complex software tool based on a vision.

If you have even a shred of interest in the software systems that are generating records that archivists will need to preserve in the future – read this book. It is well written – and it might just scare you. If there is that much chaos in the creation of these software systems (and such frequent failure in the process), what does that mean for the archivist charged with the preservation of the data locked up inside these systems?

I have written about some of this before (see Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges), but it stands repeating: If you think preserving records originating from standardized packages of off-the-shelf software is hard, then please consider that really understanding the meaning of all the data (and business rules surrounding its creation) in custom built software systems is harder still by a factor of 10 (or a 100).

It is interesting for me to feel so pessimistic about finding (or rebuilding) appropriate contextual information for electronic records. I am usually such an optimist. I suspect it is a case of knowing too much for my own good. I also think that so many attempts at preservation of archival electronic records are in their earliest stages – perhaps in that phase in which you think you have all the pieces of the puzzle. I am sure there are others who have gotten further down the path only to discover that their map to the data does not bear any resemblance to the actual records they find themselves in charge of describing and arranging. I know that in some cases everything is fine. The records being accessioned are well documented and thoroughly understood.

My fear is that in many cases we won’t know that we don’t have all the pieces we need to decipher the data until many years down the road leads me to an even darker place. While I may sound alarmist, I don’t think I am overstating the situation. This comes from my first hand experience in working with large custom built databases. Often (back in my life as a software consultant) I would be assigned to fix or add on to a program I had not written myself. This often feels like trying to crawl into someone else’s brain.

Imagine being told you must finish a 20 page paper tonight – but you don’t get to start from scratch and you have no access to the original author. You are provided a theoretically almost complete 18 page paper and piles of books with scraps of paper stuck in them. The citations are only partly done. The original assignment leaves room for original ideas – so you must discern the topic chosen by the original author by reading the paper itself. You decide that writing from scratch is foolish – but are then faced with figuring out what the person who originally was writing this was trying to say. You find 1/2 finished sentences here and there. It seems clear they meant to add entire paragraphs in some sections. The final thorn in your side is being forced to write in a voice that matches that of the original author – one that is likely odd sounding and awkward for you. About halfway through the evening you start wishing you had started from scratch – but now it is too late to start over, you just have to get it done.

So back to the archivist tasked with ensuring that future generations can make use of the electronic records in their care. The challenges are great. This sort of thing is hard even when you have the people who wrote the code sitting next to you available to answer questions and a working program with which to experiment. It just makes my head hurt to imagine piecing together the meaning of data in custom built databases long after the working software and programmers are well beyond reach.

Does this sound interesting or scary or relevant to your world? Dreaming in Code is really a great read. The people are interesting. The issues are interesting. The author does a good job of explaining the inner workings of the software world by following one real world example and grounding it in the landscape of the history of software creation. And he manages to include great analogies to explain things to those looking in curiously from outside of the software world. I hope you enjoy it as much as I did.

Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges

February 17, 2007 3 Comments

My most recent Archival Access class had a great guest speaker from the Journalism department. Professor Ira Chinoy is currently teaching a course on Computer-Assisted Reporting. In the first half of the session, he spoke about ways that archival records can fuel and support reporting. He encouraged the class to brainstorm about what might make archival records newsworthy. How do old records that have been stashed away for so long become news? It took a bit of time, but we got into the swing of it and came up with a decent list. He then went through his own list and gave examples of published news stories that fit each of the scenarios.

In the second half of class he moved on to address issues related to the freedom of information and struggling to gain access to born digital public records. Journalists are usually early in the food chain of those vying for access to and understanding of federal, state and local databases. They have many hurdles. They must learn what databases are being kept and figure out which ones are worth pursuing. Professor Chinoy relayed a number of stories about the energy and perseverance required to convince government officials to give access to the data they have collected. The rules vary from state to state (see the Maryland Public Information Act as an example) and journalists often must quote chapter and verse to prove that officials are breaking the law if they do not hand over the information. There are officials who deny that the software they use will even permit extractions of the data – or that there is no way to edit the records to remove confidential information. Some journalists find themselves hunting down the vendors of proprietary software to find out how to perform the extract they need. They then go back to the officials with that information in the hopes of proving that it can be done. I love this article linked to in Prof. Chinoy’s syllabus: The Top 38 Excuses Government Agencies Give for Not Being Able to Fulfill Your Data Request (And Suggestions on What You Should Say or Do).

After all that work – just getting your hands on the magic file of data is not enough. The data is of no use without the decoder ring of documentation and context.

I spent most of the 1990s designing and building custom databases, many for federal government agencies. There are an almost inconceivable number of person hours that go into the creation of most of these systems. Stakeholders from all over the organization destined to use the system participate in meetings and design reviews. Huge design documents are created and frequently updated … and adjustments to the logic are often made even after the system goes live (to fix bugs or add enhancements). The systems I am describing are built using complex relational databases with hundreds of tables. It is uncommon for any one person to really understand everything in it – even if they are on the IT team for the full development life cycle.

Sometimes you get lucky and the project includes people with amazing technical writing skills, but usually those talented people are aimed at writing documentation for users of the system. Those documents may or may not explain the business processes and context related to the data. They will rarely expose the relationship between a user’s actions on a screen and the data as it is stored in the underlying tables. Some decisions are only documented in the application code itself and that is not likely to be preserved along with the data.

Teams charged with the support of these systems and their users often create their own documents and databases to explain certain confusing aspects of the system and to track bugs and their fixes. A good analogy here would be to the internal files that archivists often maintain about a collection – the notes that are not shared with the researchers but instead help the archivists who work with the collection remember such things as where frequently requested documents are or what restrictions must be applied to certain documents.

So where does that leave those who are playing detective to understand the records in these systems? Trying to figure out what the data in the tables mean based on the understanding of end-users can be a fool’s errand – and that is if you even have access to actual users of the system in the first place. I don’t think there is any easy answer given the realities of how many unique systems of managing data are being used throughout the public sector.

Archivists often find themselves struggling with the same problems. They have to fight to acquire and then understand the records being stored in databases. I suspect they have even less chance of interacting with actual users of the original system that created the records – though I recall discussions in my appraisal class last term about all the benefits of working with the producers of records long before they are earmarked to head to the archives. Unfortunately, it appeared that this was often the exception rather than the rule – even if it is the preferred scenario.

The overly ambitious and optimistic part had the idea that what ‘we’ really need is a database that lists common commercial off-the-shelf (COTS) packages used by public agencies – along with information on how to extract and redact data from these packages. For those agencies using custom systems, we could include any information on what company or contractors did the work – that sort of thing can only help later. Or how about just a list of which agencies use what software? Does something like this exist? The records of what technology is purchased are public record – right? Definitely an interesting idea (for when I have all that spare time I dream about). I wonder if I set up a wiki for people to populate with this information if people would share what they already know.

I would like to imagine a future world in which all this stuff is online and you can login and download any public record you like at any time. You can get a taste of where we are on the path to achieving this dream on the archives side of things by exploring a single series of electronic records published on the US National Archives site. For example, look at the search screen for World War II Army Enlistment Records. It includes links to sample data, record group info and an FAQ. Once you make it to viewing a record – every field includes a link to explain the value. But even this extensive detail would not be enough for someone to just pick up these records and understand them – you still need to understand about World War II and Army enlistment. You still need the context of the events and this is where the FAQ comes in. Look at the information they provide – and then take a moment to imagine what it would take for a journalist to recreate a similar level of detailed information for new database records being created in a public agency today (especially when those records are guarded by officials who are leery about permitting access to the records in the first place).

This isn’t a new problem that has appeared with born digital records. Archivists and journalists have always sought the context of the information with which they are working. The new challenge is in the added obstacles that a cryptic database system can add on top of the already existing challenges of decrypting the meaning of the records.

Archivists and Journalists care about a lot of the same issues related to born digital records. How do we acquire the records people will care about? How do we understand what they mean in the context of why and how they were created? How do we enable access to the information? Where do we get the resources, time and information to support important work like this?

It is interesting for me find a new angle from which to examine rapid software development. I have spent so much of my time creating software based on the needs of a specific user community. Usually those who are paying for the software get to call the shots on the features that will be included. Certain industries do have detailed regulations designed to promote access by external observers (I am thinking of applications related to medical/pharmaceutical research and perhaps HAZMAT data) but they are definitely exceptions.

Many people are worrying about how we will make sure that the medium upon which we record our born digital records remains viable. I know that others are pondering how to make sure we have software that can actually read the data such that it isn’t just mysterious 1s and 0s. What I am addressing here is another aspect of preservation – the preservation of context. I know this too is being worried about by others, but while I suspect we can eventually come up with best practices for the IT folks to follow to ensure we can still access the data itself – it will ultimately be up to the many individuals carrying on their daily business in offices around the world to ensure that we can understand the information in the records. I suppose that isn’t new either – just another reason for journalists and archivists to make their voices heard while the people who can explain the relationships between the born digital records and the business processes that created them are still around to answer questions.

OBR: Optical Braille Recognition

January 9, 2007 1 Comment

In the interest of talking about new topics – I opened my little moleskine notebook and found a note to myself wondering if it is possible to scan Braille with the equivalent of OCR.

Enter Optical Braille Recognition or OBR. Created by a company called Neovision, this software will permit anyone with a scanner and a Windows platform computer to ‘read’ Braille documents.

Why was this in my notebook? I was thinking about unusual records that must be out in the world and wondering about how to improve access to the information within them. So if there are Braille records out there – how does the sighted person who can’t read Braille get at that information? Here is an answer. Not only does the OBR permit reading of Braille documents – but it would permit recreation of these same documents in Braille from any computer that has the right technology.

Reading through the Wikipedia Braille entry, I learned a few things that would throw a monkey wrench into some of this. For example – “because the six-dot Braille cell only offers 64 possible combinations, many Braille characters have different meanings based on their context”. The page on Braille code lists links to an assortment of different Braille codes which translate the different combinations of dots into different characters depending on the language of the text. On top of the different Braille codes used to translate Braille into specific letters or characters – there is another layer to Braille transcription. Grade 2 Braille uses a specific set of contractions and shorthand – and is used for official publications and things like menus, while Grade 3 Braille is used in the creation of personal letters.

It all goes back to context (of course!). If you have a set of Braille documents with no information on them giving you details of what sort of documents they are – you have a document that is effectively written in code. Is it music written in Braille Music notation? Is it a document in Hiranga using the Japanese Code? Is this a personal letter using Grade 3 Braille shorthand? You get the idea.

I suspect that one might even want to include a copy of both the Braille Code and the Braille transcription rules that go with a set of documents as a key to their translation in the future. If there are frequently used records – they could perhaps include the transcription (both literal transcription and a ‘translation’ of all the used Braille contractions) to improve access of analog records.

In a quick search for collections including braille manuscripts it should come as no surprise that the Helen Keller Archives does have “braille correspondence”. I also came across the finding aids for the Harvard Law School Examinations in Braille (1950-1985) and The Donald G. Morgan Papers (the papers of a blind professor at Mount Holyoke College).

I wonder how many other collections have Braille records or manuscripts. Has anyone reading this ever seen or processed a collection including Braille records?

Google Earth, Historical Maps and Ideas on Context

December 15, 2006

I have maps on the brain right now (just finished a 15 page paper about appraising geospatial electronic records) and a while ago LifeHacker gave me an excuse to install the latest version of Google Earth on my computer.

From LifeHacker’s Post: “To view these new old maps, you’ll need the latest version of Google Earth (use the program’s check-for-updates feature if you’re not sure you have it). In the layers section, select All Layers, then look for Featured Content > Rumsey Historical Maps.”

For the biggest bang – try the ‘Lewis and Clark 1814’ layer. I never really understood how large an area they traversed. This is such a great example of giving an archival record context. In this case it isn’t the context of who, when or why the record was created – it gives the user context that connects the record to the rest of the world. For maps this is easy to imagine (especially with Google Earth’s snazzy new Historical Maps), but what if we thought of ways to not only frame archival records within the context of their creation – but ways also to connect them with the current world.

When reading a newspaper article about a current event, the reader often will wonder about things that happened in the past that are related to what they are reading. One of the glories of the web is having terms hotlinked to other pages that give you more information about that term.

As more and more archival records are digitized (or are just born digital in the first place), the archival community will have the opportunity to find better and better ways to encourage the rest of the world to use said records in a quest to better understand current events. I see this as a major way for archives to earn the respect and understanding of the average person.

To some degree there is nothing now to stop a top notch newspaper (pick your favorite with a good online presence) from adding a “Learn more about this issue in history” sort of sidebar that links to a select list of relevant archival records. Many articles do this already to some degree simply by including images related to the story.

Take the recent Washington Post article Quarantined about a woman diagnosed with tuberculosis in 1954. If you click the link to the photo gallery and go to the 4th photo in the set you will see an image from the Library of Congress. While it is attributed (in teeny tiny type) there is no easy way to go and explore other images like this one. Ah – but by doing a search over on the LOC.gov site on the words tuberculosis poster kiss, I find my way back to this same poster shown to the left (Library of Congress, Prints & Photographs Division, WPA Poster Collection, [LC-USZC2-5369 ]).

I am trying to imagine what the next leap beyond the sort of usage demonstrated by the Washington Post article would look like. Could we come up with a centralized virtual archive with the ability to do the following?

permit registration of specific records – and allow assignment of meta data (such as date, location/GIS info, topic, and perhaps some keywords)
build an interface for those who want to fish for related records to feature on their sites
ensure that when a link to a record is shown, that the proper credits and links to the home institution are included

I guess what I am imagining is the equivalent of Google Ads for archival records. Sort of funny when I think of it that way. In fact – it makes me wonder if access to such a repository of well organized archival records available for use by online publishers could create some sort of revenue stream for those who carefully populate the metadata for records such that the online publishers can quickly find what they need. Would the keepers of the records provide the images themselves or thumbnails of the images with a link to the primary copy deep in their repository’s website? Talk about an interesting engine for fueling outreach and bringing in new web traffic.

I know that there are a million issues I am not thinking of (or thinking of and carefully ignoring) – but the best wild ideas pay no attention to the fine details when they are in the brainstorming stage. Let’s just leave it with the fact that I really like the idea of increasing exposure of the general public to primary sources. If we can both encourage and make it easier for purveyors of online content to use archival records to enhance their websites (especially with deep links back to the websites of online repositories) I think it can only serve to increase the profile of archives and their repositories.

129th anniversary of Thomas Edison’s Invention of the Phonograph

November 21, 2006 2 Comments

Phonograph Patent Drawing by T.A. Edison. May 18, 1880. RG 241.Patent #227,679

In honor of today’s 129th anniversary of Thomas Edison’s announcement of his invention of the phonograph, I thought I would share an idea that came to me this past summer. I had the pleasure of taking a course on Visual and Sound Materials taught by Tom Connors, the curator of the National Public Broadcasting Archives. This course explored the history of audio recording, photography, film and broadcasting technology.

When explaining the details of the first phonographs, Prof. Connors mentioned that certain sounds recorded better. Recordings of horns and the pitch of tenor singers were reproduced most accurately – or at least played back with the best sound. We also talked about the change in access to music brought about eventually by the availability of records at the corner store. The most popular recordings were (not surprisingly) of music with lots of horns or the recordings of individual singers like Enrico Caruso. So my question is how might music have evolved differently if different music had sounded better when reproduced by the phonograph? Would Caruso have been replaced at the top of the heap by someone else with a different vocal range? Would Jazz music evolved differently? Would there have been other types of music altogether if string instruments or wind instruments reproduced as well as the bright sounding horns?

In our class we also discussed the impact of the introduction of long playing records. Suddenly you could have 30 minutes of music at a time – with no need to have anyone playing the piano or hovering over the phonograph to change the disk. This led to the movement of music into the background of daily life – in contrast with the earlier focus on playing live music for entertainment in people’s homes. It also paved the way for people to experience music alone – you no longer needed to be in the same room as the musicians. No longer was music exclusively something shared and witnessed in a group. In my opinion this was the start of the long path that led to the possibility of having your own personal ‘sound track’ via first the walkman and now the digital audio player such as the iPod.

These ideas are still about archives and research. From my point of view it is just another example of how a different kind of context can impact our understanding of history. There are so many ways in which little events can impact the big picture. Edison wasn’t pursuing a dream of access to music (though that was included on his list of possible uses for the phonograph) – he was more interested in dictation, audio books for the blind and recording the last words of the soon to be dearly departed.

I love having the ability to examine the original ideas and intentions of an inventor and it came as no surprise to me that some of the most interesting resources out there for learning more about Edison and his invention of the phonograph traced back to both the Library of Congress and the U.S. National Archives and Records Administration. The LOC’s American Memory project page for The Motion Pictures and Sound Recordings of the Edison Companies gives a wide range of access to both background information and the option to listen to early Edison recordings. NARA’s page for the digital image above (originally found in Wikipedia) can be found online via NARA’s Archival Research Catalog (ARC) by searching for ‘Edison Phonograph’.

Hurrah for the invention of the phonograph and for all the archives that keep information for us to use in exploring ideas! Listen for horns and tenor voices in the next song you hear – and noticed if you are listening alone or with a group.

A final question: how can providing easy access to more big picture historical context help users to understand how the records they examine fit into the complicated real world of long ago?

Archival Context and Description – Taking It to the Next Level

November 14, 2006

In her post Describe, display, explain…, Jill Hurst-Wahl of the fabulous Digitization 101 talks about being more ambitious and committed to describing things well. This resonated very strongly for me, for so much of what I believe about enhancing access to and understanding of archival records online is tied up with describing them well. I don’t mean this solely in the traditional use of the term Description as it is used in the archival arena – but more in terms of the Five Ws.

They are not so very different really. The best finding aids and item level meta data can give you the same type of background information that would make any journalist proud. Who created these records? What are they? Why were they created? Where were they created? When? How?

This gives me a great chance to point people again to the Library of Congress American Memory Browse Collections page. I fell in love with it back when I was doing research for the paper that became my SAA 2006 Poster, but I couldn’t put my finger on how to explain why I liked it so much. I think it goes back to that basic journalistic approach to thinking about things. Am I interested in who created the records and why? Browse by topic. Am I interested in where the records were created? Browse by place. Am I interested in when the records were created? Browse by time period. You get the idea. (In a perfect world, there would be an advanced search option that would let me specify more than one of these interests at a time – but that is a topic for another day.)

So much of this goes back to some of the basics of web design. If you care about this sort of thing and you haven’t yet read Steve Krug’s Don’t Make Me Think – do yourself a favor and get your hands on a copy. It is an easy, quick and fun read – and it will leave you wishing that you could force it on the designers of every frustrating website you have ever been forced to muddle through when trying to get things done.

What would it be like if users of online archival collections didn’t need to learn new terminology in order to get the most out of the records? What if you never had to hunt to find the background history for a record because it was so obvious where to click? While I love the idea of thorough item level descriptions – I understand why that is not a reasonable expectation for most digitized archival collections. What I want is the context for any record an easy and obvious click away.

Of course if you want the dreamy version, then I would encourage folksonomy tagging and end user descriptions of digitized items. I know this would not work many places – but I suspect some experimentation with this kind of model with the right sort of collections would take the “Do you know who is in this photo” sorts of appeals to the next level.

I remember asking in my first Archives course if the archivists take information from the users of the records and add that information back into the finding aids. The answer was of the “maybe… sometimes” variety – followed up with an explanation of how archivists often have private files about a collection that are not shown to patrons. The thought was that perhaps an archivist might add a note there. My thinking was that since one of the biggest challenges with archival records is that it is possible that any single researcher is looking at a record for the first time since it was created (at least in any intent sort of way) – wouldn’t you want to harness that attention and use it to help others know what that first person found there?

This is already being done at the University of Michigan with the Polar Bear Expedition Digital Collections. They permit entry of comments at the item level and support search of the text in those comments. Every time someone adds a comment they are extending the description of the item. Take a look this comment on the Silver Parrish diary item (this was listed as the most recent comment just know when I looked at the site).

So thank you Jill for making me think about description again – from a new point of view. The more we do to describe everything fully, make it easy to find that information and let our users add more information to the mix – the more dynamic, usable and alive archival records will become.

Interesting Interface for exploring E-mail: TrampolineSystem SONAR Platform

October 26, 2006 1 Comment

Conversations and articles about the problem of archiving and accessing e-mail are often accompanied by the wringing of hands or the shrugging of shoulders. It has often seemed to me that figuring out how to archive and facilitate access to e-mail is a challenge that most people would rather ignore because it seems so difficult (and because there are plenty other things that need work too).

“In October 2003 the US Federal Energy Regulatory Commission placed 200,000 of Enron‘s internal emails from 1999-2002 into the public domain as part of its ongoing investigations.” So says TrampolineSystems on their facinating website that lets you explore those 200,000 public domain e-mails using their SONAR platform (that stands for Social Networks and Relevance). I would highly recommend taking a look and browsing around the Enron e-mails.

It appears that SONAR somehow tags the emails without human intervention – though they do not state this specifically one way or the other. The implication from the SONAR PR page is that you plug in the platform – and you instantly have this new access to your information. It is my impression that this works for either a fixed collection of e-mails (as is the case with the Enron emails) – or for an active live e-mail collection that is changing over time.

I like the social network Visualizer and the way it shows you how people are related to one another as represented by their e-mail correspondence. I like the theme and people tag clouds. I like the ease with which I can search for and read emails. I like how clearly they specify what you searched on at the top of your e-mail result list – and how many e-mails, people and themes the list represents.

On the other hand, there are a number of things I wish I could do. I wish that it was clear to me what order the emails are listed in when I do a search on a term. I searched on the word ‘pager’ – and received 2012 emails in no obvious order (most likely relevance – but that is not at all clear). I would like to be able to re-sort the results (by date for example). I would like to be able to add together multiple tags and people to get a scoped list of emails between two people on a specific set of theme.

Just as in traditional archival collections – there is some non-unique information in the mix. I found a generic Hotwire promotional email while looking at the theme The Insider (4th hit on the list). While I suppose spam and legitimate e-mail ads (ie, ones you asked for) are interesting – perhaps software considering e-mail to retain permanently could block some of these somehow.

I like clicking on things in the Visualizer and seeing the social networks hidden within the e-mails – but that gets old quickly unless you are looking for something very specific. I found myself wanting more context. Who are these people? What are their jobs? How are they ‘officially’ related in the corporate hierarchy? How do these e-mails compare with a timeline of events? What about the content of attachments (they don’t seem to be part of this interface)? All of this information could be linked into this interface in such a way as to improve an outsider’s understanding of this amazing landscape of 200,000 e-mails.

All in all I think it is an excellent starting point and I applaud them for trying to find an answer to the email question rather than just ignoring the problem.

(Thanks to Boing Boing for the pointer to this site.)

SAA 2007 Session Proposal: Preserving Context and Original Order in a Digital World

September 28, 2006 1 Comment

Abby Adams, Assistant Access and Outreach Archivist of the Richard B. Russell Library for Political Research and Studies, University of Georgia, and I are putting together a proposal for a session at SAA 2007 in Chicago. She and I found each other via my poster at SAA 2006: Communicating Context in Online Collections. We have been pondering many of the same questions related to the effective communication of context and original order in online digitized collections.

Our proposal is for a traditional 3 presentation panel with the title “Preserving Context and Original Order in a Digital World”. All we need now is a 3rd presenter, the endorsement of an SAA section or roundtable and (of course) the approval of the session selection committee. (And some plane tickets!)

This is the current version of our description for the proposal (mostly composed by Abby) :

Now that digitization projects have become more common in archival repositories, user and archivists alike have uncovered problems when it comes to understanding the context of online materials. However, there are various ways to provide more contextual information, thus enhancing the use of digital archives. But, archivists must confront the obstacles surrounding this task by developing best practices and incorporating new software into their digitization projects. In order to simplify the problem, we should return to our traditional archival principles and draw connections to collection arrangement and description in a digital environment. Join three archivists to explore how to improve on “analog” techniques in the communication of context. When done right, the digitization of a collection will not only retain all the same opportunities for communicating context that we are familiar with, it may revolutionize the way that archivists and users interact and understand our records.

The short take on what we want to cover in our session’s presentations is:

What should archivists be doing to not loose context and original order information in the transition from analog records to digitized records?
What can digitization give us the ability to do that we couldn’t do in the analog world?
What tools and standards are out there today to help archivists do both of the above? What information should archivists be capturing to permit them to take advantage of the opportunities to communicate context and original order that these tools and standards offer?

Abby’s part of the session, titled “Where’s the Context? Enhancing Access to Digital Archives”, will examine the need for preserving context and original order when digitizing archival materials – focusing on how it enhances online use and access to archives. How can new systems retain the existing ability to communicate context and original order when moving from “analog” to “digital”?

My portion, “Communicating Context: The Power of Digital Interfaces”, will discuss what archivists can do in the digital world they cannot do (or at least not easily) with analog records to communicate context and original order. I will focus on various innovative methods to do this including the use of GIS, hot-linking for ease of navigation, the ability to ‘collect’ digital surrogates for examination and more. I plan to include a combination of exciting new interfaces doing great things alongside new ideas of what could be done. Keep your fingers crossed for us that there is internet access in the session rooms in Chicago.

We have a vision of a third speaker whose talk would consider what the leading standards and software tools are permitting people to do today. How can archivists leverage the existing and evolving standards (EAD, EAC, TEI and other DTD s) to capture and communicate context and original order in the digital world? In addition, it would provide a high level review of common software packages (Archon , Archivists’ Toolkit, ContentDM , and others) and how they address original order and context. Finally we have a notion of a checklist of what to capture when digitizing to take advantage of what these tools and standards can provide for you.

Are you our mystery 3rd panelist that we are having so much trouble finding? Your first tip is that you have already mapped out 5 powerpoint slides in your head and started scribbling a rough draft of the “Archivists’ Digitization Checklist for Preserving Context” on a scrap of paper near your computer.

Maybe you know someone who would be a great person to pitch this to? Or you have advice for us concerning who to pass our proposal along to in the great hunt for that elusive session endorsement?

The deadline looms large (October 9)! Please contact us either via email (jeanne AT spellboundblog DOT com and adamsabi AT uga DOT edu) or in the comments of this post.

SAA2006 Session 305: Extended Archival Description Part III – EAD and TEI

August 20, 2006

Amanda Wilson of the Ohio State University Libraries delivered the final presentation of SAA2006 session 305 (Extended Archival Description: Context and Specificity for Digital Objects), Dynamic Duo: Enhancing Access through Dual Description with EAD and TEI. She described a proof of concept project designed to explore if EAD and TEI can be used to support a humanities professor who has students learning how to digitize and add markup.

She provided the following list of example sites:

Walt Whitman Archive
LEADERS Project ‘Linking EAD to Electronically Retrievable Sources’ – transcriptions, original images and metadata
Barren Lands Digital Collection – University of Toronto, using EAD to enhance item level descriptions.

The professor’s goals for this year’s project are to create a home page that includes a collection description, document the scholarly process and follow markup rules. Amanda got a big cheer for saying she was “not sure if you can keep scholarly process in EAD – but hey, I’ll try anything”.

For each item being digitized the professor wanted to include all of the following:

Markup
Reading View
Diplomatic View – summary of all marks
XML source (aka TEI file – AACR2 bibliographic record could be created from this data)

How can she replicate the process the class was going through to support these goals? First, she picked the software created by DLXS that was already being used on site.

During the course of her research, she had to come up with methods to do all of the following:

Retrieve metadata and images
Convert image to TIFF
Figure out how to connect from EAD to the TEI information
Load TEI and EAD files and images into DLXS

The solution would need to support the community and the work they have already done. Amanda’s vision was to permit addition of item level description to the EAD with no additional editing AND load the TEI with little or no modification. It was a challenge to massage the TEI to validate against the Document Type Definition (DTD). She also wanted to link back to the original site created by the professor’s students in order to retain extra information that had no place in the EAD.

At about this point I was wishing (for the umpteenth time at SAA) that there was internet in order for online demos.

There were definitely challenges. For example, DLXS is aimed at eBooks, while TEI has additional fields such as those required to support properties related to “hand” (as in who wrote the scanned content). This makes it hard to change the TEI format to fit into DLXS. Using TEI in DLXS does permit searching for individual items, but they may need to do additional massaging to get the data to line up with the fields expected.

The conclusion of the presentation was that it can be done. It is possible to integrate the professor’s transcriptions using EAD and TEI within DLXS, but there will need to be more discussion with the faculty member about their requirements. The ultimate aim is for a federated search that is integrated into the institution’s central search. Those who are working on similar projects may be interested in moving to using a standard, but only up to the point at which they start loosing the data that is important to them for their research focus.

Category: context