open source | Spellbound Blog

Chapter 10: Open Source, Version Control and Software Sustainability by Ildikó Vancsa

January 29, 2019

Chapter 10 of Partners for Preservation is ‘Open Source, Version Control and Software Sustainability’ by Ildikó Vancsa. The third chapter of Part III: Data and Programming, and the final of the book, this chapter shifts the lens on programming to talk about the elements of communication and coordination that are required to sustain open source software projects.

When the Pacific Telegraph Route (shown above) was finished in 1861, it connected the new state of California to the East Coast. It put the Pony Express out of business. The first week it was in operation, it cost a dollar a word. Almost 110 years later, in 1969, saw the first digital transmission over ARPANET (the precursor to the Internet).

Vancsa explains early in the chapter:

We cannot really discuss open source without mentioning the effort that people need to put into communicationg with each other. Members of a community must be able to follow and track back the information that has been exchanged, no matter what avenue of communication is used.

I love envisioning the long evolution from the telegraph crossing the continent to the Internet stretching around the world. With each leap forward in technology and communication, we have made it easier to collaborate across space and time. Archives, at their heart, are dedicated to this kind of collaboration. Our two fields can learn from and support one another in so many ways.

Bio:

Ildikó Vancsa started her journey with virtualization during her university years and has been in connection with this technology in different ways since then. She started her career at a small research and development company in Budapest, where she focused on areas like system management, business process modeling and optimization. Ildikó got involved with OpenStack when she started to work on the cloud project at Ericsson in 2013. She was a member of the Ceilometer and Aodh project core teams. She is now working for the OpenStack Foundation and she drives network functions virtualization (NFV) related feature development activities in projects like Nova and Cinder. Beyond code and documentation contributions, she is also very passionate about on-boarding and training activities.

Image source: Route of the first transcontinental telegraph, 1862.
https://commons.wikimedia.org/wiki/File:Pacific_Telegraph_Route_-_map,_1862.jpg

Chapter 5: The Internet of Things: the risks and impacts of ubiquitous computing by Éireann Leverett

January 6, 2019 2 Comments

Chapter 5 of Partners for Preservation is ‘The Internet of Things: the risks and impacts of ubiquitous computing’ by Éireann Leverett. This is one of the chapters that evolved a bit from my original idea – shifting from being primarily about proprietary hardware to focusing on the Internet of Things (IoT) and the cascade of social and technical fallout that needs to be considered.

Leverett gives this most basic definition of IoT in his chapter:

At its core, the Internet of Things is ‘ubiquitous computing’, tiny computers everywhere – outdoors, at work in the countryside, at use in the city, floating on the sea, or in the sky – for all kinds of real world purposes.

In 2013, I attended a session at The Memory of the World in the Digital Age: Digitization and Preservation conference on the preservation of scientific data. I was particularly taken with The Global Sea Level Observing System (GLOSS) — almost 300 tide gauge stations around the world making up a web of sea level observation sensors. The UNESCO Intergovernmental Oceanographic Commission (IOC) established this network, but cannot add to or maintain it themselves. The success of GLOSS “depends on the voluntary participation of countries and national bodies”. It is a great example of what a network of sensors deployed en masse by multiple parties can do – especially when trying to achieve more than a single individual or organization can on its own.

Much of IoT is not implemented for the greater good, but rather to further commercial aims. This chapter gives a good overview of the basics of IoT and considers a broad array of issues related to it including privacy, proprietary technology, and big data. It is also the perfect chapter to begin Part II: The physical world: objects, art, and architecture – shifting to a topic in which the physical world outside of the computer demands consideration.

Bio:

Éireann Leverett once found 10,000 vulnerable industrial systems on the internet.

He then worked with Computer Emergency Response Teams around the world for cyber risk reduction.

He likes teaching the basics and learning the obscure.

He continually studies computer science, cryptography, networks, information theory, economics, and magic history.

He is a regular speaker at computer security conferences such as FIRST, BlackHat, Defcon, Brucon, Hack.lu, RSA, and CCC; and also at insurance and risk conferences such as Society of Information Risk Analysts, Onshore Energy Conference, International Association of Engineering Insurers, International Risk Governance Council, and the Reinsurance Association of America. He has been featured by the BBC, The Washington Post, The Chicago Tribune, The Register, The Christian Science Monitor, Popular Mechanics, and Wired magazine.

He is a former penetration tester from IOActive, and was part of a multidisciplinary team that built the first cyber risk models for insurance with Cambridge University Centre for Risk Studies and RMS.

Image credit: Zan Zig performing with rabbit and roses, including hat trick and levitation, Strobridge Litho. Co., c1899.

NOTE: I chose the magician in the image above for two reasons:

because IoT can seem like magic
because the author of this chapter is a fan of magic and magic history

Chapter 4: Link Rot, Reference Rot and the Thorny Problems of Legal Citation by Ellie Margolis

December 29, 2018

The fourth chapter in Partners for Preservation is ‘Link Rot, Reference Rot and the Thorny Problems of Legal Citation’ by Ellie Margolis. Links that no longer work and pages that have been updated since they were referenced are an issue that everyone online has struggled with. In this chapter, Margolis gives us insight into why these challenges are particularly pernicious for those working in the legal sphere.

This passage touches on the heart of the problem.

Fundamentally, link and reference rot call into question the very foundation on which legal analysis is built. The problem is particularly acute in judicial opinions because the common law concept of stare decisis means that subsequent readers must be able to trace how the law develops from one case to the next. When a source becomes unavailable due to link rot, it is as though a part of the opinion disappears. Without the ability to locate and assess the sources the court relied on, the very validity of the court’s decision could be called into question. If precedent is not built on a foundation of permanently accessible sources, it loses
its authority.

While working on this blog post, I found a WordPress Plugin called Broken Link Checker. It does exactly what you expect – scans through all your blog posts to check for broken URLs. In my 201 published blog posts (consisting of just shy of 150,000 words), I have 3002 unique URLs. The plugin checked them all and found 766 broken links! Interestingly, the plugin updates the styling of all broken links to show them with strikethroughs – see the strikethrough in the link text of the last link in the image below:

For each of the broken URLs it finds, you can click on “Edit Link”. You then have the option of updating it manually or using a suggested link to a Wayback Machine archived page – assuming it can find one.

It is no secret that link rot is a widespread issue. Back in 2013, the Internet Archive announced an initiative to fix broken links on the Internet – including the creation of the Broken Link Checker plugin I found. Three years later, on the Wikipedia blog, they announced that over a million broken outbound links on English Wikipedia had been fixed. Fast forward to October of 2018 and an Internet Archive blog post announced that More than 9 million broken links on Wikipedia are now rescued.

I particularly love this example because it combines proactive work and repair work. This quote from the 2018 blog post explains the approach:

For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week.

And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a ‘404’, or ‘Page Not Found’). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with.

There are no silver bullets here – just the need for consistent attention to the problem. The examples of issues being faced by the law community, and their various approaches to prevent or work around them, can only help us all move forward toward a more stable web of internet links.

Bio:
Ellie Margolis is a Professor of Law at Temple University, Beasley School of law, where she teaches Legal Research and Writing, Appellate Advocacy, and other litigation skills courses. Her work focuses on the effect of technology on legal research and legal writing. She has written numerous law review articles, essays and textbook contributions. Her scholarship is widely cited in legal writing textbooks, law review articles, and appellate briefs.

Image credit: Image from page 235 of “American spiders and their spinningwork. A natural history of the orbweaving spiders of the United States, with special regard to their industry and habits” (1889)

ArchivesZ Needs You!

July 7, 2010 1 Comment

I got a kind email today asking “Whither ArchivesZ?”. My reply was: “it is sleeping” (projects do need their rest) and “I just started a new job” (I am now a Metadata and Taxonomy Consultant at The World Bank) and “I need to find enthusiastic people to help me”. That final point brings me to this post.

I find myself in the odd position of having finished my Master’s Degree and not wanting to sign on for the long haul of a PhD. So I have a big project that was born in academia, initially as a joint class project and more recently as independent research with a grant-funded programmer, but I am no longer in academia.

What happens to projects like ArchivesZ? Is there an evolutionary path towards it being a collaborative project among dispersed enthusiastic individuals? Or am I more likely to succeed by recruiting current graduate students at my former (and still nearby) institution? I have discussed this one-on-one with a number of individuals, but I haven’t thrown open the gates for those who follow me here online.

For those of you who have been waiting patiently, the ArchivesZ version 2 prototype is avaiable online. I can’t promise it will stay online for long – it is definitely brittle for reasons I haven’t totally identified. A few things to be aware of:

when you load the main page, you should see tags listed at the bottom – if you don’t at all, then drop me an email via my contact form and I will try and get Tomcat and Solr back up. If you have a small screen – you may need to view your browser full screen to get to all the parts of the UI.
I know there are lots of bugs of various sizes. Some paths through the app work – some don’t. Some screens are just placeholders. Feel free to poke around and try things – you can’t break it for anyone else!

I think there are a few key challenges to building what I would think of as the first ‘full’ version of ArchivesZ – listed here in no particular order:

In the process of creating version 2, I was too ambitious. The current version of ArchivesZ has lots of issues, some usability – some bugs (see prototype above!)
Wherever a collaborative workspace of ArchivesZ were going to live, it would need large data sets. I did a lot of work on data from eleven institutions in the spring of 2009, so there is a lot of data available – but it is still a challenge.
A lot of my future ideas for ArchivesZ are trapped in my head. The good news is that I am honestly open to others’ ideas for where to take it in the future.
How do we build a community around the creation of ArchivesZ?

I still feel that there is a lot to be gained by building a centralized visualization tool/service through which researchers and archivists could explore and discover archival materials. I even think there is promise to a freestanding tool that supports exploration of materials within a single institution. I can’t build it alone. This is a good thing – it will be a much better in the end with the input, energy and knowledge of others. I am good at ideas and good at playing the devil’s advocate. I have lots of strength on the data side of things and visualization has been a passion of mine for years. I need smart people with new ideas, strong tech skills (or a desire to learn) and people who can figure out how to organize the herd of cats I hope to recruit.

So – what can you do to help ArchivesZ? Do you have mad Action Script 3 skills? Do you want to dig into the scary little ruby script that populates the database? Maybe you prefer to organize and coordinate? You have always wanted to figure out how a project like this could group from a happy (or awkward?) prototype into a real service that people depend on?

Do you have a vision for how to tackle this as a project? Open source? Grant funded? Something else clever?

Know any graduate students looking for good research topics? There are juicy bits here for those interested in data, classification, visualization and cross-repository search.

I will be at SAA in DC in August chairing a panel on search engine optimization of archival websites. If there is even just one of you out there who is interested, I would cheerfully organize an ArchivesZ summit of some sort in which I could show folks the good, bad and ugly of the prototype as it stands. Let me know in the comments below.

Won’t be at SAA but want to help? Chime in here too. I am happy to set up some shared desktop tours of whatever you would like to see.

PS: Yes, I do have all the version 2 code – and what is online at the Google Code ArchivesZ page is not up to date. Updating the ArchivesZ website and uploading the current code is on my to do list!

Topic Modeling, Auto-Classification and Archival Description

April 27, 2010 4 Comments

In an example of Twitter serendipity, @silverasm‘s (Aditi Muralidharan) tweet pointed me to @historying‘s blog post about Topic Modeling. In this post Cameron Blevins explains the results of using the topic modeling feature of UMass Amherst‘s MAchine Learning for LanguagE Toolkit (MALLET) on the text of Martha Ballard’s Diary.

I have spent lot of time thinking about how to generate thematic overviews of groups of archival collections. My information visualization project, ArchivesZ, aims to provide ways of understanding aggregated archival description data, both from a single institution or across institutional boundaries. Now I find myself wondering if text mining with a tool like MALLET might generate smart topic groupings more elegantly than fighting with the wide range of non-standardized collection subjects.

Topic Modeling with MALLET

To get a sense of what MALLET generates, see the excerpt below from Blevins’s post:

With some tinkering, MALLET generated a list of thirty topics comprised of twenty words each, which I then labeled with a descriptive title. Below is a quick sample of what the program “thinks” are some of the topics in the diary:

MIDWIFERY: birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient

CHURCH: meeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt

DEATH: day yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn

GARDENING: gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds

He goes on to explain that “MALLET also allows us to track those topics across the text.” What if, instead of text mining a diary, we pumped the descriptions of every archival collection from a single institution into MALLET. Of course we would need a good list of stop words including such common terms as archives, history, sources and records. But I wonder how the topics MALLET suggests would compare to the official subjects associated with each collection? Could this give us a broad overview of the topics covered by a specific repository and give us a new way to build paths to the collections based on topic?

Auto-Classification Using Castanet

Text miner Aditi Muralidharan also posted recently on this theme in Castanet: automatically generating a browsing structure for a collection and explains:

Castanet automatically carves a sub-structure from the hierarchical concept dictionary, WordNet (http://wordnet.princeton.edu), and matches items in the collection to one or many appropriate places within that hierarchy. Then, after some automated trimming and flattening, the result is a hierarchical browsing system.

I have heard of Castanet before via the Flamenco Search Interface Project. Apparently Muralidharan did a project using Castanet last summer to create a category system for Flickr Commons images based on the images’ tags which is then rendered using a Flamenco interface. I include a partial screen-shot below to give you a taste of what the navigation of images feels like a few levels down in the hierarchy. I love the classification of ‘Group Action’ then filtered by a sub-classification of ‘Commerce’. The first images shown are of ‘horse trading’ – with additional headings and images beneath them as well as additional filter options on the left.

What If?

What if we pulled all the English language archival descriptions from around the world as our original data set. If we used this data for topic modeling, our subjects clusters would be cross-institutional. Maybe we could map the local institution assigned subjects to the topic model generated topics for each collection and get a sort of automated crosswalk for finding related collections. If we used the local institution assigned subjects from the archival descriptions for Canasta style auto-classification, maybe we could generate a way to hierarchically browse collections topically.

Both MALLET and Flamenco are open source (I am not sure of the status of Castanet) and, as I discovered working on ArchivesZ, many institutions will share their archival description data for a good cause. So – is this a good cause? I need to tease these ideas out a bit more, but what do you all think of it at first blush? Feasible? Interesting? Worthwhile experiments?

Image Credits: MALLET logo from MALLET homepage. Images in screen shot from Flickr Commons with no known copyright.

THATCamp 2008: Day 1 Dork Short Lightening Talks

June 14, 2008 2 Comments

During lunch on the first day of THATCamp people volunteered to give lightning talks they called ‘Dork Shorts’. As we ate our lunch, a steady stream of folks paraded up to the podium and gave an elevator pitch length demo. These are the projects about which I managed to type URLs and some other info into my laptop. If you are looking for examples of inspirational and innovative work at the intersection of technology and the humanities – these are a great place to start!

World Digital Library (Library of Congress )
PicLens + FireFox + any search results page from the New York Public Library Digital Gallery = a 3D experience of ALL the photos at one time. PicLens uses the RSS feed to retrieve the full set of images along with their captions and will work with any RSS feed of images – such as RSS image feeds from Flickr or Smugmug .
HistoryWired (National Museum of American History): A new spin on a treemap visualization built on top of museum metadata. One box is displayed per item and the box size is based on popularity. The rest of its innovations are just easier to experience than describe.
The Object of History (National Museum of American History + CHNM )
Omeka (CHNM )
Eminent Domain (NYPLOnline Exhibition): built on Omeka
American Social History Online (Digital Library Federation): Zotero enabled. They are on the hunt for more MODS records. Built on Ruby On Rails (RoR) and will be put out as open source software within a couple of months.
Typographia(David Rieder, NC State University)

Have more links to projects I missed including? Please add them in the comments below.

Image credit: Lightning by thenss (Christopher Cacho) via flickr

THATCamp 2008: Crowdsourced Transcription and Collaborative Annotation

June 5, 2008 13 Comments

The THATCamp session officially titled ‘Crowdsourcing’ on the schedule was actually aimed at discussing the intersection of crowdsourced transcription and collaborative annotation. The group was small – just six of us and Ben Brumfield got us going by giving us an overview of transcription software and projects:

The FamilySearch Indexing Project is an LDS church project put out by the FamilySearch Labs. Their goals: “Volunteers extract family history information from digital images of historical documents to create searchable indexes that assist everyone in finding their ancestors.”
The Manuscript Transcription Assistant is based at Worcester Polytechnic Institute (WPI) and is described as “a tool to assist transcribers in creating transcriptions, and incorporate meta-data about each image and transcription that can then be used to search through an electronic library of transcriptions”. I found mention in the FAQ of the desire to create a community so that “transcribers will be able to collaborate their work by rating the quality of other user’s transcriptions. By ranking the transcriptions, specific versions of transcriptions will emerge as an authority for that manuscript. ” Unfortunately, a lot of the links on that site are broken and my attempt to register gave me an error. It is not clear to me that this project is actually still active.
Soldier Studies is a website dedicated to posting transcriptions of civil war letters and diaries. This is not a tool for transcribing, but is clearly a repository targeting specifically transcriptions (see their Mission Statement for more information).
Oh No Robot is a comics transcription and search tool. It provides a page to find comics needing transcription and a great page to explain how transcription works on their site.

After examining what was out there, Ben concluded that what he wanted didn’t exist – so he started to build it himself. He gave us a demo of his “very beta” software. His goal is to build a web based tool to support collaborative manuscript transcription and annotation by individuals without a strong technical background. In its current (and private beta) state the software supports transcription, an innovative approach to linking individual words or phrases to collection defined subjects and some basic community tools to let his virtual team discuss transcription issues. Ben is working hard on the software – if you are interested in his project, definitely get in touch with him.

Travis Brown showed us his creation: eComma. eComma aims to “enable groups of students, scholars, or general readers to build collaborative commentaries on a text and to search, display, and share those commentaries online”. He showed us how users could tag or add comments on individual words or phrases of a loaded text. Take a look at the eComma page for Sonnet 18 by William Shakespeare. The words highlighted in blue are those which are tagged or have comments associated with them. If you highlight ‘the eye of heaven’ in line 5 you will see that it is tagged as a metaphor. Travis reported that he will have 2 other programmers working on eComma with him this summer and has his eye on improving some interface issues and adding a few more features.

We also talked about ways to display transcription. Elena Razlogova guided us over to the DoHistory website. There she showed us the Magic Lens interface. This interface displays the transcription of a handwritten diary page via a lens style overlay that you can move with your mouse. This reminded me of the Gilder Lehrman Battle Lines: Letters from America’s Wars interface that I found when doing research for my Communicating Context in Online Collections Poster. If you haven’t seen it before – go examine the page showing the transcription of (turn down your speaker if a reader’s voice will disturb those around you) Nathanael Green’s letter to Catherine Greene dated July 17, 1778.

While on the DoHistory site I also found the Try Your Hand At Transcribing page. This page shows the challenge of transcribing handwritten documents by giving you the chance to try it yourself and then lets you check your transcription with the click of a button.

We talked a bit about the technology behind eComma (forgive me Travis for not having enough details in my notes to explain your current architecture here) and the challenges inherent in wanting to annotate overlapping sets of words. Though he isn’t using it in the current implementation of eComma, Travis mentioned the Layered Markup Annotation Language (LMNL) which the tutorial page explains as:

…LMNL documents contain character data which is marked up using named and occasionally overlapping ranges. Ranges can have annotations, which can themselves be annotated and can have structured content. To support authoring, especially collaborative authoring, markup is namespaced and divided into layers, which might reflect different views on the text.

I can definitely see how LMNL might be an interesting framework for building transcription and annotation software.

Krissy O’Hare brought up the challenges of transcribing audio and video that she has faced working on oral history projects at Concordia University. This led to Travis (I think?) mentioning the Texas German Dialect Project (TGDP) and the CMU Sphinx Group Speech Recognition Engine. TGDP has an online archive of recorded interviews along with their transcriptions and translations. CMU Sphinx’s introduction explains that their software tools are targeted at expert users wanting to build speech-using applications.

This was a great session. The small group gave everyone a chance to contribute and take over the keyboard in order to show off their favorite sites. It was immediately after the Text Mining session, so our minds were already full of all the great things one could do with text once it is transcribed.

I am excited to watch the evolution of group transcription and annotation software. If you know of other transcription or annotation tools or projects – please post them to the comments.

Image credit: Free pencils by zone41 via flickr

As is the case with all my session summaries from THATCamp 2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

MIT’s SIMILE Project: Innovations in Metadata Interaction and Analysis

January 13, 2008 2 Comments

Well-formed Data’s post on Exhibit led me to explore what was available from MIT‘s Semantic Interoperability of Metadata and Information in unLike Environments (SIMILE) project. I took a little time to examine some of the SIMILE project tools with an eye to how they could impact interaction with archival records and metadata, as well as how they might support the work of archivists. All the tools appear to be available via an open source BSD license.

Babel

Babel converts files from one format to another. I did a test to see if it would convert one of the Library of Congress EAD Finding Aids from XML to some other format – but it gave me an error (‘unqualified attribute ‘repositoryencoding’ not allowed’). I love the idea that I could just point this at an EAD finding aid and get something useful out the other side – but apparently that is a bit on the wishful thinking side – at least for the moment.

Exhibit 2.0

Exhibit 2.0 is described on the Exhibit homepage as follows:

Exhibit is a three-tier web application framework written in Javascript, which you can include like you would include Google Maps. If you just want to show a few hundred records of data on maps, timelines, scatter plots, interactive tables, etc., why bother learning SQL, ASP, PHP, CGI, or whatever when you can just use Exhibit? To use Exhibit, you write: a simple data file, and an HTML file in which you specify how the data should be shown. Data + Presentation. That’s all there is to publishing, as it should be.

Sounds fabulous, doesn’t it? I wish I had a week to play with this tool. They have a whole slew of examples, but I think the two I list below do a fine job of showing what you can create (not to mention being fairly thematic for those of you paying attention to the US Presidential Primaries news coverage):

Gadget

Gadget is an XML inspector designed to create useful summaries of vast pools of XML data. I didn’t download and play with this one – but it sounds like something that might be very interesting to pump a big pile of EAD XML format finding aids into to see what could be discovered from an aggregate point of view.

Longwell & RDFizers

Longwell is a faceted browser for RDF formatted data, while RDFizers is actually a directory of tools which convert other data formats into the RDF format. It doesn’t exist now, but if there was an RDFizer that went from EAD to RDF then Longwell would become more interesting to archivists.

That said, they already do have both a MARC/MODS RDFizer and an OAI-PMH RDFizer. I suspect that many archivists could put their hands on archival data in one of these two formats – which makes experimenting with Longwell more plausible in the near term.

Final Thoughts

There are lots other tools that are part of the SIMILE project (screen scrapers and timeplotters and more), but the ones listed above most ignited my imagination. Surely there are geek archivists even now rolling up their sleeves to figuring out how to leverage free open source tools like these, both to improve access to records and increase understanding of what we have and how well it is (or isn’t) documented.

I hope to find time to play with each of these over the next few months – but I would love to know if anyone else out there has already tried any of these tools. Have suggestions for likely datasets? Have knowledge of existing archive related applications using these tools? Please post your comments below or drop me a line via my contact form!

Image Credit: The Simile Project logo displayed above is from MIT’s Simile Project website.

Digital Preservation via Emulation – Dioscuri and the Prevention of Digital Black Holes

December 25, 2007 2 Comments

Available Online posted about the open source emulator project Dioscuri back in late September. In the course of researching Thoughts on Digital Preservation, Validation and Community I learned a bit about the Microsoft Virtual PC software. Virtual PC permits users to run multiple operating systems on the same physical computer and can therefore facilitate access to old software that won’t run on your current operating system. That emulator approach pales in comparison with what the folks over at Dioscuri are planning and building.

On the Digital Preservation page of the Dioscuri website I found this paragraph on their goals:

To prevent a digital black hole, the Koninklijke Bibliotheek (KB), National Library of the Netherlands, and the Nationaal Archief of the Netherlands started a joint project to research and develop a solution. Both institutions have a large amount of traditional documents and are very familiar with preservation over the long term. However, the amount of digital material (publications, archival records, etc.) is increasing with a rapid pace. To manage them is already a challenge. But as cultural heritage organisations, more has to be done to keep those documents safe for hundreds of years at least.

They are nothing if not ambitious… they go on to state:

Although many people recognise the importance of having a digital preservation strategy based on emulation, it has never been taken into practice. Of course, many emulators already exist and showed the usefulness and advantages it offer. But none of them have been designed to be digital preservation proof. For this reason the National Library and Nationaal Archief of the Netherlands started a joint project on emulation.

The aim of the emulation project is to develop a new preservation strategy based on emulation.

Dioscuri is part of Planets (Preservation and Long-term Access via NETworked Services) – run by the Planets consortium and coordinated by the British Library. The Dioscuri team has created an open source emulator that can be ported to any hardware that can run a Java Virtual Machine (JVM). Individual hardware components are implemented via separate modules. These modules should make it possible to mimic many different hardware configurations without creating separate programs for every possible combination.

You can get a taste of the big thinking that is going into this work by reviewing the program overview and slide presentations from the first Emulation Expert Meeting (EEM) on digital preservation that took place on October 20th, 2006.

In the presentation given by Geoffrey Brown from Indiana University titled Virtualizing the CIC Floppy Disk Project: An Experiment in Preservation Using Emulation I found the following simple answer to the question ‘Why not just migrate?’:

Loss of information — e.g. word edits
Loss of fidelity — e.g. WordPerfect to Word isn’t very good
Loss of authenticity — users of migrated document need access to original to verify authenticity
Not always possible — closed proprietary formats
Not always feasible — costs may be too high
Emulation may necessary to enable migration

After reading through Emulation at the German National Library, presented by Tobias Steinke, I found my way to the kopal website. With their great tagline ‘Data into the future’, they state their goal is “…to develop a technological and organizational solution to ensure the long-term availability of electronic publications.” The real gem for me on that site is what they call the kopal demonstrator. This is a well thought out Flash application that explains the kopal project’s ‘procedures for archiving and accessing materials’ within the OAIS Reference Model framework. But it is more than that – if you are looking for a great way to get your (or someone else’s) head around digital archiving, software and related processes – definitely take a look. They even include a full Glossary.

I liked what I saw in Defining a preservation policy for a multimedia and software heritage collection, a pragmatic attempt from the Bibliothèque nationale de France, a presentation by Grégory Miura, but felt like I was missing some of the guts by just looking at the slides. I was pleased to discover what appears to be a related paper on the same topic presented at IFLA 2006 in Seoul titled: Pushing the boundaries of traditional heritage policy: Maintaining long-term access to multimedia content by introducing emulation and contextualization instead of accepting inevitable loss . Hurrah for NOT ‘accepting inevitable loss’.

Vincent Joguin’s presentation, Emulating emulators for long-term digital objects preservation: the need for a universal machine, discussed a virtual machine project named Olonys. If I understood the slides correctly, the idea behind Olonys is to create a “portable and efficient virtual processor”. This would provide an environment in which to run programs such as emulators, but isolate the programs running within it from the disparities between the original hardware and the actual current hardware. Another benefit to this approach is that only the virtual processor need be ported to new platforms rather than each individual program or emulator.

Hilde van Wijngaarden presented an Introduction to Planets at EEM. I also found another introductory level presentation that was given by Jeffrey van der Hoeven at wePreserve in September of 2007 titled Dioscuri: emulation for digital preservation.

The wePreserve site is a gold mine for presentations on these topics. They bill themselves as “the window on the synergistic activities of DigitalPreservationEurope (DPE), Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval (CASPAR), and Preservation and Long-term Access through NETworked Services (PLANETS).” If you have time and curiosity on the subject of digital preservation, take a glance down their home page and click through to view some of the presentations.

On the site of The International Journal of Digital Curation there is a nice ten page paper that explains the most recent results of the Dioscuri project. Emulation for Digital Preservation in Practice: The Results was published in December 2007. I like being able to see slides from presentations (as linked to above), but without the notes or audio to go with them I am often left staring at really nice diagrams wondering what the author’s main point was. The paper is thorough and provides lots of great links to other reading, background and related projects.

There is a lot to dig into here. It is enough to make me wish I had a month (maybe a year?) to spend just following up on this topic alone. I found my struggle to interpret many of the Power Point slide decks that have no notes or audio very ironic. Here I was hunting for information about the preservation of born digital records and I kept finding that the records of the research provided didn’t give me the full picture. With no context beyond the text and images on the slides themselves, I was left to my own interpretation of their intended message. While I know that these presentations are not meant to be the official records of this research, I think that the effort obviously put into collecting and posting them makes it clear that others are as anxious as I to see this information.

The best digital preservation model in the world will only preserve what we choose to save. I know the famous claim on the web is that ‘content is king’ – but I would hazard to suggest that in the cultural heritage community ‘context is king’.

What does this have to do with Dioscuri and emulators? Just that as we solve the technical problems related to preservation and access, I believe that we will circle back around to realize that digital records need the same careful attention to appraisal, selection and preservation of context as ‘traditional’ records. I would like to believe that the huge hurdles we now face on the technical and process side of things will fade over time due to the immense efforts of dedicated and brilliant individuals. The next big hurdle is the same old hurdle – making sure the records we fight to preserve have enough context that they will mean anything to those in the future. We could end up with just as severe a ‘digital black hole’ due to poorly selected or poorly documented records as we could due to records that are trapped in a format we can no longer access. We need both sides of the coin to succeed in digital preservation.

Did I mention the part about ‘Hurray for open source emulator projects with ambitious goals for digital preservation’? Right. I just wanted to be clear about that.

Image Credit: The image included at the top of this post was taken from a screen shot of Dioscuri itself, the original version of which may be seen here.

The MemoryArchive Affiliate Program: A Wiki Engine for Collecting Memoirs

November 14, 2007 2 Comments

A Beautiful WWW posted A Review of MemoryArchive.org. MemoryArchive, founded by historian Marshall Poe, is a new MediaWiki based website aimed at collecting first person accounts that they term ‘memoirs’. In sharp contrast with the communal authorship approach of most wikis, MemoryArchive locks down edits of each entry after a format review.

What sorts of memoirs are they looking for? In their FAQ they say they want “pretty much anything you remember that someone else might conceivably find interesting, now or in 500 years”.

I spent some time exploring. I read a very moving memorial titled Death by Aids The Goodbye Party, 1992, by Jay Blotcher (ed note: Jay emailed me with the correct title for this memoir). I wandered through some 9/11 memories. Eventually something dawned on me. Maybe it is the fact that I am spending most of my days lately thinking deep thoughts about metadata and classification — or maybe my archives course work is to blame — whatever the reason, I realized that I wanted more information about the storytellers. Right now it appears that each memoir includes Who, What, When and Where data – to whatever degree the contributors choose to furnish such information. Categories are also available and seem to be frequently employed.

But I want to know more about the individuals who are telling the stories. I appreciate that some posts will be made more powerful through anonymity, but for those cases that an individual is willing to share additional biographic information it would be great to have an easy place for that information to be captured.

I think the most interesting aspect of the Memory Archive to the archives community is the Memory Archive Affiliate Program. The theory behind this program is to support the collection and archiving of personal histories online. It is described as being of interest to the following types of organizations:

historical societies (urban, state, or national)
institutions interested in recording their own history (a club, society, or military unit)
educational institutions teaching history (high school or college)
public history projects (oral history gathering, or document collection)

This is a powerful idea. Any time you can accumulate a critical mass of of a single type of information on the web (in this case, memoirs) you have the chance of becoming a destination. There is also the added benefit of enabling smaller organizations to launch an online memoir collection initiatives without needing to worry about the technology, costs and people-power that would usually be required.

There does needs to be an easy way for the Memory Archive Affiliates to download these born digital memoirs for offline use and preservation purposes. This could be accomplished by an ‘export’ or ‘format for printing’ button on each memoir page, or perhaps some form of bulk download for all memoirs collected for a single affiliate’s project. I will say that the default print format isn’t bad. It seems to already do some special reformatting (such as displaying URL links in their entirety). I still also would want more metadata, though perhaps the definition of attributes to be collected could be customized per project.

I am curious to see the overall quality of the memoirs a year from now. I suspect that memoirs collected is association with a topically focused program may be more compelling than the average ‘man-on-the-net’ first person narratives. That isn’t to say that there is no value in the memories of someone who feels compelled to share their story – but a collection created around a theme would have the additional power of that common thread. The affiliate program memoirs would also be more likely to come with some contextual background explaining the source and origin of the solicited accounts. I am a fan the existing thematic memory sites, such as The April 16 Archive and the Hurricane Digital Memory Bank. I love that the Omeka software used to create these two example sites is open source and free. Unfortunately, I don’t think the average small historical society or public history project is likely to have the resources to build and support a site like this even with free software. I think that a program like the Memory Archive Affiliate Program (or something like it) could bridge the gap for these smaller organizations and make the creation of online memoir collection projects a reality.

Category: open source