digital preservation | Spellbound Blog

Book Review: Digital Preservation

January 18, 2007

In my quest for information about archiving geospatial data last term, I got my hands on a copy of Marilyn Deegan and Simon Tanner’s Digital Preservation (part of the Digital Futures Series). This excellent volume consists of nine chapters each written by different authors who are leaders in their respective fields (shown in order of their respective chapters):

David Holdsworth: known for his work on the CEDARS and CAMiLEON projects
Robin Wendler: metadata analyst at the Harvard University Library Office for Information Studies
Julien Masanès: co-founder of the European Archive
Elisa Mason: maintains the Forced Migration Current Awareness Blog
Brian F. Lavoie: a research scientist at OCLC
Stephen Chapman: Preservation Librarian for Digital Initiatives in the Weissman Preservation Center, Harvard University Library
Peter McKinney: research officer for the espida project at the University of Glasgow
Jasmine Kelly: a former research assistant at the Centre for Computing in the Humanities, King’s College London

This fabulous band of writers and researchers were led by Marilyn Deegan and Simon Tanner, both based out of the King’s College London.

Published in 2006, this is one of the most comprehensive and up to date books I found on the subject. The book starts out with two chapters addressing the basic issues related to digital preservation. Subsequent chapters present information about all kinds of metadata, web archiving, the costs of digital preservation and an overview of European approaches. The final chapter presents an extensive series of case studies – complete with URLs to give you plently of information online to explore.

This book gave me a great foundation from which to explore the details of various geospatial data and GIS archiving efforts. For those faced with the challenge of planning for digital preservation, the two chapters on costs should be very useful. So many articles talk about how it will be so expensive to ensure proper digital preservation, but don’t give people in the field any practical advice in planning for the costs – this book is different. The exploration of existing approaches being used at major institutions throughout Europe give a good sense of evolving standards and best practices.

If you are looking for a way to get a handle on the issues involved in digital preservation – this a great starting point. The final chapter on case studies alone could keep you busy for a month as you explore all the websites of projects from around the world. While the book has a decidedly European focus, the concepts are applicable the world over. If you are responsible for ensuring that digital records (either digitized or born digital) are protected and preserved – this book explains the basics and explores various strategies. They don’t oversimplify things – but take the time to explain things well. They are honest about those questions that aren’t answered yet… and they point to as many resources, standards and examples as they can. While Digital Preservation cannot provide a formula for everyone to follow, it can help you start asking the right questions and begin to understand the possibilities.

The Edges of the GIS Electronic Record

January 2, 2007 3 Comments

I spent a good chunk of the end of my fall semester writing a paper ultimately titled “Digital Geospatial Records: Challenges of Selection and Appraisal”. I learned a lot – especially with the help of archivists out there on the cutting edge who are trying to find answers to these problems. I plan on a number of posts with various ideas from my paper.

To start off, I want to consider the topic of defining the electronic record in the context of GIS. One of the things I found most interesting in my research was the fact that defining exactly what a single electronic record consists of is perhaps one of the most challenging steps.

If we start with the SAA’s glossary definition of the term ‘record’ we find the statement that “A record has fixed content, structure, and context.” The notes go on to explain:

Fixity is the quality of content being stable and resisting change. To preserve memory effectively, record content must be consistent over time. Records made on mutable media, such as electronic records, must be managed so that it is possible to demonstrate that the content has not degraded or been altered. A record may be fixed without being static. A computer program may allow a user to analyze and view data many different ways. A database itself may be considered a record if the underlying data is fixed and the same analysis and resulting view remain the same over time.

This idea presents some major challenges when you consider data that does not seem ‘fixed’. In the fast moving and collaborative world of the internet, Geographic Information Systems are changing over time – but the changes themselves are important. We no longer live in a world in which the way you access a GIS is via a CD which has a specific static version of the map data you are considering.

One of the InterPARES 2 case studies I researched for my paper was the Preservation of the City of Vancouver GIS database (aka VanMap). Via a series of emails exchanged with the very helpful Evelyn McLellan (who is working on the case study) I learned that the InterPARES 2 researchers concluded that the entire VanMap system is a single record. This decision was based on the requirement of ‘archival bond’ to be present in order for a record to exist. I have included my two favorite definitions of archival bond from the InterPARES 2 dictionary below:

archival bond
n., The network of relationships that each record has with the records belonging in the same aggregation (file, series, fonds). [Archives]

n., The originary, necessary and determined web of relationships that each record has at the moment at which it is made or received with the records that belong in the same aggregation. It is an incremental relationship which begins when a record is first connected to another in the course of action (e.g., a letter requesting information is linked by an archival bond to the draft or copy of the record replying to it, and filed with it. The one gives meaning to the other). [Archives]

I especially appreciate the second definition above because it’s example gives me a better sense of what is meant by ‘archival bond’ – though I need to do more reading on this to get a better grasp of it’s importance.

Given the usage of VanMap by public officials and others, you can imagine that the state of the data at any specific time is crucial to determining the information used for making key decisions. Since a map may be created on the fly using multiple GIS layers but never saved or printed – it is only the knowledge that someone looked at the information at a particular time that would permit those down the road to look through the eyes of the decision makers of the past. Members of the VanMap team are now working with the Sustainable Archives & Library Technologies (SALT) lab at the San Diego Supercomputer Center (SDSC) to use data grid technology to permit capturing the changes to VanMap data over time. My understanding is that a proof of concept has been completed that shows how data from a specific date can be reconstructed.

In contrast with this approach we can consider what is being done to preserve GIS data by the Archivist of Maine in the Maine GeoArchives. In his presentation titled “Managing GIS in the Digital Archives” delivered at the 2006: Joint Annual Meeting of NAGARA, COSA, and SAA on August 3, 2006, Jim Henderson explained their approach of appraising individual layers to determine if they should be accessioned in the archive. If it is determined that the layer should be preserved, then issues of frequency of data capture are addressed. They have chosen a pragmatic approach and are currently putting these practices to the test in the real world in an ambitious attempt to prevent data loss as quickly as is feasible.

My background is as a database designer and developer in the software industry. In my database life, a record is usually a row in a database table – but when designing a database using Entity-Relationship Modeling (and I will admit I am of the “Crow’s Feet” notation school and still get a smile on my face when I see the cover of the CASE*Method: Entity Relationship Modelling book) I have spent a lot of time translating what would have been a single ‘paper record’ into the combination of rows from many tables.

The current system I am working on includes information concerning legal contracts. Each of these exists as a single paper document outside the computers – but in our system we distribute information that is needed to ‘rebuild’ the contract into many different tables. One for contact information – one for standard clauses added to all the contracts of this type – another set of tables for defining financial formulas associated with the contract. If I then put on my archivist hat and I didn’t just choose to keep the paper agreement, I would of course draw my line around all these different records needed to rebuild the full contract. I see that there is a similar definition listed as the second definition on the InterPARES 2 Terminology Dictionary for the term ‘Record‘:

n., In data processing, a grouping of interrelated data elements forming the basic unit of a file. A Glossary of Archival and Records Terminology (The Society of American Archivists)

Just in this brief survey we can see three very different possible views on where to draw a line around what constitutes a single Geographic Information System electronic record. Is it the entire database, a single GIS layer or some set of data elements which create a logical record? Is it worthwhile trying to contrast the definition of a GIS record with the definition of a record when considering analog paper maps? I think the answer to all of these questions is ‘sometimes’.

What is especially interesting about coming up with standard approaches to archiving GIS data is that I don’t believe there is one answer. Saying ‘GIS data’ is about as precise as saying ‘database record’ or ‘entity’ – it could mean anything. There might be a best answer for collaborative online atlases.. and another best answer for state government managed geographic information library.. and yet another best answer for corporations dependent on GIS data for doing their business.

I suspect that it will be via thorough analysis of the information stored in a GIS system, how it is/was created, how often it changes and how it was used that will determine the right approach for archiving these born digital records. There are many archivists (and IT folks and map librarians and records managers) around the world who have a strong sense of panic over the imminent loss of geospatial data. As a result, people from many fields are trying different approaches to stem the loss. It will be interesting to consider these varying approaches (and their varying levels of success) over the next few years. We can only hope that a few best practices will rise to the top quickly enough that we can ensure access to vital geospatial records in the future.

SAA 2006: Research Library Group Roundtable – Internet Archiving

August 29, 2006

Late in the afternoon on Thursday August 3rd I attended the Research Library Group Roundtable at SAA 2006. It was an opportunity for RLG to share information with the archival community about their latest products and services. This session included presentations on the Internet Archive , Archive-It and the Web Archives Workbench.

After some brief business related to the SAA 2007 program committee and the rapid election of Brian Stevens from NYU Archives as the new chair of the group, Anne Van Camp spoke about the period of transition as RLG merges with OCLC. In the interest of the blending of cultures – she told a bar joke (as all OCLC meetings apparently begin). She explained that RLG products and services will be integrated into the OCLC product line. RLG programs will continue as RLG becomes the research arm for the joined interest areas of libraries, archives and museums. This has not existed before and they believe it will be a great chance to explore things in ways that RLG hasn’t had the opportunity to do in the past.

The initiatives on their agenda:

archival gateways: convened 2 meetings recently. The first to see if there is a way to be interactive with international archive databases and the second to bring regional archives together to see how they can work together.
web archiving: started looking at it from a service point of view, but also some community issues that have to be worked out around web archiving. Looking at big problems that will need community involvement – issues like metadata and selection.
standards: continuing to support EAD, pursuing rigorous agenda regarding EAC
OCLC has a whole group of people who works on registries (where you put information about organizations). RLG has talked about building a registry on top of Archive Grid of US archives.

In her introduction, Merrilee (frequent poster on hangingtogether.org ) highlighted that there are lots of questions about the intellectual side of web archiving (vs the technical challenges) such as:

what to archive?
what metadata data and description is appropriate for it?
what would end users of web archives need? How would they use a web archive?
what about collaborative collection development? It is expensive to archive the web – how does an institution say “I am archiving this corner of the web – this deep – this often”. This information should be publicly available for others doing research and others archiving the web.

She pointed out that RLG is happy about their work with Internet Archive – they are doing work to make the technical side easier but they understand that there is a lot for the archival community to sort out.

Next up was Kristine Hanna of the Internet Archive giving her presentation ‘Archiving and Preserving the Web’. The Internet Archive has been working with RLG this year and they need information from the users in the RLG community. They are looking into how they are going to work with OCLC and have applied for an NDIIP grant.

The Internet Archive (IA), founded by Brewster Kahle in 1996, is built on open source principles and dedicated to Open Source software.

What do they collect in the archive? Over 2 billion pages a month in 21 languages. It is free and the largest archive on the web including 55 billion pages from 55 million sites and supporting 60,000 unique users per day.

Why try to collect it all? They don’t feel comfortable making the choices about appraisal. And at risk websites and collections are disappearing all the time. The average lifespan of a web page is 100 days. They did a case study of crawling websites associated with the Nigerian election – 6 months after the election 70% of the crawled sites were gone, but they live on in the archive.

How do they collect? They use these components and tools:

Heritrix – web crawler
Wayback Machine – access tools for rendering and viewing files
Nutch – search engine
Arc File – archival file format used for preservation

How do they preserve it? They keep multiple copies at different digital repositories (CA, Alexandria (Egypt), France, Amsterdam) using over 1300 server machines.

IA also does targeted archiving for partners. Institutions that want to create specific online collections or curated domain crawls can work with IA. These archives start at 100+ million documents and are based on crawls run by IA crawl engineers. The Library of Congress has arranged for an assortment of targeted archives including archives of US National Elections 2000, September 11 and the War in Iraq (not accessible yet – marked March 2003 – Ongoing). Australia arranged for archiving of the entire .au domain. Also see Purpose, Pragmatism and Perspective – Preserving Australian Web Resources at the National Library of Australia by Paul Koerbin of the National Library of Australia and published in February of 2006.

What’s Next for Internet Archive?

collaboration and partnerships
OCA – open content alliance
Multiple copies around the world

Next, Dan Avery of IA gave a 9 minute version of his 35 minute presentation on Archive-It. Archive-It is a web based annual subscription service provided by IA to permit the capture of up to 10 million pages. Kristine gave some examples of those using Archive-It during her presentation:

Indiana University – web sites
North Carolina State Archives – Government Agencies, Occupational Licensing Boards and commissions.
Library of Virginia – Jamestown 2007 commemoration and Governor Mark Warner’s last year in office. When Mark Warner was listed by the New York Times as a possible presidential candidate, this archive got lots of hits. (This brings up interesting questions of watching content that is being purposefully preserved to get an idea of what some expect for the future. Don’t be surprised by a post on this idea all by itself later. Need to think about it some more!)

He highlighted the different elements and techniques used in Archive-It: crawling, web user interface, storage, playback, text indexing and integration.

Crawling/Browsing:
- Heritrix :
  - open source java
  - Archival-quality (they preserve exactly what they get back from the server)
  - Highly configurable
- Wayback Machine :
  - lets you surf the web as it was
  - in Archive-It – each customer has their own wayback machine
  - not open source yet.. that is a work in progress
The user interface is a web application:
- collects all the info they need to do the crawling the customer requests
- schedule (monthly, daily, weekly, quarterly… etc)
- seed URLS (the starting point for archive web crawls)
- crawl parameters
NutchWAX
- extension of Nutch which is built on Lucene
- full text search plus link analysis
- can search by date instead of relevance – useful for individual archives

While there are public collections in Archive-It, logging in gives you access to personal sites: shows the total documents archived (and more), lets you check your list of active collections and set up a new collection (includes unique collection identifier). He showed some screen shots of the interface and examples (this was the first time there wasn’t a network available for his presentation – he was amused that his paranoia that forced him to always bring screen captures finally paid off!).

It was interesting seeing this presentation back to back with the general Internet Archive overview. There are lots of overlap in tools and approaches between them – but Archive-It definitely has it’s own unique requirements. It puts the tools for managing larges scale web crawling in the hands of archivists (or more likely information managers of some sort) – rather than the technical staff of IA.

The final presentation of the roundtable was by Judy Cobb – a Product Manager fromOCLC. She gave an overview of the Web Archives Workbench. (I hunted for a good link to this – but the best I came up with was acknowledgments document and the login page .)The inspiration for the creation of Workbench was the challenge of collecting from web. The Internet is a big place. It is hard to define the scope of what to archive.

Workbench is a discovery tool that will permit its users to investigate what domains should be included when crawling a website for archiving. It will ask you which domains should be included. For example, you can tell it not to crawl Adobe.com just because there is a link to it to let people download acrobat.

Workbench will let you set metadata data for your collection based on the domains you said were in scope. It will then let you appraise and rank the entities/domains being harvested, leaving you with a list of organizations or entities in scope and ranked by importance. Next it will translate a site map of what is going to be crawled, define parts of the map as series and put the harvested content and related metadata into a repository. Other configuration options permit setting how frequently you harvest various series, choosing to only get new content and requesting notification if the sitemap changes.

Workbench is currently in beta and is still under development. The 3rd phase will add the support for Richard Pierce-Moses’s Arizona Model for Web Preservation and Access. The focus of the Arizona Model is curation, not technology. It strives to find a solution somewhere between manual harvesting and bulk harvesting that is based on standard archival theories. Workbench will be open source and funded by LOC.

I wasn’t sure what to expect from the roundtable – but I was VERY glad that I attended. The group was very enthusiastic – cramming in everything they could manage to share with those in the room. The Internet Archive, Archive-It and the Web Archives Workbench represent the front of the pack of software tools intended to support archiving the web. It was easy to see that if the Workbench is integrated in with Archive-It, that it should permit archivists to start paying more attention to the identification of what should be archived rather than figuring out how to do the actual archiving.

SAA 2006 Session 103: “X” Marks the Spot: Archiving GIS Databases – Part III

August 4, 2006 3 Comments

With the famous Hitchhiker’s Guide to the Galaxy quote of “Don’t Panic!”, James Henderson of the Maine State Archives gave an overview of how they have approached archiving GIS data in his presentation “Managing GIS in the Digital Archives” (the third presentation of the ‘X Marks the Spot’ panel). His basic point is that there is no time to wait for the perfect alignment of resources and research – GIS data is being lost every day, so they had to do what they could as soon as possible to stop the loss.

Goals: preserve permanently valuable state of main official records that are in digital form – both born digital as well as those digitized for access.. and provide continuing digital access to these records

A billion dollars has been spent creating the records over 15 years, but nothing is being done to preserve it. GIS data is overwritten or deleted by agencies as information in live systems is updated with information such as new road names.

At Camp Pitt in 1999 they created a digital records management plan – but it took a long time to get to the point that they were given the money, time and opportunity to put it into action.

Overall Strategy for archiving digital records:

Born Digital: GIS & Email
Digitized Analog: Media (paper, film, analog tape) For access: researchers, agencies, Archives staff

The state being sued caused enough panic at the state level to make the people ‘in charge’ see that email needed to preserved and organized and accessible.

Some points:

what is everyone doing across the state?
Keep both native format (whatever folks have already done) – and an archival format in XML
Digitize from microfilm (send out to be done)
Create another ‘access format’

GeoArchives (special case of the general approaches diagramed above)

stop the loss (road name change.. etc)
create a prototype for others to use
a model for others to critique, improve and apply

Scope: fairly limited

preservation: data (layers, images) in GeoLibrary (forced in by legislation – agencies MUST offer data to GeoLibrary)
access: use existing geolibrary
compare layer status (boundaries, roads) at any historical time
Overly different layers (boundaries 2005, roads 2010).

GeoArchives diagram based on NARA ERA diagram
Fit into the ERA diagram very well

Project team – true collaboration. Pulled people from GeoLibrary who were enthusiastic and supportive of central IT GIs changes.

Used a survey to find out what data people wanted.

Created crosswalks with Dublin Core, MARC 21 and FGDC

Functional Requirements – there is a lot of related information – who created this data? Where did it come from? Link them to the related layers.

Appraise the data layers – at the data layer level (rather than digging in to keep some data in a layer and not other data)

Has about 100 layers – so hand appraisal is do-able (though automation would be nice and might be required after next ‘gift’).

Current plan is to embed archival records in systems holding critical operational records so that the archival records will be migrated along with the other layers. Export to XML for now.

Challenges:

communications with IT to keep the process going
documentation of applications
documentation of servers
security?
Metadata for layers must be complete and consistent with the GeoArchives manual

For more information – see ~~http://www.maine.gov/sos/arc/GeoArchives/geosearch.html~~

~~UPDATE: This link appears to not work. I will update it with a working link once I find one!~~

http://www.maine.gov/sos/arc/GeoArchives/geoarch.html (Finally got around to finding the right fix for the link!)

Thoughts on Archiving Web Sites

July 26, 2006 1 Comment

Shortly after my last post, a thread surfaced on the Archives Listserv asking the best way to crawl and record the top few layers of a website. This led to many posts suggesting all sorts of software geared toward this purpose. This post shares some of my thinking on the subject.

Adobe Acrobat can capture a website and convert it into a PDF. As pointed out in the thread above, that would loose the original source HTML – yet there are more issues than that alone. It would also loose any interaction other than links to other pages. It is not clear to me what would happen to a video or flash interface on a site being ‘captured’ by Acrobat. Quoting a lesson for Acrobat7 titled Working with the Web : “Acrobat can download HTML pages, JPEG, PNG, SWF, and GIF graphics (including the last frame of animated GIFs), text files, image maps and form fields. HTML pages can include tables, linkes, frames, background colors, text colors, and forms. Cascading Stylesheets are supported. HTML links are turned into Web links, and HTML forms are turned into PDF forms.”

I looked at a few website HTML capture programs such as Heritrix, Teleport Pro, HTTrack Web and the related ProxyTrack. I hope to take the time to compare each of these options and discover what it does when confronted with something more complicated than HTML, images or cascading style sheets. It also got me thinking about HTML and versions of browsers. It think it safe to say that most people who browse the internet with any regularity have had the experience of viewing a page that just didn’t look right. Not looking right might be anything from strange alignment or odd fonts all the way to a page that is completely illegible. If you are a bit of a geek (like me) you might have gotten clever and tried another browser to see if it looked any better. Sometimes it does – sometimes it doesn’t. Some sites make you install something special (flash or some other type of plugin or local program).

Where does this leave us when archiving websites? A website is much more than just it’s text. If the text were all we worried about I am sure you could crawl and record (or screen scrape) just the text and links and call it a day being fairly confident that text stored as a plain ASCII file (with some special notation for links) would continue to be readable even if browsers disappeared from the world. While keeping the words is useful, it also looses a lot of the intended meaning. Have you read full text journal articles online that don’t have the images? I have – and I hate it. I am a very visually oriented person. It doesn’t help me to know there WAS a diagram after the 3rd paragraph if I can’t actually see it. Keeping all the information on a webpage is clearly important. The full range of content (all the audio, video, images and text on a page) is important to viewing the information in its original context.

Archivists who work with non-print media records that require equipment for access are already in the practice of saving old machines hoping to ensure access to their film, video and audio records. I know there are recommendations for retaining older computers and software to ensure access to data ‘trapped’ in ‘dead’ programs (I will define a dead program here as one which is no longer sold, supported or upgraded – often one that is only guaranteed to run on a dead operating system). My fear is for the websites that ran beautifully on specific old browsers. Are we keeping copies of old browsers? Will the old browsers even run on newer operating systems? The internet and its content is constantly changing – even just keeping the HTML may not be enough. What about those plugins – what about the streaming video or audio. Do the crawlers pull and store that data as well?

One of the most interesting things about reading old newspapers can be the ads. What was being advertised at the time? How much was the sale price for laundry detergent in 1948? With the internet customizing itself to individuals or simply generating random ads how would that sort of snapshot of products and prices be captured? I wonder if there is a place for advertising statistics as archival records. What google ads were most popular on a specific day? Google already has interesting graphs to show the correspondence between specific keyword searches and news stories that google perceives as related to the event. The Internet Archive (IA) could be another interesting source for statistical analysis of advertising for those sites that permit crawling.

What about customization? Only I (or someone looking over my shoulder) can see my MyYahoo page. And it changes each time I view it. It is a conglomeration of the latest travel discounts, my favorite comics, what is on my favorite TV and cable channels tonight, the headlines of the newspapers/blogs I follow and a snapshot of my stock portfolio. Take even a corporate portal inside an intranet. Often a slightly less moving target – but still customizable to the individual. Is there a practical way to archive these customized pages – even if only for a specific user of interest? Would it be worthwhile to be archiving the personalized portal pages of an ‘important’ or ‘interesting’ person on a daily basis – such that their ‘view’ of the world via a customized portal could be examined by researchers later?

A wealth of information can be found on the website for the Joint Workshop on Future-proofing Institutional Websites from January 2006. The one thing most of these presentations agree upon is that ‘future-proofing’ is something that institutions should think about at the time of website design and creation. Standards for creating future-proof websites directs website creators to use and validate against open standards. Preservation Strategies for institutional website content shows insight into NARA‘s approach for archiving US government sites, the results of which can be viewed at http://www.webharvest.gov/. A summary of the issues they found can be read in the tidy 11 page web harvesting survey.

I definitely have more work ahead of me to read through all the information available from the International Internet Preservation Consortium and the National Library of Australia’s Preserving Access to Digital Information (PADI). More posts on this topic as I have time to read through their rich resources.

All around, a lot to think about. Interesting challenges for researchers in the future. The choices archivists face today often will depend on the type of site they are archiving. Best practices are evolving both for ‘future-proofing’ sites and for harvesting sites for archiving. Unfortunately, not everyone building a website that may be worth archiving is particularly concerned with validating their sites against open standards. Institutions that KNOW that they want to archive their sites are definitely a step ahead. They can make choices in their design and development to ensure success in archiving at a later date. It is the wild west fringe of the internet that are likely to present the greatest challenge for archivists and researchers.

Category: digital preservation

Book Review: Digital Preservation

The Edges of the GIS Electronic Record

SAA 2006: Research Library Group Roundtable – Internet Archiving

SAA 2006 Session 103: “X” Marks the Spot: Archiving GIS Databases – Part III

Thoughts on Archiving Web Sites