Spellbound Blog

Session 305: Extended Archival Description Part I – Archives of American Art

August 8, 2006 2 Comments

Session 305 included perspectives from three digital collections which are trying to use EAD and meta data to solve real world problems of navigation and access. This post addresses the presentation by the first speaker, Barbara Aikens from the Archives of American Art at the Smithsonian.

The Archives of American Art (AAA) has over 4,500 collections focusing on the history of American art. They received a 3.6 million dollar grant from the Terra Foundation to fund their 5 year project. They had already been using EAD for their standard in online finding aids since 2004. They also had already looked into digitizing their microfilmed holdings and they believe that the history of microfilming at AAA made the transition to scanning entire collections at the item level easier than it might otherwise have been. So far they have digitized 11 full collections (45 linear feet).

Their organization of the digitized files was based on collection code, box and folder. Basing their template on the EAD Cookbook, AAA used Note Tab Pro to create their XML EAD finding aid. I wonder how they might be able to take advantage of the open source software tools being developed such as Archon and the Archivists’ Toolkit (if you are interested in these packages, keep your eye open for my future post looking at them each in detail). There was some mention of re-purposing DCDs, but I was not clear about what they were describing.

The resulting online finding aid lets you read all the information you would expect to find in a finding aid (see an example), as well as permitting you to drill down into each series or container to view a list of folders. Finally the folder view provides thumbnails on the left and a big image on the right. Note that this item level folder view includes very basic folder meta data and a link back to that folder’s corresponding series page. There is no meta data for any of the images of individual items. This approach for organizing and viewing digitized collections is workable for large collections. The context is well communicated and the user’s experience is very like that of going through a collection while physically visiting an archive. First you use the finding aid to location collections of interest. Next you examine the Series and or Container descriptions to location the types of information for which you are looking. Finally, you can drill down to folders with enticing names to see if you can find what you need.

As an experiment, I tested the ‘Search within Collections/Finding Aids’ option by searching for “Downtown Gallery” and for gallery artist files to see if I was given a link to the new Downtown Gallery Records finding aid. My search for “Downtown Gallery” instead directed me to what appears to be a MARC record in the Smithsonian Archives, Manuscripts and Photographs catalog. Two versions of the finding aid are linked to from this record – with no indication as to how they are different (it turned out one was an old version – the other the new one which includes links to the digitized content). A bit more experimentation showed me that the new online collection finding aids are not integrated into the search. I will have to remember to try this sort of searching in a few months to see what the search experience is like.

What I was hoping for (in a perfect world) would be highlighting of the search terms and deep linking from the search results directly to the series and folder description pages. I wonder what side effects there will be for the accuracy of search results given that the series/folder detail description page does not include all the other text from the main finding aid. (ie New Finding Aid vs New Finding Aid Series Level Page). Oddly enough – the old version of the finding aid for this same collection includes the folder level descriptions on the SAME page (with HTML anchors permitting linking from the side bar Table of Contents to the correct location on the page). So a search for terms that appear in the historical background along with the name of an artist only listed at the folder level WOULD return results (in standard text searching) for the old finding aid but not for the new one. Once the new finding aids are integrated into the search results – it would be very helpful to have an option to only return finding aids that include digitized collections.

While exploring the folder level view, I assumed that the order of the images in the folders is the original order in the analog folder. If so, then that is a fabulous and elegant way of communicating the original order of the records to the user of the digital interface. If NOT – then it is quite misleading because a user could easily assume, as I did, that the order in which they are displayed in the folder view is the original order.

Overall, this is exciting work – and shows how well the EAD can function as a framework for the item level digitization of documents. It also points to some interesting questions about how to handle search within this type of framework.

UPDATE: See the comment below for the clarification that the new finding aids based on the work described in this presentation are NOT online yet – but should be at the end of the month (posted: 08/09/2006).

Session 510: Digital History and Digital Collections (aka, a fan letter for Roy and Dan)

August 6, 2006

There were lots of interesting ideas in the talks given by Dan Cohen and Roy Rosenzweig during their SAA session Archives Seminar: Possibilities and Problems of Digital History and Digital Collections (session 510).

Two big ideas were discussed: the first about historians and their relationship to internet archiving and the second about using the internet to create collections around significant events. These are not the same thing.

In his article Scarcity or Abundance? Preserving the Past in a Digital Era, Roy talks extensively about the dual challenges of loosing information as it disappears from the net before being archived and the future challenge to historians faced with a nearly complete historical record. This assumes we get the internet archiving thing right in the first place. It assumes those in power let the multitude of voices be heard. It assumes corporately sponsored sites providing free services for posting content survive, are archived and do the right thing when it comes to preventing censorship.

The Who Built America CD-ROM, released in 1993 and bundled with Apple computers for K-12 educational use, covered the history of America from 1876 and 1914. It came under fire in the Wall Street Journal for including discussions of homosexuality, birth control and abortion. Fast forward to now when schools use filtering software to prevent ‘inappropriate’ material from being viewed by students – in much the same way as Google China uses to filter search results. He shared with us the contrast of the search results from Google Images for ‘Tiananmen square’ vs the search results from Google Images China for ‘Tiananmen square’. Something so simple makes you appreciate the freedoms we often forget here in the US.

It makes me look again at the DOPA (Deleting Online Predators Act) legislation recently passed by the House of Representatives. In the ALA’s analysis of DOPA, they point out all the basics as to why DOPA is a rotten idea. Cool Cat Teacher Blog has a great point by point analysis of What’s Wrong with DOPA. There are many more rants about this all over the net – and I don’t feel the need to add my voice to that throng – but I can’t get it out of my head that DOPA’s being signed into law would be a huge step BACK for freedom of speech and learning and internet innovation in the USA. How crazy is it that at the same time that we are fighting to get enough funding for our archivists, librarians and teachers – we should also have to fight initiatives such as this that would not only make their jobs harder but also siphon away some of those precious resources in order to enforce DOPA?

In the category of good things for historians and educators is the great progress of open source projects of all sorts. When I say Open Source I don’t just mean software – but also the collection and communication of knowledge and experience in many forms. Wikipedia and YouTube are not just fun experiments – but sources of real information. I can only imagine the sorts of insights a researcher might glean from the specific clips of TV shows selected and arranged as music videos by TV show fans (to see what I am talking about, take a look at some of the video’s returned from a search on gilmore girls music video – or the name of your favorite pop TV characters). I would even venture to say that YouTube has found a way to provide a method of responding to TV, perhaps starting down a path away from TV as the ultimate passive one way experience.

Roy talked about ‘Open Sources’ being the ultimate goal – and gave a final plug to fight to increase budgets of institutions that are funding important projects.

Dan’s part of the session addressed that second big idea I listed – using the internet to document major events. He presented an overview of the work of ECHO: Exploring and Collecting History Online. ECHO had been in existence for a year at the time of 9/11 and used 9/11 as a test case for their research to that point. The Hurricane Digital Memory Bank is another project launched by ECHO to document stories of Katrina, Rita and Wilma.

He told us the story behind the creation of the 9/11 digital archive – how they decided they had to do something quickly to collect the experiences of people surrounding the events of September 11th, 2001. They weren’t quite sure what they were doing – if they were making the best choices – but they just went for it. They keep everything. There was no ‘appraisal’ phase to creating this ‘digital archive’. He actually made a point a few minutes into his talk to say he would stop using the word archive, and use the term collection instead, in the interest of not having tomatoes thrown at him by his archivist audience.

The lack of appraisal issue brought a question at the end of the session about where that leaves archivists who believe that appraisal is part of the foundation of archival practice? The answer was that we have the space – so why not keep it all? Dan gave an example of a colleague who had written extensively based on research done using World War II rumors they found in the Library of Congress. These easily could have been discarded as not important – but you never know how information you keep can be used later. He told a story about how they noticed that some people are using the 9/11 digital archive as a place to research teen slang because it has such a deep collection of teen narratives submitted to be part of the archive.

This reminded me a story that Prof. Bruce Ambacher told us during his Archival Principals, Practices and Programs course at UMD. During the design phase for the new National Archives building in College Park, MD, the Electronic Records division was approached to find out how much room they needed for future records. Their answer was none. They believed that the speed at which the space required to store digital data was shrinking was faster than the rate of growth of new records coming into the archive. One of the driving forces behind the strong arguments for the need for appraisal in US archives was born out of the sheer bulk of records that could not possibly be kept. While I know that I am oversimplifying the arguments for and against appraisal (Jenkinson vs Schellenberg, etc) – at the same time it is interesting to take a fresh look at this in the light of removing the challenges of storage.

Dan also addressed some interesting questions about the needs of ‘digital scholarship’. They got zip codes from 60% of the submissions for the 9/11 archive – they hope to increase the accuracy and completeness of GIS information in the hurricane archive by using Google Maps new feature to permit pinpointing latitude and longitude based on an address or intersection. He showed us some interesting analysis made possible by pulling slices of data out of the 9/11 archive and placing it as layers on a Google Map. In the world of mashups, one can see this as an interesting and exciting new avenue for research. I will update this post with links to his promised details to come on his website about how to do this sort of analysis with Google Maps. There will soon be a researchers interface of some kind available at the 9/11 archive (I believe in sync with the 5 year annivarsary of September 11).
Near the end of the session a woman took a moment to thank them for taking the initiative to create the 9/11 archive. She pointed out that much of what is in archives across the US today is the result of individuals choosing to save and collect things they believed to be important. The woman who had originally asked about the place of appraisal in a ‘keep everything digital world’ was clapping and nodding and saying ‘she’s right!’ as the full room applauded.

So – keep it all. Snatch it up before it disappears (there were fun stats like the fact that most blogs remain active for 3 months, most email addresses last about 2 years and inactive Yahoo Groups are deleted after 6 months). There is likely a place for ‘curitorial views’ of the information created by those who evaluate the contents of the archive – but why assume that something isn’t important? I would imagine that as computers become faster and programming becomes smarter – if we keep as much as we can now, we can perhaps automate the sorting it out later with expert systems that follow very detailed rules for creating more organized views of the information for researchers.

This panel had so many interesting themes that crossed over into other panels throughout the conference. The Maine Archivist talking about ‘stopping the bleeding’ of digital data loss in his talk about the Maine GeoArchives. The panel on blogging (that I will write more about in a future post). The RLG Roundtable with presentations from people over at InternetArchive and their talks about archiving everything (ALSO deserves it’s own future post).

I feel guilty for not managing to touch on everything they spoke about – it really was one of the best sessions I attended at the conference. I think that having voices from outside the archival profession represented is both a good reality check and great for the cross-polination of ideas. Roy and Dan have recently published a book titled Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web – definitely on my ‘to be read’ list.

Overall Conference Impressions

August 5, 2006 3 Comments

I went to many sessions at the 2006 Joint Annual Meeting of NAGARA, COSA, and SAA and will add more presentation posts over the course of the next two weeks. I have 37 pages of notes in MS Word – though there is lots of white space throughout as I made bullet lists and started new pages for new presentations as I went. And some of my notes are on paper (darn that laptop battery). My first three pages of notes translated into the 3 posts I have put up so far summarizing and commenting on sessions – so I suspect it will take me a while to work my way through them. Combine that with all the ideas generated in conversations with fabulous people or that occurred to me during presentations and I have no fear about running out of ideas for posts here anytime soon.

I presented my poster “Communicating Context in Online Collections” throughout the morning on Friday. I enjoyed speaking with everyone who stopped by to get the long version of what my ideas on my poster were all about. Another plan I have is to post a version of my poster along with a full list of links to the websites I used as examples on my poster – look for it before the end of August.

My past experiences with conferences are from the technical world – I have been to and presented at more than one Oracle Open World conference. These are huge monstrous affairs which take over large city convention centers. While my first few minutes at this conference was a slightly overwhelming throng of people I didn’t know, I rapidly found people I knew and met many new people.

Being used to high tech conferences I was surprised by the lack of internet access which, while slightly frustrating for attendees, was quite mysterious in the context of presenters. No live demos of project websites or of the software many were discussing. Everyone worked around it (most had come prepared with screen shots of what they wanted to show) – it just seemed very strange.

There are some poster related things I would put on my wishlist to change for next year (speaking as a student who has never attended an SAA conference before):

opportunity to assemble my poster during non-session time
please take into account that most posters seem to be arranged in ‘landscape’ layout rather than ‘portrait’ and provide enough space for them all
more room for presenters to stand in front of their posters (there were great challenges this year with the placement of a buffet brunch table 2 feet in front of a long row of posters precisely during one of the main assigned poster presentation times)
either clear indication of when to pick up posters (again, not during session time) – or someone to take the posters to safety so they don’t end up in a pile at the back of the exhibit hall as they did this year

A big thank you to everyone I met at the conference. You made my first experience in the ‘greater archival universe’ (aka, beyond the University of Maryland) a good one. More SAA2006 posts and supporting information related to my poster coming soon.

SAA 2006 Session 103: “X” Marks the Spot: Archiving GIS Databases – Part III

August 4, 2006 3 Comments

With the famous Hitchhiker’s Guide to the Galaxy quote of “Don’t Panic!”, James Henderson of the Maine State Archives gave an overview of how they have approached archiving GIS data in his presentation “Managing GIS in the Digital Archives” (the third presentation of the ‘X Marks the Spot’ panel). His basic point is that there is no time to wait for the perfect alignment of resources and research – GIS data is being lost every day, so they had to do what they could as soon as possible to stop the loss.

Goals: preserve permanently valuable state of main official records that are in digital form – both born digital as well as those digitized for access.. and provide continuing digital access to these records

A billion dollars has been spent creating the records over 15 years, but nothing is being done to preserve it. GIS data is overwritten or deleted by agencies as information in live systems is updated with information such as new road names.

At Camp Pitt in 1999 they created a digital records management plan – but it took a long time to get to the point that they were given the money, time and opportunity to put it into action.

Overall Strategy for archiving digital records:

Born Digital: GIS & Email
Digitized Analog: Media (paper, film, analog tape) For access: researchers, agencies, Archives staff

The state being sued caused enough panic at the state level to make the people ‘in charge’ see that email needed to preserved and organized and accessible.

Some points:

what is everyone doing across the state?
Keep both native format (whatever folks have already done) – and an archival format in XML
Digitize from microfilm (send out to be done)
Create another ‘access format’

GeoArchives (special case of the general approaches diagramed above)

stop the loss (road name change.. etc)
create a prototype for others to use
a model for others to critique, improve and apply

Scope: fairly limited

preservation: data (layers, images) in GeoLibrary (forced in by legislation – agencies MUST offer data to GeoLibrary)
access: use existing geolibrary
compare layer status (boundaries, roads) at any historical time
Overly different layers (boundaries 2005, roads 2010).

GeoArchives diagram based on NARA ERA diagram
Fit into the ERA diagram very well

Project team – true collaboration. Pulled people from GeoLibrary who were enthusiastic and supportive of central IT GIs changes.

Used a survey to find out what data people wanted.

Created crosswalks with Dublin Core, MARC 21 and FGDC

Functional Requirements – there is a lot of related information – who created this data? Where did it come from? Link them to the related layers.

Appraise the data layers – at the data layer level (rather than digging in to keep some data in a layer and not other data)

Has about 100 layers – so hand appraisal is do-able (though automation would be nice and might be required after next ‘gift’).

Current plan is to embed archival records in systems holding critical operational records so that the archival records will be migrated along with the other layers. Export to XML for now.

Challenges:

communications with IT to keep the process going
documentation of applications
documentation of servers
security?
Metadata for layers must be complete and consistent with the GeoArchives manual

For more information – see ~~http://www.maine.gov/sos/arc/GeoArchives/geosearch.html~~

~~UPDATE: This link appears to not work. I will update it with a working link once I find one!~~

http://www.maine.gov/sos/arc/GeoArchives/geoarch.html (Finally got around to finding the right fix for the link!)

SAA 2006 Session 103: “X” Marks the Spot: Archiving GIS Databases – Part II

August 4, 2006

Richard Marciano of the SALT interdisciplinary lab (Sustainable Archives & Library Technologies) at the San Diego Supercomputer Center delivered a presentation titled “Research Issues Related to Preservation of Geospatial Electronic Records” – the 2nd topic in the ‘X’ Marks the Spot session.

He focuses on research Issues related to preservation of geospatial electronic records. While not an archivist, he is a member of SAA. As a person coming to archival studies with a strong background in software development, I took great comfort in his discussion of their being a great future for IT and archivists to work together on topics such as this.

Richard gave us a great overview of the most recent work being done in this field, along with a snapshot of the latest up and coming projects on the horizon. If I had to pick one main point to empasize, it would be that IT can provide the infrastructure to automate much of what is now being done by hand – but there is a long way to go to achieve this dream and it will require extensive collaboriation between Archivists (with the experience of how things should be done) and the IT community (with the technical expertise to build the systems needed). His presentation was definitely more organized than my laundry list below – please do not take my notes below as an indication of the flow of his talk.

NHPRC Electronic Records/GIS projects:

CIESIN www.ciesin.columbia.edu/ger at Columbia University
Maine GeoArchives www.maine.gov/geoarch/index.htm Maine State Archvies (see Part III of the Session 103 posts for details on the Maine GeoArchives)
eLegacy (State California & SDSC) – California’s geospacial records archival appraisal, accessioning and preservation. Starting in 2006
InterPARES Van MAP (2005) –presentation of the City of Vancouver GIS Database

More IT related projects:

Archivists’ Workbench (2000) www.sdsc.edu/NHPRCS Methodologies for the long-term preservation of and access to software-dependent electronic records. Includes tools for GIS
ICAP (2003) www.sdsc.edu/ICAP change management
PAT (2004) www.sdsc.edu/PAT persistent archives testbed and the Michigan precinct voting records, spacial data ingestion

SDSC has a goal of infrastructure independence – they want to keep data and move it easily over time. Their current preferred approach uses Data Grids (see American Archivist Journal , volume 69 – Number 1: Building Preservation Environments with Data Grid Technology by Reagan W. Moore) which depend on the dual goals of data virtualization and trust virtualization. He recommended the SAA Electronic Records Section on Friday from 12 to 2 for good related presentations.

CIESIN www.ciesin.columbia.edu/ger at Columbia University
Common types of data loss:

loss of non-archived data
historical versions of data

North Carolina Geospatial Data Archiving Project (www.lib.ncsu.edu/ncgdap) Steve Morris – Instead of solving problems, it actually further complications. Complex databases can be difficult to manage over time due to complex data models, challenges of proprietary database models… has MANY levels of individual datasets or data layers.

e-Legacy – working from the California State Archives
July 2006 – July 2008
The staff is a mix of California State Archives staff and members of SDSC. They are using data grid technology to build a distributed community grid. Distributed storage permits addition of storage arbitrarily and in multiple locations.
Infrastructure is being deployed across multiple offices and the SDSC.

InterPARES VanMAP (University of British Columbia)
A big city centralized enterprise GIS system
Question of case study: What are the records? Where are the records? What do they look like – from the point of view of the city users?
What infrastructure would you need to do a historical query – to see what the city would look like in a specific date in the past? Current enterprise systems are meant to be a snapshot of the present with nothing in place to support storage of past records.

How did they approach this? They got representative data sets. Put all the historical data layers into a ‘dark archive’ repository. Built proof of concept.. put in date request – correct layers are brought back from the archive system and on the fly they are rendered to show the closest version of the historical map possible.

There is a list of 30 or so questions that is part of evaluating the system.

ICAP: preserving and using temporal and multi-versions of records
Keep track of versions of records. Being aware of a timeline of records and being able to ask significant historical questions of those records.

Took multiple time slices – and automatically create an XML database using the records from the time slices of data. XML database and spatial querying

PAT Testbed
Creating a joint consortium model for managing records across state boundaries. Distributed framework with local ‘Grid Block’ at each location. Local Storage Resources manage and populate their local resources.
Goal: how do we automate archival processes

Michigan Department of Community – preserving and accessing Michigan Historical voting records. Created a MySQL database with the records. Did automatic scrubbing and validation of records based on rules. Due to the use of GIS it permits viewing maps with data shown – red/blue voting statistics by county. Viewer permits looking at maps by election year.

In response to a question, he talked about a project to take 401 Certification permits (related to water) – aspect of the PAT project that looked at this.. digitized all the historical records within a watershed. Delivered it back to the state agency. Integrating all the government processes – to permit them to ask good questions about the permits and the related locations (upstream or downstream).

SAA 2006 Session 103: “X” Marks the Spot: Archiving GIS Databases – Part I

August 4, 2006

‘X’ Marks the Spot was a fantastic first session for me at the SAA conference. I have had a facination with GIS (Geographic Information Systems) for a long time. I love the layers of information. I love the fact that you can represent information in a way that often makes you realize new things just from seeing it on a map.

Since my write-ups of each panelist is fairly long, I will put each in a separate post.

Helen Wong Smith, from the Kamehameha Schools, started off the panel discussing her work on the Land Legacy Database in her presentation titled “Wahi Kupuna: Digitized Cultural Resources Database with GIS Access”.

Kamehameha Schools (KS) was founded by the will of Princess Bernice Pauahi Bishop. With approximately 360,000 acres, KS is the largest private landowner in the state of Hawaii. With over $7 billion in assets the K-12 schools subsidize a significant portion of the cost to educate every student (parents pay only 10% of the cost).

KS generates income from residential, commercial and resort leases. In addition to generating income – a lot of the land has a strong cultural connection. Helen was charged with empowering the land management staff to apply 5 values every time there is any type of land transaction: Economic, Educational, Cultural, Environmental and Community. They realized that they had to know about the lands they own. For example, if they take a parcel back from a long lease and they are going to re-lease it, they need to know about the land. Does it have archaelogical sites? Special place to the Hawai’ian people?

Requirements for the GIS enabled system:

Find the information
Keep it all in one place
Ability to export and import from other standard-based databases (MARC, Dublin Core, Open Archives Initiative)
Some information is private – not to be shared with public
GIS info
Digitize all text and images
Identify by Tax map keys (TMK)
Identify by ‘traditional place name’
Identify by ‘common names’ – surfer invented names (her favorites examples are ‘suicides’ and ‘leftovers’)

The final system would enforce the following security:

Lowest – material from public repositories i.e the Hawaii State Archives
Medium – material for which we’ve acquired the usage rights for limited use
Highest – leases and archaeological reports

Currently the Land Legacy Database is only available within the firewall – but eventually the lowest level of security will be made public.
They already had a web GIS portal and needed this new system to hook up to the Web GIS as well and needed to collect and disseminate data, images, audio/visual clips and references in all formats. In addition, the land managers needed easy way to access information from the field, such as lease agreement or archaeological reports (native burials? Location & who they were).

Helen selected Greenstone – open source software (from New Zealand) for the following reasons:

open source
multilingual (deals with glottals and other issues with spelling in Hawiian language)
GNU General Public License
Software for building and distributing digital library collections
New way to organizing information
Publish it on the internet and CD-ROM
many ways of access including by Search, Titles and Genres
support for audio and video clips (Example – Felix E Grant Collection).

The project started with 60,000 TIF records (can be viewed as JPEGS) – pre-scanned and indexed by another person. Each of these ‘Claim documents’ includes a testimony and a register. It is crucial to reproduce the original primary resources to prevent confusion, such as can occur between place names and people names.

Helen showed an example from another Greenstone database of newspaper articles published in a new Hawaiian journal. It was displayed in 3 columns, one each for:

original hawaiian language newspaper as published
the text including the diacriticals
English translation

OCR would be a major challenge with these documents – so it isn’t being used.

Helen worked with programmers in New Zealand to do the customizations needed (such as GIS integration) after loosing the services of the IT department. She has been told that she made more progress working with the folks from New Zealand than she would have with IT!

The screen shots were fun – they showed examples of how the Land Legacy Database data uses GIS to display layers on maps of Hawaii including outlines of TMKs or areas with ‘traditional names’. One can access the Land Legacy Database by clicking on a location on the map and selecting Land Legacy Database to get to records.

The Land Legacy Database was envisioned as a tool to compile diverse resources regarding the Schools’ lands to support decision making i.e. as the location and destruction of cultural sites. Its evolution includes:

inclusion of internal and external records including reports conducted for and by the Schools in the past 121 years
a platform providing access to staff, faculty and students across the islands
sharing server space with the Education Division

Helen is only supposed to spend 20% of her time on this project! Her progress is amazing.

SAA2006: Joint Annual Meeting of NAGARA, COSA, and SAA

August 1, 2006

I will have my laptop with me at the SAA meeting in downtown DC later this week. My plan is to write my thoughts on my laptop as I go through the sessions over the course of the day and then post in the evenings after I get back home to the land of internet access.

I also will be sitting next to my Poster on Friday morning from 9-10am. If you want to stop by and say hello, that will be the easiest time and place to find me. My poster’s title is “Communicating Context in Online Collections” and I plan to upload a version of it to a page of this blog after the conference is over (along with links to all my sources).

Thoughts on Archiving Web Sites

July 26, 2006 1 Comment

Shortly after my last post, a thread surfaced on the Archives Listserv asking the best way to crawl and record the top few layers of a website. This led to many posts suggesting all sorts of software geared toward this purpose. This post shares some of my thinking on the subject.

Adobe Acrobat can capture a website and convert it into a PDF. As pointed out in the thread above, that would loose the original source HTML – yet there are more issues than that alone. It would also loose any interaction other than links to other pages. It is not clear to me what would happen to a video or flash interface on a site being ‘captured’ by Acrobat. Quoting a lesson for Acrobat7 titled Working with the Web : “Acrobat can download HTML pages, JPEG, PNG, SWF, and GIF graphics (including the last frame of animated GIFs), text files, image maps and form fields. HTML pages can include tables, linkes, frames, background colors, text colors, and forms. Cascading Stylesheets are supported. HTML links are turned into Web links, and HTML forms are turned into PDF forms.”

I looked at a few website HTML capture programs such as Heritrix, Teleport Pro, HTTrack Web and the related ProxyTrack. I hope to take the time to compare each of these options and discover what it does when confronted with something more complicated than HTML, images or cascading style sheets. It also got me thinking about HTML and versions of browsers. It think it safe to say that most people who browse the internet with any regularity have had the experience of viewing a page that just didn’t look right. Not looking right might be anything from strange alignment or odd fonts all the way to a page that is completely illegible. If you are a bit of a geek (like me) you might have gotten clever and tried another browser to see if it looked any better. Sometimes it does – sometimes it doesn’t. Some sites make you install something special (flash or some other type of plugin or local program).

Where does this leave us when archiving websites? A website is much more than just it’s text. If the text were all we worried about I am sure you could crawl and record (or screen scrape) just the text and links and call it a day being fairly confident that text stored as a plain ASCII file (with some special notation for links) would continue to be readable even if browsers disappeared from the world. While keeping the words is useful, it also looses a lot of the intended meaning. Have you read full text journal articles online that don’t have the images? I have – and I hate it. I am a very visually oriented person. It doesn’t help me to know there WAS a diagram after the 3rd paragraph if I can’t actually see it. Keeping all the information on a webpage is clearly important. The full range of content (all the audio, video, images and text on a page) is important to viewing the information in its original context.

Archivists who work with non-print media records that require equipment for access are already in the practice of saving old machines hoping to ensure access to their film, video and audio records. I know there are recommendations for retaining older computers and software to ensure access to data ‘trapped’ in ‘dead’ programs (I will define a dead program here as one which is no longer sold, supported or upgraded – often one that is only guaranteed to run on a dead operating system). My fear is for the websites that ran beautifully on specific old browsers. Are we keeping copies of old browsers? Will the old browsers even run on newer operating systems? The internet and its content is constantly changing – even just keeping the HTML may not be enough. What about those plugins – what about the streaming video or audio. Do the crawlers pull and store that data as well?

One of the most interesting things about reading old newspapers can be the ads. What was being advertised at the time? How much was the sale price for laundry detergent in 1948? With the internet customizing itself to individuals or simply generating random ads how would that sort of snapshot of products and prices be captured? I wonder if there is a place for advertising statistics as archival records. What google ads were most popular on a specific day? Google already has interesting graphs to show the correspondence between specific keyword searches and news stories that google perceives as related to the event. The Internet Archive (IA) could be another interesting source for statistical analysis of advertising for those sites that permit crawling.

What about customization? Only I (or someone looking over my shoulder) can see my MyYahoo page. And it changes each time I view it. It is a conglomeration of the latest travel discounts, my favorite comics, what is on my favorite TV and cable channels tonight, the headlines of the newspapers/blogs I follow and a snapshot of my stock portfolio. Take even a corporate portal inside an intranet. Often a slightly less moving target – but still customizable to the individual. Is there a practical way to archive these customized pages – even if only for a specific user of interest? Would it be worthwhile to be archiving the personalized portal pages of an ‘important’ or ‘interesting’ person on a daily basis – such that their ‘view’ of the world via a customized portal could be examined by researchers later?

A wealth of information can be found on the website for the Joint Workshop on Future-proofing Institutional Websites from January 2006. The one thing most of these presentations agree upon is that ‘future-proofing’ is something that institutions should think about at the time of website design and creation. Standards for creating future-proof websites directs website creators to use and validate against open standards. Preservation Strategies for institutional website content shows insight into NARA‘s approach for archiving US government sites, the results of which can be viewed at http://www.webharvest.gov/. A summary of the issues they found can be read in the tidy 11 page web harvesting survey.

I definitely have more work ahead of me to read through all the information available from the International Internet Preservation Consortium and the National Library of Australia’s Preserving Access to Digital Information (PADI). More posts on this topic as I have time to read through their rich resources.

All around, a lot to think about. Interesting challenges for researchers in the future. The choices archivists face today often will depend on the type of site they are archiving. Best practices are evolving both for ‘future-proofing’ sites and for harvesting sites for archiving. Unfortunately, not everyone building a website that may be worth archiving is particularly concerned with validating their sites against open standards. Institutions that KNOW that they want to archive their sites are definitely a step ahead. They can make choices in their design and development to ensure success in archiving at a later date. It is the wild west fringe of the internet that are likely to present the greatest challenge for archivists and researchers.

Paper Calendars, Palm Pilots and Google Calendar

July 20, 2006 8 Comments

In my intro archives class (LBSC 605 Archival Principles, Practices, and Programs), one of the first ideas that made a light bulb go on over my head related to the theory that archivists want to retain the original order of records. For example, if someone choose to put a series of 10 letters together in a file – then they should be kept that way. A researcher may be able to glean more information from these letters when he/she sees them grouped that way – organized as the person who originally used them organized them.

Our professor went on to explain that seeing what the person who used the records saw was crucial to understanding the original purpose and usage of those records. That took my mind quickly to the world of calendars. Years ago, a CEO of some important organization would have a calendar or datebook of some sort – likely managed by an assistant. Ink or pencil was used to write on paper. Perhaps fresh daily schedules would be typed.

Fast forward to now and the universe of the Palm Pilot and other such handy-dandy hand held and totally customizable devices. If you have one (or have seen those of a friend) you know that how I choose to look at my schedule may be radically different from the way you choose to see your schedule. Mine might have my to-do list shown on the bottom half of the screen. Yours might have little colored icons to show you when you have a conference call. The archivist asked to preserve a born digital calendar will have a lot of hard choices to make.

These days I actually use Google Calendar more often than my Palm. While it has more of a fixed layout (for the moment) – I have the option of including many external calendars (see examples at iCalShare). Right now I have listings of when new movies come out as well as the concert schedule for summer 2006 for the Wolf Trap National Park for the Performing Arts. In the old style paper calendar, a researcher would be able to see related events that the user of the calendar cared about because they would be written down right there. If someone wanted to include my Google calendar in an archive someday (or that of someone much more important!), I suspect they would be left with JUST the records I had added myself into my calendar. When I choose to display the Wolf Trap summer schedule, Google calendar asks me to wait while it loads – presumably from an externally published iCalendar or other public Google calendar source.

This has many implications for the archivist tasked with preserving the records in that Palm Pilot or Google calendar (or any of a laundry list of scheduling applications). This post can do nothing other than list interesting questions at this stage (both ‘this stage’ of my archival education as well as ‘this stage’ of consideration of born digital records in the archival field).

How important is it to preserve the appearance of the interface used by the digital calendar user?
Might printing or screen capturing a statistical sample (an entire month? an entire year?) help researchers in the future understand HOW the record creator in question interacted with their calendar – what sorts of information they were likely to use in making choices in their scheduling?
Could there be a place for preserving publicly shared calendars (like the ones you can choose to access on Google Calendar or Apple’s iCal) such that they would be available to researchers later? What organization would most likely be capable of taking this sort of task on?
Could emulators be used to permit easy access to centrally stored born digital calendars? At least one PalmOS Emulator already exists, created mainly for use by those developing software for hardware that runs the Palm operating system it mimics how the tested software would run in the real world. Should archivists be keeping copies of this sort of software as they look to the future of retaining the best access possible to these sorts of records?
How can the standard iCalendar format be leveraged by archivists working to preserve born digital calendars?
To what degree are the schedules of people whose records will be of interest to archivists someday moving out of private offices (and even out of personally owned computers and handheld devices) and into the centralized storage of web applications such as Google Calendar?

I know that this is just a tiny bite of the kinds of issues being grappled with by Archivists around the world as they begin to accept born digital records into archives. Each type of application (scheduling vs accounting vs business systems) will pose similar issues to those described above – along with special challenges unique to each type. Perhaps if each of the most common classes of applications (such as scheduling) are tackled one by one by a designated team we can save individual archivists the pain of reinventing the wheel. Is this already happening?

Introduction

July 19, 2006 2 Comments

My name is Jeanne. I am a graduate student in an Archives program pursuing my MLS (aka, Master of Library Science). I have enjoyed all my classes to date (3) and I love the ideas that those classes have generated. Sometimes I leave class with just as many personal ideas scrawled in the margins of my notebook as class notes written on the main page. I am especially intrigued by the ways in which concepts from different fields intersect. How do ideas from my current field of software development and database design illuminate new issues, questions and concepts in the realm of archival studies?

I am particularly interested in topics related to audio and visual archival materials, digitization, description, meta-data, and retention of context in digitized collections.

So, here we are – you reading and I writing. I hope to make you think about things in a way you may not have before. I hope if you have been down the mental road I am taking and you have noticed something that I have missed, you might take a moment to point it out to me.

Please – ask questions and let me know your thoughts.