Menu Close

Category: transcription

Harnessing The Power of We: Transcription, Acquisition and Tagging

In honor of the Blog Action Day for 2012 and their theme of ‘The Power of We’, I would like to highlight a number of successful crowdsourced projects focused on transcribing, acquisition and tagging of archival materials. Nothing I can think of embodies ‘the power of we’ more clearly than the work being done by many hands from across the Internet.

Transcription

  • Old Weather Records: “Old Weather volunteers explore, mark, and transcribe historic ship’s logs from the 19th and early 20th centuries. We need your help because this task is impossible for computers, due to diverse and idiosyncratic handwriting that only human beings can read and understand effectively. By participating in Old Weather you’ll be helping advance research in multiple fields. Data about past weather and sea-ice conditions are vital for climate scientists, while historians value knowing about the course of a voyage and the events that transpired. Since many of these logs haven’t been examined since they were originally filled in by a mariner long ago you might even discover something surprising.”
  • From The Page: “FromThePage is free software that allows volunteers to transcribe handwritten documents on-line.” A number of different projects are using this software including: The San Diego Museum of Natural History’s project to transcribe the field notes of herpetologist Laurence M. Klaube and Southwestern University’s project to transcribe the Mexican War Diary of Zenas Matthews.
  • National Archives Transcription: as part of the National Archives Citizen Archivist program, individuals have the opportunity to transcribe a variety of records. As described on the transcription home page: “letters to a civil war spy, presidential records, suffrage petitions, and fugitive slave case files”.

Acquisition:

  • Archive Team: The ArchiveTeam describes itself as “a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage.” Here is an example of the information gathered, shared and collaborated on by the ArchiveTeam focused on saving content from Friendster. The rescued data is (whenever possible) uploaded in the Internet Archive and can be found here:

    Springing into action, Archive Team began mirroring Friendster accounts, downloading all relevant data and archiving it, focusing on the first 2-3 years of Friendster’s existence (for historical purposes and study) as well as samples scattered throughout the site’s history – in all, roughly 20 million of the 112 million accounts of Friendster were mirrored before the site rebooted.

Tagging:

  • National Archives Tagging: another part of the Citizen Archivist project encourages tagging of a variety of records, including images of the Titanic, architectural drawings of lighthouses and the Petition Against the Annexation of Hawaii from 1898.
  • Flickr Commons: throughout the Flickr Commons, archives and other cultural heritage institutions encourage tagging of images

These are just a taste of the crowdsourced efforts currently being experimented with across the internet. Did I miss your favorite? Please add it below!

Blog Action Day 2009: IEDRO and Climate Change

IEDRO LogoIn honor of Blog Action Day 2009‘s theme of Climate Change, I am revisiting the subject of a post I wrote back in the summer of 2007: International Environmental Data Rescue Organization (IEDRO). This non-profit’s goal is to rescue and digitize at risk weather and climate data from around the world. In the past two years, IEDRO has been hard at work. Their website has gotten a great face-lift, but even more exciting is to see is how much progress they have made!

  • Weather balloon observations received from Lilongwe, Malawi (Africa) from 1968-1991: all the red on these charts represents data rescued by IEDRO — an increase from only 30% of the data available to over 90%.
  • Data rescue statistics from around the world

They do this work for many reasons – to improve understanding of weather patterns to prevent starvation and the spread of disease, to ensure that structures are built to properly withstand likely extremes of weather in the future and to help understand climate change. Since the theme for the day is climate change, I thought I would include a few excerpts from their detailed page on climate change:

“IEDRO’s mandate is to gather as much historic environmental data as possible and provide for its digitization so that researchers, educators and operational professionals can use those data to study climate change and global warming. We believe, as do most scientists, that the greater the amount of data available for study, the greater the accuracy of the final result.

If we do not fully understand the causes of climate change through a lack of detailed historic data evaluation, there is no opportunity for us to understand how humankind can either assist our environment to return to “normal” or at least mitigate its effects. Data is needed from every part of the globe to determine the extent of climate change on regional and local levels as well as globally. Without these data, we continue to guess at its causes in the dark and hope that adverse climate change will simply not happen.”

So, what does this data rescue look like? Take a quick tour through their process – from organizing papers, photographing each page, the transcription of all data and finally upload of this data to NOAA’s central database. These data rescue efforts span the globe and take the dedicated effort of many volunteers along the way. If you would like to volunteer to help, take a look at the IEDRO listings on VolunteerMatch.

Sunshine Week 2009: Archives, Records and Other Online Government Information

Sunshine Week Sunshine Week 2009 is a national initiative spearheaded by journalists to “open a dialogue about the importance of open government and freedom of information”. The Electronic Frontier Foundation (EFF) chose to mark Sunshine Week this year by announcing the release their new tool for searching EFF’s FOIA documents. Learn more about EFF’s efforts to make open government a reality in this EFF call to action.

The Sunshine Week blog announced the release of a 2009 Survey Of State Government Information Online. The survey results explains:

Using a standardized worksheet surveyors rated each section on its usability, looking at factors such as whether the information was clearly linked, if full reports or only summaries were available, whether viewing and/or downloading was free, and whether the data were current. The categories for the survey were selected for generally serving the overall public good — the kind of information people need for their own health and well-being and that of the community.

See the worksheet for details on the categories selected for inclusion in the survey and the results for lots of interesting tidbits about exactly which states provide access (or not) to various public information online. A few very randomly selected highlights:

  • Maryland: Nursing home information, mhcc.maryland.gov/consumerinfo/nhguide, got high marks for facilitating online search and for allowing users to “compare data in a variety of ways.”
  • Iowa: The state auditor’s office reportedly offers online more than 5,000 full reports of all its audits dating back to 2001. The audits are easily accessible from tabs on the main Web page, www.auditor.iowa.gov.
  • Colorado: Bridge inspection reports in Colorado are considered public, but they are not published online. Anyone who wants to see the reports is advised to file an FOI request.

All of this made me recall my blog post about the parallel goals of journalists and archivists when considering digital public records and databases. I wanted to celebrate Sunshine Week by looking for other online sources of government information. My first stop was the website of the Council of State Archivists (CoSA). They had a couple of great resources including:

A bit further afield we find GovernmentDocs.org advertised as a “community government document reviewer system”. On their about page we read:

With the GovernmentDocs.org system, citizen reviewers can engage in the government accountability process like never before. Registered users can review and comment on documents, adding their insights and expertise to the work of the national nonprofit organizations which are partnering on this project. This new information then becomes instantly searchable. The text of each document is searchable, as well, thanks to a powerful Optical Character Recognition (OCR) functionality.

GovernmentDocs.org adds a powerful layer to government transparency and accountability by indexing documents in a user-friendly manner that is remarkably easy to share. Every page of every document has its own unique url, allowing you and other users to link to that page on blogs, send emails about the documents to friends, and expose the information to a wider audience.

Here is an example GovernmentDocs page taken from a request submitted by CREW (Citizens for Responsibility and Ethics in Washington) regarding the Endangered Species Act. Each GovernmentDocs page has a unique URL, full text transcription of the page and supports comments and reviews. The possibility of building up a community around these records is very real. I am curious to see how many citizen reviewers and comments are associated with these documents a year from now.

Please help celebrate Sunshine Week by exploring all these amazing resources!

Library of Congress Inauguration 2009 Audio and Video Project

President Taft and his wife lead the inaugural parade, 1909 (Library of Congress: Prints and Photographs Division)

Amazing how much can change in 100 years. In March of 1909, the stereograph above shows African Americans driving the carriage that carried President and Mrs. Taft from the Capitol to lead the inauguration parade to the White House. On January 20th of 2009, Barack Obama will be the guest of honor. The American Folklife Center‘s Inauguration 2009 Sermons and Orations Project aims to collect recordings, transcriptions and ephemera of speeches addressing the significance of the inauguration of Barack Obama as the first African American president.

It is expected that such sermons and orations will be delivered at churches, synagogues, mosques and other places of worship, as well as before humanist congregations and other secular gatherings. The American Folklife Center is seeking as wide a representation of orations as possible.

The Inauguration 2009 project is modeled after prior Library of Congress collection projects. Two great examples of these earlier projects are:

If you want to organize a local recording, here are the basics:

  • Recording must be made between Friday, January 16th and Sunday, January 25th, 2009 and postmarked by February 27, 2009.
  • The project website provides the required Participant Release Form for speakers, photographers and those making the recordings.
  • The project is accepting audio recordings, video recordings, and written texts of sermons (see their detailed specifications page for information about accepted formats). Also accepted will be accompanying ephemera such as photographs and printed programs.
  • If you are sending materials to the Library of Congress, they encourage you to use FedEx, UPS, or DHL because of the danger of damage due to security screening done to USPS packages.

If you want to get a taste of  other recordings held by the Library of Congress, you can spend some time browsing the fantastic list of Collections in the Archive of Folk Culture Containing Sermons and Orations provided on the project site.

So spread the word. Honor the Library of Congress’s goals by helping this collection include the perspectives of as many communities as possible. Your local religious or secular leader could have their point of view preserved as part of a snapshot of our country’s response to the Inauguration of 2009. While they hope for audio and video recordings, they are also accepting text transcriptions – so this doesn’t have to be a high tech endeavor. That said, perhaps this is the inspiration you have been waiting for to learn how to make an audio or video recording!

reCAPTCHA: crowdsourcing transcription comes to life

With a tag-line like ‘Stop Spam, Read Books’ – how can you not love reCAPTCHA? You might have already read about it on Boing Boing , NetworkWorld.com or digitizationblog – but I just couldn’t let it go by without talking about it.

Haven’t heard about reCAPTCHA yet? Ok.. have you ever filled out an online form that made you look at an image and type the letters or numbers that you see? These ‘verify you are a human’ sorts of challenges are used everywhere from on-line concert ticket purchase sites who don’t want scalpers to get too many of the tickets to blogs that are trying to prevent spam. What reCAPTCHA has done is harness this user effort to assist in the transcription of hard to OCR text from digitized books in the Internet Archive. Their website has a great explanation about what they are doing – and they include this great graphic below to show why human intervention is needed.

Why we need reCAPTCHA

reCAPTCHA shows two words for each challenge – one that it knows the transcription of and a second that needs human verification. Slowly but surely all the words OCR doesn’t understand get transcribed and made available for indexing and search.

I have posted before about ideas for transcription using the power of many hands and eyes (see Archival Transcriptions: for the public, by the public) – but my ideas were more along the lines of what the genealogists are doing on sites like USGenWeb. It is so exciting to me that a version of this is out there – and I LOVE their take on it. Rather than find people who want to do transcription, they have taken an action lots of folks are already used to performing and given it more purpose. The statistics behind this are powerful. Apparently 60 million of these challenges are entered every DAY.

Want to try it? Leave a comment on this post (or any post in my blog) and you will get to see and use reCAPTCHA. I can also testify that the installation of this on a WordPress blog is well documented, fast and easy.

Archival Transcriptions: for the public, by the public

There is a recent thread on the archives listserv that talks about transcriptions – specifically for small projects or those that have little financial support. There is even a case in which there is no easy OCR answer due to the state of the digitized microfilm records.
One of the suggestions was to use some combination of human effort to read the documents – either into a program that would transcribe them, or to another human who would do the typing. It made me wonder what it would look like to make a place online where people who wanted to could volunteer their transcription time. In the case where the records are already digitized and viewable, this seems like an interesting approach.

Something like this already exists for the genealogy world over at the USGenWeb Archives Project. They have a long list of different projects listed here. Though the interface is a bit confusing, the spirit of the effort is clear – many hands make light work. Precious genealogical resources can be digitized, transcribed and added to this archive to support the research of many by anyone – anywhere in the world.

Of course in the case of transcribing archival records there are challenges to be overcome. How do you validate what is transcribed? How do you provide guidance and training for people working from anywhere in the world? If I have figured out that a particular shape is a capital S in a specific set of documents, that could help me (or an OCR program) as I progress through the documents, but if I only see one page from a series – I will have to puzzle through that one page without the support of my past experience. Perhaps that would encourage people to keep helping with a specific set of records? Maybe you give people a few sample pages with validated translations to practice with? And many records won’t be that hard to read – easy for a human’s eye but still a challenge for an OCR program.

The optimist in me hopes that it could be a tempting task for those who want to volunteer but don’t have time to come in during the normal working day. Transcribing digitized records can be done in the middle of the night in your pajamas from anywhere in the world. Talk about increasing your pool of possible volunteers! I would think that it could even be an interesting project for high school and college students – a chance to work with primary sources. With careful design, I can even imagine providing an option to select from a preordained set of subjects or tags (or in Folksonomy friendly environment, the option to add any tags that the transcriber deems appropriate) – though that may be another topic worthy of its own exploration independent of transcription.

The initial investment for a project like this would come from building a framework to support a distributed group of volunteers. You would need an easy way to serve up a record or group of records to a volunteer and prevent duplication of effort – but this is an old problem with good solutions from the configuration management world of software development and other collaboration work environments.

It makes a nice picture in my mind – a slow, but steady, team effort to transcribe collections like the Colorado River Bed Case (2,125 pages of digitized microfilm at the University of Utah’s J. Willard Marriott Library) – mostly done from people’s homes on their personal computers in the middle of the night. A central website for managing digitized archival transcriptions could give the research community the ability to vote on the next collection that warrants attention. Admit it – you would type a page or two yourself, wouldn’t you?