In an example of Twitter serendipity, @silverasm‘s (Aditi Muralidharan) tweet pointed me to @historying‘s blog post about Topic Modeling. In this post Cameron Blevins explains the results of using the topic modeling feature of UMass Amherst‘s MAchine Learning for LanguagE Toolkit (MALLET) on the text of Martha Ballard’s Diary.
I have spent lot of time thinking about how to generate thematic overviews of groups of archival collections. My information visualization project, ArchivesZ, aims to provide ways of understanding aggregated archival description data, both from a single institution or across institutional boundaries. Now I find myself wondering if text mining with a tool like MALLET might generate smart topic groupings more elegantly than fighting with the wide range of non-standardized collection subjects.
Topic Modeling with MALLET
To get a sense of what MALLET generates, see the excerpt below from Blevins’s post:
With some tinkering, MALLET generated a list of thirty topics comprised of twenty words each, which I then labeled with a descriptive title. Below is a quick sample of what the program “thinks” are some of the topics in the diary:
- MIDWIFERY: birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient
- CHURCH: meeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt
- DEATH: day yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn
- GARDENING: gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds
He goes on to explain that “MALLET also allows us to track those topics across the text.” What if, instead of text mining a diary, we pumped the descriptions of every archival collection from a single institution into MALLET. Of course we would need a good list of stop words including such common terms as archives, history, sources and records. But I wonder how the topics MALLET suggests would compare to the official subjects associated with each collection? Could this give us a broad overview of the topics covered by a specific repository and give us a new way to build paths to the collections based on topic?
Auto-Classification Using Castanet
Text miner Aditi Muralidharan also posted recently on this theme in Castanet: automatically generating a browsing structure for a collection and explains:
Castanet automatically carves a sub-structure from the hierarchical concept dictionary, WordNet (http://wordnet.princeton.edu), and matches items in the collection to one or many appropriate places within that hierarchy. Then, after some automated trimming and flattening, the result is a hierarchical browsing system.
I have heard of Castanet before via the Flamenco Search Interface Project. Apparently Muralidharan did a project using Castanet last summer to create a category system for Flickr Commons images based on the images’ tags which is then rendered using a Flamenco interface. I include a partial screen-shot below to give you a taste of what the navigation of images feels like a few levels down in the hierarchy. I love the classification of ‘Group Action’ then filtered by a sub-classification of ‘Commerce’. The first images shown are of ‘horse trading’ – with additional headings and images beneath them as well as additional filter options on the left.
What If?
What if we pulled all the English language archival descriptions from around the world as our original data set. If we used this data for topic modeling, our subjects clusters would be cross-institutional. Maybe we could map the local institution assigned subjects to the topic model generated topics for each collection and get a sort of automated crosswalk for finding related collections. If we used the local institution assigned subjects from the archival descriptions for Canasta style auto-classification, maybe we could generate a way to hierarchically browse collections topically.
Both MALLET and Flamenco are open source (I am not sure of the status of Castanet) and, as I discovered working on ArchivesZ, many institutions will share their archival description data for a good cause. So – is this a good cause? I need to tease these ideas out a bit more, but what do you all think of it at first blush? Feasible? Interesting? Worthwhile experiments?
Image Credits: MALLET logo from MALLET homepage. Images in screen shot from Flickr Commons with no known copyright.
Castanet isn’t any “source” right now, like most academic algorithms, anyone is welcome to implement it, although there is a link to some code on the flamenco.berkeley.edu website.
– Aditi
Thanks for the clarification Aditi!
I did find this page about Castanet, including a link to this paper: Automating Creation of Hierarchical Faceted Metadata (from April 2007).
I certainly think this idea has merit, although it may end up revealing more about the particular word choices repositories and archivists make in describing their collections. In my (admittedly limited) experience, archivists are more inclined to express the “aboutness” of a collection in terms of the collecting priorities of the repository for which they work. Still, it would be fascinating to see what various repositories think they have…
I think this is a great idea. This can be called “collection-level description”. Here’s a description of SCONE, another project making this happen at the multi-collection level in Scotland: http://blogs.talis.com/panlibus/archives/2008/10/scotlands-information.php See especially Heaney: “1.1 The information landscape can be seen as a contour map in which there are mountains, hillocks, valleys, plains and plateaux. A large general collection of information – say a research library – can be seen as a plateau, raised above the surrounding plain. A specialized collection of particular importance is like a sharp peak. Upon a plateau there might be undulations representing strengths and weaknesses.
1.2 The scholar surveying this landscape is looking for the high points. …
1.3 The landscape is, however, multidimensional. Where one scholar may see a peak another may see a trough. The task is to devise mapping conventions which enable scholars to read the map of the landscape fruitfully, at the appropriate level of generality or specificity.”
Scone’s about page is a rich source of related readings.