Posted by Richard M. Davis on June 24th, 2009
JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.
Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.
Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?
So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.
This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.
In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.
ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.