ArchivePress: When One Size Doesn’t Fit All

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

This entry was posted in Challenges, Events, missinglinks09, Software, Web 2.0 on by .

About Richard M. Davis

I work on Preservation Projects and Services in ULCC's Digital Archives Department, where my responsibilities include managing and developing our e-Repositories service. I've also worked as project manager, researcher and developer on several web-based preservation and archives projects, including the National Digital Archive of Datasets (NDAD), Significant Properties of E-learning Objects (SPeLOs), Social Networking Extensions for Eprints (SNEEP), and JISC-PoWR (Preservation of Web Resources). I also edit our excellent departmental blog. I have worked with database systems since 1986 and Web systems since 1996, when I developed a pilot intranet for the Forensic Science Laboratory in London, before joining ULCC in 1997. Since 2006, I have been studying part-time, on-line, for Edinburgh University's MSc in E-learning.

11 thoughts on “ArchivePress: When One Size Doesn’t Fit All

  1. Pingback: ResourceShelf » Blog Archive » UK: A New Project to Learn What it Takes to Archive Blog Content

  2. Owen Stephens

    I’m a little bit sceptical about this. What is the actual worth of ‘preserving’ the database. The set of html pages that make up the blog are the publication – if you want to preserve the publication, then surely this is what you should preserve? We might need to get cleverer about how we do this, but I’m not sure that the solution is what you describe here.

    If you want to preserve the database as an artifact (which I’m not sure about as a general approach, but can see you might want to do this in some circumstances) then the only way to do this is to preserve the database – you can’t ‘recreate’ it from the feeds.

    The more I think about this, I’m afraid the more problems I can see. What happens if I make a change to an old post? I can say that it is very standard practice for the web page representing that post to be updated (otherwise what’s the point) – but I don’t think you would see a consistent behaviour in the feeds from all packages – does an updated post from 3 years ago appear in the blogs RSS feed?

    OK – so answering some of these questions is the point of the project – fair enough, and I don’t want to be overly negative about this – but I just can’t see the justification for doing this above harvesting the html. Could you post some more information on why you think creating a database structure (which may or may not bear some resemblance to the original database structure) for the content is important for archiving purposes?

  3. Richard M. Davis

    Hi Owen

    I think I’m clear on the idea that harvesting the HTML is not enough: what I believe we have in blogs are data, and data objects, and so the idea is to treat them as such. If I ask the version of this blog in the UK Web Archive to ‘show me all Brian’s posts’, or ‘show me all the posts from June 2009’, it’s dumb. With a database (relational or XML) I should be able to do that quite easily (as I can by querying the backend database to this blog; or wrapping that query in a web script.

    Over an aggregated accumulation of blogs from an institution I think that if we can avoid losing the flexibility that the implicit data structure of blogs gives us, we should: all sorts of semantic, text-mining possibilities might await. To settle for screen-scraping solution is to lose that: we might as well just PDF everything.

    We’ll be posting more thoughts on the data structure on the ArchivePress blog, but I don’t see it as ‘creating a database structure’ – all blog systems share common assumptions about data and metadata, in order to map to the key elements of RSS/Atom (or is it the other way round?!) and that’s kind of what we’re focusing on.

    If you like you can blame Brian ๐Ÿ™‚ who told me a while back that a significant proportion of his readers consumed his blog via feed readers, not the web pages themselves; and Chris Rusbridge’s suggestion that “blogs represent an area where the content is primary and design secondary”: our premise lurks somewhere between those two ideas.

    The versioning aspect I’m also interested in, as it relates to the archival issues of authenticity and reliability: blog posts are increasingly cited and yet highly susceptible to change or deletion. On balance, I’d prefer to easily cite blog posts in a stable archive rather than in the wild, and I hope the project might helps us get closer to that outcome.

    Maureen will probably have more and better ideas. I’m looking forward to discussing it more over the next 6 months.

  4. Owen Stephens

    Thanks Richard. You’ll be glad to know I’ve challenged Brian on this as well – so I do blame him ๐Ÿ™‚

    The use cases you describe here definitely help me understand more of what you see as the advantages – but would fall outside what I would think of as ‘preservation’ – perhaps it is agreement on what it means to ‘preserve’ a website that needs unpicking.

    I think there are some distinctions to be made between how you capture the content, and how you store it, and how you can manipulate it. My instinct is that crawling html to capture the content is a more reliable way of getting the full picture for any particular blog at any particular time – and of course, if you capture the full content of the site, you get the RSS feeds as well. You could then decide to store this in a structured format if you wanted to do the kind of queries you describe.

    I know Maureen promised a blog post on the project, so I’m looking forward to reading that as well.

  5. Richard M. Davis

    Hi Owen… Note we’re not talking about websites as a whole, just blogs, which I think are a sufficiently distinct phenomenon to merit special attention.

    Our premise is that blog posts are atomic events, data packets, that /might/ be more effectively captured cumulatively. Crawling is a very intensive process, typically at arbitrary intervals, often unrelated to the life-cycle of information in the target site, and frequently results in both unnecessary duplication and unwanted gaps. I’d be interested to know what the UK Web Archive is doing with any newsfeeds it harvests – but I might argue that capturing (A) HTML and (B) Newsfeeds is unnecessary duplication, if you can capture the underlying data and the queries/templates that render (A) and (B). (To say nothing of (C) the WAP view, and probably some others too). I admit this is a data-centric view – no doubt influenced by my database background, and time in a dataset archive.

    As for the preservation, whether crawling or A-Pressing, that’s to some extent about what one does next, whether one has a relational database or a bunch of (W)ARC files. My guess is, also, that your average IT department will find it easier to understand the former than the latter.

    FYI, Ed Pinsent, our resident archivist and Man From UKWAC, has just offered some of his preliminary thoughts relating to the crawling approach.

  6. Pingback: Latest bloging software news - JISC-PoWR ยป Blog Archive ยป ArchivePress: When One Size Doesn’t Fit All | Best Firepow Review

  7. Pingback: ArchivePress » Blog Archive » Our first month

  8. Pingback: JISC-PoWR » Blog Archive » ArchivePress Blog Preservation Survey

  9. Pingback: Our first month « ArchivePress + APrints

  10. Pingback: Thoughts about blog data and metadata | BlogForever

  11. Pingback: BlogForever: Thoughts about blog data and metadata | ulcc da blog

Comments are closed.