Category Archives: Technologies

ArchivePress: When One Size Doesn’t Fit All

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

Set a blog to catch a blog…

Much discussion of blog preservation focuses on how to preserve the blogness of blogs: how can we make a web archive store, manage and deliver preserved blogs in a way that is faithful to the original?

Nesting...

Since it is blogging applications that provide this stucture and behaviour (usually from simple database tables of Posts, Comments, Users, etc), perhaps we should consider making blogging software behave more like an archive. How difficult would that be? Do we need to hire a developer?

One interesting thing about WordPress is the number of uses its simple blog model has been put to. Under-the-hood it is based on a remarkably simple data base schema of about 10 tables and a suite of PHP scripts, functions and libraries that provide the interface to that data. Its huge user-base has contributed a wide variety of themes and additional functions. It can be turned into a Twitter-like microblog (P2 and Prologue) or a fully-fledged social network (WordPress MU, Buddypress).

Another possibility exploited by a 3rd-party plugin is that of using WordPress as an aggregating blog, collecting posts automatically via RSS from other blogs: this seems like a promising basis for starting to develop an archive of blogs, in a blog.

The plugin in question is called FeedWordPress. It uses the Links feature of WordPress as the basis of a list of feeds which it checks regularly, importing new content when it finds it, as Posts within WordPress.

I installed FeedWordPress a while ago on ULCC’s DA Blog, and set it up to import all of the ULCC-contributed posts to JISC-PoWR, i.e. those by Ed Pinsent, Kevin Ashley and myself. I did this because I felt that these contributions warrant being part of ULCC’s insitutional record of its activities, and that DA Blog was the best to place to address this, as things stand.

JISC-PoWR also runs on WordPress, therefore I knew that, thanks to WordPress’s REST-like interface and Cool URIs, it is easy not only to select an individual author’s posts (/author/kevinashley) but also the RSS feed thereof (/author/kevinashley/feed). This, for each of the three author accounts, was all I needed to start setting up FeedWordPress in DA Blog to take an automatic copy each time any of us contributed to JISC-PoWR.The “author” on the original post has been mapped to an author in DA Blog, so posts are automatically (and correctly) attributed. The import also preserves, in custom fields, a considerable amount of contextual information about the posts in their original location.

In many cases, I’ve kept the imported post private in DA Blog. “Introductory” posts for the JISC-PoWR project blog, for example: as editor of DA Blog, I didn’t feel we needed to trouble our readers there with them; nevertheless they are stored in the blog database, as part of “the record” of our activities.

This is, admittedly, a very small-scale test of this approach, but the kind of system I’ve described is unquestionably a rudimentary blog archive, that can be set up relatively easily using WordPress and FeedWordPress – no coding necessary. Content is then searchable, sortable, exportable (SQL, RSS, etc). (Note, by the way, what happens when you use the Search box on the JISC-PoWR blog copy in UKWAC: this won’t happen with this approach!)

For organisations with many staff blogging on diverse public platforms this would be one approach to ensuring that these activities are recorded and preserved. UKOLN, for example, manages its own blog farm, while Brian and Marieke have blogs at WordPress.com (as well as contributing to this one), and Paul Walk appears to manage his own blog and web space. This kind of arrangement is not uncommon, nor the problem of how an institution get a grasp on material in all these different locations (it’s been at the heart of many JISC-PoWR workshop discussions). A single, central, self-hosted, aggregating blog, automatically harvesting the news feeds of all these blogs, might be a low-cost, quick-start approach to securing data in The Cloud, and safeguarding the corporate memory.

There are more issues to address. What of comments or embedded images? Can it handle Twitter tweets as well as blog posts? Does it scale? What of look-and-feel, individual themes, etc? Now we start needing some more robust tests and decisions, maybe even a developer or two to build a dedicated suite of “ArchivePress” plugins. But thanks to the power and Open-ness of  WordPress, and the endless creativity of its many users, we have a promising and viable short-term solution, and a compelling place to start further exploration.

LIWA – Living Web Archives

The PoWR project identified a number of technical challenges which made certain types of content – particularly that with a Web 2.0 flavour – particularly difficult to manage and preserve in an effective way. My attention has recently been drawn to an EU-funded project which hopes to overcome a number of these technical problems, as well as others that are applicable to large-scale archiving such as the problem of spam content.

LIWA – Living Web Archives – began in early 2008, but as with many EU projects, its startup phase involved a lot of internal activity without much of a public face. As a result we didn’t pick up on its work in the JISC-PoWR handbook, but I’m sure we’ll rectify this omission in any future revisions.

To pick one example of LIWA’s areas of interest, it intends to develop tools which make it easier to take a temporal view of web archives and to maintain temporal consistency. Temporal consistency – or rather its absence – will be familiar to anyone who has spent time exploring sites in the Internet Archive, where different pages, or even portions of the same page (such as images) will have been archived on different days. This can lead to occasional surprises when navigating through archived content, with links taking one to pages that don’t have the expected content.

LIWA’s partner’s include Hanzo, a UK-based web archive services company that we covered briefly in the handbook; I hope we can explore their potential value to UK HE in the future.

Twitter Groups and Twitter Problems

We’ve written about Twitter on the JISC PoWR site before mainly when considering preservation of Web 2.0 material. Now Twitter could become a useful tool in helping you communicate about Web resource preservation.

The Archivists and Records Managers Twitter Group is up and running. You can register at http://twittgroups.com/group/archives.

Twitter
I’m sure there will be lots of interesting posts.

The preservation of Twitter posts (tweets) has again been discussed in the blogosphere. Maureen Pennock commented in her post entitled ‘Making retrospective sense of cross media communications: a new archival challenge‘ that the increasing number of communication mechanisms presents a big problem for archivists.

She points out that “Some of our conversations are cross-media; they may start on Twitter, but they move to Facebook and then the blog. Capturing only one of those accounts means that only part of our conversation is captured. Okay, so you’re probably not interested in capturing our interactions in your archives. But you probably are interested in capturing interactions from important people (back to Stephen Fry and Obama again) and you will thus face the same issues.

She then says “We all know the problems we’ve got in capturing and archiving emails. What of Twitter? How do you get Tweets out of the system and integrate them into a collection? What of Facebook data? And YouTube?

It seems the Twitter challenge is becoming more real as it becomes increasingly mainstream.


Yet more on iPres2008

The web-archiving strand at the iPRES Conference was an opportunity for a number of National Libraries to describe their initiatives, their approaches to web capture and preservation, and the things that were (for them) the biggest hurdles. It’s interesting to me that librarians – as opposed to archivists – seem to have been first off the mark with web-archiving, although in the UK The National Archives have at least two parallel initiatives underway (UKWAC and European Archive) to capture government websites deemed to be records.

From the PoWR point of view, one thing I found useful was the summary statement from Iceland which suggests that the community is now starting to agree on three main approaches to selection – bulk/domain harvesting, thematic, and event-based. Sometimes I wonder if these approaches are too library-influenced (looking at websites primarily as publications, rather than as records, whereas in PoWR we are suggesting that HFE websites contain and function as a mixture of both), but it helped me sharpen up my thinking and I fed this idea back into the PoWR Handbook’s chapter on Selection.

Since they were all representing libraries, most of the presenters tended to focus on legal deposit (and attendant permission problems) as the biggest hurdle to gathering websites – and the one that eats up the most in terms of resources. Denmark have not resolved this completely, and although they have gathered lots of material from the web, they don’t make it widely available – only allowing controlled access for research or statistical purposes. France, on the other hand, have had legal deposit laws in place since 2006. Australia have found this such a problem (despite the strenuous efforts they have made to get the law changed) that it pretty much forced the decision to go the curatorial route. All permissions are negotiated, but the collections are shaped to a certain extent by community input.

With my archivist hat on, I sat up when we were told quite categorically that registration and cataloguing of web resources would not work, and that no-one should ever even attempt it. The BL admitted they were ‘not really collecting metadata’. If this is true, I immediately started to wonder why both PANDAS and Web Curator Tool (which I have used) have conventional Dublincore metadata elements built into their workflow? Don’t we anticipate cataloguing archived web resources in some way? I almost asked a question about this, until I reflected on the use of NutchWax and full-text indexing (which is probably acceptable until we can come up with some form of semantic tagging or automated metadata extraction for web resources).

Then I found myself reaching for the mic to ask a question about what I call ‘endangered resources’. We have already blogged about this on dablog, when my colleague Joanne Anthony raised a question about a web resource owned by a smallish institution which suddenly found itself with its funding removed. Was there anything such an Institution could do, I asked the panel, to preserve its website? And what were the pro-active steps being taken by these National Libraries to rescue or identify resources at risk? I know that UKWAC, for example, offers a public submission service on its website, although it is not very prominent or visible, nor is it quite clear what happens to requests for archiving once the form has been filled in. I received some interesting replies, including the amusing anecdote from France which suggests that their archival collections have been accessed by red-faced politicians who have accidentally deleted their own blogs. However, I still wasn’t quite sure what national initiatives exist to address what I perceive as a significant gap in the preservation of unrecognised (and therefore uncaptured) resources.

More on iPres2008

Brian and Marieke have already written about iPres2008 and PoWR, and I have written and will write more about it from a general perspective on DABlog. But we thought it would be worth saying a bit more about what this conference, which is looking at the complete picture of digital preservation, had to say which is of relevance to PoWR’s work of web preservation in UK Universities.

There was an entire session devoted to various web archiving initiatives on the second day, which at first sight one might think is of particular relevance (almost as much as Brian’s presentation, one might think.) I wasn’t at this session – it was one of those running in parallel tracks, and I was speaking in the other track – but Ed Pinsent was and will be writing at more length about it soon. But even without attending, I’m aware that many of the projects, operating as they do within their national domains in Australia or elsewhere, won’t have much role in helping save UK University web content (unless we move our domains to .edu.au – there’s a thought.) Even when the BL realises its long-term aim of harvesting across the entire UK web domain, it still will be selective in some ways about what it captures – about depth and frequency of harvests, and about the type of content. You won’t be able to depend on those institutions to capture what you want to be captured. So if these initiatives aren’t going to meet all our needs, do we need to do it ourselves ? The PoWR project thinks not, but that is one of the options institutions will need to examine. The work the IIPC is doing to develop harvesting and access tools will be of interest to those few institutions that feel able to operate these tools themselves – not something to be undertaken lightly.

Yet there was much of relevance at iPres2008. One recurring them, picked up at the outset by Lynne Brindley and in Steve Knight’s closing remarks, was that ‘digital preservation’ is not the term to be using in discussions with our institutions and the world, echoing remarks on the DCC blog which Brian later picked up on here. Steve prefers the phrase ‘permanent access’. which is indeed outcome-focussed. However, we’ve also said in PoWR that preservation isn’t always forever, so I would prefer something a little more all-embracing – ‘long-lived access’ might fit.

The sessions covering things like significant properties also touched on issues that PoWR is concerned with. When we decide to preserve something, what is it that we’re really trying to keep ? Most forms of preservation change the original object in some way, just as long-life milk isn’t the same as pasteurised, and neither are quite as tasty as fresh milk (or so I’ve been told.) This is clearly still a very difficult problem, and one that (to my mind) demonstrates that the digital preservation community hasn’t even developed a clear problem statement, much less a fully worked-out solution. So, in the meantime, we need to be pragmatic and do what seems best at the time. Always a good plan.

PDF web page capture

During the last JISC PoWR workshop yesterday in Manchester (of which more anon) I made brief mention of a tool from Adobe which allows web pages, or entire sites, to be captured to a PDF file. I mentioned this primarily to illustrate one of the three points at which web capture can take place (behind the server; from the HTTP transaction; or browser-side) but it generated considerable interest, and I promised to blog about the product since I could not remember what it was called.

It turns out that it’s not a separate product, nor a plug-in, but a built-in part of Adobe Acrobat. It was first available as a free add-on for Acrobat 4 in 1998 or 1999 , and I think it was then that I first saw this demonstrated at the PRO (as it then was) – hence my misunderstanding. Tools like this have their place, but (like all web preservation technologies) they also have their drawbacks. PDF’s print-oriented format isn’t a good match to some sites, much as some sites don’t look good when you try to print them. (In fact, I believe that Acrobat Web Capture effectively uses the browser’s print engine combined with PDF writer pseudo-printer to do its work, so there will be a close correlation.) But we’ll be covering this tool, along with others, in the handbook.

How sticky is your wiki?

Wetpaint wiki is just one of the many enticing, powerful, quick-fix web apps that have sprung up around Web 2.0 and Social Networking. You’ll have your own favourites no doubt: I won’t start listing them here. Wikis have grown up a lot since the first WikiWikiWeb, and now are at the online heart of many educational projects at all levels, from classroom, to research and publishing.

We’ve been using Wetpaint’s wiki feature as a collaborative space for our workshop feedback, and this suits us fine: once we have collated all the input for our project outputs, in a few weeks it’ll probably be no loss to us to delete the wiki, or just set it adrift among all the other jettisoned flotsam in cyberspace.

But what’s often given less serious consideration, in the excitement of using a third-party provider of wikis, blogs, Ning, etc., to get your collaborative hypertext project off the ground so quickly and easily – and without having to go cap or cheque in-hand to whoever guards your web space – is this key preservation issue: what happens when you want to get your painstakingly intricate web of hyperlinked pages out?

There are many good reasons why you might want to do this: you might want to migrate to another wiki system or CMS, as the shape and nature of your content evolves; or put it on a permanent, persistent footing by moving it into your own domain; you might simply want to back it up or take a snapshot; or you might want to pull out information for publication in a different form. When you had one or two pages, it might have seemed trivial; but what if you now have hundreds?

Old Style Wiki

Unfortunately, just as exporting the information is often a secondary consideration for wiki content creators, so it also is for the wiki farm systems. The Wetpaint Wiki discussion boards indicate that an export feature was a long time in coming (and its absence quite a blocker to adoption by a number of serious would-be users). And what was eventually provided leaves a lot to be desired.

Wetpaint’s backup option “lets” you download your wiki content as a set of HTML files. Well, not really HTML files: text files with some embedded HTML-like markup. (Which version? Not declared.) Don’t expect to open these files locally in your browser and carry on surfing your wiki hypertext (even links between wiki pages need fixing). The export doesn’t include comment threads or old versions. Restoring it to your online wiki is not possible. But, for what it’s worth, you have at least salvaged some sort of raw content, that might be transformed into something like the wiki it came from, if hit with a bit of Perl script or similar.

I checked out Wikidot – another impressively-specced, free “wiki farm”. Wikidot’s backup option will deliver you a zip file containing each wiki page as a separate text file, containing your wiki markup as entered, as well as all uploaded file attachments. However, according to Wikidot support:

you can not restore from it automatically, it does not include all page revisions, only current (latest), it does not include forum discussion or page comments.

To reconstruct your wiki locally, you’ll, again, need some scripting, including using the Wikidot code libraries to reconvert its non-standard wiki-markup into standard HTML.

A third approach can be seen with a self-hosted copy of Mediawiki. Here you can select one or more pages by name, and have them exported as an XML file, which also contains revisions and assorted other metadata. Within the XML framework, the page text is stored as original wiki markup, raising the same conversion issues as with Wikidot. However, the XML file can be imported fairly easily into a different or blank instance of Mediawiki, recreating both hypertext and functionality more or less instantly.

In contrast to all these approaches, if you set a spidering engine like HTTrack or Wget to work “remotely harvesting” the site, you would get a working local copy of your wiki looking pretty much as it does on the web. This might be an attractive option if you simply want to preserve a record of what you created, a snapshot of how it looked on a certain date; or just in case a day should come when Wetpaint.com Inc., and the rest, no longer exist.

However, this will only result in something like a preservation copy – not a backup that can be easily restored to the wiki, and further edited – in the event, say, the wiki is hacked/cracked, or otherwise disfigured. For that kind of security, it may be enough to depend on regular backups of the underlying database, files and scripts: but you still ought to reassure yourself exactly what backup regime your host is operating, and whether they can restore them in a timely fashion. (Notwithstanding the versioning features of most wikis, using them to roll back a raft of abusive changes across a whole site is not usually a quick, easy or particularly enjoyable task.)

All this suggests some basic questions that one needs to ask when setting up a wiki for a project:

  • How long do we need it for?
  • Will it need preserving at intervals, or at a completion date?
  • Is it more important to preserve its text content, or its complete look?
  • Should we back it up? If so, what should we back up?
  • Does the wiki provide backup features? If so, what does it back up (e.g. attachments, discussions, revisions)?
  • Once “backed up”, how easily can it be restored?
  • Will the links still work in our preservation or backup copy?
  • If the backup includes raw wiki markup, do you have the capabilities to re-render this as HTML?

And questions like these are no less relevant when considering your uses of blogs and other social software: I hope we’ll be able to look at them more closely in another post.