Author Archives: Richard M. Davis

About Richard M. Davis

I work on Preservation Projects and Services in ULCC's Digital Archives Department, where my responsibilities include managing and developing our e-Repositories service. I've also worked as project manager, researcher and developer on several web-based preservation and archives projects, including the National Digital Archive of Datasets (NDAD), Significant Properties of E-learning Objects (SPeLOs), Social Networking Extensions for Eprints (SNEEP), and JISC-PoWR (Preservation of Web Resources). I also edit our excellent departmental blog. I have worked with database systems since 1986 and Web systems since 1996, when I developed a pilot intranet for the Forensic Science Laboratory in London, before joining ULCC in 1997. Since 2006, I have been studying part-time, on-line, for Edinburgh University's MSc in E-learning.

ArchivePress: When One Size Doesn’t Fit All

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

Set a blog to catch a blog…

Much discussion of blog preservation focuses on how to preserve the blogness of blogs: how can we make a web archive store, manage and deliver preserved blogs in a way that is faithful to the original?

Nesting...

Since it is blogging applications that provide this stucture and behaviour (usually from simple database tables of Posts, Comments, Users, etc), perhaps we should consider making blogging software behave more like an archive. How difficult would that be? Do we need to hire a developer?

One interesting thing about WordPress is the number of uses its simple blog model has been put to. Under-the-hood it is based on a remarkably simple data base schema of about 10 tables and a suite of PHP scripts, functions and libraries that provide the interface to that data. Its huge user-base has contributed a wide variety of themes and additional functions. It can be turned into a Twitter-like microblog (P2 and Prologue) or a fully-fledged social network (WordPress MU, Buddypress).

Another possibility exploited by a 3rd-party plugin is that of using WordPress as an aggregating blog, collecting posts automatically via RSS from other blogs: this seems like a promising basis for starting to develop an archive of blogs, in a blog.

The plugin in question is called FeedWordPress. It uses the Links feature of WordPress as the basis of a list of feeds which it checks regularly, importing new content when it finds it, as Posts within WordPress.

I installed FeedWordPress a while ago on ULCC’s DA Blog, and set it up to import all of the ULCC-contributed posts to JISC-PoWR, i.e. those by Ed Pinsent, Kevin Ashley and myself. I did this because I felt that these contributions warrant being part of ULCC’s insitutional record of its activities, and that DA Blog was the best to place to address this, as things stand.

JISC-PoWR also runs on WordPress, therefore I knew that, thanks to WordPress’s REST-like interface and Cool URIs, it is easy not only to select an individual author’s posts (/author/kevinashley) but also the RSS feed thereof (/author/kevinashley/feed). This, for each of the three author accounts, was all I needed to start setting up FeedWordPress in DA Blog to take an automatic copy each time any of us contributed to JISC-PoWR.The “author” on the original post has been mapped to an author in DA Blog, so posts are automatically (and correctly) attributed. The import also preserves, in custom fields, a considerable amount of contextual information about the posts in their original location.

In many cases, I’ve kept the imported post private in DA Blog. “Introductory” posts for the JISC-PoWR project blog, for example: as editor of DA Blog, I didn’t feel we needed to trouble our readers there with them; nevertheless they are stored in the blog database, as part of “the record” of our activities.

This is, admittedly, a very small-scale test of this approach, but the kind of system I’ve described is unquestionably a rudimentary blog archive, that can be set up relatively easily using WordPress and FeedWordPress – no coding necessary. Content is then searchable, sortable, exportable (SQL, RSS, etc). (Note, by the way, what happens when you use the Search box on the JISC-PoWR blog copy in UKWAC: this won’t happen with this approach!)

For organisations with many staff blogging on diverse public platforms this would be one approach to ensuring that these activities are recorded and preserved. UKOLN, for example, manages its own blog farm, while Brian and Marieke have blogs at WordPress.com (as well as contributing to this one), and Paul Walk appears to manage his own blog and web space. This kind of arrangement is not uncommon, nor the problem of how an institution get a grasp on material in all these different locations (it’s been at the heart of many JISC-PoWR workshop discussions). A single, central, self-hosted, aggregating blog, automatically harvesting the news feeds of all these blogs, might be a low-cost, quick-start approach to securing data in The Cloud, and safeguarding the corporate memory.

There are more issues to address. What of comments or embedded images? Can it handle Twitter tweets as well as blog posts? Does it scale? What of look-and-feel, individual themes, etc? Now we start needing some more robust tests and decisions, maybe even a developer or two to build a dedicated suite of “ArchivePress” plugins. But thanks to the power and Open-ness of  WordPress, and the endless creativity of its many users, we have a promising and viable short-term solution, and a compelling place to start further exploration.

How sticky is your wiki?

Wetpaint wiki is just one of the many enticing, powerful, quick-fix web apps that have sprung up around Web 2.0 and Social Networking. You’ll have your own favourites no doubt: I won’t start listing them here. Wikis have grown up a lot since the first WikiWikiWeb, and now are at the online heart of many educational projects at all levels, from classroom, to research and publishing.

We’ve been using Wetpaint’s wiki feature as a collaborative space for our workshop feedback, and this suits us fine: once we have collated all the input for our project outputs, in a few weeks it’ll probably be no loss to us to delete the wiki, or just set it adrift among all the other jettisoned flotsam in cyberspace.

But what’s often given less serious consideration, in the excitement of using a third-party provider of wikis, blogs, Ning, etc., to get your collaborative hypertext project off the ground so quickly and easily – and without having to go cap or cheque in-hand to whoever guards your web space – is this key preservation issue: what happens when you want to get your painstakingly intricate web of hyperlinked pages out?

There are many good reasons why you might want to do this: you might want to migrate to another wiki system or CMS, as the shape and nature of your content evolves; or put it on a permanent, persistent footing by moving it into your own domain; you might simply want to back it up or take a snapshot; or you might want to pull out information for publication in a different form. When you had one or two pages, it might have seemed trivial; but what if you now have hundreds?

Old Style Wiki

Unfortunately, just as exporting the information is often a secondary consideration for wiki content creators, so it also is for the wiki farm systems. The Wetpaint Wiki discussion boards indicate that an export feature was a long time in coming (and its absence quite a blocker to adoption by a number of serious would-be users). And what was eventually provided leaves a lot to be desired.

Wetpaint’s backup option “lets” you download your wiki content as a set of HTML files. Well, not really HTML files: text files with some embedded HTML-like markup. (Which version? Not declared.) Don’t expect to open these files locally in your browser and carry on surfing your wiki hypertext (even links between wiki pages need fixing). The export doesn’t include comment threads or old versions. Restoring it to your online wiki is not possible. But, for what it’s worth, you have at least salvaged some sort of raw content, that might be transformed into something like the wiki it came from, if hit with a bit of Perl script or similar.

I checked out Wikidot – another impressively-specced, free “wiki farm”. Wikidot’s backup option will deliver you a zip file containing each wiki page as a separate text file, containing your wiki markup as entered, as well as all uploaded file attachments. However, according to Wikidot support:

you can not restore from it automatically, it does not include all page revisions, only current (latest), it does not include forum discussion or page comments.

To reconstruct your wiki locally, you’ll, again, need some scripting, including using the Wikidot code libraries to reconvert its non-standard wiki-markup into standard HTML.

A third approach can be seen with a self-hosted copy of Mediawiki. Here you can select one or more pages by name, and have them exported as an XML file, which also contains revisions and assorted other metadata. Within the XML framework, the page text is stored as original wiki markup, raising the same conversion issues as with Wikidot. However, the XML file can be imported fairly easily into a different or blank instance of Mediawiki, recreating both hypertext and functionality more or less instantly.

In contrast to all these approaches, if you set a spidering engine like HTTrack or Wget to work “remotely harvesting” the site, you would get a working local copy of your wiki looking pretty much as it does on the web. This might be an attractive option if you simply want to preserve a record of what you created, a snapshot of how it looked on a certain date; or just in case a day should come when Wetpaint.com Inc., and the rest, no longer exist.

However, this will only result in something like a preservation copy – not a backup that can be easily restored to the wiki, and further edited – in the event, say, the wiki is hacked/cracked, or otherwise disfigured. For that kind of security, it may be enough to depend on regular backups of the underlying database, files and scripts: but you still ought to reassure yourself exactly what backup regime your host is operating, and whether they can restore them in a timely fashion. (Notwithstanding the versioning features of most wikis, using them to roll back a raft of abusive changes across a whole site is not usually a quick, easy or particularly enjoyable task.)

All this suggests some basic questions that one needs to ask when setting up a wiki for a project:

  • How long do we need it for?
  • Will it need preserving at intervals, or at a completion date?
  • Is it more important to preserve its text content, or its complete look?
  • Should we back it up? If so, what should we back up?
  • Does the wiki provide backup features? If so, what does it back up (e.g. attachments, discussions, revisions)?
  • Once “backed up”, how easily can it be restored?
  • Will the links still work in our preservation or backup copy?
  • If the backup includes raw wiki markup, do you have the capabilities to re-render this as HTML?

And questions like these are no less relevant when considering your uses of blogs and other social software: I hope we’ll be able to look at them more closely in another post.

Web Continuity Project at The National Archives

Ed and I were pleased to come across an interesting document, recently received from The National Archives, describing their Web Continuity Project. This is the latest of the many digital preservation initiatives undertaken by TNA/PRO, that began with EROS and NDAD in the mid 1990s, leading to the UK Government Web Archive and other recent digital preservation initiatives (many in conjunction with BL and the JISC).

The Web Continuity Project arises from a request by Jack Straw, as leader of the House of Commons in 2007, that government departments ensure continued access to online documents. Further research revealed that:

  • Government departments are increasingly citing URLs in answer to Parliamentary Questions
  • 60% of links in Hansard to UK government websites for the period 1997 to 2006 are now broken
  • Departments vary considerably: for one, every link works; for another every link is broken. (TNA’s own website is not immune!)

Continue reading

Digital preservation in a nutshell (Part I)

One of the goals of PoWR is to make current trends in digital preservation meaningful and relevant to information professionals with the day-to-day responsibility for looking after web resources. Anyone coming for the first time to the field of digital preservation can find it a daunting area, with very distinct terminology and concepts. Some of these are drawn from time-honored approaches to managing things like government records or institutional archives, while others have been developed exclusively in the digital domain. It is an emerging and evolving field that can take some time to get your head round: so we thought it was a good idea to offer a series of brief primers.

Starting, naturally, with digital preservation: this is defined as a “series of managed activities necessary to ensure continued access to digital materials for as long as necessary” (Digital Preservation Coalition, 2002). Continue reading

Introduction: Richard Davis

MeHi, I’m Richard, an IT specialist since 1985. I’ve worked at ULCC since 1997 on a variety of web-based digital archives and repository systems. Recent projects include Linnean Online (an Eprints-powered archive of Linnaeus’s specimens), SAS-Space (DSpace IR for the School of Advanced Study) and NDAD (ten year-old dataset archive for the National Archives). Among the JISC repositories/preservation projects that I’ve been involved with are the Significant Properties studies, PRIMO (for the Institute of Musical Research) and SNEEP (Social Networking Extensions for Eprints). I’m also studying, part-time, on the MSc E-learning course at Edinburgh University, and I’m particularly interested in how blogs, wikis, social networks, and the like, are being used in education and research (Library 2.0, Classroom 2.0, etc).