JISC-PoWR

Preservation of Web Resources: a JISC-sponsored project

Archive for the 'Technologies' Category

Yet more on iPres2008

Posted by Ed Pinsent on 14th October 2008

The web-archiving strand at the iPRES Conference was an opportunity for a number of National Libraries to describe their initiatives, their approaches to web capture and preservation, and the things that were (for them) the biggest hurdles. It’s interesting to me that librarians - as opposed to archivists - seem to have been first off the mark with web-archiving, although in the UK The National Archives have at least two parallel initiatives underway (UKWAC and European Archive) to capture government websites deemed to be records.

From the PoWR point of view, one thing I found useful was the summary statement from Iceland which suggests that the community is now starting to agree on three main approaches to selection - bulk/domain harvesting, thematic, and event-based. Sometimes I wonder if these approaches are too library-influenced (looking at websites primarily as publications, rather than as records, whereas in PoWR we are suggesting that HFE websites contain and function as a mixture of both), but it helped me sharpen up my thinking and I fed this idea back into the PoWR Handbook’s chapter on Selection.

Since they were all representing libraries, most of the presenters tended to focus on legal deposit (and attendant permission problems) as the biggest hurdle to gathering websites - and the one that eats up the most in terms of resources. Denmark have not resolved this completely, and although they have gathered lots of material from the web, they don’t make it widely available - only allowing controlled access for research or statistical purposes. France, on the other hand, have had legal deposit laws in place since 2006. Australia have found this such a problem (despite the strenuous efforts they have made to get the law changed) that it pretty much forced the decision to go the curatorial route. All permissions are negotiated, but the collections are shaped to a certain extent by community input.

With my archivist hat on, I sat up when we were told quite categorically that registration and cataloguing of web resources would not work, and that no-one should ever even attempt it. The BL admitted they were ‘not really collecting metadata’. If this is true, I immediately started to wonder why both PANDAS and Web Curator Tool (which I have used) have conventional Dublincore metadata elements built into their workflow? Don’t we anticipate cataloguing archived web resources in some way? I almost asked a question about this, until I reflected on the use of NutchWax and full-text indexing (which is probably acceptable until we can come up with some form of semantic tagging or automated metadata extraction for web resources).

Then I found myself reaching for the mic to ask a question about what I call ‘endangered resources’. We have already blogged about this on dablog, when my colleague Joanne Anthony raised a question about a web resource owned by a smallish institution which suddenly found itself with its funding removed. Was there anything such an Institution could do, I asked the panel, to preserve its website? And what were the pro-active steps being taken by these National Libraries to rescue or identify resources at risk? I know that UKWAC, for example, offers a public submission service on its website, although it is not very prominent or visible, nor is it quite clear what happens to requests for archiving once the form has been filled in. I received some interesting replies, including the amusing anecdote from France which suggests that their archival collections have been accessed by red-faced politicians who have accidentally deleted their own blogs. However, I still wasn’t quite sure what national initiatives exist to address what I perceive as a significant gap in the preservation of unrecognised (and therefore uncaptured) resources.

Posted in Selection, Digital preservation, Legal, Technologies, Events | 1 Comment »

More on iPres2008

Posted by Kevin Ashley on 8th October 2008

Brian and Marieke have already written about iPres2008 and PoWR, and I have written and will write more about it from a general perspective on DABlog. But we thought it would be worth saying a bit more about what this conference, which is looking at the complete picture of digital preservation, had to say which is of relevance to PoWR’s work of web preservation in UK Universities.

There was an entire session devoted to various web archiving initiatives on the second day, which at first sight one might think is of particular relevance (almost as much as Brian’s presentation, one might think.) I wasn’t at this session - it was one of those running in parallel tracks, and I was speaking in the other track - but Ed Pinsent was and will be writing at more length about it soon. But even without attending, I’m aware that many of the projects, operating as they do within their national domains in Australia or elsewhere, won’t have much role in helping save UK University web content (unless we move our domains to .edu.au - there’s a thought.) Even when the BL realises its long-term aim of harvesting across the entire UK web domain, it still will be selective in some ways about what it captures - about depth and frequency of harvests, and about the type of content. You won’t be able to depend on those institutions to capture what you want to be captured. So if these initiatives aren’t going to meet all our needs, do we need to do it ourselves ? The PoWR project thinks not, but that is one of the options institutions will need to examine. The work the IIPC is doing to develop harvesting and access tools will be of interest to those few institutions that feel able to operate these tools themselves - not something to be undertaken lightly.

Yet there was much of relevance at iPres2008. One recurring them, picked up at the outset by Lynne Brindley and in Steve Knight’s closing remarks, was that ‘digital preservation’ is not the term to be using in discussions with our institutions and the world, echoing remarks on the DCC blog which Brian later picked up on here. Steve prefers the phrase ‘permanent access’. which is indeed outcome-focussed. However, we’ve also said in PoWR that preservation isn’t always forever, so I would prefer something a little more all-embracing - ‘long-lived access’ might fit.

The sessions covering things like significant properties also touched on issues that PoWR is concerned with. When we decide to preserve something, what is it that we’re really trying to keep ? Most forms of preservation change the original object in some way, just as long-life milk isn’t the same as pasteurised, and neither are quite as tasty as fresh milk (or so I’ve been told.) This is clearly still a very difficult problem, and one that (to my mind) demonstrates that the digital preservation community hasn’t even developed a clear problem statement, much less a fully worked-out solution. So, in the meantime, we need to be pragmatic and do what seems best at the time. Always a good plan.

Posted in Selection, Digital preservation, Technologies, Preservation, Events | No Comments »

PDF web page capture

Posted by Kevin Ashley on 13th September 2008

During the last JISCPoWR workshop yesterday in Manchester (of which more anon) I made brief mention of a tool from Adobe which allows web pages, or entire sites, to be captured to a PDF file. I mentioned this primarily to illustrate one of the three points at which web capture can take place (behind the server; from the HTTP transaction; or browser-side) but it generated considerable interest, and I promised to blog about the product since I could not remember what it was called.

It turns out that it’s not a separate product, nor a plug-in, but a built-in part of Adobe Acrobat. It was first available as a free add-on for Acrobat 4 in 1998 or 1999 , and I think it was then that I first saw this demonstrated at the PRO (as it then was) - hence my misunderstanding. Tools like this have their place, but (like all web preservation technologies) they also have their drawbacks. PDF’s print-oriented format isn’t a good match to some sites, much as some sites don’t look good when you try to print them. (In fact, I believe that Acrobat Web Capture effectively uses the browser’s print engine combined with PDF writer pseudo-printer to do its work, so there will be a close correlation.) But we’ll be covering this tool, along with others, in the handbook.

Posted in Software, Workshops | 2 Comments »

How sticky is your wiki?

Posted by Richard M. Davis on 13th July 2008

Wetpaint wiki is just one of the many enticing, powerful, quick-fix web apps that have sprung up around Web 2.0 and Social Networking. You’ll have your own favourites no doubt: I won’t start listing them here. Wikis have grown up a lot since the first WikiWikiWeb, and now are at the online heart of many educational projects at all levels, from classroom, to research and publishing.

We’ve been using Wetpaint’s wiki feature as a collaborative space for our workshop feedback, and this suits us fine: once we have collated all the input for our project outputs, in a few weeks it’ll probably be no loss to us to delete the wiki, or just set it adrift among all the other jettisoned flotsam in cyberspace.

But what’s often given less serious consideration, in the excitement of using a third-party provider of wikis, blogs, Ning, etc., to get your collaborative hypertext project off the ground so quickly and easily - and without having to go cap or cheque in-hand to whoever guards your web space - is this key preservation issue: what happens when you want to get your painstakingly intricate web of hyperlinked pages out?

There are many good reasons why you might want to do this: you might want to migrate to another wiki system or CMS, as the shape and nature of your content evolves; or put it on a permanent, persistent footing by moving it into your own domain; you might simply want to back it up or take a snapshot; or you might want to pull out information for publication in a different form. When you had one or two pages, it might have seemed trivial; but what if you now have hundreds?

Old Style Wiki

Unfortunately, just as exporting the information is often a secondary consideration for wiki content creators, so it also is for the wiki farm systems. The Wetpaint Wiki discussion boards indicate that an export feature was a long time in coming (and its absence quite a blocker to adoption by a number of serious would-be users). And what was eventually provided leaves a lot to be desired.

Wetpaint’s backup option “lets” you download your wiki content as a set of HTML files. Well, not really HTML files: text files with some embedded HTML-like markup. (Which version? Not declared.) Don’t expect to open these files locally in your browser and carry on surfing your wiki hypertext (even links between wiki pages need fixing). The export doesn’t include comment threads or old versions. Restoring it to your online wiki is not possible. But, for what it’s worth, you have at least salvaged some sort of raw content, that might be transformed into something like the wiki it came from, if hit with a bit of Perl script or similar.

I checked out Wikidot - another impressively-specced, free “wiki farm”. Wikidot’s backup option will deliver you a zip file containing each wiki page as a separate text file, containing your wiki markup as entered, as well as all uploaded file attachments. However, according to Wikidot support:

you can not restore from it automatically, it does not include all page revisions, only current (latest), it does not include forum discussion or page comments.

To reconstruct your wiki locally, you’ll, again, need some scripting, including using the Wikidot code libraries to reconvert its non-standard wiki-markup into standard HTML.

A third approach can be seen with a self-hosted copy of Mediawiki. Here you can select one or more pages by name, and have them exported as an XML file, which also contains revisions and assorted other metadata. Within the XML framework, the page text is stored as original wiki markup, raising the same conversion issues as with Wikidot. However, the XML file can be imported fairly easily into a different or blank instance of Mediawiki, recreating both hypertext and functionality more or less instantly.

In contrast to all these approaches, if you set a spidering engine like HTTrack or Wget to work “remotely harvesting” the site, you would get a working local copy of your wiki looking pretty much as it does on the web. This might be an attractive option if you simply want to preserve a record of what you created, a snapshot of how it looked on a certain date; or just in case a day should come when Wetpaint.com Inc., and the rest, no longer exist.

However, this will only result in something like a preservation copy - not a backup that can be easily restored to the wiki, and further edited - in the event, say, the wiki is hacked/cracked, or otherwise disfigured. For that kind of security, it may be enough to depend on regular backups of the underlying database, files and scripts: but you still ought to reassure yourself exactly what backup regime your host is operating, and whether they can restore them in a timely fashion. (Notwithstanding the versioning features of most wikis, using them to roll back a raft of abusive changes across a whole site is not usually a quick, easy or particularly enjoyable task.)

All this suggests some basic questions that one needs to ask when setting up a wiki for a project:

  • How long do we need it for?
  • Will it need preserving at intervals, or at a completion date?
  • Is it more important to preserve its text content, or its complete look?
  • Should we back it up? If so, what should we back up?
  • Does the wiki provide backup features? If so, what does it back up (e.g. attachments, discussions, revisions)?
  • Once “backed up”, how easily can it be restored?
  • Will the links still work in our preservation or backup copy?
  • If the backup includes raw wiki markup, do you have the capabilities to re-render this as HTML?

And questions like these are no less relevant when considering your uses of blogs and other social software: I hope we’ll be able to look at them more closely in another post.

Posted in Technologies, Web 2.0 | No Comments »