Author Archives: Kevin Ashley

“Why study the web?” – Monday 8th March, Royal Society

My attention has just been drawn to this event by a blog post by Aleks Krotoski. The panel session, which will be streamed live and available for later download, will discuss ways in which the web can be studied at postgraduate level. Many of the examples focus on contemporary issues – the web as it is now – but this looks to be an ideal opportunity to highlight the research potential of web archives, and the services that those archives need to provide to enable research to be carried out. (JISC are commissioning work in this area.) More details are available at ECS Southampton. Worth a visit if you are nearby; I wish we had been able to give more warning!

Web archiving in the wider world

When a topic is being discussed in the correspondence pages of national newspapers, it’s a sign that it’s no longer the concern of a few specialists. That’s certainly been true of web archiving for some time as a recent example shows. Malcolm Birdling wrote a letter published in the Guardian on January 1, 2010 bemoaning the fact the some government agencies – in particular the UK Borders agency – actively prevent sites such as the Internet Archive from capturing their contents. This has important consequences for citizens, particularly when such sites are used to publish regulations and guidance which is frequently changing. (I have anecdotal evidence that the UK Inland Revenue lost an appeal brought by a taxpayer over a very similar issue.)

WAGN website - capture from Internet Archive (detail) Mr Birdling’s letter brought a rapid response from David Thomas of the UK National Archives who was keen to reassure readers that central government websites were being archived, even without the legislation which prompted Mr Birding’s original letter. (That story refers to the changes to Legal Deposit regulations which would permit the British Library and other UK copyright libraries to capture UK content without the permission of rights owners.)

But earlier examples of non-specialist concern with preserving web content exist. One of my favourite examples comes from the Usenet group uk.railway whose contributors include a fair number of rail enthusiasts (“trainspotters” if you’re feeling unkind.) Privatisation of the UK railway network means that we have a plethora of train operating companies, or TOCs, each of whom operate their own web site, much as the great companies of old such as LNER might have done if the web had existed then. The difference is that now these companies come and go every few years when the government puts operating contracts out for re-tender. Railway ephemera such as promotional leaflets and timetables are a key part of the print collections at places such as the National Railway Museum. “What happens to TOC web sites when franchises change?” wondered one poster to uk.railway back in 2007. The Internet Archive has certainly captured some material, but it isn’t the same as a collection controlled by an institution such as the NRM. I wasn’t able to give a very positive answer to their question. I don’t believe the National Railway Museum are yet able to capture websites as part of their collection, and it’s not clear that any of the members of UKWAC see TOC sites as falling within their collecting policy.

And herein lies a lesson. Rail enthusiasts are incredibly effective at preserving railway heritage, both through their own efforts and through influencing others. They include many people with an enviable range of technical abilities. They ensured that special legislation was passed to ensure the preservation of railway heritage after privatisation. Not content with simply preserving heritage, some of them set about recreating it through building an entirely new steam locomotive. But their combined efforts have not yet (so far as I know) ensured that past railway web sites have been preserved. If they can’t manage it without institutional help, what hope is there for the rest of us ?

LIWA – Living Web Archives

The PoWR project identified a number of technical challenges which made certain types of content – particularly that with a Web 2.0 flavour – particularly difficult to manage and preserve in an effective way. My attention has recently been drawn to an EU-funded project which hopes to overcome a number of these technical problems, as well as others that are applicable to large-scale archiving such as the problem of spam content.

LIWA – Living Web Archives – began in early 2008, but as with many EU projects, its startup phase involved a lot of internal activity without much of a public face. As a result we didn’t pick up on its work in the JISC-PoWR handbook, but I’m sure we’ll rectify this omission in any future revisions.

To pick one example of LIWA’s areas of interest, it intends to develop tools which make it easier to take a temporal view of web archives and to maintain temporal consistency. Temporal consistency – or rather its absence – will be familiar to anyone who has spent time exploring sites in the Internet Archive, where different pages, or even portions of the same page (such as images) will have been archived on different days. This can lead to occasional surprises when navigating through archived content, with links taking one to pages that don’t have the expected content.

LIWA’s partner’s include Hanzo, a UK-based web archive services company that we covered briefly in the handbook; I hope we can explore their potential value to UK HE in the future.

Legal scholarship recognises long-term value of blogs

A recent post on the digital-preservation list indicates that at least one scholarly community has recognised the long-term scholarly value of online resources such as blogs, and the potential damage to future scholarship that might result from their loss. It draws attention to a symposium taking place at Georgetown University next year. The email says that the symposium:

…will build upon the fundamental assumption that blogs are an integral part of today’s legal scholarship.

and goes on to say:

This symposium will bring together academic bloggers, librarians, and experts in digital preservation …. Symposium participants will collectively develop innovative practices to ensure that valuable scholarship is not easily lost.

Join the conversation now by tagging items you think are relevant to this symposium with the tag FTLS2009.

It’s interesting to observe that this is an example of a community acting to preserve information of interest that is likely to be scattered over many institutions and none. (I suspect a fair amount of blogging in this area is done by practitioners who aren’t at an academic institution.) One of the concerns we identified in PoWR was that much material of this type was unlikely to be preserved as a result of institutional interests, unless one institution tried to bring materials like this into the remit of its special collections (and some have done this.)

The conference web site goes on to say:

This unique symposium will seek answers to the questions:

1. How can quality academic scholarship reliably be discovered?
2. How can future researchers be assured of perpetual access to the information currently available in blogs?
3. How can any researcher be confident that documents posted to blogs are genuine?

The symposium will include a working group break-out session to create a uniform standard for preservation of blogs, a document to be shared by bloggers and librarians alike.

That last goal of a uniform standard for blog preservation looks like a tall order and it will be interesting to see what emerges from this group, and what its wider relevance might be. But its a clear demonstration of the value of web material to some research communities, and their willingness to do something about it if their institutions can’t, or won’t, help them.

More on iPres2008

Brian and Marieke have already written about iPres2008 and PoWR, and I have written and will write more about it from a general perspective on DABlog. But we thought it would be worth saying a bit more about what this conference, which is looking at the complete picture of digital preservation, had to say which is of relevance to PoWR’s work of web preservation in UK Universities.

There was an entire session devoted to various web archiving initiatives on the second day, which at first sight one might think is of particular relevance (almost as much as Brian’s presentation, one might think.) I wasn’t at this session – it was one of those running in parallel tracks, and I was speaking in the other track – but Ed Pinsent was and will be writing at more length about it soon. But even without attending, I’m aware that many of the projects, operating as they do within their national domains in Australia or elsewhere, won’t have much role in helping save UK University web content (unless we move our domains to – there’s a thought.) Even when the BL realises its long-term aim of harvesting across the entire UK web domain, it still will be selective in some ways about what it captures – about depth and frequency of harvests, and about the type of content. You won’t be able to depend on those institutions to capture what you want to be captured. So if these initiatives aren’t going to meet all our needs, do we need to do it ourselves ? The PoWR project thinks not, but that is one of the options institutions will need to examine. The work the IIPC is doing to develop harvesting and access tools will be of interest to those few institutions that feel able to operate these tools themselves – not something to be undertaken lightly.

Yet there was much of relevance at iPres2008. One recurring them, picked up at the outset by Lynne Brindley and in Steve Knight’s closing remarks, was that ‘digital preservation’ is not the term to be using in discussions with our institutions and the world, echoing remarks on the DCC blog which Brian later picked up on here. Steve prefers the phrase ‘permanent access’. which is indeed outcome-focussed. However, we’ve also said in PoWR that preservation isn’t always forever, so I would prefer something a little more all-embracing – ‘long-lived access’ might fit.

The sessions covering things like significant properties also touched on issues that PoWR is concerned with. When we decide to preserve something, what is it that we’re really trying to keep ? Most forms of preservation change the original object in some way, just as long-life milk isn’t the same as pasteurised, and neither are quite as tasty as fresh milk (or so I’ve been told.) This is clearly still a very difficult problem, and one that (to my mind) demonstrates that the digital preservation community hasn’t even developed a clear problem statement, much less a fully worked-out solution. So, in the meantime, we need to be pragmatic and do what seems best at the time. Always a good plan.

PDF web page capture

During the last JISC PoWR workshop yesterday in Manchester (of which more anon) I made brief mention of a tool from Adobe which allows web pages, or entire sites, to be captured to a PDF file. I mentioned this primarily to illustrate one of the three points at which web capture can take place (behind the server; from the HTTP transaction; or browser-side) but it generated considerable interest, and I promised to blog about the product since I could not remember what it was called.

It turns out that it’s not a separate product, nor a plug-in, but a built-in part of Adobe Acrobat. It was first available as a free add-on for Acrobat 4 in 1998 or 1999 , and I think it was then that I first saw this demonstrated at the PRO (as it then was) – hence my misunderstanding. Tools like this have their place, but (like all web preservation technologies) they also have their drawbacks. PDF’s print-oriented format isn’t a good match to some sites, much as some sites don’t look good when you try to print them. (In fact, I believe that Acrobat Web Capture effectively uses the browser’s print engine combined with PDF writer pseudo-printer to do its work, so there will be a close correlation.) But we’ll be covering this tool, along with others, in the handbook.

What can PoWR do for you?

Web preservation is a big topic and we’re not even pretending to deal with all of it. The aspect that we care about – that JISC believes the community is looking for help with – is fairly well-defined. We want to help institutions make effective decisions about preserving web resources, and help them implement those decisions in a way that is cost-effective and non-disruptive.

Making effective decisions
At its simplest level, this means deciding what to keep and what not to keep. There may be many drivers for these decisions – institutional policy, legal requirements and research interests are just a few. The decisions need to relate not just to what is to be kept, but why and who for. That’s because those requirements may have a bearing on how you choose to go about the job, or whose responsibility it is to carry it out. Not everything needs to be kept, and even when it does, it may not be your institution’s responsibility to keep it.

Implementing those decisions
Carrying out your decisions – keeping things, throwing things away, or ensuring that other people keep things – can be the trickiest part of the process. You may know you want to preserve the prospectus for past years, but can you be sure that your CMS, or the Internet Archive, or some local use of web-harvesting tools is going to do this job effectively for you ? You may be being told that some part of your web infrastructure would be easier to preserve if you avoided the use of certain features, or used a different authoring system. Is that true, and if it is, what are the negative consequences of such decisions ?

The handbook which will be one of the project’s outputs will attempt to answer these quesions in a way that makes sense to everyone who might be involved in the process. We want to help to make it easier to take decisions about preservation and to know what tools, systems or working methods can be employed to help you implement them.

The workshops are the primary mechanism we’re using to test whether the handbook makes sense to the people it’s aimed at, and that they tackle the problems that people are actually facing.

Introduction: Kevin Ashley

Kevin Ashley mugshot Hello. I’m Kevin Ashley, manager of the Digital Archives Department (DAD) at ULCC since its establishment in 1997 (the department, not ULCC.) During that time, DAD has set up and run the NDAD service for The National Archives, preserved digital material for the British Library (before handing it back to them to put in their shiny new Digital Object Management system), collaborated with Cornell and the DPC to produce the Digital Preservation Training Programme in the UK, and many other activities.

I’m currently chair of JISC’s Repositories and Preservation Advisory Group, and ULCC’s representative on the DPC board. My proudest achievement is the creation (with my quondam colleague Martin Powell) of a founder member in the Useless Web Pages Hall of Fame: the ULCC web telephone dialler – often imitated but never, IMHO, bettered. Unfortunately, both the dialler itself and the Hall of Fame are no longer with us on the web, and those links both depend on the Internet Archive’s Wayback Machine. For that reason, and many others, I’m particularly interested in the success of PoWR.