Web Resource Preservation: No One Ever Said It Would Be Easy….

If it was we’d all be at it!!

Any records manager or archivist will probably be able to give you half a dozen reasons for why digital preservation is very important. Some might well give you half a dozen more for why the preservation of Web resources in particular, which now play such a huge part in our daily lives, is very very important.

Unfortunately this critical activity isn’t easy. In fact the very nature of the Web means that the preservation and archiving of Web resources is actually a very complex task. A few of the major issues include:

  • The transient and dynamic nature of the Web – The Web is growing at a rapid rate. The average Web resource’s lifespan is short and pages are often removed. On the Web publishing is an easy process and content may be changed often and not necessarily in an orderly way. Metadata is very much an afterthought. Web 2.0 content (comprising of data mash ups, blog entries, comments etc.) is even more dynamic.
  • Selection issues – Of the billions of resources out there which and which instantiation of them should we preserve?
  • The technologies involved – The Web is dependant on technology, it uses various file formats and follows many protocols, most of which evolve quickly. The look and feel of a Web page may be determined by a number of different elements such as the code, the http protocol, the user, the browser and the server. Which of these need to be preserved? Web resources are usually held on just one server, so are at greater risk of removal, yet for some resources countless copies are made. Again which do we preserve? Web sites are held together by hypertext links meaning parts of the site could be omitted (if for example they use a robots.txt file or pages are not actually linked to) if crawled by archiving software. Whole areas of the Web are held in problematical CMS or behind authentication systems and Web 2.0 applications use layered APIs, which use data in many different ways.
  • Organisational issues – How is your institution using its Web site? Is it a publication or is it a record? Is the content being managed? Who is responsible and who has ownership?
  • The legal issues – There are many IPR and data protection issues with Web content. Who owns the photos on Flickr, the comments on a blog or the details on a social networking site?

There is no easy answer! However despite the difficulties of Web preservation some institutions may be addressing some of these issues already. We are keen to hear examples of any approaches being taken.

5 thoughts on “Web Resource Preservation: No One Ever Said It Would Be Easy….

  1. Richard Davis

    Hi Marieke. I think this is a useful breakdown: I particularly hope we’ll be able to get a meaningful picture of how Web Resources are instantiated in our target institutions. From the big corporate rebranding exercise, to editing a web page, to flinging up a blog or wiki like this one: there are many ways these things can happen, many of them beyond the immediate control of a central Information or Records Manager. What are the particular issues for different organisational structures? Is there any difference embodied in the various web platforms (Windows/IIS, Linux/Apache) and CMS apps in use – none of which is likely to have been chosen for it preservation or RM features?

    Also, given that every Information Manager who does not already have web content under RM-like control has to “start here” – looking both forward and back, forging better contacts with web teams, registrars, etc. I hope at the workshops we can explore very practical examples of how they might do that.

  2. Peter Barnes

    Hi Marieke

    Plenty of food for thought … I’m looking forward to the event at Senate House, it will be interesting to see how (if?) things have moved forward since the foundation of the Digital Curation Centre. Speaking of which, how does this project fit in with the DCC?

    I think the answer for complex web production technologies with a host of dependencies the only way forward is to take snapshots of published content as HTML with mirroring tools such as HTTrack.

    How to take an effective automatic snapshot of a site that has complex user interaction is the real toughie – tools such as web application security scanners are beginning to become a bit cleverer at tackling interaction [is this/will this filter through to mirroring tools?] but it strikes me that crawlers and information retrieval systems with a degree of AI will be required to really crack the problem.

  3. Marieke Guy

    Hi Peter,

    It will be good to see you at our first workshop.

    In response to how does this project fit in with the Digital Curation Centre (DCC)?

    The short answer is that while JISC-PoWR would hope to compliment the DCC’s work, the two are actually focussed on quite different areas of work.

    The two main differences are 1) that while the DCC covers all digital data (but concentrates on scholarly and scientific materials) JISC-PoWR is focussed specifically on Web Resources (of any type). 2) The DCC is interested in maintaining and adding value to data (which preservation would be part of) and covers a wide range of topics (such as preservation of data, email preservation etc.) while JISC-PoWR is just looking at encouraging preservation by sharing of practice and transfer of knowledge between the UK HE/FE Web Manager community and the Digital Library and Digital Preservation communities.

    Also JISC-PoWR is a short-term project while the DCC is a much larger project, with partners from The University of Edinburgh, The National e-Science Centre, Humanities Advanced Technology and Information Institute, Science and Technology Facilities Council (STFC) and UKOLN. As UKOLN is a member of DCC and the JISC-PoWR team we hope to be able to work together.

    The different ‘preservation’ terms are defined in Digital preservation in a nutshell (Part I).



  4. Richard M. Davis

    Hi Peter

    As Brian’s said elsewhere, we want to try and bring a user-focused approach to issues that have been covered in various forms and forums by DCC and DPC, not to mention Adrian Brown’s book. I’m approaching this as a great opportunity to take our own kind of snapshot of the state of things, and see how much of the material generated by events like the DCC’s 2006 Workshop has trickled down to practitioners. Part of the purpose of the workshops is to scope the PoWR Handbook, which we would like to make as practical and accessible as possible.

Comments are closed.