JISC PoWR

Preservation of Web Resources: a JISC-funded project [Archived Blog]

Releasing the Herds of Cows – The Missing links Workshop

Posted by Marieke Guy on July 22nd, 2009

Yesterday many of those most committed to, and interested in, Web preservation attended the Missing links: the enduring web conference held at the British Library. The event, which had over 100 attendees (from academia, galleries, museums, libraries, archives, government, research groups and beyond) was an attempt to ‘develop and strength the links between content creators, tools developers, preservation services and user in order to ‘  secure an enduring web’.

If anyone can successfully preserve the Web it was this lot!

Session 1: Setting the scene

Keynote: Adrian Brown, Assistant Clerk of the Records, Parliamentary Archives

The opening keynote was given by Adrian Brown from the Parliamentary Archives who did adept job of setting the scene and the challenges ahead. His talk centred around the key jigsaw pieces of selection, capture, storage, access and preservation. He pointed out that the Web used to be primarily publisher of fixed content and it is the new types of content (discussions on wikis, Twitter, blogs etc.) that offer the biggest challenge. Parliament themselves are using many new forms of information, and although permission is not always an issue capturing content on other sites often is. You may own the content but someone else often owns the presentation of that content. Adrian also emphasised the need for us to co-ordinate selection and be as explicit about what we are not going to capture as what we are going to capture (the unpublishing button on facebook being a classic example of how difficult it can be to go back!) Another major challenge is that of temporal cohesion – the rate of capture of content is often far slower than the rate of change of content.

He concluded with a Peter Murray Rust quote that stresses the need for us to “Create and release herds of cows, not preserve hamburgers in a deepfreeze.”

Web Archive and Citation Repository in One: DACHS: Hanno Lecher, Librarian, Sinological Library, Leiden

HannoLecher talked about the citation problems he has encountered while working on the digital archive for Chinese studies, a resource that captures and archives relevant resources as primary source for later research. His work has indicated the possible need for citation repositories. Some optins might include verifying

URL references, evaluating the reliability of online resources, use of PURLs, avoidance ofobscure URLs and even the extreme banning of online resources as reference. As such approaches are not always practical (or appropriate) he use of applications like Snagit, Zotero or WebCite.

The future of researching the past of the Internet: Eric T. Meyer, Research Fellow, Oxford Internet Institute, Oxford University

Eric Meyer works primarily looking at social behaviour and the Internet and as part of the Transatlantic Digitisation Project spent some time building an archive collection of World War 1 and 2 Internet resources. The process gave the team incite into the feasibility of  creation of such an archive. Lessons were also learnt on the need for involvement of domain experts (webometrics, SNA, eResearch, etc.) and the desire to move from snapshots to more continuous data. Eric concluded by stressing that what researchers really want is a global way to access archives, not a national one.

After the morning coffee break the workshop moved from the general to the specific.

Session 2: Creation, capture & collection

An overview of Web Archiving Tools by Helen Hockx-Yu, Web Archiving Programme Manager, The British Library

Helen Hockx-Yu provided a very practical overview of current tools, covering Heritrix, NetArchiveSuite, Web Curator Tool and PANDAS. She also discussed archival formats such as ARC and WARC, which is highly desirable from a long term archival standpoint. Helen concluded a brief discussion on the limitations and challenges harvesters present from issues with rendering and dealing with ‘bad’ content to reliance on open source tools that are still very much evolving

Context and content: Delivering Coordinated UK Web Archive to User Communities: Cathy Smith, Collections Strategy Manager, The National Archives

Cathy Smith and the UKWAC team have recently been working on study entitled Delivering coordinated UK web archives looking at what audiences should web archives anticipate and what does this mean for selection, ingest and preservation and what do researchers want from archives. The conclusions include the notion of a single view to an archive, though this does not necessarily mean a single repository. Cathy explained that “hey just want the car to work – they don’t want to see what’s under the bonnet”. In response to the study the NA have recently begun automated whole domain harvesting – shallow as well as selective harvesting and they are looking at allowing all archiver to be searched together – a collection.

Capture and Continuity: Broken links and the UK Central Government Web Presence: Amanda Spencer and Tom Storrar from the National Archives.

Amanda Spencer and Tom Storrar were set with the task of sorting out the problem of broken links on government Web sites. 60% of URLs quoted in Hansard between 1997-2006 are broken leaving the public unable to find information and with little faith in the authority of government Web sites. The Web Continuity project has begun to create an archive of all Central Government Web space. Some thinks they are trying out include XML site maps, using redirects instead of 404 messages and capturing all government Web sites before they are closed.

Discussion Session

In the questions and discussion session the panel were asked what advice they would give to Web managers in how they can allow archiving of their sites to be more effective. The suggestions tackled content (following accessibility guidelines, W3C standards, making code valid, adding in hard links to files next to streaming content, having transcripts of audio files) and communication (enabling a dialogue between Web archivists and Web creation people.)It was noted that there is a fine line between encouraging good practice and stifling innovation and that at times communication with content creators is a luxury. Responsible harvesting also means explaining why you are doing it to those involved.

The discussion then moved to other possible points of capture such as using Google Cache or asking ISP providers.

Session 3: Issues and approaches to long term preservation of web archives

After lunch the programme moved on from the specific problems of now to problems of the future.

Diamonds in the Rough: Capturing and Preserving Online Content from Blogs: Richard Davis, Project Manager, University of London Computing Centre (ULCC)

Richard Davis, a member of the JISC PoWR team, started off by emphasising the importance of blogs today. This has lead him and colleague Maureen Pennock to consider the need for their preservation, particularly at institutional level, resulting in the ArchivePress project. Richard pointed out that traditional Web archiving tools can be difficult to configure and use: a ‘hammer to crack a nut’. A comment by Chris Rusbridge that ”blogs represent an area where the content is primary and design secondary” suggested that using blog feeds might be key. The feed could be a point at which to gather blog content for archiving. The test bed for ArchivePress will include blogs from the DCC, Lincoln University and UKOLN. In the future ArchivePress could possibly be used to harvest Twitter.

Beyond Harvest: Long Term Preservation of the UK Web Archive: Maureen Pennock, Web Archive Preservation Project Manager, The British Library

Maureen Pennock started her talk by giving some general background to UK Web Archive. Now the archive has been established the next task is preserving the archive, an ongoing activity. The main undertakings are documenting System Dependencies, consideration of containers and metadata standards (following a review, WARC is the preferred format for preservation, but they are also using a combination of features from METS and PREMIS), preservation workflow and defining the preservation strategy. The future involves the consideration of many new areas, for example should we preserve computer viruses.

From Web Page to Living Web Archive: Thomas Risse, Senior researcher, L3S Research Center

Thomas Risse introduced the LiWA (Living Web Archives) project, an EU FP7 research project. LiWA has been looking at building tools to address the specific challenges of the current generation Web, e.g. link extraction from dynamic pages.

Emulating access to the web 1.0: Jeffrey van der Hoeven, Koninklijke Bibliotheek, The Netherlands

Jeffrey van der Hoeven started off with a whirlwind history of the Web browser. From the WorldWideWeb browser in 1991, which was more like a document viewer, followed by Mosaic and Netscape, and onto now with Firefox and Google Chrome. Today’s browser do many things, from rendering Web sites, bookmarking, caching, authentication and providing RSS feeds. They have changed from viewers to all round workplaces – Google Wave is an extreme example of this. The browser is currently the centreal application of the computer and there is a need for a sustainable solution to render Web sites. At the Koninklijke Bibliotheek they have been recently working on the Dioscuri project, a durable emulator. They have also been working with the planets project using Global Remote Access to Emulation (GRATE) and on the Keep Emulation Environments Portable (KEEP) project.

After another coffee break we moved onto the final discussions

Session 4: Discussion and next steps

What we want with web-archives; will we win? Kevin Ashley, Head of Digital Archives, University London Computer Centre (ULCC) Kevin Ashley

Kevin Ashley

Kevin Ashley, also part of the JISC PoWR team, kicked off his talk by introducing us to the future of the Web, Web 8.0. Although Kevin was just joking with us he explained that when the next major change in the Web happens, we will need a much better metaphor than ‘Web 3.0′: it is the future that we need to think about. We need to be asking the key question as to what we want from Web archives, e.g. potential use cases. Preserving the Web is not just about content. Sometimes it will be useful to look at Web content in aggregate, e.g. for visualisation, linguistic analysis, etc. Right now many academics are doing own Web archiving – content and presentation or both. However we need to not take just a document centred view of data but also consider the web of data and data about the web. A lot of future interest will not in Web content, but in the context of use.

The day concluded with a round table discussion with many speakers from the day. Some of the topics discussed were the permissions challenge (primarily with regard to UKWAC and similar initiatives). Other countries and initiatives have favour a more ‘just do it’ approach.

The day left most people enthused and positive about the future for Web archiving.

A conference report will follow on the DPC Web site and will be sent out via the dpc-discussion and digital-preservation JISCmail lists.

All presentations are now available in PDF Format from the DPC Web site.

WAC09 was used as the Twitter tag for the event. You can see the tweets at http://hashtags.org/tag/wac09/messages.

The event was sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland).

2 Responses to “Releasing the Herds of Cows – The Missing links Workshop”

  1. ArchivePress » Blog Archive » Missing Links at the British Library Says:

    [...] – the slides of these are all available at the DPC website. Reports sighted so far include Marieke Guy’s report for JISC-PoWR, and a post on Jonathan Clark’s blog. I’ll add any further sightings as comments. Some [...]

  2. Missing Links at the British Library « ArchivePress + APrints Says:

    [...] – the slides of these are all available at the DPC website. Reports sighted so far include Marieke Guy’s report for JISC-PoWR, and a post on Jonathan Clark’s blog. I’ll add any further sightings as comments. Some [...]