Author Archives: Marieke Guy

About Marieke Guy

I am a research officer in the Community and Outreach Team at UKOLN. Much of my work involves exploring Web 2.0 technologies and their relevance to the communities we work with.

Goodbye from the JISC PoWR blog

From today we don’t intend to provide any more significant posts on the JISC PoWR blog and will be closing comments. The blog will remain here as a resource for you to use but it is now officially frozen.

An Archived blog page is now available giving further information on the archiving of the blog. It includes blog statistics for future reference.

The JISC PoWR team would like to say thank you to all our readers. Most of the team members are involved in new digital/Web preservation work so this won’t be the last you hear from us!

Cessation of posts to the JISC PoWR blog

Following the successful completion of the JISC PoWR  project we continued to publish occasional posts on this blog related to the preservation of Web sites. We have also recently published a new handbook on the preservation of Web resources which we have announced on this blog.

It is now therefore timely to officially announce that we do not intend to publish any new posts on the blog after a couple of post which provide a summary of how this blog was used.  A week or so after the final posts have been published we will switch off comments on the blog – so that we will no longer have to spend time in checking for spam comments.

The blog itself, and all posts and comments, will remain available for the indefinite future – by which we mean that we will seek to provide access for a period of at least 3 years from now.

The summary posts we intend to provide will contain details about the blog such as:

  • Number of posts and comments
  • Details of contributors
  • Details of blog theme and plugins used
  • Details of type and version of software used

If you have any suggestions for any other information it would be useful to provide and record please do let us know.

We intend to use the closing of the blog as a case study which will be documented as part of the JISC Beginner’s Guide to Digital Preservation. The Beginner’s Guide will eventually be available online but the process of creating the guide is being documented in the JISC Beginner’s Guide to Digital Preservation blog.

A Guide to Web Preservation

The JISC PoWR team is pleased to announce the launch of A Guide to Web Preservation.

This Guide uses similar content to PoWR: The Preservation of Web Resources Handbook but in a way which provides a practical guide to web preservation, particularly for web and records managers. The chapters are set out in a logical sequence and answer the questions which might be raised when web preservation is being seriously considered by an institution. These are:

  • What is preservation?
  • What are web resources?
  • Why do I have to preserve them?
  • What is a web preservation programme?
  • How do I decide what to preserve?
  • How do I capture them?
  • Who should be involved?
  • What approaches should I take?
  • What policies need to be developed?

Each chapter concludes with a set of actions and one chapter lists the tasks which must be carried out, and the timings of these tasks, if an institution is to develop and maintain a web preservation programme. In addition points made in the Guide are illustrated with a number of case studies.

The guide was edited by Susan Farrell who has used her knowledge and expertise in the management of large-scale institutional Web services in writing the document.

The Guide can be downloaded (in PDF format) from the JISC PoWR Web site. The Guide is also hosted on JISCPress service which provides a commenting and annotation capability. It has been published on the Lulu.com print-on-demand service where it can be bought for £2.82 plus postage and packing.

If you want to discuss the Guide on Twitter you should use the #jiscpowr tag.

Making any Upgrades to your Blog Sir?

This blog is hosted by JISC Involve who provide blogs for the JISC community.

Till recently JISC Involve was running on an old version of WordPress (1.2.5). Earlier this month the JISC Digital Communications Team upgraded their server to the latest version of WordPress (2.9.2) and then migrated all the JISC Involve’s blogs over to the new installation.

Although all blog posts, comments, attachments, user accounts, permissions and customisations were supposed to move over easily JISC Involve users were encouraged to back-up the content of drafts etc. ‘just in case’.

Unfortunately there were some technical problems migrating the content and as a consequence the original theme was lost and URLs now redirect.

Luckily the JISC PoWR team were able to locate the original theme and reinstall it.

However the process has made them aware of the need to record details of the technical components and architecture of the blog. This information can be critical in a migration process and when ‘closing down’ a blog.

The JISC PoWR team will ensure that such information is routinely recorded.

Is there any other information that is important for preservation or migration purposes?

JISC Beginner’s Guide to Digital Preservation

Members of UKOLN who were involved in the JISC PoWR project have recently begun work on a new project creating a straightforward and pragmatic guide to digital preservation for those working on JISC projects. The project will create the  JISC Beginner’s Guide to Digital Preservation.

It will look at reasons why JISC projects might want to preserve their deliverables, will introduce mainstream terminology and processes and offer clearcut solutions. The guide will also offer lists of references and resources, a checklist of issues users will need to think about and a number of case-studies by which they will be able to benchmark themselves against.

A number of the discussions initiated on the JISC PoWR blog (such as preservation of Web 2.0 services including blogs and wikis) will be taken forward on the new project.

A project blog has recently been launched at http://blogs.ukoln.ac.uk/jisc-bgdp/

The Library of Congress Twitter Archive

Two weeks ago the Library of Congress announced that they will be archiving all public tweets since Twitter began. The tweets have been given to the library as a ‘gift’ from Twitter.

The announcement was fittingly made on Twitter.

Yesterday the Library of Congress blog published a list of FAQs abouut the approach they will be taking.

The FAQ explains:

  • Why is it important to preserve the Twitter archive?
    It sees Twitter is part of the historical record of communication, news reporting, and social trends – all of which complement the Library’s existing cultural heritage collections.
  • What is in the Archive?
    Public information. Not private account information or deleted tweets.
  • What does the Library plan to do with the archive?
    Its aims are preserving access to the archive for the long term and making data available to researchers.

Blue Ribbon Task Force Publishes Sustainable Economics for a Digital Planet

Universities grappling with complex decisions on which of their burgeoning digital resources they should preserve – and the inherent financial, technical and legal issues that surround such work – may welcome a report that offers a “supply-and-demand” perspective on how individuals and institutions might manage their digital collections.

The Blue Ribbon Task Force on Sustainable Digital Preservation and Access (BRTF-SDPA), a new international initiative funded by JISC and other organisations, has recently released its report entitled Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information. Its report examines the complicated and diverse issues from an economic standpoint. It identifies the problems intrinsic to all preserved digital materials, and proposes domain-specific actions that address the challenges to sustainability. The report focuses its inquiry on materials of long-term public interest in content domains with diverse preservation profiles, namely scholarly discourse, commercially owned cultural content and collectively produced Web content.

JISC is organising a free one-day symposium in London on 6 May 2010 where the Blue Ribbon task force will present its final report and invite responses from the BBC, the Natural History Museum, the British Library, European Bioinformatics Institute and the European Commission. Further information is available.

Kevin Ashley new DCC Director

Earlier this week the Digital Curation centre announced the appointment of their new Director who will succeed Chris Rusbridge upon his retirement in April 2010. The role has been taken on by JISC PoWR’s very own Kevin Ashley.

kevin.jpgKevin has been Head of Digital Archives at the University of London Computer Centre (ULCC) since 1997, during which time his multi-disciplinary group has provided services related to the preservation and reusability of digital resources on behalf of other organisations, as well as conducting research, development and training.

The group has operated the National Digital Archive of Datasets for The National Archives of the UK for over twelve years, delivering customised digital repository services to a range of organisations.

As a member of the JISC’s Infrastructure and Resources Committee, the Advisory Council for ERPANET, plus several advisory boards for data and archives projects and services, Kevin has contributed widely to the research information community.

Kevin has been an active member of the JISC PoWR project and written many blog posts sharing his expertise.

The DCC has just begun its third phase of work makes the following comment on it’s Web site (A new phase, a new perspective, a new Director):

As a firm and trusted proponent of the DCC we look forward to his energetic leadership in this new phase of our evolution.

At JISC PoWR we offer Kevin our congratulations and wish him all the best in his new role.

Official Launch of the UK Web Archive

The British Library has officially launched the UK Web Archive, offering access in perpetuity to thousands of UK websites for generations of researchers.

The site was unveiled earlier this week by the Minister for Culture and Tourism, the Rt Hon Margaret Hodge MBE MP, and Chief Executive of the British Library, Dame Lynne Brindley, this project demonstrates the importance and value of the nation’s digital memory.

Websites included in the UK Web Archive include:

  • The Credit Crunch – initiated in July 2008, this collection contains records of high-street victims of the recession – including Woolworths and Zavvi.
  • Antony Gormley’s ‘One & Other’ Trafalgar Square Fourth Plinth Project – involving 2,400 participants and streamed live by Sky Arts over the web to an audience of millions, this site will no longer exist online from March 2010.
  • 2010 General Election – work has started to preserve the websites of MPs such as Derek Wyatt, who will be retiring at the next election, creating a permanent record of his time as a Member of Parliament.

This important research resource has been developed in partnership with the National Library of Wales, JISC and the Wellcome Library, as well as technology partners such as IBM.

British Library Chief Executive, Dame Lynne Brindley said:

Since 2004 the British Library has led the UK Web Archive in its mission to archive a record of the major cultural and social issues being discussed online. Throughout the project the Library has worked directly with copyright holders to capture and preserve over 6,000 carefully selected websites, helping to avoid the creation of a ‘digital black hole’ in the nation’s memory.

“Limited by the existing legal position, at the current rate it will be feasible to collect just 1% of all free UK websites by 2011. We hope the current DCMS consultation will enact the 2003 Legal Deposit Libraries Act and extend theprovision of legal deposit through regulationto cover freely available UK websites, providingregular snapshots ofthe free UK web domainforthebenefit of future research.

Further details are available from the British Library.

Findings available from the KRDS2 Survey

The findings from the Keeping Research Data Safe 2 (KRDS2) survey of digital preservation cost information are now available on the KRDS2 project Web page.

KRDS2

The Keeping Research Data Safe 2 project commenced on 31 March 2009 and will complete in December 2009. The project will identify and analyse sources of long-lived data and develop longitudinal data on associated preservation costs and benefits. It is believed that these outcomes will be critical to developing preservation costing tools and cost benefit analyses for justifying and sustaining major investments in repositories and data curation.

The Survey

The survey was carried out between between September and November 2009 to identify key research data collections with information on preservation costs and related issues. 13 survey responses were received: 11 of these were from UK-based collections, and 2 were from mainland Europe. The responses covered a broad area of research including the arts and humanities, social sciences, and physical and biological sciences and research data archives or cultural heritage collections.

The survey questionnaire sought to identify cost information available for the main KRDS2 activities in the Pre-Archive and Archive phases. Information for some activities is very high (archival storage cost information is available in 100% of the responses). Other more infrequent activities such as disposal (and perhaps also preservation planning) are less well represented. Knowledge of acquisition costs is also relatively low (46%).

Further information is available from the KRDS2 project Web page.


					

Bookings open for 5th International Digital Curation Conference

DCC

5th International Digital Curation Conference

“Moving to Multi-Scale Science: Managing Complexity and Diversity” | 2-4 December 2009

The IDCC is an established annual event reaching out to individuals, organisations and institutions across all disciplines and domains involved in curating data for e-science and e-research.

The DCC will be hosting a workshop programme on 2 December followed by a Pre-Conference Drinks Reception at the Natural History Museum. The main conference will open on 3 December with a keynote speech from Professor Douglas Kell, Chief Executive of the Biotechnology & Biological Sciences Research Council (BBSRC). Other key speakers will include: Professor Ed Seidal, National Science Foundation; Cliff Lynch, Coalition for Networked Information; Timo Hannay, Nature Publishing Group. The first day of the conference will incorporate an interactive afternoon for posters and demos, followed by a Symposium entitled “Citizen Science: Data Challenges” led by Richard Cable, BBC Lab UK.

The second day will be made up of peer-reviewed papers in themed sessions covering Disciplinary and Institutional Challenges, Practitioner Experience, Metadata, Software Preservation & Managing Risk.

Places are limited so please register now.

Registration to close on 20 November 2009

The digital media collection +100 years

As part of the JISC ITT Workshops & Seminars: Achievements & Challenges in Digitisation & e-Content strand JISC Digital Media have hosted two free seminars focussing on key topics for individuals involved with digital media. Today I attended the second of these entitled The digital media collection +100 years.

Obsolescence, deterioration of physical storage media or withdrawal of institutional support: just what will prove to be the greatest threat to the materials we digitise today? This seminar projects one hundred years into the future and attempts to predict the future ‘preservability’ of what we digitise today. This seminar will examine changing user demands and inevitable developments in technology.

Panel Session

After a brief opening from Dave Kilbey of JISC Digital Media the scene setting introduction was given by Dr William Kilbride, Executive Director of the Digital Preservation Coalition.

The Preservation Landscape

As well as the more conventional look at the key issues (the volumes of data available, the complexities and complicated requirements of this data teamed with rising public expecations) William gave a really interesting talk on the path of literacy. He demonstrated through the Stroop interferance test how once we can read and write we tend to process this information quicker that image information. The result is a that literate cultures tend to be hegemonic through discursive power. His point was that the consequences of our work are not inevitable or neutral: digitisation is a social practice that can be used for good and for ill. After this slight aside William ran us through some of the main challenges which include obsolescence of technologies, correct configuration of hardware, software and operators, and the need for a constantly managed service. He ended with a few ‘answers’ from a survey of recent JISC digitisation projects. When asked how long their resources were to be available answers varied from “perpetuity” to “forever or three years”. He concluded that digital preservation is possible but our legacy will be what we make of it and cannot be taken for granted.

The Camera Raw format and preservation

Nigel Goldsmith, a photographer working for JISC Digital Media gave a quick run through of the possibilities of using Raw camera format. Raw offers the photographer greater control over the processing of their images, however this flexibility comes at a price. Raw is a proprietary format which requires specialist applications to view. Nigel’s suggestion was to archive raw but to keep it along side another format, possibly tiff or Jpeg2000.

Preservation Metadata Initiatives and Standards

After coffee Getaneh Alemu from the Humanities Computing Department, the University of Portsmouth gave us a whirlwind tour of state-of-the-art metadata standards and how metadata can help ensure the integrity, identity and authenticity of digital documents. His overview included a look at OAIS, NLA PANDORA, CEDARS, NEDLIB, LMER, PREMIS, and METS metadata initiatives and standards. He concluded that at the moment preservation metadata formats tend to have element naming issues that descriptive metadata initiatives don’t tend to have.

The challenges of archiving computer games and other multipart digital interactives

After lunch Tom Woolley from the National Media Museum talked about some of the digital media preservation issues they are tackling on-site at the museum. The museum is involved in a number of initiatives that aim to let visitors ‘have a go’ at old games and old internet environments. The tricky dilema is giving users a taster of old games in a cost effective way, actually using original kit (like ZX Spectrums) would have a heavy cost attatched. The key is often emulation. The museum also try to capture the context of games by capturing fan information, discussion forums, FAQs etc. Tom was followed by James Newman from Bath Spa University who works with Tom on the National Video Game Archive.

James talked about one of the biggest challenges of video game archiving: supersession. Within the gaming world there is a tendency to be always looking for the ‘next big game’ which has resulted in an environment where games creators don’t value old games. Although there is a niche market for retro games, gaming is an area where the experience is almost completely associated with the technology, making archiving very difficult.

The importance of collaboration

Simon Tanner, director of King’s Digital Consultancy Services focused on institutional preservation and the importance of collaboration in sustainability. He started off by saying that one of the biggest challenges is that we may run out of the minerals to make microchips. He later played on the climate issue again by saying that he currently saw digital preservation as sitting in the same space as climate change: people viewed it as potentially a terrible thing (the loss of digital objects) but currently it does not impact on individuals, so it remains low on the priority list. Simon pointed out that sustainability of resources was becoming a mandate but remains an unfunded mandate. The way to deal with this was through the ecology of collaboration – within your institution and out side.

A Poisoned Chalice? Accepting Responsibility for Sustainable Access

Neil Grindley

The day concluded with a talk from Neil Grindley, JISC Programme Manager for Digital Preservation. Neil pointed out ath ensuring that an organisation’s digital assets are safe, secure and accessible for the long term should (in theory) be an interesting, responsible and useful role for anyone in an organisation to accept. The critical importance of digital assets, the ubiquity of digital methods and the need for people in all walks of life to have effective means to refer to persistent sources of data reinforce this notion. How is it then that long-term asset management, information lifecycle management, data curation, digital preservation (call it what you will) is often regarded as a peripheral specialist activity that it is difficult to resource, complex to carry out, and delivers benefits that are, at best, simply an insurance policy rather than an activity that adds value to an organisation? Neil’s presentation examined the importance of defining clear roles for those involved with digital preservation and considered the importance of associating this professional activity with strategic and tactical frameworks. He advocated the need for allocation of responsibility and internal preservation policies. JISC has spent 6 million in the digital preservation arena between 2005 and 2009, yet there is still work to be done. He concluded by pointing out the need for human judgement when deciding what to keep and predicted that in the future digital preservation will be integrated with administration departments, have better tools and will take more terms from the cultural heritage area.

After Neil’s talk there was a panel session and time for questions, unfortunately I had to leave to make the difficult drive home through rush hour traffic!

The day was an interesting one, although the talks were a real mixed bag they all offered constructive steps forward to make today’s digital media collection something that we may be able to access and use 100 years on.

Survey: How successful has Records Management been?

As part of his dissertation at Aberystwyth University Andrew Brown is undertaking a research project which aims to determine how successful Records Management has been in the UK by asking Records Managers for their perceptions of Records Management in their organisation and the profession as a whole. He is attempting to quantify this ‘success’
and would be very grateful if record managers could take the time to complete the survey, which will take approximately 10-15 minutes.

It is hoped that this study will generate some stimulating debate on this matter and lead to a greater understanding of the current and future state of the Records Management profession in the UK where digital and Web preservation may be key.

Please access the survey at the following link.

The survey closes at midnight on 5th September.

iPres 2009 Programme

The programme for the sixth International Conference on Preservation of Digital Objects (iPres 2009)  has recently been released and registration is now open.

This year’s event will be hosted by California Digital Library (CDL) at Mission Bay Conference Center in San Francisco on October 5th and 6th, 2009.

UK presentations include Maureen Pennock on ArchivePress, David Giaretta on significant properties in OAIS and Adam Farquar on (Planets) metadata.

What’s the average lifespan of a Web page?

…or is it easier to ask how long is a piece of string?

The statistic much banded about (for Web pages not pieces of string!) is 44 days, believed to originate in an article by Brewster Kahle (of Internet Archive fame) published in 1997 and titled Preserving the Internet. Brewster’s original quote is specifically about URLs, “…estimates put the average lifetime for a URL at 44 days.

Whether this figure still stands today is a matter currently being discussed on the CURATORS@LIST.NETPRESERVE.ORG list after a query from Abigail Grotke of the Library of Congress.

Abbie offered up the 44 day statistic and pointed out that on the Digital Preservation Web site they have a graphic that discusses Web volatility stating “44% of the sites available on the internet in 1998 had vanished one year later“.

The other figure often cited is 75 days from a Michael Day’s report Collecting and preserving the world wide web.

The dynamic nature of the Web means that pages and whole sites are continually evolving, meaning that pages are frequently changed or deleted. Alexa Internet once estimated that Web pages disappear after an average time of 75 days. (Lawrence, et al.,2001, p. 30).

Another figure sometimes suggested is 100 days, this seems to come from Rick Weiss article for the The Washington Post, Washington, DC, 24 November 2003, On the Web, Research Work Proves Ephemeral –  no longer available.

So what is the average lifespan of a Web page today? Is it getting shorter or longer? The Internet Archive now gives 44 -75 days as its ball park figure. I’d have to hazard a guess that with the rise in use of Web 2.0 technologies the Web is actually getting more transient by the day.

Is this OK?

Maybe if it’s just a tweet you sent your friend, however if it’s something more substantial that’s disapearing then it’s a real worry.

Releasing the Herds of Cows – The Missing links Workshop

Yesterday many of those most committed to, and interested in, Web preservation attended the Missing links: the enduring web conference held at the British Library. The event, which had over 100 attendees (from academia, galleries, museums, libraries, archives, government, research groups and beyond) was an attempt to ‘develop and strength the links between content creators, tools developers, preservation services and user in order to ‘  secure an enduring web’.

If anyone can successfully preserve the Web it was this lot!

Session 1: Setting the scene

Keynote: Adrian Brown, Assistant Clerk of the Records, Parliamentary Archives

The opening keynote was given by Adrian Brown from the Parliamentary Archives who did adept job of setting the scene and the challenges ahead. His talk centred around the key jigsaw pieces of selection, capture, storage, access and preservation. He pointed out that the Web used to be primarily publisher of fixed content and it is the new types of content (discussions on wikis, Twitter, blogs etc.) that offer the biggest challenge. Parliament themselves are using many new forms of information, and although permission is not always an issue capturing content on other sites often is. You may own the content but someone else often owns the presentation of that content. Adrian also emphasised the need for us to co-ordinate selection and be as explicit about what we are not going to capture as what we are going to capture (the unpublishing button on facebook being a classic example of how difficult it can be to go back!) Another major challenge is that of temporal cohesion – the rate of capture of content is often far slower than the rate of change of content.

He concluded with a Peter Murray Rust quote that stresses the need for us to “Create and release herds of cows, not preserve hamburgers in a deepfreeze.”

Web Archive and Citation Repository in One: DACHS: Hanno Lecher, Librarian, Sinological Library, Leiden

HannoLecher talked about the citation problems he has encountered while working on the digital archive for Chinese studies, a resource that captures and archives relevant resources as primary source for later research. His work has indicated the possible need for citation repositories. Some optins might include verifying

URL references, evaluating the reliability of online resources, use of PURLs, avoidance ofobscure URLs and even the extreme banning of online resources as reference. As such approaches are not always practical (or appropriate) he use of applications like Snagit, Zotero or WebCite.

The future of researching the past of the Internet: Eric T. Meyer, Research Fellow, Oxford Internet Institute, Oxford University

Eric Meyer works primarily looking at social behaviour and the Internet and as part of the Transatlantic Digitisation Project spent some time building an archive collection of World War 1 and 2 Internet resources. The process gave the team incite into the feasibility of  creation of such an archive. Lessons were also learnt on the need for involvement of domain experts (webometrics, SNA, eResearch, etc.) and the desire to move from snapshots to more continuous data. Eric concluded by stressing that what researchers really want is a global way to access archives, not a national one.

After the morning coffee break the workshop moved from the general to the specific.

Session 2: Creation, capture & collection

An overview of Web Archiving Tools by Helen Hockx-Yu, Web Archiving Programme Manager, The British Library

Helen Hockx-Yu provided a very practical overview of current tools, covering Heritrix, NetArchiveSuite, Web Curator Tool and PANDAS. She also discussed archival formats such as ARC and WARC, which is highly desirable from a long term archival standpoint. Helen concluded a brief discussion on the limitations and challenges harvesters present from issues with rendering and dealing with ‘bad’ content to reliance on open source tools that are still very much evolving

Context and content: Delivering Coordinated UK Web Archive to User Communities: Cathy Smith, Collections Strategy Manager, The National Archives

Cathy Smith and the UKWAC team have recently been working on study entitled Delivering coordinated UK web archives looking at what audiences should web archives anticipate and what does this mean for selection, ingest and preservation and what do researchers want from archives. The conclusions include the notion of a single view to an archive, though this does not necessarily mean a single repository. Cathy explained that “hey just want the car to work – they don’t want to see what’s under the bonnet”. In response to the study the NA have recently begun automated whole domain harvesting – shallow as well as selective harvesting and they are looking at allowing all archiver to be searched together – a collection.

Capture and Continuity: Broken links and the UK Central Government Web Presence: Amanda Spencer and Tom Storrar from the National Archives.

Amanda Spencer and Tom Storrar were set with the task of sorting out the problem of broken links on government Web sites. 60% of URLs quoted in Hansard between 1997-2006 are broken leaving the public unable to find information and with little faith in the authority of government Web sites. The Web Continuity project has begun to create an archive of all Central Government Web space. Some thinks they are trying out include XML site maps, using redirects instead of 404 messages and capturing all government Web sites before they are closed.

Discussion Session

In the questions and discussion session the panel were asked what advice they would give to Web managers in how they can allow archiving of their sites to be more effective. The suggestions tackled content (following accessibility guidelines, W3C standards, making code valid, adding in hard links to files next to streaming content, having transcripts of audio files) and communication (enabling a dialogue between Web archivists and Web creation people.)It was noted that there is a fine line between encouraging good practice and stifling innovation and that at times communication with content creators is a luxury. Responsible harvesting also means explaining why you are doing it to those involved.

The discussion then moved to other possible points of capture such as using Google Cache or asking ISP providers.

Session 3: Issues and approaches to long term preservation of web archives

After lunch the programme moved on from the specific problems of now to problems of the future.

Diamonds in the Rough: Capturing and Preserving Online Content from Blogs: Richard Davis, Project Manager, University of London Computing Centre (ULCC)

Richard Davis, a member of the JISC PoWR team, started off by emphasising the importance of blogs today. This has lead him and colleague Maureen Pennock to consider the need for their preservation, particularly at institutional level, resulting in the ArchivePress project. Richard pointed out that traditional Web archiving tools can be difficult to configure and use: a ‘hammer to crack a nut’. A comment by Chris Rusbridge that ”blogs represent an area where the content is primary and design secondary” suggested that using blog feeds might be key. The feed could be a point at which to gather blog content for archiving. The test bed for ArchivePress will include blogs from the DCC, Lincoln University and UKOLN. In the future ArchivePress could possibly be used to harvest Twitter.

Beyond Harvest: Long Term Preservation of the UK Web Archive: Maureen Pennock, Web Archive Preservation Project Manager, The British Library

Maureen Pennock started her talk by giving some general background to UK Web Archive. Now the archive has been established the next task is preserving the archive, an ongoing activity. The main undertakings are documenting System Dependencies, consideration of containers and metadata standards (following a review, WARC is the preferred format for preservation, but they are also using a combination of features from METS and PREMIS), preservation workflow and defining the preservation strategy. The future involves the consideration of many new areas, for example should we preserve computer viruses.

From Web Page to Living Web Archive: Thomas Risse, Senior researcher, L3S Research Center

Thomas Risse introduced the LiWA (Living Web Archives) project, an EU FP7 research project. LiWA has been looking at building tools to address the specific challenges of the current generation Web, e.g. link extraction from dynamic pages.

Emulating access to the web 1.0: Jeffrey van der Hoeven, Koninklijke Bibliotheek, The Netherlands

Jeffrey van der Hoeven started off with a whirlwind history of the Web browser. From the WorldWideWeb browser in 1991, which was more like a document viewer, followed by Mosaic and Netscape, and onto now with Firefox and Google Chrome. Today’s browser do many things, from rendering Web sites, bookmarking, caching, authentication and providing RSS feeds. They have changed from viewers to all round workplaces – Google Wave is an extreme example of this. The browser is currently the centreal application of the computer and there is a need for a sustainable solution to render Web sites. At the Koninklijke Bibliotheek they have been recently working on the Dioscuri project, a durable emulator. They have also been working with the planets project using Global Remote Access to Emulation (GRATE) and on the Keep Emulation Environments Portable (KEEP) project.

After another coffee break we moved onto the final discussions

Session 4: Discussion and next steps

What we want with web-archives; will we win? Kevin Ashley, Head of Digital Archives, University London Computer Centre (ULCC) Kevin Ashley

Kevin Ashley

Kevin Ashley, also part of the JISC PoWR team, kicked off his talk by introducing us to the future of the Web, Web 8.0. Although Kevin was just joking with us he explained that when the next major change in the Web happens, we will need a much better metaphor than ‘Web 3.0’: it is the future that we need to think about. We need to be asking the key question as to what we want from Web archives, e.g. potential use cases. Preserving the Web is not just about content. Sometimes it will be useful to look at Web content in aggregate, e.g. for visualisation, linguistic analysis, etc. Right now many academics are doing own Web archiving – content and presentation or both. However we need to not take just a document centred view of data but also consider the web of data and data about the web. A lot of future interest will not in Web content, but in the context of use.

The day concluded with a round table discussion with many speakers from the day. Some of the topics discussed were the permissions challenge (primarily with regard to UKWAC and similar initiatives). Other countries and initiatives have favour a more ‘just do it’ approach.

The day left most people enthused and positive about the future for Web archiving.

A conference report will follow on the DPC Web site and will be sent out via the dpc-discussion and digital-preservation JISCmail lists.

All presentations are now available in PDF Format from the DPC Web site.

WAC09 was used as the Twitter tag for the event. You can see the tweets at http://hashtags.org/tag/wac09/messages.

The event was sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland).

Posters for Missing Links

Today was the Missing Links: The enduring Web conference held at the British Library Conference Centre, London. A trip report on the event will follow.

The JISC PoWR team presented two posters, one provided a general overview of the project and the other discussed Preservation Policies and Approaches for Use of Social Web Services.

The content of the second poster has been mentioned in a number of previous blog posts:

Although the poster was submitted to the British Library it was unfortunately mislaid, so here is the actual content.

Preservation Policies and Approaches for Use of Social Web Services

Click on the image for a larger version.

Preserving Tweets: The CILIP2 Experience

At the  CILIP 2.0 open meeting in London (29th April 2009) delegates and remote participants were encouraged to tweet, using the #cilip2 tag, on issues relating to the role of Web 2.0 for CILIP members. These tweets were displayed on a screen using the Twitterfall client. Twitterfall is a way of viewing the latest ‘tweets’ of upcoming trends and custom searches on Twitter. Updates fall from the top of the page in near-realtime. The client is also a great way for remote participants to keep up-to-date with event posts.

Use of the tag was very popular and a number of blog posts have been written about its success including CILIP: More Popular Than Swine Flu!, Twittering Librarians, Twitter – better than a conference,  and CILIP 2.0 – Open Session.

Why?

Some might argue that the significant use of the tag could turn out to be a defining moment in CILIP’s movement towards engagement with Web 2.0. Those with an eye for preservation might argue that if this is the case we will want to preserve the tweets that embody this ‘crucial moment’.

How?

Luckily help was on hand in the form of Dave Pattern, systems manager at the University of Huddersfield Library. After the event Dave provided a good example of rapid software development (or in his words “a quick and dirty hack“!) when he quickly wrote a piece of software to harvest the tweets and make available a transcript of them. Dave used the Twitter search API to fetch the results (in JSON format) of the #cilip2 tag every 60 seconds. He then had to dedupe the results from the previous seaches. Once he’d got all the tweets he wrote a quick Perl script to process them and generate some HTML.

He also provided a Wordle visualisation of the content of the tweets.

CILIP Tweets

This now means that the CILIP community continue to have access to:

  • A record of the discussions, annotated with the time of posting.
  • A Wordle cloud which provides a visual summary of the topics.

Software such as this could potentially be used by others to preserve tweets. For those interested Dave is happy to share the code.

If does however raise a number of questions: Should you keep a record of Twitter posts? If so, how would you go about doing it? Where should the responsibility lay? Should it be something event organisers need to consider? Is there a role for commercial companies to play or will this type of preservation just fall at the feet of interested participants (those keen on mashup creation)?

Is there scope for some sort of tool to be developed in this area? Should funding councils in specific areas be funding the creation of a more managed service? Should the tools be more generalised?

Some suggestions are already provided in the Handbook and on the JISC PoWR blog:

This blog post provides the background for one of two case studies which will be displayed on a poster on Preservation Policies and Approaches for Use of Social Web Services to be presented at the Missing links: the enduring web workshop to be held on Tuesday 21st July 2009 at the at the British Library Conference Centre, London.

Missing links: the enduring web

The JISC PoWR team will be involved in the forthcoming Workshop on missing links: the enduring web.  The workshop is sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and organised by the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland). It will be held on Tuesday 21st July 2009 at the British Library Conference Centre, London.

Richard Davis, ULCC, will be giving a presentation on Diamonds in the Rough: Capturing and Preserving Online Content from Blogs. Other members of the team will be presenting posters on the JISC-PoWR Project and on Preservation Policies and Approaches for Use of Web 2.0 Services.

In next few posts we’ll describe in more detail what we’ll be covering. Online registration is still open and closes on Friday 10th July 2009. We hope to see you there…

Archiving the US Election 2004 Web sites

The Library of Congress have recently made their US Election 2004 Web Archive available from the Library of Congress Web Archives site. The Election 2004 Web Archive is a selective collection of approximately 2,000 Web sites associated with the United States Presidential, Congressional, and gubernatorial elections. It is part of a continuing effort by the Library’s Web Archiving Project Minerva to evaluate, select, collect, catalogue, provide access to, and preserve digital materials for future generations of researchers.

The archived material includes blogs (such as blogs for Bush). Currently permission is necessary for offsite access for researchers. All archived Web sites are available to researchers onsite at the Library of Congress.

Metadata

At the Library of Congress they are currently providing metadata for individual Web sites through brief records using the MODS schema. There is a MARC collection level record (for the collection itself) with a link to an entry/overview page for each collection that links to search and browse functions with MODS metadata for each individual Web site that was collected.

An overview of their metadata approach (at the collection and item levels) is available. They are also in the process of developing more formal descriptive metadata profiles for their digital content and have developed one for the Library of Congress Web archives.

For a list of publicly available Library of Congress Web archives and access to each, see the Library of Congress Web Archives site.

More information on activities at the Library of Congress are given in a Powerpoint presentation given at the Digital Library Federation 2008 Fall Forum.