JISC-PoWR

Preservation of Web Resources: a JISC-sponsored project

Findings available from the KRDS2 Survey

Posted by Marieke Guy on February 3rd, 2010

The findings from the Keeping Research Data Safe 2 (KRDS2) survey of digital preservation cost information are now available on the KRDS2 project Web page.

KRDS2

The Keeping Research Data Safe 2 project commenced on 31 March 2009 and will complete in December 2009. The project will identify and analyse sources of long-lived data and develop longitudinal data on associated preservation costs and benefits. It is believed that these outcomes will be critical to developing preservation costing tools and cost benefit analyses for justifying and sustaining major investments in repositories and data curation.

The Survey

The survey was carried out between between September and November 2009 to identify key research data collections with information on preservation costs and related issues. 13 survey responses were received: 11 of these were from UK-based collections, and 2 were from mainland Europe. The responses covered a broad area of research including the arts and humanities, social sciences, and physical and biological sciences and research data archives or cultural heritage collections.

The survey questionnaire sought to identify cost information available for the main KRDS2 activities in the Pre-Archive and Archive phases. Information for some activities is very high (archival storage cost information is available in 100% of the responses). Other more infrequent activities such as disposal (and perhaps also preservation planning) are less well represented. Knowledge of acquisition costs is also relatively low (46%).

Further information is available from the KRDS2 project Web page.


			

Posted in Preservation | No Comments »

Web archiving in the wider world

Posted by Kevin Ashley on January 12th, 2010

When a topic is being discussed in the correspondence pages of national newspapers, it’s a sign that it’s no longer the concern of a few specialists. That’s certainly been true of web archiving for some time as a recent example shows. Malcolm Birdling wrote a letter published in the Guardian on January 1, 2010 bemoaning the fact the some government agencies - in particular the UK Borders agency - actively prevent sites such as the Internet Archive from capturing their contents. This has important consequences for citizens, particularly when such sites are used to publish regulations and guidance which is frequently changing. (I have anecdotal evidence that the UK Inland Revenue lost an appeal brought by a taxpayer over a very similar issue.)

WAGN website - capture from Internet Archive (detail) Mr Birdling’s letter brought a rapid response from David Thomas of the UK National Archives who was keen to reassure readers that central government websites were being archived, even without the legislation which prompted Mr Birding’s original letter. (That story refers to the changes to Legal Deposit regulations which would permit the British Library and other UK copyright libraries to capture UK content without the permission of rights owners.)

But earlier examples of non-specialist concern with preserving web content exist. One of my favourite examples comes from the Usenet group uk.railway whose contributors include a fair number of rail enthusiasts (”trainspotters” if you’re feeling unkind.) Privatisation of the UK railway network means that we have a plethora of train operating companies, or TOCs, each of whom operate their own web site, much as the great companies of old such as LNER might have done if the web had existed then. The difference is that now these companies come and go every few years when the government puts operating contracts out for re-tender. Railway ephemera such as promotional leaflets and timetables are a key part of the print collections at places such as the National Railway Museum. “What happens to TOC web sites when franchises change?” wondered one poster to uk.railway back in 2007. The Internet Archive has certainly captured some material, but it isn’t the same as a collection controlled by an institution such as the NRM. I wasn’t able to give a very positive answer to their question. I don’t believe the National Railway Museum are yet able to capture websites as part of their collection, and it’s not clear that any of the members of UKWAC see TOC sites as falling within their collecting policy.

And herein lies a lesson. Rail enthusiasts are incredibly effective at preserving railway heritage, both through their own efforts and through influencing others. They include many people with an enviable range of technical abilities. They ensured that special legislation was passed to ensure the preservation of railway heritage after privatisation. Not content with simply preserving heritage, some of them set about recreating it through building an entirely new steam locomotive. But their combined efforts have not yet (so far as I know) ensured that past railway web sites have been preserved. If they can’t manage it without institutional help, what hope is there for the rest of us ?

Posted in Future, Selection, Policies | No Comments »

Bookings open for 5th International Digital Curation Conference

Posted by Marieke Guy on November 6th, 2009

DCC

5th International Digital Curation Conference

“Moving to Multi-Scale Science: Managing Complexity and Diversity” | 2-4 December 2009

The IDCC is an established annual event reaching out to individuals, organisations and institutions across all disciplines and domains involved in curating data for e-science and e-research.

The DCC will be hosting a workshop programme on 2 December followed by a Pre-Conference Drinks Reception at the Natural History Museum. The main conference will open on 3 December with a keynote speech from Professor Douglas Kell, Chief Executive of the Biotechnology & Biological Sciences Research Council (BBSRC). Other key speakers will include: Professor Ed Seidal, National Science Foundation; Cliff Lynch, Coalition for Networked Information; Timo Hannay, Nature Publishing Group. The first day of the conference will incorporate an interactive afternoon for posters and demos, followed by a Symposium entitled “Citizen Science: Data Challenges” led by Richard Cable, BBC Lab UK.

The second day will be made up of peer-reviewed papers in themed sessions covering Disciplinary and Institutional Challenges, Practitioner Experience, Metadata, Software Preservation & Managing Risk.

Places are limited so please register now.

Registration to close on 20 November 2009

Posted in Digital preservation, Events | No Comments »

The Demise of Geocities - But a Renewed Interest in Web Site Archeology

Posted by Brian Kelly on October 26th, 2009

An article published today on the Guardian Technology Web site entitled “Geocities: dead but not lost” describes how Geocities, which was founded in 1994 and was at one stage the third most-browsed site on the web, is now dead.

Geocities pageWe discussed Yahoo’s announcement that the Geocities service was to be shut down some time ago in a post entitled ““Seething With Anger” at the Demise of Geocities“. What I find interesting in the article is the information that “… there’s the real effort, by the Archive Team, who have been trying to archive as many Geocities pages and sites as they could“.

I’d not come across the Archive Team wiki before. They describe themselves as a “project composed of volunteers, currently coordinated by Jason Scott” which invites.

  • Writers, who can create clear essays and instructions for archivists and concerned parties.
  • People with Lots of Hosted Disk Space who have a proper hosted webserver and fat pipe, who are willing (when asked) to consider hosting mirrored dead sites or archives.
  • People who love setting up torrents who can do the same as the mirror folks, but do so hosting torrents.
  • OCD-rich individuals who want to download things who will respond to our alerts and call outs and download entire sites or diagnose ways to get at obfuscated data.

The wiki home page informs us that “This website is intended to be an offloading point and information depot for a number of archiving projects, all related to saving websites or data that is in danger of being lost. Besides serving as a hub for team-based pulling down and mirroring of data, this site will provide advice on managing your own data and rescuing it from the brink of destruction.”

Hmm. I wonder how effective a volunteer organisation is likely to me? My initial thoughts were fairly sceptical, but other volunteer-led initiatives, such as Wikipedia, do seem to be successful. What are your thoughts?

Posted in Web 1.0 | 2 Comments »

The digital media collection +100 years

Posted by Marieke Guy on September 16th, 2009

As part of the JISC ITT Workshops & Seminars: Achievements & Challenges in Digitisation & e-Content strand JISC Digital Media have hosted two free seminars focussing on key topics for individuals involved with digital media. Today I attended the second of these entitled The digital media collection +100 years.

Obsolescence, deterioration of physical storage media or withdrawal of institutional support: just what will prove to be the greatest threat to the materials we digitise today? This seminar projects one hundred years into the future and attempts to predict the future ‘preservability’ of what we digitise today. This seminar will examine changing user demands and inevitable developments in technology.

Panel Session

After a brief opening from Dave Kilbey of JISC Digital Media the scene setting introduction was given by Dr William Kilbride, Executive Director of the Digital Preservation Coalition.

The Preservation Landscape

As well as the more conventional look at the key issues (the volumes of data available, the complexities and complicated requirements of this data teamed with rising public expecations) William gave a really interesting talk on the path of literacy. He demonstrated through the Stroop interferance test how once we can read and write we tend to process this information quicker that image information. The result is a that literate cultures tend to be hegemonic through discursive power. His point was that the consequences of our work are not inevitable or neutral: digitisation is a social practice that can be used for good and for ill. After this slight aside William ran us through some of the main challenges which include obsolescence of technologies, correct configuration of hardware, software and operators, and the need for a constantly managed service. He ended with a few ‘answers’ from a survey of recent JISC digitisation projects. When asked how long their resources were to be available answers varied from “perpetuity” to “forever or three years”. He concluded that digital preservation is possible but our legacy will be what we make of it and cannot be taken for granted.

The Camera Raw format and preservation

Nigel Goldsmith, a photographer working for JISC Digital Media gave a quick run through of the possibilities of using Raw camera format. Raw offers the photographer greater control over the processing of their images, however this flexibility comes at a price. Raw is a proprietary format which requires specialist applications to view. Nigel’s suggestion was to archive raw but to keep it along side another format, possibly tiff or Jpeg2000.

Preservation Metadata Initiatives and Standards

After coffee Getaneh Alemu from the Humanities Computing Department, the University of Portsmouth gave us a whirlwind tour of state-of-the-art metadata standards and how metadata can help ensure the integrity, identity and authenticity of digital documents. His overview included a look at OAIS, NLA PANDORA, CEDARS, NEDLIB, LMER, PREMIS, and METS metadata initiatives and standards. He concluded that at the moment preservation metadata formats tend to have element naming issues that descriptive metadata initiatives don’t tend to have.

The challenges of archiving computer games and other multipart digital interactives

After lunch Tom Woolley from the National Media Museum talked about some of the digital media preservation issues they are tackling on-site at the museum. The museum is involved in a number of initiatives that aim to let visitors ‘have a go’ at old games and old internet environments. The tricky dilema is giving users a taster of old games in a cost effective way, actually using original kit (like ZX Spectrums) would have a heavy cost attatched. The key is often emulation. The museum also try to capture the context of games by capturing fan information, discussion forums, FAQs etc. Tom was followed by James Newman from Bath Spa University who works with Tom on the National Video Game Archive.

James talked about one of the biggest challenges of video game archiving: supersession. Within the gaming world there is a tendency to be always looking for the ‘next big game’ which has resulted in an environment where games creators don’t value old games. Although there is a niche market for retro games, gaming is an area where the experience is almost completely associated with the technology, making archiving very difficult.

The importance of collaboration

Simon Tanner, director of King’s Digital Consultancy Services focused on institutional preservation and the importance of collaboration in sustainability. He started off by saying that one of the biggest challenges is that we may run out of the minerals to make microchips. He later played on the climate issue again by saying that he currently saw digital preservation as sitting in the same space as climate change: people viewed it as potentially a terrible thing (the loss of digital objects) but currently it does not impact on individuals, so it remains low on the priority list. Simon pointed out that sustainability of resources was becoming a mandate but remains an unfunded mandate. The way to deal with this was through the ecology of collaboration - within your institution and out side.

A Poisoned Chalice? Accepting Responsibility for Sustainable Access

Neil Grindley

The day concluded with a talk from Neil Grindley, JISC Programme Manager for Digital Preservation. Neil pointed out ath ensuring that an organisation’s digital assets are safe, secure and accessible for the long term should (in theory) be an interesting, responsible and useful role for anyone in an organisation to accept. The critical importance of digital assets, the ubiquity of digital methods and the need for people in all walks of life to have effective means to refer to persistent sources of data reinforce this notion. How is it then that long-term asset management, information lifecycle management, data curation, digital preservation (call it what you will) is often regarded as a peripheral specialist activity that it is difficult to resource, complex to carry out, and delivers benefits that are, at best, simply an insurance policy rather than an activity that adds value to an organisation? Neil’s presentation examined the importance of defining clear roles for those involved with digital preservation and considered the importance of associating this professional activity with strategic and tactical frameworks. He advocated the need for allocation of responsibility and internal preservation policies. JISC has spent 6 million in the digital preservation arena between 2005 and 2009, yet there is still work to be done. He concluded by pointing out the need for human judgement when deciding what to keep and predicted that in the future digital preservation will be integrated with administration departments, have better tools and will take more terms from the cultural heritage area.

After Neil’s talk there was a panel session and time for questions, unfortunately I had to leave to make the difficult drive home through rush hour traffic!

The day was an interesting one, although the talks were a real mixed bag they all offered constructive steps forward to make today’s digital media collection something that we may be able to access and use 100 years on.

Posted in Events | 1 Comment »

Why you can sometimes leave it to the University

Posted by Ed Pinsent on September 8th, 2009

“Does anyone have any positive experiences to share?”, asks Brian in a recent post. Well, I have - except it’s not in the UK. Harvard University Library in the USA have recently put Harvard WAX (the Web Archive Collection Service) live, after a pilot project which began in July 2006.

Harvard WAX includes themed collections on Women’s Voices and Constitutional Revision in Japan, but of particular interest to us in PoWR is their A-Sites collection: the semi-annual captures of selected Harvard websites. “The Harvard University Archives is charged with collecting and preserving the historical records of the University,” state the curators, recognising their formal archival function in this regard. “Much of the information collected for centuries in paper form now resides on University web sites.”

Helen Hockx-Yu of the British Library met with the WAX team in May 2009. “I was impressed with many of the features of the system,” she said, “not just the user and web curator interfaces but also some of the architectural decisions. WAX is a service offered by the Library to all Harvard departments and colleges. In exchange for a fee, the Departments use the system to build their collections. The academics may not be involved with the actual crawling of websites, but spend time QAing and curating the websites, and can to some extent decide how the archive targets appear in the Access Tool. The QAed sites are submitted directly into Harvard’s institutional repository.”

It is very encouraging to read of this participatory dimension to the project, indicating how success depends on the active involvement of the creators of the resources. Already 48 Harvard websites have been put into the collection, representing Departments, Committees, Schools, Libraries, Museums, and educational programmes.

The delivery of the resources has many good features also; there’s an unobtrusive header element which lets the user know they’re looking at an archived instance (instead of the live website). There’s a link explaining why the site was added to the collection, and contextual information about the wider collection. Another useful link allows researchers, scholars and other users to cite the resource; it’s good to see this automated feature integrated directly within the site. The Terms of Use page addresses a lot of current concerns about republishing web resources, and strikes just the right balance between protecting the interests of Harvard and providing a service to its users. Like a good OAIS-compliant repository, they are perfectly clear about who their designated user community are.

Best of all, they provide a working full-text search engine for the entire collection, something that many other web archive collections have been struggling to achieve.

The collection is tightly scoped, and takes account of ongoing developments for born-digital materials: “Collection managers, working in the online environment, must continue to acquire the content that they have always collected physically. With blogs supplanting diaries, e-mail supplanting traditional correspondence, and HTML materials supplanting many forms of print collateral, collection managers have grown increasingly concerned about potential gaps in the documentation of our cultural heritage.” The project has clear ownership (it is supported by the University Library’s central infrastructure), and it built its way up from a pilot project in less than three years. Their success was partially due to having a clear brief from the outset, and through collaboration with three University partners. What Harvard have done chimes in with many of the recommendations and suggestions made in the PoWR Handbook, particularly Chapters 5 (Selection), 16 (Responsibility for preservation of web resources) and 19 (How can you effect change?)

There are many aspects of this project which UK Institutions could observe, and perhaps learn something from. It shows that it is both possible and practical to embed website collection and preservation within an Institution.

Posted in Selection, Policies, Records management, Preservation, Resources | 1 Comment »

Survey: How successful has Records Management been?

Posted by Marieke Guy on September 4th, 2009

As part of his dissertation at Aberystwyth University Andrew Brown is undertaking a research project which aims to determine how successful Records Management has been in the UK by asking Records Managers for their perceptions of Records Management in their organisation and the profession as a whole. He is attempting to quantify this ‘success’
and would be very grateful if record managers could take the time to complete the survey, which will take approximately 10-15 minutes.

It is hoped that this study will generate some stimulating debate on this matter and lead to a greater understanding of the current and future state of the Records Management profession in the UK where digital and Web preservation may be key.

Please access the survey at the following link.

The survey closes at midnight on 5th September.

Posted in Records management | No Comments »

iPres 2009 Programme

Posted by Marieke Guy on August 24th, 2009

The programme for the sixth International Conference on Preservation of Digital Objects (iPres 2009)  has recently been released and registration is now open.

This year’s event will be hosted by California Digital Library (CDL) at Mission Bay Conference Center in San Francisco on October 5th and 6th, 2009.

UK presentations include Maureen Pennock on ArchivePress, David Giaretta on significant properties in OAIS and Adam Farquar on (Planets) metadata.

Posted in Events | 1 Comment »

“Why you never should leave it to the University”

Posted by Brian Kelly on August 19th, 2009

A blog post from Richard Gatarski begins with the blunt announcement:

A year ago my academic web site disappeared. And those who made it go away probably ignored that such a thing could happen.

The article goes on to describe how last year Richard “found out that the School of Business had redesigned their web site. And in the process they just ignored my research. About ten years worth of virtually daily updates were gone That included most of the manuscripts for my published work. The same thing happened to lecture notes, powerpoint slides, course documentations, useful links, etc. It had all disappeared from the Web!“.

Richard did have some good news to report: “Courtesy of the Internet Archive you can still find most of my academic stuff on the Web through their Wayback machine.” although Richard did wonder why he had to rely on the Internet Archive (”a 501(c)(3) non-profit that was founded to build an Internet library”) - after all, wouldn’t you expect your institutional library to provide this service?

Richard’s losses of his digital resources have continued - a blog he set up at Stockholm University was deleted after he left the institution - although, again a copy is archived on the Internet Archive.

Richard’s experiences have left him disillusioned with the attitudes towards the digital preservation of scholarly resources. He concludes by recommending that academics take responsibility themselves for preserving their resources:

Meanwhile, for those of you who publish stuff on the Web while working with an organisation, including universities. Try to put your content where you control it. Most likely you will move between work places, temporary assignments, and soforth. If you want your stuff to be preserved, it is your responsability to make sure it is.

But how easy will this be for the typical academic? Richard doubts whether “the issues I bring forward today are heavily discussed among university chancellors, political leaders, educational policy makers, and scientific philosophers.“  But surely we need to ensure that this debate takes place. And, in today’s economic climate, that debate needs to include discussions of the costs of digital preservation (disk storage may be cheap but management of content is not).

Richard’s tale is based on his experiences as an academic in Sweden. Is the situation different in the UK, I wonder?  Judging by Stuart Smith’s lament that “Mummy I lost my MP3!“, which I summarised in a post on “Disappearing Resources On Institutional Web Sites” in December 2008 it would seem that we have similar experiences in the UK higher education sector. Does anyone have any positive experiences to share?

Posted in Digital preservation | 2 Comments »

What’s the average lifespan of a Web page?

Posted by Marieke Guy on August 12th, 2009

…or is it easier to ask how long is a piece of string?

The statistic much banded about (for Web pages not pieces of string!) is 44 days, believed to originate in an article by Brewster Kahle (of Internet Archive fame) published in 1997 and titled Preserving the Internet. Brewster’s original quote is specifically about URLs, “…estimates put the average lifetime for a URL at 44 days.

Whether this figure still stands today is a matter currently being discussed on the CURATORS@LIST.NETPRESERVE.ORG list after a query from Abigail Grotke of the Library of Congress.

Abbie offered up the 44 day statistic and pointed out that on the Digital Preservation Web site they have a graphic that discusses Web volatility stating “44% of the sites available on the internet in 1998 had vanished one year later“.

The other figure often cited is 75 days from a Michael Day’s report Collecting and preserving the world wide web.

The dynamic nature of the Web means that pages and whole sites are continually evolving, meaning that pages are frequently changed or deleted. Alexa Internet once estimated that Web pages disappear after an average time of 75 days. (Lawrence, et al.,2001, p. 30).

Another figure sometimes suggested is 100 days, this seems to come from Rick Weiss article for the The Washington Post, Washington, DC, 24 November 2003, On the Web, Research Work Proves Ephemeral -  no longer available.

So what is the average lifespan of a Web page today? Is it getting shorter or longer? The Internet Archive now gives 44 -75 days as its ball park figure. I’d have to hazard a guess that with the rise in use of Web 2.0 technologies the Web is actually getting more transient by the day.

Is this OK?

Maybe if it’s just a tweet you sent your friend, however if it’s something more substantial that’s disapearing then it’s a real worry.

Posted in Web 1.0, Digital preservation, Web 2.0 | 3 Comments »