Category Archives: Digital preservation

Blue Ribbon Task Force Publishes Sustainable Economics for a Digital Planet

Universities grappling with complex decisions on which of their burgeoning digital resources they should preserve – and the inherent financial, technical and legal issues that surround such work – may welcome a report that offers a “supply-and-demand” perspective on how individuals and institutions might manage their digital collections.

The Blue Ribbon Task Force on Sustainable Digital Preservation and Access (BRTF-SDPA), a new international initiative funded by JISC and other organisations, has recently released its report entitled Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information. Its report examines the complicated and diverse issues from an economic standpoint. It identifies the problems intrinsic to all preserved digital materials, and proposes domain-specific actions that address the challenges to sustainability. The report focuses its inquiry on materials of long-term public interest in content domains with diverse preservation profiles, namely scholarly discourse, commercially owned cultural content and collectively produced Web content.

JISC is organising a free one-day symposium in London on 6 May 2010 where the Blue Ribbon task force will present its final report and invite responses from the BBC, the Natural History Museum, the British Library, European Bioinformatics Institute and the European Commission. Further information is available.

Official Launch of the UK Web Archive

The British Library has officially launched the UK Web Archive, offering access in perpetuity to thousands of UK websites for generations of researchers.

The site was unveiled earlier this week by the Minister for Culture and Tourism, the Rt Hon Margaret Hodge MBE MP, and Chief Executive of the British Library, Dame Lynne Brindley, this project demonstrates the importance and value of the nation’s digital memory.

Websites included in the UK Web Archive include:

  • The Credit Crunch – initiated in July 2008, this collection contains records of high-street victims of the recession – including Woolworths and Zavvi.
  • Antony Gormley’s ‘One & Other’ Trafalgar Square Fourth Plinth Project – involving 2,400 participants and streamed live by Sky Arts over the web to an audience of millions, this site will no longer exist online from March 2010.
  • 2010 General Election – work has started to preserve the websites of MPs such as Derek Wyatt, who will be retiring at the next election, creating a permanent record of his time as a Member of Parliament.

This important research resource has been developed in partnership with the National Library of Wales, JISC and the Wellcome Library, as well as technology partners such as IBM.

British Library Chief Executive, Dame Lynne Brindley said:

Since 2004 the British Library has led the UK Web Archive in its mission to archive a record of the major cultural and social issues being discussed online. Throughout the project the Library has worked directly with copyright holders to capture and preserve over 6,000 carefully selected websites, helping to avoid the creation of a ‘digital black hole’ in the nation’s memory.

“Limited by the existing legal position, at the current rate it will be feasible to collect just 1% of all free UK websites by 2011. We hope the current DCMS consultation will enact the 2003 Legal Deposit Libraries Act and extend theprovision of legal deposit through regulationto cover freely available UK websites, providingregular snapshots ofthe free UK web domainforthebenefit of future research.

Further details are available from the British Library.

Bookings open for 5th International Digital Curation Conference

DCC

5th International Digital Curation Conference

“Moving to Multi-Scale Science: Managing Complexity and Diversity” | 2-4 December 2009

The IDCC is an established annual event reaching out to individuals, organisations and institutions across all disciplines and domains involved in curating data for e-science and e-research.

The DCC will be hosting a workshop programme on 2 December followed by a Pre-Conference Drinks Reception at the Natural History Museum. The main conference will open on 3 December with a keynote speech from Professor Douglas Kell, Chief Executive of the Biotechnology & Biological Sciences Research Council (BBSRC). Other key speakers will include: Professor Ed Seidal, National Science Foundation; Cliff Lynch, Coalition for Networked Information; Timo Hannay, Nature Publishing Group. The first day of the conference will incorporate an interactive afternoon for posters and demos, followed by a Symposium entitled “Citizen Science: Data Challenges” led by Richard Cable, BBC Lab UK.

The second day will be made up of peer-reviewed papers in themed sessions covering Disciplinary and Institutional Challenges, Practitioner Experience, Metadata, Software Preservation & Managing Risk.

Places are limited so please register now.

Registration to close on 20 November 2009

“Why you never should leave it to the University”

A blog post from Richard Gatarski begins with the blunt announcement:

A year ago my academic web site disappeared. And those who made it go away probably ignored that such a thing could happen.

The article goes on to describe how last year Richard “found out that the School of Business had redesigned their web site. And in the process they just ignored my research. About ten years worth of virtually daily updates were gone That included most of the manuscripts for my published work. The same thing happened to lecture notes, powerpoint slides, course documentations, useful links, etc. It had all disappeared from the Web!“.

Richard did have some good news to report: “Courtesy of the Internet Archive you can still find most of my academic stuff on the Web through their Wayback machine.” although Richard did wonder why he had to rely on the Internet Archive (“a 501(c)(3) non-profit that was founded to build an Internet library”) – after all, wouldn’t you expect your institutional library to provide this service?

Richard’s losses of his digital resources have continued – a blog he set up at Stockholm University was deleted after he left the institution – although, again a copy is archived on the Internet Archive.

Richard’s experiences have left him disillusioned with the attitudes towards the digital preservation of scholarly resources. He concludes by recommending that academics take responsibility themselves for preserving their resources:

Meanwhile, for those of you who publish stuff on the Web while working with an organisation, including universities. Try to put your content where you control it. Most likely you will move between work places, temporary assignments, and soforth. If you want your stuff to be preserved, it is your responsability to make sure it is.

But how easy will this be for the typical academic? Richard doubts whether “the issues I bring forward today are heavily discussed among university chancellors, political leaders, educational policy makers, and scientific philosophers.”  But surely we need to ensure that this debate takes place. And, in today’s economic climate, that debate needs to include discussions of the costs of digital preservation (disk storage may be cheap but management of content is not).

Richard’s tale is based on his experiences as an academic in Sweden. Is the situation different in the UK, I wonder?  Judging by Stuart Smith’s lament that “Mummy I lost my MP3!“, which I summarised in a post on “Disappearing Resources On Institutional Web Sites” in December 2008 it would seem that we have similar experiences in the UK higher education sector. Does anyone have any positive experiences to share?

What’s the average lifespan of a Web page?

…or is it easier to ask how long is a piece of string?

The statistic much banded about (for Web pages not pieces of string!) is 44 days, believed to originate in an article by Brewster Kahle (of Internet Archive fame) published in 1997 and titled Preserving the Internet. Brewster’s original quote is specifically about URLs, “…estimates put the average lifetime for a URL at 44 days.

Whether this figure still stands today is a matter currently being discussed on the CURATORS@LIST.NETPRESERVE.ORG list after a query from Abigail Grotke of the Library of Congress.

Abbie offered up the 44 day statistic and pointed out that on the Digital Preservation Web site they have a graphic that discusses Web volatility stating “44% of the sites available on the internet in 1998 had vanished one year later“.

The other figure often cited is 75 days from a Michael Day’s report Collecting and preserving the world wide web.

The dynamic nature of the Web means that pages and whole sites are continually evolving, meaning that pages are frequently changed or deleted. Alexa Internet once estimated that Web pages disappear after an average time of 75 days. (Lawrence, et al.,2001, p. 30).

Another figure sometimes suggested is 100 days, this seems to come from Rick Weiss article for the The Washington Post, Washington, DC, 24 November 2003, On the Web, Research Work Proves Ephemeral –  no longer available.

So what is the average lifespan of a Web page today? Is it getting shorter or longer? The Internet Archive now gives 44 -75 days as its ball park figure. I’d have to hazard a guess that with the rise in use of Web 2.0 technologies the Web is actually getting more transient by the day.

Is this OK?

Maybe if it’s just a tweet you sent your friend, however if it’s something more substantial that’s disapearing then it’s a real worry.

Missing links: the enduring web

The JISC PoWR team will be involved in the forthcoming Workshop on missing links: the enduring web.  The workshop is sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and organised by the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland). It will be held on Tuesday 21st July 2009 at the British Library Conference Centre, London.

Richard Davis, ULCC, will be giving a presentation on Diamonds in the Rough: Capturing and Preserving Online Content from Blogs. Other members of the team will be presenting posters on the JISC-PoWR Project and on Preservation Policies and Approaches for Use of Web 2.0 Services.

In next few posts we’ll describe in more detail what we’ll be covering. Online registration is still open and closes on Friday 10th July 2009. We hope to see you there…

Archiving the US Election 2004 Web sites

The Library of Congress have recently made their US Election 2004 Web Archive available from the Library of Congress Web Archives site. The Election 2004 Web Archive is a selective collection of approximately 2,000 Web sites associated with the United States Presidential, Congressional, and gubernatorial elections. It is part of a continuing effort by the Library’s Web Archiving Project Minerva to evaluate, select, collect, catalogue, provide access to, and preserve digital materials for future generations of researchers.

The archived material includes blogs (such as blogs for Bush). Currently permission is necessary for offsite access for researchers. All archived Web sites are available to researchers onsite at the Library of Congress.

Metadata

At the Library of Congress they are currently providing metadata for individual Web sites through brief records using the MODS schema. There is a MARC collection level record (for the collection itself) with a link to an entry/overview page for each collection that links to search and browse functions with MODS metadata for each individual Web site that was collected.

An overview of their metadata approach (at the collection and item levels) is available. They are also in the process of developing more formal descriptive metadata profiles for their digital content and have developed one for the Library of Congress Web archives.

For a list of publicly available Library of Congress Web archives and access to each, see the Library of Congress Web Archives site.

More information on activities at the Library of Congress are given in a Powerpoint presentation given at the Digital Library Federation 2008 Fall Forum.

Archiving a wiki

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a few points which we have touched on in the PoWR Handbook, which I’d like to illuminate and amplify here.

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy – or the designated user community, as OAIS calls it – probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point – there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.

LIWA – Living Web Archives

The PoWR project identified a number of technical challenges which made certain types of content – particularly that with a Web 2.0 flavour – particularly difficult to manage and preserve in an effective way. My attention has recently been drawn to an EU-funded project which hopes to overcome a number of these technical problems, as well as others that are applicable to large-scale archiving such as the problem of spam content.

LIWA – Living Web Archives – began in early 2008, but as with many EU projects, its startup phase involved a lot of internal activity without much of a public face. As a result we didn’t pick up on its work in the JISC-PoWR handbook, but I’m sure we’ll rectify this omission in any future revisions.

To pick one example of LIWA’s areas of interest, it intends to develop tools which make it easier to take a temporal view of web archives and to maintain temporal consistency. Temporal consistency – or rather its absence – will be familiar to anyone who has spent time exploring sites in the Internet Archive, where different pages, or even portions of the same page (such as images) will have been archived on different days. This can lead to occasional surprises when navigating through archived content, with links taking one to pages that don’t have the expected content.

LIWA’s partner’s include Hanzo, a UK-based web archive services company that we covered briefly in the handbook; I hope we can explore their potential value to UK HE in the future.

Considerations for the Preservation of Blogs

DigitalPreservationEurope (DPE) fosters collaboration and synergies between many existing national digital preservation initiatives across the European Research Area. As part of their work they publish concise overviews of key digital preservation and curation issues. Earlier this month they published a briefing paper on Considerations for the Preservation of Blogs (PDF).

The preamble sets the context for the paper:

Blogs, it seems, are everywhere these days, but what about the next day (and the next and the next …). Opinions vary on whether or not blogs merit preservation beyond the actions of a blog’s respective authors. This briefing paper does not contribute to that dialogue. Rather, it provides an overview of issues to be considered by organizations planning blog preservation programs. Blogs are the product of a network of players, including blog authors, service providers, and readers. Discussed here are some key attributes of blogs, and the characteristics and behaviors of these players, which may impact preservation activities.

During the JISC PoWR project we recognised that despite blogs initially being commonly characterised as ephemeral (as commented on in the DPE paper) their increasing importance and role in both the research context and in our cultural history is becoming apparent, and like other Web resources their preservation is a matter that needs to be addressed, somehow.

The PoWR blog has a number of interesting posts on the preservation of blogs including:

There is a also a section on preservation of blogs in the JISC PoWR handbook.

The Fetish of the Digital

Happy New Year to all our readers.

We are lucky enough to start 2009 with a guest blog post from Dr James Currall, Director of Information Strategy, IT Services & HATII Senior Research Fellow, University of Glasgow.

James has been involved with the highly successful Glasgow MPhil (now MSc) course in Information Management and Preservation since it inception, in which he teaches about the transition from storage of information on physical to digital media, information security, the role of numbers as information and a variety of other topics including risk and information management as an investment. In this latter context he was the Project Director of the espida project which developed a sustainable business-focussed model for digital preservation. He gave a plenary talk on Web preservation at last year’s Institutional Web Management Workshop (IWMW 2008) entitled The Tangled Web is but a Fleeting Dream … but then again … which was very well received and is available to watch on Google Video.

And I’ll pass you over to James…


A few weeks back, I was involved in a discussion about the skills required by people involved in Digital Curation and much of that discussion was based around the DigCCurr Project which has a long list of skills, some of which are specific to Digital Curation, but many of which are rather of a more general nature. And this set me on a dangerous course – thinking ….What exactly is this ‘profession’ of digital curator that DigCCurr amongst others are trying to define?

Let us rewind to say the second half of the 16th century and let us suppose that you were charged by Mr Shakespeare’s publishers with curating ‘The Scottish Play’.  What would you have done?  What exactly is this ‘information object’?  Is is the fonts, the layout, the pagination, the language, the story, the stage directions or what?  In spite of the absence of the profession of ‘paper curator’ we have inherited a rich heritage.  Along the way, many items will have been lost – it was always thus and, in spite of the optimistic techno-determinism of some, it always will be EVEN IN THE DIGITAL AGE. I would argue that this is all good and necessary and whilst I would mourn the passing of Algol, Reverse Polish Notation, amplifiers based on thermionic valves or chunky discrete solid state components, vinyl records, reel to reel tape and other really splendid ideas that were IMHO much better than the ‘mass market equivalents’ that replaced them, we have to discard much of our baggage as we move on.

So what is this preservation activity all about?  Is it not about the preservation and curation of information not of digits?  During a session with my MSc students, We visited the Way Back Machine and had a look at the University of Glasgow Web site (you wondered when I would get on to the web didn’t you?).  The page that we selected at random was from 18th October 2000. As a web page it is rather uninteresting, when I looked at it today there was no style sheet, the graphics were all missing and it was generally rather uninspiring, but ….  what is interesting is the headline news story ‘Funeral of the First Minister, Donald Dewar’. For those of you firth of Scotland, Donald was a leading light in the establishment of devolution for Scotland and the first First Minister of the devolved administration in Scotland. He was a graduate of the University of Glasgow and his premature passing at the age of 63 was tragic.  The news story is about ‘administrative’ details of his funeral and the passage of his cortege past the University – details of importance in relation to the history of the University and perhaps of Scotland.  It is the information contained in the web pages that is of interest and importance, whilst the layout of the pages and such ‘technical’ details of passing interest as the ‘container’ for that information.

So with 2008 now ended let us bury the idea that the digital needs its own ghetto that we need to prepend everything with ‘digital’, be it: curation, preservation, art, culture, revolution, etc.  Digital artifacts are the currently ‘fashionable’ containers for information and whilst the term continues, the technologies underneath that are radically different at every turn and often require as much conversion one to another as a paper to magnetic disc conversion.  It is not the containers that are important but what they contain.  The Eastern concept of ‘Pointing at the Moon’ has something to say here.

If we come to regard preservation/curation as a finger pointing to the moon; we might come to mistake the finger for the moon and never see beyond it to the moon itself.

This short clip of Bruce Lee in ‘Enter the Dragon‘ (1974) captures something of this in a different context.

I am also reminded of the auditors in Terry Pratchett’s ‘Thief of Time’ who take a great painting and break it down into flakes of paint which they put in little piles of each colour and then spend time looking to see where the art has gone!  These auditors are described in the Wikipedia article for DiscWorld thus:

The Auditors, cosmic bureaucrats who prefer a universe where electrons spin, rocks float in space and imagination is dead, represent the perils of handing yourself over to a completely materialist and deterministic vision of reality, devoid of the myths and stories that make us human.

From http://en.wikipedia.org/wiki/Discworld#Elves_and_Auditors

In 2009 we need to see digital preservation and curation as ‘last year’s model’, of course we need to understand the importance of custody, metadata and identifiers, but above all we need to understand the centrality of the information in the artifacts that we are seeking to curate and preserve.  This piece is recognisably ‘Currall’ not because of a digital signature, not because it is on his web site and not because the owners of the JISC PoWR say it is – it is ‘Currall’ because of its recognisably iconoclast position, poor grammar and tortured logic – that is what needs to be preserved!

Information is the thing (even if that is hard and technology is relatively easy) – lose sight of that and the game is a bogey.

PS if you are interested in a rather more rigorous treatment of this topic you might like to access “Authenticity: a red herring? (doi:10.1016/j.jal.2008.09.004)

Legal scholarship recognises long-term value of blogs

A recent post on the digital-preservation list indicates that at least one scholarly community has recognised the long-term scholarly value of online resources such as blogs, and the potential damage to future scholarship that might result from their loss. It draws attention to a symposium taking place at Georgetown University next year. The email says that the symposium:

…will build upon the fundamental assumption that blogs are an integral part of today’s legal scholarship.

and goes on to say:

This symposium will bring together academic bloggers, librarians, and experts in digital preservation …. Symposium participants will collectively develop innovative practices to ensure that valuable scholarship is not easily lost.

Join the conversation now by tagging items you think are relevant to this symposium with the del.icio.us tag FTLS2009.

It’s interesting to observe that this is an example of a community acting to preserve information of interest that is likely to be scattered over many institutions and none. (I suspect a fair amount of blogging in this area is done by practitioners who aren’t at an academic institution.) One of the concerns we identified in PoWR was that much material of this type was unlikely to be preserved as a result of institutional interests, unless one institution tried to bring materials like this into the remit of its special collections (and some have done this.)

The conference web site goes on to say:

This unique symposium will seek answers to the questions:

1. How can quality academic scholarship reliably be discovered?
2. How can future researchers be assured of perpetual access to the information currently available in blogs?
3. How can any researcher be confident that documents posted to blogs are genuine?

The symposium will include a working group break-out session to create a uniform standard for preservation of blogs, a document to be shared by bloggers and librarians alike.

That last goal of a uniform standard for blog preservation looks like a tall order and it will be interesting to see what emerges from this group, and what its wider relevance might be. But its a clear demonstration of the value of web material to some research communities, and their willingness to do something about it if their institutions can’t, or won’t, help them.

JISC Study on Digital Preservation Policies

JISC have announced the publication of a study on Digital Preservation Policies which can be downloaded in PDF format from the JISC Web site.

This study aims to provide an outline model for digital preservation policies and to analyse the role that digital preservation can play in supporting and delivering key strategies for Higher and Further Education Institutions. Although focussing on the UK Higher and Further Education sectors, the study draws widely on policy and implementations from other sectors and countries and will be of interest to those wishing to develop policy and justify investment in digital preservation within a wide range of institutions.

The study concludes “that for institutions digital preservation must be seen as “a means to an end” rather than an end in itself: any digital preservation policy must be framed in terms of the key business drivers and strategies of the institution.

Two tools have been created in the study:

1) a model/framework for digital preservation policy and implementation clauses based on examination of existing digital preservation policies;

2) a series of mappings of digital preservation to other key institutional strategies in UK universities and colleges including Research, Teaching and Learning, Information, Libraries, and Records Management.

These tools are definitely worth taking a look at if you are embarking on a Web preservation strategy.

Yet more on iPres2008

The web-archiving strand at the iPRES Conference was an opportunity for a number of National Libraries to describe their initiatives, their approaches to web capture and preservation, and the things that were (for them) the biggest hurdles. It’s interesting to me that librarians – as opposed to archivists – seem to have been first off the mark with web-archiving, although in the UK The National Archives have at least two parallel initiatives underway (UKWAC and European Archive) to capture government websites deemed to be records.

From the PoWR point of view, one thing I found useful was the summary statement from Iceland which suggests that the community is now starting to agree on three main approaches to selection – bulk/domain harvesting, thematic, and event-based. Sometimes I wonder if these approaches are too library-influenced (looking at websites primarily as publications, rather than as records, whereas in PoWR we are suggesting that HFE websites contain and function as a mixture of both), but it helped me sharpen up my thinking and I fed this idea back into the PoWR Handbook’s chapter on Selection.

Since they were all representing libraries, most of the presenters tended to focus on legal deposit (and attendant permission problems) as the biggest hurdle to gathering websites – and the one that eats up the most in terms of resources. Denmark have not resolved this completely, and although they have gathered lots of material from the web, they don’t make it widely available – only allowing controlled access for research or statistical purposes. France, on the other hand, have had legal deposit laws in place since 2006. Australia have found this such a problem (despite the strenuous efforts they have made to get the law changed) that it pretty much forced the decision to go the curatorial route. All permissions are negotiated, but the collections are shaped to a certain extent by community input.

With my archivist hat on, I sat up when we were told quite categorically that registration and cataloguing of web resources would not work, and that no-one should ever even attempt it. The BL admitted they were ‘not really collecting metadata’. If this is true, I immediately started to wonder why both PANDAS and Web Curator Tool (which I have used) have conventional Dublincore metadata elements built into their workflow? Don’t we anticipate cataloguing archived web resources in some way? I almost asked a question about this, until I reflected on the use of NutchWax and full-text indexing (which is probably acceptable until we can come up with some form of semantic tagging or automated metadata extraction for web resources).

Then I found myself reaching for the mic to ask a question about what I call ‘endangered resources’. We have already blogged about this on dablog, when my colleague Joanne Anthony raised a question about a web resource owned by a smallish institution which suddenly found itself with its funding removed. Was there anything such an Institution could do, I asked the panel, to preserve its website? And what were the pro-active steps being taken by these National Libraries to rescue or identify resources at risk? I know that UKWAC, for example, offers a public submission service on its website, although it is not very prominent or visible, nor is it quite clear what happens to requests for archiving once the form has been filled in. I received some interesting replies, including the amusing anecdote from France which suggests that their archival collections have been accessed by red-faced politicians who have accidentally deleted their own blogs. However, I still wasn’t quite sure what national initiatives exist to address what I perceive as a significant gap in the preservation of unrecognised (and therefore uncaptured) resources.

More on iPres2008

Brian and Marieke have already written about iPres2008 and PoWR, and I have written and will write more about it from a general perspective on DABlog. But we thought it would be worth saying a bit more about what this conference, which is looking at the complete picture of digital preservation, had to say which is of relevance to PoWR’s work of web preservation in UK Universities.

There was an entire session devoted to various web archiving initiatives on the second day, which at first sight one might think is of particular relevance (almost as much as Brian’s presentation, one might think.) I wasn’t at this session – it was one of those running in parallel tracks, and I was speaking in the other track – but Ed Pinsent was and will be writing at more length about it soon. But even without attending, I’m aware that many of the projects, operating as they do within their national domains in Australia or elsewhere, won’t have much role in helping save UK University web content (unless we move our domains to .edu.au – there’s a thought.) Even when the BL realises its long-term aim of harvesting across the entire UK web domain, it still will be selective in some ways about what it captures – about depth and frequency of harvests, and about the type of content. You won’t be able to depend on those institutions to capture what you want to be captured. So if these initiatives aren’t going to meet all our needs, do we need to do it ourselves ? The PoWR project thinks not, but that is one of the options institutions will need to examine. The work the IIPC is doing to develop harvesting and access tools will be of interest to those few institutions that feel able to operate these tools themselves – not something to be undertaken lightly.

Yet there was much of relevance at iPres2008. One recurring them, picked up at the outset by Lynne Brindley and in Steve Knight’s closing remarks, was that ‘digital preservation’ is not the term to be using in discussions with our institutions and the world, echoing remarks on the DCC blog which Brian later picked up on here. Steve prefers the phrase ‘permanent access’. which is indeed outcome-focussed. However, we’ve also said in PoWR that preservation isn’t always forever, so I would prefer something a little more all-embracing – ‘long-lived access’ might fit.

The sessions covering things like significant properties also touched on issues that PoWR is concerned with. When we decide to preserve something, what is it that we’re really trying to keep ? Most forms of preservation change the original object in some way, just as long-life milk isn’t the same as pasteurised, and neither are quite as tasty as fresh milk (or so I’ve been told.) This is clearly still a very difficult problem, and one that (to my mind) demonstrates that the digital preservation community hasn’t even developed a clear problem statement, much less a fully worked-out solution. So, in the meantime, we need to be pragmatic and do what seems best at the time. Always a good plan.

iPres 2008

Brian Kelly will be presenting a paper on “Preservation of Web Resources: The JISC PoWR Project” authored by the JISC PoWR team at the fifth International Conference on Preservation of Digital Objects (iPres 2008) this coming Monday (29th September 2008). The conference will be held at the British Library from 29 – 30th September 2008 and brings together researchers and practitioners from around the world to explore the latest trends, innovations, thinking, and practice in digital preservation.

The slides and accompanying paper are available from the UKOLN Web site.

PoWR Handbook: early draft

powr_cover.jpghandbookdraft_10092008.pdf

Attached is an early draft of the PoWR Handbook. This release is timed to coincide with our third and final workshop in Manchester.  We hope to gather feedback from that workshop, and from any comments received via this blog, to feed into the final version of the Handbook.

You can also browse an Issuu.com version of the Report here.

Comments can also be sent direct to Ed Pinsent, if you would prefer not to comment here.

Preservation Experts Suggest That The Term “Digital Preservation” Is Harmful

A recent post entitled “Digital Preservation” term considered harmful?” on the Digital Curation blog begins with the words:

Over the past few weeks I have become acutely aware that the term “digital preservation” may be becoming a problem.

Not quite what one might expect from Chris Rusbridge, director of the Digital Curation Centre (DCC)! And James Currall, who recently gave a plenary talk on Web site preservation issues at UKOLN’s IWMW 2008 event, appears to have been responsible for such heresy with his view that:

The digital preservation community has become very good at talking to itself and convincing ‘paid-up’ members of the value of preserving digital information, but the language used and the way that the discourse is constructed is unlikely to make much impact on either decision-makers or the creators of the digital information (academics, administrators, etc.).

But I have to say that I think that these views reflect the experiences we have had in the JISC PoWR project. Indeed Alison Wildish was quite open about this in her presentation at the first JISC PoWR workshop.

While we have to use “digital preservation” in appropriate contexts, including technical and other in-house discussions, and digital curation is appropriate in other contexts, terms that reflect the outcomes are more persuasive. The outcome of successful digital preservation is that digital resources remain accessible and usable over the long term.

and concludes by arguing that:

… outcome-related phrases like “long term accessibility” or “usability over time” are better than the process-oriented phrase “digital preservation”.

Amen to that! This reflects my views on the need to take a user-focussed approach to Web site development, with long term accessibility and usability simply mean that we need to think about the users in the future and not just those we have today.  And perhaps that’s the approach we have to take in order to ‘sell’ preservation to somewhat sceptical Web developers.

Should our slogan be “Web preservation is dead! Long live long term accessibility! Long live usability over time!” I wonder?

Web Continuity Project at The National Archives

Ed and I were pleased to come across an interesting document, recently received from The National Archives, describing their Web Continuity Project. This is the latest of the many digital preservation initiatives undertaken by TNA/PRO, that began with EROS and NDAD in the mid 1990s, leading to the UK Government Web Archive and other recent digital preservation initiatives (many in conjunction with BL and the JISC).

The Web Continuity Project arises from a request by Jack Straw, as leader of the House of Commons in 2007, that government departments ensure continued access to online documents. Further research revealed that:

  • Government departments are increasingly citing URLs in answer to Parliamentary Questions
  • 60% of links in Hansard to UK government websites for the period 1997 to 2006 are now broken
  • Departments vary considerably: for one, every link works; for another every link is broken. (TNA’s own website is not immune!)

Continue reading

Digital preservation in a nutshell, part II

As Richard noted in Part I, digital preservation is a “series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” But what sort of digital materials might be in scope for the PoWR project?

We think it extremely likely that institutional web resources are going to include digital materials such as “records created during the day-to-day business of an organisation” and “born-digital materials created for a specific purpose”.

What we want is to “maintain access to these digital materials beyond the limits of media failure or technological change”. This leads us to consider the longevity of certain file formats, the changes undergone by proprietary software, technological obsolescence, and the migration or emulation strategies we’ll use to overcome these problems.

Continue reading