Category Archives: Policies

Web archiving in the wider world

When a topic is being discussed in the correspondence pages of national newspapers, it’s a sign that it’s no longer the concern of a few specialists. That’s certainly been true of web archiving for some time as a recent example shows. Malcolm Birdling wrote a letter published in the Guardian on January 1, 2010 bemoaning the fact the some government agencies – in particular the UK Borders agency – actively prevent sites such as the Internet Archive from capturing their contents. This has important consequences for citizens, particularly when such sites are used to publish regulations and guidance which is frequently changing. (I have anecdotal evidence that the UK Inland Revenue lost an appeal brought by a taxpayer over a very similar issue.)

WAGN website - capture from Internet Archive (detail) Mr Birdling’s letter brought a rapid response from David Thomas of the UK National Archives who was keen to reassure readers that central government websites were being archived, even without the legislation which prompted Mr Birding’s original letter. (That story refers to the changes to Legal Deposit regulations which would permit the British Library and other UK copyright libraries to capture UK content without the permission of rights owners.)

But earlier examples of non-specialist concern with preserving web content exist. One of my favourite examples comes from the Usenet group uk.railway whose contributors include a fair number of rail enthusiasts (“trainspotters” if you’re feeling unkind.) Privatisation of the UK railway network means that we have a plethora of train operating companies, or TOCs, each of whom operate their own web site, much as the great companies of old such as LNER might have done if the web had existed then. The difference is that now these companies come and go every few years when the government puts operating contracts out for re-tender. Railway ephemera such as promotional leaflets and timetables are a key part of the print collections at places such as the National Railway Museum. “What happens to TOC web sites when franchises change?” wondered one poster to uk.railway back in 2007. The Internet Archive has certainly captured some material, but it isn’t the same as a collection controlled by an institution such as the NRM. I wasn’t able to give a very positive answer to their question. I don’t believe the National Railway Museum are yet able to capture websites as part of their collection, and it’s not clear that any of the members of UKWAC see TOC sites as falling within their collecting policy.

And herein lies a lesson. Rail enthusiasts are incredibly effective at preserving railway heritage, both through their own efforts and through influencing others. They include many people with an enviable range of technical abilities. They ensured that special legislation was passed to ensure the preservation of railway heritage after privatisation. Not content with simply preserving heritage, some of them set about recreating it through building an entirely new steam locomotive. But their combined efforts have not yet (so far as I know) ensured that past railway web sites have been preserved. If they can’t manage it without institutional help, what hope is there for the rest of us ?

Why you can sometimes leave it to the University

“Does anyone have any positive experiences to share?”, asks Brian in a recent post. Well, I have – except it’s not in the UK. Harvard University Library in the USA have recently put Harvard WAX (the Web Archive Collection Service) live, after a pilot project which began in July 2006.

Harvard WAX includes themed collections on Women’s Voices and Constitutional Revision in Japan, but of particular interest to us in PoWR is their A-Sites collection: the semi-annual captures of selected Harvard websites. “The Harvard University Archives is charged with collecting and preserving the historical records of the University,” state the curators, recognising their formal archival function in this regard. “Much of the information collected for centuries in paper form now resides on University web sites.”

Helen Hockx-Yu of the British Library met with the WAX team in May 2009. “I was impressed with many of the features of the system,” she said, “not just the user and web curator interfaces but also some of the architectural decisions. WAX is a service offered by the Library to all Harvard departments and colleges. In exchange for a fee, the Departments use the system to build their collections. The academics may not be involved with the actual crawling of websites, but spend time QAing and curating the websites, and can to some extent decide how the archive targets appear in the Access Tool. The QAed sites are submitted directly into Harvard’s institutional repository.”

It is very encouraging to read of this participatory dimension to the project, indicating how success depends on the active involvement of the creators of the resources. Already 48 Harvard websites have been put into the collection, representing Departments, Committees, Schools, Libraries, Museums, and educational programmes.

The delivery of the resources has many good features also; there’s an unobtrusive header element which lets the user know they’re looking at an archived instance (instead of the live website). There’s a link explaining why the site was added to the collection, and contextual information about the wider collection. Another useful link allows researchers, scholars and other users to cite the resource; it’s good to see this automated feature integrated directly within the site. The Terms of Use page addresses a lot of current concerns about republishing web resources, and strikes just the right balance between protecting the interests of Harvard and providing a service to its users. Like a good OAIS-compliant repository, they are perfectly clear about who their designated user community are.

Best of all, they provide a working full-text search engine for the entire collection, something that many other web archive collections have been struggling to achieve.

The collection is tightly scoped, and takes account of ongoing developments for born-digital materials: “Collection managers, working in the online environment, must continue to acquire the content that they have always collected physically. With blogs supplanting diaries, e-mail supplanting traditional correspondence, and HTML materials supplanting many forms of print collateral, collection managers have grown increasingly concerned about potential gaps in the documentation of our cultural heritage.” The project has clear ownership (it is supported by the University Library’s central infrastructure), and it built its way up from a pilot project in less than three years. Their success was partially due to having a clear brief from the outset, and through collaboration with three University partners. What Harvard have done chimes in with many of the recommendations and suggestions made in the PoWR Handbook, particularly Chapters 5 (Selection), 16 (Responsibility for preservation of web resources) and 19 (How can you effect change?)

There are many aspects of this project which UK Institutions could observe, and perhaps learn something from. It shows that it is both possible and practical to embed website collection and preservation within an Institution.

Wiki Management

This contribution to a thread about management of wikis, posted by the Records management section at the University of Edinburgh, was submitted to the Archive listerv recently:

Below is an outline of the ‘wiki’ situation at the University of Edinburgh:

At Edinburgh University our main effort to date has been making sure that wikis are retention scheduled, and considering what the ideal retention period for a wiki should be. As part of setting up any new wiki space the University records details such as space owner and proposed use, but due to the wide variety of uses it is difficult to specify a generic retention period. There is the option for the space owner to delete a wiki space; however the most likely scenario is that a space atrophies over time, the owner stops engaging, and it is therefore then up to the University to be proactive in identifying and pruning out dead spaces.

At present the service policy talks about a default retention period of 1 year, which is primarily to make space owners aware that if not used their space may be deleted. If we have anything that requires long term migration we would look into outward migration; either to a new system or to an archive.

I found it very encouraging to see this pro-active and practical-minded approach to the management of wikis. In many ways Edinburgh’s RM approach vindicates a lot of the RM advice which we have recommended in the PoWR Handbook; as we say early on, we must manage resources in order to preserve them. It is also encouraging that in Edinburgh’s case at least the wiki problem is considered primarily in terms of information and staff management, and not exclusively in terms of the technological solutions that might be applied.

In particular:

1) Edinburgh: “Make sure wikis are retention scheduled”.

  • PoWR: “Deciding which aspects of your web resources to capture can be informed to a large extent by your Institutional drivers, and the agreed policies for retention and preservation.”  (p 22)

2) Edinburgh: “Consider the ideal retention period for a wiki”.

  • PoWR: “The attraction of bringing a website in line with an established retention and disposal programme is that it will work to defined business rules and retention schedules to enable the efficient destruction of materials, and also enable the protection and maintenance of records that need to be kept for business reasons.”  (p 93)

3) Edinburgh: “Make space owners aware that if not used their space may be deleted”.

  • PoWR: “Quite often in an academic context these applications rely on the individual to create and manage their own resources. A likely scenario is that the academic, staff member or student creates and manages his or her own external accounts in Flickr, Slideshare or; but they are not Institutional accounts. It is thus possible with Web 2.0 application for academics to conduct a significant amount of Institutional business outside of any known Institution network. The Institution either doesn’t know this activity is taking place, or ownership of the resources is not recognised officially. In such a scenario, it is likely the resources are at risk.”  (p 42)

4) Edinburgh: “The service policy talks about a default retention period.” This approach seems to incorporate rules as part of setting up any new wiki space, starting to manage the resource at the very beginning of the record’s lifecyle.

  • PoWR: “If  we can apply a lifecycle model to web resources, they will be created, managed, stored and disposed of in a more efficient and consistent way; it can assist with the process of identifying what should and should not be retained, and why; and that in turn will help with making preservation decisions.” (p 34)

5) Edinburgh: “If we have anything that requires long term migration we would look into outward migration; either to a new system or to an archive.”

  • PoWR: “Migration of resources is a form of preservation. Migration means moving resources from one operating system to another, or from one storage/management system to another. This may raise questions about emulation and performance. Can the resource be successfully extracted from its old system, and behave in an acceptable way in the new system?”  (p 33)
  • “The usual aim of archival appraisal has been to identify and select records for permanent preservation. Quite often appraisal has taken place at the very end of the lifecycle process (although records managers intervene where possible at the beginning of the process, enabling records of importance to be identified early).”  (p 36)

When Funding Bodies Shut Down

An email sent to the MLANORTHEAST-NEWS JISCMail list provides details of the implications of the closure of the MLA North East regional Agency on the Web services it has set up or commissioned.

The message states :

MLA North East Websites after 12th December, 2008

MLA North East over recent years has set up several websites which we have managed on behalf of the sector. This brief note is to inform you of the arrangements made for each of the sites.  a holding page will refer visitors to MLA council site at All other content will be taken down at 4.00pm on Friday 12th December, 2008.  currently a portal to our sector’s on-line stores selling local studies material and other ephemera. The content will be taken down at 4.00pm on Friday 12th December, 2008. The domain name is now owned  by One NorthEast.  a portal to North East archives services, providing links to catalogues and paid-for  professional support in researching archives. This will continue under the auspices of the North East Regional Archives Council [NERAC] Contact Liz Rees a portal to learning resources to teachers, pupils, parents and carers providing venue guides, information regarding learning visits and links to some of the sector’s regional on-line learning resources. This will continue under the auspices of Tyne & Wear Museums. Contact  basic skills resources developed by primary teachers, working alongside learning professionals from six archives in the North East region and  designed for use schools.  These resources offer a fresh and engaging approach to teaching basic skills. This will continue under the auspices of Durham University. Contact DiSCS provides an online directory of information technology (IT) and digital services suppliers to work with the cultural and heritage sector. This site has transferred and is managed by The Collections Trust The regional local studies site for archives and record offices, libraries, museums, archaeology services, the region’s universities and commercial organisations.

Additionally, community groups have created one hundred local history projects. This will continue. The domain name is now owned by Newcastle City Council. The site is managed by Newcastle City Library Services. Contact Kath Cassidy The site for oral history in the North East of England. Support for individuals and organisations undertaking oral history projects, to provide focus and support and a forum for the sharing of ideas and experience. This site is closed.

I think this demonstrates some good practices of what organisations which have set up or commissioned Web sites should do if they are forced to close, either due to changes in Government funding and policies (as is the case with the MLA Regional Agencies).

We can see that the Web site address and a brief summary of its purpose is provided, details of when the site ceases operation, contact details and, in a couple of cases, details of how the service is being continued by other organisations.

I know the implications of the demise of our organisation on the Web services we are providing isn’t something that we like to think about. But in a personal capacity once we reach a certain age and become aware of our resoponsibilities to others we due ten to make plans for what happens after we die, perhaps by making a will. So shouldn’t our organisations be making similar plans in case the oprganisation ceases to exist. And at a time of the credit crunch this is even more important than it used to be.

Library Partnership Preserves End-of-Term Government Web Sites

The news that a Library Partnership Preserves End-of-Term Government Web Sites was announced in August 2008 (and it’s about the end of the George W Bush’s term of office). However I think it’s worth drawing attention to the article for those with an interest in the preservation of Web sites. One thing that caught my eye was the comment that:

the Internet Archive will undertake a comprehensive crawl of the .gov domain.

The article concluded with a summary of the role of the Internet Archive:

The Internet Archive is a high-tech nonprofit, founded in 1996 by Brewster Kahle as an “Internet library” to provide universal and permanent access to digital information for educators, researchers, historians, and the general public. The Internet Archive captures, stores and provides access to born-digital and digitized content, and leads the development of Heritrix, the open-source archival web crawler, used to facilitate the collection of web data for this project.

What role might the Internet Archive have in the UK, I wonder?

Unpublishing the Web

There was an interesting editorial by Siobhan Butterworth in Monday’s Guardian about ‘unpublishing’ – removal of content once placed on the Internet.

Siobhan explains:

Judging from the numbers of emails I get from people asking for material to be removed from the Guardian’s electronic archive, it seems that some people still don’t fully understand the implications of speaking to or even writing for a news organisation in the web age.

She goes on to argue that:

The web makes a lie of the old cliché that today’s newspaper pages are tomorrow’s fish and chip wrapping. Nowadays, as I’ve said before, the things you say about yourself in a newspaper are more like tattoos – they can be extremely difficult to get rid of.

It seems a good rule to set yourself when publishing content (or allowing content to be published about you) on the Web (and the same rule could apply to all emails sent) is: Are you happy for the whole world to see this?

The concepts that what you publish can be seen by all and that nothing truly disappears from the Web have slowly begun to embed themselves in our consciousness. This has been fuelled by a number of horror stories about employers accessing the Facebook (and Flickr and other socialnetworking sites…) accounts of perspective employees. A New York Magazine article from February this year quoted a teenager as saying “If I don’t delete it, I’m still gonna be there. My generation is going to have all this history; we can document anything so easily.” Many people do realise that the off-hand comments and inappropriate photos we blog or publish can come back to haunt us.

While in some ways this might seem to be the flip side of what JISC PoWR is about deletion is very much part of a preservation strategy.

So it pays to remember that:

  • stuff can disappear, and quite often it is the really good stuff we wish we’d held on to.


  • stuff that we wish would come out in the wash can stain for good.

So maybe we need to give some thought to how (and should things) be ‘unpublished’? What do people think?

Auricle: The Case Of The Disappearing E-learning Blog

The Auricle E-Learning Blog

The e-learning team at the University of Bath was one of the early adopters of blog technologies to provide a forum for reflecting on e-learning in a Web content.   The blog was set up by Derek Morrison when he was head of the e-learning unit. Derek had an interest in exploring the potential of new technologies, with one example of this being the series of podcast interviews he recording and made available on the blog back in 2005.  This included an interview with John Dale about the innovative blogging service developed at the University of Warwick (the first large scale student blogging service in the UK)  and, perhaps not as noteworthy, an interview with me on my reflections of the WWW 2005 conference.

The name of the e-learning team’s blog was Auricle, which has an advantage of being a very Google-friendly name, and a Google search for “Auricle Bath” finds links to the blog itself and various page which refer to the blog. Unfortunately it seems that the blog no longer exists – following a link to the blog’s home page gets a 404 error message:

The web address was not found. It may have moved, or it may no longer be available.

How unfortunate – all that potentially valuable historical content giving views on the potential of the Web (including technologies such as blogs and podcasts) to enhance the quality of the student’s learning experiences now no longer available.  And how should the University of Bath feel about this loss of its intellectual endeavours and the role that the University had in being one of the early adopters of blogs by an e-learning team.

Why Did The Blog Disappear?

The URL for the Auricle blog provides an indication of some of the reasons for the disappearance of the blog: dacs refers to the Division of Access and Continuing Studies and cdntl to the Centre for the Development of New Technologies in Learning – but neither of these departments still exists.  Following staff departures and organisational changes, support for learning at the University of now provided by the Learning and Teaching Enhancement Office (LTEO)  with the e-Learning Team having responsibility for managing and supporting e-learning developments.

In addition to these organisational changes, the pMachine part of the blog’s domain name refers to the pMachine blog engine and morriblog clearly refers to  Derek Morrison, who left the University a number of years ago to support the HE Academy’s Pathfinder programme.

It is perhaps not surprising that following such changes and the influx of a large number of new staff in the e-Learning Team that the Auricle blog got lost somewhere along the way!

Can We Retrieve Any Of The Resources?

Is it possible to retrieve any of the blog posts and related resources? Is it possible to obtain any details about the blog, such as when it was launched, the number of posts published during its lifetime, how popular it was and, perhaps, the impact that the blog may have had?

Since the blog was public, as opposed to a blog which was restricted to members of the University of Bath, the contents of the blog have been indexed by Google. And using a combination of search terms, such as “Auricle Bath”, it is also possible to discover Web resources which cite the Auricle blog. This helped me to find a blog post on Stephen Downes’s blog on The Weblog as the Model for a New Type of Virtual Learning Environment? in which Stephen (a high profile Canadian e-learning guru)  clearly acknowledged the importance of Derek Morrison’s views on the potential of the blog as providing “the basis for a distributed, not centralised, information and learning object system“:

The author of Auricle nails it. “In the weblog, however, the announcements, articles, stories are the raison d’etre’ so much so that, not satisfied to present articles from one source, the weblog has the temerity, due to the adoption of the RSS standard, to receive syndicated stories from other sources and, in turn, offer it’s own portfolio of articles for use by others. For example, a blog supporting a programme or module could be the vehicle by which faculty post date and time-stamped short articles relevant to the course but which also link to related, but distributed, learning resources which are presented via RSS feeds. Such feeds can be static or dynamic so that updated RSS formatted information will be reflected in whatever application is displaying it, e.g. a la Auricle’s RSS Dispenser. Here then is the basis for a distributed, not centralised, information and learning object system.” (My emphasis) Derek Morrison, Auricle, February 27, 2004.

And the date of Stephen’s post (27 February 2004) indicates that the Auricle blog was available in early 2004.

With some further use of Google I discover that the Auricle podcast resources are still available on the University of Bath Web site – and I’m pleased that the MP3 file of my interview has not been lost. The RSS file also contains the publication dates, which show that the podcasts were published during 2005. We seem to have unearthed some further information about the Auricle blog.

Rediscovering The Blog!

It required a Google search for “Auricle Morrison” for me to discover that the Auricle blog is alive and well! It is now hosted at (much better than the original URI!).  And as well as providing access to the original posts (although with a new look-and-feel, as the blog is now based on the WordPress blog software)  the blog is still active, with Derek using the blog to support his Pathfinder work at the HE Academy. As Chris Rusbridge pointed out on a post on “Digital Preservation” term considered harmful?” on the Digital Curation blog “phrases like “long term accessibility” or “usability over time” are better than the process-oriented phrase “digital preservation“. And here’s an example of how the Auricle blog has been preserved by continuing to still be used and accessible to its user community.

The Lessons

What are the implications of this case study for the wider community? And what lessons can be learnt?

We should be aware of the dangers of associating services with departmental names and specific technologies. This has been well documented, including Tim Berners-Lee’s article on “Cool URIs Don’t Change!” – although this is clearly easy to say, but more difficult to implement in practice.

I feel there is also a need for departments to audit their networked services and to document their policies regarding the sustainability of such services.  And such documented policies should be examined when departments change their names or there are significant changes in personnel.

And this case study provides an interesting example of a service which has been driven by an individual – Derek Morrison. As Derek clearly felt ownership to the Auricle blog, he was motivated to migrate the content of the blog to a new platform and, at a later date, to continue to contribute to the blog, although not as frequently as previously.  This probably saves the e-Learning Team at Bath from having to retrieve backup copies of the blog posts and provide an archived copy of the resource.  But who owns the blog?  And what would have happened if there had been an ownership dispute over the blog and the name of the blog?  These are questions which will be relevant to many academics and support who make use of blogs to support their professional activities – including myself and my UK Web Focus blog. But as the contents of my blog have a Creative Commons licence I would hope that there will minimise any legal barriers to the contents of my blog being migrated to a new environment should circumstances change.

Are There Three Key Aspects To Web Site Preservation?

In response to my post on “Don’t Web Managers Care About Preservation?” Kevin Ashley described how he “see[s] a distinction being made between preserving an experience and preserving the information which the experience makes available. Both are valid preservation approaches and both achieve different ends.

Kevin is correct – these distinctions are very real. And different sectors will may well have differing views as to the importance of preservation the underlying data or the user experience – this has surfaced at recent repository events, with some groups arguing that PDF provides a satisfactory means of preserving the user experience whilst others feel that it is more important to preserve the data which was used to create the PDFs.

But rather than revisiting such arguments in this blog I would like to reflect on a comment made by Chris Rusbridge in response to the same post mentioned above. Chris described how:

this grump came about partly because a number of organisations which are supposed to have a commitment to long-term access to information managed to destroy access through re-launches. Richard, I do like continuity, and also long-term accessibility (gets both angles!) rather than preservation…

Persistent URIs are not about technical solutions, they are about commitment. We must make sure we never break URIs!

We should note that Chris isn’t engaging with the argument of whether it’s the experience of the information which he wants to be preserved – rather it’s the means of access he wants to remain in place.

And this, I feel, is one of the most challenging aspects of Web site preservation – preserving the access mechanisms for the end user. This, then, is very different from preserving that valuable historical parchment which might be moved from public view,  send off to a company for renovation and then send on tour as part of a travelling exhibition. In this case the resource may be being curated, but access to end user is not available – or even expected.

In the case of Web resources a failure by an organisation to manage digital assets may result in the organisation losing valuable information. But what if the Web resources are simply migrated to an alternative location? Or the resources are embedded in other aspects of the organisation’s work? In such cases the organisation will argue that it hasn’t lost anything.  Rather it is the end user who may feel aggrieved – as Chris has clearly described.

So perhaps we have three key aspects to Web site preservation – preservation of the experience, the information and the access. Or, if you feel that access for end users is part of the experience, we might argue the need to preserve the experience and/or information to support the needs of the organisation and the needs of the user community.

Preservation and Innovation

In a recent comment on this blog Kevin Ashley makes the point that having an interest in the preservation of Web resources doesn’t mean that one is anti-innovation. As Kevin points out “I see a distinction being made between preserving an experience and preserving the information which the experience makes available. Both are valid preservation approaches and both achieve different ends.

There’s a real difficulty, though, in applying either of these preservation approaches in a environment of rapid technological development. And within higher education we are likely to see examples of such innovation, whether this is scientific researchers involved in new ways of visualisating scientific data or teaching staff who wish to ensure students gain experiences in use of Social Web technologies.

How are such tensions to be addressed? Should, for example, use of immersive environments such as Second Life be banned until preservation techniques have been developed which will ensure that such complex environments can be preserved? Such a draconian approach is alien to the educational sector’s IT development culture (although such approaches are taken in other areas such as biological and medical research). And as I’ve described in a post on “Is Second Life Accessible?” innovative technologies such as Second Life can bring substantuial benefits to the user community – in this case a user with cerebral palsy who feels that Second Life provides a really useful tool for people who are unable to get around, who have problems of mobility in real life “because you can have friends without having to go out and physically find them“.

The tensions between preservation and innovation perhaps reflect similar tensions between accessibility and innovation, with differing opinions being held by the various interested parties. In the case of Second Life (where we are seeing virtual worlds being continually assembled, developed and then redeveloped) there does seem to be an awareness of the need to preserve such virtual worlds, with the Maryland Institute for Technology in the Humanities having received funding from the Library of Congress’s National Digital Information Infrastructure and Preservation Program (NDIIPP) for a two-year project on Preserving Virtual Worlds. And yet the $590,000 funding for this project, which will not, of course, guarantee that a solution to the problem with be available at the end of the funding, indicates that the preservation of immersive worlds will not be an easy undertaking.

Returning to Kevin’s comment that there is a “distinction [to be] made between preserving an experience and preserving the information which the experience makes available. Both are valid preservation approaches and both achieve different ends. perhaps it is important to focus on these distinctions when we are seeking to preserve our innovative services. Might the video clip of the Second Life experience be the appropriate solution for the pioneers of this technology until the research programmes have devised ways of preserving the much richer and resuable environment? And might not this be an approach which can also be taken for our innovative Web services?

Web Continuity Project at The National Archives

Ed and I were pleased to come across an interesting document, recently received from The National Archives, describing their Web Continuity Project. This is the latest of the many digital preservation initiatives undertaken by TNA/PRO, that began with EROS and NDAD in the mid 1990s, leading to the UK Government Web Archive and other recent digital preservation initiatives (many in conjunction with BL and the JISC).

The Web Continuity Project arises from a request by Jack Straw, as leader of the House of Commons in 2007, that government departments ensure continued access to online documents. Further research revealed that:

  • Government departments are increasingly citing URLs in answer to Parliamentary Questions
  • 60% of links in Hansard to UK government websites for the period 1997 to 2006 are now broken
  • Departments vary considerably: for one, every link works; for another every link is broken. (TNA’s own website is not immune!)

Continue reading