Author Archives: Ed Pinsent

Why you can sometimes leave it to the University

“Does anyone have any positive experiences to share?”, asks Brian in a recent post. Well, I have – except it’s not in the UK. Harvard University Library in the USA have recently put Harvard WAX (the Web Archive Collection Service) live, after a pilot project which began in July 2006.

Harvard WAX includes themed collections on Women’s Voices and Constitutional Revision in Japan, but of particular interest to us in PoWR is their A-Sites collection: the semi-annual captures of selected Harvard websites. “The Harvard University Archives is charged with collecting and preserving the historical records of the University,” state the curators, recognising their formal archival function in this regard. “Much of the information collected for centuries in paper form now resides on University web sites.”

Helen Hockx-Yu of the British Library met with the WAX team in May 2009. “I was impressed with many of the features of the system,” she said, “not just the user and web curator interfaces but also some of the architectural decisions. WAX is a service offered by the Library to all Harvard departments and colleges. In exchange for a fee, the Departments use the system to build their collections. The academics may not be involved with the actual crawling of websites, but spend time QAing and curating the websites, and can to some extent decide how the archive targets appear in the Access Tool. The QAed sites are submitted directly into Harvard’s institutional repository.”

It is very encouraging to read of this participatory dimension to the project, indicating how success depends on the active involvement of the creators of the resources. Already 48 Harvard websites have been put into the collection, representing Departments, Committees, Schools, Libraries, Museums, and educational programmes.

The delivery of the resources has many good features also; there’s an unobtrusive header element which lets the user know they’re looking at an archived instance (instead of the live website). There’s a link explaining why the site was added to the collection, and contextual information about the wider collection. Another useful link allows researchers, scholars and other users to cite the resource; it’s good to see this automated feature integrated directly within the site. The Terms of Use page addresses a lot of current concerns about republishing web resources, and strikes just the right balance between protecting the interests of Harvard and providing a service to its users. Like a good OAIS-compliant repository, they are perfectly clear about who their designated user community are.

Best of all, they provide a working full-text search engine for the entire collection, something that many other web archive collections have been struggling to achieve.

The collection is tightly scoped, and takes account of ongoing developments for born-digital materials: “Collection managers, working in the online environment, must continue to acquire the content that they have always collected physically. With blogs supplanting diaries, e-mail supplanting traditional correspondence, and HTML materials supplanting many forms of print collateral, collection managers have grown increasingly concerned about potential gaps in the documentation of our cultural heritage.” The project has clear ownership (it is supported by the University Library’s central infrastructure), and it built its way up from a pilot project in less than three years. Their success was partially due to having a clear brief from the outset, and through collaboration with three University partners. What Harvard have done chimes in with many of the recommendations and suggestions made in the PoWR Handbook, particularly Chapters 5 (Selection), 16 (Responsibility for preservation of web resources) and 19 (How can you effect change?)

There are many aspects of this project which UK Institutions could observe, and perhaps learn something from. It shows that it is both possible and practical to embed website collection and preservation within an Institution.

Permission to capture Twitter

This blog has been occasionally concerned with issues surrounding the capture of Tweets, in their capacity as web-based resources which may or may not have value to the Institution as record or archive material. See Brian Kelly’s post of July 2008. The discussions have been based around (a) is it worth doing? And (b) is it technically possible to capture Tweets, and how? This post is concerned with a third issue, (c), do we have permission to do it, since is a third-party hosting service? The same issue in fact, that applies to many Web 2.0 applications which involve members of staff working in the cloud or generally conducting University business outside of the network.

The PANDORA Librarian at the State Library of Queensland has just asked the question about permission. “Do we have a blanket permission for Twitter That is, if we find a Twitter site we want to gather, and we get permission as well from say a politician or whoever has content on that site, can we go ahead and gather?”

The National Library of Australia, who have been involved in a library-centric web archiving programme for many years, replied with the following advice:

“We received advice from Twitter, when asking to archive the Prime Minister’s twitter page, which gives a clear indication that they have no problem with archiving,” said the NLA’s Manager of Web Archiving. “I have also been in contact with another Twitter staffer who also gave permission. So I think it is safe to assume that we may archive selectively from Twitter, depending upon receiving the permission of the Twitteree. However in the case of the PM, we didn’t seek that explicit permission as we have blanket permission, from his Party and the PM’s dept.”

And that advice from confirms this:

“Archiving the Prime Minister’s twitter updates is fine, since he owns the copyright to the messages that he posted. According to our Terms of Service:

Copyright (What’s Yours is Yours)

1. We claim no intellectual property rights over the material you provide to the Twitter service. Your profile and materials uploaded remain yours. You can remove your profile at any time by deleting your account. This will also remove any text and images you have stored in the system.

2. We encourage users to contribute their creations to the public domain or consider progressive licensing terms.

This is potentially a useful little development. It may send archivists in the direction of checking the terms of service of other third-party hosting organisations, such as Google Docs or SlideShare, to establish what’s permissible and what isn’t. If copyright and ownership issues are sidestepped or aren’t even mentioned, this could be a cause for concern.

In the meantime, presumably we must also give credit to our Australian colleague for devising the neologism “Twitteree”.

Archiving a wiki

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a few points which we have touched on in the PoWR Handbook, which I’d like to illuminate and amplify here.

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy – or the designated user community, as OAIS calls it – probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point – there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.

Wiki Management

This contribution to a thread about management of wikis, posted by the Records management section at the University of Edinburgh, was submitted to the Archive listerv recently:

Below is an outline of the ‘wiki’ situation at the University of Edinburgh:

At Edinburgh University our main effort to date has been making sure that wikis are retention scheduled, and considering what the ideal retention period for a wiki should be. As part of setting up any new wiki space the University records details such as space owner and proposed use, but due to the wide variety of uses it is difficult to specify a generic retention period. There is the option for the space owner to delete a wiki space; however the most likely scenario is that a space atrophies over time, the owner stops engaging, and it is therefore then up to the University to be proactive in identifying and pruning out dead spaces.

At present the service policy talks about a default retention period of 1 year, which is primarily to make space owners aware that if not used their space may be deleted. If we have anything that requires long term migration we would look into outward migration; either to a new system or to an archive.

I found it very encouraging to see this pro-active and practical-minded approach to the management of wikis. In many ways Edinburgh’s RM approach vindicates a lot of the RM advice which we have recommended in the PoWR Handbook; as we say early on, we must manage resources in order to preserve them. It is also encouraging that in Edinburgh’s case at least the wiki problem is considered primarily in terms of information and staff management, and not exclusively in terms of the technological solutions that might be applied.

In particular:

1) Edinburgh: “Make sure wikis are retention scheduled”.

  • PoWR: “Deciding which aspects of your web resources to capture can be informed to a large extent by your Institutional drivers, and the agreed policies for retention and preservation.”  (p 22)

2) Edinburgh: “Consider the ideal retention period for a wiki”.

  • PoWR: “The attraction of bringing a website in line with an established retention and disposal programme is that it will work to defined business rules and retention schedules to enable the efficient destruction of materials, and also enable the protection and maintenance of records that need to be kept for business reasons.”  (p 93)

3) Edinburgh: “Make space owners aware that if not used their space may be deleted”.

  • PoWR: “Quite often in an academic context these applications rely on the individual to create and manage their own resources. A likely scenario is that the academic, staff member or student creates and manages his or her own external accounts in Flickr, Slideshare or; but they are not Institutional accounts. It is thus possible with Web 2.0 application for academics to conduct a significant amount of Institutional business outside of any known Institution network. The Institution either doesn’t know this activity is taking place, or ownership of the resources is not recognised officially. In such a scenario, it is likely the resources are at risk.”  (p 42)

4) Edinburgh: “The service policy talks about a default retention period.” This approach seems to incorporate rules as part of setting up any new wiki space, starting to manage the resource at the very beginning of the record’s lifecyle.

  • PoWR: “If  we can apply a lifecycle model to web resources, they will be created, managed, stored and disposed of in a more efficient and consistent way; it can assist with the process of identifying what should and should not be retained, and why; and that in turn will help with making preservation decisions.” (p 34)

5) Edinburgh: “If we have anything that requires long term migration we would look into outward migration; either to a new system or to an archive.”

  • PoWR: “Migration of resources is a form of preservation. Migration means moving resources from one operating system to another, or from one storage/management system to another. This may raise questions about emulation and performance. Can the resource be successfully extracted from its old system, and behave in an acceptable way in the new system?”  (p 33)
  • “The usual aim of archival appraisal has been to identify and select records for permanent preservation. Quite often appraisal has taken place at the very end of the lifecycle process (although records managers intervene where possible at the beginning of the process, enabling records of importance to be identified early).”  (p 36)

Handbook v 1.0 released

Version 1.0 of the PoWR Handbook is published and released today. The Handbook has been one of the main deliverables of the JISC PoWR project, but there is no doubt that this blog itself has constituted a rich and varied source of discussion and information. (As such perhaps the blog itself should be nominated for archiving). Indeed, some of the blog discussions here have been refitted and turned into case studies and scenarios in the Handbook, as they touch on many important issues.

Version 1.0 is not substantially different from Version 0.3, released last month, but we have taken some recent feedback comments into account, added a useful and comprehensive Index to the Handbook, and made corrections arising from the proof-reading stage.

In the spirit of showing our commitment to the management of Web 2.0- type resources, we are also making the Handbook available on issuu and Scribd.

Yet more on iPres2008

The web-archiving strand at the iPRES Conference was an opportunity for a number of National Libraries to describe their initiatives, their approaches to web capture and preservation, and the things that were (for them) the biggest hurdles. It’s interesting to me that librarians – as opposed to archivists – seem to have been first off the mark with web-archiving, although in the UK The National Archives have at least two parallel initiatives underway (UKWAC and European Archive) to capture government websites deemed to be records.

From the PoWR point of view, one thing I found useful was the summary statement from Iceland which suggests that the community is now starting to agree on three main approaches to selection – bulk/domain harvesting, thematic, and event-based. Sometimes I wonder if these approaches are too library-influenced (looking at websites primarily as publications, rather than as records, whereas in PoWR we are suggesting that HFE websites contain and function as a mixture of both), but it helped me sharpen up my thinking and I fed this idea back into the PoWR Handbook’s chapter on Selection.

Since they were all representing libraries, most of the presenters tended to focus on legal deposit (and attendant permission problems) as the biggest hurdle to gathering websites – and the one that eats up the most in terms of resources. Denmark have not resolved this completely, and although they have gathered lots of material from the web, they don’t make it widely available – only allowing controlled access for research or statistical purposes. France, on the other hand, have had legal deposit laws in place since 2006. Australia have found this such a problem (despite the strenuous efforts they have made to get the law changed) that it pretty much forced the decision to go the curatorial route. All permissions are negotiated, but the collections are shaped to a certain extent by community input.

With my archivist hat on, I sat up when we were told quite categorically that registration and cataloguing of web resources would not work, and that no-one should ever even attempt it. The BL admitted they were ‘not really collecting metadata’. If this is true, I immediately started to wonder why both PANDAS and Web Curator Tool (which I have used) have conventional Dublincore metadata elements built into their workflow? Don’t we anticipate cataloguing archived web resources in some way? I almost asked a question about this, until I reflected on the use of NutchWax and full-text indexing (which is probably acceptable until we can come up with some form of semantic tagging or automated metadata extraction for web resources).

Then I found myself reaching for the mic to ask a question about what I call ‘endangered resources’. We have already blogged about this on dablog, when my colleague Joanne Anthony raised a question about a web resource owned by a smallish institution which suddenly found itself with its funding removed. Was there anything such an Institution could do, I asked the panel, to preserve its website? And what were the pro-active steps being taken by these National Libraries to rescue or identify resources at risk? I know that UKWAC, for example, offers a public submission service on its website, although it is not very prominent or visible, nor is it quite clear what happens to requests for archiving once the form has been filled in. I received some interesting replies, including the amusing anecdote from France which suggests that their archival collections have been accessed by red-faced politicians who have accidentally deleted their own blogs. However, I still wasn’t quite sure what national initiatives exist to address what I perceive as a significant gap in the preservation of unrecognised (and therefore uncaptured) resources.

PoWR Handbook: early draft


Attached is an early draft of the PoWR Handbook. This release is timed to coincide with our third and final workshop in Manchester.  We hope to gather feedback from that workshop, and from any comments received via this blog, to feed into the final version of the Handbook.

You can also browse an version of the Report here.

Comments can also be sent direct to Ed Pinsent, if you would prefer not to comment here.

Before and after

Many good ideas are already coming out of the first PoWR workshop. One thing I personally found illuminating was one of the breakout sessions which I facilitated, called ‘The history of the Institution’s home page’.

The scenario required an institution to provide ‘examples of how the web site has developed since it was launched’. (Brian Kelly has already broached this topic here). My colleagues discussed this conundrum with great vigour, but most had to admit they drew a blank when it came to producing even a basic screenshot from five years ago. Instead, there was a lot of ‘folk memory’ and anecdotal evidence, also sometimes called ‘tacit knowledge’. It was clear that everyone was aware that their web site had changed (and sometimes even improved) quite dramatically in the last 5-10 years. It’s just that no-one could lay their hands on any evidence of the changes. Imagine it like one of those ‘before and after’ reveals which you might get on Changing Rooms on TV. The problem with web sites is that we can’t always lay hands on the ‘before’ picture.

Some drivers for changes included:

  • Corporate or institutional rebranding
  • Move to a Content Management System
  • Content provider change
  • External consultancy

And the following sorts of web site elements were subject to change:

  • Design, branding, colours, logos
  • Accessibility
  • Language
  • Content – obviously(!) – but do we know what content was added and what was thrown away?
  • Navigation – clicks and links became more consistent across the site
  • More pages – a site which used to be one ‘main page’ for the entire institution is now much larger and more complex, and every Department in the University now has a suite of pages
  • More interactive elements, including automated forms
  • Site became more searchable
  • More media, video and audio content was embedded
  • Distinction between internal and external content became more pronounced

It’s also interesting that one of our colleagues thought that the CMS also added some constraints to working; what once was easy is now much more difficult. I’m sure this is a trade-off we find with most new ways of working and it isn’t meant to imply that a CMS is always evil.

Kevin Ashley heard this summary with some interest and jokingly suggested that perhaps some form of oral history was the relevant preservation solution here. Yet if we have no other evidence of a web site’s history, who knows – it may yet turn out to be the last resort.

Digital preservation in a nutshell, part II

As Richard noted in Part I, digital preservation is a “series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” But what sort of digital materials might be in scope for the PoWR project?

We think it extremely likely that institutional web resources are going to include digital materials such as “records created during the day-to-day business of an organisation” and “born-digital materials created for a specific purpose”.

What we want is to “maintain access to these digital materials beyond the limits of media failure or technological change”. This leads us to consider the longevity of certain file formats, the changes undergone by proprietary software, technological obsolescence, and the migration or emulation strategies we’ll use to overcome these problems.

Continue reading

Records management: are websites records?

I’ve been asked to contribute a post or two on Records Management (RM) and preservation, as it relates to this project. There was certainly a paragraph or two in the original JISC Invitation To Tender that made me think it was possible to apply aspects of Records Management to this task:

“A formal records management approach may be the appropriate response to handling certain kinds of web material and this assumes active management of the material throughout its whole life-cycle, an approach which is intrinsic to all preservation-related approaches.”

We think it’s quite likely this line of thinking may apply to University and HFE websites, if the institutional website is being used as a place to conduct business; or is developing into an area where corporate records are being created and stored. A webmaster or a records manager could start to ask a few questions about the organisation’s website: Continue reading

Introduction: Ed Pinsent

ed.jpg Ed Pinsent has been doing website archiving and collection management since 2004, funded by the JISC to capture snapshot impressions of JISC project websites which are otherwise in danger of disappearing. The snapshots are currently stored and managed by the UK Web Archiving Consortium. Ed comes from a background of traditional archival curation (understanding of the importance of preservation, context, structured information, user needs); and records management (pro-active maintenance and protection of important assets and resources). Currently a member of the Digital Preservation team at the University of London Computer Centre.