Preservation of Web Resources: a JISC-funded project [Archived Blog]

Archive for the 'Records management' Category

Why you can sometimes leave it to the University

Posted by Ed Pinsent on 8th September 2009

“Does anyone have any positive experiences to share?”, asks Brian in a recent post. Well, I have – except it’s not in the UK. Harvard University Library in the USA have recently put Harvard WAX (the Web Archive Collection Service) live, after a pilot project which began in July 2006.

Harvard WAX includes themed collections on Women’s Voices and Constitutional Revision in Japan, but of particular interest to us in PoWR is their A-Sites collection: the semi-annual captures of selected Harvard websites. “The Harvard University Archives is charged with collecting and preserving the historical records of the University,” state the curators, recognising their formal archival function in this regard. “Much of the information collected for centuries in paper form now resides on University web sites.”

Helen Hockx-Yu of the British Library met with the WAX team in May 2009. “I was impressed with many of the features of the system,” she said, “not just the user and web curator interfaces but also some of the architectural decisions. WAX is a service offered by the Library to all Harvard departments and colleges. In exchange for a fee, the Departments use the system to build their collections. The academics may not be involved with the actual crawling of websites, but spend time QAing and curating the websites, and can to some extent decide how the archive targets appear in the Access Tool. The QAed sites are submitted directly into Harvard’s institutional repository.”

It is very encouraging to read of this participatory dimension to the project, indicating how success depends on the active involvement of the creators of the resources. Already 48 Harvard websites have been put into the collection, representing Departments, Committees, Schools, Libraries, Museums, and educational programmes.

The delivery of the resources has many good features also; there’s an unobtrusive header element which lets the user know they’re looking at an archived instance (instead of the live website). There’s a link explaining why the site was added to the collection, and contextual information about the wider collection. Another useful link allows researchers, scholars and other users to cite the resource; it’s good to see this automated feature integrated directly within the site. The Terms of Use page addresses a lot of current concerns about republishing web resources, and strikes just the right balance between protecting the interests of Harvard and providing a service to its users. Like a good OAIS-compliant repository, they are perfectly clear about who their designated user community are.

Best of all, they provide a working full-text search engine for the entire collection, something that many other web archive collections have been struggling to achieve.

The collection is tightly scoped, and takes account of ongoing developments for born-digital materials: “Collection managers, working in the online environment, must continue to acquire the content that they have always collected physically. With blogs supplanting diaries, e-mail supplanting traditional correspondence, and HTML materials supplanting many forms of print collateral, collection managers have grown increasingly concerned about potential gaps in the documentation of our cultural heritage.” The project has clear ownership (it is supported by the University Library’s central infrastructure), and it built its way up from a pilot project in less than three years. Their success was partially due to having a clear brief from the outset, and through collaboration with three University partners. What Harvard have done chimes in with many of the recommendations and suggestions made in the PoWR Handbook, particularly Chapters 5 (Selection), 16 (Responsibility for preservation of web resources) and 19 (How can you effect change?)

There are many aspects of this project which UK Institutions could observe, and perhaps learn something from. It shows that it is both possible and practical to embed website collection and preservation within an Institution.

Posted in Policies, Preservation, Records management, Resources, Selection | 1 Comment »

Survey: How successful has Records Management been?

Posted by Marieke Guy on 4th September 2009

As part of his dissertation at Aberystwyth University Andrew Brown is undertaking a research project which aims to determine how successful Records Management has been in the UK by asking Records Managers for their perceptions of Records Management in their organisation and the profession as a whole. He is attempting to quantify this ‘success’
and would be very grateful if record managers could take the time to complete the survey, which will take approximately 10-15 minutes.

It is hoped that this study will generate some stimulating debate on this matter and lead to a greater understanding of the current and future state of the Records Management profession in the UK where digital and Web preservation may be key.

Please access the survey at the following link.

The survey closes at midnight on 5th September.

Posted in Records management | No Comments »

Archiving a wiki

Posted by Ed Pinsent on 25th March 2009

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a few points which we have touched on in the PoWR Handbook, which I’d like to illuminate and amplify here.

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy – or the designated user community, as OAIS calls it – probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point – there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.

Posted in Challenges, Digital preservation, Records management, Selection, Web 2.0 | No Comments »

Wiki Management

Posted by Ed Pinsent on 16th January 2009

This contribution to a thread about management of wikis, posted by the Records management section at the University of Edinburgh, was submitted to the Archive listerv recently:

Below is an outline of the ‘wiki’ situation at the University of Edinburgh:

At Edinburgh University our main effort to date has been making sure that wikis are retention scheduled, and considering what the ideal retention period for a wiki should be. As part of setting up any new wiki space the University records details such as space owner and proposed use, but due to the wide variety of uses it is difficult to specify a generic retention period. There is the option for the space owner to delete a wiki space; however the most likely scenario is that a space atrophies over time, the owner stops engaging, and it is therefore then up to the University to be proactive in identifying and pruning out dead spaces.

At present the service policy talks about a default retention period of 1 year, which is primarily to make space owners aware that if not used their space may be deleted. If we have anything that requires long term migration we would look into outward migration; either to a new system or to an archive.

I found it very encouraging to see this pro-active and practical-minded approach to the management of wikis. In many ways Edinburgh’s RM approach vindicates a lot of the RM advice which we have recommended in the PoWR Handbook; as we say early on, we must manage resources in order to preserve them. It is also encouraging that in Edinburgh’s case at least the wiki problem is considered primarily in terms of information and staff management, and not exclusively in terms of the technological solutions that might be applied.

In particular:

1) Edinburgh: “Make sure wikis are retention scheduled”.

  • PoWR: “Deciding which aspects of your web resources to capture can be informed to a large extent by your Institutional drivers, and the agreed policies for retention and preservation.”  (p 22)

2) Edinburgh: “Consider the ideal retention period for a wiki”.

  • PoWR: “The attraction of bringing a website in line with an established retention and disposal programme is that it will work to defined business rules and retention schedules to enable the efficient destruction of materials, and also enable the protection and maintenance of records that need to be kept for business reasons.”  (p 93)

3) Edinburgh: “Make space owners aware that if not used their space may be deleted”.

  • PoWR: “Quite often in an academic context these applications rely on the individual to create and manage their own resources. A likely scenario is that the academic, staff member or student creates and manages his or her own external accounts in Flickr, Slideshare or WordPress.com; but they are not Institutional accounts. It is thus possible with Web 2.0 application for academics to conduct a significant amount of Institutional business outside of any known Institution network. The Institution either doesn’t know this activity is taking place, or ownership of the resources is not recognised officially. In such a scenario, it is likely the resources are at risk.”  (p 42)

4) Edinburgh: “The service policy talks about a default retention period.” This approach seems to incorporate rules as part of setting up any new wiki space, starting to manage the resource at the very beginning of the record’s lifecyle.

  • PoWR: “If  we can apply a lifecycle model to web resources, they will be created, managed, stored and disposed of in a more efficient and consistent way; it can assist with the process of identifying what should and should not be retained, and why; and that in turn will help with making preservation decisions.” (p 34)

5) Edinburgh: “If we have anything that requires long term migration we would look into outward migration; either to a new system or to an archive.”

  • PoWR: “Migration of resources is a form of preservation. Migration means moving resources from one operating system to another, or from one storage/management system to another. This may raise questions about emulation and performance. Can the resource be successfully extracted from its old system, and behave in an acceptable way in the new system?”  (p 33)
  • “The usual aim of archival appraisal has been to identify and select records for permanent preservation. Quite often appraisal has taken place at the very end of the lifecycle process (although records managers intervene where possible at the beginning of the process, enabling records of importance to be identified early).”  (p 36)

Posted in Case studies, Policies, Preservation, Records management, Selection | No Comments »

Managing the Crowd: Rethinking records management for the Web 2.0 world

Posted by Marieke Guy on 19th November 2008

My review of the Steve Bailey text Managing the Crowd: Rethinking records management for the Web 2.0 world has now been published in the latest Ariadne magazine.

This text has been mentioned at PoWR workshops, on the PoWR blog and on the JISC Information Environment Team blog. I can honestly say that it has had quite an impact on my thinking with regard to preservation and Web 2.0 resources, other members of the PoWR team may agree.

As I say in the conclusion:

This book offers up much food for thought. Bailey wants to wake up and shake his community. He wants to make them see that all is not well in the records management world and that if they don’t start moving with the times then they will be pushed out of the way. He contends there is a very real possibility that records management as we know it will cease to exist; it will be outsourced.

Go on, have a read.

Posted in Future, Records management | No Comments »

Heritage Records and the Changing Filter through which we View our World

Posted by Marieke Guy on 11th August 2008

At both of the JISC-PoWR workshops delegates have been keen for the project team to spell out the reasons why institutions might want to preserve Web resources. These ‘drivers’ then give fuel to their case for the funds needed to archive the institutional Web site.

The idea of ‘heritage records’ is one that is often mentioned. Using Web sites as a ‘cultural snap shot’ has the potential to be a highly useful activity.

In his interesting and functional text Managing the Crowd: Rethinking Records Management for the Web 2.0 World Steve Bailey puts forward the point that deciding what will be important in the future is a tricky business. As he explains in the section on appraisal, retention and destruction: “The passage of time inevitably changes the filter through which we view our world and assess its priorities.”

Steve gives the example of the current plethora of Web sites that offer what we might call ‘quack’ remedies for medical problems. These sites may not seem to be of great interest right now but they may be invaluable to future historians who wish to demonstrate the distrust of the medical profession exhibited in 21st century western culture.

James Curral in his recent plenary talk at the recent Institutional Web Management Workshop used the example of blog posts made by soldiers out in Iraq and Afghanistan to demonstrate the irony of modern technology; these highly informative records could easily be lost while the diaries of World War II soldiers remain accessible.

Preservation mistakes have been made aplenty in the past. The destruction of much of the BBC’s flagship programmes in the 1970s has been well documented and in 2001 the BBC launched a a treasure hunt campaign to locate recordings of pre-1980 television or radio programmes. Ironically the Web site is no longer being updated, though it is still hosted on the BBc server.

So who can know what the future will bring? Which Web resources will we wish we had kept? Which student blog writer will go on to be a future prime minister or an infamous criminal? What bit of the terrabytes is the most important?

As Steve Bailey points out there is no crystal ball. It has always has been, and always will be, very difficult to predict what resources may prove to be valuable to future generations.

Although this offers little recompense for those making these choices, it does at least argue the case that we do need to preserve and we need to do so soon.

Posted in Challenges, Preservation, Records management, Web 1.0 | 2 Comments »

When do we Fixity?

Posted by Marieke Guy on 14th July 2008

Records Management has a concept of record declaration. This is the point when we “draw a metaphorical line in the sand and fix the content of a record” (see the JISCInfo Kit on Records Management which also uses the term ‘fixity’ in this context.)

Most electronic records management systems (ERMS) provide users with the ability to perform this declaration automatically. When they do so, the digital content they have created (e-mail, document or whatever) becomes ‘fixed’. UK Government have called this creating ‘locked down and secure’ records, a necessary step for ensuring their authenticity and reliability.

But ERM systems seem to work best with static documents; authors of reports, for example, understand that a good time to declare their report as a record is when the final approved version has been accepted. Yet one of the distinctive features of Web 2.0 content is that the information is very fluid, and often there is no obvious point at which to draw this line and fix content.

One example might be blog posts. These can receive comments from the moment they are posted and well into the future. Not only this but many bloggers go back and edit previous posts and delete comments. This matter was recently discussed on Brian Kelly’s UKWeb Focus blog. Phil Wilson asked:

Brian, is there any reason you never modify or update your posts when you’ve made an error, and instead make users plough through the comments to see if anything you’ve said is wrong?” (UK Web Focus Blog)

Brian’s response was that he sometimes fixes typos and layout issues but is:

reluctant to change the meaning of a published post, even (or perhaps especially) if I make mistakes. In part I don’t want to undermine the authority of any comments or the integrity of any threaded discussions.”

Brian is open about this in his blog policy stating that only in exceptional circumstances will postings and comments be deleted.

Concerns about censorship and bloggers deleting posts/comments were also recently made in responses to What is fair play in the blogo/commentosphere? on Nature’s Blog.

Assuming that blog posts are to be included within a records management programme or a preservation programme, the issues described above might cause problems for those attempting to preserve authentic and reliable Web resources.

One approach is to be explicit in your Web Resource Preservation strategy about when you freeze Web resources for preservation, and the implications of doing so.

Another approach might involve an agreed institutional policy such as Brian has, but with an additional form of wording that is explicit about the status of blog posts as records, including when and how they should be declared as records, and whose responsibility it is to do so. Should selected blog posts be declared as records by their owners into the ERMS? Or will they all be harvested by an automated capture programme, and if so, how frequently?

Any thoughts?

Posted in Challenges, Records management, Web 2.0 | 1 Comment »

Seeing Eye to Eye: Web Managers and Records Managers

Posted by Marieke Guy on 25th June 2008

The technological and cultural changes brought about by the advancement of the Web have, on numerous occasions, required co-ordinated interdisciplinary work. 0ne of the intended aims of the JISC-PoWR project is to help to bring together the differing perspectives of information professionals such records managers and Web managers in the context of the preservation of Web resource – and there are probably at least four sets of expertise involved: Web content creation (as perceived by Web authors), Web content management from a technical perspective (as perceived by those who choose or configure the underlying software), records and/or information management and digital preservation. So there’s the bringing together of intellectual perspectives: (What content needs to be preserved? How long for? Who is responsible?) and there’s the technical perspectives, assuming that the above questions come up with anything that needs preserving (How do we do it ? Are site-level tools more appropriate than national services? Does CMS X make preservation easier or harder than CMS Y? Is a more accessible site also a more preservable one? Are there configuration choices that affect preservation without (significantly) affecting other aspects of management?)

Within the JISC-PoWR team there have been a number of interesting discussions that have highlighted how differently the different players see Web preservation. To quote Ed Pinsent:

“The fundamental thing here is bringing together two sets of information professionals from differing backgrounds who, in many cases, don’t tend to speak to each other. Many records managers and archivists are, quite simply, afraid of IT and are content to let it remain a mystery. Conversely, it is quite possible to work in an IT career path in any organisation (not just HE/FE) and never be troubled by retention or preservation issues of any sort. “

The cliched view might regard Web managers as concerning themselves primarily with the day to day running of an organisation’s Web site, with preservation as an afterthought, and records managers focussing mainly on the preservation of resources and failing to understand some of the technical challenges presented. And although this may be a superficial description of the complexitities of they ways in which institutions go about the management of the digital resources, perhaps like many cliches, there could be an element of truth in such views.

Read the rest of this entry »

Posted in Challenges, Preservation, Records management, Web 1.0 | 2 Comments »

Digital preservation in a nutshell, part II

Posted by Ed Pinsent on 10th June 2008

As Richard noted in Part I, digital preservation is a “series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” But what sort of digital materials might be in scope for the PoWR project?

We think it extremely likely that institutional web resources are going to include digital materials such as “records created during the day-to-day business of an organisation” and “born-digital materials created for a specific purpose”.

What we want is to “maintain access to these digital materials beyond the limits of media failure or technological change”. This leads us to consider the longevity of certain file formats, the changes undergone by proprietary software, technological obsolescence, and the migration or emulation strategies we’ll use to overcome these problems.

Read the rest of this entry »

Posted in Digital preservation, Records management | No Comments »

Records management: are websites records?

Posted by Ed Pinsent on 14th May 2008

I’ve been asked to contribute a post or two on Records Management (RM) and preservation, as it relates to this project. There was certainly a paragraph or two in the original JISC Invitation To Tender that made me think it was possible to apply aspects of Records Management to this task:

“A formal records management approach may be the appropriate response to handling certain kinds of web material and this assumes active management of the material throughout its whole life-cycle, an approach which is intrinsic to all preservation-related approaches.”

We think it’s quite likely this line of thinking may apply to University and HFE websites, if the institutional website is being used as a place to conduct business; or is developing into an area where corporate records are being created and stored. A webmaster or a records manager could start to ask a few questions about the organisation’s website: Read the rest of this entry »

Posted in Records management | No Comments »