Category Archives: Challenges

ArchivePress: When One Size Doesn’t Fit All

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

Archiving a wiki

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a few points which we have touched on in the PoWR Handbook, which I’d like to illuminate and amplify here.

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy – or the designated user community, as OAIS calls it – probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point – there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.

Future in Bits

The BBC News Web site has published an interesting article entitled Future in Bits asking how can the ever-changing Web be archived bearing in mind the dilema of the malleable nature of digital information.

The article draws attention to the fact that no UK-based commercial online newspapers are currently being archived.

David Stuart, a research fellow in Web 2.0 Technologies at the University of Wolverhampton is quoted as saying:

The lack of an exhaustive archive of the UK web space not only risks the loss of information on web pages that are changed or taken down,” he said. “It also undermines the value of pages that link to them; the value of the web comes as much from the hyperlinks between pages as the contents of the web pages. This is especially true in the blogosphere, where so much of the content created by the public is built upon the foundations of traditional news stories

Jessie Owen, digital continuity project manager at the National Archives explains that the key to archiving is preparation.

This is something the JISC PoWR handbook can offer help with.

Unpublishing the Web

There was an interesting editorial by Siobhan Butterworth in Monday’s Guardian about ‘unpublishing’ – removal of content once placed on the Internet.

Siobhan explains:

Judging from the numbers of emails I get from people asking for material to be removed from the Guardian’s electronic archive, it seems that some people still don’t fully understand the implications of speaking to or even writing for a news organisation in the web age.

She goes on to argue that:

The web makes a lie of the old cliché that today’s newspaper pages are tomorrow’s fish and chip wrapping. Nowadays, as I’ve said before, the things you say about yourself in a newspaper are more like tattoos – they can be extremely difficult to get rid of.

It seems a good rule to set yourself when publishing content (or allowing content to be published about you) on the Web (and the same rule could apply to all emails sent) is: Are you happy for the whole world to see this?

The concepts that what you publish can be seen by all and that nothing truly disappears from the Web have slowly begun to embed themselves in our consciousness. This has been fuelled by a number of horror stories about employers accessing the Facebook (and Flickr and other socialnetworking sites…) accounts of perspective employees. A New York Magazine article from February this year quoted a teenager as saying “If I don’t delete it, I’m still gonna be there. My generation is going to have all this history; we can document anything so easily.” Many people do realise that the off-hand comments and inappropriate photos we blog or publish can come back to haunt us.

While in some ways this might seem to be the flip side of what JISC PoWR is about deletion is very much part of a preservation strategy.

So it pays to remember that:

  • stuff can disappear, and quite often it is the really good stuff we wish we’d held on to.


  • stuff that we wish would come out in the wash can stain for good.

So maybe we need to give some thought to how (and should things) be ‘unpublished’? What do people think?

Heritage Records and the Changing Filter through which we View our World

At both of the JISC-PoWR workshops delegates have been keen for the project team to spell out the reasons why institutions might want to preserve Web resources. These ‘drivers’ then give fuel to their case for the funds needed to archive the institutional Web site.

The idea of ‘heritage records’ is one that is often mentioned. Using Web sites as a ‘cultural snap shot’ has the potential to be a highly useful activity.

In his interesting and functional text Managing the Crowd: Rethinking Records Management for the Web 2.0 World Steve Bailey puts forward the point that deciding what will be important in the future is a tricky business. As he explains in the section on appraisal, retention and destruction: “The passage of time inevitably changes the filter through which we view our world and assess its priorities.”

Steve gives the example of the current plethora of Web sites that offer what we might call ‘quack’ remedies for medical problems. These sites may not seem to be of great interest right now but they may be invaluable to future historians who wish to demonstrate the distrust of the medical profession exhibited in 21st century western culture.

James Curral in his recent plenary talk at the recent Institutional Web Management Workshop used the example of blog posts made by soldiers out in Iraq and Afghanistan to demonstrate the irony of modern technology; these highly informative records could easily be lost while the diaries of World War II soldiers remain accessible.

Preservation mistakes have been made aplenty in the past. The destruction of much of the BBC’s flagship programmes in the 1970s has been well documented and in 2001 the BBC launched a a treasure hunt campaign to locate recordings of pre-1980 television or radio programmes. Ironically the Web site is no longer being updated, though it is still hosted on the BBc server.

So who can know what the future will bring? Which Web resources will we wish we had kept? Which student blog writer will go on to be a future prime minister or an infamous criminal? What bit of the terrabytes is the most important?

As Steve Bailey points out there is no crystal ball. It has always has been, and always will be, very difficult to predict what resources may prove to be valuable to future generations.

Although this offers little recompense for those making these choices, it does at least argue the case that we do need to preserve and we need to do so soon.

When do we Fixity?

Records Management has a concept of record declaration. This is the point when we “draw a metaphorical line in the sand and fix the content of a record” (see the JISCInfo Kit on Records Management which also uses the term ‘fixity’ in this context.)

Most electronic records management systems (ERMS) provide users with the ability to perform this declaration automatically. When they do so, the digital content they have created (e-mail, document or whatever) becomes ‘fixed’. UK Government have called this creating ‘locked down and secure’ records, a necessary step for ensuring their authenticity and reliability.

But ERM systems seem to work best with static documents; authors of reports, for example, understand that a good time to declare their report as a record is when the final approved version has been accepted. Yet one of the distinctive features of Web 2.0 content is that the information is very fluid, and often there is no obvious point at which to draw this line and fix content.

One example might be blog posts. These can receive comments from the moment they are posted and well into the future. Not only this but many bloggers go back and edit previous posts and delete comments. This matter was recently discussed on Brian Kelly’s UKWeb Focus blog. Phil Wilson asked:

Brian, is there any reason you never modify or update your posts when you’ve made an error, and instead make users plough through the comments to see if anything you’ve said is wrong?” (UK Web Focus Blog)

Brian’s response was that he sometimes fixes typos and layout issues but is:

reluctant to change the meaning of a published post, even (or perhaps especially) if I make mistakes. In part I don’t want to undermine the authority of any comments or the integrity of any threaded discussions.”

Brian is open about this in his blog policy stating that only in exceptional circumstances will postings and comments be deleted.

Concerns about censorship and bloggers deleting posts/comments were also recently made in responses to What is fair play in the blogo/commentosphere? on Nature’s Blog.

Assuming that blog posts are to be included within a records management programme or a preservation programme, the issues described above might cause problems for those attempting to preserve authentic and reliable Web resources.

One approach is to be explicit in your Web Resource Preservation strategy about when you freeze Web resources for preservation, and the implications of doing so.

Another approach might involve an agreed institutional policy such as Brian has, but with an additional form of wording that is explicit about the status of blog posts as records, including when and how they should be declared as records, and whose responsibility it is to do so. Should selected blog posts be declared as records by their owners into the ERMS? Or will they all be harvested by an automated capture programme, and if so, how frequently?

Any thoughts?

Getting Institutional Buy-in For Web Site Preservation

One of the risks we identified when we wrote the bid for the JISC PoWR project was that those involved in providing institutional Web service would not be interested in issues related to preservation. Surely not, you may feel if you’re a records manager. And if you are involved in providing institutional Web services you may be reluctant to confess to being less than fully committed to an area which does seem worthy.  But, to be honest, Web managers may not have a particularly strong interest in this topic. And if this is the case, it will be difficult to persuade them of the need to invest resources in this area and to gain the necessary commitment from senior managers and policy makers. Without these issues being addressed it seems to me that we’re unlikely to make any significant changes to instituional approachs to Web site preservation.

So I was very pleased to read Alison’s Wildish’s blog post enitled “Web Preservation: should we make the time?“. In this post Alison (head of Web Services at the University of Bath) described the case study which she and Lizzie Richard (Archivist, Records Manager and FOI Coordinator at the University of Bath) presesented at the first JISC PoWR workshop. Alison described how:

Neither of us felt web preservation was something we had expertise in nor the time (and for me the inclination) to fully explore this. Web preservation was something we could see as being useful (in the future) but I think we both felt it wasn’t a priority.

The good news is that the discussions Alison and Lizzie had after I introduced them to each other and invited them to participate in the JISC PoWR projects have helped them to further their understanding of Web site preservation:

Simply discussing preservation (from both sides of the fence) taught us a lot. We discovered the risks involved in simply side-lining it; the potential gap in University history and the benefits of embedding preservation into our digital strategy.

And now that Alison and Lizzie are better aware of the need to have a policy of Web site preservation they are  in a position to start working on one:

So is it something we should make time for? Yes I believe it is.

The JISC PoWR project is starting to deliver its goals of engaging the key stakeholders, making them better aware of the challenges in preserving Web sites but also willing to address those challenges 🙂

And I’m please to say that Alison has made the slides used at the workshop available on Slideshare – well worth viewing, especially if you are a records manager who is “a paper person [and] have enough trouble trying to preserve hard copy records without having to worry about the web … [who] can see the value in theory, but in practice it’s too huge [and] guess it might be a good idea, but no one much cares what I think I am interested though… ” or a Web person who has the view that “In all honesty it isn’t interesting to me… We struggle to keep the site current – never mind thinking Web Specialist about preserving the old stuff I am future watching… need to know what to bring in not how to keep hold of the past Why is it something I should think about now? I’m not really that interested“. 

Before and after

Many good ideas are already coming out of the first PoWR workshop. One thing I personally found illuminating was one of the breakout sessions which I facilitated, called ‘The history of the Institution’s home page’.

The scenario required an institution to provide ‘examples of how the web site has developed since it was launched’. (Brian Kelly has already broached this topic here). My colleagues discussed this conundrum with great vigour, but most had to admit they drew a blank when it came to producing even a basic screenshot from five years ago. Instead, there was a lot of ‘folk memory’ and anecdotal evidence, also sometimes called ‘tacit knowledge’. It was clear that everyone was aware that their web site had changed (and sometimes even improved) quite dramatically in the last 5-10 years. It’s just that no-one could lay their hands on any evidence of the changes. Imagine it like one of those ‘before and after’ reveals which you might get on Changing Rooms on TV. The problem with web sites is that we can’t always lay hands on the ‘before’ picture.

Some drivers for changes included:

  • Corporate or institutional rebranding
  • Move to a Content Management System
  • Content provider change
  • External consultancy

And the following sorts of web site elements were subject to change:

  • Design, branding, colours, logos
  • Accessibility
  • Language
  • Content – obviously(!) – but do we know what content was added and what was thrown away?
  • Navigation – clicks and links became more consistent across the site
  • More pages – a site which used to be one ‘main page’ for the entire institution is now much larger and more complex, and every Department in the University now has a suite of pages
  • More interactive elements, including automated forms
  • Site became more searchable
  • More media, video and audio content was embedded
  • Distinction between internal and external content became more pronounced

It’s also interesting that one of our colleagues thought that the CMS also added some constraints to working; what once was easy is now much more difficult. I’m sure this is a trade-off we find with most new ways of working and it isn’t meant to imply that a CMS is always evil.

Kevin Ashley heard this summary with some interest and jokingly suggested that perhaps some form of oral history was the relevant preservation solution here. Yet if we have no other evidence of a web site’s history, who knows – it may yet turn out to be the last resort.

What can PoWR do for you?

Web preservation is a big topic and we’re not even pretending to deal with all of it. The aspect that we care about – that JISC believes the community is looking for help with – is fairly well-defined. We want to help institutions make effective decisions about preserving web resources, and help them implement those decisions in a way that is cost-effective and non-disruptive.

Making effective decisions
At its simplest level, this means deciding what to keep and what not to keep. There may be many drivers for these decisions – institutional policy, legal requirements and research interests are just a few. The decisions need to relate not just to what is to be kept, but why and who for. That’s because those requirements may have a bearing on how you choose to go about the job, or whose responsibility it is to carry it out. Not everything needs to be kept, and even when it does, it may not be your institution’s responsibility to keep it.

Implementing those decisions
Carrying out your decisions – keeping things, throwing things away, or ensuring that other people keep things – can be the trickiest part of the process. You may know you want to preserve the prospectus for past years, but can you be sure that your CMS, or the Internet Archive, or some local use of web-harvesting tools is going to do this job effectively for you ? You may be being told that some part of your web infrastructure would be easier to preserve if you avoided the use of certain features, or used a different authoring system. Is that true, and if it is, what are the negative consequences of such decisions ?

The handbook which will be one of the project’s outputs will attempt to answer these quesions in a way that makes sense to everyone who might be involved in the process. We want to help to make it easier to take decisions about preservation and to know what tools, systems or working methods can be employed to help you implement them.

The workshops are the primary mechanism we’re using to test whether the handbook makes sense to the people it’s aimed at, and that they tackle the problems that people are actually facing.

Seeing Eye to Eye: Web Managers and Records Managers

The technological and cultural changes brought about by the advancement of the Web have, on numerous occasions, required co-ordinated interdisciplinary work. 0ne of the intended aims of the JISC-PoWR project is to help to bring together the differing perspectives of information professionals such records managers and Web managers in the context of the preservation of Web resource – and there are probably at least four sets of expertise involved: Web content creation (as perceived by Web authors), Web content management from a technical perspective (as perceived by those who choose or configure the underlying software), records and/or information management and digital preservation. So there’s the bringing together of intellectual perspectives: (What content needs to be preserved? How long for? Who is responsible?) and there’s the technical perspectives, assuming that the above questions come up with anything that needs preserving (How do we do it ? Are site-level tools more appropriate than national services? Does CMS X make preservation easier or harder than CMS Y? Is a more accessible site also a more preservable one? Are there configuration choices that affect preservation without (significantly) affecting other aspects of management?)

Within the JISC-PoWR team there have been a number of interesting discussions that have highlighted how differently the different players see Web preservation. To quote Ed Pinsent:

“The fundamental thing here is bringing together two sets of information professionals from differing backgrounds who, in many cases, don’t tend to speak to each other. Many records managers and archivists are, quite simply, afraid of IT and are content to let it remain a mystery. Conversely, it is quite possible to work in an IT career path in any organisation (not just HE/FE) and never be troubled by retention or preservation issues of any sort. “

The cliched view might regard Web managers as concerning themselves primarily with the day to day running of an organisation’s Web site, with preservation as an afterthought, and records managers focussing mainly on the preservation of resources and failing to understand some of the technical challenges presented. And although this may be a superficial description of the complexitities of they ways in which institutions go about the management of the digital resources, perhaps like many cliches, there could be an element of truth in such views.

Continue reading