JISC-PoWR

Preservation of Web Resources: a JISC-sponsored project

Archive for the 'Web 2.0' Category

What’s the average lifespan of a Web page?

Posted by Marieke Guy on 12th August 2009

…or is it easier to ask how long is a piece of string?

The statistic much banded about (for Web pages not pieces of string!) is 44 days, believed to originate in an article by Brewster Kahle (of Internet Archive fame) published in 1997 and titled Preserving the Internet. Brewster’s original quote is specifically about URLs, “…estimates put the average lifetime for a URL at 44 days.

Whether this figure still stands today is a matter currently being discussed on the CURATORS@LIST.NETPRESERVE.ORG list after a query from Abigail Grotke of the Library of Congress.

Abbie offered up the 44 day statistic and pointed out that on the Digital Preservation Web site they have a graphic that discusses Web volatility stating “44% of the sites available on the internet in 1998 had vanished one year later“.

The other figure often cited is 75 days from a Michael Day’s report Collecting and preserving the world wide web.

The dynamic nature of the Web means that pages and whole sites are continually evolving, meaning that pages are frequently changed or deleted. Alexa Internet once estimated that Web pages disappear after an average time of 75 days. (Lawrence, et al.,2001, p. 30).

Another figure sometimes suggested is 100 days, this seems to come from Rick Weiss article for the The Washington Post, Washington, DC, 24 November 2003, On the Web, Research Work Proves Ephemeral -  no longer available.

So what is the average lifespan of a Web page today? Is it getting shorter or longer? The Internet Archive now gives 44 -75 days as its ball park figure. I’d have to hazard a guess that with the rise in use of Web 2.0 technologies the Web is actually getting more transient by the day.

Is this OK?

Maybe if it’s just a tweet you sent your friend, however if it’s something more substantial that’s disapearing then it’s a real worry.

Posted in Web 1.0, Digital preservation, Web 2.0 | 4 Comments »

An Archive Of IWMW 2009 Tweets

Posted by Brian Kelly on 7th August 2009

In a recent blog post entitled Tools For Preserving Twitter Posts I described some of the Twitter preservation tools we were planning to use to keep a record of the tweets related to UKOLN’s recent IWMW 2009 event.

Twitter proved very popular during this annual event for institutional Web managers, with over 1,500 Twitter posts (tweets) being published during the last week of July. Further statistical information is provided in a post on Evidence on Use of Twitter for Live Blogging.

We suggested that a two character code (P1 to P8)  could be used to identify each plenary session and that using this as a hashtag in conjunction with the event’s hashtag (#iwmw2009) would enable the tweets about a particular talk to be easily identified and, in theory, this data migrated to a managed environment.

As an example you can search for the tweets related to:

We have recently used The Archivist desktop application to create a local copy of the tweets for the plenary  talks at the conference, and these have been made available on the IWMW 2009 Web site from the individual pages for the plenary talks (e.g. see the page for Derek Law’s opening plenary talk). The pages also contain a summary of the number of Twitter posts which were found using the tool.

One reason for wishing to do this is to provide an answer to the speaker who may ask “I Wonder What They Thought About My Session?“.

Posted in Web 2.0 | 2 Comments »

Preservation and Google Wave

Posted by Brian Kelly on 3rd August 2009

A number of scientists have written enthusiastic blog posts about the potential of Google Wave including Peter Murray-Rust, Cameron Neylon and several others. A post entitled Google Wave: possibilities for librarians on the Rambling Librarian blog provides a useful summary of Google Wave and how it aims to provide a response to the question “What might email be like if it was invented today?

The Rambling Librarian post also picks up on the important “implication … that digital preservation will be even more critical. Imagine all the collaborative efforts gone when the server crashes. Or power fails.

Absolutely! And let’s ensure that the digital preservation aspects are being considered right at the start of any development activities rather than being ignored by those focussing on the new possibilities which this technology can provide.

Hmm, I wonder if there are any funding possibilities available for exploring the preservation aspects of Google Wave?

Posted in Web 2.0 | No Comments »

Preservation Policies for Content Hosted on Third Party Blogs

Posted by Brian Kelly on 15th July 2009

We know that a variety of externally-hosted Web 2.0 services are being used to support institutional aims. But what about the associated risks of loss of important resources?  One approach to such risk would be to ban use of such services. But this is to ignore the benefits that such services provide and is likely to alienate users of such services if it were possible to implement such a policy.

The approach taken by the JISC PoWR project ( and described in more detail in the JISC PoWR handbook - see particularly chapter 13) has been to recognise that there are legitimate reasons to make use of such services and to look at ways in which content hosted on such services can be managed and curated in the long term.

The need to do this is of relevance to UKOLN which provides a number of blogs on externally-hosted services including the UK Web Focus blog, provided by Brian Kelly and the Rambling of a Remote Worker blog, provided by Marieke Guy.

The first stage is to define and publicise a policy covering the long-term access to the content of these two blogs, including what will happen if either of the authors leaves UKOLN.

Policies posted on the UK Web Focus and Ramblings of a Remote Worker blogs state that:

  • A rich copy of the contents of the blog will be made available to UKOLN (my host organisation) if I leave. Note that this may not include the full content if there are complications concerning third party content (e.g. guest blog posts, embedded objects, etc.), technical difficulties, etc.
  • Since the blog reflects personal views I reserve the rights to continue providing the blog if I leave UKOLN. If this happens I will remove any UKOLN branding from the blog.

These two simple statements can help, we feel, in ensuring that the content can be managed if the blog authors leave (or if they fall ill, go crazy or die!). The statements seek to avoid uncertainties regarding what can be done with the content.  The second statement also clarifies that if the authors were to leave, they may wish to continue using the blog.

It may be argued that since both blogs make their content available under a Creative Commons licence this already grants the host institution, along with anyone else, the rights to preserve the content. This may be true, but there is no harm in making this explicit, we feel.

Would it not be legitimate for organisations to expect its employees to make similar statements which clarify the ownership of content hosted on Web 2.0 services and created as a normal course of one’s duties?


Note: This blog post has been written to support a poster which will be provided for the Missing links: the enduring web conference. The poster, which has been produced by Marieke Guy and Brian Kelly, UKOLN, is entitled “Preservation Policies and Approaches for Use of Web 2.0 Services“. A series of blog posts published on this blog provide more detailed information of the content summarised in the poster.

Posted in missinglinks09mg, missinglinks09, Web 2.0 | No Comments »

Tools For Preserving Twitter Posts

Posted by Brian Kelly on 14th July 2009

I recently described some Some Use Cases For Preserving Twitter Posts including preservation of an organisation’s digital memory and preservation of data for subsequent data mining. The post, however, failed to include perhaps the most obvious example: preservation of Twitter posts (tweets) related to an event.

In response to that post a number of solutions for preserving tweets were suggested including FriendFeed, the WordPress Lifestream plugin and What the Hashtag. In addition following a tweet I posted I received details of the Tweetdoc service.

With this year’s Institutional Web Management Workshop (IWMW 2009) rapidly approaching it is timely to decide the tools we’ll be using to preserve the discussions associated with this event. We began keeping a record of the amplification of the IWMW event back in 2005 when an IRC channel was provided for use by the small numbers of participants who had a laptop and WiFi connectivity at the event. The IWMW 2005 event took place during the 7/7 bombings in London and a record of the awareness of what was having can be gleaned from the archive of the IRC discussions.

This year we will once again be making use of Twitter and will be capturing tweets which contain the event hashtag #iwmw2009.  The tools we are currently intending to use are What the Hashtag and Tweetdoc service.

Use of wthashtags service to view Twitter posts about #iwmw2009Use of What the Hashtag to views tweets containing the #iwmw2009 tag is illustrated.

As can be seen the output can be access as an RSS feed. In addition the data can be viewed as an HTML resource, and a data range can also be supplied.

We intend to capture this tweets about the IWMW 2009 event after the event is over, and store the data on the UKOLN Web site, in order to avoid dependencies on the What the Hashtag service itself.

We will also explore other services, such as Tweetdoc - although in this case as the data is only available as a PDF resource, it is not well-suited to provide data for analysis by other services.

Are there any other services we should be looking at? And what functionality might be desirable for a more generic service for preserving tweet? Any thoughts?


Note: This blog post has been written to support a poster which will be provided for the Missing links: the enduring web conference. The poster, which has been produced by Marieke Guy and Brian Kelly, UKOLN, is entitled “Preservation Policies and Approaches for Use of Web 2.0 Services“. A series of blog posts published on this blog provide more detailed information of the content summarised in the poster.

Posted in missinglinks09mg, missinglinks09, Web 2.0 | 8 Comments »

Some Use Cases For Preserving Twitter Posts

Posted by Brian Kelly on 26th June 2009

I’ve come across two schools of thought regarding the preservation of Twitter posts (tweets). From Twitter fans, it’s a really useful form of informal conversation, with a particular strength being the way in which messages quickly disappear. It’s like having conversations in your favourite bar - and we wouldn’t want such conversations preserved, so why would we wish to preserve tweets? Sceptics of Twitter might regard Twitter as inane babble in a bar and would agree that it’s pointless to preserve it.

However this week I have come across a couple of examples which illustrate why it might be useful to preserve tweets.

Preservation of Twitter data for research purposes

I recently described how Twitterers Subvert[ed] Daily Mail’s Racist Poll. For me that provided a fascinating example of how Twitter can be used by activists to support a viral campaign (in this case to undermine the racist sentiments in the question “Should the NHS allow gipsies to jump the queue?“). But why did this poll succeed in attract a large turnout (with the poll being undermined with around 94% voting yes before the Daily Mail removed the poll) whereas a subsequent poll on Should immigrants be forced to respect British culture? failed to arouse much interest?

Since many of the tweets and retweets provided a link to “http://bit.ly/w4b6Q” we can look at the statistics provided by the bit.ly short URL service which tells us that 48,796 clicked on the link.

Statistics for clicks to Daily Mail pollAs shown in the histogram as might be expected the vast majority of clicks to the poll took place on 19 June 2009, the day the poll was the hot topic.

But who created the interest in the Twitterverse originally? Was it, as I speculated in my post, the UK-based psychologist community? Or where there a number of starting points? How was the message communicated? Who where the ‘influencers’ (to use the terminology from a recent blog post by Martin Weller) whose tweets were felt to contain links worth following? Who were the ‘hubs’ who were responsible for communicated the message to a wide audience? And was Twitter actually the main driver or were their other viral mechanisms in operation as Josh suggested, who went on to add “we must keep perspective and not neglect other areas or over-celebrate Twitter as a comms tool!“.

In a recent post by Andy Powell on “Influence, connections and outputs” Andy  suggested that we want “some sensible measure of scholarly impact on the social Web” and that we need to understand”what do we want to measure; what we can measure and how can bring these two things close enough together to create something useful”. Although I’m not suggesting that the Twitter responses to the Daily Mail poll have any scholarly impact, I do think that this could provide us with an opportunity to understand how ideas (whether scholarly, political or simply marking) are communicated and transmitted by Twitter. But if we wish to make use of this particular example, we will need to preserve the Twitter stream - and we will need to preserve the structured data (the Twitterer ID, the time, the IDs of those being retweeted and the numbers of followers) in order to carry out any rich analysis.

Preservation of an organisation’s memory

As recently described on the This is Bath blog the Bus station Twitter project ends - today. The bathcsc Twitter account and accompanying blog, were set up to keep users of the First bus service in Bath informed of news - including delays to bus services. Sadly First’s head office in Bath  have announced that the company “is having to make changes to its business structure. As part of making the company more cost efficient, the customer services function for Bath Bus Station will now be handled by a specialist, area-wide centre based at Exeter“.

This news has not gone down well with many of the fans of this innovation use of social networking tools (and I should add that, as a daily bus user, I personally found the Twitter service very useful) as can be gauged from the comments on the blog to the announcement of the demise of the service and on the Bus station Twitter project ends article.

In the article a spokeswoman said that the company was hoping to learn lessons from the nine-month experiment:

The Bath Twitter trial attracted an enormous amount of interest, both inside and outside the organisation. We are now looking at the lessons we have learnt, and benefits associated with the system, with a view to developing a national social media approach to customer services for the future.

But is the company in a position to preserve the tweets? Will the company be in a position to appreciate the ways in which Twitter can be used effectively if this record is lost? The danger, of course, is that a new media markeing agency wil sometime in the future advise on how Twitter could be used - and fail to learn from the successes of the initial trial, perhaps even making as big a mistake as Habitat have done recently.

What To Do? When To Do It?

Are these illustrations of when preservation of Twitter tweets reasonable? If so, can you think of other examples which we need to consider?

And if there is a case to be made for preserving tweets, what tools should be used? I am aware of the Print Your Twitter service, but this just creates a PDF file of tweets matching the search criteria.  To summarise what I think is needed:

  •  Ability to create a structured repository of tweets from a Twitter ID
  •  Ability to create a structured repository of tweets matching a search term

Anyone know of software which can be used to do this? I’m aware that one could simply use Twitter to search for “http://bit.ly/w4b6Q” or “@bathcsc” but this doesn’t provide a persistent and reusable data store.

Posted in missinglinks09mg, missinglinks09, Web 2.0 | 10 Comments »

ArchivePress: When One Size Doesn’t Fit All

Posted by Richard M. Davis on 24th June 2009

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary - scholarly, actual, political, or personal - which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools - like Web Curator Tool and PANDAS - that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static - as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software - WordPress, of course - to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders - except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) - and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

Posted in missinglinks09, Challenges, Software, Web 2.0, Events | 8 Comments »

Digital Preservation and Nuclear Disaster: An Animation

Posted by Brian Kelly on 12th May 2009

DigitalPreservationEurope (DPE), an organisation which is “committed to making digital preservation materials available to the widest possible audience and to breaking down barriers to access”, has released the first in a series of short animations introducing and explaining digital preservation problems and solutions for the general public.  

Not for everyone, I suspect, but I’m pleased to see a diversity of approaches being taken to explaining digital preservation concepts. And making it available on YouTube means that the animation can be easily used in a wide variety of contexts, such as being embedded in this blog post. What do you think? 

Posted in Web 2.0 | 1 Comment »

Permission to capture Twitter

Posted by Ed Pinsent on 1st May 2009

This blog has been occasionally concerned with issues surrounding the capture of Tweets, in their capacity as web-based resources which may or may not have value to the Institution as record or archive material. See Brian Kelly’s post of July 2008. The discussions have been based around (a) is it worth doing? And (b) is it technically possible to capture Tweets, and how? This post is concerned with a third issue, (c), do we have permission to do it, since twitter.com is a third-party hosting service? The same issue in fact, that applies to many Web 2.0 applications which involve members of staff working in the cloud or generally conducting University business outside of the network.

The PANDORA Librarian at the State Library of Queensland has just asked the question about permission. “Do we have a blanket permission for Twitter http://twitter.com/? That is, if we find a Twitter site we want to gather, and we get permission as well from say a politician or whoever has content on that site, can we go ahead and gather?”

The National Library of Australia, who have been involved in a library-centric web archiving programme for many years, replied with the following advice:

“We received advice from Twitter, when asking to archive the Prime Minister’s twitter page, which gives a clear indication that they have no problem with archiving,” said the NLA’s Manager of Web Archiving. “I have also been in contact with another Twitter staffer who also gave permission. So I think it is safe to assume that we may archive selectively from Twitter, depending upon receiving the permission of the Twitteree. However in the case of the PM, we didn’t seek that explicit permission as we have blanket permission, from his Party and the PM’s dept.”

And that advice from Twitter.com confirms this:

“Archiving the Prime Minister’s twitter updates is fine, since he owns the copyright to the messages that he posted. According to our Terms of Service:

Copyright (What’s Yours is Yours)

1. We claim no intellectual property rights over the material you provide to the Twitter service. Your profile and materials uploaded remain yours. You can remove your profile at any time by deleting your account. This will also remove any text and images you have stored in the system.

2. We encourage users to contribute their creations to the public domain or consider progressive licensing terms.

This is potentially a useful little development. It may send archivists in the direction of checking the terms of service of other third-party hosting organisations, such as Google Docs or SlideShare, to establish what’s permissible and what isn’t. If copyright and ownership issues are sidestepped or aren’t even mentioned, this could be a cause for concern.

In the meantime, presumably we must also give credit to our Australian colleague for devising the neologism “Twitteree”.

Posted in Web 2.0 | 3 Comments »

Archiving a wiki

Posted by Ed Pinsent on 25th March 2009

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a few points which we have touched on in the PoWR Handbook, which I’d like to illuminate and amplify here.

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy - or the designated user community, as OAIS calls it - probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point - there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.

Posted in Challenges, Selection, Digital preservation, Records management, Web 2.0 | No Comments »