“Why you never should leave it to the University”

A blog post from Richard Gatarski begins with the blunt announcement:

A year ago my academic web site disappeared. And those who made it go away probably ignored that such a thing could happen.

The article goes on to describe how last year Richard “found out that the School of Business had redesigned their web site. And in the process they just ignored my research. About ten years worth of virtually daily updates were gone That included most of the manuscripts for my published work. The same thing happened to lecture notes, powerpoint slides, course documentations, useful links, etc. It had all disappeared from the Web!“.

Richard did have some good news to report: “Courtesy of the Internet Archive you can still find most of my academic stuff on the Web through their Wayback machine.” although Richard did wonder why he had to rely on the Internet Archive (“a 501(c)(3) non-profit that was founded to build an Internet library”) – after all, wouldn’t you expect your institutional library to provide this service?

Richard’s losses of his digital resources have continued – a blog he set up at Stockholm University was deleted after he left the institution – although, again a copy is archived on the Internet Archive.

Richard’s experiences have left him disillusioned with the attitudes towards the digital preservation of scholarly resources. He concludes by recommending that academics take responsibility themselves for preserving their resources:

Meanwhile, for those of you who publish stuff on the Web while working with an organisation, including universities. Try to put your content where you control it. Most likely you will move between work places, temporary assignments, and soforth. If you want your stuff to be preserved, it is your responsability to make sure it is.

But how easy will this be for the typical academic? Richard doubts whether “the issues I bring forward today are heavily discussed among university chancellors, political leaders, educational policy makers, and scientific philosophers.”  But surely we need to ensure that this debate takes place. And, in today’s economic climate, that debate needs to include discussions of the costs of digital preservation (disk storage may be cheap but management of content is not).

Richard’s tale is based on his experiences as an academic in Sweden. Is the situation different in the UK, I wonder?  Judging by Stuart Smith’s lament that “Mummy I lost my MP3!“, which I summarised in a post on “Disappearing Resources On Institutional Web Sites” in December 2008 it would seem that we have similar experiences in the UK higher education sector. Does anyone have any positive experiences to share?

What’s the average lifespan of a Web page?

…or is it easier to ask how long is a piece of string?

The statistic much banded about (for Web pages not pieces of string!) is 44 days, believed to originate in an article by Brewster Kahle (of Internet Archive fame) published in 1997 and titled Preserving the Internet. Brewster’s original quote is specifically about URLs, “…estimates put the average lifetime for a URL at 44 days.

Whether this figure still stands today is a matter currently being discussed on the CURATORS@LIST.NETPRESERVE.ORG list after a query from Abigail Grotke of the Library of Congress.

Abbie offered up the 44 day statistic and pointed out that on the Digital Preservation Web site they have a graphic that discusses Web volatility stating “44% of the sites available on the internet in 1998 had vanished one year later“.

The other figure often cited is 75 days from a Michael Day’s report Collecting and preserving the world wide web.

The dynamic nature of the Web means that pages and whole sites are continually evolving, meaning that pages are frequently changed or deleted. Alexa Internet once estimated that Web pages disappear after an average time of 75 days. (Lawrence, et al.,2001, p. 30).

Another figure sometimes suggested is 100 days, this seems to come from Rick Weiss article for the The Washington Post, Washington, DC, 24 November 2003, On the Web, Research Work Proves Ephemeral –  no longer available.

So what is the average lifespan of a Web page today? Is it getting shorter or longer? The Internet Archive now gives 44 -75 days as its ball park figure. I’d have to hazard a guess that with the rise in use of Web 2.0 technologies the Web is actually getting more transient by the day.

Is this OK?

Maybe if it’s just a tweet you sent your friend, however if it’s something more substantial that’s disapearing then it’s a real worry.

An Archive Of IWMW 2009 Tweets

In a recent blog post entitled Tools For Preserving Twitter Posts I described some of the Twitter preservation tools we were planning to use to keep a record of the tweets related to UKOLN’s recent IWMW 2009 event.

Twitter proved very popular during this annual event for institutional Web managers, with over 1,500 Twitter posts (tweets) being published during the last week of July. Further statistical information is provided in a post on Evidence on Use of Twitter for Live Blogging.

We suggested that a two character code (P1 to P8)  could be used to identify each plenary session and that using this as a hashtag in conjunction with the event’s hashtag (#iwmw2009) would enable the tweets about a particular talk to be easily identified and, in theory, this data migrated to a managed environment.

As an example you can search for the tweets related to:

We have recently used The Archivist desktop application to create a local copy of the tweets for the plenary  talks at the conference, and these have been made available on the IWMW 2009 Web site from the individual pages for the plenary talks (e.g. see the page for Derek Law’s opening plenary talk). The pages also contain a summary of the number of Twitter posts which were found using the tool.

One reason for wishing to do this is to provide an answer to the speaker who may ask “I Wonder What They Thought About My Session?“.

Preservation and Google Wave

A number of scientists have written enthusiastic blog posts about the potential of Google Wave including Peter Murray-Rust, Cameron Neylon and several others. A post entitled Google Wave: possibilities for librarians on the Rambling Librarian blog provides a useful summary of Google Wave and how it aims to provide a response to the question “What might email be like if it was invented today?

The Rambling Librarian post also picks up on the important “implication … that digital preservation will be even more critical. Imagine all the collaborative efforts gone when the server crashes. Or power fails.

Absolutely! And let’s ensure that the digital preservation aspects are being considered right at the start of any development activities rather than being ignored by those focussing on the new possibilities which this technology can provide.

Hmm, I wonder if there are any funding possibilities available for exploring the preservation aspects of Google Wave?

Releasing the Herds of Cows – The Missing links Workshop

Yesterday many of those most committed to, and interested in, Web preservation attended the Missing links: the enduring web conference held at the British Library. The event, which had over 100 attendees (from academia, galleries, museums, libraries, archives, government, research groups and beyond) was an attempt to ‘develop and strength the links between content creators, tools developers, preservation services and user in order to ‘  secure an enduring web’.

If anyone can successfully preserve the Web it was this lot!

Session 1: Setting the scene

Keynote: Adrian Brown, Assistant Clerk of the Records, Parliamentary Archives

The opening keynote was given by Adrian Brown from the Parliamentary Archives who did adept job of setting the scene and the challenges ahead. His talk centred around the key jigsaw pieces of selection, capture, storage, access and preservation. He pointed out that the Web used to be primarily publisher of fixed content and it is the new types of content (discussions on wikis, Twitter, blogs etc.) that offer the biggest challenge. Parliament themselves are using many new forms of information, and although permission is not always an issue capturing content on other sites often is. You may own the content but someone else often owns the presentation of that content. Adrian also emphasised the need for us to co-ordinate selection and be as explicit about what we are not going to capture as what we are going to capture (the unpublishing button on facebook being a classic example of how difficult it can be to go back!) Another major challenge is that of temporal cohesion – the rate of capture of content is often far slower than the rate of change of content.

He concluded with a Peter Murray Rust quote that stresses the need for us to “Create and release herds of cows, not preserve hamburgers in a deepfreeze.”

Web Archive and Citation Repository in One: DACHS: Hanno Lecher, Librarian, Sinological Library, Leiden

HannoLecher talked about the citation problems he has encountered while working on the digital archive for Chinese studies, a resource that captures and archives relevant resources as primary source for later research. His work has indicated the possible need for citation repositories. Some optins might include verifying

URL references, evaluating the reliability of online resources, use of PURLs, avoidance ofobscure URLs and even the extreme banning of online resources as reference. As such approaches are not always practical (or appropriate) he use of applications like Snagit, Zotero or WebCite.

The future of researching the past of the Internet: Eric T. Meyer, Research Fellow, Oxford Internet Institute, Oxford University

Eric Meyer works primarily looking at social behaviour and the Internet and as part of the Transatlantic Digitisation Project spent some time building an archive collection of World War 1 and 2 Internet resources. The process gave the team incite into the feasibility of  creation of such an archive. Lessons were also learnt on the need for involvement of domain experts (webometrics, SNA, eResearch, etc.) and the desire to move from snapshots to more continuous data. Eric concluded by stressing that what researchers really want is a global way to access archives, not a national one.

After the morning coffee break the workshop moved from the general to the specific.

Session 2: Creation, capture & collection

An overview of Web Archiving Tools by Helen Hockx-Yu, Web Archiving Programme Manager, The British Library

Helen Hockx-Yu provided a very practical overview of current tools, covering Heritrix, NetArchiveSuite, Web Curator Tool and PANDAS. She also discussed archival formats such as ARC and WARC, which is highly desirable from a long term archival standpoint. Helen concluded a brief discussion on the limitations and challenges harvesters present from issues with rendering and dealing with ‘bad’ content to reliance on open source tools that are still very much evolving

Context and content: Delivering Coordinated UK Web Archive to User Communities: Cathy Smith, Collections Strategy Manager, The National Archives

Cathy Smith and the UKWAC team have recently been working on study entitled Delivering coordinated UK web archives looking at what audiences should web archives anticipate and what does this mean for selection, ingest and preservation and what do researchers want from archives. The conclusions include the notion of a single view to an archive, though this does not necessarily mean a single repository. Cathy explained that “hey just want the car to work – they don’t want to see what’s under the bonnet”. In response to the study the NA have recently begun automated whole domain harvesting – shallow as well as selective harvesting and they are looking at allowing all archiver to be searched together – a collection.

Capture and Continuity: Broken links and the UK Central Government Web Presence: Amanda Spencer and Tom Storrar from the National Archives.

Amanda Spencer and Tom Storrar were set with the task of sorting out the problem of broken links on government Web sites. 60% of URLs quoted in Hansard between 1997-2006 are broken leaving the public unable to find information and with little faith in the authority of government Web sites. The Web Continuity project has begun to create an archive of all Central Government Web space. Some thinks they are trying out include XML site maps, using redirects instead of 404 messages and capturing all government Web sites before they are closed.

Discussion Session

In the questions and discussion session the panel were asked what advice they would give to Web managers in how they can allow archiving of their sites to be more effective. The suggestions tackled content (following accessibility guidelines, W3C standards, making code valid, adding in hard links to files next to streaming content, having transcripts of audio files) and communication (enabling a dialogue between Web archivists and Web creation people.)It was noted that there is a fine line between encouraging good practice and stifling innovation and that at times communication with content creators is a luxury. Responsible harvesting also means explaining why you are doing it to those involved.

The discussion then moved to other possible points of capture such as using Google Cache or asking ISP providers.

Session 3: Issues and approaches to long term preservation of web archives

After lunch the programme moved on from the specific problems of now to problems of the future.

Diamonds in the Rough: Capturing and Preserving Online Content from Blogs: Richard Davis, Project Manager, University of London Computing Centre (ULCC)

Richard Davis, a member of the JISC PoWR team, started off by emphasising the importance of blogs today. This has lead him and colleague Maureen Pennock to consider the need for their preservation, particularly at institutional level, resulting in the ArchivePress project. Richard pointed out that traditional Web archiving tools can be difficult to configure and use: a ‘hammer to crack a nut’. A comment by Chris Rusbridge that ”blogs represent an area where the content is primary and design secondary” suggested that using blog feeds might be key. The feed could be a point at which to gather blog content for archiving. The test bed for ArchivePress will include blogs from the DCC, Lincoln University and UKOLN. In the future ArchivePress could possibly be used to harvest Twitter.

Beyond Harvest: Long Term Preservation of the UK Web Archive: Maureen Pennock, Web Archive Preservation Project Manager, The British Library

Maureen Pennock started her talk by giving some general background to UK Web Archive. Now the archive has been established the next task is preserving the archive, an ongoing activity. The main undertakings are documenting System Dependencies, consideration of containers and metadata standards (following a review, WARC is the preferred format for preservation, but they are also using a combination of features from METS and PREMIS), preservation workflow and defining the preservation strategy. The future involves the consideration of many new areas, for example should we preserve computer viruses.

From Web Page to Living Web Archive: Thomas Risse, Senior researcher, L3S Research Center

Thomas Risse introduced the LiWA (Living Web Archives) project, an EU FP7 research project. LiWA has been looking at building tools to address the specific challenges of the current generation Web, e.g. link extraction from dynamic pages.

Emulating access to the web 1.0: Jeffrey van der Hoeven, Koninklijke Bibliotheek, The Netherlands

Jeffrey van der Hoeven started off with a whirlwind history of the Web browser. From the WorldWideWeb browser in 1991, which was more like a document viewer, followed by Mosaic and Netscape, and onto now with Firefox and Google Chrome. Today’s browser do many things, from rendering Web sites, bookmarking, caching, authentication and providing RSS feeds. They have changed from viewers to all round workplaces – Google Wave is an extreme example of this. The browser is currently the centreal application of the computer and there is a need for a sustainable solution to render Web sites. At the Koninklijke Bibliotheek they have been recently working on the Dioscuri project, a durable emulator. They have also been working with the planets project using Global Remote Access to Emulation (GRATE) and on the Keep Emulation Environments Portable (KEEP) project.

After another coffee break we moved onto the final discussions

Session 4: Discussion and next steps

What we want with web-archives; will we win? Kevin Ashley, Head of Digital Archives, University London Computer Centre (ULCC) Kevin Ashley

Kevin Ashley

Kevin Ashley, also part of the JISC PoWR team, kicked off his talk by introducing us to the future of the Web, Web 8.0. Although Kevin was just joking with us he explained that when the next major change in the Web happens, we will need a much better metaphor than ‘Web 3.0’: it is the future that we need to think about. We need to be asking the key question as to what we want from Web archives, e.g. potential use cases. Preserving the Web is not just about content. Sometimes it will be useful to look at Web content in aggregate, e.g. for visualisation, linguistic analysis, etc. Right now many academics are doing own Web archiving – content and presentation or both. However we need to not take just a document centred view of data but also consider the web of data and data about the web. A lot of future interest will not in Web content, but in the context of use.

The day concluded with a round table discussion with many speakers from the day. Some of the topics discussed were the permissions challenge (primarily with regard to UKWAC and similar initiatives). Other countries and initiatives have favour a more ‘just do it’ approach.

The day left most people enthused and positive about the future for Web archiving.

A conference report will follow on the DPC Web site and will be sent out via the dpc-discussion and digital-preservation JISCmail lists.

All presentations are now available in PDF Format from the DPC Web site.

WAC09 was used as the Twitter tag for the event. You can see the tweets at http://hashtags.org/tag/wac09/messages.

The event was sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland).

Posters for Missing Links

Today was the Missing Links: The enduring Web conference held at the British Library Conference Centre, London. A trip report on the event will follow.

The JISC PoWR team presented two posters, one provided a general overview of the project and the other discussed Preservation Policies and Approaches for Use of Social Web Services.

The content of the second poster has been mentioned in a number of previous blog posts:

Although the poster was submitted to the British Library it was unfortunately mislaid, so here is the actual content.

Preservation Policies and Approaches for Use of Social Web Services

Click on the image for a larger version.

Preservation Policies for Content Hosted on Third Party Blogs

We know that a variety of externally-hosted Web 2.0 services are being used to support institutional aims. But what about the associated risks of loss of important resources?  One approach to such risk would be to ban use of such services. But this is to ignore the benefits that such services provide and is likely to alienate users of such services if it were possible to implement such a policy.

The approach taken by the JISC PoWR project ( and described in more detail in the JISC PoWR handbook – see particularly chapter 13) has been to recognise that there are legitimate reasons to make use of such services and to look at ways in which content hosted on such services can be managed and curated in the long term.

The need to do this is of relevance to UKOLN which provides a number of blogs on externally-hosted services including the UK Web Focus blog, provided by Brian Kelly and the Rambling of a Remote Worker blog, provided by Marieke Guy.

The first stage is to define and publicise a policy covering the long-term access to the content of these two blogs, including what will happen if either of the authors leaves UKOLN.

Policies posted on the UK Web Focus and Ramblings of a Remote Worker blogs state that:

  • A rich copy of the contents of the blog will be made available to UKOLN (my host organisation) if I leave. Note that this may not include the full content if there are complications concerning third party content (e.g. guest blog posts, embedded objects, etc.), technical difficulties, etc.
  • Since the blog reflects personal views I reserve the rights to continue providing the blog if I leave UKOLN. If this happens I will remove any UKOLN branding from the blog.

These two simple statements can help, we feel, in ensuring that the content can be managed if the blog authors leave (or if they fall ill, go crazy or die!). The statements seek to avoid uncertainties regarding what can be done with the content.  The second statement also clarifies that if the authors were to leave, they may wish to continue using the blog.

It may be argued that since both blogs make their content available under a Creative Commons licence this already grants the host institution, along with anyone else, the rights to preserve the content. This may be true, but there is no harm in making this explicit, we feel.

Would it not be legitimate for organisations to expect its employees to make similar statements which clarify the ownership of content hosted on Web 2.0 services and created as a normal course of one’s duties?


Note: This blog post has been written to support a poster which will be provided for the Missing links: the enduring web conference. The poster, which has been produced by Marieke Guy and Brian Kelly, UKOLN, is entitled “Preservation Policies and Approaches for Use of Web 2.0 Services“. A series of blog posts published on this blog provide more detailed information of the content summarised in the poster.

Tools For Preserving Twitter Posts

I recently described some Some Use Cases For Preserving Twitter Posts including preservation of an organisation’s digital memory and preservation of data for subsequent data mining. The post, however, failed to include perhaps the most obvious example: preservation of Twitter posts (tweets) related to an event.

In response to that post a number of solutions for preserving tweets were suggested including FriendFeed, the WordPress Lifestream plugin and What the Hashtag. In addition following a tweet I posted I received details of the Tweetdoc service.

With this year’s Institutional Web Management Workshop (IWMW 2009) rapidly approaching it is timely to decide the tools we’ll be using to preserve the discussions associated with this event. We began keeping a record of the amplification of the IWMW event back in 2005 when an IRC channel was provided for use by the small numbers of participants who had a laptop and WiFi connectivity at the event. The IWMW 2005 event took place during the 7/7 bombings in London and a record of the awareness of what was having can be gleaned from the archive of the IRC discussions.

This year we will once again be making use of Twitter and will be capturing tweets which contain the event hashtag #iwmw2009.  The tools we are currently intending to use are What the Hashtag and Tweetdoc service.

Use of wthashtags service to view Twitter posts about #iwmw2009Use of What the Hashtag to views tweets containing the #iwmw2009 tag is illustrated.

As can be seen the output can be access as an RSS feed. In addition the data can be viewed as an HTML resource, and a data range can also be supplied.

We intend to capture this tweets about the IWMW 2009 event after the event is over, and store the data on the UKOLN Web site, in order to avoid dependencies on the What the Hashtag service itself.

We will also explore other services, such as Tweetdoc – although in this case as the data is only available as a PDF resource, it is not well-suited to provide data for analysis by other services.

Are there any other services we should be looking at? And what functionality might be desirable for a more generic service for preserving tweet? Any thoughts?


Note: This blog post has been written to support a poster which will be provided for the Missing links: the enduring web conference. The poster, which has been produced by Marieke Guy and Brian Kelly, UKOLN, is entitled “Preservation Policies and Approaches for Use of Web 2.0 Services“. A series of blog posts published on this blog provide more detailed information of the content summarised in the poster.

Some Use Cases For Preserving Twitter Posts

I’ve come across two schools of thought regarding the preservation of Twitter posts (tweets). From Twitter fans, it’s a really useful form of informal conversation, with a particular strength being the way in which messages quickly disappear. It’s like having conversations in your favourite bar – and we wouldn’t want such conversations preserved, so why would we wish to preserve tweets? Sceptics of Twitter might regard Twitter as inane babble in a bar and would agree that it’s pointless to preserve it.

However this week I have come across a couple of examples which illustrate why it might be useful to preserve tweets.

Preservation of Twitter data for research purposes

I recently described how Twitterers Subvert[ed] Daily Mail’s Racist Poll. For me that provided a fascinating example of how Twitter can be used by activists to support a viral campaign (in this case to undermine the racist sentiments in the question “Should the NHS allow gipsies to jump the queue?“). But why did this poll succeed in attract a large turnout (with the poll being undermined with around 94% voting yes before the Daily Mail removed the poll) whereas a subsequent poll on Should immigrants be forced to respect British culture? failed to arouse much interest?

Since many of the tweets and retweets provided a link to “http://bit.ly/w4b6Q” we can look at the statistics provided by the bit.ly short URL service which tells us that 48,796 clicked on the link.

Statistics for clicks to Daily Mail pollAs shown in the histogram as might be expected the vast majority of clicks to the poll took place on 19 June 2009, the day the poll was the hot topic.

But who created the interest in the Twitterverse originally? Was it, as I speculated in my post, the UK-based psychologist community? Or where there a number of starting points? How was the message communicated? Who where the ‘influencers’ (to use the terminology from a recent blog post by Martin Weller) whose tweets were felt to contain links worth following? Who were the ‘hubs’ who were responsible for communicated the message to a wide audience? And was Twitter actually the main driver or were their other viral mechanisms in operation as Josh suggested, who went on to add “we must keep perspective and not neglect other areas or over-celebrate Twitter as a comms tool!“.

In a recent post by Andy Powell on “Influence, connections and outputs” Andy  suggested that we want “some sensible measure of scholarly impact on the social Web” and that we need to understand”what do we want to measure; what we can measure and how can bring these two things close enough together to create something useful”. Although I’m not suggesting that the Twitter responses to the Daily Mail poll have any scholarly impact, I do think that this could provide us with an opportunity to understand how ideas (whether scholarly, political or simply marking) are communicated and transmitted by Twitter. But if we wish to make use of this particular example, we will need to preserve the Twitter stream – and we will need to preserve the structured data (the Twitterer ID, the time, the IDs of those being retweeted and the numbers of followers) in order to carry out any rich analysis.

Preservation of an organisation’s memory

As recently described on the This is Bath blog the Bus station Twitter project ends – today. The bathcsc Twitter account and accompanying blog, were set up to keep users of the First bus service in Bath informed of news – including delays to bus services. Sadly First’s head office in Bath  have announced that the company “is having to make changes to its business structure. As part of making the company more cost efficient, the customer services function for Bath Bus Station will now be handled by a specialist, area-wide centre based at Exeter“.

This news has not gone down well with many of the fans of this innovation use of social networking tools (and I should add that, as a daily bus user, I personally found the Twitter service very useful) as can be gauged from the comments on the blog to the announcement of the demise of the service and on the Bus station Twitter project ends article.

In the article a spokeswoman said that the company was hoping to learn lessons from the nine-month experiment:

The Bath Twitter trial attracted an enormous amount of interest, both inside and outside the organisation. We are now looking at the lessons we have learnt, and benefits associated with the system, with a view to developing a national social media approach to customer services for the future.

But is the company in a position to preserve the tweets? Will the company be in a position to appreciate the ways in which Twitter can be used effectively if this record is lost? The danger, of course, is that a new media markeing agency wil sometime in the future advise on how Twitter could be used – and fail to learn from the successes of the initial trial, perhaps even making as big a mistake as Habitat have done recently.

What To Do? When To Do It?

Are these illustrations of when preservation of Twitter tweets reasonable? If so, can you think of other examples which we need to consider?

And if there is a case to be made for preserving tweets, what tools should be used? I am aware of the Print Your Twitter service, but this just creates a PDF file of tweets matching the search criteria.  To summarise what I think is needed:

  •  Ability to create a structured repository of tweets from a Twitter ID
  •  Ability to create a structured repository of tweets matching a search term

Anyone know of software which can be used to do this? I’m aware that one could simply use Twitter to search for “http://bit.ly/w4b6Q” or “@bathcsc” but this doesn’t provide a persistent and reusable data store.

ArchivePress: When One Size Doesn’t Fit All

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

Preserving Tweets: The CILIP2 Experience

At the  CILIP 2.0 open meeting in London (29th April 2009) delegates and remote participants were encouraged to tweet, using the #cilip2 tag, on issues relating to the role of Web 2.0 for CILIP members. These tweets were displayed on a screen using the Twitterfall client. Twitterfall is a way of viewing the latest ‘tweets’ of upcoming trends and custom searches on Twitter. Updates fall from the top of the page in near-realtime. The client is also a great way for remote participants to keep up-to-date with event posts.

Use of the tag was very popular and a number of blog posts have been written about its success including CILIP: More Popular Than Swine Flu!, Twittering Librarians, Twitter – better than a conference,  and CILIP 2.0 – Open Session.

Why?

Some might argue that the significant use of the tag could turn out to be a defining moment in CILIP’s movement towards engagement with Web 2.0. Those with an eye for preservation might argue that if this is the case we will want to preserve the tweets that embody this ‘crucial moment’.

How?

Luckily help was on hand in the form of Dave Pattern, systems manager at the University of Huddersfield Library. After the event Dave provided a good example of rapid software development (or in his words “a quick and dirty hack“!) when he quickly wrote a piece of software to harvest the tweets and make available a transcript of them. Dave used the Twitter search API to fetch the results (in JSON format) of the #cilip2 tag every 60 seconds. He then had to dedupe the results from the previous seaches. Once he’d got all the tweets he wrote a quick Perl script to process them and generate some HTML.

He also provided a Wordle visualisation of the content of the tweets.

CILIP Tweets

This now means that the CILIP community continue to have access to:

  • A record of the discussions, annotated with the time of posting.
  • A Wordle cloud which provides a visual summary of the topics.

Software such as this could potentially be used by others to preserve tweets. For those interested Dave is happy to share the code.

If does however raise a number of questions: Should you keep a record of Twitter posts? If so, how would you go about doing it? Where should the responsibility lay? Should it be something event organisers need to consider? Is there a role for commercial companies to play or will this type of preservation just fall at the feet of interested participants (those keen on mashup creation)?

Is there scope for some sort of tool to be developed in this area? Should funding councils in specific areas be funding the creation of a more managed service? Should the tools be more generalised?

Some suggestions are already provided in the Handbook and on the JISC PoWR blog:

This blog post provides the background for one of two case studies which will be displayed on a poster on Preservation Policies and Approaches for Use of Social Web Services to be presented at the Missing links: the enduring web workshop to be held on Tuesday 21st July 2009 at the at the British Library Conference Centre, London.

Missing links: the enduring web

The JISC PoWR team will be involved in the forthcoming Workshop on missing links: the enduring web.  The workshop is sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and organised by the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland). It will be held on Tuesday 21st July 2009 at the British Library Conference Centre, London.

Richard Davis, ULCC, will be giving a presentation on Diamonds in the Rough: Capturing and Preserving Online Content from Blogs. Other members of the team will be presenting posters on the JISC-PoWR Project and on Preservation Policies and Approaches for Use of Web 2.0 Services.

In next few posts we’ll describe in more detail what we’ll be covering. Online registration is still open and closes on Friday 10th July 2009. We hope to see you there…

New Study – Web Archives: Now and in the Future

A news item on The National Archives Web site has recently announced a new study on “Web Archives: Now and in the Future“. This study, which is funded by the JISC and will take place in collaboration with the UK Web Archiving Consortium, will look into how archived Web sites are collected and made available to users.

The study aims to:

  • Investigate how UK Web archives are delivered to users now, and how they might be delivered in the future
  • Define the long-term historical and research value of online content in the UK
  • Look at different organisations that collect Web archives, and their interests

The study will run until late July 2009, and the results will be published on The National Archives and UK Web Archiving Consortium Web sites in August 2009.

We’ll published details on the availability of the study once it is published.

Digital Preservation and Nuclear Disaster: An Animation

DigitalPreservationEurope (DPE), an organisation which is “committed to making digital preservation materials available to the widest possible audience and to breaking down barriers to access”, has released the first in a series of short animations introducing and explaining digital preservation problems and solutions for the general public.

Not for everyone, I suspect, but I’m pleased to see a diversity of approaches being taken to explaining digital preservation concepts. And making it available on YouTube means that the animation can be easily used in a wide variety of contexts, such as being embedded in this blog post. What do you think?

“Seething With Anger” at the Demise of Geocities

A blog post entitled “The Death and Life of Geocities” has been published recently on the Adactio blog by Jeremy Keith, a Web developer living and working in Brighton, England. In the post Jeremy describes how he is “seething with anger” but then goes on to add that “I hope I can tap into that anger to do something productive“. The reason for the anger is his concern that “Yahoo are planning to destroy their Geocities property. All those URLs, all that content, all those memories will be lost …like tears in the rain“.

Although in an update to his post Jeremy does admit that “no data has been destroyed yet; no links have rotted” and that his “toys-from-pram-throwage may yet prove to be completely unfounded” Jeremy is right to raise concerns regarding the recent announcement that “Yahoo [is] to shut down GeoCities“.

Some people, as illustrated by JR Raphael’s article in PC World entitled “So Long, GeoCities: We Forgot You Still Existed” are not losing any sleep over GeoCities demise whilst others, such as the Online Lunchpail blog feel that “the demise of GeoCities … proves my point that the U.S. government never should have approved the takeover of GeoCities by Yahoo!“.

From my perspective I feel that the concerns raised by Jeremy Keith (who, it should be pointed out, is a professional Web developers) will become more widely appreciated as ordinary Web users, who might have used the first generation of public-facing Web-hosting services such as GeoCities for their initial simple Web development activities, realise that their may be sentimental attachments to one’s early work – just as I regret having lost my scrap book from primary school (I remember writing “When I grow up I want to be a Beatle, sing ‘She loves you, yer, yer, yer’ and earn £100 a week“). And what of the social historians – have we lost our cultural memories of the initial take-up of the Web outside of the universities and business sector?

In a blog post by Jason Scott on the ASCII  “weblog of computer history, punditry and trivia” Jason describes the efforts being made to preserve content published on GeoCities. But Jason admits that

I can’t do this alone. I’m going to be pulling data from these twitching, blood-in-mouth websites for weeks, in the background. I could use help, even if we end up being redundant. More is better. We’re in #archiveteam on EFnet. Stop by. Bring bandwidth and disks. Help me save Geocities. Not because we love it. We hate it. But if you only save the things you love, your archive is a very poor reflection indeed.”

What is to be done? Should the digital preservation for the general public’s digital heritage (as opposed to an institutional digital heritage) be left to volunteers? Or will future generations regard us as having failed in our responsibilities as previous generations failed to preserve the built environment and left us with the soulless shopping centres and high-rise building which were developed during the 1960s?

Permission to capture Twitter

This blog has been occasionally concerned with issues surrounding the capture of Tweets, in their capacity as web-based resources which may or may not have value to the Institution as record or archive material. See Brian Kelly’s post of July 2008. The discussions have been based around (a) is it worth doing? And (b) is it technically possible to capture Tweets, and how? This post is concerned with a third issue, (c), do we have permission to do it, since twitter.com is a third-party hosting service? The same issue in fact, that applies to many Web 2.0 applications which involve members of staff working in the cloud or generally conducting University business outside of the network.

The PANDORA Librarian at the State Library of Queensland has just asked the question about permission. “Do we have a blanket permission for Twitter http://twitter.com/? That is, if we find a Twitter site we want to gather, and we get permission as well from say a politician or whoever has content on that site, can we go ahead and gather?”

The National Library of Australia, who have been involved in a library-centric web archiving programme for many years, replied with the following advice:

“We received advice from Twitter, when asking to archive the Prime Minister’s twitter page, which gives a clear indication that they have no problem with archiving,” said the NLA’s Manager of Web Archiving. “I have also been in contact with another Twitter staffer who also gave permission. So I think it is safe to assume that we may archive selectively from Twitter, depending upon receiving the permission of the Twitteree. However in the case of the PM, we didn’t seek that explicit permission as we have blanket permission, from his Party and the PM’s dept.”

And that advice from Twitter.com confirms this:

“Archiving the Prime Minister’s twitter updates is fine, since he owns the copyright to the messages that he posted. According to our Terms of Service:

Copyright (What’s Yours is Yours)

1. We claim no intellectual property rights over the material you provide to the Twitter service. Your profile and materials uploaded remain yours. You can remove your profile at any time by deleting your account. This will also remove any text and images you have stored in the system.

2. We encourage users to contribute their creations to the public domain or consider progressive licensing terms.

This is potentially a useful little development. It may send archivists in the direction of checking the terms of service of other third-party hosting organisations, such as Google Docs or SlideShare, to establish what’s permissible and what isn’t. If copyright and ownership issues are sidestepped or aren’t even mentioned, this could be a cause for concern.

In the meantime, presumably we must also give credit to our Australian colleague for devising the neologism “Twitteree”.

Archiving the US Election 2004 Web sites

The Library of Congress have recently made their US Election 2004 Web Archive available from the Library of Congress Web Archives site. The Election 2004 Web Archive is a selective collection of approximately 2,000 Web sites associated with the United States Presidential, Congressional, and gubernatorial elections. It is part of a continuing effort by the Library’s Web Archiving Project Minerva to evaluate, select, collect, catalogue, provide access to, and preserve digital materials for future generations of researchers.

The archived material includes blogs (such as blogs for Bush). Currently permission is necessary for offsite access for researchers. All archived Web sites are available to researchers onsite at the Library of Congress.

Metadata

At the Library of Congress they are currently providing metadata for individual Web sites through brief records using the MODS schema. There is a MARC collection level record (for the collection itself) with a link to an entry/overview page for each collection that links to search and browse functions with MODS metadata for each individual Web site that was collected.

An overview of their metadata approach (at the collection and item levels) is available. They are also in the process of developing more formal descriptive metadata profiles for their digital content and have developed one for the Library of Congress Web archives.

For a list of publicly available Library of Congress Web archives and access to each, see the Library of Congress Web Archives site.

More information on activities at the Library of Congress are given in a Powerpoint presentation given at the Digital Library Federation 2008 Fall Forum.

Call for Papers at a “Workshop on “missing links: the enduring web”

We recently published a blog post about the workshop on “missing links: the enduring web”. Further information about this one-day event, which will take place on 21st July 2009 at the British Library Conference Centre, London, is now available.

Papers and posters are invited which address long term preservation issues in relation to Web content. Abstracts of not more than 300 words should be sent to the conference organisers by 8th May 2009.

“Your List Will Be Closed In One Week’s Time”

The dangers of reliance of externally-hosted Web 2.0 services has been mentioned previously. And there have been recent incidents in which companies have given a short period of notice of impending closure of services, with users having little time to migrate their data to alternative providers. A recent article in The Guardian (Thursday 2 April 2009)  entitled “Can I assume that my online data is safe for ever?” addressed such concerns in an article on the closure of the Filefront.com service, who gave their users just 5 days to migrate their data.

Coincidentally I recently received the following email from a service I subscribe to:

Our previous request to you to provide a new owner for the  list has not produced a response.  Therefore, we assume the list is no longer useful and aim to close it in one week’s time.
We would be happy to provide a zipped copy of the archives and any files on deletion of the list, should they be required.

In this case it appears that the service has been little used for over a year. And yet what if useful information is still available on the service? Is a week’s notice enough for users of the service to consider the implications of this decision, identify appropriate solutions and then implement them? And let’s not forget that this email was sent outside of term time when researchers could be away.

The email did not make it clear if data was to be deleted, the service was to continue to be made available in a read-only mode or the interface to the data hidden – all possible solutions if it is felt necessary for a little-used service to be withdrawn.

There’s still a need to establish the best practices when Web-based interfaces to services are to be removed, I feel. And such issues do not just affect the third party services outside of our community.

Workshop on “missing links: the enduring web”

The Digital Preservation Coalition have recently announced a 1-day workshop on “missing links: the enduring web” which will be held at the British Library Conference Centre on 21st July 2009. This event, which is being organised by the JISC and the UK Web Archiving Consortium in conjunction with the Digital Preservation Coalition, aims to “bring together key stakeholders – web managers, archive managers, preservation experts, national libraries, web archivists and content providers – for practical and focussed discussion on shared perspectives, requirements, problems and solutions. Formal presentations and case studies will be presented with an opportunity for posters and demonstrations of tools“.

We will provide information on the detailed programme and how to register when this information is published.