Category Archives: missinglinks09

Releasing the Herds of Cows – The Missing links Workshop

Yesterday many of those most committed to, and interested in, Web preservation attended the Missing links: the enduring web conference held at the British Library. The event, which had over 100 attendees (from academia, galleries, museums, libraries, archives, government, research groups and beyond) was an attempt to ‘develop and strength the links between content creators, tools developers, preservation services and user in order to ‘  secure an enduring web’.

If anyone can successfully preserve the Web it was this lot!

Session 1: Setting the scene

Keynote: Adrian Brown, Assistant Clerk of the Records, Parliamentary Archives

The opening keynote was given by Adrian Brown from the Parliamentary Archives who did adept job of setting the scene and the challenges ahead. His talk centred around the key jigsaw pieces of selection, capture, storage, access and preservation. He pointed out that the Web used to be primarily publisher of fixed content and it is the new types of content (discussions on wikis, Twitter, blogs etc.) that offer the biggest challenge. Parliament themselves are using many new forms of information, and although permission is not always an issue capturing content on other sites often is. You may own the content but someone else often owns the presentation of that content. Adrian also emphasised the need for us to co-ordinate selection and be as explicit about what we are not going to capture as what we are going to capture (the unpublishing button on facebook being a classic example of how difficult it can be to go back!) Another major challenge is that of temporal cohesion – the rate of capture of content is often far slower than the rate of change of content.

He concluded with a Peter Murray Rust quote that stresses the need for us to “Create and release herds of cows, not preserve hamburgers in a deepfreeze.”

Web Archive and Citation Repository in One: DACHS: Hanno Lecher, Librarian, Sinological Library, Leiden

HannoLecher talked about the citation problems he has encountered while working on the digital archive for Chinese studies, a resource that captures and archives relevant resources as primary source for later research. His work has indicated the possible need for citation repositories. Some optins might include verifying

URL references, evaluating the reliability of online resources, use of PURLs, avoidance ofobscure URLs and even the extreme banning of online resources as reference. As such approaches are not always practical (or appropriate) he use of applications like Snagit, Zotero or WebCite.

The future of researching the past of the Internet: Eric T. Meyer, Research Fellow, Oxford Internet Institute, Oxford University

Eric Meyer works primarily looking at social behaviour and the Internet and as part of the Transatlantic Digitisation Project spent some time building an archive collection of World War 1 and 2 Internet resources. The process gave the team incite into the feasibility of  creation of such an archive. Lessons were also learnt on the need for involvement of domain experts (webometrics, SNA, eResearch, etc.) and the desire to move from snapshots to more continuous data. Eric concluded by stressing that what researchers really want is a global way to access archives, not a national one.

After the morning coffee break the workshop moved from the general to the specific.

Session 2: Creation, capture & collection

An overview of Web Archiving Tools by Helen Hockx-Yu, Web Archiving Programme Manager, The British Library

Helen Hockx-Yu provided a very practical overview of current tools, covering Heritrix, NetArchiveSuite, Web Curator Tool and PANDAS. She also discussed archival formats such as ARC and WARC, which is highly desirable from a long term archival standpoint. Helen concluded a brief discussion on the limitations and challenges harvesters present from issues with rendering and dealing with ‘bad’ content to reliance on open source tools that are still very much evolving

Context and content: Delivering Coordinated UK Web Archive to User Communities: Cathy Smith, Collections Strategy Manager, The National Archives

Cathy Smith and the UKWAC team have recently been working on study entitled Delivering coordinated UK web archives looking at what audiences should web archives anticipate and what does this mean for selection, ingest and preservation and what do researchers want from archives. The conclusions include the notion of a single view to an archive, though this does not necessarily mean a single repository. Cathy explained that “hey just want the car to work – they don’t want to see what’s under the bonnet”. In response to the study the NA have recently begun automated whole domain harvesting – shallow as well as selective harvesting and they are looking at allowing all archiver to be searched together – a collection.

Capture and Continuity: Broken links and the UK Central Government Web Presence: Amanda Spencer and Tom Storrar from the National Archives.

Amanda Spencer and Tom Storrar were set with the task of sorting out the problem of broken links on government Web sites. 60% of URLs quoted in Hansard between 1997-2006 are broken leaving the public unable to find information and with little faith in the authority of government Web sites. The Web Continuity project has begun to create an archive of all Central Government Web space. Some thinks they are trying out include XML site maps, using redirects instead of 404 messages and capturing all government Web sites before they are closed.

Discussion Session

In the questions and discussion session the panel were asked what advice they would give to Web managers in how they can allow archiving of their sites to be more effective. The suggestions tackled content (following accessibility guidelines, W3C standards, making code valid, adding in hard links to files next to streaming content, having transcripts of audio files) and communication (enabling a dialogue between Web archivists and Web creation people.)It was noted that there is a fine line between encouraging good practice and stifling innovation and that at times communication with content creators is a luxury. Responsible harvesting also means explaining why you are doing it to those involved.

The discussion then moved to other possible points of capture such as using Google Cache or asking ISP providers.

Session 3: Issues and approaches to long term preservation of web archives

After lunch the programme moved on from the specific problems of now to problems of the future.

Diamonds in the Rough: Capturing and Preserving Online Content from Blogs: Richard Davis, Project Manager, University of London Computing Centre (ULCC)

Richard Davis, a member of the JISC PoWR team, started off by emphasising the importance of blogs today. This has lead him and colleague Maureen Pennock to consider the need for their preservation, particularly at institutional level, resulting in the ArchivePress project. Richard pointed out that traditional Web archiving tools can be difficult to configure and use: a ‘hammer to crack a nut’. A comment by Chris Rusbridge that ”blogs represent an area where the content is primary and design secondary” suggested that using blog feeds might be key. The feed could be a point at which to gather blog content for archiving. The test bed for ArchivePress will include blogs from the DCC, Lincoln University and UKOLN. In the future ArchivePress could possibly be used to harvest Twitter.

Beyond Harvest: Long Term Preservation of the UK Web Archive: Maureen Pennock, Web Archive Preservation Project Manager, The British Library

Maureen Pennock started her talk by giving some general background to UK Web Archive. Now the archive has been established the next task is preserving the archive, an ongoing activity. The main undertakings are documenting System Dependencies, consideration of containers and metadata standards (following a review, WARC is the preferred format for preservation, but they are also using a combination of features from METS and PREMIS), preservation workflow and defining the preservation strategy. The future involves the consideration of many new areas, for example should we preserve computer viruses.

From Web Page to Living Web Archive: Thomas Risse, Senior researcher, L3S Research Center

Thomas Risse introduced the LiWA (Living Web Archives) project, an EU FP7 research project. LiWA has been looking at building tools to address the specific challenges of the current generation Web, e.g. link extraction from dynamic pages.

Emulating access to the web 1.0: Jeffrey van der Hoeven, Koninklijke Bibliotheek, The Netherlands

Jeffrey van der Hoeven started off with a whirlwind history of the Web browser. From the WorldWideWeb browser in 1991, which was more like a document viewer, followed by Mosaic and Netscape, and onto now with Firefox and Google Chrome. Today’s browser do many things, from rendering Web sites, bookmarking, caching, authentication and providing RSS feeds. They have changed from viewers to all round workplaces – Google Wave is an extreme example of this. The browser is currently the centreal application of the computer and there is a need for a sustainable solution to render Web sites. At the Koninklijke Bibliotheek they have been recently working on the Dioscuri project, a durable emulator. They have also been working with the planets project using Global Remote Access to Emulation (GRATE) and on the Keep Emulation Environments Portable (KEEP) project.

After another coffee break we moved onto the final discussions

Session 4: Discussion and next steps

What we want with web-archives; will we win? Kevin Ashley, Head of Digital Archives, University London Computer Centre (ULCC) Kevin Ashley

Kevin Ashley

Kevin Ashley, also part of the JISC PoWR team, kicked off his talk by introducing us to the future of the Web, Web 8.0. Although Kevin was just joking with us he explained that when the next major change in the Web happens, we will need a much better metaphor than ‘Web 3.0’: it is the future that we need to think about. We need to be asking the key question as to what we want from Web archives, e.g. potential use cases. Preserving the Web is not just about content. Sometimes it will be useful to look at Web content in aggregate, e.g. for visualisation, linguistic analysis, etc. Right now many academics are doing own Web archiving – content and presentation or both. However we need to not take just a document centred view of data but also consider the web of data and data about the web. A lot of future interest will not in Web content, but in the context of use.

The day concluded with a round table discussion with many speakers from the day. Some of the topics discussed were the permissions challenge (primarily with regard to UKWAC and similar initiatives). Other countries and initiatives have favour a more ‘just do it’ approach.

The day left most people enthused and positive about the future for Web archiving.

A conference report will follow on the DPC Web site and will be sent out via the dpc-discussion and digital-preservation JISCmail lists.

All presentations are now available in PDF Format from the DPC Web site.

WAC09 was used as the Twitter tag for the event. You can see the tweets at http://hashtags.org/tag/wac09/messages.

The event was sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland).

Posters for Missing Links

Today was the Missing Links: The enduring Web conference held at the British Library Conference Centre, London. A trip report on the event will follow.

The JISC PoWR team presented two posters, one provided a general overview of the project and the other discussed Preservation Policies and Approaches for Use of Social Web Services.

The content of the second poster has been mentioned in a number of previous blog posts:

Although the poster was submitted to the British Library it was unfortunately mislaid, so here is the actual content.

Preservation Policies and Approaches for Use of Social Web Services

Click on the image for a larger version.

Preservation Policies for Content Hosted on Third Party Blogs

We know that a variety of externally-hosted Web 2.0 services are being used to support institutional aims. But what about the associated risks of loss of important resources?  One approach to such risk would be to ban use of such services. But this is to ignore the benefits that such services provide and is likely to alienate users of such services if it were possible to implement such a policy.

The approach taken by the JISC PoWR project ( and described in more detail in the JISC PoWR handbook – see particularly chapter 13) has been to recognise that there are legitimate reasons to make use of such services and to look at ways in which content hosted on such services can be managed and curated in the long term.

The need to do this is of relevance to UKOLN which provides a number of blogs on externally-hosted services including the UK Web Focus blog, provided by Brian Kelly and the Rambling of a Remote Worker blog, provided by Marieke Guy.

The first stage is to define and publicise a policy covering the long-term access to the content of these two blogs, including what will happen if either of the authors leaves UKOLN.

Policies posted on the UK Web Focus and Ramblings of a Remote Worker blogs state that:

  • A rich copy of the contents of the blog will be made available to UKOLN (my host organisation) if I leave. Note that this may not include the full content if there are complications concerning third party content (e.g. guest blog posts, embedded objects, etc.), technical difficulties, etc.
  • Since the blog reflects personal views I reserve the rights to continue providing the blog if I leave UKOLN. If this happens I will remove any UKOLN branding from the blog.

These two simple statements can help, we feel, in ensuring that the content can be managed if the blog authors leave (or if they fall ill, go crazy or die!). The statements seek to avoid uncertainties regarding what can be done with the content.  The second statement also clarifies that if the authors were to leave, they may wish to continue using the blog.

It may be argued that since both blogs make their content available under a Creative Commons licence this already grants the host institution, along with anyone else, the rights to preserve the content. This may be true, but there is no harm in making this explicit, we feel.

Would it not be legitimate for organisations to expect its employees to make similar statements which clarify the ownership of content hosted on Web 2.0 services and created as a normal course of one’s duties?


Note: This blog post has been written to support a poster which will be provided for the Missing links: the enduring web conference. The poster, which has been produced by Marieke Guy and Brian Kelly, UKOLN, is entitled “Preservation Policies and Approaches for Use of Web 2.0 Services“. A series of blog posts published on this blog provide more detailed information of the content summarised in the poster.

Tools For Preserving Twitter Posts

I recently described some Some Use Cases For Preserving Twitter Posts including preservation of an organisation’s digital memory and preservation of data for subsequent data mining. The post, however, failed to include perhaps the most obvious example: preservation of Twitter posts (tweets) related to an event.

In response to that post a number of solutions for preserving tweets were suggested including FriendFeed, the WordPress Lifestream plugin and What the Hashtag. In addition following a tweet I posted I received details of the Tweetdoc service.

With this year’s Institutional Web Management Workshop (IWMW 2009) rapidly approaching it is timely to decide the tools we’ll be using to preserve the discussions associated with this event. We began keeping a record of the amplification of the IWMW event back in 2005 when an IRC channel was provided for use by the small numbers of participants who had a laptop and WiFi connectivity at the event. The IWMW 2005 event took place during the 7/7 bombings in London and a record of the awareness of what was having can be gleaned from the archive of the IRC discussions.

This year we will once again be making use of Twitter and will be capturing tweets which contain the event hashtag #iwmw2009.  The tools we are currently intending to use are What the Hashtag and Tweetdoc service.

Use of wthashtags service to view Twitter posts about #iwmw2009Use of What the Hashtag to views tweets containing the #iwmw2009 tag is illustrated.

As can be seen the output can be access as an RSS feed. In addition the data can be viewed as an HTML resource, and a data range can also be supplied.

We intend to capture this tweets about the IWMW 2009 event after the event is over, and store the data on the UKOLN Web site, in order to avoid dependencies on the What the Hashtag service itself.

We will also explore other services, such as Tweetdoc – although in this case as the data is only available as a PDF resource, it is not well-suited to provide data for analysis by other services.

Are there any other services we should be looking at? And what functionality might be desirable for a more generic service for preserving tweet? Any thoughts?


Note: This blog post has been written to support a poster which will be provided for the Missing links: the enduring web conference. The poster, which has been produced by Marieke Guy and Brian Kelly, UKOLN, is entitled “Preservation Policies and Approaches for Use of Web 2.0 Services“. A series of blog posts published on this blog provide more detailed information of the content summarised in the poster.

Some Use Cases For Preserving Twitter Posts

I’ve come across two schools of thought regarding the preservation of Twitter posts (tweets). From Twitter fans, it’s a really useful form of informal conversation, with a particular strength being the way in which messages quickly disappear. It’s like having conversations in your favourite bar – and we wouldn’t want such conversations preserved, so why would we wish to preserve tweets? Sceptics of Twitter might regard Twitter as inane babble in a bar and would agree that it’s pointless to preserve it.

However this week I have come across a couple of examples which illustrate why it might be useful to preserve tweets.

Preservation of Twitter data for research purposes

I recently described how Twitterers Subvert[ed] Daily Mail’s Racist Poll. For me that provided a fascinating example of how Twitter can be used by activists to support a viral campaign (in this case to undermine the racist sentiments in the question “Should the NHS allow gipsies to jump the queue?“). But why did this poll succeed in attract a large turnout (with the poll being undermined with around 94% voting yes before the Daily Mail removed the poll) whereas a subsequent poll on Should immigrants be forced to respect British culture? failed to arouse much interest?

Since many of the tweets and retweets provided a link to “http://bit.ly/w4b6Q” we can look at the statistics provided by the bit.ly short URL service which tells us that 48,796 clicked on the link.

Statistics for clicks to Daily Mail pollAs shown in the histogram as might be expected the vast majority of clicks to the poll took place on 19 June 2009, the day the poll was the hot topic.

But who created the interest in the Twitterverse originally? Was it, as I speculated in my post, the UK-based psychologist community? Or where there a number of starting points? How was the message communicated? Who where the ‘influencers’ (to use the terminology from a recent blog post by Martin Weller) whose tweets were felt to contain links worth following? Who were the ‘hubs’ who were responsible for communicated the message to a wide audience? And was Twitter actually the main driver or were their other viral mechanisms in operation as Josh suggested, who went on to add “we must keep perspective and not neglect other areas or over-celebrate Twitter as a comms tool!“.

In a recent post by Andy Powell on “Influence, connections and outputs” Andy  suggested that we want “some sensible measure of scholarly impact on the social Web” and that we need to understand”what do we want to measure; what we can measure and how can bring these two things close enough together to create something useful”. Although I’m not suggesting that the Twitter responses to the Daily Mail poll have any scholarly impact, I do think that this could provide us with an opportunity to understand how ideas (whether scholarly, political or simply marking) are communicated and transmitted by Twitter. But if we wish to make use of this particular example, we will need to preserve the Twitter stream – and we will need to preserve the structured data (the Twitterer ID, the time, the IDs of those being retweeted and the numbers of followers) in order to carry out any rich analysis.

Preservation of an organisation’s memory

As recently described on the This is Bath blog the Bus station Twitter project ends – today. The bathcsc Twitter account and accompanying blog, were set up to keep users of the First bus service in Bath informed of news – including delays to bus services. Sadly First’s head office in Bath  have announced that the company “is having to make changes to its business structure. As part of making the company more cost efficient, the customer services function for Bath Bus Station will now be handled by a specialist, area-wide centre based at Exeter“.

This news has not gone down well with many of the fans of this innovation use of social networking tools (and I should add that, as a daily bus user, I personally found the Twitter service very useful) as can be gauged from the comments on the blog to the announcement of the demise of the service and on the Bus station Twitter project ends article.

In the article a spokeswoman said that the company was hoping to learn lessons from the nine-month experiment:

The Bath Twitter trial attracted an enormous amount of interest, both inside and outside the organisation. We are now looking at the lessons we have learnt, and benefits associated with the system, with a view to developing a national social media approach to customer services for the future.

But is the company in a position to preserve the tweets? Will the company be in a position to appreciate the ways in which Twitter can be used effectively if this record is lost? The danger, of course, is that a new media markeing agency wil sometime in the future advise on how Twitter could be used – and fail to learn from the successes of the initial trial, perhaps even making as big a mistake as Habitat have done recently.

What To Do? When To Do It?

Are these illustrations of when preservation of Twitter tweets reasonable? If so, can you think of other examples which we need to consider?

And if there is a case to be made for preserving tweets, what tools should be used? I am aware of the Print Your Twitter service, but this just creates a PDF file of tweets matching the search criteria.  To summarise what I think is needed:

  •  Ability to create a structured repository of tweets from a Twitter ID
  •  Ability to create a structured repository of tweets matching a search term

Anyone know of software which can be used to do this? I’m aware that one could simply use Twitter to search for “http://bit.ly/w4b6Q” or “@bathcsc” but this doesn’t provide a persistent and reusable data store.

ArchivePress: When One Size Doesn’t Fit All

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

Preserving Tweets: The CILIP2 Experience

At the  CILIP 2.0 open meeting in London (29th April 2009) delegates and remote participants were encouraged to tweet, using the #cilip2 tag, on issues relating to the role of Web 2.0 for CILIP members. These tweets were displayed on a screen using the Twitterfall client. Twitterfall is a way of viewing the latest ‘tweets’ of upcoming trends and custom searches on Twitter. Updates fall from the top of the page in near-realtime. The client is also a great way for remote participants to keep up-to-date with event posts.

Use of the tag was very popular and a number of blog posts have been written about its success including CILIP: More Popular Than Swine Flu!, Twittering Librarians, Twitter – better than a conference,  and CILIP 2.0 – Open Session.

Why?

Some might argue that the significant use of the tag could turn out to be a defining moment in CILIP’s movement towards engagement with Web 2.0. Those with an eye for preservation might argue that if this is the case we will want to preserve the tweets that embody this ‘crucial moment’.

How?

Luckily help was on hand in the form of Dave Pattern, systems manager at the University of Huddersfield Library. After the event Dave provided a good example of rapid software development (or in his words “a quick and dirty hack“!) when he quickly wrote a piece of software to harvest the tweets and make available a transcript of them. Dave used the Twitter search API to fetch the results (in JSON format) of the #cilip2 tag every 60 seconds. He then had to dedupe the results from the previous seaches. Once he’d got all the tweets he wrote a quick Perl script to process them and generate some HTML.

He also provided a Wordle visualisation of the content of the tweets.

CILIP Tweets

This now means that the CILIP community continue to have access to:

  • A record of the discussions, annotated with the time of posting.
  • A Wordle cloud which provides a visual summary of the topics.

Software such as this could potentially be used by others to preserve tweets. For those interested Dave is happy to share the code.

If does however raise a number of questions: Should you keep a record of Twitter posts? If so, how would you go about doing it? Where should the responsibility lay? Should it be something event organisers need to consider? Is there a role for commercial companies to play or will this type of preservation just fall at the feet of interested participants (those keen on mashup creation)?

Is there scope for some sort of tool to be developed in this area? Should funding councils in specific areas be funding the creation of a more managed service? Should the tools be more generalised?

Some suggestions are already provided in the Handbook and on the JISC PoWR blog:

This blog post provides the background for one of two case studies which will be displayed on a poster on Preservation Policies and Approaches for Use of Social Web Services to be presented at the Missing links: the enduring web workshop to be held on Tuesday 21st July 2009 at the at the British Library Conference Centre, London.

Missing links: the enduring web

The JISC PoWR team will be involved in the forthcoming Workshop on missing links: the enduring web.  The workshop is sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and organised by the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland). It will be held on Tuesday 21st July 2009 at the British Library Conference Centre, London.

Richard Davis, ULCC, will be giving a presentation on Diamonds in the Rough: Capturing and Preserving Online Content from Blogs. Other members of the team will be presenting posters on the JISC-PoWR Project and on Preservation Policies and Approaches for Use of Web 2.0 Services.

In next few posts we’ll describe in more detail what we’ll be covering. Online registration is still open and closes on Friday 10th July 2009. We hope to see you there…