JISC-PoWR

Preservation of Web Resources: a JISC-sponsored project

Archive for June, 2009

Some Use Cases For Preserving Twitter Posts

Posted by Brian Kelly on 26th June 2009

I’ve come across two schools of thought regarding the preservation of Twitter posts (tweets). From Twitter fans, it’s a really useful form of informal conversation, with a particular strength being the way in which messages quickly disappear. It’s like having conversations in your favourite bar - and we wouldn’t want such conversations preserved, so why would we wish to preserve tweets? Sceptics of Twitter might regard Twitter as inane babble in a bar and would agree that it’s pointless to preserve it.

However this week I have come across a couple of examples which illustrate why it might be useful to preserve tweets.

Preservation of Twitter data for research purposes

I recently described how Twitterers Subvert[ed] Daily Mail’s Racist Poll. For me that provided a fascinating example of how Twitter can be used by activists to support a viral campaign (in this case to undermine the racist sentiments in the question “Should the NHS allow gipsies to jump the queue?“). But why did this poll succeed in attract a large turnout (with the poll being undermined with around 94% voting yes before the Daily Mail removed the poll) whereas a subsequent poll on Should immigrants be forced to respect British culture? failed to arouse much interest?

Since many of the tweets and retweets provided a link to “http://bit.ly/w4b6Q” we can look at the statistics provided by the bit.ly short URL service which tells us that 48,796 clicked on the link.

Statistics for clicks to Daily Mail pollAs shown in the histogram as might be expected the vast majority of clicks to the poll took place on 19 June 2009, the day the poll was the hot topic.

But who created the interest in the Twitterverse originally? Was it, as I speculated in my post, the UK-based psychologist community? Or where there a number of starting points? How was the message communicated? Who where the ‘influencers’ (to use the terminology from a recent blog post by Martin Weller) whose tweets were felt to contain links worth following? Who were the ‘hubs’ who were responsible for communicated the message to a wide audience? And was Twitter actually the main driver or were their other viral mechanisms in operation as Josh suggested, who went on to add “we must keep perspective and not neglect other areas or over-celebrate Twitter as a comms tool!“.

In a recent post by Andy Powell on “Influence, connections and outputs” Andy  suggested that we want “some sensible measure of scholarly impact on the social Web” and that we need to understand”what do we want to measure; what we can measure and how can bring these two things close enough together to create something useful”. Although I’m not suggesting that the Twitter responses to the Daily Mail poll have any scholarly impact, I do think that this could provide us with an opportunity to understand how ideas (whether scholarly, political or simply marking) are communicated and transmitted by Twitter. But if we wish to make use of this particular example, we will need to preserve the Twitter stream - and we will need to preserve the structured data (the Twitterer ID, the time, the IDs of those being retweeted and the numbers of followers) in order to carry out any rich analysis.

Preservation of an organisation’s memory

As recently described on the This is Bath blog the Bus station Twitter project ends - today. The bathcsc Twitter account and accompanying blog, were set up to keep users of the First bus service in Bath informed of news - including delays to bus services. Sadly First’s head office in Bath  have announced that the company “is having to make changes to its business structure. As part of making the company more cost efficient, the customer services function for Bath Bus Station will now be handled by a specialist, area-wide centre based at Exeter“.

This news has not gone down well with many of the fans of this innovation use of social networking tools (and I should add that, as a daily bus user, I personally found the Twitter service very useful) as can be gauged from the comments on the blog to the announcement of the demise of the service and on the Bus station Twitter project ends article.

In the article a spokeswoman said that the company was hoping to learn lessons from the nine-month experiment:

The Bath Twitter trial attracted an enormous amount of interest, both inside and outside the organisation. We are now looking at the lessons we have learnt, and benefits associated with the system, with a view to developing a national social media approach to customer services for the future.

But is the company in a position to preserve the tweets? Will the company be in a position to appreciate the ways in which Twitter can be used effectively if this record is lost? The danger, of course, is that a new media markeing agency wil sometime in the future advise on how Twitter could be used - and fail to learn from the successes of the initial trial, perhaps even making as big a mistake as Habitat have done recently.

What To Do? When To Do It?

Are these illustrations of when preservation of Twitter tweets reasonable? If so, can you think of other examples which we need to consider?

And if there is a case to be made for preserving tweets, what tools should be used? I am aware of the Print Your Twitter service, but this just creates a PDF file of tweets matching the search criteria.  To summarise what I think is needed:

  •  Ability to create a structured repository of tweets from a Twitter ID
  •  Ability to create a structured repository of tweets matching a search term

Anyone know of software which can be used to do this? I’m aware that one could simply use Twitter to search for “http://bit.ly/w4b6Q” or “@bathcsc” but this doesn’t provide a persistent and reusable data store.

Posted in missinglinks09mg, missinglinks09, Web 2.0 | 10 Comments »

ArchivePress: When One Size Doesn’t Fit All

Posted by Richard M. Davis on 24th June 2009

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary - scholarly, actual, political, or personal - which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools - like Web Curator Tool and PANDAS - that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static - as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software - WordPress, of course - to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders - except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) - and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

Posted in missinglinks09, Challenges, Software, Web 2.0, Events | 9 Comments »

Preserving Tweets: The CILIP2 Experience

Posted by Marieke Guy on 19th June 2009

At the  CILIP 2.0 open meeting in London (29th April 2009) delegates and remote participants were encouraged to tweet, using the #cilip2 tag, on issues relating to the role of Web 2.0 for CILIP members. These tweets were displayed on a screen using the Twitterfall client. Twitterfall is a way of viewing the latest ‘tweets’ of upcoming trends and custom searches on Twitter. Updates fall from the top of the page in near-realtime. The client is also a great way for remote participants to keep up-to-date with event posts.

Use of the tag was very popular and a number of blog posts have been written about its success including CILIP: More Popular Than Swine Flu!, Twittering Librarians, Twitter - better than a conference,  and CILIP 2.0 – Open Session.

Why?

Some might argue that the significant use of the tag could turn out to be a defining moment in CILIP’s movement towards engagement with Web 2.0. Those with an eye for preservation might argue that if this is the case we will want to preserve the tweets that embody this ‘crucial moment’.

How?

Luckily help was on hand in the form of Dave Pattern, systems manager at the University of Huddersfield Library. After the event Dave provided a good example of rapid software development (or in his words “a quick and dirty hack“!) when he quickly wrote a piece of software to harvest the tweets and make available a transcript of them. Dave used the Twitter search API to fetch the results (in JSON format) of the #cilip2 tag every 60 seconds. He then had to dedupe the results from the previous seaches. Once he’d got all the tweets he wrote a quick Perl script to process them and generate some HTML.

He also provided a Wordle visualisation of the content of the tweets.

CILIP Tweets

This now means that the CILIP community continue to have access to:

  • A record of the discussions, annotated with the time of posting.
  • A Wordle cloud which provides a visual summary of the topics.

Software such as this could potentially be used by others to preserve tweets. For those interested Dave is happy to share the code.

If does however raise a number of questions: Should you keep a record of Twitter posts? If so, how would you go about doing it? Where should the responsibility lay? Should it be something event organisers need to consider? Is there a role for commercial companies to play or will this type of preservation just fall at the feet of interested participants (those keen on mashup creation)?

Is there scope for some sort of tool to be developed in this area? Should funding councils in specific areas be funding the creation of a more managed service? Should the tools be more generalised?

Some suggestions are already provided in the Handbook and on the JISC PoWR blog:

This blog post provides the background for one of two case studies which will be displayed on a poster on Preservation Policies and Approaches for Use of Social Web Services to be presented at the Missing links: the enduring web workshop to be held on Tuesday 21st July 2009 at the at the British Library Conference Centre, London.

Posted in missinglinks09mg, missinglinks09, Events | 1 Comment »

Missing links: the enduring web

Posted by Marieke Guy on 11th June 2009

The JISC PoWR team will be involved in the forthcoming Workshop on missing links: the enduring web.  The workshop is sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and organised by the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland). It will be held on Tuesday 21st July 2009 at the British Library Conference Centre, London.

Richard Davis, ULCC, will be giving a presentation on Diamonds in the Rough: Capturing and Preserving Online Content from Blogs. Other members of the team will be presenting posters on the JISC-PoWR Project and on Preservation Policies and Approaches for Use of Web 2.0 Services.

In next few posts we’ll describe in more detail what we’ll be covering. Online registration is still open and closes on Friday 10th July 2009. We hope to see you there…

Posted in missinglinks09mg, missinglinks09, Digital preservation, Events | 1 Comment »