Category Archives: Web 2.0

Making any Upgrades to your Blog Sir?

This blog is hosted by JISC Involve who provide blogs for the JISC community.

Till recently JISC Involve was running on an old version of WordPress (1.2.5). Earlier this month the JISC Digital Communications Team upgraded their server to the latest version of WordPress (2.9.2) and then migrated all the JISC Involve’s blogs over to the new installation.

Although all blog posts, comments, attachments, user accounts, permissions and customisations were supposed to move over easily JISC Involve users were encouraged to back-up the content of drafts etc. ‘just in case’.

Unfortunately there were some technical problems migrating the content and as a consequence the original theme was lost and URLs now redirect.

Luckily the JISC PoWR team were able to locate the original theme and reinstall it.

However the process has made them aware of the need to record details of the technical components and architecture of the blog. This information can be critical in a migration process and when ‘closing down’ a blog.

The JISC PoWR team will ensure that such information is routinely recorded.

Is there any other information that is important for preservation or migration purposes?

The Library of Congress Twitter Archive

Two weeks ago the Library of Congress announced that they will be archiving all public tweets since Twitter began. The tweets have been given to the library as a ‘gift’ from Twitter.

The announcement was fittingly made on Twitter.

Yesterday the Library of Congress blog published a list of FAQs abouut the approach they will be taking.

The FAQ explains:

  • Why is it important to preserve the Twitter archive?
    It sees Twitter is part of the historical record of communication, news reporting, and social trends – all of which complement the Library’s existing cultural heritage collections.
  • What is in the Archive?
    Public information. Not private account information or deleted tweets.
  • What does the Library plan to do with the archive?
    Its aims are preserving access to the archive for the long term and making data available to researchers.

What’s the average lifespan of a Web page?

…or is it easier to ask how long is a piece of string?

The statistic much banded about (for Web pages not pieces of string!) is 44 days, believed to originate in an article by Brewster Kahle (of Internet Archive fame) published in 1997 and titled Preserving the Internet. Brewster’s original quote is specifically about URLs, “…estimates put the average lifetime for a URL at 44 days.

Whether this figure still stands today is a matter currently being discussed on the CURATORS@LIST.NETPRESERVE.ORG list after a query from Abigail Grotke of the Library of Congress.

Abbie offered up the 44 day statistic and pointed out that on the Digital Preservation Web site they have a graphic that discusses Web volatility stating “44% of the sites available on the internet in 1998 had vanished one year later“.

The other figure often cited is 75 days from a Michael Day’s report Collecting and preserving the world wide web.

The dynamic nature of the Web means that pages and whole sites are continually evolving, meaning that pages are frequently changed or deleted. Alexa Internet once estimated that Web pages disappear after an average time of 75 days. (Lawrence, et al.,2001, p. 30).

Another figure sometimes suggested is 100 days, this seems to come from Rick Weiss article for the The Washington Post, Washington, DC, 24 November 2003, On the Web, Research Work Proves Ephemeral –  no longer available.

So what is the average lifespan of a Web page today? Is it getting shorter or longer? The Internet Archive now gives 44 -75 days as its ball park figure. I’d have to hazard a guess that with the rise in use of Web 2.0 technologies the Web is actually getting more transient by the day.

Is this OK?

Maybe if it’s just a tweet you sent your friend, however if it’s something more substantial that’s disapearing then it’s a real worry.

An Archive Of IWMW 2009 Tweets

In a recent blog post entitled Tools For Preserving Twitter Posts I described some of the Twitter preservation tools we were planning to use to keep a record of the tweets related to UKOLN’s recent IWMW 2009 event.

Twitter proved very popular during this annual event for institutional Web managers, with over 1,500 Twitter posts (tweets) being published during the last week of July. Further statistical information is provided in a post on Evidence on Use of Twitter for Live Blogging.

We suggested that a two character code (P1 to P8)  could be used to identify each plenary session and that using this as a hashtag in conjunction with the event’s hashtag (#iwmw2009) would enable the tweets about a particular talk to be easily identified and, in theory, this data migrated to a managed environment.

As an example you can search for the tweets related to:

We have recently used The Archivist desktop application to create a local copy of the tweets for the plenary  talks at the conference, and these have been made available on the IWMW 2009 Web site from the individual pages for the plenary talks (e.g. see the page for Derek Law’s opening plenary talk). The pages also contain a summary of the number of Twitter posts which were found using the tool.

One reason for wishing to do this is to provide an answer to the speaker who may ask “I Wonder What They Thought About My Session?“.

Preservation and Google Wave

A number of scientists have written enthusiastic blog posts about the potential of Google Wave including Peter Murray-Rust, Cameron Neylon and several others. A post entitled Google Wave: possibilities for librarians on the Rambling Librarian blog provides a useful summary of Google Wave and how it aims to provide a response to the question “What might email be like if it was invented today?

The Rambling Librarian post also picks up on the important “implication … that digital preservation will be even more critical. Imagine all the collaborative efforts gone when the server crashes. Or power fails.

Absolutely! And let’s ensure that the digital preservation aspects are being considered right at the start of any development activities rather than being ignored by those focussing on the new possibilities which this technology can provide.

Hmm, I wonder if there are any funding possibilities available for exploring the preservation aspects of Google Wave?

Preservation Policies for Content Hosted on Third Party Blogs

We know that a variety of externally-hosted Web 2.0 services are being used to support institutional aims. But what about the associated risks of loss of important resources?  One approach to such risk would be to ban use of such services. But this is to ignore the benefits that such services provide and is likely to alienate users of such services if it were possible to implement such a policy.

The approach taken by the JISC PoWR project ( and described in more detail in the JISC PoWR handbook – see particularly chapter 13) has been to recognise that there are legitimate reasons to make use of such services and to look at ways in which content hosted on such services can be managed and curated in the long term.

The need to do this is of relevance to UKOLN which provides a number of blogs on externally-hosted services including the UK Web Focus blog, provided by Brian Kelly and the Rambling of a Remote Worker blog, provided by Marieke Guy.

The first stage is to define and publicise a policy covering the long-term access to the content of these two blogs, including what will happen if either of the authors leaves UKOLN.

Policies posted on the UK Web Focus and Ramblings of a Remote Worker blogs state that:

  • A rich copy of the contents of the blog will be made available to UKOLN (my host organisation) if I leave. Note that this may not include the full content if there are complications concerning third party content (e.g. guest blog posts, embedded objects, etc.), technical difficulties, etc.
  • Since the blog reflects personal views I reserve the rights to continue providing the blog if I leave UKOLN. If this happens I will remove any UKOLN branding from the blog.

These two simple statements can help, we feel, in ensuring that the content can be managed if the blog authors leave (or if they fall ill, go crazy or die!). The statements seek to avoid uncertainties regarding what can be done with the content.  The second statement also clarifies that if the authors were to leave, they may wish to continue using the blog.

It may be argued that since both blogs make their content available under a Creative Commons licence this already grants the host institution, along with anyone else, the rights to preserve the content. This may be true, but there is no harm in making this explicit, we feel.

Would it not be legitimate for organisations to expect its employees to make similar statements which clarify the ownership of content hosted on Web 2.0 services and created as a normal course of one’s duties?


Note: This blog post has been written to support a poster which will be provided for the Missing links: the enduring web conference. The poster, which has been produced by Marieke Guy and Brian Kelly, UKOLN, is entitled “Preservation Policies and Approaches for Use of Web 2.0 Services“. A series of blog posts published on this blog provide more detailed information of the content summarised in the poster.

Tools For Preserving Twitter Posts

I recently described some Some Use Cases For Preserving Twitter Posts including preservation of an organisation’s digital memory and preservation of data for subsequent data mining. The post, however, failed to include perhaps the most obvious example: preservation of Twitter posts (tweets) related to an event.

In response to that post a number of solutions for preserving tweets were suggested including FriendFeed, the WordPress Lifestream plugin and What the Hashtag. In addition following a tweet I posted I received details of the Tweetdoc service.

With this year’s Institutional Web Management Workshop (IWMW 2009) rapidly approaching it is timely to decide the tools we’ll be using to preserve the discussions associated with this event. We began keeping a record of the amplification of the IWMW event back in 2005 when an IRC channel was provided for use by the small numbers of participants who had a laptop and WiFi connectivity at the event. The IWMW 2005 event took place during the 7/7 bombings in London and a record of the awareness of what was having can be gleaned from the archive of the IRC discussions.

This year we will once again be making use of Twitter and will be capturing tweets which contain the event hashtag #iwmw2009.  The tools we are currently intending to use are What the Hashtag and Tweetdoc service.

Use of wthashtags service to view Twitter posts about #iwmw2009Use of What the Hashtag to views tweets containing the #iwmw2009 tag is illustrated.

As can be seen the output can be access as an RSS feed. In addition the data can be viewed as an HTML resource, and a data range can also be supplied.

We intend to capture this tweets about the IWMW 2009 event after the event is over, and store the data on the UKOLN Web site, in order to avoid dependencies on the What the Hashtag service itself.

We will also explore other services, such as Tweetdoc – although in this case as the data is only available as a PDF resource, it is not well-suited to provide data for analysis by other services.

Are there any other services we should be looking at? And what functionality might be desirable for a more generic service for preserving tweet? Any thoughts?


Note: This blog post has been written to support a poster which will be provided for the Missing links: the enduring web conference. The poster, which has been produced by Marieke Guy and Brian Kelly, UKOLN, is entitled “Preservation Policies and Approaches for Use of Web 2.0 Services“. A series of blog posts published on this blog provide more detailed information of the content summarised in the poster.

Some Use Cases For Preserving Twitter Posts

I’ve come across two schools of thought regarding the preservation of Twitter posts (tweets). From Twitter fans, it’s a really useful form of informal conversation, with a particular strength being the way in which messages quickly disappear. It’s like having conversations in your favourite bar – and we wouldn’t want such conversations preserved, so why would we wish to preserve tweets? Sceptics of Twitter might regard Twitter as inane babble in a bar and would agree that it’s pointless to preserve it.

However this week I have come across a couple of examples which illustrate why it might be useful to preserve tweets.

Preservation of Twitter data for research purposes

I recently described how Twitterers Subvert[ed] Daily Mail’s Racist Poll. For me that provided a fascinating example of how Twitter can be used by activists to support a viral campaign (in this case to undermine the racist sentiments in the question “Should the NHS allow gipsies to jump the queue?“). But why did this poll succeed in attract a large turnout (with the poll being undermined with around 94% voting yes before the Daily Mail removed the poll) whereas a subsequent poll on Should immigrants be forced to respect British culture? failed to arouse much interest?

Since many of the tweets and retweets provided a link to “http://bit.ly/w4b6Q” we can look at the statistics provided by the bit.ly short URL service which tells us that 48,796 clicked on the link.

Statistics for clicks to Daily Mail pollAs shown in the histogram as might be expected the vast majority of clicks to the poll took place on 19 June 2009, the day the poll was the hot topic.

But who created the interest in the Twitterverse originally? Was it, as I speculated in my post, the UK-based psychologist community? Or where there a number of starting points? How was the message communicated? Who where the ‘influencers’ (to use the terminology from a recent blog post by Martin Weller) whose tweets were felt to contain links worth following? Who were the ‘hubs’ who were responsible for communicated the message to a wide audience? And was Twitter actually the main driver or were their other viral mechanisms in operation as Josh suggested, who went on to add “we must keep perspective and not neglect other areas or over-celebrate Twitter as a comms tool!“.

In a recent post by Andy Powell on “Influence, connections and outputs” Andy  suggested that we want “some sensible measure of scholarly impact on the social Web” and that we need to understand”what do we want to measure; what we can measure and how can bring these two things close enough together to create something useful”. Although I’m not suggesting that the Twitter responses to the Daily Mail poll have any scholarly impact, I do think that this could provide us with an opportunity to understand how ideas (whether scholarly, political or simply marking) are communicated and transmitted by Twitter. But if we wish to make use of this particular example, we will need to preserve the Twitter stream – and we will need to preserve the structured data (the Twitterer ID, the time, the IDs of those being retweeted and the numbers of followers) in order to carry out any rich analysis.

Preservation of an organisation’s memory

As recently described on the This is Bath blog the Bus station Twitter project ends – today. The bathcsc Twitter account and accompanying blog, were set up to keep users of the First bus service in Bath informed of news – including delays to bus services. Sadly First’s head office in Bath  have announced that the company “is having to make changes to its business structure. As part of making the company more cost efficient, the customer services function for Bath Bus Station will now be handled by a specialist, area-wide centre based at Exeter“.

This news has not gone down well with many of the fans of this innovation use of social networking tools (and I should add that, as a daily bus user, I personally found the Twitter service very useful) as can be gauged from the comments on the blog to the announcement of the demise of the service and on the Bus station Twitter project ends article.

In the article a spokeswoman said that the company was hoping to learn lessons from the nine-month experiment:

The Bath Twitter trial attracted an enormous amount of interest, both inside and outside the organisation. We are now looking at the lessons we have learnt, and benefits associated with the system, with a view to developing a national social media approach to customer services for the future.

But is the company in a position to preserve the tweets? Will the company be in a position to appreciate the ways in which Twitter can be used effectively if this record is lost? The danger, of course, is that a new media markeing agency wil sometime in the future advise on how Twitter could be used – and fail to learn from the successes of the initial trial, perhaps even making as big a mistake as Habitat have done recently.

What To Do? When To Do It?

Are these illustrations of when preservation of Twitter tweets reasonable? If so, can you think of other examples which we need to consider?

And if there is a case to be made for preserving tweets, what tools should be used? I am aware of the Print Your Twitter service, but this just creates a PDF file of tweets matching the search criteria.  To summarise what I think is needed:

  •  Ability to create a structured repository of tweets from a Twitter ID
  •  Ability to create a structured repository of tweets matching a search term

Anyone know of software which can be used to do this? I’m aware that one could simply use Twitter to search for “http://bit.ly/w4b6Q” or “@bathcsc” but this doesn’t provide a persistent and reusable data store.

ArchivePress: When One Size Doesn’t Fit All

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

Digital Preservation and Nuclear Disaster: An Animation

DigitalPreservationEurope (DPE), an organisation which is “committed to making digital preservation materials available to the widest possible audience and to breaking down barriers to access”, has released the first in a series of short animations introducing and explaining digital preservation problems and solutions for the general public.

Not for everyone, I suspect, but I’m pleased to see a diversity of approaches being taken to explaining digital preservation concepts. And making it available on YouTube means that the animation can be easily used in a wide variety of contexts, such as being embedded in this blog post. What do you think?

Permission to capture Twitter

This blog has been occasionally concerned with issues surrounding the capture of Tweets, in their capacity as web-based resources which may or may not have value to the Institution as record or archive material. See Brian Kelly’s post of July 2008. The discussions have been based around (a) is it worth doing? And (b) is it technically possible to capture Tweets, and how? This post is concerned with a third issue, (c), do we have permission to do it, since twitter.com is a third-party hosting service? The same issue in fact, that applies to many Web 2.0 applications which involve members of staff working in the cloud or generally conducting University business outside of the network.

The PANDORA Librarian at the State Library of Queensland has just asked the question about permission. “Do we have a blanket permission for Twitter http://twitter.com/? That is, if we find a Twitter site we want to gather, and we get permission as well from say a politician or whoever has content on that site, can we go ahead and gather?”

The National Library of Australia, who have been involved in a library-centric web archiving programme for many years, replied with the following advice:

“We received advice from Twitter, when asking to archive the Prime Minister’s twitter page, which gives a clear indication that they have no problem with archiving,” said the NLA’s Manager of Web Archiving. “I have also been in contact with another Twitter staffer who also gave permission. So I think it is safe to assume that we may archive selectively from Twitter, depending upon receiving the permission of the Twitteree. However in the case of the PM, we didn’t seek that explicit permission as we have blanket permission, from his Party and the PM’s dept.”

And that advice from Twitter.com confirms this:

“Archiving the Prime Minister’s twitter updates is fine, since he owns the copyright to the messages that he posted. According to our Terms of Service:

Copyright (What’s Yours is Yours)

1. We claim no intellectual property rights over the material you provide to the Twitter service. Your profile and materials uploaded remain yours. You can remove your profile at any time by deleting your account. This will also remove any text and images you have stored in the system.

2. We encourage users to contribute their creations to the public domain or consider progressive licensing terms.

This is potentially a useful little development. It may send archivists in the direction of checking the terms of service of other third-party hosting organisations, such as Google Docs or SlideShare, to establish what’s permissible and what isn’t. If copyright and ownership issues are sidestepped or aren’t even mentioned, this could be a cause for concern.

In the meantime, presumably we must also give credit to our Australian colleague for devising the neologism “Twitteree”.

Archiving a wiki

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a few points which we have touched on in the PoWR Handbook, which I’d like to illuminate and amplify here.

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy – or the designated user community, as OAIS calls it – probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point – there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.

Set a blog to catch a blog…

Much discussion of blog preservation focuses on how to preserve the blogness of blogs: how can we make a web archive store, manage and deliver preserved blogs in a way that is faithful to the original?

Nesting...

Since it is blogging applications that provide this stucture and behaviour (usually from simple database tables of Posts, Comments, Users, etc), perhaps we should consider making blogging software behave more like an archive. How difficult would that be? Do we need to hire a developer?

One interesting thing about WordPress is the number of uses its simple blog model has been put to. Under-the-hood it is based on a remarkably simple data base schema of about 10 tables and a suite of PHP scripts, functions and libraries that provide the interface to that data. Its huge user-base has contributed a wide variety of themes and additional functions. It can be turned into a Twitter-like microblog (P2 and Prologue) or a fully-fledged social network (WordPress MU, Buddypress).

Another possibility exploited by a 3rd-party plugin is that of using WordPress as an aggregating blog, collecting posts automatically via RSS from other blogs: this seems like a promising basis for starting to develop an archive of blogs, in a blog.

The plugin in question is called FeedWordPress. It uses the Links feature of WordPress as the basis of a list of feeds which it checks regularly, importing new content when it finds it, as Posts within WordPress.

I installed FeedWordPress a while ago on ULCC’s DA Blog, and set it up to import all of the ULCC-contributed posts to JISC-PoWR, i.e. those by Ed Pinsent, Kevin Ashley and myself. I did this because I felt that these contributions warrant being part of ULCC’s insitutional record of its activities, and that DA Blog was the best to place to address this, as things stand.

JISC-PoWR also runs on WordPress, therefore I knew that, thanks to WordPress’s REST-like interface and Cool URIs, it is easy not only to select an individual author’s posts (/author/kevinashley) but also the RSS feed thereof (/author/kevinashley/feed). This, for each of the three author accounts, was all I needed to start setting up FeedWordPress in DA Blog to take an automatic copy each time any of us contributed to JISC-PoWR.The “author” on the original post has been mapped to an author in DA Blog, so posts are automatically (and correctly) attributed. The import also preserves, in custom fields, a considerable amount of contextual information about the posts in their original location.

In many cases, I’ve kept the imported post private in DA Blog. “Introductory” posts for the JISC-PoWR project blog, for example: as editor of DA Blog, I didn’t feel we needed to trouble our readers there with them; nevertheless they are stored in the blog database, as part of “the record” of our activities.

This is, admittedly, a very small-scale test of this approach, but the kind of system I’ve described is unquestionably a rudimentary blog archive, that can be set up relatively easily using WordPress and FeedWordPress – no coding necessary. Content is then searchable, sortable, exportable (SQL, RSS, etc). (Note, by the way, what happens when you use the Search box on the JISC-PoWR blog copy in UKWAC: this won’t happen with this approach!)

For organisations with many staff blogging on diverse public platforms this would be one approach to ensuring that these activities are recorded and preserved. UKOLN, for example, manages its own blog farm, while Brian and Marieke have blogs at WordPress.com (as well as contributing to this one), and Paul Walk appears to manage his own blog and web space. This kind of arrangement is not uncommon, nor the problem of how an institution get a grasp on material in all these different locations (it’s been at the heart of many JISC-PoWR workshop discussions). A single, central, self-hosted, aggregating blog, automatically harvesting the news feeds of all these blogs, might be a low-cost, quick-start approach to securing data in The Cloud, and safeguarding the corporate memory.

There are more issues to address. What of comments or embedded images? Can it handle Twitter tweets as well as blog posts? Does it scale? What of look-and-feel, individual themes, etc? Now we start needing some more robust tests and decisions, maybe even a developer or two to build a dedicated suite of “ArchivePress” plugins. But thanks to the power and Open-ness of  WordPress, and the endless creativity of its many users, we have a promising and viable short-term solution, and a compelling place to start further exploration.

Who Should Preserve The Web?

Members of the JISC PoWR Team will be participating at next week’s JISC conference, which takes place in Edinburgh on 24th March 2009.

In the session, entitled “Who should preserve the web?” a panel will

“Outline the key issues with archiving and preserving the web and will describe practical ways of approaching these issues. Looking at the international picture and the role of major consortia working in this area, the session will also offer practical advice from the JISC Preservation of Web Resources (PoWR) project on the institutional benefits of preserving web resources, what tools and processes are needed, and how a records management approach may be appropriate.”

If you are attending the conference we hope you will attend the session and participate in the discussions. If you are attending one of the other parallel sessions you can meet the UKOLN members of the  JISC PoWR team at the UKOLN staff. And if you haven’t bookeda place at the conference (which is now fully subscribed) feel free to participate in the discussions on the online forum.

Considerations for the Preservation of Blogs

DigitalPreservationEurope (DPE) fosters collaboration and synergies between many existing national digital preservation initiatives across the European Research Area. As part of their work they publish concise overviews of key digital preservation and curation issues. Earlier this month they published a briefing paper on Considerations for the Preservation of Blogs (PDF).

The preamble sets the context for the paper:

Blogs, it seems, are everywhere these days, but what about the next day (and the next and the next …). Opinions vary on whether or not blogs merit preservation beyond the actions of a blog’s respective authors. This briefing paper does not contribute to that dialogue. Rather, it provides an overview of issues to be considered by organizations planning blog preservation programs. Blogs are the product of a network of players, including blog authors, service providers, and readers. Discussed here are some key attributes of blogs, and the characteristics and behaviors of these players, which may impact preservation activities.

During the JISC PoWR project we recognised that despite blogs initially being commonly characterised as ephemeral (as commented on in the DPE paper) their increasing importance and role in both the research context and in our cultural history is becoming apparent, and like other Web resources their preservation is a matter that needs to be addressed, somehow.

The PoWR blog has a number of interesting posts on the preservation of blogs including:

There is a also a section on preservation of blogs in the JISC PoWR handbook.

Twitter Groups and Twitter Problems

We’ve written about Twitter on the JISC PoWR site before mainly when considering preservation of Web 2.0 material. Now Twitter could become a useful tool in helping you communicate about Web resource preservation.

The Archivists and Records Managers Twitter Group is up and running. You can register at http://twittgroups.com/group/archives.

Twitter
I’m sure there will be lots of interesting posts.

The preservation of Twitter posts (tweets) has again been discussed in the blogosphere. Maureen Pennock commented in her post entitled ‘Making retrospective sense of cross media communications: a new archival challenge‘ that the increasing number of communication mechanisms presents a big problem for archivists.

She points out that “Some of our conversations are cross-media; they may start on Twitter, but they move to Facebook and then the blog. Capturing only one of those accounts means that only part of our conversation is captured. Okay, so you’re probably not interested in capturing our interactions in your archives. But you probably are interested in capturing interactions from important people (back to Stephen Fry and Obama again) and you will thus face the same issues.

She then says “We all know the problems we’ve got in capturing and archiving emails. What of Twitter? How do you get Tweets out of the system and integrate them into a collection? What of Facebook data? And YouTube?

It seems the Twitter challenge is becoming more real as it becomes increasingly mainstream.


Auricle: The Case Of The Disappearing E-learning Blog

The Auricle E-Learning Blog

The e-learning team at the University of Bath was one of the early adopters of blog technologies to provide a forum for reflecting on e-learning in a Web content.   The blog was set up by Derek Morrison when he was head of the e-learning unit. Derek had an interest in exploring the potential of new technologies, with one example of this being the series of podcast interviews he recording and made available on the blog back in 2005.  This included an interview with John Dale about the innovative blogging service developed at the University of Warwick (the first large scale student blogging service in the UK)  and, perhaps not as noteworthy, an interview with me on my reflections of the WWW 2005 conference.

The name of the e-learning team’s blog was Auricle, which has an advantage of being a very Google-friendly name, and a Google search for “Auricle Bath” finds links to the blog itself and various page which refer to the blog. Unfortunately it seems that the blog no longer exists – following a link to the blog’s home page gets a 404 error message:

The web address http://www.bath.ac.uk/dacs/cdntl/pMachine/morriblog.php was not found. It may have moved, or it may no longer be available.

How unfortunate – all that potentially valuable historical content giving views on the potential of the Web (including technologies such as blogs and podcasts) to enhance the quality of the student’s learning experiences now no longer available.  And how should the University of Bath feel about this loss of its intellectual endeavours and the role that the University had in being one of the early adopters of blogs by an e-learning team.

Why Did The Blog Disappear?

The URL for the Auricle blog provides an indication of some of the reasons for the disappearance of the blog: dacs refers to the Division of Access and Continuing Studies and cdntl to the Centre for the Development of New Technologies in Learning – but neither of these departments still exists.  Following staff departures and organisational changes, support for learning at the University of now provided by the Learning and Teaching Enhancement Office (LTEO)  with the e-Learning Team having responsibility for managing and supporting e-learning developments.

In addition to these organisational changes, the pMachine part of the blog’s domain name refers to the pMachine blog engine and morriblog clearly refers to  Derek Morrison, who left the University a number of years ago to support the HE Academy’s Pathfinder programme.

It is perhaps not surprising that following such changes and the influx of a large number of new staff in the e-Learning Team that the Auricle blog got lost somewhere along the way!

Can We Retrieve Any Of The Resources?

Is it possible to retrieve any of the blog posts and related resources? Is it possible to obtain any details about the blog, such as when it was launched, the number of posts published during its lifetime, how popular it was and, perhaps, the impact that the blog may have had?

Since the blog was public, as opposed to a blog which was restricted to members of the University of Bath, the contents of the blog have been indexed by Google. And using a combination of search terms, such as “Auricle Bath”, it is also possible to discover Web resources which cite the Auricle blog. This helped me to find a blog post on Stephen Downes’s blog on The Weblog as the Model for a New Type of Virtual Learning Environment? in which Stephen (a high profile Canadian e-learning guru)  clearly acknowledged the importance of Derek Morrison’s views on the potential of the blog as providing “the basis for a distributed, not centralised, information and learning object system“:

The author of Auricle nails it. “In the weblog, however, the announcements, articles, stories are the raison d’etre’ so much so that, not satisfied to present articles from one source, the weblog has the temerity, due to the adoption of the RSS standard, to receive syndicated stories from other sources and, in turn, offer it’s own portfolio of articles for use by others. For example, a blog supporting a programme or module could be the vehicle by which faculty post date and time-stamped short articles relevant to the course but which also link to related, but distributed, learning resources which are presented via RSS feeds. Such feeds can be static or dynamic so that updated RSS formatted information will be reflected in whatever application is displaying it, e.g. a la Auricle’s RSS Dispenser. Here then is the basis for a distributed, not centralised, information and learning object system.” (My emphasis) Derek Morrison, Auricle, February 27, 2004.

And the date of Stephen’s post (27 February 2004) indicates that the Auricle blog was available in early 2004.

With some further use of Google I discover that the Auricle podcast resources are still available on the University of Bath Web site – and I’m pleased that the MP3 file of my interview has not been lost. The RSS file also contains the publication dates, which show that the podcasts were published during 2005. We seem to have unearthed some further information about the Auricle blog.

Rediscovering The Blog!

It required a Google search for “Auricle Morrison” for me to discover that the Auricle blog is alive and well! It is now hosted at http://www.auricle.org/auriclewp/ (much better than the original URI!).  And as well as providing access to the original posts (although with a new look-and-feel, as the blog is now based on the WordPress blog software)  the blog is still active, with Derek using the blog to support his Pathfinder work at the HE Academy. As Chris Rusbridge pointed out on a post on “Digital Preservation” term considered harmful?” on the Digital Curation blog “phrases like “long term accessibility” or “usability over time” are better than the process-oriented phrase “digital preservation“. And here’s an example of how the Auricle blog has been preserved by continuing to still be used and accessible to its user community.

The Lessons

What are the implications of this case study for the wider community? And what lessons can be learnt?

We should be aware of the dangers of associating services with departmental names and specific technologies. This has been well documented, including Tim Berners-Lee’s article on “Cool URIs Don’t Change!” – although this is clearly easy to say, but more difficult to implement in practice.

I feel there is also a need for departments to audit their networked services and to document their policies regarding the sustainability of such services.  And such documented policies should be examined when departments change their names or there are significant changes in personnel.

And this case study provides an interesting example of a service which has been driven by an individual – Derek Morrison. As Derek clearly felt ownership to the Auricle blog, he was motivated to migrate the content of the blog to a new platform and, at a later date, to continue to contribute to the blog, although not as frequently as previously.  This probably saves the e-Learning Team at Bath from having to retrieve backup copies of the blog posts and provide an archived copy of the resource.  But who owns the blog?  And what would have happened if there had been an ownership dispute over the blog and the name of the blog?  These are questions which will be relevant to many academics and support who make use of blogs to support their professional activities – including myself and my UK Web Focus blog. But as the contents of my blog have a Creative Commons licence I would hope that there will minimise any legal barriers to the contents of my blog being migrated to a new environment should circumstances change.

JISC PoWR Workshop 2: Preservation and Web 2.0

The second JISC PoWR workshop was held on 23rd July 2008 as part of UKOLN’s annual institutional Web management workshop, IWMW 2008.

This workshop provided an opportunity to review the outcomes of the first workshop, in which members of the JISC PoWR team and the 30+ participants identified some of the challenges to be faced in preserving content held on institutional Web services and explored some of the ways in which these challenges can be addressed. The slides for this review are available on Slideshare and are embedded below.


The main focus of the second workshop, however, was to look at the additional challenges which need to be addressed in a Web 2.0 context, when the content may be more dynamic, hosted by third party services and created by a wide range of users.

A PowerPoint presentation was used to initiate discussions based on a number of scenarios including use of blogs, wikis, Twitter, communications tools, social networks, ‘amplified events’ and use of third party repository services such as Slideshare – which is appropriate as this presentation is itself available on Slideshare and is embedded below.


This presentation doesn’t have any answers to these challenges – it was intended to initiate the debate at the workshop. Some of the approaches which may be relevant to the various scenarios have already been discussed on this blog including use of wikis, student blogs, use of Slideshare, instant messaging and Twitter and the wider set of discussions which took place at the workshop will feed into the final JISC PoWR handbook.

It is worth noting that this presentation was spotlighted on the Slideshare home page. This has helped to increase the visibility of the work of the JISC PoWR project: a week after the presentation hed been given there had been 713 views of the slides. It should also be noted that other Slideshare users had assigned various tags to the presentation (including data-portability, digital-preservation, sioc and preservation). As can be seen if you follow these links, we are beginning to see use of such social Web technologuies which can help users to discover related resources of interest to the digital preservation community.  This, to me, is a good example of the potential benefits which Web 2.0 can provide to those with n interest in the presevation of Web resources.

Preservation And Instant Messaging

Background

The Web 2.0 environment has a strong emphasis on communications between individuals and not just one-way publishing. This pattern of usage places additional challenges for institutions wishing to ensure that records are kept of the dialogue which takes place. And these challenges may well need to be addressed within the context of policies on the preservation of Web resources as increasingly digital communications technologies will have Web interfaces.

We will be publishing a series of posts looking at different aspects of Web 2.0. In this initial post we will provide a brief case study on use of instant messaging to support communications between two institutions. The case study will attempt to draw out some of the general policy issues which should be applicable more widely.

Use of IM for the QA Focus Project

This example describes the approaches taken to use of instant messaging to support communications between the project partners for the JISC-funded QA Focus project which was launched in January 2002. The project partners were UKOLN (based at the University of Bath) and, initially, ILRT, University of Bristol. However after the end of the first year of the project ILRT withdrew form the project and were replaced by AHDS, who were based in London.

In order to minimise the amount of travel and to help to provide closely integrated working across the project partners it was agreed to make use of instant messaging technologies. As well as enabling the team members to have speedy contact with each other it was also recognised that official project meetings could be held using the technology. It was appreciated that in this context there was a need to have a slightly formal protocol for managing the meetings, to compensate for the limitations of online meetings. And in addition to the best practices for managing the online meetings it was also agreed that a record of the transcript would be kept, and that this record would be copied across to the Intranet along with other formal documents.

After AHDS replaced by ILRT as project partners we decided to change our IM client from Yahoo Messenger to MSN Messenger. It was either during this change of IM tools or whilst making use of another IM client (I can’t recollect the exact details) that we noticed that different IM applications work in slightly different ways. This includes whether a transcript of dialogue is kept automatically and whether new participants to a group chat will see only new discussions or discussions which have taken place previously (which has the potential to cause embarrassments at the least).

The experiences we gained in use of IM led the project partners to develop a policy on use of IM (which covered issues such as the possible dangers of interruptions, as well as keeping records of formal meetings held on IM). The policy also clarified use of IM in an informal context, with their being no guarantee that records would be kept.

The policy stated that:

  • IM software may be used for formal scheduled meetings. In such cases standard conventions for running meetings should be used. For example an agenda should be produced, actions clearly defined, changes of topics flagged and a record of the meeting kept.
  • IM software may be used for direct communications between individual team members. For example it may be used for working on particular tasks, to clarify issues when working on collaborative tasks and to support team working. IM may be particularly suited for short term tasks for which no archive is needed and other team members need not be involved – for example, arranging a meeting place.
  • Highly confidential information will not be sent using IM, due to the lack of strong encryption.

General Issues

The general issues arising from this case study include:

  • The need to ensure that the users of the IM technologies and those involved in developing policies related to its use have a good understanding of how the technologies work together with an understanding of the differences between different IM systems.
  • The need for simple documented policy statements

Preservation and Slideshare

Slideshare is a popular externally hosted Web 2.0 service for providing access to presentations. And as I’ve described on the UK Web Focus blog, there is evidence to demonstrate its impact in maximising awareness of presentations – and this might include both awareness of research activities, as described in my post, but also marketing activities.

But what about the risks associated in making use of a third party service in this way? What will happen if, for example, the Slideshare’s business model is flawed and the company goes bankrupt? Rather than making use of a Web 2.0 service shouldn’t we be providing Slideshare’s functionality in-house?

I feel this is the wrong response: it would be similar to saying that we should not allow third party organisations to manage our savings – but we all have bank accounts. And, although we know from recent experiences in the UK that there can be risks when using banks, we don’t shut down our accounts when we became aware if incidents such as Northern Rock financial difficulties. Rather we assess the risks and then manage the risks (in the case of savings, this might be to limit one’s saving to a maximum of £35,000 with any single bank, as this amount is guaranteed by the Government).

In the case of Slideshare an in-house solution would not only be costly to replicate its functionality, but it would also be unlikely to provide the impact and popularity which Slideshare has.

The challenge then is to assess possible risks and to explore mechanisms for managing such risks. The approach I take is to look at the popularity of the service and its user community (an approach, incidentally, which has also been recommended when selecting open source software). The Techcrunch service can be useful if providing information on the financial background to many Web 2.0 companies and its information on Slideshare seems reassuring, with a post in May 2008 described how SlideShare had Secured $3M for Embeddable Presentations.

The risk management approach I have taken is to store a managed master copy of the slides on the UKOLN Web site and ensure that links to this resource are provided on Slideshare.  As can be seen from the image,  the URL is included  on the title slide and in the accompanying metadata. In addition the URL is also included in the footer of the hard copy printouts. I also provide a Creative Commons licence for the resource, which seeks to avoid any legal barriers to future curation of the resource and allow the resource to be downloaded from the Slideshare site.

Metadata provided on the Slideshare service

This approach aims to ensure that the master resource is kept at a stable managed location, allows users to make a copy of the resource (if, for example, the Slideshare service suffers from performance or reliability problems) and allows uses to bookmark or cite the managed master version of the file.