My attention has just been drawn to this event by a blog post by Aleks Krotoski. The panel session, which will be streamed live and available for later download, will discuss ways in which the web can be studied at postgraduate level. Many of the examples focus on contemporary issues – the web as it is now – but this looks to be an ideal opportunity to highlight the research potential of web archives, and the services that those archives need to provide to enable research to be carried out. (JISC are commissioning work in this area.) More details are available at ECS Southampton. Worth a visit if you are nearby; I wish we had been able to give more warning!
5th International Digital Curation Conference
“Moving to Multi-Scale Science: Managing Complexity and Diversity” | 2-4 December 2009
The IDCC is an established annual event reaching out to individuals, organisations and institutions across all disciplines and domains involved in curating data for e-science and e-research.
The DCC will be hosting a workshop programme on 2 December followed by a Pre-Conference Drinks Reception at the Natural History Museum. The main conference will open on 3 December with a keynote speech from Professor Douglas Kell, Chief Executive of the Biotechnology & Biological Sciences Research Council (BBSRC). Other key speakers will include: Professor Ed Seidal, National Science Foundation; Cliff Lynch, Coalition for Networked Information; Timo Hannay, Nature Publishing Group. The first day of the conference will incorporate an interactive afternoon for posters and demos, followed by a Symposium entitled “Citizen Science: Data Challenges” led by Richard Cable, BBC Lab UK.
The second day will be made up of peer-reviewed papers in themed sessions covering Disciplinary and Institutional Challenges, Practitioner Experience, Metadata, Software Preservation & Managing Risk.
Places are limited so please register now.
Registration to close on 20 November 2009
As part of the JISC ITT Workshops & Seminars: Achievements & Challenges in Digitisation & e-Content strand JISC Digital Media have hosted two free seminars focussing on key topics for individuals involved with digital media. Today I attended the second of these entitled The digital media collection +100 years.
Obsolescence, deterioration of physical storage media or withdrawal of institutional support: just what will prove to be the greatest threat to the materials we digitise today? This seminar projects one hundred years into the future and attempts to predict the future ‘preservability’ of what we digitise today. This seminar will examine changing user demands and inevitable developments in technology.
After a brief opening from Dave Kilbey of JISC Digital Media the scene setting introduction was given by Dr William Kilbride, Executive Director of the Digital Preservation Coalition.
The Preservation Landscape
As well as the more conventional look at the key issues (the volumes of data available, the complexities and complicated requirements of this data teamed with rising public expecations) William gave a really interesting talk on the path of literacy. He demonstrated through the Stroop interferance test how once we can read and write we tend to process this information quicker that image information. The result is a that literate cultures tend to be hegemonic through discursive power. His point was that the consequences of our work are not inevitable or neutral: digitisation is a social practice that can be used for good and for ill. After this slight aside William ran us through some of the main challenges which include obsolescence of technologies, correct configuration of hardware, software and operators, and the need for a constantly managed service. He ended with a few ‘answers’ from a survey of recent JISC digitisation projects. When asked how long their resources were to be available answers varied from “perpetuity” to “forever or three years”. He concluded that digital preservation is possible but our legacy will be what we make of it and cannot be taken for granted.
The Camera Raw format and preservation
Nigel Goldsmith, a photographer working for JISC Digital Media gave a quick run through of the possibilities of using Raw camera format. Raw offers the photographer greater control over the processing of their images, however this flexibility comes at a price. Raw is a proprietary format which requires specialist applications to view. Nigel’s suggestion was to archive raw but to keep it along side another format, possibly tiff or Jpeg2000.
Preservation Metadata Initiatives and Standards
After coffee Getaneh Alemu from the Humanities Computing Department, the University of Portsmouth gave us a whirlwind tour of state-of-the-art metadata standards and how metadata can help ensure the integrity, identity and authenticity of digital documents. His overview included a look at OAIS, NLA PANDORA, CEDARS, NEDLIB, LMER, PREMIS, and METS metadata initiatives and standards. He concluded that at the moment preservation metadata formats tend to have element naming issues that descriptive metadata initiatives don’t tend to have.
The challenges of archiving computer games and other multipart digital interactives
After lunch Tom Woolley from the National Media Museum talked about some of the digital media preservation issues they are tackling on-site at the museum. The museum is involved in a number of initiatives that aim to let visitors ‘have a go’ at old games and old internet environments. The tricky dilema is giving users a taster of old games in a cost effective way, actually using original kit (like ZX Spectrums) would have a heavy cost attatched. The key is often emulation. The museum also try to capture the context of games by capturing fan information, discussion forums, FAQs etc. Tom was followed by James Newman from Bath Spa University who works with Tom on the National Video Game Archive.
James talked about one of the biggest challenges of video game archiving: supersession. Within the gaming world there is a tendency to be always looking for the ‘next big game’ which has resulted in an environment where games creators don’t value old games. Although there is a niche market for retro games, gaming is an area where the experience is almost completely associated with the technology, making archiving very difficult.
The importance of collaboration
Simon Tanner, director of King’s Digital Consultancy Services focused on institutional preservation and the importance of collaboration in sustainability. He started off by saying that one of the biggest challenges is that we may run out of the minerals to make microchips. He later played on the climate issue again by saying that he currently saw digital preservation as sitting in the same space as climate change: people viewed it as potentially a terrible thing (the loss of digital objects) but currently it does not impact on individuals, so it remains low on the priority list. Simon pointed out that sustainability of resources was becoming a mandate but remains an unfunded mandate. The way to deal with this was through the ecology of collaboration – within your institution and out side.
A Poisoned Chalice? Accepting Responsibility for Sustainable Access
The day concluded with a talk from Neil Grindley, JISC Programme Manager for Digital Preservation. Neil pointed out ath ensuring that an organisation’s digital assets are safe, secure and accessible for the long term should (in theory) be an interesting, responsible and useful role for anyone in an organisation to accept. The critical importance of digital assets, the ubiquity of digital methods and the need for people in all walks of life to have effective means to refer to persistent sources of data reinforce this notion. How is it then that long-term asset management, information lifecycle management, data curation, digital preservation (call it what you will) is often regarded as a peripheral specialist activity that it is difficult to resource, complex to carry out, and delivers benefits that are, at best, simply an insurance policy rather than an activity that adds value to an organisation? Neil’s presentation examined the importance of defining clear roles for those involved with digital preservation and considered the importance of associating this professional activity with strategic and tactical frameworks. He advocated the need for allocation of responsibility and internal preservation policies. JISC has spent 6 million in the digital preservation arena between 2005 and 2009, yet there is still work to be done. He concluded by pointing out the need for human judgement when deciding what to keep and predicted that in the future digital preservation will be integrated with administration departments, have better tools and will take more terms from the cultural heritage area.
After Neil’s talk there was a panel session and time for questions, unfortunately I had to leave to make the difficult drive home through rush hour traffic!
The day was an interesting one, although the talks were a real mixed bag they all offered constructive steps forward to make today’s digital media collection something that we may be able to access and use 100 years on.
This year’s event will be hosted by California Digital Library (CDL) at Mission Bay Conference Center in San Francisco on October 5th and 6th, 2009.
UK presentations include Maureen Pennock on ArchivePress, David Giaretta on significant properties in OAIS and Adam Farquar on (Planets) metadata.
Yesterday many of those most committed to, and interested in, Web preservation attended the Missing links: the enduring web conference held at the British Library. The event, which had over 100 attendees (from academia, galleries, museums, libraries, archives, government, research groups and beyond) was an attempt to ‘develop and strength the links between content creators, tools developers, preservation services and user in order to ‘ secure an enduring web’.
If anyone can successfully preserve the Web it was this lot!
Session 1: Setting the scene
Keynote: Adrian Brown, Assistant Clerk of the Records, Parliamentary Archives
The opening keynote was given by Adrian Brown from the Parliamentary Archives who did adept job of setting the scene and the challenges ahead. His talk centred around the key jigsaw pieces of selection, capture, storage, access and preservation. He pointed out that the Web used to be primarily publisher of fixed content and it is the new types of content (discussions on wikis, Twitter, blogs etc.) that offer the biggest challenge. Parliament themselves are using many new forms of information, and although permission is not always an issue capturing content on other sites often is. You may own the content but someone else often owns the presentation of that content. Adrian also emphasised the need for us to co-ordinate selection and be as explicit about what we are not going to capture as what we are going to capture (the unpublishing button on facebook being a classic example of how difficult it can be to go back!) Another major challenge is that of temporal cohesion – the rate of capture of content is often far slower than the rate of change of content.
He concluded with a Peter Murray Rust quote that stresses the need for us to “Create and release herds of cows, not preserve hamburgers in a deepfreeze.”
Web Archive and Citation Repository in One: DACHS: Hanno Lecher, Librarian, Sinological Library, Leiden
HannoLecher talked about the citation problems he has encountered while working on the digital archive for Chinese studies, a resource that captures and archives relevant resources as primary source for later research. His work has indicated the possible need for citation repositories. Some optins might include verifying
URL references, evaluating the reliability of online resources, use of PURLs, avoidance ofobscure URLs and even the extreme banning of online resources as reference. As such approaches are not always practical (or appropriate) he use of applications like Snagit, Zotero or WebCite.
The future of researching the past of the Internet: Eric T. Meyer, Research Fellow, Oxford Internet Institute, Oxford University
Eric Meyer works primarily looking at social behaviour and the Internet and as part of the Transatlantic Digitisation Project spent some time building an archive collection of World War 1 and 2 Internet resources. The process gave the team incite into the feasibility of creation of such an archive. Lessons were also learnt on the need for involvement of domain experts (webometrics, SNA, eResearch, etc.) and the desire to move from snapshots to more continuous data. Eric concluded by stressing that what researchers really want is a global way to access archives, not a national one.
After the morning coffee break the workshop moved from the general to the specific.
Session 2: Creation, capture & collection
An overview of Web Archiving Tools by Helen Hockx-Yu, Web Archiving Programme Manager, The British Library
Helen Hockx-Yu provided a very practical overview of current tools, covering Heritrix, NetArchiveSuite, Web Curator Tool and PANDAS. She also discussed archival formats such as ARC and WARC, which is highly desirable from a long term archival standpoint. Helen concluded a brief discussion on the limitations and challenges harvesters present from issues with rendering and dealing with ‘bad’ content to reliance on open source tools that are still very much evolving
Context and content: Delivering Coordinated UK Web Archive to User Communities: Cathy Smith, Collections Strategy Manager, The National Archives
Cathy Smith and the UKWAC team have recently been working on study entitled Delivering coordinated UK web archives looking at what audiences should web archives anticipate and what does this mean for selection, ingest and preservation and what do researchers want from archives. The conclusions include the notion of a single view to an archive, though this does not necessarily mean a single repository. Cathy explained that “hey just want the car to work – they don’t want to see what’s under the bonnet”. In response to the study the NA have recently begun automated whole domain harvesting – shallow as well as selective harvesting and they are looking at allowing all archiver to be searched together – a collection.
Capture and Continuity: Broken links and the UK Central Government Web Presence: Amanda Spencer and Tom Storrar from the National Archives.
Amanda Spencer and Tom Storrar were set with the task of sorting out the problem of broken links on government Web sites. 60% of URLs quoted in Hansard between 1997-2006 are broken leaving the public unable to find information and with little faith in the authority of government Web sites. The Web Continuity project has begun to create an archive of all Central Government Web space. Some thinks they are trying out include XML site maps, using redirects instead of 404 messages and capturing all government Web sites before they are closed.
In the questions and discussion session the panel were asked what advice they would give to Web managers in how they can allow archiving of their sites to be more effective. The suggestions tackled content (following accessibility guidelines, W3C standards, making code valid, adding in hard links to files next to streaming content, having transcripts of audio files) and communication (enabling a dialogue between Web archivists and Web creation people.)It was noted that there is a fine line between encouraging good practice and stifling innovation and that at times communication with content creators is a luxury. Responsible harvesting also means explaining why you are doing it to those involved.
The discussion then moved to other possible points of capture such as using Google Cache or asking ISP providers.
Session 3: Issues and approaches to long term preservation of web archives
After lunch the programme moved on from the specific problems of now to problems of the future.
Diamonds in the Rough: Capturing and Preserving Online Content from Blogs: Richard Davis, Project Manager, University of London Computing Centre (ULCC)
Richard Davis, a member of the JISC PoWR team, started off by emphasising the importance of blogs today. This has lead him and colleague Maureen Pennock to consider the need for their preservation, particularly at institutional level, resulting in the ArchivePress project. Richard pointed out that traditional Web archiving tools can be difficult to configure and use: a ‘hammer to crack a nut’. A comment by Chris Rusbridge that ”blogs represent an area where the content is primary and design secondary” suggested that using blog feeds might be key. The feed could be a point at which to gather blog content for archiving. The test bed for ArchivePress will include blogs from the DCC, Lincoln University and UKOLN. In the future ArchivePress could possibly be used to harvest Twitter.
Beyond Harvest: Long Term Preservation of the UK Web Archive: Maureen Pennock, Web Archive Preservation Project Manager, The British Library
Maureen Pennock started her talk by giving some general background to UK Web Archive. Now the archive has been established the next task is preserving the archive, an ongoing activity. The main undertakings are documenting System Dependencies, consideration of containers and metadata standards (following a review, WARC is the preferred format for preservation, but they are also using a combination of features from METS and PREMIS), preservation workflow and defining the preservation strategy. The future involves the consideration of many new areas, for example should we preserve computer viruses.
From Web Page to Living Web Archive: Thomas Risse, Senior researcher, L3S Research Center
Thomas Risse introduced the LiWA (Living Web Archives) project, an EU FP7 research project. LiWA has been looking at building tools to address the specific challenges of the current generation Web, e.g. link extraction from dynamic pages.
Emulating access to the web 1.0: Jeffrey van der Hoeven, Koninklijke Bibliotheek, The Netherlands
Jeffrey van der Hoeven started off with a whirlwind history of the Web browser. From the WorldWideWeb browser in 1991, which was more like a document viewer, followed by Mosaic and Netscape, and onto now with Firefox and Google Chrome. Today’s browser do many things, from rendering Web sites, bookmarking, caching, authentication and providing RSS feeds. They have changed from viewers to all round workplaces – Google Wave is an extreme example of this. The browser is currently the centreal application of the computer and there is a need for a sustainable solution to render Web sites. At the Koninklijke Bibliotheek they have been recently working on the Dioscuri project, a durable emulator. They have also been working with the planets project using Global Remote Access to Emulation (GRATE) and on the Keep Emulation Environments Portable (KEEP) project.
After another coffee break we moved onto the final discussions
Session 4: Discussion and next steps
What we want with web-archives; will we win? Kevin Ashley, Head of Digital Archives, University London Computer Centre (ULCC) Kevin Ashley
Kevin Ashley, also part of the JISC PoWR team, kicked off his talk by introducing us to the future of the Web, Web 8.0. Although Kevin was just joking with us he explained that when the next major change in the Web happens, we will need a much better metaphor than ‘Web 3.0’: it is the future that we need to think about. We need to be asking the key question as to what we want from Web archives, e.g. potential use cases. Preserving the Web is not just about content. Sometimes it will be useful to look at Web content in aggregate, e.g. for visualisation, linguistic analysis, etc. Right now many academics are doing own Web archiving – content and presentation or both. However we need to not take just a document centred view of data but also consider the web of data and data about the web. A lot of future interest will not in Web content, but in the context of use.
The day concluded with a round table discussion with many speakers from the day. Some of the topics discussed were the permissions challenge (primarily with regard to UKWAC and similar initiatives). Other countries and initiatives have favour a more ‘just do it’ approach.
The day left most people enthused and positive about the future for Web archiving.
A conference report will follow on the DPC Web site and will be sent out via the dpc-discussion and digital-preservation JISCmail lists.
All presentations are now available in PDF Format from the DPC Web site.
WAC09 was used as the Twitter tag for the event. You can see the tweets at http://hashtags.org/tag/wac09/messages.
The event was sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland).
Today was the Missing Links: The enduring Web conference held at the British Library Conference Centre, London. A trip report on the event will follow.
The JISC PoWR team presented two posters, one provided a general overview of the project and the other discussed Preservation Policies and Approaches for Use of Social Web Services.
The content of the second poster has been mentioned in a number of previous blog posts:
- Preservation Policies for Content Hosted on Third Party BlogsTools For Preserving Twitter Posts
- Some Use Cases For Preserving Twitter Posts
- Preserving Tweets: The CILIP2 Experience
- Missing links: the enduring web
Although the poster was submitted to the British Library it was unfortunately mislaid, so here is the actual content.
Click on the image for a larger version.
JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.
Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.
Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?
So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.
This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.
In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.
ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.
At the CILIP 2.0 open meeting in London (29th April 2009) delegates and remote participants were encouraged to tweet, using the #cilip2 tag, on issues relating to the role of Web 2.0 for CILIP members. These tweets were displayed on a screen using the Twitterfall client. Twitterfall is a way of viewing the latest ‘tweets’ of upcoming trends and custom searches on Twitter. Updates fall from the top of the page in near-realtime. The client is also a great way for remote participants to keep up-to-date with event posts.
Use of the tag was very popular and a number of blog posts have been written about its success including CILIP: More Popular Than Swine Flu!, Twittering Librarians, Twitter – better than a conference, and CILIP 2.0 – Open Session.
Some might argue that the significant use of the tag could turn out to be a defining moment in CILIP’s movement towards engagement with Web 2.0. Those with an eye for preservation might argue that if this is the case we will want to preserve the tweets that embody this ‘crucial moment’.
Luckily help was on hand in the form of Dave Pattern, systems manager at the University of Huddersfield Library. After the event Dave provided a good example of rapid software development (or in his words “a quick and dirty hack“!) when he quickly wrote a piece of software to harvest the tweets and make available a transcript of them. Dave used the Twitter search API to fetch the results (in JSON format) of the #cilip2 tag every 60 seconds. He then had to dedupe the results from the previous seaches. Once he’d got all the tweets he wrote a quick Perl script to process them and generate some HTML.
He also provided a Wordle visualisation of the content of the tweets.
This now means that the CILIP community continue to have access to:
- A record of the discussions, annotated with the time of posting.
- A Wordle cloud which provides a visual summary of the topics.
Software such as this could potentially be used by others to preserve tweets. For those interested Dave is happy to share the code.
If does however raise a number of questions: Should you keep a record of Twitter posts? If so, how would you go about doing it? Where should the responsibility lay? Should it be something event organisers need to consider? Is there a role for commercial companies to play or will this type of preservation just fall at the feet of interested participants (those keen on mashup creation)?
Is there scope for some sort of tool to be developed in this area? Should funding councils in specific areas be funding the creation of a more managed service? Should the tools be more generalised?
Some suggestions are already provided in the Handbook and on the JISC PoWR blog:
This blog post provides the background for one of two case studies which will be displayed on a poster on Preservation Policies and Approaches for Use of Social Web Services to be presented at the Missing links: the enduring web workshop to be held on Tuesday 21st July 2009 at the at the British Library Conference Centre, London.
The JISC PoWR team will be involved in the forthcoming Workshop on missing links: the enduring web. The workshop is sponsored by the Digital Preservation Coalition (DPC) and the Joint Information Systems Committee (JISC) and organised by the six partners of the UK Web Archiving Consortium (British Library, National Library of Wales, JISC, Wellcome Library, The National Archives and the National Library of Scotland). It will be held on Tuesday 21st July 2009 at the British Library Conference Centre, London.
Richard Davis, ULCC, will be giving a presentation on Diamonds in the Rough: Capturing and Preserving Online Content from Blogs. Other members of the team will be presenting posters on the JISC-PoWR Project and on Preservation Policies and Approaches for Use of Web 2.0 Services.
In next few posts we’ll describe in more detail what we’ll be covering. Online registration is still open and closes on Friday 10th July 2009. We hope to see you there…
We recently published a blog post about the workshop on “missing links: the enduring web”. Further information about this one-day event, which will take place on 21st July 2009 at the British Library Conference Centre, London, is now available.
Papers and posters are invited which address long term preservation issues in relation to Web content. Abstracts of not more than 300 words should be sent to the conference organisers by 8th May 2009.
The Digital Preservation Coalition have recently announced a 1-day workshop on “missing links: the enduring web” which will be held at the British Library Conference Centre on 21st July 2009. This event, which is being organised by the JISC and the UK Web Archiving Consortium in conjunction with the Digital Preservation Coalition, aims to “bring together key stakeholders – web managers, archive managers, preservation experts, national libraries, web archivists and content providers – for practical and focussed discussion on shared perspectives, requirements, problems and solutions. Formal presentations and case studies will be presented with an opportunity for posters and demonstrations of tools“.
We will provide information on the detailed programme and how to register when this information is published.
Members of the JISC PoWR Team will be participating at next week’s JISC conference, which takes place in Edinburgh on 24th March 2009.
In the session, entitled “Who should preserve the web?” a panel will
“Outline the key issues with archiving and preserving the web and will describe practical ways of approaching these issues. Looking at the international picture and the role of major consortia working in this area, the session will also offer practical advice from the JISC Preservation of Web Resources (PoWR) project on the institutional benefits of preserving web resources, what tools and processes are needed, and how a records management approach may be appropriate.”
If you are attending the conference we hope you will attend the session and participate in the discussions. If you are attending one of the other parallel sessions you can meet the UKOLN members of the JISC PoWR team at the UKOLN staff. And if you haven’t bookeda place at the conference (which is now fully subscribed) feel free to participate in the discussions on the online forum.
Members of the JISC PoWR team from UKOLN and ULCC will be attending the JISC 2009 conference in Edinburgh on 24th March 2009. UKOLN will have a stand in the accompanying exhibition and we intend to produce a poster about the work of the JISC PoWR project which will be on display.
In order to help you spot the poster at what is likely to be a very busy event we’ve included an image of the poster in this post (which is also available on Slideshare, if you’ve like to see more details of the content of poster).
Too late to be of much use, I suspect, but just before Christmas I received an email containing details of two CASPAR (Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval ) Training Days. The CASPAR Training Day for the Cultural Domain will be held on 12 January 2009 and the CASPAR Training Day for the Scientific Domain on the following day (13 January 2009).
The seminars will take place in Rome, and are free to attend. If you require further information please email:
A recent post on the digital-preservation list indicates that at least one scholarly community has recognised the long-term scholarly value of online resources such as blogs, and the potential damage to future scholarship that might result from their loss. It draws attention to a symposium taking place at Georgetown University next year. The email says that the symposium:
…will build upon the fundamental assumption that blogs are an integral part of today’s legal scholarship.
and goes on to say:
This symposium will bring together academic bloggers, librarians, and experts in digital preservation …. Symposium participants will collectively develop innovative practices to ensure that valuable scholarship is not easily lost.
Join the conversation now by tagging items you think are relevant to this symposium with the del.icio.us tag FTLS2009.
It’s interesting to observe that this is an example of a community acting to preserve information of interest that is likely to be scattered over many institutions and none. (I suspect a fair amount of blogging in this area is done by practitioners who aren’t at an academic institution.) One of the concerns we identified in PoWR was that much material of this type was unlikely to be preserved as a result of institutional interests, unless one institution tried to bring materials like this into the remit of its special collections (and some have done this.)
The conference web site goes on to say:
This unique symposium will seek answers to the questions:
1. How can quality academic scholarship reliably be discovered?
2. How can future researchers be assured of perpetual access to the information currently available in blogs?
3. How can any researcher be confident that documents posted to blogs are genuine?
The symposium will include a working group break-out session to create a uniform standard for preservation of blogs, a document to be shared by bloggers and librarians alike.
That last goal of a uniform standard for blog preservation looks like a tall order and it will be interesting to see what emerges from this group, and what its wider relevance might be. But its a clear demonstration of the value of web material to some research communities, and their willingness to do something about it if their institutions can’t, or won’t, help them.
A presentation on JISC PoWR entitled Preservation for the Next Generation was given yesterday at the Internet Librarian International Conference 2008 held at the Novotel London West.
The slides of the talk are now available from Slideshare and embedded below.
The presentation was well received and sparked a lot of interest particularly from delegates from US libraries. Donald Grose, Dean of Libraries from the University of North Texas, informed me that they have been preserving many of the US government Web sites for a number of years. See this related press release. Looking into Web resource preservation activity in other sectors and possibly other countries is definitely an area of interest for the future. Hopefully the JISC PoWR project will be able to talk more to Donald about his work in the future.
The web-archiving strand at the iPRES Conference was an opportunity for a number of National Libraries to describe their initiatives, their approaches to web capture and preservation, and the things that were (for them) the biggest hurdles. It’s interesting to me that librarians – as opposed to archivists – seem to have been first off the mark with web-archiving, although in the UK The National Archives have at least two parallel initiatives underway (UKWAC and European Archive) to capture government websites deemed to be records.
From the PoWR point of view, one thing I found useful was the summary statement from Iceland which suggests that the community is now starting to agree on three main approaches to selection – bulk/domain harvesting, thematic, and event-based. Sometimes I wonder if these approaches are too library-influenced (looking at websites primarily as publications, rather than as records, whereas in PoWR we are suggesting that HFE websites contain and function as a mixture of both), but it helped me sharpen up my thinking and I fed this idea back into the PoWR Handbook’s chapter on Selection.
Since they were all representing libraries, most of the presenters tended to focus on legal deposit (and attendant permission problems) as the biggest hurdle to gathering websites – and the one that eats up the most in terms of resources. Denmark have not resolved this completely, and although they have gathered lots of material from the web, they don’t make it widely available – only allowing controlled access for research or statistical purposes. France, on the other hand, have had legal deposit laws in place since 2006. Australia have found this such a problem (despite the strenuous efforts they have made to get the law changed) that it pretty much forced the decision to go the curatorial route. All permissions are negotiated, but the collections are shaped to a certain extent by community input.
With my archivist hat on, I sat up when we were told quite categorically that registration and cataloguing of web resources would not work, and that no-one should ever even attempt it. The BL admitted they were ‘not really collecting metadata’. If this is true, I immediately started to wonder why both PANDAS and Web Curator Tool (which I have used) have conventional Dublincore metadata elements built into their workflow? Don’t we anticipate cataloguing archived web resources in some way? I almost asked a question about this, until I reflected on the use of NutchWax and full-text indexing (which is probably acceptable until we can come up with some form of semantic tagging or automated metadata extraction for web resources).
Then I found myself reaching for the mic to ask a question about what I call ‘endangered resources’. We have already blogged about this on dablog, when my colleague Joanne Anthony raised a question about a web resource owned by a smallish institution which suddenly found itself with its funding removed. Was there anything such an Institution could do, I asked the panel, to preserve its website? And what were the pro-active steps being taken by these National Libraries to rescue or identify resources at risk? I know that UKWAC, for example, offers a public submission service on its website, although it is not very prominent or visible, nor is it quite clear what happens to requests for archiving once the form has been filled in. I received some interesting replies, including the amusing anecdote from France which suggests that their archival collections have been accessed by red-faced politicians who have accidentally deleted their own blogs. However, I still wasn’t quite sure what national initiatives exist to address what I perceive as a significant gap in the preservation of unrecognised (and therefore uncaptured) resources.
Brian and Marieke have already written about iPres2008 and PoWR, and I have written and will write more about it from a general perspective on DABlog. But we thought it would be worth saying a bit more about what this conference, which is looking at the complete picture of digital preservation, had to say which is of relevance to PoWR’s work of web preservation in UK Universities.
There was an entire session devoted to various web archiving initiatives on the second day, which at first sight one might think is of particular relevance (almost as much as Brian’s presentation, one might think.) I wasn’t at this session – it was one of those running in parallel tracks, and I was speaking in the other track – but Ed Pinsent was and will be writing at more length about it soon. But even without attending, I’m aware that many of the projects, operating as they do within their national domains in Australia or elsewhere, won’t have much role in helping save UK University web content (unless we move our domains to .edu.au – there’s a thought.) Even when the BL realises its long-term aim of harvesting across the entire UK web domain, it still will be selective in some ways about what it captures – about depth and frequency of harvests, and about the type of content. You won’t be able to depend on those institutions to capture what you want to be captured. So if these initiatives aren’t going to meet all our needs, do we need to do it ourselves ? The PoWR project thinks not, but that is one of the options institutions will need to examine. The work the IIPC is doing to develop harvesting and access tools will be of interest to those few institutions that feel able to operate these tools themselves – not something to be undertaken lightly.
Yet there was much of relevance at iPres2008. One recurring them, picked up at the outset by Lynne Brindley and in Steve Knight’s closing remarks, was that ‘digital preservation’ is not the term to be using in discussions with our institutions and the world, echoing remarks on the DCC blog which Brian later picked up on here. Steve prefers the phrase ‘permanent access’. which is indeed outcome-focussed. However, we’ve also said in PoWR that preservation isn’t always forever, so I would prefer something a little more all-embracing – ‘long-lived access’ might fit.
The sessions covering things like significant properties also touched on issues that PoWR is concerned with. When we decide to preserve something, what is it that we’re really trying to keep ? Most forms of preservation change the original object in some way, just as long-life milk isn’t the same as pasteurised, and neither are quite as tasty as fresh milk (or so I’ve been told.) This is clearly still a very difficult problem, and one that (to my mind) demonstrates that the digital preservation community hasn’t even developed a clear problem statement, much less a fully worked-out solution. So, in the meantime, we need to be pragmatic and do what seems best at the time. Always a good plan.
As my colleague Marieke Guy commented recently I presented a paper on “Preservation of Web Resources: The JISC PoWR Project” at the iPRES 2008 conference on Monday 29 September 2008 which described the work of the JISC PoWR project. The iPRES 2008 conference, incidentally, was featured in an article “In praise of … preserving digital memories” published in The Guardian Editorial page yesterday (1 October 2008). The article stated that “If all goes well, we will have the capacity to preserve as many of our memories, personal and national, as we want“. So it was very pleasing to present the work of the JISC PoWR project, which explored ways in which memories held on Web sites can be selected and preserved.
The slides of the talk (in which I focus primarily on preservation within a Web 2.0 environment) are now available and are embedded below.
There is also a video recording of the talk available (although I haven’t yet been able to upload the video to Google Video to allow it to be embedded in other Web pages, I’m afraid).
I should also add that Chris Rusbridge provided a comprehensive report on the conference. I was pleased to read Chris’s comments on my talk which he described as “a very entertaining talk, and well worth looking up“. He went on to describe me as “not a preservationist, but is a full-blown technogeek discussing the roles of the latest Web 2.0 technologies on his blog, in his role as UK Web Focus“. And this technogeek was particularly pleased to read that the JISC PoWR “project achieved a strong level of interaction through its several workshops“.
Brian Kelly will be presenting a paper on “Preservation of Web Resources: The JISC PoWR Project” authored by the JISC PoWR team at the fifth International Conference on Preservation of Digital Objects (iPres 2008) this coming Monday (29th September 2008). The conference will be held at the British Library from 29 – 30th September 2008 and brings together researchers and practitioners from around the world to explore the latest trends, innovations, thinking, and practice in digital preservation.
The slides and accompanying paper are available from the UKOLN Web site.