<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/wordpress-mu-1.2.5" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>
<channel>
	<title>Comments on: ArchivePress: When One Size Doesn&#8217;t Fit All</title>
	<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/</link>
	<description>Preservation of Web Resources: a JISC-sponsored project</description>
	<pubDate>Fri, 19 Mar 2010 04:14:11 +0000</pubDate>
	<generator>http://wordpress.org/?v=wordpress-mu-1.2.5</generator>

	<item>
		<title>By: Our first month &#171; ArchivePress + APrints</title>
		<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-34787</link>
		<dc:creator>Our first month &#171; ArchivePress + APrints</dc:creator>
		<pubDate>Tue, 16 Mar 2010 23:16:43 +0000</pubDate>
		<guid>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-34787</guid>
		<description>[...] also had some highly useful discussions about the project on the JISC-PoWR blog and at Peter Murray-Rust&#8217;s blog. Among the things I&#8217;ve learned from them is [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] also had some highly useful discussions about the project on the JISC-PoWR blog and at Peter Murray-Rust&rsquo;s blog. Among the things I&rsquo;ve learned from them is [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: JISC-PoWR &#187; Blog Archive &#187; ArchivePress Blog Preservation Survey</title>
		<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-19432</link>
		<dc:creator>JISC-PoWR &#187; Blog Archive &#187; ArchivePress Blog Preservation Survey</dc:creator>
		<pubDate>Wed, 02 Sep 2009 09:05:58 +0000</pubDate>
		<guid>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-19432</guid>
		<description>[...] focusing on blogs as records of institutional activity and corporate memory. A recent post entitled ArchivePress: When One Size Doesn’t Fit All explains the project aims in more [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] focusing on blogs as records of institutional activity and corporate memory. A recent post entitled ArchivePress: When One Size Doesn’t Fit All explains the project aims in more [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ArchivePress &#187; Blog Archive &#187; Our first month</title>
		<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-16716</link>
		<dc:creator>ArchivePress &#187; Blog Archive &#187; Our first month</dc:creator>
		<pubDate>Wed, 08 Jul 2009 09:08:18 +0000</pubDate>
		<guid>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-16716</guid>
		<description>[...] also had some highly useful discussions about the project on the JISC-PoWR blog and at Peter Murray-Rust&#8217;s blog. Among the things I&#8217;ve learned from them is [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] also had some highly useful discussions about the project on the JISC-PoWR blog and at Peter Murray-Rust&#8217;s blog. Among the things I&#8217;ve learned from them is [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Latest bloging software news - JISC-PoWR » Blog Archive » ArchivePress: When One Size Doesn&#8217;t Fit All &#124; Best Firepow Review</title>
		<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-16073</link>
		<dc:creator>Latest bloging software news - JISC-PoWR » Blog Archive » ArchivePress: When One Size Doesn&#8217;t Fit All &#124; Best Firepow Review</dc:creator>
		<pubDate>Sat, 27 Jun 2009 13:00:36 +0000</pubDate>
		<guid>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-16073</guid>
		<description>[...] JISC-PoWR &#187; Blog Archive &#187; ArchivePress: When One Size Doesn&#8217;t Fit All [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] JISC-PoWR &#187; Blog Archive &#187; ArchivePress: When One Size Doesn&#8217;t Fit All [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Richard M. Davis</title>
		<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-16005</link>
		<dc:creator>Richard M. Davis</dc:creator>
		<pubDate>Fri, 26 Jun 2009 15:08:34 +0000</pubDate>
		<guid>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-16005</guid>
		<description>Hi Owen... Note we're not talking about websites as a whole, just blogs, which I think are a sufficiently distinct phenomenon to merit special attention. 

Our premise is that blog posts are atomic events, data packets, that /might/ be more effectively captured cumulatively. Crawling is a very intensive process, typically at arbitrary intervals, often unrelated to the life-cycle of information in the target site, and frequently results in both unnecessary duplication and unwanted gaps. I'd be interested to know what the UK Web Archive is doing with any newsfeeds it harvests - but I might argue that capturing (A) HTML and (B) Newsfeeds is unnecessary duplication, if you can capture the underlying data and the queries/templates that render (A) and (B). (To say nothing of (C) the WAP view, and probably some others too). I admit this is a data-centric view - no doubt influenced by my database background, and time in a &lt;a href="http://ndad.nationalarchives.gov.uk/" rel="nofollow"&gt;dataset archive&lt;/a&gt;. 

As for the preservation, whether crawling or A-Pressing, that's to some extent about what one does next, whether one has a relational database or a bunch of (W)ARC files. My guess is, also, that your average IT department will find it easier to understand the former than the latter.

FYI, Ed Pinsent, our resident archivist and Man From UKWAC, has just offered some of his &lt;a href="http://archivepress.ulcc.ac.uk/2009/06/25/ap-pres/" rel="nofollow"&gt;preliminary thoughts&lt;/a&gt; relating to the crawling approach.</description>
		<content:encoded><![CDATA[<p>Hi Owen&#8230; Note we&#8217;re not talking about websites as a whole, just blogs, which I think are a sufficiently distinct phenomenon to merit special attention. </p>
<p>Our premise is that blog posts are atomic events, data packets, that /might/ be more effectively captured cumulatively. Crawling is a very intensive process, typically at arbitrary intervals, often unrelated to the life-cycle of information in the target site, and frequently results in both unnecessary duplication and unwanted gaps. I&#8217;d be interested to know what the UK Web Archive is doing with any newsfeeds it harvests - but I might argue that capturing (A) HTML and (B) Newsfeeds is unnecessary duplication, if you can capture the underlying data and the queries/templates that render (A) and (B). (To say nothing of (C) the WAP view, and probably some others too). I admit this is a data-centric view - no doubt influenced by my database background, and time in a <a href="http://ndad.nationalarchives.gov.uk/" rel="nofollow" onclick="javascript:urchinTracker ('/outbound/comment/ndad.nationalarchives.gov.uk');">dataset archive</a>. </p>
<p>As for the preservation, whether crawling or A-Pressing, that&#8217;s to some extent about what one does next, whether one has a relational database or a bunch of (W)ARC files. My guess is, also, that your average IT department will find it easier to understand the former than the latter.</p>
<p>FYI, Ed Pinsent, our resident archivist and Man From UKWAC, has just offered some of his <a href="http://archivepress.ulcc.ac.uk/2009/06/25/ap-pres/" rel="nofollow" onclick="javascript:urchinTracker ('/outbound/comment/archivepress.ulcc.ac.uk');">preliminary thoughts</a> relating to the crawling approach.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Owen Stephens</title>
		<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-15999</link>
		<dc:creator>Owen Stephens</dc:creator>
		<pubDate>Fri, 26 Jun 2009 14:23:33 +0000</pubDate>
		<guid>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-15999</guid>
		<description>Thanks Richard. You'll be glad to know I've challenged Brian on this as well - so I do blame him :)

The use cases you describe here definitely help me understand more of what you see as the advantages - but would fall outside what I would think of as 'preservation' - perhaps it is agreement on what it means to 'preserve' a website that needs unpicking.

I think there are some distinctions to be made between how you capture the content, and how you store it, and how you can manipulate it. My instinct is that crawling html to capture the content is a more reliable way of getting the full picture for any particular blog at any particular time - and of course, if you capture the full content of the site, you get the RSS feeds as well. You could then decide to store this in a structured format if you wanted to do the kind of queries you describe.

I know Maureen promised a blog post on the project, so I'm looking forward to reading that as well.</description>
		<content:encoded><![CDATA[<p>Thanks Richard. You&#8217;ll be glad to know I&#8217;ve challenged Brian on this as well - so I do blame him :)</p>
<p>The use cases you describe here definitely help me understand more of what you see as the advantages - but would fall outside what I would think of as &#8216;preservation&#8217; - perhaps it is agreement on what it means to &#8216;preserve&#8217; a website that needs unpicking.</p>
<p>I think there are some distinctions to be made between how you capture the content, and how you store it, and how you can manipulate it. My instinct is that crawling html to capture the content is a more reliable way of getting the full picture for any particular blog at any particular time - and of course, if you capture the full content of the site, you get the RSS feeds as well. You could then decide to store this in a structured format if you wanted to do the kind of queries you describe.</p>
<p>I know Maureen promised a blog post on the project, so I&#8217;m looking forward to reading that as well.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Richard M. Davis</title>
		<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-15992</link>
		<dc:creator>Richard M. Davis</dc:creator>
		<pubDate>Fri, 26 Jun 2009 11:02:43 +0000</pubDate>
		<guid>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-15992</guid>
		<description>Hi Owen

I think I'm clear on the idea that harvesting the HTML is not enough: what I believe we have in blogs are data, and data objects, and so the idea is to treat them as such. If I ask the version of this blog in the UK Web Archive to 'show me all Brian's posts', or 'show me all the posts from June 2009', it's dumb. With a database (relational or XML) I should be able to do that quite easily (as I can by querying the backend database to this blog; or &lt;a href="http://jiscpowr.jiscinvolve.org/author/briankelly" rel="nofollow"&gt;wrapping that query in a web script&lt;/a&gt;. 

Over an aggregated accumulation of blogs from an institution I think that if we can avoid losing the flexibility that the implicit data structure of blogs gives us, we should: all sorts of semantic, text-mining possibilities might await. To settle for screen-scraping solution is to lose that: we might as well just PDF everything.

We'll be posting more thoughts on the data structure on the ArchivePress blog, but I don't see it as 'creating a database structure' - all blog systems share common assumptions about data and metadata, in order to map to the key elements of RSS/Atom (or is it the other way round?!) and that's kind of what we're focusing on. 

If you like you can blame Brian :) who told me a while back that a significant proportion of his readers consumed his blog via feed readers, not the web pages themselves; and Chris Rusbridge's suggestion that "blogs represent an area where the content is primary and design secondary": our premise lurks somewhere between those two ideas.

The versioning aspect I'm also interested in, as it relates to the archival issues of authenticity and reliability: blog posts are increasingly cited and yet highly susceptible to change or deletion. On balance, I'd prefer to easily cite blog posts in a stable archive rather than in the wild, and I hope the project might helps us get closer to that outcome.

Maureen will probably have more and better ideas. I'm looking forward to discussing it more over the next 6 months.</description>
		<content:encoded><![CDATA[<p>Hi Owen</p>
<p>I think I&#8217;m clear on the idea that harvesting the HTML is not enough: what I believe we have in blogs are data, and data objects, and so the idea is to treat them as such. If I ask the version of this blog in the UK Web Archive to &#8217;show me all Brian&#8217;s posts&#8217;, or &#8217;show me all the posts from June 2009&#8242;, it&#8217;s dumb. With a database (relational or XML) I should be able to do that quite easily (as I can by querying the backend database to this blog; or <a href="http://jiscpowr.jiscinvolve.org/author/briankelly" rel="nofollow" >wrapping that query in a web script</a>. </p>
<p>Over an aggregated accumulation of blogs from an institution I think that if we can avoid losing the flexibility that the implicit data structure of blogs gives us, we should: all sorts of semantic, text-mining possibilities might await. To settle for screen-scraping solution is to lose that: we might as well just PDF everything.</p>
<p>We&#8217;ll be posting more thoughts on the data structure on the ArchivePress blog, but I don&#8217;t see it as &#8216;creating a database structure&#8217; - all blog systems share common assumptions about data and metadata, in order to map to the key elements of RSS/Atom (or is it the other way round?!) and that&#8217;s kind of what we&#8217;re focusing on. </p>
<p>If you like you can blame Brian :) who told me a while back that a significant proportion of his readers consumed his blog via feed readers, not the web pages themselves; and Chris Rusbridge&#8217;s suggestion that &#8220;blogs represent an area where the content is primary and design secondary&#8221;: our premise lurks somewhere between those two ideas.</p>
<p>The versioning aspect I&#8217;m also interested in, as it relates to the archival issues of authenticity and reliability: blog posts are increasingly cited and yet highly susceptible to change or deletion. On balance, I&#8217;d prefer to easily cite blog posts in a stable archive rather than in the wild, and I hope the project might helps us get closer to that outcome.</p>
<p>Maureen will probably have more and better ideas. I&#8217;m looking forward to discussing it more over the next 6 months.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Owen Stephens</title>
		<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-15982</link>
		<dc:creator>Owen Stephens</dc:creator>
		<pubDate>Fri, 26 Jun 2009 08:37:13 +0000</pubDate>
		<guid>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-15982</guid>
		<description>I'm a little bit sceptical about this. What is the actual worth of 'preserving' the database. The set of html pages that make up the blog are the publication - if you want to preserve the publication, then surely this is what you should preserve? We might need to get cleverer about how we do this, but I'm not sure that the solution is what you describe here.

If you want to preserve the database as an artifact (which I'm not sure about as a general approach, but can see you might want to do this in some circumstances) then the only way to do this is to preserve the database - you can't 'recreate' it from the feeds.

The more I think about this, I'm afraid the more problems I can see. What happens if I make a change to an old post? I can say that it is very standard practice for the web page representing that post to be updated (otherwise what's the point) - but I don't think you would see a consistent behaviour in the feeds from all packages - does an updated post from 3 years ago appear in the blogs RSS feed?

OK - so answering some of these questions is the point of the project - fair enough, and I don't want to be overly negative about this - but I just can't see the justification for doing this above harvesting the html. Could you post some more information on why you think creating a database structure (which may or may not bear some resemblance to the original database structure) for the content is important for archiving purposes?</description>
		<content:encoded><![CDATA[<p>I&#8217;m a little bit sceptical about this. What is the actual worth of &#8216;preserving&#8217; the database. The set of html pages that make up the blog are the publication - if you want to preserve the publication, then surely this is what you should preserve? We might need to get cleverer about how we do this, but I&#8217;m not sure that the solution is what you describe here.</p>
<p>If you want to preserve the database as an artifact (which I&#8217;m not sure about as a general approach, but can see you might want to do this in some circumstances) then the only way to do this is to preserve the database - you can&#8217;t &#8216;recreate&#8217; it from the feeds.</p>
<p>The more I think about this, I&#8217;m afraid the more problems I can see. What happens if I make a change to an old post? I can say that it is very standard practice for the web page representing that post to be updated (otherwise what&#8217;s the point) - but I don&#8217;t think you would see a consistent behaviour in the feeds from all packages - does an updated post from 3 years ago appear in the blogs RSS feed?</p>
<p>OK - so answering some of these questions is the point of the project - fair enough, and I don&#8217;t want to be overly negative about this - but I just can&#8217;t see the justification for doing this above harvesting the html. Could you post some more information on why you think creating a database structure (which may or may not bear some resemblance to the original database structure) for the content is important for archiving purposes?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ResourceShelf &#187; Blog Archive &#187; UK: A New Project to Learn What it Takes to Archive Blog Content</title>
		<link>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-15856</link>
		<dc:creator>ResourceShelf &#187; Blog Archive &#187; UK: A New Project to Learn What it Takes to Archive Blog Content</dc:creator>
		<pubDate>Wed, 24 Jun 2009 15:44:39 +0000</pubDate>
		<guid>http://jiscpowr.jiscinvolve.org/2009/06/24/archivepress-when-one-size-doesnt-fit-all/#comment-15856</guid>
		<description>[...] Source: ArchivePress Hat Tip: The JISC-PoWR Blog (Excellent Overivew of Project [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] Source: ArchivePress Hat Tip: The JISC-PoWR Blog (Excellent Overivew of Project [&#8230;]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
