JISC PoWR

Preservation of Web Resources: a JISC-funded project [Archived Blog]

PDF web page capture

Posted by Kevin Ashley on September 13th, 2008

During the last JISC PoWR workshop yesterday in Manchester (of which more anon) I made brief mention of a tool from Adobe which allows web pages, or entire sites, to be captured to a PDF file. I mentioned this primarily to illustrate one of the three points at which web capture can take place (behind the server; from the HTTP transaction; or browser-side) but it generated considerable interest, and I promised to blog about the product since I could not remember what it was called.

It turns out that it’s not a separate product, nor a plug-in, but a built-in part of Adobe Acrobat. It was first available as a free add-on for Acrobat 4 in 1998 or 1999 , and I think it was then that I first saw this demonstrated at the PRO (as it then was) – hence my misunderstanding. Tools like this have their place, but (like all web preservation technologies) they also have their drawbacks. PDF’s print-oriented format isn’t a good match to some sites, much as some sites don’t look good when you try to print them. (In fact, I believe that Acrobat Web Capture effectively uses the browser’s print engine combined with PDF writer pseudo-printer to do its work, so there will be a close correlation.) But we’ll be covering this tool, along with others, in the handbook.

2 Responses to “PDF web page capture”

  1. Gareth Knight Says:

    Open Office can also be used to mass convert web pages into PDF. I attended an presentation on a.nnotate at the Repository Fringe event where they suggested they had used OO to provide platform-independent, collaborative annotation of web pages.

  2. Fred Howell Says:

    [from one of the A.nnotate.com developers who was at the Edinburgh repository fringe...]

    A.nnotate will let you do web page capture – you can enter a URL or use a bookmarklet to take a snapshot of a web page and store a copy of the HTML in your private space on the a.nnotate.com site – which can be useful for archiving, as you get the page at a particular point in time. Currently it does a shallow copy (i.e. just the HTML) – the images etc are left on the original site, so it would need to download those too if you wanted to use it for archiving. The A.nnotate server is also available for local installation (with an API) if you want to integrate it with some other CMS.

    You can also upload PDFs to A.nnotate (we use OpenOffice to convert from the various Office formats to PDF first) and these get converted to images and rendered in the browser using pure HTML / AJAX (without any dependency on Flash or Adobe reader).