• Register

Is there a web archiving tool that produces WARC + directory tree?

+4 votes
14,390 views

An important feature of any web archiving tool is the ability to specify that one would like to pull down embedded resources ("page requisites" in the parlance of wget) that are hosted on a domain other than the site that is the target of the crawl. Such cases are found not only as examples of people "hot-linking" other people's images, but is encountered heavily on sites with cloud based content delivery networks, such as Tumblr. Including such assets in a crawl is absolutely necessary when one's aim is to achieve a complete mirror of a site.

Heritrix has an option for including such assets in the crawl scope:https://webarchive.jira.com/wiki/display/Heritrix/unexpected+offsite+content

As does Httrack, using the --near flag:\

http://www.httrack.com/html/fcguide.html\ But of course, Httrack does not offer WARC output.

Wget has the -H flag, allowing one to "span hosts" (in other words hit sites with domain names other than the starting url), it lacks the ability to specify that the crawler should span hosts only for page requisites, and so tries to download the entire web if one combines an infinitely recursive crawl with -H. There are some hacky ways of getting around this, but they aren't pretty or reliable. The great thing about Wget though is that it allows the user to output WARC and a directory tree of the crawled site, thus not locking one in to WARC completely.

Are there any tools of the trade that allow one to conduct the comprehensive type of crawl mentioned above, but also have the affordance of outputting WARC, and a directory tree?

asked Sep 9, 2014 by benfinoradin (460 points)

1 Answer

+2 votes

I'd say you have two options. You could crawl using Heritrix and generate the WARC files, but then pop the contents of those files out into a directory tree. You could do this via one of these two warctozip scripts 1, 2 (disclaimer: I wrote the latter one and it's just a rough prototype and should probably be merged with the former).

The other option is to use an existing tool to enumerate all the URLs and then pass that list to wget. This is what my flashfreeze tool does, using a PhantomJS script to determine all the dependencies of a specific page, and then using wget to grab them as a WARC file.

answered Sep 9, 2014 by anjackson (2,930 points)
Note that my PhantomJS script should probably be superseded by this one: https://github.com/kimmobrunfeldt/url-to-image
Another tool has just been brought to my attention (https://twitter.com/atomotic/status/509653702324789248 ), called warcat: https://github.com/chfoo/warcat
...