<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Digital Preservation Q&amp;A - Recent questions tagged wget</title>
<link>https://qanda.digipres.org/tag/wget</link>
<description>Powered by Question2Answer</description>
<item>
<title>Best way to crawl website from localhost with wget, preserving all files in source directory</title>
<link>https://qanda.digipres.org/1166/crawl-website-localhost-preserving-files-source-directory</link>
<description>&lt;p&gt;
	We recently recovered the contents of an old (2004) website from CD-ROM. I managed to get a local instance of the site running using the Apache web server; by editing the machine’s &lt;em&gt;hosts&lt;/em&gt; file the site is available on that local machine from its original URL, which is &lt;code&gt;&lt;a href=&quot;http://www.nl-menu.nl&quot; rel=&quot;nofollow&quot;&gt;http://www.nl-menu.nl&lt;/a&gt;&lt;/code&gt; (some more context can be found &lt;a rel=&quot;nofollow&quot; href=&quot;http://openpreservation.org/blog/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited/&quot;&gt;here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;
	I’m now looking into ways to crawl the contents of the site into a WARC, so we can ingest it into our web archive. After initial experiments with Heritrix failed, I moved on to wget. After some experimentation the following wget command appeared to work reasonably well:&lt;/p&gt;
&lt;h2 id=&quot;attempt-1-mirror-from-site-root&quot;&gt;
	Attempt 1: mirror from site root&lt;/h2&gt;
&lt;pre&gt;
&lt;code&gt;wget --mirror \
    --page-requisites \
    --warc-file=&quot;nl-menu&quot; \
    --warc-cdx \
    --output-file=&quot;nl-menu.log&quot; \
    &lt;a href=&quot;http://www.nl-menu.nl&quot; rel=&quot;nofollow&quot;&gt;http://www.nl-menu.nl&lt;/a&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;
	However, closer inspection of the result showed that about 668 files in the source directory are missing in the resulting WARC file. The majority (90%) of these files are “orphan” resources that are not used/referenced by any of the HTML files in the crawl. However, the remaining 10% of missing files are resourced that &lt;em&gt;are&lt;/em&gt; referenced, in most cases through JavaScript variables. These aren’t picked up by wget, and therefore they end up missing in the WARC. So I am looking for a way to force wget to include these resources anyway.&lt;/p&gt;
&lt;h2 id=&quot;attempt-2-use-input-file&quot;&gt;
	Attempt 2: use –input-file&lt;/h2&gt;
&lt;p&gt;
	At first wget’s &lt;code&gt;--input-file&lt;/code&gt; switch (which takes a list of URLs) looked like a good way to achieve this. I created a directory listing of all files that are part of the website, and then transformed them into corresponding URLs:&lt;/p&gt;
&lt;pre&gt;
&lt;code&gt;find /var/www/www.nl-menu.nl -type f \
    | sed -e 's/\/var\/www\//http:\/\//g' &amp;gt; urls.txt&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;
	Then I ran wget like this (note that I removed the &lt;code&gt;--mirror&lt;/code&gt; option, as this apparently causes wget to do a recursive crawl &lt;em&gt;for each single URL&lt;/em&gt; in the list, which takes forever):&lt;/p&gt;
&lt;pre&gt;
&lt;code&gt;wget --page-requisites \
    --warc-file=&quot;nl-menu&quot; \
    --warc-cdx \
    --output-file=&quot;nl-menu.log&quot; \
    --input-file=urls.txt&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;
	This results in a WARC file that contains &lt;em&gt;all&lt;/em&gt; files from the source directory: perfect! But it does introduce a different problem: when I try to access the WARC using &lt;a rel=&quot;nofollow&quot; href=&quot;https://github.com/webrecorder/pywb&quot;&gt;pywb&lt;/a&gt;, it turns out that the WARC is made up of 85864 individual captures (i.e.&amp;nbsp;each file appears to be treated as an individual capture)! This makes rendering of the WARC impossible (loading the list of capture alone takes forever to begin with).&lt;/p&gt;
&lt;h2 id=&quot;attempt-3-include-list-of-urls-in-crawl&quot;&gt;
	Attempt 3: include list of URLs in crawl&lt;/h2&gt;
&lt;p&gt;
	So as a last resort I created a list of all URLs in HTML format, and put that file in the source directory. Steps:&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot;&gt;
	&lt;li&gt;
		&lt;p&gt;
			Create list of URLS in Markdown format (add “&amp;lt;” and “&amp;gt;” pre-and suffix to each line):&lt;/p&gt;
		&lt;p&gt;
			&lt;code&gt;find /var/www/www.nl-menu.nl -type f | sed -e 's/\/var\/www\//&amp;lt;http:\/\//; s/$/&amp;gt;\n/g' &amp;gt; urls.txt&lt;/code&gt;&lt;/p&gt;
	&lt;/li&gt;
	&lt;li&gt;
		&lt;p&gt;
			Replace any whitespace characters with &lt;em&gt;%20&lt;/em&gt; to avoid malformed URLs:&lt;/p&gt;
		&lt;p&gt;
			&lt;code&gt;sed -i 's/\ /%20/g' urls.txt&lt;/code&gt;&lt;/p&gt;
	&lt;/li&gt;
	&lt;li&gt;
		&lt;p&gt;
			Convert URL list to HTML which is placed at the root directory of the source dir:&lt;/p&gt;
		&lt;p&gt;
			&lt;code&gt;sudo pandoc -s urls.txt -o /var/www/www.nl-menu.nl/urls.html&lt;/code&gt;&lt;/p&gt;
	&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;
	Then I ran wget, using the above URL list as crawl root:&lt;/p&gt;
&lt;pre&gt;
&lt;code&gt;wget --mirror \
    --page-requisites \
    --warc-file=&quot;nl-menu&quot; \
    --warc-cdx \
    --output-file=&quot;nl-menu.log&quot; \
    &lt;a href=&quot;http://www.nl-menu.nl/urls.html&quot; rel=&quot;nofollow&quot;&gt;http://www.nl-menu.nl/urls.html&lt;/a&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;
	The resulting WARC contains all files that are in the source dir, and it can be accessed as one single capture in pywb. The obvious downside of this hack is that it compromises the integrity of the ‘original’ website by adding one (huge) HTML file that was not part of the original site to the WARC.&lt;/p&gt;
&lt;p&gt;
	This makes me wonder if there is another, more elegant way to do this that I have overlooked here? Any suggestions welcome!&lt;/p&gt;
&lt;p&gt;
	BTW I know this question is somewhat similar to [this earlier one] (&lt;a href=&quot;http://qanda.digipres.org/337/there-web-archiving-tool-that-produces-warc-directory-tree),&quot; rel=&quot;nofollow&quot;&gt;http://qanda.digipres.org/337/there-web-archiving-tool-that-produces-warc-directory-tree),&lt;/a&gt; but option 2 as mentioned by &lt;span class=&quot;citation&quot;&gt;@anjackson&lt;/span&gt; there looks similar to Attempt 2 in my case.&lt;/p&gt;</description>
<guid isPermaLink="true">https://qanda.digipres.org/1166/crawl-website-localhost-preserving-files-source-directory</guid>
<pubDate>Tue, 03 Jul 2018 14:16:10 +0000</pubDate>
</item>
<item>
<title>Is there a web archiving tool that produces WARC + directory tree?</title>
<link>https://qanda.digipres.org/337/there-web-archiving-tool-that-produces-warc-directory-tree</link>
<description>&lt;p style=&quot;margin: 0px 0px 20px; color: rgb(119, 119, 119); font-family: Lato, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 21px;&quot;&gt;
	An important feature of any web archiving tool is the ability to specify that one would like to pull down embedded resources (&quot;page requisites&quot; in the parlance of wget) that are hosted on a domain other than the site that is the target of the crawl. Such cases are found not only as examples of people &quot;hot-linking&quot; other people's images, but is encountered heavily on sites with cloud based content delivery networks, such as Tumblr. Including such assets in a crawl is absolutely necessary when one's aim is to achieve a complete mirror of a site.&lt;/p&gt;
&lt;p style=&quot;margin: 0px 0px 20px; color: rgb(119, 119, 119); font-family: Lato, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 21px;&quot;&gt;
	&lt;span style=&quot;color: rgb(34, 34, 34); font-weight: 700;&quot;&gt;Heritrix&lt;/span&gt;&amp;nbsp;has an option for including such assets in the crawl scope:&lt;a rel=&quot;nofollow&quot; href=&quot;https://webarchive.jira.com/wiki/display/Heritrix/unexpected+offsite+content&quot; style=&quot;color: rgb(51, 153, 204); text-decoration: none;&quot;&gt;https://webarchive.jira.com/wiki/display/Heritrix/unexpected+offsite+content&lt;/a&gt;&lt;/p&gt;
&lt;p style=&quot;margin: 0px 0px 20px; color: rgb(119, 119, 119); font-family: Lato, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 21px;&quot;&gt;
	As does&amp;nbsp;&lt;span style=&quot;color: rgb(34, 34, 34); font-weight: 700;&quot;&gt;Httrack&lt;/span&gt;, using the --near flag:\&lt;/p&gt;
&lt;p style=&quot;margin: 0px 0px 20px; color: rgb(119, 119, 119); font-family: Lato, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 21px;&quot;&gt;
	&lt;a rel=&quot;nofollow&quot; href=&quot;http://www.httrack.com/html/fcguide.html&quot; style=&quot;color: rgb(51, 153, 204); text-decoration: none;&quot;&gt;http://www.httrack.com/html/fcguide.html&lt;/a&gt;\ But of course, Httrack does not offer WARC output.&lt;/p&gt;
&lt;p style=&quot;margin: 0px 0px 20px; color: rgb(119, 119, 119); font-family: Lato, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 21px;&quot;&gt;
	&lt;span style=&quot;color: rgb(34, 34, 34); font-weight: 700;&quot;&gt;Wget&lt;/span&gt;&amp;nbsp;has the -H flag, allowing one to &quot;span hosts&quot; (in other words hit sites with domain names other than the starting url), it lacks the ability to specify that the crawler should span hosts only for page requisites, and so tries to download the entire web if one combines an infinitely recursive crawl with -H. There are some hacky ways of getting around this, but they aren't pretty or reliable. The great thing about Wget though is that it allows the user to output WARC&amp;nbsp;&lt;em&gt;and&lt;/em&gt;&amp;nbsp;a directory tree of the crawled site, thus not locking one in to WARC completely.&lt;/p&gt;
&lt;p style=&quot;margin: 0px 0px 20px; color: rgb(119, 119, 119); font-family: Lato, 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 21px;&quot;&gt;
	Are there any tools of the trade that allow one to conduct the comprehensive type of crawl mentioned above, but also have the affordance of outputting WARC,&amp;nbsp;&lt;em&gt;and&lt;/em&gt;&amp;nbsp;a directory tree?&lt;/p&gt;</description>
<guid isPermaLink="true">https://qanda.digipres.org/337/there-web-archiving-tool-that-produces-warc-directory-tree</guid>
<pubDate>Tue, 09 Sep 2014 20:05:01 +0000</pubDate>
</item>
</channel>
</rss>