• Register

How should you scope a crawl for web archiving online discussion forums

+3 votes
3,917 views

A lot of popular online discussion platforms (phpbb, vbullitan, invision powerboard, etc.) generate a lot of different kinds of URLs for the same discussion threads and digital assets and do a lot of strange things with links for pagination and such. You can eaisly get stuck in a range of crawler traps. What are some good tactics to use when trying to scope crawling online discussion forms to archve them? So, going into scoping and planning to archive a discussion forum what kinds of ideas and tactics should one be thinking about/considering?

asked Jun 5, 2014 by tjowens (2,360 points)
edited Aug 13, 2014 by tjowens

1 Answer

+2 votes

In most cases, an archive will only really be interested in the unique content of a forum - threads, user profile pages, and the high-level organization (forum/subforum/thread arrangement in subforum) of the site.  Unfortunately, most links on forums are there for ease of navigation and user administration.  A single thread may have its unique URL, a #lastpost argument URL, its neighbors linking with #next and #previous, display settings passed via URL argument, etc etc. 

 

To give an idea of the complexity that forums can have, a single thread's page (10 posts) from http://www.ubuntuforums.org contained 327 intradomain links.  Most of these are for user actions and serve no purpose in a web archive.

 

As such, there are a few options here:

 

Blacklisting - Assume that a forum is essentially all useful, and reduce the scope to exclude URLs you aren't interested in.  For example, on Ubuntu Forums, you would start with a simple recursive crawl from www.ubuntuforums.org, and add pattern-matching exclusions for URLs such as *markread*, *sort*, *newreply*, et al.  The benefit of this approach is that it guarantees capturing (nearly) everything of interest on the forum; however, it does require a lot of setup, testing, and fine-tuning the scope to minimize the amount of non-useful pages captured.

 

Whitelisting - Assume that a forum is essentially all useless, and expand the scope to include URLs you are interested in.  Using the Ubuntu Forums example again, you would start by including *forumdisplay*, *showthread*, and *member* (with regular expressions written to block unnecessary arguments) and expand from there.  The benefit of a whitelist approach is that it requires less monitoring and testing to ensure that traps and unwanted pages aren't being captured; however, it also requires knowing exactly what you want beforehand and comes with the risk of excluding pages that you want, but didn't know existed.

answered Jun 5, 2014 by alexanderduryee (800 points)
...