• Register

How should an organisation QA the results of its outsourced web archiving activities?

+2 votes
Outsourcing can help an organisation achieve progress when they don't have the resources, skills, technology or infrastructure to conduct particular activities themselves. But its only worth doing if there is a way to validate the quality and completeness of the work, possibly the biggest challenge in outsourcing.  Organisations sometimes also outsource the provision of access to crawled websites as well. In this case, the crawled data is also supplied to the collecting organisation for preservation, but this data isn't seen by the end users so there is no validation of the data.
So what should that organisation be looking for in order to check the quality and completeness of the supplied crawl data, and how should they go about doing it? A key issue is perhaps what the 3rd party can reasonably be expected to supply alongside the crawled data (in the form of manifests) to enable this QA.
I'm interested in what organisations that outsource their web archiving are doing with regards to QA and also what the web archiving experts who do their own web archiving (and are familar with the common web archiving pitfalls) think these organisations should be doing as part of their QA activities.
asked Nov 25, 2015 by prwheatley (310 points)
I'd also be interested to know if anyone other than the Internet Archive is offering outsourced web archiving services.
Internet Memory Foundation, Hanzo Archives and Aleph Archives spring to mind.
@euanc Also MirrorWeb

1 Answer

+2 votes
Great question! For the Parliamentary Archives, we have done exactly as you describe in terms of outsourcing the majority of our web archive capability to the Internet Memory Foundation. The justification for doing so is simple; we simply wouldn’t have the resources or infrastructure to do so in-house. The Houses of Parliament web estate forms a vital corporate record of the organisations activity which should be preserved and made accessible.
To set our Quality Assurance practices into context, it will help to give a very brief introduction to our web archive process:
  1. First, Archive staff sign-off the Seed List (usually around 30 URL’s).Second, IMF carries out the crawls. There is the option for Archives staff to set off crawls using ArchiveTheNet (AtN).
  2. An initial QA stage is then done by IMF to identify and rectify initial issues (such as ensuring crawl parameters have been met, technical limitations noted etc.).
  3. Using project management software (JIRA) IMF then hand over the captured URL’s for the second QA stage which is carried out by Archive staff. We assign one person to lead the QA and then distribute QA tasks across multiple staff.
  4. Any further QA issues are reported through JIRA and resolved by IMF.
  5. Finally, captured crawls are delivered by IMF to the Archives for ingest into the Parliament’s digital repository (Preservica Enterprise Edition).
As you can see, there are multiple levels of QA which occur. From the initial QA carried out by IMF, to the second QA undertaken by the Archives. I would argue that there is a final level of QA involved during the ingest workflow as the ARCs/(W)ARC’s are characterised, validated, and virus scanned etc. Setting aside the ingest workflow, the bulk of our QA is a manual human process which involves a significant amount of time checking the captured data. We also have a Service Level Agreement (SLA) with IMF which is vital in ensuring that the required service and QA approaches are met. For obvious reasons, I can’t go into too much detail in terms of what the SLA contains!
I am aware that our process could be improved, and that our QA steps rely upon manual intervention. I’d be happy to discuss in more detail the QA steps I have outlined and potential areas of improvement. I’d also be particularly interested to hear from other web archive experts with suggestions as to what we should be doing as part of our QA activities.
The Parliamentary Archives pages for the web archive can be found here: http://www.parliament.uk/business/publications/parliamentary-archives/web-archive/
The actual web archive portal (hosted by IMF) is here: http://webarchive.parliament.uk/
~Chris Fryer, Senior Digital Archivist, Parliamentary Archives
answered Nov 25, 2015 by fryerc (230 points)
Thanks Chris! Could you elaborate slightly on step 3? Are you checking a sample of URLs or checking for completeness in some way?
We check and QA all URLs which have been crawled. This is feasible because we capture a relatively small amount compared to national web archive initiatives. We check for completeness, so digging into the functionality of the site as much as possible. Checking that external third party content has not been captured. We also come across numerous technical limitations in crawler technology which we report to IMF.

There is a balance between ensuring relatively simple technical issues can be resolved (with additional resource if needs be) and realising that there are limits to what can be achieved, this goes for our whole QA process.

It's important to state that these web archive resources are catalogued too: http://www.portcullis.parliament.uk/CalmView/Record.aspx?src=CalmView.Catalog&id=PARL%2fWEB&pos=1