• Register

Tool(s) for extracting administrative metadata from WARC?

+3 votes
I'm researching best practices for administrative metadata--preservation metadata in particular--for web archives. So far I've found some very helpful rationales and schemas, all PREMIS-in-METS-based, but I haven't seen anything that directly explains how one gets from point A to point B. That is, I haven't seen any descriptions of the steps nor the tools used to actually extract this type of metadata (or as much as can reasonably be gleaned) from WARCs. Have you? Have you extracted administrative metadata from your WARCs and lived to tell the tale? I'd love to know what you used and what you thought of the process.
asked Jan 29, 2015 by blumenthal (230 points)

1 Answer

+2 votes
I'm still researching this issue, but in case this topic is of interest to you as well, here's an update:

Every DROID-, FITS-, and/or JHOVE-based workflow that I have seen thus far can extract PREMIS-conformant core metadata from the WARC container file for eventual packaging with a METS manifest. This includes processing through Archivematica and, presumably, Artefactual's storage-integrated product ArchivesDirect, but we'll see about that one when it's fully released.

Fewer options seem to be available for extracting and packaging the same metadata at the internal file level. Preservica claims to have this extra level of functionality, though they've been reticent to explain in detail how their (proprietary) system works. I'll test it in the near future.

For the moment, the Bibliothèque nationale de France's SPAR Project repository seems to represent the only open(-ish?) source alternative. It runs principally on JHOVE2, but may transition to JWAT-based tools in the near future, and it provides a METS manifest that includes structural mapping tailored to container file formats. This is laid out fairly simply in a 2011 iPRES paper here: http://web.archive.org/web/20150325135508/https://halshs.archives-ouvertes.fr/halshs-00868729/document

It looks like developing this level of extraction into workflows here in the U.S. would take some serious funding and management, but I don't yet see any reason why it couldn't work!
answered Mar 25, 2015 by blumenthal (230 points)