• Register

Strategy for preserving scanned files outside of repository

+1 vote
We have undertaken various digitisation projects the past couple of  years; each item digitised has resulted in output in TIFF, JPEG and PDF formats, with up to 600 or so TIFFs/JPEGs for some items.

We have loaded the PDF files into our research repository. However, our repository software does not accommodate too well more than a dozen or so files at the most for each record. So we need to find an alternative mechanism for storing at least the TIFF files (the JPEGs and PDFs can always be regenerated again from the TIFFs). Currently, we have 3 copies of each of the files stored on our local network (as well as whatever system backups occur), until we can figure out a more permanent preservation strategy.

We're now at the point where we want to figure out that more permanent strategy. The PDF files in the repository will remain the primary access point. Can anyone suggest a suitable approach/software for the preservation masters?
asked Jul 1, 2015 by bernieh (550 points)

3 Answers

+2 votes

I am assuming that each "item" can logically be thought of as a unit, and the large number of files in some of them is merely an artefact of your digitization process.

For such a situation, I would consider using Zip without compression as a container format.

Doing so will allow you to add each item to your repository as a single archive file, which can be retrieved and extracted when the need arises.

Not using compression (in effect, using Zip only as a logical container) means that even if the Zip file header and directory somehow gets corrupted, standard recovery techniques used on damaged storage media should be able to pick out the respective files stored within the archive with little difficulty. The fact that the archive (unlike storage media) will have no fragmentation should also help should such recovery ever be necessary.

Zip is in wide use in many areas, including being used as a container format e.g. in ISO 26300 (OpenDocument) and Microsoft Office Open XML (as well as many others). Support on modern systems, including Windows, OS X and Linux, is basically ubiquitous. Both open source and proprietary implementations exist, and the format itself is publicly documented.

Zip files offer basic fixity verification through the stored checksums, but no real recovery mechanism.

Standard Zip files have a few limitations that you might run into in a situation like what you are describing. Perhaps most importantly, the maximum size of a Zip archive is capped at 4 GiB. The ZIP64 extension appears to resolve that deficiency by raising the size fields from 32 to 64 bits, but according to Wikipedia, support for ZIP64 is not as ubiquitous. If you are able to set software requirements for the systems that will handle the Zip files themselves, standardizing on ZIP64 might be a reasonable option even if the larger file size capability is not yet needed.

answered Jul 1, 2015 by michael (400 points)
edited Jul 2, 2015 by michael
TAR (https://en.wikipedia.org/wiki/Tar_(computing)) is another option if you have more than 4gb, but you would also lost the advantage that many people know what to do with a Zip file but fewer know what a TAR is.
@nkrabben There's also the ubiquity argument. Zip files work fine on Unix-like systems (Linux and OS X being the variants you're most likely to encounter on non-server systems) but Tar files are basically unknown outside of the Unix world. Zip files are supported natively by Windows, and ZIP64 is apparently supported natively since Vista. That's a lot of installed base.
Agreed, although TARs have more native support than ZIP64 on OS X, iOS, and Android. Of course, I'm not sure what the use case for downloading TIFFs on a smartphone is.

A kludgy option might be to create multiple > 4GB zips if necessary.
Thanks, guys. There are 87 GB worth of TIFFs, 4 GB worth of JPGs and 630 MB of PDFs just for one record alone. On a quick estimate, it looks like around 50 records out of ~200 will exceed the 4 GiB limit.
+1 vote
A decent short term strategy would be to place all of your files in bags so that you can run fixity checks while you are deciding on a longer term solution.

Bagger is a graphical interface for creating bags (https://github.com/LibraryOfCongress/bagger), but there are other utilities such as bagit-pyhton (https://github.com/LibraryOfCongress/bagit-python) if you're comfortable on the command line.
answered Jul 2, 2015 by nkrabben (1,760 points)
Thanks, nkrabben. Do you know if there are file size limitations with the baggers? Our largest record contains ~91 GB of files.
Bags don't have a file size limit because they're basically a folder structure. A bag consists of a folder with a subdirectory for all data, a manifest of all files in the data subdirectory with their checksums, a file with metadata about the bag, and additional files.

When you verify a bag, a program reads the manifest and checks all the contents of the data directory against that manifest.
Thanks for a very quick response, nkrabben. Will look into bagger.
0 votes
Here's our final solution: putting the files into Bagger/BagIt bags, then storing in our Research Data Store.

We also set up a mechanism to verify all the checksums on a 2-monthly basis, and added (non-public) links to the bags from our catalogue and research repository.

Thanks, everyone, for your help :-)
answered Dec 6, 2015 by bernieh (550 points)