• Register

What tools do you use for the ongoing monitoring of checksums?

+4 votes
6,645 views
I am currently working on choosing and implementing a tool to continuously monitor checksums for our digital holdings (digitized content and born-digital materials) across our entire institution.  We have a lot of data on our preservation server (we have many terabytes of data, and want to plan ahead for the near future when we break a petabyte) that we want to continuously monitor for bit-rot, and we're wondering about the experiences of other institutions.
 
What do you use to monitor the fixity of your preservation masters on an ongoing basis?  What (generally speaking) is the workflow like to make sure that checksums happen as often as they're supposed to?  
 
Bonus points if you know of a tool that is automated (or at least schedulable), and super-bonus points if it could work on our server and have multiple user accounts for different content administrators.
asked Sep 9, 2014 by sarah.barsness (1,250 points)
There's a list of potential tools on the COPTR wiki, but there's not much information on user experiences with those tools yet. http://coptr.digipres.org/Category:Fixity

4 Answers

+3 votes

At MoMA we store all digital collections materials in the bagit format, and worked with Artefactual Systems to develop a fixity checker that runs on the same server that hosts our Archivematica instance. It checks all AIPs found in the Archivematica Storage Service for validity (which if you're familliar with the bagit format, you'll know, checks aspects other than file fixity - i.e. files that shouldn't be there, files that are missing), and emails any failures to Storage Service admin accounts.

This fixity checker sends metadata about its activity to our browser-based repository management system (also built by the lovely people at Artefactual Systems), so that we can index and retain a permenant record of all fixity checks, allowing reporting in various ways. In the event of a fixity error, our management app allows users to recover the AIP from a manually retrieved offline backup (i.e. an offsite LTO copy).

Here's a screenshot of the fixity audit dashboard widget (faulures are fake!): 

And here is a screenshot of (part of) a full report:

 

This system is brand new, so I don't have much in the way of lessons learned at scale. All of this is being released by Artefactual Systems (free and open source, naturally).

There are however some major changes we are making to our storage system this year due to an incomming massive influx of collections materials in the area of 2PB. We are moving from a fully disk based storage system, to a heriarchical system that is primarily LTO with a fast disk cache. This will give us an exponentially lower $ cost per TB, lower energy bills, as well as a built-in method for making a third backup for off-site offline storage. This new system is a managed service, and actually provides file-level fixity checking that can be queried via an API. We will be working to get our fixity checker app talking to this API so that we can get the managed service's fixity check metadata into our management app in the same way we get our "own" fixity metadata in. Will will probably rely solely on the managed service's fixity checks, since it has a logic for ensuring things are checked that is more sensitive to the fact that the materials are on tape rather than disk. If you are comfortable with moving away from file level fixity checks there are some proprietary tape formats that have built-in fixity metdata that allows for fixity to be checked without reading the full bit streams (i.e. T10000C) – though that's not what we're doing. We're still big on file-level checks (for now). The kind of fixity Andy is talking about with HDFS is in my mind a given –  or at least should be at any institution where you have an IT department. If your storage appliance is not doing some method of integrity checking (at the filesystem level, or at the appliance level via RAID, etc) then you're in big trouble and have bigger fish to fry than human readable fixity checks. This is a basic requirement of any enterprise-ish storage appliance, not just for "archival" applications.

The AVPS fixity tool serves a great purpose, even in our situation. I'm thinking about using it for monitoring our pre-ingest staging area, where things live prior to ingest into our repository. The stats in the report are pretty great not just for monitoring fixity, but for just keeping a handle on what's going in/out on a file share.

answered Sep 10, 2014 by benfinoradin (460 points)
edited Sep 10, 2014 by benfinoradin
+2 votes

We've developed a checksum monitoring tool called Fixity (http://www.avpreserve.com/tools/fixity/) that will handle many of your requirements.  Fixity will:

answered Sep 9, 2014 by alexanderduryee (800 points)
We've actually been looking quite a bit at Fixity, and it seems like a really awesome tool!  Since I've got you here, I'm going to ply you with a couple questions:
1. Do you know if/how we could install fixity directly on a server?  We've tried doing local installations on computers with access to the server, but it gets very slow because of the size of our network (among other things).
2. Are there any plans to add quarterly and annual scanning for scheduling options?
If your servers have some sort of client access (e.g., they're running Windows/OSX and you can access it via keyboard/mouse and monitor), then you can run Fixity on it.  If they're headless machines (no keyboard/mouse/monitor attached), which most storage servers are, then you won't be able to run Fixity on it easily.  Fixity also does not have a Linux version.

If you need to run something on your storage environment directly, you may want to develop requirements around some of Fixity's features (email alerts, scheduled scans, file attendance) for a in-house system.  This will give you the features that you want in a checksum application while also running directly on your storage environment.

We don't plan to add quarterly/annual scanning to Fixity right now, although it may be implemented at a future date.
+2 votes

As you head up towards a petabyte, you might want to consider switching to HDFS. It's very cost effective, highly resilient against disk failures, and has been proven to work at scale. It has fixity checks built in (IIRC, it defaults to checking every data block once every three weeks), and if data is found to be damaged or lost it automatically heals itself using the checksum metadata and the other copies spread over the cluster (defaults to three copies, over at least two distinct racks).

Any filesystem that goes up to a petabyte will probably include something like this, as HDD failures become fairly frequent when you have hundreds of disks. We have about a petabyte of raw storage, and experience a couple of disk failures per month.

Of course, filesystem-level checks only prevent against storage rot/failure. If you want to ensure data isn't accidentally deleted, you'll need to monitor things at the workflow and/or collection level. Tools like Fixity can help here.

answered Sep 9, 2014 by anjackson (2,950 points)
0 votes
After I asked you all what you use to monitor checksums about two weeks ago, responses started flowing in, along with requests that I share all of the suggestions with everybody else.  I'm sure the intervening period has been one of anxiety for you all, but you can release a collective sigh: the results are in!
 
By far the number one suggestion was for AVPreserve's fixity tool, which really does meet most of the needs I brought up in my initial posting.  There were a lot of other suggestions, however, which can be split into three categories (full-on repository software, checksumming programs, and places to look for lists of tools).  Without further ado, here's the list!
 

1. Institutions looking to develop a full digital preservation program -- most digital repository software handles ongoing/scheduled checksum verification for master files, but these systems in particular were suggested:

        a. Preservica
        b. Archivematica (MoMA shared that they developed a custom fixity-checking tool that will become publicly available, as well)
        c. DSpace
        d. Chronopolis
        e. Roda
        f. Rosetta
 
2. Institutions looking to regularly monitor material outside a digital preservation program -- if you have a file store outside of a repository system that needs regular monitoring but you still want to regularly monitor checksums, these might be good options: 
        a. Fixity (Windows only for the time being)
        c. Manually generate/verify checksums (works best for small amounts of data)
        d. HDFS
        f. Fixity_checker
        g. Fixi
        h. Make your own custom tool (Python)
 
3. Resources for finding tools (and more information about specific tools):
        a. POWRR project
        c. CAROL's evaluation of checksum programs (from my very own institution, offers overviews of several free checkers that do not have scheduling capabilities)
answered Sep 22, 2014 by sarah.barsness (1,250 points)
...