• Register

Do filesystem-based checksums add value in a digital preservation context?

+5 votes
Recently, file systems like ZFS and to some extent Btrfs have been making inroads onto quite affordable hardware. Unlike most file systems (like Windows' NTFS, Linux's ext2/3/4, Mac OS X's HFS+, and so on), these file systems store checksums for each data block and verify that checksum on each read, thus to a very large degree guaranteeing that any successful read has not suffered from bit rot. In a redundant storage configuration, they can also be used as an additional safeguard to ensure that data for which parity was used to reconstruct missing parts resulted in the originally stored data and not something else.

By "scrubbing" the data, each block of data is checked against its corresponding checksum and possibly redundant data as well, ensuring that all data is accurate and readable, and allowing for reconstruction of data from redundant blocks if a problem is detected. They cannot, however, protect from changes that stem through the operating system's "normal" channels (using the operating system's documented interfaces to open a file and write garbage to it is not protected against, for example).

Seeing that separate fixity data is likely to be necessary as well to ensure the integrity of the full archive, can file system-based checksums like these add value in the context of digital preservation?
asked Jul 15, 2014 by michael (400 points)

2 Answers

+3 votes
Yes and no - it depends what you're asking.

Yes, in the regard that more advanced filesystems can provide excellent security against hardware/firmware issues.  ZFS, for example, is highly resistant to flipped bits (due to its use of block-level checksums) and otherwise invisible write errors (due to copy-on-write verification).  Having the extra layers of data integrity is always beneficial for preservation - you won't ever need to worry about phantom writes, for example.  Given the option, using a filesystem with block-level checksums (ZFS, Btrfs, and a LTO enhancement whose name escapes me) is typically a very good idea for a preservation environment.

If your question is about using block-level checksums for preservation practices (e.g. storing them as metadata alongside the file- and asset-level checksums), then no, they cannot be used like that.  Block-level checksums essentially say "the sequence of bytes within this physical space should be equal to this value", which is only of use if you care about very specific drive geometry.  Unless you can somehow map the 512- or 4k-blocks into your file (an extremely difficult task), you won't be able to do anything useful with it.
answered Jul 15, 2014 by alexanderduryee (800 points)
By scrubbing the storage, or even by simply reading through an entire file and observing that no errors are thrown, you can be quite certain that what you just checked hasn't suffered from bit rot. While the individual block checksums are less likely to be useful *on their own*, can't it be argued that such checking can add value? (And the file system by its very design already stores them alongside file- and asset-level checksums, you just can't readily access them because as you say they don't carry much meaning in any larger context. Even more so if you also employ file-system-level compression; at least ZFS checksums are computed after compression.) I however don't see how block-level checksums are drive-geometry-specific; if anything, modern storage components (HDDs and SSDs) are very good at abstracting away the actual physical geometry.
+1 vote

I second Alex's answer (yes they do add value but should not be used alone), but will add one more reason not to only rely on filesystem-based checksums:

As this report by the IT Policy Compliance Group shows (2007), user error is one of the biggest causes of data loss. For this reason, it is good practice to hold your data in at least two distinct systems with different processes in place to manage them. When you do this you will also need to have some system for synchronising the data in the two (or more) distinct systems. Filesystem-based checksums will not help with that very much. 


answered Jul 16, 2014 by euanc (3,200 points)
edited Jul 16, 2014 by euanc