• Register

When to redact AIPs vs DIPs in digital archives workflows?

+1 vote

Tools like bitcurator are integrating processes for redacting information from disk images. So, not just identifying information that you might want to use to make aprasial decisions about what to keep but also the ability to directly redact information from files. I know some folks are planning on using those tools that auto locate personal info and then redacting files or redacting information from inside files.

Given that we have (at least in theory) Archival Information Packages and Disimenation Information packages we could be redacting, when should archivists be redacting the archival copy and when do we want to just be redacting the access materials? If one has information that is sensitive now but would be useful and non-sensitive in 50 years? Or, is the threat of mantaining sensitive information and multiple copies of materials too significant to warrent such an approach? Or, is this something that is going to depend on the particular issues in a collection. If the answer is "it depends" what is it that it depends on?

asked Jun 24, 2014 by tjowens (2,360 points)
Ideally the archivist would never have to redact the archival copy. Disk images are best though of as instances of the original disk--that is, we keep the disk image of a 3.5" floppy disk because we don't want to have to read the data from a floppy disk every time we want to access it and because we're concerned that we'll lose the data due to bit rot, etc. Remember that a disk image that is the full bitstream from the original media isn't a copy, it's the same object but moved to a more sustainable medium. To redact a disk image, then, would be like redacting a paper archive by scratching words out on the original with a sharpie.

That said, there may be legal constraints that necessitate the redaction of the original disk image, but realize that once you start doing that, you loose the ability to think of the disk image as the original object. There are, of course, other reasons to use disk images, such as maintaining original file tree structure, preserving file system metadata, and being able to share the disk image as a coherent object, not just a directory of files (particularly important if the original media had any particular significance). But something is certainly lost when the bitsream on the original media is no longer the bitstream in your repository. So, to answer your question, in an ideal scenario I would recommend that all redaction happen at the DIP level, with the original disk image kept intact in a dark archive.

There are some significant technical challenges to disk image redaction to keep in mind. There are no tools that I know of that allow you to redact a disk image directly. Rather, you have to copy the files from the disk image into a working directory and then generate a new disk image from those files. The problem is that the redacted disk image isn't really a disk image at all, it's just a directory of files bundled up together in a disk image format. FTK does some slight of hand and suggests that their AD1 disk image format is... well, a disk image, but really it's just a proprietary tar ball, though one that can be mounted like a disk with the proper tools.

Things change, of course, so if anyone knows of a tool for redacting disk images that keeps them intact (minus the redacted info, of course), I would be very interested to hear about it.

1 Answer

+1 vote
I would be inclined to create a new, redacted, AIP: different parts of the file may lose their sensitivity at different rates, and the very process of producing different redacted copies over time, progressively revealing more information might well be of interest to future researchers. A slightly different case is where we receive a set of images from a company that has been licensed to do a digitisation and run a commercial service to provide access to the images.  After a certain period of time the images may be resold to other commercial providers, where there is closed data in those images, a further set of redacted images (opening more of the data) will be prepared from time-to-time, and to get access to the newer set, the secondary providers would have to purchase them again. If for some reason, they perhaps opted to purchase only every other set of images or similar, there arises a possible situation where they discover an issue with some proportion of the earlier images, and we ned to resupply those images to them, making sure (in effect) that they don't also receive the additional information which they haven't purchased.  We can only be sure of doing that if each set is a separate AIP.
answered Jun 24, 2014 by DavidUnderdown (790 points)