Digital Preservation Q&A - Recent questions

What is a good tool for normalizing image file formats?

Mon, 29 Jan 2024 15:17:36 +0000

We were going to use Adobe for file normalization, but the subscription is pricey. Instead of Adobe Audition, we can use Audacity for video files. Is there anyone who can recommend a tool for image file normalization?

What is a good tool to create a graphical map of a web site (sitemap)?

Fri, 15 Oct 2021 12:48:26 +0000

I'm looking for a tool that either crawls a website, a local file tree, or a WARC and produces a graphical sitemap showing the links inbetween these files.

Maybe it could export graphviz code to make into a diagram, or export an SVG...?

Most tools suggested when googling are commercial SEO optimization tools or create Google's preferred sitemap.xml fornat, which is missing the linking information.

How to access a subset of Web Archive pages from a vintage web browser on vintage hardware?

Tue, 29 Sep 2020 09:57:50 +0000

I am in possession of an old 1998 Powerbook looking to browse era-specific websites (1995-2000, roughly) from the Web Archive using Netscape Communicator 4.08. I'm doing so specifically to demonstrate the full capabilites of this specific browser, not simply trying to view the websites from any other browser, nor on any other machine. Furthermore, I need to show that these web sites are, in fact, web-accessible, so simply saving the web pages on a modern computer, removing the offending incompatibilities, transferring the files over to the Powerbook through other means and opening the files locally using the browser doesn't cut it. Obviously, there are many issues with this, incompatibility with the modern https protocol now used by Web Archive and the version of JavaScript it uses chief among them. The solutions I'm thinking of attempting, are as follows:

1) Save the web pages I want from Web Archive, manually strip away the incompatible elements, host the pages on local servers on my network using protocols compatible with my vintage browser. This is my "brute force" option.

2) Write a transcoding proxy that automatically strips away the incompatible elements from the Web Archived versions of the web sites, so that any attempt to access the site by my vintage browser will return these transcoded versions.

Both options are time consuming, and I was simply wondering if this problem has already been tackled, before I waste my time reinventing a solution.

Tools to accession via file transfer

Tue, 21 Jan 2020 14:57:14 +0000

I'm wondering how to accession digital objects from donors with very little tech knowledge - who may also be resistant to downloading software for the transfer. Previously we have been using WeTransfer. Thanks!

Best practices for digitising tapes captured at a low resolution

Tue, 29 Oct 2019 23:20:50 +0000

What would be best practice for digitising VHS and miniDV tapes where the camera was set to capture at a low resolution? Our usual practice for video digitistation to date has been to capture in uncompressed AVI format, which results in very large files. However, am wondering whether it is worth all the space required, given the quality of the digitised files will be relatively poor anyway. I do understand that using a compressed format to save space may result in even poorer quality files. Thoughts?

Do you have any LTFS Tape Library Recommendations?

Tue, 11 Dec 2018 14:44:05 +0000

Hello all,

My institution currently uses LTO tape as part of our backup strategy for both business and collections data. We are both scaling up the number of tapes we create and are seeking to switch to the LTFS standard (https://en.wikipedia.org/wiki/Linear_Tape_File_System) for our collections data. We are looking for an LTFS-compliant tape library to help us increase our capacity, and I'm hoping that some of you might be able to share some details about what you're doing...

If you use tape for backups, do you use LTO? If you use LTO, do you use LTFS or a different standard/format?
What kind of tape drive/library set up do you have?
What software do you use to manage your hardware?

Best way to crawl website from localhost with wget, preserving all files in source directory

Tue, 03 Jul 2018 14:16:10 +0000

We recently recovered the contents of an old (2004) website from CD-ROM. I managed to get a local instance of the site running using the Apache web server; by editing the machine’s hosts file the site is available on that local machine from its original URL, which is http://www.nl-menu.nl (some more context can be found here).

I’m now looking into ways to crawl the contents of the site into a WARC, so we can ingest it into our web archive. After initial experiments with Heritrix failed, I moved on to wget. After some experimentation the following wget command appeared to work reasonably well:

Attempt 1: mirror from site root

wget --mirror \
    --page-requisites \
    --warc-file="nl-menu" \
    --warc-cdx \
    --output-file="nl-menu.log" \
    http://www.nl-menu.nl

However, closer inspection of the result showed that about 668 files in the source directory are missing in the resulting WARC file. The majority (90%) of these files are “orphan” resources that are not used/referenced by any of the HTML files in the crawl. However, the remaining 10% of missing files are resourced that are referenced, in most cases through JavaScript variables. These aren’t picked up by wget, and therefore they end up missing in the WARC. So I am looking for a way to force wget to include these resources anyway.

Attempt 2: use –input-file

At first wget’s --input-file switch (which takes a list of URLs) looked like a good way to achieve this. I created a directory listing of all files that are part of the website, and then transformed them into corresponding URLs:

find /var/www/www.nl-menu.nl -type f \
    | sed -e 's/\/var\/www\//http:\/\//g' > urls.txt

Then I ran wget like this (note that I removed the --mirror option, as this apparently causes wget to do a recursive crawl for each single URL in the list, which takes forever):

wget --page-requisites \
    --warc-file="nl-menu" \
    --warc-cdx \
    --output-file="nl-menu.log" \
    --input-file=urls.txt

This results in a WARC file that contains all files from the source directory: perfect! But it does introduce a different problem: when I try to access the WARC using pywb, it turns out that the WARC is made up of 85864 individual captures (i.e. each file appears to be treated as an individual capture)! This makes rendering of the WARC impossible (loading the list of capture alone takes forever to begin with).

Attempt 3: include list of URLs in crawl

So as a last resort I created a list of all URLs in HTML format, and put that file in the source directory. Steps:

Create list of URLS in Markdown format (add “<” and “>” pre-and suffix to each line):

find /var/www/www.nl-menu.nl -type f | sed -e 's/\/var\/www\//<http:\/\//; s/$/>\n/g' > urls.txt
Replace any whitespace characters with %20 to avoid malformed URLs:

sed -i 's/\ /%20/g' urls.txt
Convert URL list to HTML which is placed at the root directory of the source dir:

sudo pandoc -s urls.txt -o /var/www/www.nl-menu.nl/urls.html

Then I ran wget, using the above URL list as crawl root:

wget --mirror \
    --page-requisites \
    --warc-file="nl-menu" \
    --warc-cdx \
    --output-file="nl-menu.log" \
    http://www.nl-menu.nl/urls.html

The resulting WARC contains all files that are in the source dir, and it can be accessed as one single capture in pywb. The obvious downside of this hack is that it compromises the integrity of the ‘original’ website by adding one (huge) HTML file that was not part of the original site to the WARC.

This makes me wonder if there is another, more elegant way to do this that I have overlooked here? Any suggestions welcome!

BTW I know this question is somewhat similar to [this earlier one] (http://qanda.digipres.org/337/there-web-archiving-tool-that-produces-warc-directory-tree), but option 2 as mentioned by @anjackson there looks similar to Attempt 2 in my case.

Best way to read SCSI tape drive on a modern PC?

Wed, 06 Jun 2018 12:00:51 +0000

We recently aquired a number of DLT-IV and DDS-1 tapes, and we'd like to recover the data stored on them. We already have readers for both tape formats; however these are both SCSI devices, and since modern PCs don't have SCSI connectors we cannot hook them up directly. After a bit of Googling I came up with a few options myself:

Buy a SCSI to USB adapter cable (used, as these are not produced anymore). However, I came across some reports that this is not a good option for tape drives, as tape transport commands (rewind, fas-forward) won't work. See e.g. here
Buy a writeblocker that has a SCSI connection. I'd expect that this would have the same limitations as option 1 above (also, these days no-one seems to be making SCSI writeblockers anymore)
Buy a used SCSI adapter card and build that into the machine that is used for imaging (but are these cards even compatible with modern desktops?).
Do the imaging using an old desktop that already has a SCSI card (or is compatible with it). For various reasons I'd rather not go this route unless absolutely necessary ...

As I don't want to reinvent the wheel, I'm curious if anyone with experience reading SCSI-connected tape drives could give me some recommendations on the best way to proceed with this. (BTW the workstation we'll be using for data extraction runs Linux.)

Merging & Deduping WARC files

Tue, 01 May 2018 12:06:57 +0000

Is there any programme/script that will take a group of WARC files and merge them, removing exact duplicate responses ?

I realise this probably goes somewhat against good practice, but for reasons of space I would like to remove the approximately 90% replication of content (e.g. unchanged images) but retain the varying parts.

Does anyone have a procedure in place for redacting selected content (e.g. PII) from files that they can share?

Wed, 07 Feb 2018 11:38:43 +0000

In addition, if you are using IdentityFinder, how are you making it work? We have been unable to get the "scrub" function to work in our copy. If you are using some other tool, what is it?

Is there any best practice/guidance on where to store, in METS, metadata about digitisation/digital creation processes?

Wed, 07 Feb 2018 11:38:34 +0000

For digitised files, what is the best way to capture metadata about the digitisation process (equipment/software, settings, dates, operator names, etc) in METS? Should it be captured?

I'm not just thinking of image based digitisation, so a couple of example scenarios would be:

Digitising a newspaper to a TIFF or JP2?
"Digitising" sound content from carrier form to a WAVE file?

My thoughts were around using the <techMD> or <digiprovMD> elements, however neither seem to completely fit.

The LoC’s METS primer suggests "<techMD> records technical metadata about a component of the METS object, such as a digital content file” (http://www.loc.gov/standards/mets/METSPrimerRevised.pdf#page=41). So this appears to be aimed at the technical characteristics of the files being preserved.

Just below that, in the same document (http://www.loc.gov/standards/mets/METSPrimerRevised.pdf#page=43), is says the "<digiprovMD> can be used to record preservation-related actions taken on the various files which comprise a digital object (e.g., those subsequent to the initial digitization of the files such as transformation or migrations) or, in the case of born digital materials, the files' creation”. So for digitised material, this seems to be about actions *after* digitisation, not the digitisation itself. On the contrary, for born-digital content the wording implies it can be used to capture details of content creation.

Perhaps I've missed something obvious (or used the wrong search terms), but I'm not finding much in the way of digitisation provenance capture.

Does anyone else record such info in METS, and if so, how so?

What are the risks involved in a DIY solution like RODA?

Wed, 07 Feb 2018 11:38:21 +0000

I am curious to hear from anyone, particularly smaller institutions, that have implemented DIY solutions with Archivematica, RODA, etc. We have a non-trivial amount of data (300TB - 400TB) that would be cost prohibitive in vendor-based pure cloud scenario using S3 and Glacier.

It seems to me the most cost effective method is to have data on-site in duplicate and replicated to Amazon Glacier, perhaps using a commercial system for the added support. However, its seems (in my ambitious mind) possible to manage a system like RODA in-house and handle the replication ourselves, needing only to pull from Glacier if the fixity checking reports an error. Can someone disabuse me of my fanciful notions? Any horror stories or success stories? Am I right to assume that as long as the bags are stored in Glacier we could migrate those to another system later on if we found maintaining our own system too unwieldy? How horrifying is it to all of you to rely on Glacier as a fail safe option?

Does splitting large master files (temporarily) go against best practices?

Wed, 07 Feb 2018 11:38:12 +0000

At my academic institution, we currently have a set of 15 hard drives with 42 TB (total) of video recordings stored on them. We are seeking to move the material to more secure storage, but the only storage option immediately available to us has a cap limit of 15 GB per upload. Our files exceed that limit. The IT department wants us to split the uncompressed video files to meet this cap limit which we do not want to do. At the same time, we are currently seeking approval for more robust scalable storage but as an academic institution, that can take some time.

Has anybody else faced a similar situation and how did you address it?

Digital Preservation - Server Side

Wed, 07 Feb 2018 11:38:10 +0000

Is there any guidance/best practice on server side archiving of a webserver ? I can think of things such as documenting hardware setup/partitions/process running etc in text files, using Bag-It for important directories (/etc/, /usr/local/, /home, software in /src, etc), but this seems slightly haphazard.

I should say the disk image itself *may* be preserved but I would like to ensure the non-OS parts are preserved in a more accessible form.

And as always, time is short and the budget is small. I'd like to just do enough that whoever may come along in 10 years time doesn't curse me for not saving the one essential thing they need.

External Write Blocker that works with an External Optical Drive?

Wed, 07 Feb 2018 11:38:02 +0000

I need an external write blocker that works with an external optical drive (CD / DVD drive).

We bought a Tableau T8U but it doesn't recognise the optical drive. Tableau tell me they have nothing that will work.

It needs to be external as it gets attached to more than one PC (quarantine and then network).

Any suggestions please.

Is checking fixity at the folder level sufficient for knowing if any files within that folder that been altered?

Wed, 07 Feb 2018 11:37:59 +0000

What are current best practices for acquiring & preserving Google Docs?

Mon, 05 Feb 2018 18:56:07 +0000

What are current (2018) best practices for acquiring and preserving records created in Google's cloud platform?

In particular, are there methods that preserve the rich metadata around document creation, editing, and commenting that exist in the native apps?

How/where to store metadata about optical media sector layout in METS/PREMIS

Mon, 05 Feb 2018 16:09:33 +0000

I'm drafting a METS/PREMIS profile for images/rips of optical media images (ISOs for data sessions; WAV or FLAC files for audio). One of the pieces of metadata I'd like to include is the output of the cd-info tool, which contains information about the sector layout of the disc. Here's an example:

https://gist.github.com/bitsgalore/9a2838481574c040f7c4b7da4ed59926

However I'm unsure how (and where) to store this info in METS. My initial idea was something like this:

Create a METS techMD element which is associated with the structmap div element that encompasses all files that were extracted from the physical disc (typically one ISO image and/or multiple audio files).
Inside this techMD element, create a PREMIS OBJECT instance with xsi:type="premis:representation" (since it describes the disc as a whole, and not an individual ISO image or audio file!)
Then use PREMIS unit 1.5.7 objectCharacteristicsExtension as a container for wrapping the iso-info output.

The problem here is that the PREMIS 3.0 data dictionary says that the objectCharacteristicsExtension unit (and also its parent unit 1.5 objectCharacteristics) is "Not Applicable" for the intellectual Entity and Representation object types!

This makes me wonder how others are handling this. Is this an oversight of PREMIS, or is there some other (possibly
better) way to do this that I've overlooked?

Any suggestions appreciated!

IP limiting/restrictions for providing in-house access to copyrighted/restricted material?

Mon, 11 Sep 2017 19:55:17 +0000

My institution is interested in providing in-house-only access to digital materials we can't simply put online -- copyrighted material, restricted materials, etc.

We currently embed digital materials directly into finding aids, and want to avoid a system that would involve implementing a separate access solution (and we know our patrons aren't interested in setting up accounts/logins, either). We are currently exploring the use of IP addresses to restrict access to the digital content to computers physically located inside our library reading room. Ideally, anybody would be able to look at the finding aid, but only IP addressess in the permitted range would be able to click through to view/download the actual digital content.

Does your institution or any other institution that you know of do something similar? Have you explored IP address restrictions for your institution? What advantages and disadvantages did you find? We're hoping to avoid re-inventing the wheel if we can.

Ingesting large & hybrid digital collections

Thu, 18 May 2017 14:15:46 +0000

Dear all,

For our web archaeology project The Digital City Revives [1], we are looking for answers to a complex set of questions. Over the years, we have gathered a considerable amount of heterogeneous files related to the original websites. Our next question is what would be a logical way to keep this material together and manageable.

In order to better understand the necessary approach we are looking for concrete examples and use cases. We’re looking for descriptions of the ways archives or museums have ingested heterogeneous digital collections. In hopes of finding an answer to questions such as:

- to what extent do the provenance and original order of the materials in the state we found them in, matter?

- what descriptions are out there of organisations that have ingested this type of heterogeneougs (dare we say, messy) collections?

- to what extent do these organisations describe the various contents and at what (unit) level do they do it?

- how do these organizations inventorise the dependencies for this variety of materials?

If you’d be able to point us to some concrete examples (or, potentially, references) from the top of your head, we’d already be grateful. With warm regards and many thanks in advance for your help!

Erwin Verbruggen

Netherlands Institute for Sound and Vision

[1] http://waag.org/en/project/digital-city-revives

Preserving VCD's, what is the best format

Wed, 21 Dec 2016 22:43:10 +0000

Our institution preserves DVD's in the ISO format instead of converting to an mpeg2 format. This allows us to maintain the interactivity within the DVD structure and allow playback on a computer. Beacuse VCD's are multi-track like Audio CD's you can't just make an ISO. I have made a few BIN/CUE images of the disc's, but playback is an issue.

Has anyone else found a great way to preserve this content?

Are there accepted alternatives to WAVE for storing audio files larger than 4GB?

Thu, 23 Jun 2016 15:56:59 +0000

The total size of a WAVE file is defined by a 32 bit integer in its header. Depending on whether a developer interpreted the spec as a signed or unsigned integer, WAVE files have a maximum size of 2 or 4 GB. Using 24 bits @ 96 Hz to digitize audio creates about 2GB per hour with stereo. So audio preservation master WAVE files have a 2-hour duration limit. We have a number of audio media that exceed that duration, especially with formats like open reel audio that don't have a defined size limit.

As far as I can tell, it is standard practice in the digitization community to save a 2+ hour audio stream as 2 or more WAVE files with 30 seconds of overlap to match up the streams. Here's an example being served by the American Archive of Public Broadcasting (http://americanarchive.org/catalog/cpb-aacip_28-gf0ms3kc4t).

However, splitting up the audio stream is causing downstream preservation problems because I cannot find a simple way to encode the relationship between the preservation masters. The semantic relationships become even more complicated when edit masters are created that do not have the same relationship to each other as the preservation masters. Finally, the split files provide a disjointed performance of the original asset for our users.

The easiest solution would be to use a format that can hold more than 4GB of data. There have been some attempts to extend WAVE with WAVE64 and RF64. There are also other audio formats such as FLAC. IASA even recommended creating multiple mono WAV files and saving them in a single TAR (2.8.3). However, I haven't found documentation that any of these have been widely accepted. Is there a common strategy for creating 4GB+ audio files?

Best SIP / AIP creation practices for optical carriers that span multiple volumes

Wed, 22 Jun 2016 16:30:04 +0000

I’m trying to figure out a general SIP/AIP architecture for optical media images. For CD-ROMs the images will typically be ISO9660 files, and for audio CDs audio files for each track.

It’s pretty common to have carriers that span multiple volumes, and from an access point I think it would make sense to combine those cases into a composite SIP/AIP. Following a discussion I had about this with a colleague, I’m curious how other institutions are handling this, and if there are any best practices. Below some information I found myself.

Library of Congress Audio Compact Disc METS Profile:

http://www.loc.gov/standards/mets/profiles/00000007.html

This explicitly takes into account the possibility of composite objects:

The primary physical component of a compactDiscObject is one or more compact discs. (…) When there is more than one disc, the div TYPE=“cd:disc” elements must occur in document order that corresponds to the physical order of the discs (Disk 1, Disk 2, etc.).

The virtual CD-ROM and floppy collection of Indiana University seems take a similar approach, judging by e.g. this METS file:

http://webapp1.dlib.indiana.edu/virtual_disk_library/index.cgi/4252478/mets

An alternative approach would be to represent each carrier as one SIP / AIP, and then aggregate the AIPs that belong together using what the OAIS model refers to as an Archival Information Collection (AIC). However I’m not sure how widely used this approach is, and I’m slightly worried it might later complicate things on the access side. But I’d be interested to hear what others are doing.

Parity data for ISO images: anyone doing this? Best practices?

Wed, 25 May 2016 13:48:28 +0000

Earlier today I had a discussion with a colleague on possible ways to store (ISO) images of CD-ROMs and DVDs in our repository system. In addition to checksums, he suggested to also generate and store parity data for each image file (e.g. using the par2 tool: http://manpages.ubuntu.com/manpages/trusty/man1/par2.1.html). This would enable one to repair files in case of bit-level corruption.

This made me wonder how commonly parity info is used in digital archives, and if there are any best (or at least recommended) practices. E.g. what levels of redundancy are typically used? I'm not really familiar with this at all, and it's not a subject that is often mentioned in digital preservation discussions (but see https://twitter.com/anjacks0n/status/733281959762395139).

I am looking for a straightforward accession tool. Can people please advise?

Fri, 01 Apr 2016 13:40:15 +0000

How can I set up a test instance of Archivematica on OS X (10.11.2)?

Fri, 01 Apr 2016 13:40:05 +0000

I want to test out Archivematica on my MacBook if it's possible. Should I run a Linux virtual machine via VirtualBox or Docker? I'm looking for a tutorial if any are out there.

Cheers!

DIY digitisation, good or bad?

Thu, 18 Feb 2016 08:59:04 +0000

Our digitisation workflow is a long and complex one that takes well over a year from the time we start gathering another batch to digitise, to the time we finally make the digitised items publicly available.

Because it takes so long, we often gets requests to bypass our existing procedures by scanning or photographing works using standard desktop scanners or cameras. What do others think of this? As long as we comply with our digitisation standards and properly check the quality of the results, would it be acceptable to bypass our established digitisation procedures?

So far we have only done this for items we don't intend to preserve long term, but we are facing increasing pressure to also do it for more valuable special collections items

Setting a custom date for wget or wpull?

Thu, 17 Dec 2015 11:47:58 +0000

Both wget and wpull, when saving to WARC, will store the date and time of when the archiving action took place in every HTTP request.

Is there a way to modify this date, either during or after recording?

How should an organisation QA the results of its outsourced web archiving activities?

Wed, 25 Nov 2015 18:07:55 +0000

Outsourcing can help an organisation achieve progress when they don't have the resources, skills, technology or infrastructure to conduct particular activities themselves. But its only worth doing if there is a way to validate the quality and completeness of the work, possibly the biggest challenge in outsourcing. Organisations sometimes also outsource the provision of access to crawled websites as well. In this case, the crawled data is also supplied to the collecting organisation for preservation, but this data isn't seen by the end users so there is no validation of the data.

So what should that organisation be looking for in order to check the quality and completeness of the supplied crawl data, and how should they go about doing it? A key issue is perhaps what the 3rd party can reasonably be expected to supply alongside the crawled data (in the form of manifests) to enable this QA.

I'm interested in what organisations that outsource their web archiving are doing with regards to QA and also what the web archiving experts who do their own web archiving (and are familar with the common web archiving pitfalls) think these organisations should be doing as part of their QA activities.

Managing Timestamps

Thu, 29 Oct 2015 23:01:06 +0000

How do you document that your digital records are authentic when timestamps can change upon transfer from one drive to another?

Best method to record track playing order for ripped audio CDs

Tue, 20 Oct 2015 13:52:30 +0000

Although there’s quite a bit of info on the web on audio CD preservation, most of the resources I found tend to focus on the audio ripping process. However, when you rip a CD to a bunch of audio files, it’s important that the original playing order of the tracks is recorded somewhere. Of course you can (and should!) do this at the preservation metadata level, but for access you also want to have some mechanism that allows an end user to hear the individual tracks in their original playing order (e.g. in a multimedia player application). From the top of my head I can think of 2 general approaches to achieve this:

Record the playing order inside the audio files themselves, e.g. using “Track Number” tags that are embedded as RIFF Info or ID3 tags.
Use a separate play list file. For this several formats exist, such as M3U and XSPF.

The main problem I see is that all of the above methods are supported by some players (and not by others), but I have have some difficulty in judging which approach would ensure the most widespread support. This made me wonder if there are any recommended best practices for this? What are other people doing?

CD-ROM / DVD imaging: is it customary to save scans of booklets / covers as well?

Tue, 22 Sep 2015 12:48:07 +0000

I'm trying to compile a minimum set of metadata that an end-user would need for using ISO images from a ~15,000 CD-ROM /DVD collection. In particular I'm wondering about CD booklets and covers. They often contain lots of useful info and documentation related to technical environments and usage instructions. However, having to digitize them would add a substantial overhead on top of the actual imaging workflow. An other option might be to make scans of the actual CDs, which often contain useful info as well. This is what e.g. Internet Archive seems to be doing:

https://archive.org/details/SharewearBreakthroughUtil_ProdCol

I'm curious how other institutions are dealing with this?

Incomplete ISO image after imaging CD-ROM - how to prevent and detect this?

Thu, 03 Sep 2015 10:33:26 +0000

While running some tests creating CD-ROM ISO images with ddrescue, I ended up with ISO images that were incomplete in some cases (last ~50 MB of image file missing), even though ddrescue’s log file didn’t report any errors. Below the results I got from 4 attempts at imaging the same CD-ROM on the same PC (note that some of the ddrescue options I used are slightly different, but this appears to be unrelated to my issue). For this I used 2 different external DVD readers:

Reader A - modern Samsung USB device;
Reader B - old SATA (internal) device, refurbished to USB.

Attempt 1 - reader A

Command line:

ddrescue -b 2048 -r4 -v /dev/sr0 windows_98_upgrade_nl.iso windows_98_upgrade_nl.log

This resulted in a 601.7 MB ISO image. Here are the contents of the log file:

# Rescue Logfile. Created by GNU ddrescue version 1.17
# Command line: ddrescue -b 2048 -r4 -v /dev/sr0 windows_98_upgrade_nl.iso windows_98_upgrade_nl.log
# current_pos  current_status
0x23DC0000     +
#      pos        size  status
0x00000000  0x23DCB000  +

I.e. the log file indicates the CD was imaged without problems. MD5 checksum is 82603be06a8142aad1dfaa9e1279371f

Attempt 2 - reader B

Command line:

ddrescue -d -n -b 2048 /dev/sr0 windows_98_upgrade.iso windows_98_upgrade.log

Again this resulted in a 601.7 MB ISO image, again with no indication of read errors in the ddrescue log. MD5 checksum was (again) 82603be06a8142aad1dfaa9e1279371f.

Then by chance I discovered some text files in the image file weren’t readable, so I did a third try, now again with reader A.

Attempt 3 - reader A

Command line:

ddrescue -d -n -b 2048 /dev/sr0 windows_98_upgrade_test.iso windows_98_upgrade_test.log

This resulted in a 660.9 MB ISO file. Again no errors in the ddrescue log; MD5 checksum is 24f0f746d0817121253c6b1242d4246e. After mounting the image, the text files that were problematic in the earlier images were normally readable.

Attempt 4 - reader B

Command line:

ddrescue -d -n -b 2048 /dev/sr0 windows_98_upgrade_refurbished_onemoretry.iso windows_98_upgrade_refurbished_onemoretry.log

Result was identical to result of attempt 3!

So summarising, 2 runs of ddrescue (using 2 different USB readers) resulted in exactly the same error, whereas the remaining 2 runs (again using 2 different readers) completed fine. So what’s going on here!?

Md5sum directly on physical CD

As a first step I computed the MD5 checksum directly on the phyical disc, using:

md5sum /dev/sr0

I repeated this 4 times, using both readers A and B, plugging them into different USB slots. In each case the result was 24f0f746d0817121253c6b1242d4246e, which is identical to the hash I got for the ISO in attempts 3 and 4 (confirming these images are correct).

Comparison of ISO images in hex editor

I also did a comparison of the intact and faulty ISOs in a hex editor. This revealed that in the faulty images a block of about 59 MB of data is missing at the end of the file. I double checked this by copying the block of missing data to a separate file (missingblock.dat), after which I appended it to one of the faulty files using:

cat windows_98_upgrade_nl.iso missingblock.dat > isorepaired.iso

Then check:

md5sum isorepaired.iso

Result:

24f0f746d0817121253c6b1242d4246e

Which corresponds to the value of the intact image.

But why is this happening in the first place?!

The really important question is why this is happening in the first place, and if there’s any way to avoid it? The thread below on the ddrescue mailing list describes a somewhat similar (but not quite the same) issue:

https://lists.gnu.org/archive/html/bug-ddrescue/2014-02/msg00003.html

Note the following quote from the response by ddrescue’s main author. He suggests that the problem might an issue with a USB port, adding:

Ddrescue can’t know if the data are really good or if the hardware is lying about it.

If correct, this would apply to other imaging tools as well. Based on my results, I’m curious if other people may have run into similar issues. More importantly: how does one even detect errors like these? Of course it is always possible to run a checksum on the physical medium and then compare it to the ISO checksum, but this takes ages. A more quick and dirty approach would be to compare the size of each created image against the size of input medium. E.g. to get the size of a CD-ROM I can use something like this:

lsblk /dev/sr0 -n -b

Result:

sr0   11:0    1 660850688  0 rom

(Third column is size of CD in bytes).

To get the size of the ISO image:

du -b windows_98_upgrade_refurbished_onemoretry.iso

Result:

660850688   windows_98_upgrade_refurbished_onemoretry.iso

This does not guarantee the image is correct, but it will detect missing blocks of data.

I also ran some cursory checks with isovfy and isoinfo, but the output of those tools turned out to be identical for both faulty and intact images, so they’re probably not very helpful for this sort of error.

I’m curious how other people/memory institutions are dealing with this. Any thoughts / suggestions are welcome!

Addition

On Twitter Alexander Duryee rightly pointed out that a CD-Rom's Primary Volume Descriptor contains a field with the size of the disk (this is also where lsblk gets this value). So one would assume that ddrescue would check against this number. Apparently it doesn't do this, so I think I'll reprt this as a bug. (Note that such a check doesn't guarantee the copied data are identical to the source disc.)

What organizations are preserving software?

Thu, 13 Aug 2015 20:04:13 +0000

I've been building a list of organizations that are attempting to preserve software including:

NIST's software reference library
some game archives such as Stanford's
Microsoft has an internal Archive
the Internet Archive is doing amazing work
so is the National Library of Australia
the play it again project
the University of Michigan Archive
Rhizome are preserving software based/dependent art
as is MoMA
and of course the computer history museum is also

There are resources here and here with more information but I'm interested in community input.

But who else is doing software preservation?

Would anyone else find such a list useful?

Is it problematic to include spaces in file names?

Thu, 09 Jul 2015 15:53:37 +0000

I have read about and heard different opinions about including spaces in file names, and based on casual observation it seems like, as a community, we typically recommend that people do not use spaces in their file names (perhaps for many of the reasons outlined here). Do you have experience with working with spaces in file names that provides insight into whether they are problematic? Should we continue to recommend that digital object creators avoid spaces in file names, or is it an obsolete concern?

Favorite online resources to learn more about DP tools?

Tue, 07 Jul 2015 20:13:52 +0000

I'm trying to gather information about some ways for folks just starting out in digital preservation to learn more about tools and resources. Any one have favorites, or know of good webinars that can offer more insight on tools? Any information sessions (preferably free) that you'd like to share that you found helpful?

I'm aware of the resources through NDSA and DPOE, and often refer to the COPTR tool grid. But I'd love to see if anyone has other favorites or has a resource they found particularly helpful. For example, the new ArchivesDirect service hosts information sessions and can be a good place for people to learn more about the new hosted service.

Thanks!

Strategy for preserving scanned files outside of repository

Wed, 01 Jul 2015 04:05:52 +0000

We have undertaken various digitisation projects the past couple of years; each item digitised has resulted in output in TIFF, JPEG and PDF formats, with up to 600 or so TIFFs/JPEGs for some items.

We have loaded the PDF files into our research repository. However, our repository software does not accommodate too well more than a dozen or so files at the most for each record. So we need to find an alternative mechanism for storing at least the TIFF files (the JPEGs and PDFs can always be regenerated again from the TIFFs). Currently, we have 3 copies of each of the files stored on our local network (as well as whatever system backups occur), until we can figure out a more permanent preservation strategy.

We're now at the point where we want to figure out that more permanent strategy. The PDF files in the repository will remain the primary access point. Can anyone suggest a suitable approach/software for the preservation masters?

What are best hardware/software for automated back-up to external hard drive?

Tue, 09 Jun 2015 21:39:20 +0000

I'm wondering if we as a digipres community have centered on one or a few trusted solutions for ongoing, automated backup of personal data to physical media, i.e., external hard drives. I'm interested in hardware and software recommendations for a Windows environment. This would be active research data management, and the data is potentially valuable for the long-term. Any data identified for long-term preservation would be migrated to a trusted repository. Please, if there's anyone out there who can help me hone in on some trusted solutions of the zillions of options that seem to be out there in the PC world, I'd be much appreciative.

Recommendations - Drives for Offline Storage

Thu, 04 Jun 2015 18:36:34 +0000

I'm looking into options for drives for 'cold' storage for my institution, possibly for long periods of time (we would like a shelf-life of about four years). They don't need to be particularly fast so much as reliable and long-lived (and, of course, cost-effective). We've already got some LTO tape we use; we're looking more for hard disk drives or solid state drives to be held offline in a climate-controlled area. The scenario involves a write-once, read-a-few-times situation.

So, first of all, does anybody have any experiences or know of any studies about the longevity and data integrity of HDD versus SSD? I know, for example, that SSD is better-equipped for unpowered storage than HDD, but that they can also be more prone to data loss.

Second, does anybody have a particular brand that they use and would particularly recommend to purchase or to avoid?

If you'd rather not answer on-list, you can email me at sarah.barsness@mnhs.org and I'll post your replies to this thread anonymously for you (I could *really* use some very candid recommendations).

Preserving PDF files with multimedia content

Wed, 20 May 2015 20:24:50 +0000

I am curious what other insitutions are doing for PDF files with embedded video or audio content? We migrate almost all of our PDF's to PDF/A. The PDF/A stadard does not allow for embedded content like this.

We have discussed normalizing the file to ensure long term usability but not to PDF/A or extracting the embedded content and handling them seperately. But this takes them out of their original presentation.

Are there other options?

Thanks.

How to distinguish 68k and PPC executables on classic Macintoshes?

Thu, 26 Mar 2015 18:48:45 +0000

When getting a binary for a classic Macintosh operating system (up until version 9), how to distinguish for which processor architecture the binary was made?

Scanned manuscript - saved as TIFF image files in archival disc. Is it Ok

Mon, 16 Feb 2015 19:58:42 +0000

Any LTP "Packaging Format" Standard you Support that Makes More Sense than WARC?

Fri, 13 Feb 2015 21:20:18 +0000

Here's my question: Do you prefer a packaging format other than WARC? What other justifications should I take into account? Could you share additional resources for my consideration ( in addition to those below?) And I'll elaborate:

We generally agree that digital permanent records should be kept in open and supported format (tools and communities). We even have a broad selection of formats to chose from (from TIFF/JPEG2000 through PDF/A to XML). What remains, in my view, is a similar broad agreement on the package format that house the above content and linked metadata. The target here is the format of the container, wrapper, or capsule.

There are many requirements that this container should meet, and by far, I like that the container be extendable whereby one can add more content, additional metadata, user contributed context, etc.

Sofar though, .zip container has more or less led the charge, though other formats such as .warc are emerging from studies and comparaisons and showing more promise, which prompts me to ask:

Do you prefer a packaging format other than WARC? What other justifications should I take into account? Could you share additional resources I could consider in addition to these below?

Recommendations/studies supporting WARC
- For web objects: http://www.getpocket.com/a/read/777166755
- For all objects: https://fedora.phaidra.univie.ac.at/fedora/get/o:293682/bdef:Content/get
Standard using ZIP as container/wrapper/capsule:
- Victorian Electronic Records Strategy / VERS Encapsulated Object (VEO) : http://prov.vic.gov.au/wp-content/uploads/2014/05/VERSStdRevisionProposal-v1-0.pdf
Standard-to-be considering SIRF or an unspecified format (needs proof of concept)
- Storage Industry Networking Association's SIRF (Self-Contained Information Retention Format ( http://www.snia.org/SIRF ; )

Answers, links and comments will be well received!

Is there a GUI alternative to Bagger that performs a similar function?

Fri, 06 Feb 2015 01:06:38 +0000

As I understand it, Bagger is the only GUI tool that was developed to use the BagIt specification and it is no longer being updated. If that's correct (that is, there's no other GUI BagIt tool), are there other GUI tools that enable you to carry out the following steps:

1. Package a set of files with a manifest containing a list of files and checksums

2. Transfer the package to a new system/location

3. Verify that the files have been received

I realize that there are updated BagIt CLI tools, but I ask because not everyone is comfortable with the command line and I wonder if there are other GUI tools that do similar things, but don't follow the BagIt spec.

Tool(s) for extracting administrative metadata from WARC?

Thu, 29 Jan 2015 07:08:14 +0000

I'm researching best practices for administrative metadata--preservation metadata in particular--for web archives. So far I've found some very helpful rationales and schemas, all PREMIS-in-METS-based, but I haven't seen anything that directly explains how one gets from point A to point B. That is, I haven't seen any descriptions of the steps nor the tools used to actually extract this type of metadata (or as much as can reasonably be gleaned) from WARCs. Have you? Have you extracted administrative metadata from your WARCs and lived to tell the tale? I'd love to know what you used and what you thought of the process.

Optimal file sizes for access PDFs

Sat, 24 Jan 2015 20:41:55 +0000

We have digitised a collection of old books, pamphlets, etc, which we will be making available as PDFs and as images through BookReader. The original files were scanned as TIFFs and will not be available to users. The PDFs we originally created were high quality and large file sizes; some over 100 MB. Obviously, this is far too large to expect most users - which will be the public - to download. Is it better to split the larger files into smaller and retain the high quality? Or regenerate the PDFs into smaller, lower quality files? And in either case, what is the ideal maximum file size for cultural heritage-type PDFs available to the general public? Have not been able to find any recommended standards at all around this issue.

Splitting PDFs while preserving quality

Wed, 21 Jan 2015 19:18:52 +0000

For easier presentation, we need to break some very large PDFs into smaller files (we already have the originals saved in TIFF format). However, when we do this via Adobe, the total file sizes get halved, e.g. a 600MB PDF results in 6 PDFs that total 300MB in size. Does this denote loss of quality? How do I check? And what's the best way to break large PDFs into smaller files without that loss of quality?

I thought this info would be relatively easy to find on the Adobe web site or elsewhere, but no luck so far.

What preservation format for video to use when digitizing VHS tapes?

Mon, 05 Jan 2015 19:57:23 +0000

Who uses the SPOT Model for Risk Assessment?

Mon, 15 Dec 2014 18:17:17 +0000

I'm currently investigating using the Simple Property-Oriented Threat (SPOT) Model for risk assessing our repository content and I'd like to know if anyone else is using it in practice?

Two example organisations (Statistics New Zealand and the Florida Digital Archive are including in the D-Lib article), but I'd like to know if anyone else has attempted to use SPOT since 2012. Good and negative experiences welcome!

~Chris Fryer, Senior Digital Archivist, Parliamentary Archives

Does anyone have experience capturing and archiving WeChat data?

Tue, 09 Dec 2014 15:51:18 +0000

Does anyone have experience capturing and archiving WeChat data? I am working with a research group that wants to capture text conversation, emoticon usage, as well as shared audio clips.