• Register

Should there be a Pronom ID for unidentifiable/unidentified?

+3 votes
I may have to rephrase this question but this very close to the original question I asked on twitter that sparked enough conversation to justify adding it here.  I originally asked "Does Pronom have an ID for unidentifiable/unidentified?"
asked Jul 15, 2014 by euanc (3,910 points)

3 Answers

+2 votes
Best answer

To expand a little on some of my comments in the original Twitter thread. First of all we need to think about what DROID returning a PUID means: (some part of) the bytestream has matched one of the signatures held in PRONOM, or possibly simply based on extension if no signature match is achieved (depending on your settings). And why no PUID is returned, there is no match to any signature, perhaps because there is no signature at all in PRONOM, or PRONOM is missing a variant of a particular filetype, or depending on your settings might possibly be because you haven't actually scanned the whole file, or the file is damaged in some way so that it doesn't match any of the available signatures (note that this may, or may, not mean it is so corrupted that it is unopenable), or as Andy says there has been some sort of tool failure - usually this should generate some sort of error code as well which should also be detectable. As Rob indicates, DROID may only be a first step in a characterisation process, additional tools (or additional steps with DROID such as container signatures) may then be invoked to confirm or refine the initial identification(s).

So by returning no PUID we are saying we have (for possibly a variety of reasons) been unable to match to a possible PUID, to then say say we'll assign this case it's own PUID seems a bit upside-down to me. In terms of some the cases Andy mentions DROID reports separately (from any PUID) on whether something is a file or a folder, and on its size

answered Jul 16, 2014 by DavidUnderdown (790 points)
selected Jul 16, 2014 by euanc
+1 vote

I'm not sure about PRONOM in particular - it depends how they want to scope things - but I would argue that there are various cases where identification does not succeed but where it is useful to distinguish between them.

The primary use case is as a short-hand to distinguish the "failure to identify a format for a bitstream" case from the case where the actual identification process itself failed.

For example, when profiling the formats in the web archive, we hit so many malformed objects that identification often doesn't just fail - it crashes quite badly. So, when the identification process succeeds, but no format is identified, we record this as "application/octet-stream", reflecting the fact that all we know is that we have a bytestream. When the identification process itself fails, we return no format identifer at all, because we literally know nothing at all about the format. (We happen to shift everything over to MIME types rather than simply using PUIDs, but the issues are the same.)

Similarly, it can use useful to create identifiers for other edge cases:

  • Folders
  • Empty files
  • Soft or hard links
  • Various classes of block device

For web archives, only really the "empty file" case is worth considering. In the context of system analysis, as in personal digital archiving or forensics, these other cases become more important. This is probably why the 'fine free file' command is quite good at distinguishing between them.

answered Jul 15, 2014 by anjackson (2,950 points)
Talking around the office - a general rule of thumb should be "Don't have fields which are ambiguous".  eg "I know what to put in the field" and "I am confident that I don't need to use this field" is so much better than "What does this do?" and "Well I suppose we could use this field to mean that ...".

Flex - the guy who sits next to me - is creating a tool which populates a csv with extracted metadata.  He's filling every field.  For the stuff that's currently 'unknown' he's populating with the text string 'unknown' - so he knows that [blank] fields are actually a problem.

I'm blathering on.  More chocolate!
That's also how we tend to work, although one of these days I'll probably end up needing to distinguish an explicit metadata field with the value "Unknown" from a metadata field that's "Unknown" because it's not there!

But, to re-iterate what I said before, it is a scoping issue. You could instead decide to have to separate fields - one which remains blank and ambiguous, and another that captures whether the process failed or not (then rinse and repeat for each exceptional case). There's not necessarily a right answer here, but certainly I find when aggregating a lot of separate metadata fields, having 'special values' for the edge cases is easier to work with than multiplying the fields.

p.s. Also blathering on - more coffee required!
+1 vote
Interesting question.  I think the short answer is no.  

In addition, I'd argue that it probably shouldn't.  I see PRONOM (or perhaps more relevantly DROID) as a tool for doing a particular job: a quick general parse to try to work out the format.  The interpretation of the result is context-dependent and so I'd argue best left to other pieces of software.

In most cases we (in Preservica) use DROID to determine what to do next (e.g., ideally run a more detailed charcterisation tool to validate the identification and/or extract further properties).  I'm quite happy for the second (or third or fourth) tool to update the format identification result to something more specific should it be capable of doing that (e.g., some TIFF variant).  Some of this will need detailed information that DROID's algorithms may not be capable of dealing with. I don't see this as a big problem with DROID: it achieved its purpose by, in eseence, telling us which tool to run next.

If DROID produces no result I think this is fine: we just record that.  We can report on it (real time or later) and attempt to do something about it (e.g., create a new signature and re-characterise later if needed).  It is true that we can't distinguish between something that is corrupted and something that is, say, just a snippet so not formally compliant with any format specification.  It would be useful to distinguish these but I think this is just a part of the general case of adding more format definitions (e.g., defining a 'format' for a "snippet of text"?).

If DROID reports multiple identifications we try to validate them all.  If one succeeds and the other fail we will update the format identification to just the successful one.  If, however, we can't work it out we just keep all identifications.  This limits future capability (e.g., we won't migrate a file if we don't have a single unique format identificaiton) but again it can be reported on and the issue (if indeed it is an issue) resolved and dealt with as above: hopefully leading to a future re-characterisation once format specifications in PRONOM (or validation tools) have been improved.

Andy mentioned some specific cases that are also worthy of discussion.  For empty files we can determine this by measuring the file size.  Likewise, we have a flag that determines if something is a folder.  Both of these are detemined before DROID is run (and any files in these categories are excluded from further characterisation).

For embedded objects, our characterisation framework will first try to extract them and then will characterise them as stand-alone files using DROID (provided they are not empty files) and other tools as described above.

I have also blathered on for too long...

answered Jul 16, 2014 by RSharpe (160 points)