• Register

What is a file format?

+5 votes
2,778 views

Not to get too theoretical, but I think it is important to develop a common understanding of key concepts in the digital preservation field. 

File formats are often discussed but there are varying ideas put forward to explain what they are/what the term means. This variance can cause confusion and make it difficult to agree on common processes and tools to use for digital preservation purposes. 

Some sub-questions to help the discussion:

Is a file format a standard for use in formatting files?

Is a file format a common way of structuring the internal structure (format) of files?

Is a file format whatever The National Archive's DROID tool identifies?

Can the spread sheet files created by Microsoft Excel 2007 SP2 with an ODS extension be considered Open Document Spread sheet formatted files?

Can the (significantly different) spread sheet files created by OpenOffice 3.x with the ODS extension be considered Open Document Spread sheet formatted files?

Do we need multiple terms to refer to the various different types of concepts that we currently refer to as "file formats"

asked May 21, 2014 by euanc (3,910 points)
edited May 21, 2014 by euanc
I wonder if PREFORMA will have a response these questions. http://preforma-project.eu/

7 Answers

+7 votes

For what it's worth, we say:

"A file format is a method of storing digital information in a computer file, allowing its later use by computer systems or people. There are thousands of different file formats for different kinds of digital content and there may be several different versions of the "same" file format. 

A file format is often confused with the software most commonly used to create or use it. 
For example, we talk about "Microsoft Word" files, or "Acrobat PDF" files. Despite these 
naming conventions, in principle a file format is not bound to any particular software – 
even if this is sometimes the case in practice. 
 
A file format is like a language which is only spoken by certain pieces of software. In 
general, the greater the number of languages, and the fewer speakers there are for 
each, the harder it is to maintain the ability to read and understand them over long 
periods of time."
 
 
 
 

 

answered Jun 6, 2014 by dclipsham (380 points)
This is the definition I most adhere to and it makes me think: Using this definition a set of format trees could be developed. Signatures could be associated with points in the trees, e.g. "word 97-2003" could be a format and "word 97-2003 created by Office 2003 SP1" could be a subformat of that format, each with their own signature. Those that only cared about files being of "word 97-2003" format could make use of the high level signature and those needing to understand the potential differences introduced by the creating application could use that "lower level" signature.
There could also be an identifier (though probably not a signature) for the intended interaction environment. Files could be tagged with that ID (though probably not automatically) and that could be used for risk analysis (e.g. don't have the environment, your content is at risk).
Right now (personally, and my views at this point may not necessarily represent The National Archives' past, current, or future policy), I think rendering intent and canonical prescription will not always necessarily marry up (for examples, see any web page created and optimised for e.g IE5 or Netscape 3 etc.), and my personally favoured approach is to make the bitstream in its original form wholly available and to allow the end user to choose how they interpret it, through whatever means they deem appropriate.

I also think where possible, heritage institutions should make available rendering options, but I'm conscious of costs and relative value.

Lately I've been thinking about video games (okay, I'm usually thinking about video games), but specifically about how devs would tailor their sound, colour schemes etc to suit a range of hardware environments. Even back in the day, everybody knew that SVGA/SoundBlaster was king, but a person with EGA/Tandy, or even onboard or possibly no sound could possibly still experience the 'intellectual entity' of the video game. The experience will certainly have been different, but which version was 'true'?' Is the canonical version in this case the developers 'ideal rendering,' or are all renderings fundamentally acceptable, and should we provide them all? Can we even possibly hope to?
Thanks for the detailed comment David. I have a fair bit to share regarding your questions and concerns. This whole discussion is really fascinating to me and does help to explain the difficulty in answering the question of this page.

I have discussed some answers to your last three questions here: http://www.openplanetsfoundation.org/comment/371#comment-371 . To summarise, I believe we should endeavour to provide at least one "representative" interaction environment for use in interacting with the content being preserved. Ideally we could provide multiple representative interaction environments but that may never be fiscally practical.
Providing a representative rendering environment would likely be the best we can do in the absence of knowledge of the intended interaction environment.

In the case of websites, I've discussed those many times in the past and can assure you that in the case of such content it is really quite straight forward to recreate the original interaction environments for those http://digitalcontinuity.org/post/22115350229/accessing-authentic-archived-websites-well-aaaww

Lastly, addressing the second paragraph of your comment I really believe that the cost of alternative strategies is likely to be greater than that related to maintaining interaction environments. I discussed that further here: http://digitalcontinuity.org/post/47460207305/reply-to-dshrs-comment
+3 votes

The FADGI glossary (http://www.digitizationguidelines.gov/term.php?term=fileformat) has this to say:

Set of structural conventions that define a wrapper, formatted data, and embedded metadata, and that can be followed to represent images, audiovisual waveforms, texts, etc., in a digital object. The wrapper component on its own is often colloquially called a file format. The formatted data may consist of one or more encoded binary bitstreams for such entities as images or waveforms, and/or textually-encoded data, often marked up with XML or HTML, for texts. The embedded metadata may be skeletal or extensive.

This definition has been tailored to fit the planning activities carried out by the FADGI File Format Subgroup. Meanwhile, in the digital library community, the broad concepts underlying the FADGI definition are often subsumed under the generic term format, although this usage does not generally require that all three elements (wrapper, bitstream, and metadata) be present at the same time. Here are two definitions for format from authoritative bodies in the field:

  • A set of syntactic and semantic rules for mapping between an information model and a serialized bit stream. Many formats can be grouped into loose categories, or families, sharing a general set of encoding rules that are further restricted or extended for the specific format or profile. A format version is considered a profile. (Combined definition from the United Digital Formats Registry (UDFR), slide 7 in the Unified Digital Formats Registry Stakeholder Meeting PowerPoint Slides; and JHOVE2, JHOVE2 glossary.)
  • The internal structure and encoding of a digital object, which allows it to be processed, or to be rendered in human-accessible form. A digital object may be a file, or a bitstream embedded within a file. (From the U.K. National Archives Digital Preservation Technical Paper Automatic Format Identification Using PRONOM and DROID.)

Additional definitions of format have been offered by the InterPARES 2 Project and the Library of Congress Sustainability of Digital Formats Planning Web site.

answered Jun 25, 2014 by kmur (230 points)
Thanks, Kate.  When we-all started in on the Format Sustainability Web site, we saw that "file format" was rather confining.  We felt it more useful to talk about "formatting" -- a broader concept that includes packages and wrappers and bundles but also includes encodings and other things.  The page you cite sketches our understanding of format in this broad sense.  We were (and still are) influenced and inspired by Steve Abrams, whose excellent perspectives are presented at the now-antique GDFR Web site: http://library.harvard.edu/preservation/digital-preservation_gdfr.html.
The GDFR website appears to have gone, but the it made it into the Internet Archive: http://web.archive.org/web/20110721202425/http://www.gdfr.info/
+1 vote
I would suggest that there is no good answer to this question, as much as we all would like there to be one. A "file format" can be defined by many things and can have a range of purposes. Even a new and (supposedly) clearly defined file format can be adapted, extended, appropriated or twisted over time as new software implementations add proprietary functionality, interpret ambiguous definitions in different ways or simply fail to support all aspects of a format. As a consequence it's impossible to come up with a concise definition. At best a file format is a somewhat fuzzy entity.
 
A file format might be defined by a file format specification, a reference implementation for an application that renders it or indeed any other software that saves out data in a particular way. In practice it can often be a somewhat unclear combination of all three.
 
A file format might be designed to capture and store data in a carefully ordered and organised manner. Or it might simply capture the state of a software application at a particular time (eg. early MS Word formats).
 
Sheila Morrissey provides an excellent example of the challenges in understanding, defining, or indeed modelling, what a format actually is in: "The Network is the Format: PDF and the Long-term Use of Digital Content". This quote gives a hint at some of the complexity wrapped up in just this one example (but the whole paper is well worth a read):
"Older versions of the PDF specification included an appendix called “Implementation Notes”, which describes at least some of the deviations from the specification for which Acrobat reader attempts to compensate. These notes do not comprise a part of the ISO PDF 32000-1:2008 document. Further, these notes, while helpful, beg the question as to what we are to consider authoritative with respect to PDF format instances: the specification, or the behavior of the Acrobat reader application."
 
answered Jun 10, 2014 by prwheatley (310 points)
I think we need a qanda.meta to do this, else it's going to get very messy. Very quickly, the fuzzier the better. So (1) is the hardest to break. I reckon I could have a pretty good pop at the rest. Of course (1) doesn't really say anything useful in preservation terms, but I guess that's my point.

Incidentally I see exactly what you're after from that very clear example, but in practice you probably don't want to be precise about it. You just want a quick impression of what delights you might find in the file. To say it's a "doc" or "I saved it from Word" is probably enough.

Isn't the only precise definition of what a file is (in file format terms), actually the version of the software that created it? Anything else is just an observation of a file's characteristics that might loosely classify it in some way that we generally associate with the term "file format".
So, a file format:

Is a digital container for content - made to a specification (the container, not the content) - which allows human beings to view the content using a computer.

Sort of (1), (2) and (sort of) (9) above.

No?
@crouchmi if you think all formats have a specification, you're in for a nasty surprise! :-)
I can't tell how flippant you're being here, Mr Jackson, so I'll plow on:

1) Chaos and entropy are extant in this crazy mundane world we live in, but just as 'database' *implies* structured data -- 'format' *implies* specifications.  They may not be documented, heck - they may not be known outside of the skull of the mad scientist who is using them, but you encapsulate content in a file format to re-use the content in a consistent, predictable way. SHARE and REUSE.  No?

2) And I guess that its not just human beings who access content contained in file formats - computers are potential users as well.  My bad.
Ah, it seems I misinterpreted 'specification' to mean some kind of documentation or definition independent of the specific software/user that reads or writes the bitstreams. Of course, something somewhere 'specifies' the content of a bitstream, but I don't tend to think of that as 'a specification' because it can be so weak/ambiguous. But you are right, it can still be considered a specification.
0 votes
From my experience, the many people distinguish formats based on extensions. This means that people will distinguish a TSV from a CSV, but not an Excel ODS from an OpenOffice ODS or a PDF 1.3 from a PDF/A3.

I don't think that's an exact or even good use of the term, but it's the most understandable use of it for most people. If the same content is somehow different when placed in a PDF 1.3 and a PDF/A3, why do both of them carry the same .pdf at the end? Given that, maybe we should start speaking of file format versions for anything beneath the level of 3-letter abstraction.

Edit: As suggested in the comments, extensions have many failings, such as a text file created on a Unix system that could be Javascript code, JSON data, or just text. MIME type does a better job of differentiating between these cases. It too isn't perfect, but I think a definition needs a practical output for it to be useful. Otherwise, we are stuck in the same epistomological problems.
answered May 22, 2014 by nkrabben (1,990 points)
edited Jun 5, 2014 by nkrabben
I like the approach, taking the most simple definition as the base and defining other terms "format version"(?) for referring to related, more complex, concepts. However there were and are many programs that create files without extensions. So incorporating extensions into the definition seems like it might limit it too much.
I agree that extensions are too limiting. There was a time on certain platforms where extensions were unnecessary. Maybe an approached based on MIME types would be more appropriate?
Since extensions are usually just three characters, they're often unintentionally used for unrelated formats. In cases like these, we'd need to refer to them as something like "the .xyz format from ABC corporation." I think extension is just one facet, even at the level of popular perception.
0 votes

This is how Microsoft defines the word format in their application for the application/msword mime type:

From any microsoft word application select "Save As..." from the "File" menu.  Enter a filename, make sure that "Normal" is specified for the file type, and click "Save".

If you wanted multiple terms I'd suggest there are the:

saved as file format (which is the file's lineage: the application that created a file, its version, and the set of options selected by the user when saving the file). This is how MS chose to describe the msword file format in the quote above.

open as file format (the file's interface: the set of intentions about how the file might be accessed e.g. I'm declaring that it is a PDF, that can be opened by PDF applications compliant with v 1.7+, and that it conforms with a PDF accessibility standard so it can be faithfully rendered in a screen reader)

In practice, I think it is the open as file format that we really care about & I think Pronom does a pretty good job of mapping that space. Of course, because applications inevitably make mistakes when declaring open as file formats, it can often be very helpful to know the saved as file format too.

answered Jun 6, 2014 by richardlehane (1,000 points)
I really like the saved as and openable as options but I also think we need something for the set consisting of the combination of saved as and open with. “open with” would be the configured software environments that the object was intended to be opened with in order to be interacted with.
Each object/file/set of files may have one or more “open with” environments. But an important distinction between “open with” and “open as” might be that some possible “open as” options may not be appropriate for the file for whatever reason (e.g. they mispresent the content or distort it in some meaningful way) which would be why the “open with” option would be useful.
–1 vote
These are great questions to ask and I think a common understanding of these key concepts is important.

I have found myself asking these questions when working on a collections of born digital material. I use tools like DROID, JHOVE, FITS to determine if the files are well formed and vaild and are what they say they are. But I find myself forcing a few files to conform to the DROID standard just to get them through the system.

DROID is an ongoing project and needs to be updated frequently to stay ahead of changing file structures.

I think the best approach is really based on your digital preservation strategy, will forcing all files to conform to a standard make them easier to migrate in the future or are you more interested in preserving the original file?
answered Jun 4, 2014 by thorsted (590 points)
I think it's important to understand what DROID is doing in order to answer this question.

DROID is scanning the bitstream of the file, comparing it against a list of signature sequences, and reporting back when it finds a match. Nothing less, nothing more.

A suitably skilled individual can easily create a 'file' that contains a matching byte sequence and nothing else, and DROID will report a match (See Ross Spencer's Skeleton Suite work for a tool that does exactly that - https://github.com/ross-spencer).

So where has the sequence come from? Simply our internal research. We are a very small team who work on PRONOM as part of our wider roles. We are not  world-experts in every (or even any) particular file format , but we have enough experience of developing file format signatures to understand what we are looking for. The National Archive's (current) use-case for PRONOM/DROID is to understand what types of formats are entering our collection, primarily so we can drive our own workflows and preservation strategies. We do this for our own business purposes and make our research freely available. I'm pleased and proud that other institutions and individuals (generally) seem to trust our data output.

What we are not, and have never asserted ourselves to be, is arbiters about what constitutes a canonical format. DROID simply has no notion of validation, and users who want validation should seek other appropriate tools for that, which they themselves judge to be fit for purpose.

As for registration trees, I do like this too, however it kind of falls down (currently) at the opening sentence. "All media types should be registered using the IANA registration procedures." I agree, they should be. But they're not and I'm not sure it's realistic that they ever will be. For our part, we (currently) tend to add a media type to PRONOM only when either the IANA register, or the vendor themselves have asserted a particular media type.
DROID/PRONOM have never attempted to assert anything about the validity of a file format, rather the byte stream matches particular identifying characteristics which are common to files of a certain format.  If both variants of ODS match the defined siganture, they'll be identified as ODS.
Forgive me if I sounded like I was being negative toward DROID and the team. Thank you for the clarification. I think it is wise for everyone to know what the purpose of tools like DROID are for. I was hoping to answer the question stated in in this thread "Is a file format whatever The National Archive's DROID tool identifies?"
Not at all - my latest response was to Euan's question about QDS! Non-threading comments boards are always prone to this :)
I'll just add that DROID and PRONOM are excellent, particularly very impressive given the size of the team!
–1 vote

Apologies for answering my own question, but one answer can be found on the Pronom website:

 

"File formats encode information into a form which can only be processed and rendered comprehensible by very specific combinations of hardware and software."

 

It does occur to me that many signatures in Pronom match with more than one of the things that match that definition. 

answered Jun 10, 2014 by euanc (3,910 points)
...