• Register

When should one keep database content in an original format and when to export or flaten it?

+2 votes
There is a lot of invaluable material in relational databases. The original and in some cases propritary formates of some of these systems likely mean that as those systems become less avaliable, or transition to newer versions which don't support older files, it will become incresingly difficult to access the content of a database through it's original software. With that noted, exporting to more generic formats for tabular data, or XML structures for containing relational data information, transform the data and potentialy introduce a range of issues around quality assurance that in the fedility and authenticity of characteristics of the data.

So, in what what cases should one export or migrate data to more common or accessable formats and in what cases should one stick with the original? Similarly, I would be curious to suss out the factors that would prompt an institution to go with one approach over another.
asked Mar 27, 2014 by tjowens (2,350 points)

2 Answers

+3 votes

Both, because I see this as two distinct goals:

  1. Discovery and access: for search and other content-oriented tasks, a simple standard format is easier to work with. Something like a CSV file should be easy to generate and import into things like search engines.
  2. Preservation and access programs specific to this data: in this case, you really want to maintain the full fidelity of the original structure, including things like validation rules. I would recommend against storing the version-specific data files in favor of whatever native export format the database in question supports – for example, MySQL, Postgres, MS SQL Server, etc. have tools which will dump a database as a SQL file including the extra data for any proprietary features which you're using. This is more future-proof than storing binary files produced by a specific version but avoids the loss of semantic meaning which you inevitably suffer when exporting to like CSV.

Two other thoughts:

  1. Most raw database files are likely to contain old data which has been deleted / replaced when the underlying storage record has not yet been wiped or overwritten. This may or may not be desirable but as with disk imaging it's a question you would need to ask which does not apply to either approach above.
  2. I would be careful about using a custom format which is not widely used outside of the archiving community unless you have the resources and time to invest in tool development. Something like a mysqldump file is going to be loadable for a long time because it's the primary backup mechanism used by most MySQL users and it will thus support all custom database exensions. In particular, if the goal is to have the ability to load the archive into a new database to work with, I would heavily test that capability before assuming it will be available.
answered May 29, 2014 by acdha (320 points)
+2 votes
I think you should assess databases on a case-by-case basis, rather than adopt a single approach.

The main two criteria would be the nature of the data in the DB & how it was used: if it is a relatively simple thing that can be accurately represented in a in a single tabular structure then sure turn it into a CSV. If the DB was only ever used by end users to generate PDF reports, maybe you could just export all those reports.

Of course many relational databases aren't like that & in those cases I think flat formats like CSV or hierarchical ones like XML are a bad fit. But that doesn't necessarily mean you are stuck with the original. Why not migrate the DB into SQLITE? It is open source, cross platform, public domain and a single file format. If you hand write the schema during the migration you can add comments so the thing becomes (relatively) self-describing. RDF using something like http://www.w3.org/TR/r2rml/ might work too & would allow you to integrate that particular DB with other ones in your collection.
answered May 29, 2014 by richardlehane (1,000 points)