• Register

Should data repositories remove illegal characters from filenames?

+1 vote
Should normalizing filenames be a standard curation practice in data repositories? There is consensus that file names without illegal characters are better for interoperability and long-term preservation, but changing a filename could immediately break a program that relied on a certain name to operate

In my mind, filename normalization falls in the same realm as format normalization - we normalize/migrate formats for more sustainable long term preservation/access. Many preservation polices address this with data producers at the time of deposit.

If filename normalization occurs, what sorts of standards are available to address this? Simply inserting a (-) in place of a whitespace seems unsustainable and potentially problematic. ISO 9660? Are there any repositories out there activly addressing this issue?
asked Nov 13, 2014 by CDearborn (130 points)

1 Answer

+2 votes

Short answer:

Ideally no


Long answer: 

If something in your processing system requires you to be changing the files that you are taking in, in order to "preserve" them. Then you might want to ask whether you are using the right system or process. 

It might often be advantageous to create a derivative of an original with a different file name, but that should probably be considered a derrivative rather than an original (it might be considered a new preservation master though). 

On the other hand, if you document the renaming process and it is easily reversible then it might be worthwhile. Especially if it enables you to take other actions to preserve the content that you might not otherwise be able to take without having renamed the files (e.g. ingest them into a problematic but ultimately more effective preservation system). 

answered Nov 13, 2014 by euanc (3,910 points)
edited Nov 14, 2014 by euanc