File Formats - Working Level
Who should read this?
This is likely to be of interest to researchers, institutional support staff, data managers and repository staff working with file formats for data storage, transmission and sharing.
What is a file format?
In simple terms, a file format describes the way information is organised in a computer file. File formats apply to documents, images, audio files, video files and research data sets for example .doc or .pdf
A comprehensive list of differing file formats can be found by a web search using the key term ‘file formats list'.
There are many different file formats available for creating and storing data.
Choosing a suitable file format for data preservation and sharing is vital for the sustainability of future access and reuse of that data. This may require careful analysis of the advantages of proprietary or open standards software to ensure that access, reuse and future storage of the data meets future reuse of that data stored.
The ICPSR Digital Preservation Management Tutorial provides a useful overview of obsolescence for file formats and software.
File format obsolescence
File formats can become obsolete for various reasons:
• Software / file formats are upgraded and the new software version no longer works with the old version;
• The software that supports the format gets bought out by a competitor and withdrawn;
• The format falls into disuse or no-one writes software to support/implement it;
• The format is no longer compatible with current software or is not backwards compatible with older software.
The result of this obsolescence means that it may no longer be possible to access the file, read the file or reuse the data, either entirely or partially. Risks also emerge for users if the software required resolving the format is restricted or the developer changes licensing or costed use of that software.
If data is stored using a format that is, or is about to become, obsolete, then it may be necessary to migrate to a more suitable format.
The alternative is to preserve the entire environment needed to access and/or use the data.
This approach either involves maintaining old computer hardware, together with the operating system and all the required software, or writing/obtaining special emulation software that recreates the software-operating environment within more recent systems.
Open and proprietary formats
It is important that organisations implement data management policies that conform to standards that manages risk of file format obsolescence or degradation of information storage.
A proprietary format is one that is owned by an individual or a corporation. Some common examples of proprietary formats are: AutoCADs DWG drawing format, the MP3 MPEG Audio Layer 3 format and Adobe Photoshop's PSG native image format. Most proprietary formats are closed, meaning that the neither definition nor development of the format is available to the public.
This means that data stored in the format can only be accessed using the format owner's software. Some formats are both open and proprietary eg. Adobe PDF Microsoft OOXMLAn open format is one where the description of the format is available to the public.
Examples of open formats include the JPEG 2000, PNG and SVG standard image formats; ASCII, PDF, Open Document Format and Office Open XML format (the native format for recent versions of Microsoft Word) for text; HTML, XHTML, RSS and CSS for the web and NetCDF for some scientific data.
A lossless format for example, TIFF for images, retains the original detail of the data file.
A lossy format however, such as JPEG for images, discards information permanently in order to reduce the scale and size of file in effect lowering the quality of that data.
"Lossy" compression is a data encoding method that compresses data by discarding (losing) some of it. The procedure aims to minimize the amount of data that needs to be held, handled, and/or transmitted by a computer. Images become progressively coarser as the data that made up the original one is discarded (lost). Typically, a substantial amount of data can be discarded before the result is sufficiently degraded to be noticed by the user.
For audio file formats, the ubiquitous MP3 format is lossy, while WAV format is lossless. The implications of re-saving or converting data from one format to another becomes apparent when the quality of that data is compromised in quality due to this removal of information. Uncompressed non-lossy file formats take up a lot more storage space that needs to be taken into account when budgeting for storage.
Note - metadata such as file title, description, date etc. is not removed during this process.
Compression refers to ways of making data take up less storage space without losing any of the content. For long-term preservation, uncompressed formats are less risk prone.
A lossless file that has been compressed can be restored to its original state, completely unchanged. In the case of lossy formats the reduction in size is achieved effectively by throwing select data away (losing).
The compression process makes data more susceptible to "bit-rot". Bit rot is the small electric charge of a bit as memory disperses, possibly altering program code. The risk is that a change of one bit in a compressed text file may cause major changes across the entire document, rendering it useless.
The importance of standards
Standard file formats are essential for effective data sharing. In many cases a research discipline will have a mandatory or preferred standard for saving and storing research data eg. SPSS data files for social science data sets.
Using appropriate file formats ensures the sustainability of the data being stored and it's access and reuse.
Retaining multiple formats
Retaining multiple formats and instances of data may add to the scale of data being stored or sync difficulties however doing so can reduce the risk of loss of the original high-resolution file. The advantage of retaining a low-resolution lossy format file set is quick navigation or ease of transmission.
An alternative to keeping multiple formats is to use content management system software eg. Alfresco that can convert it to multiple alternative formats on the fly. For example, a repository might store a text document in a gold-standard preservation format like DocBook XML, but provide a service that can also disseminate the document as HTML, PDF or Word, depending on the preference of the reader.
File format types should ideally be considered and decided upon before the commencement of data collection. Information lost by storing data using a lossy image, sound or video format cannot be recovered.
Migrating data from an unsuitable format to a more sustainable option is always difficult and expensive, and may in some cases be impossible.
Future File Formats
Development of file formats in the near future will likely incorporate information that pertains to geospatial platforms and environments.
Mobile technologies and the onset of virtual reality based data creation mean there are a number of consortiums such as the Open Standards for Real-Time 3D Communication that conform with the International Organisation for Standardization.
The onset of geospatial data also presents new challenges as raster, vector and grid formats develop alongside other formats such as Worldfile used for geo-referencing a raster image such as a JPEG or BMP file. The importance of organisations to remain vigilant of, and responsive to, future proofing format sustainability is a critical consideration for the access, re-use and compatibility of data.
An introduction to geospatial resources and formats
Preservation formats and display formats
High-resolution data eg. a lossless uncompressed bitmap may require conversion to a .jpeg format for ease of visualisation online or transmission via email messaging. Another example is a standard XML format, which is best rendered to HTML or PDF for ease of viewing or printing purposes
Consideration must therefore be made for the long-term preservation of data taking into account the storage, display, visualisation, conversion or re-use of data.
The US Library of Congress ‘Sustainability of Digital Formats' website provides a dynamic overview of preservation and sustainability of digital formats.