File Formats
File formats are an important consideration in managing research data.
To ensure usability of and access to your data over the course of time you need to determine that durable file formats are utilised. This encompasses the life cycle of a research project as well as for long term access that may be required, to meet a data retention period, after completion of a research project.
Not all file formats are durable over time or compatible with the need to share data with others. There is also a distinction between the types of file formats that are optimal for presentation versus those optimal for preservation and longer term access.
Considerations include:
The National Archives of Australia has documented the suggested file formats for 'born-digital' file formats to ensure the preservation, accessibility and interoperability or you can review the table below.
Long Term Preservation
There are a number of resources available to assist you in determining an appropriate file format for your data. The following table, prepared by the UK Data Archive contains data formats identified by the Archive as optimal for long term preservation of data to the Archive. These can serve as a guide.
Type of data | Acceptable formats for sharing, reuse and preservation | Other acceptable formats for data preservation |
Quantitative tabular data with extensive metadata a dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data |
SPSS portable format (.por) delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information some structured text or mark-up file containing metadata information, e.g. DDI XML file |
proprietary formats of statistical packages e.g. SPSS (.sav), Stata (.dta) MS Access (.mdb/.accdb) |
Quantitative tabular data with minimal metadata a matrix of data with or without column headings or variable names, but no other metadata or labelling |
comma-separated values (CSV) file (.csv) tab-delimited file (.tab) including delimited text of given character set with SQL data definition statements where appropriate |
delimited text of given character set - only characters not present in the data should be used as delimiters (.txt) widely-used formats, e.g. MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf) and OpenDocument Spreadsheet (.ods) |
Geospatial data vector and raster data |
ESRI Shapefile (essential - .shp, .shx, .dbf, optional - .prj, .sbx, .sbn) geo-referenced TIFF (.tif, .tfw) CAD data (.dwg) tabular GIS attribute data |
ESRI Geodatabase format (.mdb) MapInfo Interchange Format (.mif) for vector data Keyhole Mark-up Language (KML) (.kml) Adobe Illustrator (.ai), CAD data (.dxf or .svg) binary formats of GIS and CAD packages |
Qualitative data textual |
eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml) Rich Text Format (.rtf) plain text data, ASCII (.txt) |
Hypertext Mark-up Language (HTML) (.html) widely-used proprietary formats, e.g. MS Word (.doc/.docx) some proprietary/software-specific formats, e.g. NUD*IST, NVivo and ATLAS.ti |
Digital image data | TIFF version 6 uncompressed (.tif) |
JPEG (.jpeg, .jpg) but only if created in this format TIFF (other versions) (.tif, .tiff) Adobe Portable Document Format (PDF/A, PDF) (.pdf) standard applicable RAW image format (.raw) Photoshop files (.psd) |
Digital audio data | Free Lossless Audio Codec (FLAC) (.flac) |
MPEG-1 Audio Layer 3 (.mp3) but only if created in this format Audio Interchange File Format (AIFF) (.aif) Waveform Audio Format (WAV) (.wav) |
Digital video data |
MPEG-4 (.mp4) motion JPEG 2000 (.mj2) |
|
Documentation and scripts | Rich Text Format (.rtf) PDF/A or PDF (.pdf) HTML (.htm) OpenDocument Text (.odt) |
plain text (.txt) some widely-used proprietary formats, e.g. MS Word (.doc/.docx) or MS Excel (.xls/.xlsx) XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0 |
Source : UK Data Archive : http://www.data-archive.ac.uk/create-manage/format/formats-table