It is important to give some thought to the life of your research data after your project is completed and what file formats will enable long-term accessibility and reusability.
When you create digital research data its file format is often dictated by how you choose to collect, analyse and store your data, by the hardware being used or the availability of software. It can also be determined by discipline-specific standards and customs.
The file formats used when collecting and working with research data are not always ideal for long-term accessibility and reusability. You should therefore consider:
You may find that you need to use one format for your own data recording and analysis and another for data archiving and sharing. Where possible, converting your files to open, non-proprietary formats that can be used by any operating system to maximise accessibility and interoperability.
We do not know which data formats will still be readable in 10 or 20 years time but the following guidance may help inform decision making:
Open, non-proprietary file formats - where documentation that describes the format is complete and freely available - are far more likely to remain useable even if the software that created it is no longer available or functioning. If the only option for reading or accessing the data is the programme that created the file, it's most likely a proprietary, closed format.
Example: TIFF is a good format for the long-term preservation of digital image files. The TIFF image specification was published in 1992 and is available for anyone to consult.
Ubiquitous file formats that are in very wide circulation are more likely to continue to be readable for some time. Even if the creating application becomes obsolete, it is likely that the critical mass of users will ensure the supply of necessary import routines.
For cutting edge research projects, where the use of new techniques for data creation and analysis necessitate the use of obscure and niche file formats, this will not be a major consideration.
Examples: Popular file formats include Microsoft Office files, MS Word and MS Excel.
Compression is often introduced to make the file smaller and easier to store, however this can lead to data loss. Uncompressed versions of files should be maintained where possible.
Example: Uncompressed TIFF is the de facto standard for storage of image files. Use of other formats that include compression (e.g. JPEG) will result in a lower image quality.
ASCII formats can be opened in a basic text pad application and the text within it should be readable. If your data can be viewed successfully in a text editor it should be readable into the future.
Example: A CSV file can be opened within a text editor and the data can be inspected within the text editor or can be imported into a range of other software packages.
PDF/A is the archival version of the PDF standard. Based on PDF 1.4 it reduces the complexity of some of the more recent versions of PDF and ensures that all necessary information is embedded into the file. PDF/A isn't reliant on certain system fonts being present on the device on which it is accessed so this makes it more likely to be readable in the future.
Original files from the creating application often contain more information than those that are subsequently derived from them. Derived files may be of poorer quality due to the introduction of compression or the loss of other information such as metadata. Derived files may also be harder to extract information from.
Example: An original MS Word document is easier to preserve for the long term than a PDF version of the same file.
Many data repositories specify the file formats they will accept. These are formats chosen by the repository to help it keep data usable over the long term.
The file formats recommended by the UK Data Service and the data requirements table of the Archaeology Data Service, for example, offer useful lists of preferred formats that can help you to choose the file formats to use.
Native Google files (e.g. Google Docs and Google Sheets) will need to be downloaded into different file formats before deposit with a data repository. You should ensure that the file format chosen adequately and accurately captures the content of the item, e.g. that calculated values in spreadsheets are retained or comments within documents captured.
It's important to carefully consider the file formats you are going to use as conversion to different file formats will add to your workload.
You should also consider the software and/or application you are using to present your research data.
If you are using specialist new and/or expensive software then you should consider reformatting it to a more user-friendly application, if feasible.