LibGuides: Researcher Skills Toolkit: Organise and manage

Manage data

Organise and manage

Documenting your data is good practice and will help to ensure that it is optimised for reuse.

Factors to consider when collecting, organising, and managing data:

Organising and structuring data
Data formats
File naming conventions
Metadata

Data analysis tools
Data cleaning
Data containerisation
Privacy, sensitivity and confidentiality consideration.

Consider how best to structure a filing system with for the elements of research that are likely to be created as part of the process. Where multiple people are involved in collecting and interacting with the data, it is important that a clear structure be adopted to avoid duplication or misplaced information.

Use a folder structure that represents your approach to your research and is meaningful to you
Use folders and sub-folders to group data
Do not let your folder structure go too deep, and do not create overlapping sub-categories of folders
Be consistent (dates and formats, vocabulary, version, revision number). Use a standard date format, for example - ISO 8601 - YYYYMMDD or similar.

The type of research being conducted will determine options for documenting and recording the research data collected.

Other researchers in your field may have suggestions and understand any limitations of software and instruments that are being considered for use.

Keep the following points in mind when selecting data formats:

Consider data formats for long term access at the beginning of your project
Good practice is to use open standard file formats as opposed to proprietary formats:

Text documents – use TXT or RTF
Spreadsheets – use CSV
Images – use TIFF
Audio files – WAV
Video files – MP4 (see UK Data Service. Recommended formats)

Consider using data formats that are widely known and adopted - in the future you may no longer be able to access your files with newer versions of software and different file formats
Consider saving data in plain text format for long-term preservation, i.e. in ASCI text format for future porting and accessibility
The best formats for preservation are unencrypted and uncompressed
Where third-party data is used, it is good practice to keep a record of the source and to request any copyright permissions in advance of publication.

Be consistent

Use descriptive and contextual file names:

Project name/experiment/type of date/version
Agree on a date format. For dates, it is recommended to use YYYYMMDD. Numbering can influence the location of a file in a list and therefore what file might be used
Use leading zeroes to make sure files are stored sequentially (e.g. 01, 02, …)
Avoid long file names
Avoid using special characters or spaces as some software may not recognise these. Other options are underscores, dashes, and camel case (e.g. InterviewTranscript)

File versioning

Over time, elements of the project will be updated such as ‘Results’ sections in documents. Version control will ensure that changes to the original file are documented and that previous versions can be accessed and checked.

Use revision numbers (eg. Project document v1.0, Project document v1.1)
Use application specific codes in the file names (TIFF, MOV)
Include a document history or version control section in your documents and keep last updated information up to date
Keeping track of version history of your documents is good practice as it allows you and others to understand how your document was created, developed, and changed over time
Decide on and record file version and versioning format for redrafting and corrections and revisions of documents
Consider using version control software, such as Mercurial
Document your file version conventions

Documenting file name conventions

Include a document history or version control section in your documents and keep last updated information up to date
List your file naming convention in the Data Management Plan.

Tools for 'bulk' renaming of files

Bulk renaming tools provide the ability to rename many files in a single action. Several bulk file renaming tools are available, including Bulk Renaming Utility (Windows, free), Renamer (Mac), RenameIT, etc

Metadata is data about data. Metadata ensures that data is discoverable in databases and repositories and increases the chance that the data will be reused. Metadata also outlines provenance - how the data was collected and where it has come from.

Metadata can describe a single item or a collection. What is relevant will depend on the item that is being described.

Metadata includes the following elements:

Descriptive: Name of the creator, location, time, and date, who/what the item is about
Technical: Specifications of the equipment used to create the item, format, etc.
Access and rights: Defining who can access the content
Preservation: Information to assist with longer-term use and access.

Metadata Standards

Some disciplines have developed standard vocabularies to help with data discovery of concepts. For example, in the medical field MeSH (Medical Subject Headings) terms are often used.

The most successful vocabularies are those where terms are clearly defined, so it’s important to always check definitions before using terms. If definitions do not fit the research, then options exist to create or add to vocabularies. This may be applicable where research is moving into a new area.

Check ARDC’s Research Vocabularies Australia (RVA) for examples of controlled vocabularies that could be used. It is also possible for Australian research organisations to publish, re-purpose, create, and manage their own controlled vocabularies. Vocabularies can change over time and this service enables management of new versions while retaining superseded versions.

The University facilitates access to a range of research and data analysis tools available under site licence, for purchase, or as open-source software. These tools cover a range of functions, including:

Analysis and visualisation – such as NVivo, SASS, SPSS, ArcGIS, and JMP
Data and workflow management – DIVER and LabVIEW
Data collection – such as QuestionPro
Database systems – Microsft SQLServer and MySQL
Operating systems and visualisation – RedHat, Ubuntu and VirtualBox

Programming – JAVA, Mathematica, MATLAB, Microsoft Visual Studio, Python
Simulation, design, and modelling – ANYSYS, ArchiCAD, AutoCAD, Blender, Creo Parametrix, CPLEX Optimizer
Utilities – including Adobe Creative Cloud, EndNote, Covidence.

Training is available for some programs – check IT Services, The Library, Graduate Research and Research Advantage pages for details.

Also check ARDC’s Working with research software pages.

Collaborative data tools and platforms

Collaboration tools allow you to work on research projects with others within or external to the University. The right tool for your project and your team will depend on:

The type of information or data which needs to be shared
Whether the data requires a specific platform to operate
Whether there are any cross-institutional collaborators.

ARDC-Supported Research Platforms are online environments that draw together research data, models, analysis tools and workflows to support collaborative research across institutional and discipline boundaries.

Prior to commencing any analysis, it is important to check the raw data for completeness, identify any gaps, and the ability and ease with which the data could be analysed in its current state. Cleaning the data may be necessary to produce a form that can be analysed.

The choice of program will depend on the type of data involved. It is important to note any previous data cleaning that was undertaken, including what protocols were employed and why it was required. Data cleaning may need to be undertaken manually where a program does not identify all the issues in the raw data that need to be corrected before it can be properly analysed.

OpenRefine is an example of a data-cleaning tool. It can be used to correct text where there may have been numerous variations in the format of entries such as date formats, or combined duplicated columns. Data is likely to still need checking, but this can help reduce the amount of editing required.

Once the data and tools for analysis have been decided, data containers can help to identify all the elements involved to achieve the same research outcomes.

These containers hold all the components needed to analyse the data. They can be shared with others and limits the need to remember all the elements used.

Containerisation also addresses the issue of changing software versions and automated software upgrades; this is useful when the research takes place over several years and may require different data formats over time.

Data collected during your research may be subject to privacy, sensitivity, and confidentiality considerations.

Sensitive data identifies “individuals, species, objects or locations, and carries a risk of causing discrimination, harm or unwanted attention.” (ARDC Publishing sensitive data)

Sensitive data may relate to:

People – particularly health and personal information that can be used to identify individuals or groups of people. This data may include:

Racial or ethnic origin
Political opinions, or memberships of political associations, professional or trade associations or unions
Religious beliefs or affiliations
Philosophical beliefs
Sexual orientation
Criminal records
Health, genetic or biometric information

Beyond human data – examples include sensitive environmental or diversity data, such as the location of rare, endangered, or commercially viable species, data representing potentially patentable intellectual property (IP), or other sensitivities, such as cultural, indigenous, or ecological sensitivities, or research that may cause public controversy for its methodology or subject matter.

Identifiable data

Data may often need to be identifiable (i.e. contain personal information) during the process of research, e.g. for study administration, qualitative analysis, etc.

If data is identifiable, then ethical and privacy requirements may be met through access control and data security but establishing a well-defined Data Management Plan before a research activity begins is the most effective way of meeting these requirements.

This may include:

control of access through physical or digital means (e.g. authentication)
encryption of data, particularly if it is being moved between locations
ensuring that data is not stored in an identifiable and unencrypted format when on easily lost items such as USB flash drives, laptops, or external hard drives
taking reasonable actions to prevent the inadvertent disclosure, release, or loss of sensitive personal information.