Good documentation in research projects is essential in order to allow good quality data and transparent research. Describing how the datasets were created, how they are structured, and what they mean, is key for making your data understandable for others as well as your future self. Metadata provides such ‘data about data’. Metadata is needed at several levels to describe the study, the samples, the experiments, the analysis and so on. It may include information on the methodology and instrumentation used to collect the data, analytical and procedural information, definitions of variables, units of measurement, any assumptions made, the format and file type of the data and software used to collect and/or process the data.
Researchers are strongly encouraged to use community metadata standards where these are in place (see further down) and are recommended doing so already from the beginning of the project. Data repositories may also provide guidance about appropriate metadata standards and requirements e.g. the European Nucleotide Archive (ENA) have ENA sample checklists. We provide templates for some of these checklists, see further on Sharing phase - ENA. Structuring the metadata in a way that conforms to the suitable repository from the beginning enables data submission without having to reformat the metadata.
Ontologies, controlled vocabularies and data dictionaries are used to standardize the language used to describe the metadata. Think of the many ways to write that the organism is human (human, Human, homo sapiens, H. sapiens, Homo Sapiens, man, etc), using an ontology such as NCBI taxonomy unifies the language and makes it easier for both humans and machines to interpret and work with the data. While an ontology has a hierarchical structure (e.g. homo sapiens is a mammalia which is a eukaryota), a controlled vocabulary is an unstructured set of terms, but fills the same purpose, to standardize the language used. A Data Dictionary is a user-defined way of describing what all the variable names and values in the data really mean.
For a suggested list of ontologies appropriate for Life Science community please see FAIRsharing.org, filter on e.g. Domain.
Below are ontology resources, adapted from Table 2 in Griffin PC, Khadake J, LeMay KS et al. Best practice data life cycle approaches for the life sciences. F1000Research 2018, 6:1618. doi: https://doi.org/10.12688/f1000research.12344.2:
- Ontology Lookup Service - Discover different ontologies and their contents.
- OBO Foundry - Table of open biomedical ontologies with information on development status, license and content.
- ZOOMA - Assign ontology terms using curated mapping.
- Ontobee - A linked data server that facilitates ontology data sharing, visualization, and use.
Data types, file formats and metadata standards
Curated up-to-date guidance regarding file types and metadata standards is found at FAIRsharing.org. The most common ones, including links with data type specific FAIRsharing queries, is listed below. The information is adapted from RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing. Research Data Alliance. 2020. doi: https://doi.org/10.15497/rda00052
A list of relevant data and metadata standards can be found in FAIRsharing, some specific examples are below.
Gene expression - Transcriptomics
- Preferred minimal metadata standard: MINSEQE - Minimal Information about a high throughput SEQuencing Experiment
- Preferred file formats (sequencing-based):
- Raw sequences: .fastq (compression can be added with gzip)
- Mapped sequences: .sam (compression with .bam or .cram). Please ensure that the used reference sequence is also publically available and that the @SQ header is present and unambiguously describes the used reference sequence.
- Transcripts per million (TPM): .csv
- Gene Structure: .gtf
- Gene Features: .gff
- Variant calling: .vcf. Please ensure that the used reference sequence is also publically available and that it is unambiguously referenced in the header of the .vcf file, e.g. using the URL field of the ##contig field.
Gene expression - Microarray-based
- Preferred minimal metadata standard: MIAME - Minimum Information About a Microarray Experiment
- Preferred file formats:
- tab-delimited text e.g. MAGE-TAB and ISA-TAB
- raw data file formats from commercial microarray platforms (see Annotare accepted formats)
Genome-wide association studies (GWAS)
- Preferred metadata standards:
- MIxS - MIGS/MIMS - Minimum Information about a (Meta)Genome Sequence. The MIMS extension includes key environmental metadata. Developed by the Genomic Standards Consortium. Numerous adopters including NCBI/EBI/DDBJ databases.
- MIMARKS - Minimum Information about a MARKer gene Sequence. This is an extension of MIGS/MIMS for environmental sequences. Developed by the Genomic Standards Consortium. Numerous adopters including NCBI/EBI/DDBJ databases.
Imaging / Structural data
Images and structural data cover a wide range of data types and thus metadata standards. Please find below guidance for a selection of them.
Processed structural information is submitted to structural databases in the PDBx/.mmCIF format.
- Data archiving and validation standards for cryo-EM maps and models are coordinated internationally by EMDataResource (EMDR).
- Cryo-EM structures (map, experimental metadata, and optionally coordinate model) are deposited and processed through the wwPDB OneDep system, following the same annotation and validation workflow also used for X-ray crystallography and nuclear magnetic resonance (NMR) structures. EMDB holds all workflow metadata while PDB holds a subset of the metadata.
- Most electron microscopy data is stored in either raw data formats (binary, bitmap images, tiff, etc.) or proprietary formats developed by vendors (dm3, emispec, etc.).
- Processed structural information is submitted to structural resources as PDBx/mmCIF.
- Experimental metadata include information about the sample, specimen preparation, imaging, image processing, symmetry, reconstruction method, resolution and resolution method, as well as a description of the modeling/fitting procedures used and are described in EMDR, see also Lawson et al 2020.
- There are no widely accepted standards for NMR (Nucleic Magnetic Resonance) raw data files. Generally these are stored and archived in single FID/SER files.
- One effort for the standardization of NMR parameters extracted from 1D and 2D spectra of organic compounds to the proposed chemical structure is the NMReDATA format.
- There is no universally accepted format, especially for crucial FID-associated metadata. NMR-STAR and its NMR-STAR Dictionary is the archival format used by the Biological Nuclear Magnetic Resonance data Bank (BMRB), the international repository of biomolecular NMR data and an archive of the Worldwide Protein Data Bank (wwPDB).
- The nmrML format specification (XML Schema Definition (XSD) and an accompanying controlled vocabulary called nmrCV) is an open mark up language and ontology for NMR data.
- Processed structural information is submitted in the PDBx/mmCIF format.
- ENDF/B-VI of Cross-Section Evaluation Working Group (CSEWG) and JEFF of OECD/NEA have been widely utilized in the nuclear community. The latest versions of the two nuclear reaction data libraries are JEFF-3.3 and ENDF/B-VIII.0 (Brown et al., 2018) with a significant upgrade in data for a number of nuclides (Carlson et al., 2018).
- Neutron scattering data are stored in the internationally-adopted ENDF-6 format maintained by CSEWG.
- Processed structural information is submitted in the PDBx/mmCIF format.
For a curated list of relevant standards see FAIRsharing using the query ‘metabolomics’, some examples are given below:
- Core Information for Metabolomics Reporting CIMR standard
- For identifying chemical compounds use SMILES or InChl
- To document Investigation/Study/Assay data, use the ISA Abstract Model, also implemented as a tabular format, ISA-Tab in MetaboLights. For an introduction to ISA, see (Sansone S-A et al., 2012)
- Recommended formats for LC-MS data: ANDI-MS specification, an analytical data interchange protocol for chromatographic data representation and/or mzML
- Recommended formats for NMR data: nmrCV, nmrML
- Metadata should follow recommendations from the CIMR standard by the Metabolomics Standards Initiative. It should be made available as tab or comma separated files (.tsv or .csv).
- Data can be stored in LC-MS file, in tab (.tsv) or comma (.csv) separated formats.
For a curated list of relevant standards see FAIRsharing using the query ’proteomics’, some examples are given below:
- Use the minimal information model specified in MIAPE by the HUPO Proteomics Standards Initiative (HUPO PSI), and fill the model using the controlled vocabularies specified by the Proteomics Standards Initiative: PSI-MS CV
- Recommended formats: