Training & Education
In this document, several useful research data management tools are listed and described for each step of their research throughout the data lifecyle management.
This document is mostly generic from an institutional point of view, however it oers specic information intended for EPFL and Swiss researchers for some targeted points.
This selection aims to help researchers make the most out of their data, and especially:
- save time in the long run
- collaborate eciently on their data
- promote reproducible research
- enhance the visibility of their work
- meet funders’ data requirements (Horizon 2020, SNF…)
- meet publishers’ data requirements (Nature Publishing Group, PLoS…)
- minimize the risks around their data (such as data loss or corruption, data leak, etc.)
- open or secure their intellectual property privacy
2. Data lifecycle
2.1 Data lifecycle management phases
Many tools are available to help manage data throughout the research process; these can be categorized using data lifecyle management phases (illustrated below):
IMAGE TO INSERT
- discovery: nd useful, technically and legally reusable datasets
- acquisition: maximize the potential of your data, its visibility, and reproducibility by choosing from the beginning appropriated data and metadata formats. Anticipating will be crucial, knowing that afterwards it might be too late to do so.
- analysis: analyze the data
- collaboration: share your data and collaborate with your colleagues and partners (locally or worldwide)
- writing: use comprehensive and collaborative paper writing tools
- publication: deposit your datasets in visible and trusted data repositories
2.2 Data management planning tools
Data management plans (DMP) are recognized as important means to produce data of good quality. DMPs are generally conceived from the beginning of research projects and updated throughout their duration. They aim to anticipate all data-related needs (such as storage capacities, collaborative tools, data licenses, data formats, metadata or description, sensitive data anonymization, etc.) and avoid any problems (lacunar data, data loss, data corruption, intellectual properties issues, storage, etc.).
DMP Online (UK) and DMP Tool (US) are free online tools assisting in the conception of DMPs. Both will guide the user through the essential questions that must be answered to manage your data. In addition, various funders’ requirements are built in and can be used out of the box (e.g. DMP Online supports the Horizon 2020 guidelines).
How to Develop a Data Management is a more traditional guide describing the elaboration of a DMP.
Guidelines on Data Management in Horizon 2020 are the social requirements to follow when appling to a Horizon 2020 project subject to the data pilot program.
2.3 Data policies
The following institutional data policies, constitute together a good overview on this topic in Europe: University of Edinburgh, University of Oxford, Humbolt University Berlin, University of Southampton and University of Manchester.
3 Data management tools
In this section data management tools are listed along each step of the data lifecycle.
3.1 Data discovery tools
3.1.1 Data repositories
Re3data.org is a registry of data repositories. This tool indexes over a thousand archives which are both subject specic and generalist and can be browsed by disciplines, available repositories features such as persistent identiers support (e.g. DOIs, which play a crucial for guarantying access to datasets, as web links tend to break after a few years), and other important information such as data licenses availability, standards and policies.
Nature’s Recommended Data Repositories is a set of disciplinary repositories covering the following elds: Biological sciences; Health sciences; Chemistry; Earth and environmental sciences; Physics, astrophysics & astronomy; Social sciences; and General science.
Zenodo, Dryad and Figshare are state of the art general purpose data repositories. Zenodo is made available by Cern and OpenAire and is free for any researcher publishing his or her data openly. Dryad is a curated repository, maintained by a non prot organization. Figshare
belongs to a for prot company, the MacMillan group, which also owns the Nature Publishing Group.
3.1.2 Data papers and data journals
Data papers are publications describing datasets. In other words, they constitute peer-reviewed searchable metadata, and they can be used to nd or highlight datasets. Data papers can be found in pure data journals, or in journals mixed with traditional scholarly publications. An important point, is that these papers may be found through classical scholarly search engines. In addition, the following resources can help you nd multi-disciplinary data journals and data papers:
- Dryad’s examples of journal data policies lists journals that require data archiving and journals with data policies
- Trac’s multidisciplinary data journals list
- Nature Publishing Group’s scientic data website
- DataShare’s sources of dataset peer review list of data journals
Some discipline specic data journals exist too, for example:
- Wiley’s Geoscience Data Journal and Earth System Science Data
- UpMetaJournal’s Open Health Data
- Pensoft’s Biodiversity Data Journal
- UpMetaJournal’s Journal of open archaeology data
In addition, a list of Journals Data Policies compiled by Dryad may be
3.2 Data acquisition, format and description
In order to make the most out of your research data, it is important to use appropriated data and metadata standards. This has to be thought through at the beginning of the project and in any case before data acquisition. Indeed, it is often to late to correct, complete or correct data after the project is well started. Using good data and metadata standard help collecting coherent data and avoid missing some points. In addition, badly described data will not be re-usable by others at all, nor will it be easy to nd. A standard and open data format will allow more people to access and re-use a dataset because of its good compatibility across software and platforms. Finally, open standards will maximize the chances to access your results in the future because they are supported longer.
3.2.1 Data acquisition
ScienceExchange is a platform where scientists can order data for specific experiments (including from their own design): it is an on-line scientific experiment marketplace.
RopenSci offers packages providing an easy access to data repositories through the R statistical programming environment. R is a free software available on all platforms (Windows, Mac, Linux). These packages cover data access in the following domains: primary data, full-text of journal
articles, altmetrics, data-publication, reproducibility and data visualization. Many other data analysis R packages are available through CRAN.
3.2.2 Data formats
A directory of Recommended Data Formats is maintained by the US Library of Congress. It covers the following categories: still images, sounds, moving images, textual documents, web archives, datasets, geospatial data as well as generic data .
The DataTypeRegistry is a generic open source data type description platform. It allows in particular to combine already described units or data types to create new ones. Data types are labeled with unique identiers. In addition to a web interface an automated access is allowed through the API .
HDF5 “is a data model, library, and le format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for exible and ecient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format”. HDF5 is supported form within many environments such as Matlab, Octave, Python (H5py, PyTables), GNU-R, Java, C++, Fortran and Mathematica.
The Structured Query Language (SQL) “a special-purpose programming language designed for managing data held in a relational database management system (RDBMS)”. SQL is well suited to store and share relational data. RDBMS are a great help in maintaining dateset’s coherence and enforce data constraints. Several multi-platform open source RDBMS are available, such as MariaDB (MySQL) or PosgreSQL.
The Semantic Web Resource Description Format (RDF) “is a standard model data interchange on the Web”. RDF is a very general format and may be used to encode most types of data. A signicant advantage of RDF over other data formats resides in its interoperablity capabilities with other datasources, allowing to extend analyses beyond given datasets by connecting them to other data sources. In practice, this is done using the SPARQL query language on triple stores such as Virtuoso, Jena or 4store.
Sometimes, a simple but clear datasets Naming Convention can help a lot. Consider e.g. the pattern: Project-Institution-Group-DataSetName-Version-YYYYMMDD.DataFormat, which could in practice become something like FictiveProject-UniversityX-SignalProcessingLab-SolarIntensity-Version2.3-20151131.csv.
3.2.3 Metadata formats
DublinCore is a vocabulary consisting of only 15 basic elements, such as Creator, Title, Date, Description, Format, Rights, or Subject. It is not specically designed for dataset description, but widely used in scholarly communication. For that reason, it is a minimalist solution, and we recommend one of the solutions listed below instead.
DataCite Metadata Schema is a standard designed with datasets in mind, and hence more adapted then DublinCore mentioned just above. For example, GeoLocation, ReserarchGroups, Collections, Videos or Work have their own specic resource types .
A directory of Recommended Metadata Formats is available in this open source collaborative platform. They are indexed by discipline, extensions, associated tools and associated use cases. General available subject categories are Art and Humanities, Engineering, Life Sciences, Physical Sciences and Mathematics, Social and Behavioral Sciences and General Research Data. Dozens of subcategories are also available .
The Semantic Web Resource Description Format (RDF) “is a standard model data interchange on the Web”. RDF is a very general format and may be used to encode most types of data. Indeed, many OWL  RDF based ontologies exist. Ontologies are \formal naming and denition
of types, properties and interrelationship” of data .