Data and the FAIR Principles

Lesson 3: Data Publishing

Overview

Teaching: Self-paced min
Exercises: 1 min
Questions
  • Am I ready to publish my data?

  • What resources are available for your research data needs?

Objectives
  • To learn what constitutes good stewardship of research data

  • To learn about resources that can be used to assist in the process of data stewardship and publication

Introduction

This lesson provides an overview of best practices in data publishing. Note that we prefer the term “data publishing” to “data sharing” because the goal of this module is to make data available in public for third party inspection and re-use.

History of open mandates and guidelines

Prior to computers and the internet, publishing data routinely was really not possible beyond what we could publish in articles or books. Rather it was the hypotheses proposed, the experimental design, the analysis and the insights gained from collected data that were valued and preserved through our system of journals and books. Eventually, a culture grew up around scientific publishing where data were considered disposable after some specified regulatory period or a personal asset to be maintained and exploited for personal use (Martone et al., 2018).

As science continues to move online, almost all major funding agencies in the US and abroad are developing policies around open sharing of research data and other research products like code. These policies seek to promote the integrity of scientific research through greater transparency given recent concerns about scientific reproducibility but are also driven by the promises of new insights to be gained from increased human- and machine-based access to data.

Within neuroimaging there exist a set of recommendations for best practices in data analysis and sharing. To advance open science in neuroimaging, the Organization for Human Brain Mapping’s Committee on Best Practice in Data Analysis and Sharing (COBIDAS) has released a number of recommendations. (doi: https://doi.org/10.1101/054262). These guidelines for various aspects of a study are provided via tabular listings of items that will help plan, execute, report and finally share research in support of reproducible neuroimaging.

Data sharing versus data publishing

We are seeing more and more calls for domains not just to establish a culture of human-centric sharing, i.e., “data available upon request”, but to move towards a more e-science vision, where researchers conduct their research digitally and where data standards and programmatic interfaces make it easy for machines to access and ingest large amounts of data. To achieve this goal requires that the products of research, including the data, be FAIR not just by humans but by machines. We have already covered the FAIR principles in another module. As a review, FAIR stands for Findable, Accessible, Interoperable and Re-usable. FAIR requires that enough metadata be associated with a data set so that it can be interpreted and reused appropriately, but also that data can be easily coupled to computational resources that can operate on them at scale, the provenance of these research products can be tracked as they transition between uses and the ability to find, access and reuse digital artifacts using computational interfaces (APIs) with minimal restrictions.

In other words, to take advantage of data, time, attention and resources must be devoted to publishing data and code, just as we spend time and energy publishing enduring narrative works to be understood and used by third parties, now and in the future.

Best Practices for publishing data

We are still in the early days of publishing data, and the FAIR principles have not yet been interpreted completely within any domain (Mons et al., 2017). Part of the remit of ReproNim is to establish a viable set of FAIR data practices for neuroimaging. But some best practices and practical guidelines have already become clear:

1) Proper data management throughout the research lifecycle: Documentation and management of data within the laboratory is a critical first step in publishing data. Many laboratories still have no formal systems for storing, annotating and managing critical lab data. Often when a graduate student or post-doc leaves, valuable data and metadata are lost. How many data sets are on archival media such as zip drives that are no longer readable? As the FAIR standard states, rich and standardized metadata is critical to ensure interoperability and re-usability. While some of this annotation can be done after the fact, establishing good data management practices within the laboratory, understanding what minimal information and data standards will be required before the experiment is performed, and ensuring that the data are properly stewarded throughout the lifecycle will save countless hours during and after the study.

Funding agencies are starting to require that researchers have a data management plan in their grant applications. Towards that end, many research libraries are now creating practical guidance and resources for effective data management and for creating acceptable data management plans for funding agencies.

2) Ensure that your data is deposited into a reputable repository when publishing data: The simplest and most effective mechanism for making data FAR is to deposit data in a qualified data repository (Roche et al., 2014; Gewin, 2016; White et al., 2013). Hosting data on personal websites or even as supplementary material for a published article is not ideal. The first instance is prone to link rot while both generally mean that data will not be further curated. Known impediments to re-usability, e.g, proprietary formats, may not be caught. Putting data in the cloud is also not a panacea, if the FAIR principles are not followed.

Many data repositories have been created for scientific data (see Exercise 1: Finding a data repository for your data). Qualified data repositories ensure long term maintenance of the data, generally enforce enforce community standards, and handle things like obtaining an identifier, maintaining a landing page with appropriate descriptive metadata, and providing programmatic access. Many institutions are beginning to provide data management services

Types of repositories: A variety of types of repositories are available, from specialized repositories developed around a specific domain, e.g., NITRC-IR, OpenNeuro, to general repositories that will take all domains and most data types, e.g., Figshare, Dryad, OSF, DataVerse, Zenodo (Table 2). Many research institutions are maintaining data repositories for their researchers as well (e.g., University of California DASH).

The advantage of more specific repositories is that they can invest in much more specialized metadata, data models, formats and tools, compared to the more generalist repositories. Because the generalist repositories contain mixtures of different data types across many domains, they have a difficult time harmonizing across different data sets or developing data representations that allow programmatic access to the full data without the need for significant human intervention.

3) Plan ahead when publishing your data: The process and costs associated with publishing research articles are well known to researchers, and they ensure that they include adequate resources within their research proposals to ensure that they can prepare and publish articles. Data are a lot more varied in size and complexity than research articles, and thought must be given to how they are going to be published before they are collected. For data of a reasonable size, most repositories are still hosting the data free of charge. Some repositories, e.g., NDAR, require a fee to deposit data, but they also provide cost estimates that can be included within grant proposals. In addition to costs, just as with publishing articles, you need to ensure that all parties who have contributed to the data are credited and agree to publishing the data (see below: Data citation).

Here are some things to consider:

Before the data are collected:

Tips for making data FAIR

DataLad

DataLad builds on top of git-annex and extends it with an intuitive command-line interface. It enables users to operate on data using familiar concepts, such as files and directories, while transparently managing data access and authorization with underlying hosting providers.

Credit for publishing data

Publishing data should be no different than publishing an article: The creators should be credited and the authors and data formally cited when it is reused. This view is expressed in the Joint Declaration of Data Citation principles, issued and endorsed by organizations around the globe.

In recognition of the growing importance of publishing data, publishers are providing specialized journals, e.g., Scientific Data, published by Springer-Nature, or a specialized article type called a data paper, specifically for publishing well curated and described data sets. These papers are published using traditional publishing metadata and article structure, but are not expected to include any analyses or conclusions; rather the paper is devoted to providing rich metadata and a rigorous description of experimental and data collection mechanisms. Scientific Data also implements a standard format for structuring metadata, to ensure that the data is FAIR. These journals usually require that data are deposited in an approved community repository and they provide lists of recommended repositories.

Publishing a data paper is one way to ensure that data can be cited. But many journals are currently working on implementing more formal systems of data citation, led by community efforts to push for equal status for data in the publishing pipeline (e.g., Joint Declaration of Data Citation Principles; Starr et al., 2015)). Citations to data sets would look like citations to articles, with a standard set of metadata, and would appear in the reference list. With the ability to list published datasets on a scientific CV, cite them within published articles, and search for them via tools like DataMed, data is finally taking its place as a primary product of scholarship.

A formal citation system assigns credit for the re-use of data, but also establishes links to the evidence on which claims are based while providing the means for tracking and therefore measuring impact of data re-use. In our current publishing system, authors adopt a variety of styles for referencing data when they are re-used, from accession numbers, to URL’s, to citing a paper associated with the data set. Some journals set aside a special section on data use which contain lists of data sets and other resources. Unlike references to articles which have a standard format and tools and services to insert them and analyze citations, uncovering and tracking re-use of data typically involves using manual identification, text mining and natural language processing approaches, requiring full text access and considerable time and effort (Read et al., 2015).

Honor et al. (2016) have published a recommendation for citing neuroimaging data sets.

References

Online courses:

Additional materials:

Key Points