The European Commission has identified the opportunity to save €10.2 billion per year by using FAIR data (Findable, Accessible, Interoperable, Reusable). As policies begin to emerge requiring FAIR data, it’s timely to consider the open infrastructure needed to make embed FAIRness into the research and research communication workflows and outputs.
Coko recently received a grant from the Sloan Foundation to build DataSeer, an web service that uses Natural Language Processing to identify and call out datasets associated with research articles. Datasets are often not explicitly identified, let alone made FAIR and accessible. The first step is knowing how many datasets were used in a body of work. DataSeer “reads” documents and finds mentions of dataset creation and use. Based on the context, DataSeer can offer recommendations to curate, deposit, add metadata too, or otherwise better handle datasets. DataSeer can fit into the workflows of researchers, publishers, aggregators, funders, and institutions.
DataSeer would be a good starting point to build a tool that can check for and assess FAIRness. Datasets come in a multitude of different formats, and typically require extensive metadata to be understandable. DataSeer tackles these complexities directly, and our development workflow can readily be applied to other research outputs.
Before FAIR compliance can be assessed, the full range of datasets associated with a research project must first be identified. There are often ‘hidden’ datasets mentioned in the text that are included among the ‘official’ outputs. DataSeer finds these mentions and help to authors to identify and share all of the datasets involved in their work. Development of the software, algorithm and creation of the training set are already underway. Subsequent projects can build on this work rather than having to start from the beginning and duplicate effort.