Preprints won’t just publish themselves: Why we need centralized services for preprints

There has been much attention recently given to preprints, the early versions of journal articles that haven’t yet been peer-reviewed. While preprints have been around since before launched in 1991, fields outside of physics are starting to push for more early sharing of research data, results and conclusions. This will undoubtedly speed up research and make more of it available under open and reusable licenses.

We are seeing the beginning of a proliferation in the number of preprint publishing services. This is a good thing. We know from the organic growth of journals that researchers often choose to publish in places that serve their own community. This will no doubt be true with preprint services as well and offering them choices makes sense. Even within the very large arXiv preprint server, there are many different community channels where researchers look for their colleagues work. 

Last year ASAPbio formed with the goal of increasing preprint posting in the life sciences. There was agreement that preprints should be searchable, discoverable, mineable, linked, and reliably archived. These are all steps that the online journal publishing industry needed to take 20 years ago, and there are well-understood mechanisms in place. This is how cross-journal databases such as PubMed came to be, best practices such as assigning DOIs evolved, standards such as COUNTER were developed to ensure consistent reporting on usage, and integration with research databases such as GenBank were worked out.

These same efforts will be needed across the different preprint services to ensure that preprints are taken seriously as research artifacts. As more preprint channels arise, this infrastructure and operating standards will only be more important. A research communication service is not necessarily the same as its underlying technology and, though people tend to equate the two, shared preprint infrastructure is actually the best way to ensure costs are kept down and standards are applied. 

As the ASAPbio conversation evolved, so did the discussion of whether a central service was needed for aggregation of preprints. I believe that what is needed is a collection of services that are centralized in some way to ensure a low cost and easy path to preprint services and that they work together as effectively as possible.

Several of the needed services include:

  1. Consistent standards applied to preprints (identifiers, formats, metadata)
  2. Reliable archiving for long term preservation
  3. A record of all preprints in one place for further research purposes (text and data mining, informatics, etc)
  4. Version recording and control
  5. Best practices in preprint publishing applied across all services
  6. Sustainability mechanisms for existing and new preprint services

Comparable services for journals have helped to make journal literature reliable and persistent. If we want preprints to turn into first class research artifacts in the life sciences and other fields outside of physics, we need to apply some degree of the same treatment for them – and at this early stage, now is the time to plan for these services.

A centralized set of services could ensure that, for preprint services that already exist, their efforts are tracked and a record is kept. If they don’t have DOIs they can get affordably get them. If they have DOIs, those DOIs are tracked and searchable through a central API. If the preprints are PDF-only, a version could be converted to structured data and held in a minable database.

The sooner that the research communication community gets out in front of these support services for preprints, the less chance there is for loss of data and an incomplete record of this growing segment of research literature.