Reimagining Preprints: a new generation of early sharing

The goal of preprints is to share early and often and to improve the range and quality of what is shared. Preprints must be treated like first class research objects and there is an opportunity to create a workflow for preprints that raises the quality of all published works. While the focus is currently on early versions of journal manuscripts, any new services or tools can be built to accommodate many forms of research, including datasets, code, protocols and null results.

The ideal preprint ecosystem will make early versions of manuscripts and accompanying works:


  • Accessible (openly available at the earliest possible date)
  • Flexible (ingests, produces and disseminates many types of research objects)
  • Discoverable (adequate indexing and metadata)
  • Reproducible (includes all background, data, code, materials to reproduce)
  • Reusable (able to be pulled in to overlay journals and other reuse vehicles)
  • Reliable (not plagiarized or already published on another service)
  • Versioned (able to be updated and stamped with new but associated identifiers)
  • Minable (structured and available for text and data mining)
  • Networked (all related research objects connected throughout their life cycles)
  • Trackable (Collaborators/followers/funders/etc. can get the latest updates on the research outputs in progress)


Required infrastructure

There are three key functions that technologies need to support:

  1. 1. Ingest and conversion
  2. 2. Production workflow
  3. 3. Dissemination and delivery

To dramatically improve on each of these functions requires deep expertise. Functions can be handled by discrete technologies, in a modular setup, which may allow for more concentrated expertise to be applied within each area.

Interoperability through adherence to standards and use of APIs means that the modules can both work together but also operate separately, as stand-alone services. This will enable components to be updated or replaced without interfering with the operations of other components.

A key question is whether there should be many different preprint services or if a shared platform or service would be effective or efficient. Different communities of scholars tend to have different needs when it comes to how content is vetted and curated. There is a good argument to be made for diversity at the level of editorial and production pipelines.

For the input and output portions of the process, however, the ingest and dissemination functions, shared infrastructure might lower costs and improve the final product.

Centralized Ingest and Conversion

Many of the current limitations of preprints are a result of the initial submission process. Authors typically upload a PDF, Word or LaTeX file with minimal metadata to a preprint server. The full text rarely ends up in a more structured format such as HTML or XML. Early conversion to xHTML, for example, would enable much of the enrichment, discoverability and other features for a next generation preprint.

A centralized tool that can ingest from many different author-supplied formats and convert to xHTML would offer a major advancement to the capabilities of preprints. If these functions operated within an adaptable and extensible framework that can also perform other functions, such as extracting metadata, enriching the content, and assigning identifiers and much of the work towards making preprints networked objects will be automated and accessible. With a centralized set of rules, this tool could reliably ensure that funder and licensing data were accurately identified or ping authors to add missing metadata.

The ingest and conversion service could apply rules that codify standards and best practices that will be needed for preprint production to ensure that they are on par with the rest of the published record.

The advantage to shared infrastructure in this area is that the ingest and conversion system will “learn” as more content flows through it. Many different preprint and publishing services will benefit from the collective intelligence and growing code as the service evolves.

Diverse workflows

Because there are many community-driven variations in how manuscripts are vetted and curated, it is entirely appropriate to have a diverse set of preprint services, each with a unique set of editorial and production tasks. Medical preprints will likely need a different level of vetting than those in other fields. While a single workflow tool could be applied across many community-run preprint services, this may produce significant overhead.

Dissemination and web delivery

If preprints and accompanying files are well characterized during ingest and conversion they are automatically more discoverable. Centralized syndication, discoverability and validation will be done by the same ingest and conversion service at the time of publication. With DOIs and other identifiers attached, metadata about each research object is in CrossRef and other centralized databases and will be syndicated automatically. Indexing by Google and other search engines can happen within that service or once the preprints are delivered to their web delivery site. Delivery to a website becomes just a final step in that process and offers the formal display and site-specific search functions that researchers expect but is not the sole avenue for discovery. Delivery will include a report on the preprints adherence to standards, plagiarism checks and other validation steps which should make it a more reliable publication for preprint services as well as journal publishers should it end up being submitted to a journal.

Beyond Preprints

Improvements made to preprint services will have a follow-on effect of conferring these qualities upon all shared or published research objects, including traditional journal articles.

Post by Kristen Ratan