DOCX, XML, HTML, and Publishing Workflows

Coko Foundation Staff May 14, 2018

Most publishing workflows start with an author submitting their manuscript. For better or worse, this manuscript is most often in a docx format, created using Microsoft Word. The docx manuscript is typically used throughout the reviewing and editing process, often shuttled around as an attachment to emails and other communications. Once the manuscript is accepted, the publisher must move it out of docx and into formats that can be displayed on the web and disseminated.

To make a manuscript ready for publication, most publishers hire external vendors to convert the docx into HTML, PDF, and extensible markup language (XML). While not all publishers require XML, of those that do XML is considered as one of three canonical formats, along with HTML and PDF, for journal content. With the XML version the article metadata is contained in a structured and machine-readable format, and this has long been considered necessary to help maximize discoverability, and text and data-mining. As mentioned above, the format conversion work is often done by an external vendor during the production stage, which is costly and time-consuming.  

Here at Coko, we are working to improve speed and usability of publisher workflows. One way we are accomplishing this is by building tools that get the manuscript out of docx as early as possible. Publishers can design their own workflows using Coko’s suite of platform technologies and we recommend converting the document to clean, nicely structured HTML during the editorial and peer review stages. HTML is the language of the web, and is a format that can be easily worked at all stages of publishing. HTML is capable of being as structured as XML at a fraction of the cost. Once the article has been reviewed, accepted, copy edited and proofed, conversion from structured HTML to PDF and XML for publication is a much more straightforward step.

To help with this, we recently announced the release of XSweet 1.0, open source suite of tools for transforming the contents of docx into HTML. The advantages of using HTML during the editorial process include:

  • Work is instantly ready to share in the web browser at any stage along the workflow, from preprints, to review copies to author proofs;
  • Avoid versioning problems and delays in editing/revising that come from passing static files back and forth;
  • Add correct semantic structure to the document for easy export in a variety of formats; and
  • Reducing time and costs of publication.

The HTML can be displayed to the author for edits, used by the editor to assign appropriate reviewers, and annotated by the reviewers. This makes the pre-production tasks both easier and faster and opens up opportunities for collaboration.

After the manuscript is accepted, it can be easily moved into a web delivery system for display as HTML. If XML is needed for archiving or syndication, conversion from HTML can be largely automated with minimal production work needed for quality assurance. Note that high-quality PDFs can also be automatically generated from HTML. All this can significantly reduce time to publication, and potentially dramatically reduce costs associated with using external vendors for typesetting.

Adam Hyde illustrated the differences between HTML and docx in a recent blog post on his own website.  

…the following screenshots that more viscerally illustrate the benefits of this strategy. Displayed below are two screenshots of source code. The first is the markup of a docx file, followed by a screenshot showing the result of converting that docx to HTML using the HTML Typescript converters (XSweet) that we built.

Keep scrolling!

About the Author

Coko Foundation Staff

We build open source, digital-first technology for publishing.

Posts by this author