Blog
Wax Show ‘n’ Tell
Hindawi Limited & Coko Announce Partnership
Coko & eLife partner on first PubSweet fueled journals submission & peer-review platform
Seeding a New Ecosystem: open infrastructure
Take Editoria for a spin
Making decisions in a small team and keeping it fun
A look at the future of journals with xpub
Editoria 1.1: Meet the Automagic Book Builder
A sneak peak at what’s next for PubSweet
Travel the long and winding road to PubSweet
Ink 1.0 is here!
Baby steps to user-centric open source development
Why we’re all in open source now
Getting Started with Coko
Editoria 1.0 preview
Preprints won’t just publish themselves: Why we need centralized services for preprints
INK – the file conversion engine
How we’re building the ‘mountain chalet’ of complex conversions
Sowing the seeds for change in scholarly publishing
Open Source Alliance for Open Science
Editoria Newsletter Out Now!
INK client upgrade
All About INK (explained with cake)
Track Changes (Request for Comments)
Book on Open Source Product Development Method Released!
Italics, Buenos Aires and Coko?
Editoria Update
Where we are with File Conversion
A Typescript for the Web
Coko Celebrates Year One
Editoria – Scholarly Monograph Platform
Adam Hyde’s Blog
Introducing Christos
Introducing Yannis
New PubSweet release
Attribution in Open Source Projects
Open Source for Open Access
Reimagining Preprints: a new generation of early sharing
Introducing Stencila and Nokome Bentley
Reimagining Publishing
Introducing Charlie
PubSweet 1.0 “Science Blogger” alpha 2
PubSweet 1.0 “Science Blogger” alpha, INK 1.0 alpha RELEASES!!!
Collaborative Product Development
Publishing for reproducibility: collaborative input and networked output
Substance Consortium
UCP & CDL Announcement
Release 0.2.0 is here!
CKF receives funding from the Gordon and Betty Moore Foundation to transform research communication
Technology Slows Down Science
[tech post] CSS and Drop Caps
Vote for the pubsweet logo!
Introducing Substance
Digging Collaboration and Cooperation: Code for a New Era
Coko 2015
PubSweet 0.1 Release
Coko Resources
Making science writing smarter
What I Have Learned About Building Community
Introducing the Tech Team
Knowledge and Communication
PKP and CKF Strategic Alliance
CKF Launches
December 5, 2016

A Typescript for the Web

In the days of the typewriter, a typescript was a typed copy of a work. The typescript copy was used for improving the document through the editorial process – for reviewing, commenting, fact and rights checking, revision etc.

By Schlesinger Library, RIAS, Harvard University [No restrictions], via Wikimedia Commons
By Schlesinger Library, RIAS, Harvard University [No restrictions], via Wikimedia Commons
For many years, since the advent of desktop publishing, the hand-typed typescript has been replaced by Microsoft Word documents. This advanced the publishing world quite a bit. Emailing typescript in the form of Word documents was far easier than mailing paper copies, and making revised copies moved from being an arduous typing exercise to the one click ‘save as’. There are obvious efficiencies.

But now we are in the age of networked documents. We have the opportunity to make another paradigmatic workflow change – in a sense: bringing typescript to the browser. With this evolution, we can bring an end to many of the frustrations of emailing Word documents around for comment and revision – while exploring wider options for collaboration, innovating more interesting content types, more easily understanding and managing a document’s history, using automated typesetting and data exchange. All of these represent cost and time savings for the publishing process and have the potential to move the communication of research beyond the limited paradigm of the manuscript.

However, migrating typescript formats away from the desktop to the web has proven to be very difficult. One of the major problems is that while there is a growing move towards online authoring environments, many authors still start in Word, and many previous attempts have shown that MS Word is not an easy file format to convert to other formats. Fortunately, we believe, if we think of MS Word as a software for “typescript preparation,” that HTML is a viable option for conversion from Word’s  .docx format into a typescript format ready for the web.

HTML works out well as a format for these purposes due to features that are more commonly considered weaknesses. How so? Well, first we must consider that, at early stages of a publishing workflow, a Word document will not have achieved its final structure, or indeed, much of any structure at all. Nonetheless, attempts to ‘get out of Word’ have tried to jump from unstructured Word to very structured XML formats by:
  1. copying over all the data in the document and
  2. interpolating structure at the same time in an attempt to ‘understand’ the “intent of the author” (or a proxy) as represented by the display semantics of the document. 
But if the structure does not exist in the first place, you have a problem. It is a little like going, as Peter Murray Rust has previously commented, from Hamburger to Cow.

However, with the XSweet Word-to-HTML scripts we are developing, we retain step one (copying over all the data) and replace step two (interpolating structure) with a process that forsakes the introduction of any structure in favor of carrying over whatever information from the original Word file might be useful later on, whether for programmatically understanding or manually applying document structure. The best solution of course, being a little bit of both.

Hence we:
  1. convert the unstructured Word document into an unstructured (or partially better structured) HTML document; and
  2. interpret the original Word file and carry forward as much information as that original Word file contained for possible future use in interpreting the document structure – or representing any features of interest – while not actually structuring the document.
Interestingly, since HTML does not require you to enforce a strict document structure if you do not have it, an unstructured document flows into HTML as easily as a structured, or partially structured, document flows into it. If your aim is a well-controlled document format, such a failure to enforce strict structures is regarded as a flaw or weakness. Yet, since we do not have much or any document structure in the originating Word document, and our goal is to improve it – HTML’s flexibility becomes a critical feature.

This process produces an intermediary carrier format — an information- rich, sanitized HTML format that is suitable for improvement. In an HTML editor, we can then bring to bear further tools for adding structure, reviewing, commenting, revising. Hence, there is a kind of similarity to its historical predecessor – which is why we are using the working title ‘HTML typescript’.

If the decision is to improve the structure programmatically, the Coko INK platform comes in nicely. INK enables developers to chain conversion steps together. Hence, a developer can use an XSweet “docx extraction” to get to HTML typescript, and then write (or reuse) another step, in the language of their choice, to do structural interpolation. By separating concerns here, XSweet under INK allows developers to provide structural interpolations that suit the kinds of content their organisations are working with, in the way that makes most sense for them.

When manual improvement is required, we can use sophisticated editors like those built with the Substance libraries. These editors manage strict control of the source and ensure that the manually applied structure is semantically robust. This also means that we can apply explicit structure to the text and co-relate that to the other output formats we require, such as JATS XML.

It is not a revolutionary new approach, rather a shift in emphasis. By avoiding over-complicating the conversion as described above, we have been able to make faster headway – and to turn our focus to the tools required to support the editorial processes via a web-based typescript.

Post by Wendell Piez and Adam Hyde.