Blog
Coko & eLife partner on first PubSweet fueled journals submission & peer-review platform
Seeding a New Ecosystem: open infrastructure
Take Editoria for a spin
Making decisions in a small team and keeping it fun
A look at the future of journals with xpub
Editoria 1.1: Meet the Automagic Book Builder
A sneak peak at what’s next for PubSweet
Travel the long and winding road to PubSweet
Ink 1.0 is here!
Baby steps to user-centric open source development
Why we’re all in open source now
Getting Started with Coko
Editoria 1.0 preview
Preprints won’t just publish themselves: Why we need centralized services for preprints
INK – the file conversion engine
How we’re building the ‘mountain chalet’ of complex conversions
Sowing the seeds for change in scholarly publishing
Open Source Alliance for Open Science
Editoria Newsletter Out Now!
INK client upgrade
All About INK (explained with cake)
Track Changes (Request for Comments)
Book on Open Source Product Development Method Released!
Italics, Buenos Aires and Coko?
Editoria Update
Where we are with File Conversion
A Typescript for the Web
Coko Celebrates Year One
Editoria – Scholarly Monograph Platform
Adam Hyde’s Blog
Introducing Christos
Introducing Yannis
New PubSweet release
Attribution in Open Source Projects
Open Source for Open Access
Reimagining Preprints: a new generation of early sharing
Introducing Stencila and Nokome Bentley
Reimagining Publishing
Introducing Charlie
PubSweet 1.0 “Science Blogger” alpha 2
PubSweet 1.0 “Science Blogger” alpha, INK 1.0 alpha RELEASES!!!
Collaborative Product Development
Publishing for reproducibility: collaborative input and networked output
Substance Consortium
UCP & CDL Announcement
Release 0.2.0 is here!
CKF receives funding from the Gordon and Betty Moore Foundation to transform research communication
Technology Slows Down Science
[tech post] CSS and Drop Caps
Vote for the pubsweet logo!
Introducing Substance
Digging Collaboration and Cooperation: Code for a New Era
Coko 2015
PubSweet 0.1 Release
Coko Resources
Making science writing smarter
What I Have Learned About Building Community
Introducing the Tech Team
Knowledge and Communication
PKP and CKF Strategic Alliance
CKF Launches
January 24, 2017

Where we are with File Conversion

There has been quite a bit of interest in our HTML-first strategy and how we manage file conversion. The general inquiries start with ‘isn’t converting from MS Word to HTML impossible?’ through to what technologies we use and our general approach.

Well, we are pleased to report that conversion from MS Word to HTML isn’t impossible. In fact, it is relatively easy (many people before us have done it). What is tricky is doing it right. This is where we have spent a lot of time and energy in both thought and development time. Our approach comes down to 2 key approaches:

  1. HTML Typescript – this is the name we are giving to our technical approach to MS Word to HTML conversion. Many file conversion experts baulk at the prospect of this conversion pipeline because MS Word documents are fundamentally very loosely formatted. That is a problem. At the other end of the problem, we know that scholarly publishers want to be able to distribute content in highly structured XML structures. So how to go from one to the other? Essentially by using HTML as an intermediary. The very flexible nature of HTML enables us to convert from Word to an equivalently unstructured HTML file, infer (programmatically) as much structure as possible, and then bring the partially structured content into a sophisticated (substance.io-based) web based editor to add the remaining structure. If you would like to read more about this approach read here and (in more detail) here. Our docs in the XSweet (MS to HTML conversion scripts made by file format guru Wendell Piez) are also a good place to start.
  2. INK – INK is the other important approach to this issue. It is a web-based service for managing conversions. Actually, it can handle a lot more than conversions, but this is, at present, its primary use case. INK is a web service, which means that other platforms (like Editoria) can throw a file at it (eg in MS Word) and ask it to convert it to another format (eg HTML). INK does this very cleverly, looking after managing system requests etc (in case there are a lot of such requests simultaneously), and managing fallbacks and error reporting etc. The architecture of INK is, due to the hard work of Charlie Ablett, very smart. All ‘converters’ (processors really since INK can do a lot more than conversion) are in fact individual steps, each step does one thing and it is written as a plugin to the main INK framework (these plugins are GEMs for the technically minded). So the plugins can be built and shared which is nice, but most importantly, INK enables these individual conversion steps to be chained together, enabling multiple processing steps. Why is this important? Imagine first converting that horrible MS Word file to HTML, that’s a good start, but then imagine if the HTML went through another step to extract metadata from it. This could then be used, for example, to automatically populate the metadata fields of your submission system – reducing the work required by the authors (pre-submission) substantially. It also means the stored metadata is consistent with the information in the manuscript (we can also automatically normalise both with an additional step to catch any errors)… Now you can possibly see why these steps are interesting. By linking them together, you can construct very interesting processing pipelines and, if you are community-minded, you can share these steps with others. There is much more to INK but I will leave that to Charlie to tell you soon as we will have an exciting new release of INK out in the next weeks.
These two approaches together comprise the mainstay of our approach to file conversion. It is the tip of the iceberg as I haven’t broken down a lot of other things we are working in this field. HTML to JATS conversion, and rendering beautiful PDF from HTML, for example, and much more. All this will come in following posts, for now we are lucky to have experts with many years of experience working on all parts of this. After many years of working on exactly these issues, I can tell you it is a pleasure to be working with such folk and, at last, doing it the right way.