1

I would like to write some long text in some structure to allow a set of operations on that text. The question is which structure or format should I use, which suits best the use that I plan to do of that text?

Next I describe that use:

  • I would like to write text in natural language, possibly with translations to several languages. Translations would simply be the same structure with different data (text).
  • I would like to keep that text in a VCS, check diffs, branch and merge, etc. The structure should fit well this use.
  • I would like to keep the text free of as much clutter as possible so that it is human-readable.
  • I would like to easily convert the text to other formats, not necessarily many but at least html and pdf would be fine.
  • I would like to be able to manipulate that text easily, for instance changing the order of some elements, filtering them, etc. based on metadata in that text.
  • Metadata is data, that means it may be printed or not, or it may be printed in different ways.

Here are the main options I have considered so far:

  • Latex: basically it is a language designed for this task. The problems I see are that it is not as readable as other options, for instance Markdown, and it is not really structured text. The text is there and the metadata about formatting options and so on can be separated with a set of macros, but the text is not really structured, changing the order requires either parsing it or defining all the text as macros so that only the order of the macro invocation needs to be changed. It's great for what it does, but becomes clumsy when it falls short in some feature, as structuring the text. I don't see a good separation between control information and data to be printed. It is a very good option to convert to pdf.
  • XML: The structure in this case is fairly good, but in the current context, I see no advantage in using XML when HTML could be used instead, it provides the same features and some more.
  • HTML: the conversion to HTML would be immediate in this case but the conversion to pdf is not so clear. In terms of human readability maybe markdown could be better, but HTML is probably the most widespread and used language for the task at hand, there are supporting languages like CSS (Less, Sass, too many options) that can make life easier, Javascript can handle it, anyone with a browser can easily read it, etc. Maybe some special HTML could be converted to quality Latex and there to pdf, I don't know.
  • Markdown: a very good option in terms of readability, but I'm uncertain about how could it be manipulated, maybe through conversion to HTML and then using DOM manipulations or any other processing that could be done on XML and thus on proper HTML. I'm uncertain about how flexible may be for defining metadata (for instance a paragraph that is a summary of other paragraphs) when this could be easily done with XML or HTML via classes or other attributes.
  • JSON: most languages include a parser for JSON thus it is very friendly for programming languages and easy manipulation m. Obviously some standard should be defined for JSON, but the same holds for the rest of the options, including latex (macros).
  • CoffeeScript: this removes some clutter from the usual JSON, it may be more readable and can be converted into JSON easily.
  • Mixing: the problem with JSON and CoffeeScript is that the structure to hold the contents is very flexible (maybe too much) but it doesn't support in a natural way inline annotations. A possible solution is to use Markdown or HTML for these fragments of text, including bold text or what may be needed.

The objective is to write a manifesto, or something that looks like a manifesto and evolves. This is based on some ideas that recommend using VCS systems. The point is to have a structure that allows to write once and publish as many times as may be needed and in different ways, maybe blog posts, pdfs, etc., because a lot of effort to reach the consensus has to be put to write the text, rewriting and rewording does not seem a good idea. This discards some other nice options, like a wiki, but it would be nice to be able to have it structured in some way such that a set of pages like a wiki could be built from the source data.

In the end the technology may not be there yet, but I think it is not too far. There are actually so many options that a clever use of some of them should be enough.

gnat
  • 21,442
  • 29
  • 112
  • 288
Trylks
  • 782
  • 4
  • 10

1 Answers1

2

One thing that I notice about the options you mention is that most of these are easily converted to each other (assuming you follow some predefined rules before writing the document).

Whereas LaTeX is easily compiled to PDF I would advice against it, as it contains too much information on the actual presentation and too little structure.

In many cases HTML also reflects the presentation too much I would say (although it is possible to minimize this of course).

I would probably define an XML schema and create the document as XML. In this way it is easily converted to all the remaining formats (especially HTML). Depending on the complexity of your documents, you might want to use some predefined schema such as DocBook (although I have little experience in using it):

http://en.wikipedia.org/wiki/DocBook

Further, you can use tools such as XML transformations (XSLT) to handle some of the conversions (if you like). This makes it really easy to convert a well structured XML document to HTML for instance.

All that being said, many of the remaining choices would be viable as well. In the end it depends on your own taste (personally, I don't find JSON very interesting for such job, for instance - others might) and on the complexity you want to represent (Markdown might not be good if you need very fine grained control).

As long as you actually structure the document, conversion to many other formats should be rather trivial.

nilu
  • 1,024
  • 7
  • 12
  • XML may be fairly suitable to convert to HTML (XSLT) and to PDF (XSL-FO), but one primary use is to have something human-readable for collaboration. XML is far from the readability of Markdown, for instance. If there was a more terse serialization for XML that would be great, something as turtle is to RDF/XML, or CoffeeScript to Javascript. I failed in finding something in that direction, all I found is YAML and SDL. I'm starting to feel inclined to make my own syntax, which is probably the worst option. – Trylks Mar 31 '13 at 15:29
  • Obviously, Markdown is very easily read. Still, I do find XML rather easy to actually read/write if structured appropriately. Nevertheless, there certainly is some kind of conflict between structure and minimalism (readability). If you can live the features that Markdown supports, it should be OK to use it -> after all, it is rather easily converted to XML either way. But it seems the main problem is the editing/collaboration part - reading is easily done when converted into something else. Maybe even the correct choice of editor/tools would remedy the problems to some extend. – nilu Mar 31 '13 at 15:39