10

I need a textual human readable format which is reasonably compact and version-control friendly to serialize a persistent memory heap. My Bismon system (GPLv3) has such a format (it is textual, human-readable, git-friendly, occasionally editable under emacs, but specific to Bismon. It is usually loaded and dumped by the bismon program.). That format is documented in the Bismon technical draft report (please skip the first few pages for H2020 bureaucracy), chapter §2 Data and its persistence in Bismon. For an example of a file using that format, look into Bismon's store1.bmon (and other store*.bmon files).

I am considering that such a format might better be JSON like (but I am not sure). Just because many developers are familiar with JSON.

The JSON format requires object keys to be quoted strings, e.g. { "x":1, "y":2 }.

I am thinking of an application (maybe RefPerSys, which conceptually could become a Bismon done right) where a JSON notation is very useful (and where a human-readable textual file format is essential), but where we deal with only JSON objects whose keys are always C-identifiers like (starts with an latin letter, contains letters, digits, underscores). However, that application may need to parse perhaps a million of such objects, and parsing performance does matter a little, and more significantly file space will matter a lot (since for {x:1,y:2} only 9 bytes are needed, but {"x":1,"y":2} requires 13 bytes, i.e. about 40% more space). My exact goal is any textual, human-readable, quickly and easily machine-parsable, tree-structured, compact version-controllable (i.e. git friendly) format. Most of the time it is dumped and loaded by the same application. Occasionally, I may need to glance into it with some editor, and perhaps even to change a small bit of it with that editor. I am not imagining needing a generic JSON transformer or processor like jq.

But my feeling is that, when the keys are C identifiers like (and different of the three JSON keywords: true, false, null), the quotes could be avoided, like for example in {x:1, y:2}. I am also understanding that some JavaScript implementations might be able to parse that.

I am obviously guessing that parsing {x:1,y:2} is faster than parsing { "x":1, "y":2 } or even {"x":1,"y":2} (simply because the textual representation is slightly shorter) especially when we deal with millions of such JSON objects.

In a Bismon or RefPerSys like system, a possible example could be:

{ oid: _7T9OwSFlgov_0wVJaK1eZbn,
  name: word,
  mtime: 1502296590.98,
  class: _7T9OwSFlgov_0wVJaK1eZbn,
  attrs: [ { at:  _01h86SAfOfg_1q2oMegGRwW, va: "for words" } ]
}

(currently, in commit ff19f15ecd2f647d42 of Bismon, the equivalent is in lines 1011 and following of store1.bmon; the | there delimits comments and these comments like |=word| there could be removed, since skipped at parsing; the comments in these dumped and loaded *.bmon files will be removed once Bismon is stable enough)

In a few years, I could have many millions of such JSON objects. The bismon program is a server, started every morning (it then loads its persistent state in textual format) and ended every evening (it then dumps its persistent state in textual format). So taking one or a few minutes to load, and one or a few minutes to dump, a large persistent state is definitely acceptable. But the git commited disk size of that textual persistent state is more a concern (since both gitlab and github are unhappy with large textual files).

Since humans will very rarely look into the textual persistent store (as rarely as a compiler writer is looking into generated assembler, or as rarely as the sqlite team is looking into huge *.sql files), I value compactness and git-friendliness over readability. So I could even consider something as compact as:

 {oid:_7T9OwSFlgov_0wVJaK1eZbn,nam:word,mti:1502296590.98,
  cla:_7T9OwSFlgov_0wVJaK1eZbn,
  att:[{a:_01h86SAfOfg_1q2oMegGRwW,v:"for words"}]}

or even the same in a single line. However, being occasionally able to git diff is valuable.

In other words, the JSON model is very nice to me. But its concrete syntax less so. Patching most JSON libraries to such a simplified syntax is very probably trivial work.

This brings three questions:

  • what is the exact name of such common variant of a JSON format with such only C-identifier keys. (It seems that the YAML specification still suggests it to be JSON, but it is not exactly JSON but something very close to it). While that format is not exactly JSON, it is very JSON like (and the conversion to exact JSON is trivial, assuming a parsing library exists for it).

  • what are open source C or C++ libraries dealing with that format (for Linux/x86-64)? I am guessing that adapting the source code of JSON parsing libraries to that special case is trivial. But I really want to avoid forking one.

  • can recent Web browsers (Firefox or Chrome) efficiently parse {x:1,y:2} as JSON? I tend to believe that yes (since that notation is exactly compatible with JavaScript).

This GIT and YAML answer could be relevant.

And I just discovered HJSON which might be what I want.


PS. I can avoid any set of given C keywords or identifiers in the keys, if I have such a list of forbidden or reserved names. In particular, I will avoid every JavaScript or C++ keyword (like for or auto or while) for key names. The only platform I care about is Linux (currently x86-64).

PPS. Another application where human-readable textual file format is essential is my Bismon project (a persistent reflexive monitor for static source code analysis, under GPLv3+ license), and I am explaining why in the Bismon draft report (that is a H2020 draft deliverable, so please skip the first few pages for H2020 bureaucracy). I have chosen in Bismon to have my own human-readable textual format, but that particular choice might have been a big mistake, and I probably should have used some JSON-like one (or even JSON itself), like suggested in this question. The RefPerSys project might become a "Bismon done right" project. And the persistent data of Bismon (e.g. its store2.bmon textual file) is git-version controlled and occasionally hand-edited (but most of the time, loaded and dumped by bismon itself). So, yes, there are cases where textual data cares about a 20% space difference: for both gitlab and github, a textual version-controlled file of 700Kbytes or of 1.1Mbytes is presented very differently: in Bismon, its store2.bmon file is already shown only in raw format.

Basile Starynkevitch
  • 32,434
  • 6
  • 84
  • 125
  • 4
    Your premise is that your pseudo-JSON will be faster because it saves two bytes per entry. This reeks of micro-optimization. If you want maximally conpact representations you may prefer non-textual formats. However, any parser is likely to be slower than a browser's built-in parser. – amon Apr 07 '19 at 09:51
  • The more important motivation is file space (or data volume, i.e. bandwidth). 40% is not that unsignificant. Maybe in real world it would be 20%, but that is still significant IMHO. For reasons not mentioned here I value a human-readable format. – Basile Starynkevitch Apr 07 '19 at 12:45
  • 3
    I count more like a max 33% asymptotic reduction for the smallest possible JSON entry (single byte key, single byte value, incl. comma at the end). For real data, much less. If you're more concerned about space than speed, gzip compression is a better answer. – amon Apr 07 '19 at 12:50
  • I value a lot the fact that the data is `git`-versionned, and then using `gzip` compressed data is not recommended. However, in practice, `gitlab` or `github` prefer not-too-big textual files, so a 20% difference *is* significant. – Basile Starynkevitch Apr 07 '19 at 12:58
  • A goal “compact textual representation for version control” is very different from “compact pseudo-JSON that works in the browser”. Which is it? Perhaps you really want to store your data in a compact ad-hoc format, or as CSV (only 2 bytes overhead per entry, it doesn't get better than that unless the data has fixed width, or other characteristics that make a key-value separator unnecessary). – amon Apr 07 '19 at 13:02
  • 1
    The exact goal is structured compact textual representation which is version control friendly. I have improved my question to mention that goal – Basile Starynkevitch Apr 07 '19 at 13:04
  • Do you have a verifiable performance issue? If not, any optimization is premature. Start with a reasonable choice that matches the requirements and is not too expensive to implement (native JSON libraries are often a good choice). Gain experience, measure total application performance and memory needs. Calculate how much you can save by tweaking the format in the way you propose. (Hint: it's not nearly 40% mem and likely less than 1% run time). Weigh the possible gains against the disadvantages of having an incompatible format that needs to be maintained by you. – Hans-Martin Mosner Apr 07 '19 at 14:15
  • Any application specific JSON format needs to be maintained: if your application expects `{"x":1,"y":2}` and that later becomes `{"X":1,"Y":2}` you do need to change it. And JSON (or a JSON-like) format and JSON libraries are so simple that it does not matter that much. What really matters is how do you use (what JSON schema do you have) it, and that stays application specific. But a 15% difference in file size does matter for me – Basile Starynkevitch Apr 07 '19 at 14:19
  • Your question is very dense and a bit hard to understand. In a comment you wrote that your main goal is to invent a _"structured compact textual representation which is version control friendly."_ - that statement should probably be the very first statement of the question. – Bryan Oakley Apr 07 '19 at 14:36
  • AFAIK, `bson` is not at all `git`friendly. How does a `bson` file looks in `github` or `gitlab` ? – Basile Starynkevitch Apr 07 '19 at 16:36
  • @Hans-MartinMosner: my main issue is not parsing performance (I guess it is good enough), it is file size and `git`-friendliness. Look at how differently `github` is showing my [store1.bmon](https://github.com/bstarynk/bismon/blob/master/store1.bmon) and [store2.bmon](https://github.com/bstarynk/bismon/blob/master/store2.bmon) – Basile Starynkevitch Apr 07 '19 at 17:20
  • @BasileStarynkevitch when dealing with git, file size isn't the only parameter. Files are compressed when committed, and I'd estimate that in compressed file size the difference is pretty small. However, every version of a file is stored as a new compressed file in the repository. This means that when you change only some records, one big file will cause more issues than many smaller files, most of which will stay constant. As git is designed for source code management, it works best when changes are localized into few files. – Hans-Martin Mosner Apr 07 '19 at 17:34
  • Yes, but that favors big textual files, not many small ones. And philosophically, the textual persistent store *is* source code. I am typing some kind of source code (today in a GTK window, in a few months in my browser) and it gets persisted in that textual persistent store. – Basile Starynkevitch Apr 07 '19 at 17:37
  • Depending on the insert/update/delete patterns of records, you may be able to cluster them into files such that only the most recent files ever need to be persisted in several versions. For a standardized JSON-like format yaml might indeed be a better choice because you can omit quotes in many places, but don't get too focused on the space savings. In your realistic examples with longer keys and values, savings get lower, down to a few percent. – Hans-Martin Mosner Apr 07 '19 at 17:39
  • So you recommend using ordinary JSON in practice? – Basile Starynkevitch Apr 07 '19 at 17:40
  • What other formats have you considered besides JSON? HDF5? Proto (the descriptor, not the binary), ConfiFile? – Laiv Apr 08 '19 at 19:31

3 Answers3

14

Keys in a JSON dictionary are not quoted strings, they are strings. Strings in JSON start with a quote, continue with escaped or unescaped characters, and end with a string. You can’t have different JSON. You can define a different exchange format, but it won’t be JSON and you are completely on your own.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • It would be *extremely* close to JSON, and is the native JavaScript syntax for its "literal" objects. So it is very JSON-like – Basile Starynkevitch Apr 07 '19 at 13:01
  • And my question includes "how is such nearly JSON format named" – Basile Starynkevitch Apr 07 '19 at 13:16
  • 3
    This is about software engineering. From a software engineering point of view, a format "extremely close to JSON" is a very bad idea. A format with arbitrary restrictions on the keys is also a very bad idea. – gnasher729 Apr 07 '19 at 21:33
  • Why would a format close to JSON be worse than an entirely ad-hoc format? I see on the contrary similarity to JSON as an advantage (because most developers know about JSON) – Basile Starynkevitch Apr 08 '19 at 06:58
  • Being similar but different is the problem, because all those developers who know about JSON _think_ they know about this format, but they don't. – gnasher729 Oct 25 '19 at 20:44
13

It sounds like you are looking for a data-serialization format that is human-readable and version-control-friendly but not as strict about quotes as JSON.

Such formats include:

  • Relaxed JSON (RJSON) (simple keys and simple values generally do not require quotes)
  • Hjson (simple keys and simple values generally do not require quotes)
  • YAML (keys and values generally do not require quotes)
  • JavaScript object literal (also printed out by many implementations of "console.dir()" when passed a JavaScript object; simple keys generally not required to be quoted, but string values must be quoted by either single quotes or double quotes)

for completeness:

  • JSON (requires double-quotes around keys, also called property names, and requires double-quotes around string data values).
David Cary
  • 1,402
  • 13
  • 20
2

You have a file-system. It has the unique ability to store multiple files, in sub-directories.

This will solve:

  • git large files and versioning, the diffs will be much clearer, the files themselves much smaller.
  • human observation. The git tree provides a higher level communication for changes, and having a path structure allows humans to more easily reason about content.
  • IO. Being spread across several files can allow parallel parsing/generation.

As for a data language. Go with the one which allows you to accurately and unambiguously store your fundamental types, and supports structures like map, sequence, set, etc... that your data is expressed in terms of.

As this is a server and you are transmitting data to clients, I presume this is where you want to optimise your overheads. The problem is that you really need to understand the exact usage here. Downloading a 50K parser to handle 5k of messages, versus a 0K parser (built in to the platform) and 25k of messages. Your optimisation might be premature, but if you are going to optimise, go all the way and implement a binary format - it is the most efficient from a message perspective. Also do not forget to use channel based compression.

I would not worry too much about data being stored into git, it applies its own compression mechanisms. Also as this data needs to be git and human friendly the format will probably be pretty formatted. This does increase the overheads in terms of file size, but reduces git deltas and human cognitive load.

Neither would I worry too much about data to disk unless you are operating on resource limited device in which case why aren't you using a binary format?

Kain0_0
  • 15,888
  • 16
  • 37
  • Your comments are true and insightful, but do not answer my question. BTW, I was already aware of all you have said. – Basile Starynkevitch Apr 08 '19 at 07:11
  • I inferred that you were discussing the operational concerns of a data format. I'm glad that your estimation of my response was positive. But fair enough, direct answers. You've already answered the first question with several suitable answers: JSON, YAML, make it yourself. Any answer to the second question would be opinionated and could be trivially solved by a web search. Adapting such a library may not be trivial though. The third answer is *No, it is not JSON*. If you want to parse it in the browser download a parser for it, or live dangerously and use JavaScript. – Kain0_0 Apr 08 '19 at 07:54