15

For a project, I need to work with varying types of files from some old games and related software--configuration files, saves, resource archives, and so on. The bulk of these aren't yet documented, nor do tools exist to work with them, so I must reverse-engineer the formats and build my own libraries to handle them.

Although I don't suppose there's great demand for most of it, I intend to publish the results of my efforts. Are there any accepted standards for documenting file formats? Looking around, there are several styles in use: some, like the .ZIP File Format Specification, are very wordy; others, like those on XentaxWiki, are much more terse--I find some of them difficult to read; the one I personally like best is this description of the PlayStation 2 Memory Card File System, which includes both detailed descriptive text and several 'memory maps' with offsets and such--it also most closely matches my use case. It will vary a little for different formats, but it seems there should be some general principles that I should try to follow.

Edit: I seem not to have explained very well what I want to do. Let me construct an example.

I may have some old piece of software which stores its configuration in a 'binary' file--a series of bitfields, integers, strings, and whatnot all glued together and understood by the program, but not human-readable. I decipher this. I wish to document exactly what is the format of this file, in a human-readable way, as a specification for implementing a library to parse and modify this file. Additionally, I'd like this to be easily understood by other people.

There are several ways such a document might be written. The PKZIP example above is very wordy and mostly describes the file format in free text. The PS2 example gives tables of value types, offsets, and sizes, with extensive comments on what they all mean. Many others, like those on XentaxWiki, only list the variable types and sizes, with little or no commentary.

I ask whether there is any standard, akin to a coding style guide, which provides guidance on how to write this kind of documentation. If not, is there any well-known excellent example that I should emulate? If not, can anyone at least summarize some useful advice?

Sopoforic
  • 253
  • 2
  • 7
  • 1
    http://i.stack.imgur.com/4illz.jpg -> http://stackoverflow.com/a/769443/839601 – gnat Apr 05 '14 at 21:27
  • Ha! I know that feeling. One format I was looking at I actually had the original source code that wrote the file. The problem was that the variables were being written in a different order than in the struct definition, with some extra stuff sprinkled in between. And the comments were wrong about the offsets. It's part of what inspired this question--a strong desire to DON'T DO THAT. – Sopoforic Apr 05 '14 at 21:32
  • 1
    My only experience with documented reverse engineered filetypes is from wiibrew.org. If I remember correctly, they documented the file as a `struct`. It worked quite well. – MetaFight Apr 05 '14 at 21:52
  • 1
    I may be misunderstanding the question but it seems like you are looking for something like [EBNF](http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form). –  Apr 06 '14 at 00:22
  • @MattFenwick: BNF is for specifying the syntax of a language; not quite what I'm after. I'll edit to be clearer what kind of file format I mean. – Sopoforic Apr 06 '14 at 00:30
  • There's an important rule I like to follow in documentation, that I think you've hit on at least partially. "Although I don't suppose there's great demand for most of it...". On the contrary, there is exactly ONE person that you need to speak to above all others: your future self. Your future self may find that they hate some so-called standard because it's hard to read or didn't cover edge cases, etc. - so it better be something reflects what *you* find clear. Personally, my future self isn't a big fan of E/BNF - I'm not my compiler. – J Trana Apr 07 '14 at 04:11
  • @JTrana: Your point is well taken. Most of what I write seems to be _My Documentation for Me, by Me Myself_, and given the responses I'm getting here, it looks like 'however I feel like' is the style I'll be using. Hopefully, anyone else who reads it will share my particular brand of 'brilliance' and thus be able to decipher it. – Sopoforic Apr 07 '14 at 12:06
  • A [very similar question](http://reverseengineering.stackexchange.com/questions/1579/what-are-the-best-practice-methods-for-documenting-research-into-the-reverse-eng) was asked on the Reverse Engineering site, as it turns out, with more or less the same result as this: some formal machine-readable standards exist, otherwise do more or less what you see others doing. Deer Hunter's answer is as complete as I'm likely to get. – Sopoforic Apr 09 '14 at 15:58
  • 1
    Closed because the question is too broad... on a Q&A site specifically about software engineering?! I'm just going to leave this here https://www.joelonsoftware.com/2018/04/23/strange-and-maddening-rules/ ... – Cocowalla Apr 25 '18 at 19:21

2 Answers2

6

A binary file is just a sequence of bits arranged into logical units according to certain rules. These rules are usually called grammar. Grammar can be classified into four types (the Chomsky hierarchy), and for context-free grammars you should use Extended Backus-Naur Form as pointed out by Matt Fenwick in his comment. The interpretation (or semantics) of the sequence stored in the file can be described verbally or with well-annotated sample programs serializing and deserializing the information.

To know more about documenting binary file formats, suggest reading up on e.g. ASN.1 standard.

Deer Hunter
  • 917
  • 1
  • 11
  • 15
  • _Technically_, most config files have a context-free language, since they have a finite language. Practically, writing 'the set of all 2-byte strings' (e.g. for a config file that's just a 16-item bitfield) in EBNF doesn't teach anyone anything. The pointer to the ASN.1 standard is the closest thing to an answer I've gotten, though it seems a specification in ASN.1 is meant to be read by computers, and I wanted information for writing documentation for humans. However, if nothing more closely matching my requirements turns up, shortly, I'll accept this answer. Thanks for your assistance. – Sopoforic Apr 07 '14 at 11:48
2

That's odd because a quick search of file formats brought up a Wikipedia article (List of file formats). It also includes several Video Game Data formats.

List of common file formats of data for video games on systems that support filesystems, most commonly PC games.

It also include a large selection of Video Game Storage Media formats.

List of the most common filename extensions used when a game's ROM image or storage medium is copied from an original ROM device to an external memory such as hard disk for back up purposes or for making the game playable with an emulator. In the case of cartridge-based software, if the platform specific extension is not used then filename extensions ".rom" or ".bin" are usually used to clarify that the file contains a copy of a content of a ROM. ROM, disk or tape images usually do not consist of a single file or ROM, rather an entire file or ROM structure contained within a single file on the backup medium.


Are there any accepted standards for documenting file formats?

There is no "official" standard anywhere. Since the file formats are made by a company, the company decides on the format for the documentation.

Adam Zuckerman
  • 3,715
  • 1
  • 19
  • 27
  • 2
    I think you've misunderstood my question. Of course there any many file formats that have been documented--I menioned XentaxWiki, which includes over 1500 over them. But the files I'm interested in are often not documented--game-specific things like save files or configuration, rather than general container formats, usually. My situation is that no documentation exists, and I intend to write some--so how shall this be done? – Sopoforic Apr 05 '14 at 22:36
  • The same way all those other file formats were documented. – Robert Harvey Apr 05 '14 at 22:38
  • 4
    @RobertHarvey: Confusing, conflicting, inaccurate, and incomplete? Seriously, though, as I mentioned, I noted several different general styles in use. I'm not familiar enough with work in this area to know if any particular style is to be preferred. The ones on XentaxWiki, the single largest resource I've seen, are almost exclusively for container formats, so they don't quite map to the more general case. If I thought that just picking a random example to emulate would be good enough, I wouldn't be asking for advice. – Sopoforic Apr 05 '14 at 23:41
  • @Sopoforic: Then you need to be clearer in your question what you want. Are you seriously asking us "How do I write documentation for a file format?" There are entire educational curriculums on technical writing that are devoted to that subject. Find a format that has clear, well-written documentation (according to your personal standards), and emulate that one. They can't all be crap. **Hint:** Usage examples are king. Clarity of explanation comes a close second. – Robert Harvey Apr 05 '14 at 23:59
  • 2
    @RobertHarvey: Yes, much like questions about how to comment your code or how to document a function, I am looking for a 'style guide' for writing a comprehensible format specification. If I want to know how to write an RFC, I can look at RFC 2223. If I want to know what style to use in Python code, I can read PEP 8. If I want to know How to Ask Questions The Smart Way, ESR has me covered. Is there some similar guidance for file format specifications? Or a well-known excellent example of one? I can surely use my own judgment, but if a standard exists, it'd be sensible to follow it. – Sopoforic Apr 06 '14 at 00:23