Is there a proper way to create a file format?

Question

I'm building a proprietary file format for an application I wrote in C# .NET to store save information and perhaps down the line project assets. Is there a standard on how to do this in any way? I was simply going to Serialize my objects into binary and create a header that would tell me how to parse the file. Is this a bad approach?

Whatever approach (from the answers) you choose, always include a version number in the format! Your question already suggests that it may change, and the version number will save you a lot of effort if you have to be backwarsd compatible. — Jan Doggen, Mar 13 '13 at 10:32

score 11 · Accepted Answer · answered Feb 27 '13 at 02:07

11

The most straight-forward method is probably to serialize your structure to XML using the XMLSerializer class. You probably wouldn't need to create a separate header and body structure - but serialize all assets into XML. This allows you to easily inspect / edit your file structure outside of your own program, and is easily manageable.

However, if your file structure is really complex, containing many different assets of different types, such that serializing the entire structure to XML is too burdensome, you might look at serializing each asset separately and compiling them into a single package using the Packaging library in C#. This is essentially how .docx, .xslx, .pptx, and other office file formats are constructed.

answered Feb 27 '13 at 02:07

p.s.w.g

4,135
4
28
40

Yes, my project is much more complex than just that, but I am also attempting to make it less user readable since we might deploy these in a field in a licensed context. I am currently using `protobuf-net` to serialize my data and that much works great. But I have to serialize pieces separately, so what you are talking about with the Packaging library sounds like what I need. – corylulu Feb 27 '13 at 02:22
8

Dear god not XML – James Feb 27 '13 at 14:06
3

@James yeah XML has its downsides, of course. I favor packaging and XML in most cases for the same reasons: 1. it's a pre-existing framework, so requires low effort. 2. It's easy for other systems to support, since it's a widely accepted standard. 3. It's easy for a human to inspect the resulting file to verify the serialization process. – p.s.w.g Feb 27 '13 at 14:56
XML has advantages, but it is because of those advantages that I do not like using the XML serializer. I believe it requires the XML to be in a specific format. XML is a semi-structured format, which allows my file format to change over time and still be backward and even forward compatible. In the past, I've written my own XML parsing while being careful not to make any assumptions about ordering or there not being tags I'm unaware of in the future. If you can load the entire XML file, XPATH would probably work pretty well. Otherwise your left with some more complicated stream parsing – Alan Mar 13 '13 at 03:11
1

I would suggest looking into [JSON](http://json.org/) – Basile Starynkevitch Jun 16 '17 at 08:17

score 8 · Answer 2 · answered Jun 14 '17 at 23:39

From someone who has had to parse a lot of file formats, I have opinions on this from a different point of view to most.

Make the magic number very unique so that people's file format detectors for other formats don't misidentify it as yours. If you use binary, allocate 8 or 16 randomly-generated bytes at the start of a binary format for the magic number. If you use XML, allocate a proper namespace in your domain so that it can't clash with other people. If you use JSON, god help you. Maybe someone has sorted out a solution for that abomination of a format by now.
Plan for backwards compatibility. Store the version number of the format somehow so that later versions of your software can deal with differences.
If the file can be large, or there are sections of it which people might want to skip over for some reason, make sure there is a nice way to do this. XML, JSON and most other text formats are particularly terrible for this, because they force the reader to parse all the data between the start and end element even if they don't care about it. EBML is somewhat better because it stores the length of elements, allowing you to skip all the way to the end. If you make a custom binary format, there is a fairly common design where you store a chunk identifier and a length as the first thing in the header, and then the reader can skip the entire chunk.
Store all strings in UTF-8.
If you care about long-term extensibility, store all integers in a variable-length form.
Checksums are nice because it allows the reader to immediately abort on invalid data, instead of potentially stepping into sections of the file which could produce confusing results.

+1 for making me realize that I'm not the only person who thinks json is an abomination of a format. — RubberDuck, Jun 15 '17 at 00:06
Why the hate for json? Just put a known string in a known location to identify the format. Problem solved. — Esben Skov Pedersen, Jun 15 '17 at 13:02
It's not perfect, but it works seamlessly with javascript, faster to parse than XML and smaller size, and still human readable. — corylulu, Jun 15 '17 at 18:51
"Why the hate for JSON?" No support for human-readable comments, crap escaping of Unicode, and a weird syntax requiring me to quote the keys even though they never contain whitespace. Plus the usual inability to extend things because nobody thought about namespacing... by the time you resolve that one, you end up with something that looks even worse than XML did in the first place, all for what, the benefit of avoiding some angle brackets? — Hakanai, Jun 16 '17 at 01:44
Yeah, but as with all things with programming, use the right tool for the job. There are applications where XML is better than JSON and vice versa. — corylulu, Jun 16 '17 at 22:13

score 4 · Answer 3 · answered Feb 27 '13 at 04:58

4

Well, there are times what you describe can be a very bad approach. This is assuming when you say 'serialize' you're talking about using a language/framework's ability to simply take an object and output directly to some sort of binary stream. The problem is class structures change over the years. Will you be able to reload a file made in a previous version of your app if all your classes change in a newer one?

For long term stability of a file format, I've found it better to roll up your sleeves a little bit now and specifically write your own 'serializing'/'streaming' methods within your classes. ie, manually handle the writing of values to a stream. Write a header as you state that describes the format version, and then the data you want saved in the order you want it in. On the reading side, handling different versions of the file format becomes a lot easier.

The other option of course is XML or JSON. Not necessarily the greatest for binary heavy content, but simple and human readable... a big plus for long term viability.

answered Feb 27 '13 at 04:58

GrandmasterB

37,990
7
78
131

I am serializing using protobuf-net (https://code.google.com/p/protobuf-net/) which is extensible. But your points are valid, however, I don't think that their is any method of file format that is immune to this. – corylulu Feb 27 '13 at 05:05
Yep... thats why I say sometimes you just have to get your hands dirty and handle the order in which data is written & loaded manually. – GrandmasterB Feb 27 '13 at 05:11
The application I am building is far to dynamic and has far too many values for something like that. – corylulu Feb 27 '13 at 05:20
1

The more complicated the application, the more important it is to have very fine control over the file format. Keep in mind I'm not saying each class shouldnt have its own streamable output... just that you should control that for each class. Then just call those routines. – GrandmasterB Feb 27 '13 at 16:30
Yeah, I have methods in place that upgrade legacy versions to modern versions and I have a very clear layout of how my classes are laid out. I'm not overly worried about that, but I do agree it's important. I've been working on this for almost a year, so I do have a pretty clear view of how it's structure works. – corylulu Feb 27 '13 at 16:52

score 2 · Answer 4 · answered Jun 23 '23 at 12:24

I was simply going to Serialize my objects into binary and create a header that would tell me how to parse the file. Is this a bad approach?

From someone who's been on the receiving end of someone else doing this ...

YES, it's a Bad Idea.

We had a very old application, written in a now-obsolete technology that did exactly this - dumped the object out of memory and wrote it into a file. Easy to code, nice quick solution for the Developers. Two decades and some down the line, when that technology got trashed on security grounds, we were left with thousands of these binary nightmare files lying around, still used by the business, but with no way to edit them.
Picking the file "format" apart and interpreting it into a replacement application was ... "Fun".

score 0 · Answer 5 · answered Mar 13 '13 at 03:17

I would also love to hear answers to this question from people with years more experience than myself.

I have personally implemented several file formats for my work, and I've moved over to using an XML file format. My requirements and hardware that I interact with change all the time, and there is no telling what I will need to add to the format in the future. One of XML's primary advantages is that it is semi-structured. For this reason, I generally avoid automatic XML Serialization that .NET provides because I believe it forces it to expect an exact format.

My goal was to create an XML format that allowed for new elements and attributes to be added in the future and for the order of the tags to not matter whenever possible. If you are sure that you can load your entire file into memory then XPATH is probably a good choice.

If you are dealing with particularly large files, or for other reasons cannot load the file all at once, then you are probably left with using an XmlStreamReader and scanning for known elements and recursing into those elements with ReadSubtree and scanning again...

This answer is not very directed to the Q, this site is not meant to be a discussion board but rather is intended for non-speculative Q&A. You have some valid points made in your answer that could be used to argue a suggestion of why the questioner's approach is or isn't good, but it's not very focussed. Please focus your answer on the question a little more, thanks! — Jimmy Hoffa, Mar 13 '13 at 04:00
@JimmyHoffa While my answer also supported the OP's question, I did make it clear that I was suggesting an XML semi-structured approach.. but I do see what you mean, I may edit — Alan, Mar 13 '13 at 04:20

Is there a proper way to create a file format?

5 Answers5

Linked