I have a large number of record types derived from a binary format specification. So far, I've already written a computation expression builder that let’s me read structures from the files easily:
type Data = { Value: … } // Data record
let readData : ReadOnlyMemory<byte> -> Result<Data option, exn> = // This function takes a ReadOnlyMemory<byte> and maybe returns a record or a wrapped exception.
parser {
let decode algorithm bytes =
… // Some code to transform the bytes
let! algorithm = readUInt32LE 0 // The algorithm value from the first 4 bytes in little endian order
let! length = readUInt32LE 4 // The length of bytes to read for the value
if length > 0 then
let! value = readBytes 8 248 >=> decode algorithm // The actual data described by the bytes
return { Value = value }
}
The nice thing about this approach is that I can easily convert the format specification tables stored on a spreadsheet to create parsers as F# computation expressions for every kind of record defined plus some additional code here and there for validation logic (like above). A lot of the messiness of matching and conditional statements goes away using computational expressions and I can take advantage of imperative-style code with the brevity of F# syntax. (Notice my if
statement has no corresponding else
statement in the above code above.)
However, it’s not clear to me how best to do this for the reverse―taking records and serializing them into bytes. As in the above example, the byte representation can vary in length. There are also other considerations a writer must be aware of too:
- Variable-length: the byte representation is not necessarily fixed-length, although many are.
- Context: the byte representation for some types changes depending on where the bytes are written, what parent type they point to, and sometimes even bytes up ahead. (I’ve got one type where the encoder must process all the bytes, and then go back to the first byte position and write the algorithm identifier, so the resulting byte sequences are not always sequentially written.)
- Order: some records have a concept of pointers to parents, children, or siblings, so the order of writing is also important.
- Size: the resulting file sizes range from a megabyte to hundreds of gigabytes.
I’ve given some cursory thought to it and come up with the following:
A computation expression builder that caches all the write operations and returns a newly initialized byte array/Memory once the length of the final byte representation is known:
let encode algorithm bytes = // This is defined outside of the computation expression because the expression is … let serialize data context = serializer { let algorithm = if context … then … else … do! writeUInt32LE 0 algorithm let length = if algorithm … then … else … do! writeUInt32LE 4 length do! writeBytesTo 8 <=< encode algorithm <| data.Value return Array.zeroCreate <| sizeof<uint32> + sizeof<uint32> + length }
- An optimized version of the above for serializations with a known fixed size or small upper bound.
I’ve implemented the above with working results, but on second thought, the resulting computation expressions are not very intuitive; the return
statement at the very end creates the buffer which the preceding do!
statements write to. And the builder type for the computation expression also does a lot of extra work to make this work.
Something tells me that I’m barking up the wrong tree here. If I wanted to pursue code with high signal-to-noise ratio without massively impacting clarity or performance, what is a better way?