[36.6] How exactly do I read/write simple types in non-human-readable ("binary") format?
Before you read this, make sure to
evaluate all the tradeoffs between
human-readable and non-human-readable formats. The tradeoffs are
non-trivial, so you should resist a knee-jerk reaction to do it the way you
did it on the last project — one size does not fit all.
After you have made an eyes-open decision to use non-human-readable ("binary")
format, you should remember these keys:
- Make sure you open the input and output streams using
std::ios::binary. Do this even if you are on a Unix system since it's
easy to do, it documents your intent, and it's one less non-portability to
locate and change down the road.
- You probably want to use iostream's read() and write()
methods instead of its >> and << operators. read()
and write() are better for binary mode; >> and << are
better for text mode.
- If the binary data might get read by a different computer than the one
that wrote it, be very careful about endian issues (little-endian
vs. big-endian) and sizeof issues. The easiest way to handle this is
to anoint one of those two formats as the official "network" format, and to
create a header file that contains machine dependencies (I usually call it
machine.h). That header should define inline functions like
readNetworkInt(std::istream& istr) to read a "network int,"
and so forth for reading and writing all the primitive types. You can define
the format for these pretty much anyway you want. E.g., you might define a
"network int" as exactly 32 bits in little endian format. In any
case, the functions in machine.h will do any necessary endian
conversions, sizeof conversions, etc. You'll either end up with a
different machine.h on each machine architecture, or you'll end up
with a lot of #ifdefs in your machine.h, but either way, all
this ugliness will be buried in a single header, and all the rest of your code
will be clean(er). Note: the floating point differences are the most subtle
and tricky to handle. It can be done, but you'll have to be careful with
things like NaN, over- and under-flow, #bits in the mantissa
or exponent, etc.
- When space-cost is an issue, such as when you are storing the serialized
form in a small memory device or sending it over a slow link, you can compress
the stream and/or you can do some manual tricks. The simplest is to store
small numbers in a smaller number of bytes. For example, to store an unsigned
integer in a stream that has 8-bit bytes, you can hijack the 8th bit of each
byte to indicate whether or not there is another byte. That means you get 7
meaningful bits/byte, so 0...127 fit in 1 byte, 128...16384 fit in 2 bytes,
etc. If the average number is smaller than around half a billion, this will
use less space than storing every four-byte unsigned number in four 8-bit
bytes. There are lots of other variations on this theme, e.g., a sorted array
of numbers can store the difference between each number, storing extremely
small values in unary format, etc.
- String data is tricky because you have to unambiguously know when the
string's body stops. You can't unambiguously terminate all strings with a
'\0' if some string might contain that character; recall that
std::string can store '\0'. The easiest solution is to write
the integer length just before the string data. Make sure the integer length
is written in "network format" to avoid sizeof and endian problems
(see the solutions in earlier bullets).
Please remember that these are primitives that you will need to use in the
other FAQs in this section.