Why text-based serialization is awesome



One of the most important parts of akumuli is the serialization mechanism and network protocol. It should allow us to sustain a steady message flow between client and server. There are many serialization tools suited for this purpose - cap’n’proto, thrift, protocol buffers, message pack, etc. After considering all the strengths and weaknesses I’ve decided to use a text-based RESP (REdis Serialization Protocol) serialization format in akumuli.

RESP is very easy to implement, human readable and fast to parse. More importantly it does not require an additional build step and can be very secure.

This is how an actual RESP encoded message can look like:

+network.loadavg host=postgres\r\n
+2015-02-09T07:42:42Z\r\n
+24.3\r\n

New lines (\r\n) are used to delimit fields of the struct. The first field is an ID string, the second is a timestamp and the last is a floating-point number encoded using RESP string. It looks like three lines of text in editor in contrast to protobuf or thrift or any other binary serialization format.

This data can be parsed easily, let’s look at integer parser as an example:

uint64_t _read_int_body(InputStream *stream) {
    uint64_t result = 0;
    const int MAX_DIGITS = 84;  // Maximum number of decimal digits in uint64_t
    int quota = MAX_DIGITS;
    while(quota) {
        Byte c = stream->get();
        if (c == '\r') {
            c = stream->get();
            if (c == '\n') {
                return result;
            }
            throw_exception("Bad stream");
        }
        // c must be in [0x30:0x39] range
        if (c > 0x39 || c < 0x30) {
            throw_exception("can't parse integer (character value out of range)");
        }
        result = result*10 + static_cast<int>(c & 0x0F);
        quota--;
    }
    throw_exception("integer is too long");
}

uint64_t read_int(InputStream *stream) {
    Byte c = stream->get();
    if (c != ':') {
        throw_exception("bad call");
    }
    return _read_int_body(stream);
}
  • This parser is very compact and readable (contrary to code generated by tools like thrift or protoc) and can be optimized by hand.
  • If the stream is malformed an error will be generated. Errors are human-readable and contain context information and an error description. This is not the case when binary serialization is used. This is how it looks like:

RESP error message

  • This parser uses quota that limits the number of symbols that it can parse for security reasons.

One can easily encode data using this format on the client-side using any programming language. Some client-side code can be reused between Redis and Akumuli. Also, this format is very secure as there are no back-references or length-prefixes, just a stream of bytes (contrary to many binary serialization formats).

Performance

Everything has its downsides. RESP-encoded data is less compact and slower to parse then binary encoded data. But in this case decoding and encoding performance is an order of magnitude better than needed. Akumuli’s storage engine can handle several million writes per second and the RESP parser can decode data fast enough to keep the storage engine busy. On AWS, when you pay for traffic, compression atop of RESP will be a good option too, this is a subject for future improvements.