NOTE: this article is out of date. Akumuli doesn’t work this way anymore.
Let’s talk about compression algorithms. You probably know that DBMS can be row-oriented or column-oriented. Both have some strengths and weaknesses. Row-orientation has higher write-throughput and not so great compression. Column-orientation on the other hand has lower write-throughput and better compression. This is achieved because the data in each column is uniform. We can compress all the timestamps using one algorithm and all the IDs using another one that’s better suited for IDs.
Akumuli is row-oriented in the sense that every chunk is a row. But on the other hand every chunk is column-oriented. This gives us both high write-throughput and good a compression ratio.
Akumuli uses several compression algorithms for different column types.
My compression algorithm is based on this paper (you can find PDF online). If we have an array of floating-point values we can’t delta-encode them because otherwise precision would be lost. Instead we can XOR two adjacent values and get a binary diff. In the case of numeric time series we will get some 64bit number with leading zeroes. This number can be encoded using a very small amount of memory.
Akumuli uses four bits to represent the length of a number and sequence of four bit numbers to encode its binary diff. The aforementioned paper describes a slightly more complex algorithm which uses forecasting to predict a new value. The predicted and actual value are then used to calculate the binary diff. This diff should be smaller than the diff between the previous and new value. I am not using this because it’s harder to implement and my simple algorithm already gives very good results.
The floating-point compression algorithm implementation can be found here. It uses the __builtin_clzl
compiler intrinsic function to count leading zeroes for better performance.
In my tests this specialized compression algorithms gives up to 50% better compression ratio than a generic compression algorithm (gzip).