There is a lot of distributed time-series databases nowadays but Akumuli takes a different approach. It’s a standalone solution and there is some logic behind this design decision.
###Time-series storage throughput requirements
Time-series data storage is actually quite simple problem. It was solved many times by many different companies. And the reason why we don’t have a lot of distributed TSDB’s yet is actually very simple - most of us simply don’t need it. I’m not speaking here about monitoring startups - maybe they do need scaleable TSDB’s, but most projects simply don’t generate enough data.
Several years ago I was heavily involved in development of the SCADA system that handles data from entire russian electric grid. Time-series storage component of this system works on a single machine (with hot reserve). Difficulties were not in time-series storage throughput, actual time-series storage system were backed by relational database (because of good tooling and convenience).
Let’s do some “back of the envelope” calculations. Imagine that we have 1000 servers, each sends 100 data-points per second (CPU time, memory usage, etc). This is only 100K data-points per second. This number is not big, it’s actually quite small. If writes are batched and data is compressed (4 bytes per data-point) only about 390KB/s of write bandwidth will be used. 1M data-points per second is just about 3.8MB/s of write bandwidth. And modern SSD’s has sequential write throughput measured in hundreds of megabytes per second. The write path is not even I/O bound so system that can write tens of millions of data-points per second on single machine is feasible (but hard to build, one should parallelize write path to do this).
Modern servers are the number crunching beasts with large number of CPU’s, fast memory, storage and network. There is a good chance that it is possible to handle all your monitoring workload using only one machine.
###My views on this subject
In many cases you even don’t need to store original time-series data. Stream processing is fine for most monitoring tasks (thresholds, anomaly detection, alarms etc). You don’t need raw time-series to draw graph or to see that value has been out of range for too long. Raw time-series data usually needed to perform data-analysis, calculate DTW distance measure, SAXify etc. But most of this tasks are often fast enough for single machine anyway (except kNN, kNN requires a lot hardware resources).