I’ve decided to post progress reports periodically. The first one is going to be large.
Akumuli is not just a library. You can use it as a service now. There is an ingestion protocol based on RESP format that works over TCP. Akumuli can handle up to 2 million data-points per second (on m3.xlarge instance). I’m planning to implement RESP over UDP that will allow akumuli to handle millions of UDP packets per second (with some crazy quirks).
Akumuli has a query language based on json. Query can look like this:
{
"metric": ["cpu"],
"where": [
{
"in": {"host": [ "balancerA" ]}
}
],
"range":{
"from":"20150102T030400",
"to": "20150102T030500"
}
}
This query asks for all points from “cpu” metric with tag “host” set to “balancerA” in the specified time range. This query can be sent over HTTP and akumuli will return requested data in RESP format. Response will be sent using chunked HTTP. This allows akumuli to send large amounts of data over a long period of time. I’m planning to implement “continuous queries” that can stream query results to your application as soon as new data arrives (WIP).
You can compose different processing steps using this query language. Toolbox is not very large now, you can use random sampling, compute moving average and moving mean, filter data, find frequent items, heavy hitters and perform anomaly detection using SMA forecasting model. Some of this processing steps can be composed, for example, you can compute moving averages first and then sample this data using reservoir sampling. In this case you will get the same number of samples for each time frame.
I’m planning to add different aggregations and data processing steps: order statistics, histograms, PAA transform, SAX transform, Haar wavelets etc.
Anomaly detection in akumuli is based on predictive analytics. Service scans each time-series using sliding window, produces forecast and then computes error between forecast and actual value. It uses SMA (Simple Moving Average) now but can be easily improved with different forecasting models (EWMA, Holt-Winters, ARIMA).
This method is not new and many developers are already familiar with it but akumuli’s approach is a bit different here. It doesn’t computes forecast for each time-series individually, instead of that it computes special sketch data-structure. This sketch is used to produce forecast and error sketches. Error sketch is used to prune search space and find candidates for further analysis. This algorithm allows akumuli to perform anomaly detection on large amount of series in real-time (this will be great in conjunction with continuous queries).
Sketch-based algorithm uses approximation of the dataset. It can produce false positives so every anomalous series should be checked individually using direct approach instead of approximation.
Storage was revamped a lot. Compression algorithm was changed. Akumuli now reorders elements on disk to increase compression rate. Histogram is gone now, it was replaced by the new indirection vector that contains both timestamps and offsets. Indirection vector is now stored at the end of the volume and data chunks are added to the beginning. This change has simplified calculations in write path.