Time-series storage design (part three)


In my previous storage rant I sketched akumuli write algorithm. In this algorithm two different memory regions should be updated on every chunk write. Each memory region is updated sequentially but at the same time. In addition to that the volume header should be updated, too. One can argue that all of this results in a non-sequential write pattern and should cause slowdown. This is not the case in reality and I want to explain why. Let’s start from good old hard drives.

####HDD performance.

HDDs usually have one or several rotating plates and magnetic head. This head can move backward and forward and read information from the rotating plate. To read or write data in random place we need to move the magnetic head first (this usually called a “seek”). This process is very slow. Contrary to that sequential reads and writes are very fast because disk seeking is not required.

All modern HDDs have a controller onboard. All read and write requests go through it. This controllers usually run some advanced version of the SCAN disck scheduling algorithm. SCAN buffers read and write requests in the on-board cache minimizing disk seeks.

It’s obvious that even the simplest SCAN implementation can deal with several write streams without loss of throughput if one stream produces most of the writes and other streams is tiny. Small write streams can be buffered and cause the disk head to move only occasionally.

####SSD performance.

Solid state drive are a very different beast. They too have an on-board controller, albeit a much more complex one. One of the main problems of the SSD is write amplification. This negative effect can decrease throughput when only a small amount of data is updated in each write.

SSDs have a FTL (Flash Translation Layer) that maps logical memory addresses to physical addresses. When you write new data to the SSD sequentially it allocates new pages and stores your data there. Every page is used for new data fully. In this case write amplification is very small (and throughput is high). If we write new data in several places sequentially controller can do the same thing. New data from both streams will be written to one physical location. In this case write amplification will be small.