A week into linux storage performances - Day 1 : Introduction

Timoté Brusson

2025-06-16

linux, storage

Introduction

On production, storage is the nerve of the war. Each layer composing storage need to be carefully watched to be able to exploit 100% of the resources offered by the devices.

In this series of blog posts we’ll be studying local storage operation and optimization on Linux systems from device layer up to filesystem (VFS). We will not address remote storage topics like NVMeOF, CEPH, SAN etc…

Storage key points

Performances

Comparing to CPU (L level cache) and volatile memory (RAM), persistent storage devices are considerably slower.
This is pretty problematic because IN/OUT instructions in CPU (understand read and write operations) have priority over others, so CPU is waiting instruction end before continuing his job (IOwait). If IOwait is 10% it means that your CPU time dedicated to computing is only 90%. Hyperthreading and multi-core CPU reduce IOwait impact but it can be harmful in some cases.

Integrity / Redundancy

Besides performances, data integrity is also a very important topic to consider. In fact, depending on the workload, dataloss can lead to major outage and financial impacts. For example, dataloss will not be processed in the same way for a cache server as for a database.
The principal method to get better redundancy is to duplicate data leading to performance impact mainly on write.

Cost

Finally, even if business can be strongly impacted by performance and the ability to ensure data integrity, budgets are never unlimited, and it’s essential to properly evaluate the systems put in place so that they don’t cost too much.

We can use a small diagram to place this different key points and place workloads examples on it. Placement is fairly realistic, but it’s subjective and can change depending on the company’s core business.

Storage layers in Linux Kernel

We can divide Linux storage stack in 4 layers with this simple representation :

Application in user space will perform syscalls to interact with the filesystem (read,write,open,close etc…).
You can check a more complete list of available syscall on Linux Assembly website.
Filesystem got 2 possibles options :
- Perform direct I/O directly to block layer which is faster on large sequential reads or writes because it’s avoid overhead of data copy into page cache
- Use cache to perform buffered I/O giving better performances by merging small write operations
The block layer main job is to allow interactions with various devices types by a single way. The main component of this layer is blk-mq, an API for parallel IO requests. You can find complete documentation on kernel.org documentation. This is also where you’ll find IO schedulers in the form of modules that allow you to manage policies by device. I/O scheduler is additional layer so this will impact raw performances, but this allow to better manage parallel operations and give you complex features (mandatory for IOnice in some cases).
Then you will find drivers allowing kernel to communicate with physical devices. There is a huge list to handle all devices types (RAID card, HBA card, virtio, nvme etc…)

Thomas Kernn company create great schema for Linux storage stack available here.

Storage performances measurement tools

Fio (Flexible I/O Tester)

Fio is the most widely used performance testing tool for storage. This tool provide a lot of options test all possible cases (FS, block, direct IO etc…).
This is the tools we will use in the next blog posts to take measures.

sysbench

Sysbench got a lot of features to benchmark all system performances (CPU, RAM, storage etc…) and among the benchmarks available you can find fileio for storage.

stress-ng

Stress-ng does not provide any performance measurement tools as such. However, it does allow you to abuse your system. This allows you to test stability. You can also use additional tools such as sysstat to see what’s going on.

dd

DD allow to copy block from device to another. This isn’t tool for benchmark but it can give you an idea in some cases.

Bonus

On some use case like HPC you will have really powerful storage solution and benchmark can became problematic because CPU become the bottleneck. For this kind of setup you can use DPDK based solutions which perform DMA operations to bypass the kernel. One of these tool is dpdk-test-dma-perf (documentation here).

Performance benchmark key points and methodology

Key points

When measuring performances we have 3 main points to address :

Bandwidth for large file copy
IOPS for small random operations
Latencies for time sensitive workload

To achieve this the followings tests need to be done :

Sequential test with 1M or more blocksize for bandwidth :
- Sequential Read
- Sequential Write
- Sequential Read/write
Random test with 4k blocksize for IOPS :
- Random Read
- Random Write
- Random Read/Write

On device datasheet, performances for random access are usually specified for 4K (QD32), which mean max IO in queue (queue depth) is 32.

Saturation point

For each tests we need to check latencies and adapt parameters to know real limits and not just raw performances due to a saturation point (or knee point).

This saturation point is pretty problematic because it create higher CPU load when reach it and this can strongly impact some workloads like databases. This can be a good idea in production to set limits just before this point.
To measure latencies 99.9 percentile is a good measure for standard workload and 99.99 for latency sensitive workload like database.

Flash devices warm up

On SSD and NVMe, performing test directly after trim/erase will bypass garbage collector and hit cache (write). To avoid this, disks need to be warmed up by ignoring at least first 30 seconds results (60s is better).

On FIO, --ramp_time=<warmup time in seconds> will help us to achieve that.

This isn’t necessary on HDD.

Importance of fsync

Fsync() is a syscall used to force kernel flush cache and ensure synchronisation with device. This give you the assurance that your data is safely written on device. For example on database, DBMS is performing fsync before reply to client. This is strongly impacting performances and in case of database workload, you need set fsync to 1 when running tests (ensure fsync after each write).

What’s next ?

On the next blog post we’ll see which parameters we can tune/adapt to take advantage of all the performance offered by a device.

A week into linux storage performances - Day 2 : Device optimization and scheduler