A week into linux storage performances - Day 2 : Device optimization and scheduler

This blog post is part of a series. If you haven’t read the previous blog post it’s here : A week into linux storage performances - Day 1 : Introduction.

How a storage device work

On a storage device, a sector is the smallest unit to store data.

Here is sector size for some devices types :

  • HDD : 512 on older devices and 4k physical with 512n (emulated) logical on modern drives
  • SSD/NVME : Usually 4096, can be 512 to 16384
  • CDROM : 2048

Because of this you cannot have file smaller than a sector size. Example :

Creating file with only 1 byte of data
1
echo -n 1 > test.txt

We can show file size to check if file contain 1 byte of data :

1
2
ls -l test.txt
-rw-r--r-- 1 root root 1 Apr 24 17:25 test.txt

But when checking usage on disk we got this :

1
2
du -hs test.txt
4.0K test.txt

Even if our file is 1 byte long, usage on disk is 4096 bytes (1 sector).
We can also check physical bloc size of our device :

1
2
cat /sys/block/sda/queue/physical_block_size
4096

It looks coherent.

Align physical and logical sector size

Devices have physical and logical block size (emulated) for backward compatibility with older OS and filesystem.
Usually HDD firmware doesn’t allow to change logical size but on NVMe and SSD this isn’t the same.

Align physical and logical size will reduce latency when performing IO.

NVMe

We can check this for a device :

1
2
3
4
root@em-priceless-rubin-rescue:~# cat /sys/block/nvme0n1/queue/physical_block_size
4096
root@em-priceless-rubin-rescue:~# cat /sys/block/nvme0n1/queue/logical_block_size
512

Here we see that sector size aren’t aligned properly.
We can also use nvme-cli to display this information :

1
2
3
nvme id-ns -H /dev/nvme0n1 | grep 'Relative Performance'
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best

If you didn’t see multiple LBA format from the above return command, it mean you can’t change logical sector size (or you need manufacturer tool).

or with smartmontools

1
2
3
4
5
smartctl -c /dev/nvme0n1 | grep -A 4 'Supported LBA'
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 4096 0 0
Warning: Change logical sector size will result into data deletion. Don't perform this operation on production device.

We can change logical sector size by setting lbaf setting with desired size :

0 for 512 and 1 for 4096
1
nvme format --lbaf=1 /dev/nvme0n1

And now we got this :

1
2
3
nvme id-ns -H /dev/nvme0n1 | grep 'Relative Performance'
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)

HDD / SSD

In case your device firmware allow logical sector change you can process like this :

Check logical and physical block size :

1
2
3
cat /sys/block/sda/queue/{logical,physical}_block_size
512
4096

We also can use hdparm to check this :

1
2
3
hdparm -I /dev/sda |grep 'Sector size:'
Logical Sector size: 512 bytes
Physical Sector size: 4096 bytes

And now we can change logical size (destructive action for data on disk) :

1
hdparm --set-sector-size 4096 --please-destroy-my-drive /dev/sda

In my case, disk firmware doesn’t allow logical sector size reconfiguration and I got this error :

1
2
/dev/sda:
READ_LOG_EXT(SECTOR_CONFIGURATION) failed: Unknown error -2

NVME / HDD / SSD special cases

If you can’t change logical sector size with method detailed above, you can try to download manufacturer tool to modify this.
This is generally available only on enterprise grade devices.

I/O Scheduler

Initially, I/O schedulers were designed for mechanical disks for reordering requests to group them by logical address. That was necessary due to read head and rotation incompressible time.
Legacy schedulers (no-op, deadline, cfq) generate small CPU overhead for I/O reordering but that wasn’t an issue due to slow drives. There are also non-multiqueue schedulers but they were deprecated in linux 5.3 kernel (september 2019).
With modern flash memory devices without mechanical constrains and high iops/bandwidth, legacy schedulers weren’t necessary and result into high CPU overhead.

To address this issue, new generation schedulers have been created, so let met introduce you: BFQ, mq-deadline and Kyber!

none

no-op has been replaced by none scheduler which is the one integrated in blk_mq framework. This is first-in first-out algorithm merging request by latest cache hit. In case of doubt and for NVMe devices this is a pretty good choice.

mq-deadline

mq-deadline is a multiqueue adaptation of deadline in multiqueue with one queue for reads and one for writes. It prioritize reads above writes by default and execute oldest write request in queue when deadline time is reached.

mq-deadline is better for mechanical disks and older SSD because scaling is bad with modern NVMe (high IOPS) and CPU overhead is important.
However a patch made in january 2024 seems to fix performance issues on high performances NVMe (up to +112% performances).

BFQ

Budget Fair Queuing create one queue for each process with equals cost. It seems to be the best way for ionice specific process but it’s also the scheduler with the higher CPU overhead.

However scaling is really complicated and IOPS seems to be limited to ~500kiops. A patch have been submitted to fix this but this has never been merged and deployed. Actually this scheduler seems to be not maintained anymore.

Kyber

Kyber is a low overhead scheduler for multiqueue fast devices, prorizing reads over writes. It is based on a very simple mechanism ideal for low-latency workload like databases.
Kyber is useless for HDD and old/cheap SSD.

Scheduler recap

Scheduler Device type Workload Detail
none NVMe, enterprise grade SSD pure sequential IO, backups, Maximum throughput, simple workload
mq-deadline HDD, cheap/old SSD
kyber NVMe database, webserver, VM hypervisor Low latency, critical reads, high number of process hitting storage device

On may 2024, a great paper was released for scheduler comparison on high performances NVMe : BFQ, Multiqueue-Deadline, or Kyber? Performance
Characterization of Linux Storage Schedulers in the NVMe Era
.

How to change scheduler

First we can check current scheduler and available ones :

1
2
cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline

Here, scheduler is set to none.

Schedules are kernel modules so we need to load them before use :

1
2
3
4
# For BFQ
modprobe bfq
# For kyber
modprobe kyber-iosched

For persistence at reboot add this into etc/modules

1
2
3
4
cat <<EOF >> /etc/modules
bfq
kyber-iosched
EOF

Now we see our schedulers available :

1
2
cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline bfq kyber

To modify scheduler do this :

1
2
3
4
echo kyber > /sys/block/nvme0n1/queue/scheduler
# Then we can check all went well
cat /sys/block/nvme0n1/queue/scheduler
mq-deadline bfq [kyber] none

However this change isn’t persistent to reboot. To make is persistent we need to create udev rule for all NVMe devices :

/etc/udev/rules.d/60-nvme-scheduler.rules
1
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/scheduler}="kyber"

And then we can apply rule without rebooting :

1
2
udevadm control --reload
udevadm trigger

Modify IO scheduler will reset all changes in /sys/block/<device>/queue/.

Scheduler specific option are located here : /sys/block/<device>/queue/iosched/.

Device queue general options tuning

All the following queue options are editable on devices in /sys/block/<device>/queue/ for performances improvement. We will check one by one possible things :

  • io_poll : Enabling polling can lower latency because the CPU polls for completions instead of waiting for interrupts. Especially useful on NVMe.

  • io_poll_delay : 3 possibles values if enabled. -1 (default) Classic polling or immediate polling, 0 is hybrid polling, a bit slower (increase latencies) but reduce CPU load. Values > 1 is for delayed polling with delay value in µs. For this part left -1 or perform tests.

  • io_timeout : You can adjust this, but usually not a big performance gain unless you are heavily optimizing error recovery.

  • max_sectors_kb : Maximum size of a single I/O, It cannot be less than logical_sector_size/1024. Increase for better bandwidth and reduce for better latencies

  • nomerges : 3 options available: 0 (default), 1 to disable single hit merge, 2 to completely disable merging. Set 2 can improve IOPS slightly in extreme high-performance tuning, at the cost of some CPU efficiency. Can be useful on high performance NVME, dramatic on HDD and SSD. 1 is a good value.

  • nr_requests : Queue size for scheduler, increase it allow more merging and can be useful on HDD. For high performance NVME I got better performances by setting it to 8.

  • read_ahead_kb : Use to predict next data read on sequential operation, pretty useful for mechanical disks, set it to max_sectors_kb value.

  • rq_affinity : 1 by default (mean end I/O operation on same core group where IO start), set it to 2 to use only the same CPU core. Maybe more important when you perform IRQ pinning. Set 0 to totally disable affinity (not recommended).

  • wbt_lat_usec : Throttle write when detecting latencies above selected value (default 2000), you must perform benchmark when changing this value, don’t set too high value to avoid hang.

  • write_cache : Default value is write through that mean write cache is disable, you can set it to write back but it can be dramatic in case of server crash or power issue. Don’t do this with a database for example.

Tuning option by scheduler

mq-deadline

  • async_depth : max number of asynchronous I/O in flight, decrease for low latency and increase for async-heavy workloads.

  • fifo_batch : Number of requests served from one queue (read or write) before switching. Increase for higher bandwidth and decrease for lower latencies.

  • front_merges : Set 1 if you want better throughput (merging saves operations). Set 0 if your workload is random (no benefits)

  • read_expire : Maximum time (in milliseconds) a read request can wait before being prioritized. Lower it for read-latency-sensitive applications and higher values prioritize throughput if read latency is not critical.

  • write_expire : Maximum time (in milliseconds) a write request can wait before being prioritized. Lower it for write-latency-critical applications and higher values allow batching more writes = better bandwidth

  • writes_starved : How many read batches are served before writes are allowed. 1 or 2 for write intensive workload, 2 or more for read intensive workload.

Kyber

  • read_lat_nsec : Target read latency. Lower values for low latencies and higher values for high bandwidth.

  • write_lat_nsec : Lower values is for faster writes and higher value for better bandwidth.

What’s next ?

On the next blog post we’ll see how to manage redundancy with software RAID with a focus on RAID10 on different ways to perform RAID10.

A week into linux storage performances - Day 3 : Manage redondancy with RAID