A week into linux storage performances - Day 5 : Miscellaneous

Timoté Brusson

2025-06-20

linux, storage

This blog post is part of a series. If you haven’t read the previous blog post it’s here : A week into linux storage performances - Day 4 : Partitions and Filesystem

Introduction

On previous blog posts we saw on which points we need to focus on storage layers to improve performances/capacity. Some behaviors on Linux may impact storage performance by creating preasure on it or on some other components.

This are the points we can address to gain a bit more performances:

memory management
IRQ management
CPU changes

Memory management

Swap management

the exchange space or SWAP is often a partition present on disk and allow your system to put some RAM pages on it to avoid out of RAM space. A lot of automated distributions install are automatically setting up SWAP partition.

Because of this, when system a page present on swap need to be read, this considerably slow down your system (time to access disk) and has a strong impact on your storage performances (lot of small IOPS).

By default Linux easly copy pages on swap even if memory isn’t full. To avoid this, we have several adjustment variables.

Disabling swap is a good solution that can be used on a lot of cases. You may avoid to do that in the following cases:

When you need to hibernate your system
Small memory devices where ram usage spikes can lead to OOM
Systems hosting critical services where uptime is more important than performances

Chris Down wrote a great blog post to defend swap.

If you still want remove swap, you can hot disable it :

1	swapoff -a

Then you can remove entry in fstab and delete partition.

Another alertnative is to stop using swap on disk and use zram to put it on compress ram disk. To do this :

1	sudo apt install zram-tools

In configuration file set number of % or your ram you allow for zram:

1	echo -e "ALGO=zstd\nPERCENT=20" \| sudo tee -a /etc/default/zramswap

Finally you can reload service:

1	sudo systemctl reload zramswap

You can display information about zram space :

1	sudo zramctl

If none of this both solutions are posible you can change how kernel will manage cache and operations in swap.

Before touching this parameter it’s important to understand how memory reclaim works.

On running system you got a process named kswapd whose mission is to reclaim pages when free memory is low. To achieve this, it process into 3 steps:

Drop clean file pages
Write dirty file pages back to disk
Swap out cold anonymous pages

Swappiness work into step 3. This is a mechanism of I/O cost choosing between swapping vs. reclaiming cache. Since Linux 5.8 the value is between 0 and 200 (before 0-100). The default value is 60.

When the value is near to 0 it mean “swap is expensive”, when 100 it treat write in swap and reclaim cache equally and above 100 is “swap is cheap”.

Contrary to what many documents say, it’s not a percentage. The complete formula of swappiness is the following one:

1	swap_tendency = mapped_ratio/2 + vm_swappiness + distress

mapped_ratio/2 = % of RAM currently mapped by processes
distress = 0 to 100 value; grows when repeated reclaim passes failed
vm_swappiness –> Your defined value

If swap_tendency exced 100, kswapd will scan anonymous pages (process’s heap or stack) to push them into swap.

What are recommanded values:

Value	Behaviour	Use case
0	Do not swap until OOM	Latency-critical real-time systems; but higher risk of OOM
1-10	Swapping only really cold anon pages	Databases
60	Default	Desktop
100	Free cache aggressively	Heavy fileserver
>100	Prefer swapping to keep big page-cache hot; intended for systems where swap is fast (zram, NVMe)	Low memory systems with zram and swap on NVMe devices

To change swappiness value you can do this:

1	sysctl -w vm.swappiness=10

To make it persistant to reboot:

1	echo "vm.swappiness = 10" \| sudo tee /etc/sysctl.d/99-my-swappiness.conf

You can follow write in swap like this:

vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 7879604  12952 146128    0    0  6135   603  753 1000 14 13 68  6  0
 1  0      0 7879604  12952 146128    0    0     0   476   70  110  0  0 99  0  0
 1  0      0 7879604  12952 146128    0    0     0     8   50   58  0  0 100  0  0
 1  0      0 7879604  12952 146128    0    0     0     0   39   32  0  0 100  0  0

si/so is swap IN / swap OUT. This can help you to adjust swappiness value.

Manage dirty memory

In opposition to anonymous pages (process’s heap or stack), file page is a part of some on-disk file.

How a file page become dirty ?

When you read() file on disk, the kernel pulls the data from disk, puts it in RAM, and maps those pages into your process (page is clean).
If you want to modify file with write(), data in memory change and no longer match data on disk (page is marked dirty).
Kernel flush thread and perform write back operation on disk (page is clean)

Write back operation are performed on the following situations:

Each number of seconds pages are writed back (default each 5 seconds)
When memory pressure forces reclaim
on explicit calls such as sync, fsync, fdatasync, msync, or unmount.

Because this feature allow kernel to group I/O on disk before process writeback, too low values will do higher preasure on your device. Add delay will reduce preasure but will increase possible dataloss if your system fall. To change dirty setting you can modify this sysctl parameters :

vm.dirty_background_ratio percentage of system memory that can be filled with “dirty” pages
vm.dirty_ratio maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk
vm.dirty_expire_centisecs number of 1/100s before writeback

You can set this parameters to reduce preasure on disk:

1
2
3

vm.dirty_background_ratio = 5
vm.dirty_ratio = 20
vm.dirty_expire_centisecs = 3000

Warning: The kernel delays writes so it can flush them as big, sequential chunks, but if the “dirty” pool grows too large it suddenly blocks every writer.

Filesystem cache in RAM

Filesystem will cache some operations in RAM (directory and inode objects). Depending of you amount of RAM you can configure cache reclaim agressivity by kernel.

The vfs_cache_pressure option is here for that.

The default value (100) will try to reclaim inodes at a fair rate respecting pagecache and swapcache.
Reducing value will reduce cache reclaim and use more RAM (improving performance on storage operations). Be carefull this can lead to OOM.
For a lot of operations on small files you can redure this value to 50.

Push throw-away files to RAM

Some softs are using drives for cache and temp file. In case of a lot of small write operations this can affect other software performances. To avoid this you can configure cache directory as tmpfs (fs in RAM).

To acheive this you can do this in fstab :

1	tmpfs /tmp tmpfs rw,nosuid,nodev,size=2G 0 0

IRQ management

As we saw in this blog post, in case of multiple numa node, fixing IRQ properly have an impact on device performance. For NVME devices you can also perform this operation for better performances.

CPU changes

CPU governor

To reduce power consumption all modern OS are using frequency scaling to reduce frequence when load is low. On a lot of this is a great thing but when you need better performances and reactivity you can switch governor into performance mode.

To acheive :

1 2	sudo apt install linux-cpupower cpupower frequency-set --governor performance

CPU mitigations

Warning: Changing CPU mitigation configuration can lead to severe security breach. Do this only if you know what you're doing.

CPU mitigations respond to hardware security breach (eg. Meltdown) and drastically reduce performances. You can disable all mitigations by add this in you kernel cmdline and updating your bootload:

1	mitigations=off

For some mitigations you got differents configuration mode (like SRSO for exemple).

End of this serie of blog posts

Before I close, if you’re interested in this kind of problem and would like to work on environments with a large scale, I’m looking for motivated people to join my SRE team at Scaleway. If you’d like to apply, please get my profesionnal email here:

1	txt prodisonfire.timotebrusson.fr

I hope you’ve enjoyed this series of blog posts on optimizing storage performance on Linux. There are plenty of other optimizations not covered in this series, but the subject is so vast that it’s impossible to cover everything. See you next time for more blog posts.