This blog post is part of a series. If you haven’t read the previous blog post it’s here : A week into linux storage performances - Day 4 : Partitions and Filesystem
Introduction
On previous blog posts we saw on which points we need to focus on storage layers to improve performances/capacity. Some behaviors on Linux may impact storage performance by creating preasure on it or on some other components.
This are the points we can address to gain a bit more performances:
- memory management
- IRQ management
- CPU changes
Memory management
Swap management
the exchange space or SWAP is often a partition present on disk and allow your system to put some RAM pages on it to avoid out of RAM space. A lot of automated distributions install are automatically setting up SWAP partition.
Because of this, when system a page present on swap need to be read, this considerably slow down your system (time to access disk) and has a strong impact on your storage performances (lot of small IOPS).
By default Linux easly copy pages on swap even if memory isn’t full. To avoid this, we have several adjustment variables.
Disabling swap is a good solution that can be used on a lot of cases. You may avoid to do that in the following cases:
- When you need to hibernate your system
- Small memory devices where ram usage spikes can lead to OOM
- Systems hosting critical services where uptime is more important than performances
Chris Down wrote a great blog post to defend swap.
If you still want remove swap, you can hot disable it :
1 | swapoff -a |
Then you can remove entry in fstab and delete partition.
Another alertnative is to stop using swap on disk and use zram to put it on compress ram disk. To do this :
1 | sudo apt install zram-tools |
In configuration file set number of % or your ram you allow for zram:
1 | echo -e "ALGO=zstd\nPERCENT=20" | sudo tee -a /etc/default/zramswap |
Finally you can reload service:
1 | sudo systemctl reload zramswap |
You can display information about zram space :
1 | sudo zramctl |
If none of this both solutions are posible you can change how kernel will manage cache and operations in swap.
Before touching this parameter it’s important to understand how memory reclaim works.
On running system you got a process named kswapd whose mission is to reclaim pages when free memory is low. To achieve this, it process into 3 steps:
- Drop clean file pages
- Write dirty file pages back to disk
- Swap out cold anonymous pages
Swappiness work into step 3. This is a mechanism of I/O cost choosing between swapping vs. reclaiming cache. Since Linux 5.8 the value is between 0 and 200 (before 0-100). The default value is 60.
When the value is near to 0 it mean “swap is expensive”, when 100 it treat write in swap and reclaim cache equally and above 100 is “swap is cheap”.
Contrary to what many documents say, it’s not a percentage. The complete formula of swappiness is the following one:
1 | swap_tendency = mapped_ratio/2 + vm_swappiness + distress |
- mapped_ratio/2 = % of RAM currently mapped by processes
- distress = 0 to 100 value; grows when repeated reclaim passes failed
- vm_swappiness –> Your defined value
If swap_tendency exced 100, kswapd will scan anonymous pages (process’s heap or stack) to push them into swap.
What are recommanded values:
| Value | Behaviour | Use case |
|---|---|---|
| 0 | Do not swap until OOM | Latency-critical real-time systems; but higher risk of OOM |
| 1-10 | Swapping only really cold anon pages | Databases |
| 60 | Default | Desktop |
| 100 | Free cache aggressively | Heavy fileserver |
| >100 | Prefer swapping to keep big page-cache hot; intended for systems where swap is fast (zram, NVMe) | Low memory systems with zram and swap on NVMe devices |
To change swappiness value you can do this:
1 | sysctl -w vm.swappiness=10 |
To make it persistant to reboot:
1 | echo "vm.swappiness = 10" | sudo tee /etc/sysctl.d/99-my-swappiness.conf |
You can follow write in swap like this:
1 | vmstat 1 |
si/so is swap IN / swap OUT. This can help you to adjust swappiness value.
Manage dirty memory
In opposition to anonymous pages (process’s heap or stack), file page is a part of some on-disk file.
How a file page become dirty ?
- When you read() file on disk, the kernel pulls the data from disk, puts it in RAM, and maps those pages into your process (page is clean).
- If you want to modify file with write(), data in memory change and no longer match data on disk (page is marked dirty).
- Kernel flush thread and perform write back operation on disk (page is clean)
Write back operation are performed on the following situations:
- Each number of seconds pages are writed back (default each 5 seconds)
- When memory pressure forces reclaim
- on explicit calls such as sync, fsync, fdatasync, msync, or unmount.
Because this feature allow kernel to group I/O on disk before process writeback, too low values will do higher preasure on your device. Add delay will reduce preasure but will increase possible dataloss if your system fall. To change dirty setting you can modify this sysctl parameters :
- vm.dirty_background_ratio percentage of system memory that can be filled with “dirty” pages
- vm.dirty_ratio maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk
- vm.dirty_expire_centisecs number of 1/100s before writeback
You can set this parameters to reduce preasure on disk:
1 | vm.dirty_background_ratio = 5 |
Filesystem cache in RAM
Filesystem will cache some operations in RAM (directory and inode objects). Depending of you amount of RAM you can configure cache reclaim agressivity by kernel.
The vfs_cache_pressure option is here for that.
The default value (100) will try to reclaim inodes at a fair rate respecting pagecache and swapcache.
Reducing value will reduce cache reclaim and use more RAM (improving performance on storage operations). Be carefull this can lead to OOM.
For a lot of operations on small files you can redure this value to 50.
Push throw-away files to RAM
Some softs are using drives for cache and temp file. In case of a lot of small write operations this can affect other software performances. To avoid this you can configure cache directory as tmpfs (fs in RAM).
To acheive this you can do this in fstab :
1 | tmpfs /tmp tmpfs rw,nosuid,nodev,size=2G 0 0 |
IRQ management
As we saw in this blog post, in case of multiple numa node, fixing IRQ properly have an impact on device performance. For NVME devices you can also perform this operation for better performances.
CPU changes
CPU governor
To reduce power consumption all modern OS are using frequency scaling to reduce frequence when load is low. On a lot of this is a great thing but when you need better performances and reactivity you can switch governor into performance mode.
To acheive :
1 | sudo apt install linux-cpupower |
CPU mitigations
CPU mitigations respond to hardware security breach (eg. Meltdown) and drastically reduce performances. You can disable all mitigations by add this in you kernel cmdline and updating your bootload:
1 | mitigations=off |
For some mitigations you got differents configuration mode (like SRSO for exemple).
End of this serie of blog posts
Before I close, if you’re interested in this kind of problem and would like to work on environments with a large scale, I’m looking for motivated people to join my SRE team at Scaleway. If you’d like to apply, please get my profesionnal email here:
1 | txt prodisonfire.timotebrusson.fr |
I hope you’ve enjoyed this series of blog posts on optimizing storage performance on Linux. There are plenty of other optimizations not covered in this series, but the subject is so vast that it’s impossible to cover everything. See you next time for more blog posts.