This blog post is part of a series. If you haven’t read the previous blog post it’s here : A week into linux storage performances - Day 3 : Manage redondancy with RAID
Context
File system is for hierarchical data management on physical device.
There is three ways to deals with FS :
- Local file systems (ext4,btfs,xfs)
- Network file systems (nfs,cephfs)
- Virtual file systems (procfs,sysfs)
We will not handle Network and Virtual file systems here.
File systems are usually set on partitions which are a section of a physical storage device.
Because this is an additional tier on storage stack, this must be managed carefully to avoid loosing performances.
Filesystem comparison
Here, we will quickly compare 5 filesystem and benchmark their performances.
- EXT4: One of the most widely used journaling FS pretty fast with strong data integrity level.
- BTRFS: Modern copy-on-write FS with advanced features like snapshot, checksumming and built-in RAID.
- XFS: High performance FS for large files, high throughput and databases
- Bcachefs: High-performance, low-latency, built-in compression and encryption FS with built-in block caching to combine mechanical and flash drive devices.
- F2FS: For flash drive only, designed and developed by Samsung for mobile devices. Low overhead on write-intensive workloads.
Analysis: In sequential, bcachefs is clearly the best FS and in random EXT4 still a good choice but XFS also give good performances.
For the rest of this blog post we’ll focus on EXT4 for his versatility and because this is the mostly used FS.
Partition alignment
As we saw in Day 2 : Device optimization and scheduler with device physical and logical alignment we also need to align partition with block size.
If your partition start in the middle of a block, each access will need 2 I/O instead of 1 and you will loose storage space with small files. For example if you got 4k sector size a 1 byte file will took 8k space.
Fortunately, modern partition tools do the job for you but in case of manual partitioning you’ll need to take care of this. A good practice is to set partition start on the 1MiB nearest value.
You also need to align partition block size with your device block size. By default modern part tools are setting fs blocksize to 4096. When create new partition you can manually set it like this :
1 | mkfs.ext4 -b 4096 /dev/sda1 |
Extended options with RAID when stripping
When performing 0,4,5,6 and 10, the address space of the array is conceptually divided into chunks and consecutive chunks are striped onto neighbouring devices. To avoid write straddle stripe boundaries, you need to set extended options to align fs block size and stipes.
There is 2 options to define:
- stride: number of filesystem blocks that fit into one RAID chunk.
- stripe-width: number of filesystem blocks that span the entire RAID stripe.
You need to apply the following formulas :
- stride = chunk_size / filesystem_block_size
- stripe-width = stride × number_of_data_disks
We will give example for each RAID level with 512k chunk (default on modern MDADM - legacy is 64k), and 4k disk logicial and physical block size.
RAID0 (4 disks / 4 data disk):
stride = 512K / 4K = 128
stripe-width = 128 × 4 = 512
RAID5 (3 disks / 2 data disk):
stride = 512K / 4K = 128
stripe-width = 128 × 2 = 256
RAID6 (4 disks / 2 data disk):
stride = 512K / 4K = 128
stripe-width = 128 × 3 = 256
RAID 10 (6 disks / 3 array of 1 data disk):
stride = 512K / 4K = 128
stripe-width = 128 # mirrors don’t increase stripe width
To setup extendend options when formating device (eg. with RAID10) :
1 | mkfs.ext4 -E stride=128,stripe-width=128 /dev/md0 |
Reserved block
Some filesystem got reserved-block system. According mke2fs man page: This avoids fragmentation, and allows root-owned daemons, […] to continue to function correctly after non-privileged processes are prevented from writing to the filesystem. The default percentage is 5%.
On modern big devices 5%, can represent a huge amount of space lost. Eg. for a 26TB drive you will loose 1,33TB space.
If space lost is upper than 50G you can safely reduce this.
To totally disable reserved space (not recommanded in case of FS full) :
1 | mkfs.ext4 -m 0 /dev/sda1 |
You can set a lower percentage (eg. with 2%) :
1 | mkfs.ext4 -m 2 /dev/sda1 |
Or set reservation space in GiB (hot change) :
1 | tune2fs -r <size in byte / device block size> /dev/sda1 |
Eg. to set 50GiB on device with 4k blocksize :
1 | tune2fs -r < 50x1024x1024x1024 / 4096> /dev/sda1 |
You can display current reserved space for a partition :
1 | tune2fs -l /dev/sda1 |grep 'Reserved block ' |
By multiplying this value by block size and divide it by 1024**3 you will got space loose in GiB.
NB: This tips doesn’t give you better performances, this is only to get a bit more space usable.
FS optimizations
atime, diratime
atime or access timestamp will save for each file last read time (same as diratime). You can disable this feature to reduce file access overhead on your filesystem.
To disable atime and diratime on running system:
1 | mount -o remount,noatime,nodiratime / |
You need to add theses options in /etc/fstab to be persistent at reboot.
fast_commit
An important feature of ext4 is delayed allocation. That mean the filesystem defer block allocation on disk until data is written on disk. This limit unneeded operations related to short-lived, small files, large batches write and ensure space is allocated contiguously. By default delay is fixed to 60s so data could to up to a minute to be really written on disk. Some application like databases cannot wait this time to ensure data integrity so fsync() syscall is used to immediate write on device.
If you run fio with fsync enabled you will see high performance impact.
To handle this issue, fast_commit feature was implemented. By creating a second journal with simplify commit path, operations that can be simplified got a lower overhead. On production it mean better performances, balanced by higher journal recovery time in case of crash.
To enable fast_commit :
1 | tune2fs -O fast_commit /dev/sda1 |
For all fsync() optimizations you can read this great paper from usenix.
commit delay
As we saw before, data can still in flight up to 60 seconds. If you got RAID card with battery and UPS rescued system you can increase this value to better handle IO peak. Don’t do it if you don’t know what you’re doing.
This option can be specified on you fstab with commit=<time in seconds>.
discard
The discard option will perform TRIM on each deleted file. This have a high negative performance impact. Usually discard is disabled by default.
If necessary you can disable discard with option nodiscard on fstab.
EXT4 journaling mode
On EXT4 you got three ways to manager journal :
- journal (default): log data and metadata to the journal
- ordered (faster than journal and safer than writeback): log only metadata to the journal
- writeback (fastest but least safe): similar to ordered mode but can write data before update the journal
Set data=<method> in your fstab to change journaling method.
Bonus: lvmcache
Setup
In case you got mechanical and flash drive in the same server, you can use multiple devices with LVM to setup caching improving performances. This is based on dm-cache features.
Assuming we got 2TB slow HDD (/dev/sda) and 128GB fast NVME (/dev/nvme1n1) we can setup storage like that :
Setup basic LVM :
1 | pvcreate /dev/nvme1n1p1 |
Then create logical volumes:
1 | lvcreate -l 100%FREE -n data vg0 /dev/sda1 |
Good ratio for cache/cache_metadata is ~1000:1 with a minimum of 8M. In this example lv_cache_meta is a bit oversized but this is only for test.
Then we can convert lv_cache into cache pool :
1 | lvconvert --type cache-pool --cachemode writeback --cachepolicy smq --poolmetadata vg0/lv_cache_meta vg0/lv_cache |
Finally we can set lv using cache :
1 | lvconvert --type cache --cachepool vg0/lv_cache vg0/data |
You can use /dev/vg0/data as a normal LV.
Performance comparison
As we can see NVMe improve performances. However, on production, performance gain will be lower because here fio hit the same file each times, so cache hit will be higher than on multiples files.
N.B: Caching isn’t redundant in this case. Cache loose will result into dataloss. To avoid that, it could be a great thing to perform RAID on NVME and HDD to ensure redundancy in case of device issue.
Filesystem trim
When you deleting file on a filesystem, only inode (or it’s reference) is deleted so space appear free for new write, but block on disk aren’t deleted. Perform trim will delete block on device. It will also This give you the following advantages :
- Have free blocks when need to perform write to avoid perform theses operations : read / erase / write-back
- rebalance data correctly across device to avoid using more some parts
Depending on the intensity of writing operations you need to adjust time between 2 trim. If waiting too much between trim this will create high pressure on disk during operation.
You can just create basic systemd-timer to achieve automatic trim.
What’s next ?
On next and last blog post of this serie, we’ll see which global system parameters can affect storage performances.
A week into linux storage performances - Day 5 : Miscellaneous