This blog post is part of a series. If you haven’t read the previous blog post it’s here : A week into linux storage performances - Day 2 : Device optimization and scheduler
Introduction
Managing multiple disks in a system, maintaining good performances and proper redundancy can lead to mind blow with modern devices.
In fact few years ago, using hardware RAID card was a pretty good choice to handle mechanical drives but it create some nasty surprises :
- Manufacturer dependency
- Out of order battery –> lead to downtime
- Additional licensed features
- Hard to handle different type of disks (eg. mix SATA and SAS)
- Can became a bottleneck in case of a lot of disks
- Pretty expensive
Fortunately, it exists a lot of methods to achieve software RAID to handle this issues and NVMe devices.
Reminder
There is different RAID levels to be able to address different setup and criteria.
| RAID Level | Name | Min disks | Number of disk can fail | Real space available | Pros | Cons |
|---|---|---|---|---|---|---|
| 0 | Stripped | 1 | 0 | 100% | High performances | Losing 1 disk will result into complete dataloss |
| 1 | Mirror | 2 | N-1 | 1/N | Good performances | Redundancy cost is important in case of more than 2 disks |
| 5 | 1 parity | 3 | 1 | (N-1)/N | High available space | Bad redundancy in case of high number of disks, poor performances |
| 6 | 2 parity | 4 | 2 | (N-2)/N | Like RAID5 with better redundancy | Really bad performances |
| 10 | Raid 1+0 | 4 | 1 per array | N/2 | Good redundancy level and high performances | 50% of space lost regardless of the number of disks |
| 50 | Raid 5+0 | 6 | 1 per array | (N-G) x S | Better redundancy than RAID5 with high number of disks Low space loose | High rebuild time |
| 60 | Raid 6+0 | 8 | 2 per array | G×(n−2)×S | Better redundancy than RAID50 | Really really high rebuild time. Poor performances |
G : Number of arrays N : Number of disk in G or number of disks S : Size of 1 disk
With modern workloads, disks prices and performances, RAID1 and RAID10 are preferred for their performances and redundancy level. In case of high number of disks (more than 60) other technologies than RAID can be implemented.
Quick RAID bench
Just for comparing performances I performed quick benches with FIO on server with 4 NVME of 2.05TB and MDADM for RAID. I didn’t make any optimizations for these benchmarks.
NVME raw performances
As we can see, all NVME got the same performances.
RAID 1/5/6/10 benchmark
Following this small benchmark we can see that RAID10 is the most versatile raid with a pretty good tradeoff between performances and available capacity. Because of this we will focus on RAID10 on the next part of this blog post.
Software RAID method on Linux
Here, we’ll only explore MDADM, LVM and ZFS but here a small list of available way to perform software RAID.
MDADM
Historical method on Linux to achieve software RAID.
Pros :
- Mature method with high documentation and community
- Will work everywhere, every time
Cons :
- No advances features like snapshots or checksums
LVM
Logical storage management.
Pros :
- Flexible
- Easy management and well documented
Cons :
- Often need custom monitoring system to get disk failure informations
ZFS
Combined disk and volume management. Also provide integrated filesystem.
Pros :
- End-to-end checksumming prevents silent data corruption
- Built-in RAID (RAID-Z) and snapshot capabilities
- Pretty good performances
Cons :
- High memory usage
- Need third-party packages
- Link that may be strong with kernel version (Not really up to date on the latest versions)
BTRFS
New-generation filesystem with built-in RAID, snapshots, and checksumming.
Pros :
- Snapshots and copy-on-write for efficient backups
Cons :
- May still be subject to instability
dm-raid
Kernel level framework using device-mapper structure to provide RAID features.
Pros :
- Can replace mdadm by providing direct kernel-level RAID management
Cons :
- Less straightforward to configure and monitor
- Limiting features compared to ZFS
Lab context
For these benchmarks, I used Scaleway EM-L220E-NVME dedicated server with the following specifications :
- AMD EPYC 7232P (8c/16t)
- 64GB RAM
- 4 NVME SSSTC CA6-8D2048
NVME specifications :
- Sequential Read/Write 6,800/4,800 MB/s
- 4K Random Read/Write 500K/500K IOPS
You can find complete specification here
When performing benchmark, NVME are pretty new with equal usage on each. NVMe are evenly distributed over 2 PCI lanes.
I installed some packages :
1 | apt update && apt install nvme-cli fio lvm2 zfsutils-linux mdadm -y |
Between each RAID method I performed NVMe format :
1 | for nvme in $(nvme list |grep nvme |awk '{print$1}'); do nvme format -s1 $nvme & done |
Setup different types of RAID
MDADM
Create MDADM RAID10 array :
1 | mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 |
Speed up recovery speed to avoid waiting for hours :
1 | echo 9999999 > /proc/sys/dev/raid/speed_limit_max |
This operation isn’t recommended in production environment.
Now we need waiting rebuild to complete. To follow sync :
1 | watch -n 1 cat /proc/mdstat |
Once rebuild done we can perform FIO.
RAID10 with LVM
Set all NVME as LVM physical volume
1 | pvcreate /dev/nvme[0-3]n1 |
Create group of volume with all NVMe :
1 | vgcreate vg_raid10 /dev/nvme[0-3]n1 |
Create logical volume in RAID10 mode
1 | lvcreate -l 100%FREE -n lv_raid10 --type raid10 --mirrors 1 --stripes 2 vg_raid10 |
RAID10 with ZFS
Create ZFS tank in RAID10
1 | zpool create -o ashift=12 ztank mirror /dev/nvme0n1 /dev/nvme1n1 mirror /dev/nvme2n1 /dev/nvme3n1 |
Create mountpoint for our tests
1 | zfs create ztank/fio_test |
Check if all went well
1 | zpool status ztank/fio_test |
Hybrid RAID10 with MDADM and LVM
Here we’ll do hybrid RAID 10 with LVM volume stripped across 2 MDADM RAID1.
This can be a good idea because MDADM well manage RAID1 but LVM is fastest to stripe data.
First Create both MDADM RAID 1 :
1 | mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1 |
Then create LVM volume :
1 | pvcreate /dev/md[0-1] |
Benchmark results
Bandwidth
IOPS
Analysis
ZFS is pretty good in sequential operations and not too bad on random operations except for read only. It is important to note that during test, ZFS took around 20GB of RAM (for a total of 64GB or RAM) so in real life with RAM limitation, performances will be poorer. It’s also important to point out that ZFS was the only solution tested in filesystem mode.
That’s a bit the same for LVM with lower ram consumption.
MDADM is good only on Rand read only operations but not good on other tests.
Hybrid mode result into balanced results depending of workload.
Conclusion: take a solution adapted to your workload and available resources. There is also a tons of tuning you can do to have better performances on each systems.
What’s next ?
On the next blog post we’ll compare different file system features and performance and how to tune them.
A week into linux storage performances - Day 4 : Partitions and Filesystem
Addendum 2025-06-18
Further to a comment made by Guillaume M. on Linkedin, I’m adding this addendum to address block alignments between a RAID5/6 MDADM and LVM because it’s a bit specific.
When performing RAID5/6 with MDADM, MDADM will work with stripes with each stripe contain n chunks of data (min 2) and 1 chunk for parity (2 for RAID6) with each chunk 512k long (default on modern OS and 64k legacy).
In definitive the data stripe size will be (n - 1) * chunk_size (with n the number of disks). So for minimal RAID5 setup (3 disks), stripe size will be 1M. For RAID6 this is (n - 2) * chunk_size.
If LVM extend aren’t well aligned with stripe size, this lead to negative impacts:
- Writes may straddle stripe boundaries resulting into bad write performances
- RAID 5 must read-modify-write more often
- Disk wear may increase (SSD)
To avoid this you need to properly initialize physical volume with LVM :
1 | pvcreate --dataalignment 1M /dev/mdX |
You may set at least stripe size or a multiple of stripe size. Default value (1M) is good for RAID5/6 when minimal number of disks but in case of more disk this is necessary to specify it.
For example, if you got 8 disks in you RAID5 your stripe size will be 3584k (7x512) so you need to setting it up manually.