A week into linux storage performances - Day 3 : Manage redondancy with RAID

This blog post is part of a series. If you haven’t read the previous blog post it’s here : A week into linux storage performances - Day 2 : Device optimization and scheduler

Introduction

Managing multiple disks in a system, maintaining good performances and proper redundancy can lead to mind blow with modern devices.

In fact few years ago, using hardware RAID card was a pretty good choice to handle mechanical drives but it create some nasty surprises :

  • Manufacturer dependency
  • Out of order battery –> lead to downtime
  • Additional licensed features
  • Hard to handle different type of disks (eg. mix SATA and SAS)
  • Can became a bottleneck in case of a lot of disks
  • Pretty expensive

Fortunately, it exists a lot of methods to achieve software RAID to handle this issues and NVMe devices.

Reminder

There is different RAID levels to be able to address different setup and criteria.

RAID Level Name Min disks Number of disk can fail Real space available Pros Cons
0 Stripped 1 0 100% High performances Losing 1 disk will result into complete dataloss
1 Mirror 2 N-1 1/N Good performances Redundancy cost is important in case of more than 2 disks
5 1 parity 3 1 (N-1)/N High available space Bad redundancy in case of high number of disks, poor performances
6 2 parity 4 2 (N-2)/N Like RAID5 with better redundancy Really bad performances
10 Raid 1+0 4 1 per array N/2 Good redundancy level and high performances 50% of space lost regardless of the number of disks
50 Raid 5+0 6 1 per array (N-G) x S Better redundancy than RAID5 with high number of disks Low space loose High rebuild time
60 Raid 6+0 8 2 per array G×(n−2)×S Better redundancy than RAID50 Really really high rebuild time. Poor performances

 

G : Number of arrays N : Number of disk in G or number of disks S : Size of 1 disk

 

With modern workloads, disks prices and performances, RAID1 and RAID10 are preferred for their performances and redundancy level. In case of high number of disks (more than 60) other technologies than RAID can be implemented.

Quick RAID bench

Just for comparing performances I performed quick benches with FIO on server with 4 NVME of 2.05TB and MDADM for RAID. I didn’t make any optimizations for these benchmarks.

NVME raw performances

As we can see, all NVME got the same performances.

RAID 1/5/6/10 benchmark

Following this small benchmark we can see that RAID10 is the most versatile raid with a pretty good tradeoff between performances and available capacity. Because of this we will focus on RAID10 on the next part of this blog post.

Software RAID method on Linux

Here, we’ll only explore MDADM, LVM and ZFS but here a small list of available way to perform software RAID.

MDADM

Historical method on Linux to achieve software RAID.

Pros :

  • Mature method with high documentation and community
  • Will work everywhere, every time

Cons :

  • No advances features like snapshots or checksums

LVM

Logical storage management.

Pros :

  • Flexible
  • Easy management and well documented

Cons :

  • Often need custom monitoring system to get disk failure informations

ZFS

Combined disk and volume management. Also provide integrated filesystem.

Pros :

  • End-to-end checksumming prevents silent data corruption
  • Built-in RAID (RAID-Z) and snapshot capabilities
  • Pretty good performances

Cons :

  • High memory usage
  • Need third-party packages
  • Link that may be strong with kernel version (Not really up to date on the latest versions)

BTRFS

New-generation filesystem with built-in RAID, snapshots, and checksumming.

Pros :

  • Snapshots and copy-on-write for efficient backups

Cons :

  • May still be subject to instability

dm-raid

Kernel level framework using device-mapper structure to provide RAID features.

Pros :

  • Can replace mdadm by providing direct kernel-level RAID management

Cons :

  • Less straightforward to configure and monitor
  • Limiting features compared to ZFS

Lab context

For these benchmarks, I used Scaleway EM-L220E-NVME dedicated server with the following specifications :

  • AMD EPYC 7232P (8c/16t)
  • 64GB RAM
  • 4 NVME SSSTC CA6-8D2048

NVME specifications :

  • Sequential Read/Write 6,800/4,800 MB/s
  • 4K Random Read/Write 500K/500K IOPS

You can find complete specification here

When performing benchmark, NVME are pretty new with equal usage on each. NVMe are evenly distributed over 2 PCI lanes.

I installed some packages :

1
apt update && apt install nvme-cli fio lvm2 zfsutils-linux mdadm -y

Between each RAID method I performed NVMe format :

1
for nvme in $(nvme list |grep nvme |awk '{print$1}'); do nvme format -s1 $nvme & done

Setup different types of RAID

MDADM

Create MDADM RAID10 array :

1
mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Speed up recovery speed to avoid waiting for hours :

1
echo 9999999 > /proc/sys/dev/raid/speed_limit_max

This operation isn’t recommended in production environment.

Now we need waiting rebuild to complete. To follow sync :

1
watch -n 1 cat /proc/mdstat

Once rebuild done we can perform FIO.

RAID10 with LVM

Set all NVME as LVM physical volume

1
pvcreate /dev/nvme[0-3]n1

Create group of volume with all NVMe :

1
vgcreate vg_raid10 /dev/nvme[0-3]n1

Create logical volume in RAID10 mode

1
lvcreate -l 100%FREE -n lv_raid10 --type raid10 --mirrors 1 --stripes 2 vg_raid10

RAID10 with ZFS

Create ZFS tank in RAID10

1
zpool create -o ashift=12 ztank mirror /dev/nvme0n1 /dev/nvme1n1 mirror /dev/nvme2n1 /dev/nvme3n1

Create mountpoint for our tests

1
zfs create ztank/fio_test

Check if all went well

1
zpool status ztank/fio_test

Hybrid RAID10 with MDADM and LVM

Here we’ll do hybrid RAID 10 with LVM volume stripped across 2 MDADM RAID1.

This can be a good idea because MDADM well manage RAID1 but LVM is fastest to stripe data.

First Create both MDADM RAID 1 :

1
2
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/nvme2n1 /dev/nvme3n1

Then create LVM volume :

1
2
3
pvcreate /dev/md[0-1]
vgcreate vg_raid10 /dev/md[0-1]
lvcreate --type raid0 -l 100%FREE --stripes 2 --stripesize 4 -n lv_raid10 vg_raid10

Benchmark results

Bandwidth

IOPS

Analysis

ZFS is pretty good in sequential operations and not too bad on random operations except for read only. It is important to note that during test, ZFS took around 20GB of RAM (for a total of 64GB or RAM) so in real life with RAM limitation, performances will be poorer. It’s also important to point out that ZFS was the only solution tested in filesystem mode.

That’s a bit the same for LVM with lower ram consumption.

MDADM is good only on Rand read only operations but not good on other tests.

Hybrid mode result into balanced results depending of workload.

Conclusion: take a solution adapted to your workload and available resources. There is also a tons of tuning you can do to have better performances on each systems.

What’s next ?

On the next blog post we’ll compare different file system features and performance and how to tune them.

A week into linux storage performances - Day 4 : Partitions and Filesystem

Addendum 2025-06-18

Further to a comment made by Guillaume M. on Linkedin, I’m adding this addendum to address block alignments between a RAID5/6 MDADM and LVM because it’s a bit specific.

When performing RAID5/6 with MDADM, MDADM will work with stripes with each stripe contain n chunks of data (min 2) and 1 chunk for parity (2 for RAID6) with each chunk 512k long (default on modern OS and 64k legacy).

In definitive the data stripe size will be (n - 1) * chunk_size (with n the number of disks). So for minimal RAID5 setup (3 disks), stripe size will be 1M. For RAID6 this is (n - 2) * chunk_size.

If LVM extend aren’t well aligned with stripe size, this lead to negative impacts:

  • Writes may straddle stripe boundaries resulting into bad write performances
  • RAID 5 must read-modify-write more often
  • Disk wear may increase (SSD)

To avoid this you need to properly initialize physical volume with LVM :

1
pvcreate --dataalignment 1M /dev/mdX

You may set at least stripe size or a multiple of stripe size. Default value (1M) is good for RAID5/6 when minimal number of disks but in case of more disk this is necessary to specify it.

For example, if you got 8 disks in you RAID5 your stripe size will be 3584k (7x512) so you need to setting it up manually.