Launch more efficient VMs on Qemu (AMD CPU)

In this blogpost we’ll see how we can improve performances on KVM/Qemu virtual machine with AMD CPU. This is not universal method to ensure good performances but rather a list of good practices and some explanations of AMD CPU particularities.

We will explore the following parts :

  • AMD CPU architecture with CCX
  • NUMA handling
  • Hugepages

Prepare environment

Let’s install tools to be able to create VM and display some informations :

1
apt install -y qemu-kvm hwloc cpuid sysbench

Now we can create a VM

1
2
3
4
5
6
# Create directory for our VM
mkdir bench_vm && cd $_
# Download debian iso
wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-12.6.0-amd64-netinst.iso
# Create 60G image for vm
qemu-img create -f qcow2 disk0.img 60G

Then we can start VM to perform install

1
qemu-system-x86_64 -machine accel=kvm,kernel-irqchip=split -cpu host,topoext=on -m 8192 -hda disk0.img -boot d -enable-kvm -smp cores=4 -vga std -vnc 0.0.0.0:0 -cdrom debian-12.6.0-amd64-netinst.iso -monitor unix:$(pwd)/monitor.sock,server,nowait -daemonize

With this you can access VM console with VNC client on port 5900 of your hypervisor ip.
In this context, I performed minimal install with SSH server.
On VM we only need sysbench package.

I also edit grub config and performed update-grub to enable serial :

/etc/default/grub
1
2
3
GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0,115200n8"
GRUB_TERMINAL="console serial"
GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=0 --word=8 --parity=no --stop=1"

To access serial

1
socat UNIX-CONNECT:$(pwd)/serial.sock STDIO

When install is complete we can stop VM from guest and restart it from host replacing VNC by SSH port mapping on host :

1
qemu-system-x86_64 -machine accel=kvm,kernel-irqchip=split -cpu host,topoext=on -m 8192 -hda disk0.img -boot d -enable-kvm -smp cores=4 -net nic -net user,hostfwd=tcp::2222-:22 -monitor unix:$(pwd)/monitor.sock,server,nowait -serial unix:$(pwd)/serial,server,nowait -display none -daemonize

If you need to stop vm from host :

1
echo "system_powerdown" | socat unix-connect:$(pwd)/monitor.sock stdio

Our lab is ready!

AMD CPU architecture with CCX

On AMD architecture CCX (Core CompleX) is a cluster of multiple cores with common L3 cache. There can be multiple CCX present on CCD (Core Chiplet Die).

1
2
3
4
5
┌───────┬────┬─────────────┬────┬───────┐
│ CORE0 │ L2 │ │ L2 │ CORE2 │
├───────┼────┤ L3 CACHE ├────┼───────┤
│ CORE1 │ L2 │ │ L2 │ CORE3 │
└───────┴────┴─────────────┴────┴───────┘

AMD white papper about CPU architecture

The number of core per CCX depend on CPU generation. For EPYC the values are the following :

CPU GEN Core per CCX CCX per die
1st Generation (Zen - Naples) 4 Up to 4
2nd Generation (Zen 2 - Rome) 4 Up to 8
3rd Generation (Zen 3 - Milan) 8 Up to 8
4th Generation (Zen 4 - Genoa) 8 Up to 8

You can display your CPU architecture with lstopo that return you something like this :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
L3 L#0 (8192KB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#9)
L3 L#1 (8192KB)
L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#10)
L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#11)

In this example we see 2 CCX with 2c/4t each. If you perform CPU pin of your VM it’s better to group CPU based on CCX architecture. For example if you want to create 4vcpu VM and you pinning it on thread 0,1,4,5 this will be slower than do CPU pin on 0,1,2,3.

Numa handling

NUMA (Non-uniform memory access) is multiprocessor system where memory DIMM are located on different bus. This system allow to handle high memory quantity with high number of CPU core. This makes it possible to compensate for SMP architecture limiting.

1
2
3
4
5
6
7
8
9
┌──────┐                                     ┌──────┐
│ RAM1 ├─────┐ ┌─────┤ RAM5 │
├──────┤ ├────────┐ ┌────────┤ ├──────┤
│ RAM2 ├─────┤ │ BUS │ ├─────┤ RAM6 │
├──────┤ │ CPU0 ├───────┤ CPU1 │ ├──────┤
│ RAM3 ├─────┤ │ │ ├─────┤ RAM7 │
├──────┤ ├────────┘ └────────┤ ├──────┤
│ RAM4 ├─────┘ └─────┤ RAM8 │
└──────┘ └──────┘

NUMA important value is node per socket (NPS) defining how memory is distributed across CPU sockets. This is usually configurable in BIOS depending of how many CPU and memory you got (Eg. Dell recommended configuration for R6525).

You can see NUMA and NPS configuration with lscpu.

If you create VM with more CORE than CCX capacity you need to CPU pin by NUMA node. This improve performance by avoiding context switching and utilization of interconnect bus. If you create VM with more vcpu than NUMA capacity you should create virtual numa mapping/dual socket VM.

For AMD this bus is called HT (HyperTransport) and at Intel this bus is called QPI (QuickPath Interconnect)

CORE number order aren’t linked to NUMA number, for example NUMA node 1 can have core 0,1,2,3,16,17,18,19

Hugepages and memory management

Before talk about hugepages let’s talk about how memory is managed inside computer.

A memory page is the fixed size memory continuous segment of your physical memory. The size of a memory page is generally 4KB.

When you init program, you will need to access memory address, but in fact a program will always use virtual memory address.

To address this, it exist a structure named page tables that make translation between physical memory address and virtual memory address.

However, read page table to make the translation is very slow and CPU time costly. That’s why in each CPU you got a TLB (translation lookaside buffer) which is a small buffer (some KB) that contain some translations. Each time you need to access memory, the CPU will look in the TLB with 2 possibilities :

  • TLB hit : physical memory address has been found
  • TLB miss : Need to go read page tables

So, the more you got TLB hit, more your system is fast.

The issue is that the TLB size is really small and a if you got 4GB RAM with page of 4KB you got 220 pages (232 / 212) so there will be plenty of access to page tables.

A smart way to avoid this is to use hugepages, it means create memory pages bigger than 4KB. In case of process using huge memory size (Like a VM :)) we will have performance gain. For example if you create pages of 1GB size if you got 4GB RAM you will have only 4 pages so it will fit into TLB (In reality this is not possible to use huge page everywhere).

On modern Linux based OS you generally got a system named THP (transparent hugepages) that automatically create huge pages if needed. However this system can create system performance degradation in case of Database for example. In addition in virtualisation context, it’s better manually manage hugepages.

Let’s go!

Before performing any modifications, we can run a bench to have a referential.

1
sysbench threads --threads=4 --thread-locks=2 --events=60000 run

Result :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
General statistics:
total time: 10.0012s
total number of events: 28465

Latency (ms):
min: 0.27
avg: 1.40
max: 7.55
95th percentile: 3.02
sum: 39989.47

Threads fairness:
events (avg/stddev): 7116.2500/98.34
execution time (avg/stddev): 9.9974/0.00

Now we can do actions to improve this :

let’s allocate hugepages, persist in fstab :

1
2
3
4
mkdir -p /mnt/hugepages
mount -t hugetlbfs -o pagesize=1G,size=16G none /mnt/hugepages
echo vm.nr_hugepages=16 >> /etc/sysctl.conf
cat /proc/mounts |grep /mnt/hugepages >> /etc/fstab

Advice: pagesize should be the size of the ram of your smallest VM but not more than 1GB. Before setting pages of 1GB you need to check your CPU accept it with cat /proc/cpuinfo |grep pdpe1gb.

Update CMDLINE for proper hugepage configuration

/etc/default/grub
1
GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G hugepages=16 transparent_hugepage=never"

And perform grub update

1
update-grub

Then we can start our VM with some options to take in account hugepages and cpu pinning :

1
2
3
4
5
taskset -ca 4,5,6,7 qemu-system-x86_64 -machine accel=kvm,kernel-irqchip=split -cpu host -smp cores=4 \
-m 8192 -mem-prealloc -mem-path /mnt/hugepages \
-hda disk0.img -boot d -enable-kvm \
-net nic -net user,hostfwd=tcp::2222-:22 \
-monitor unix:$(pwd)/monitor.sock,server,nowait -serial unix:$(pwd)/serial,server,nowait -display none -daemonize

Finally, we can perform a new bench :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
sysbench threads --threads=4 --thread-locks=2 --events=60000 run

General statistics:
total time: 10.0011s
total number of events: 36576

Latency (ms):
min: 0.27
avg: 1.09
max: 7.99
95th percentile: 2.26
sum: 39987.73

Threads fairness:
events (avg/stddev): 9144.0000/45.73
execution time (avg/stddev): 9.9969/0.00

As you can see we got 28,5% of performance gain on number of events and the 95th percentile for latency is 33,6% better.

Performance gain can be better on bigger VM. You should note that this settings aren’t universal and can change depending your hardware and workload.

Additional tip :

If you run databases on guest it’s look a good idea to disable high resolution timer to performances issues :
You can do this by adding this in your cmdline : nohz=off highres=off

More documentation