Optimize Linux performances with irq balancing

Context

This blogpost will describe what is an IRQ and how can we improve baremetal server performances on linux.

What is an IRQ (Interrupt request)

Interrupt

An IRQ is a signal sent to the processor by hardware or software indicating that an event needs immediate attention. When a CPU receive IRQ, it stop running current task and run instead an interrupt handler.

Type of interrupt

Hardware interrupt are used to handle events like network packet receive. Software interrupt are used for exceptions or specific syscall.

PIC and IRQ priority

Programmable Interrupt Controller (PIC) is a hardware device used to manage and prioritize interrupt signals sent from various peripheral devices to the CPU.

IRQ line

Each device that can generate an interrupt is connected to a specific IRQ line. The number of available IRQ lines depends on the hardware architecture.

Modern systems integrate system named APIC (Advanced Programmable Interrupt Controller) that allow 255 physical hardware IRQ lines that is much more than older system (24 or 8).

How we can improve performance by managing IRQ

In context of NUMA nodes (multiple NPS/multiple sockets), fixing IRQs to a good NUMA node can improve performances because it optimize the locality of data access and reducing latency. It permit to improve cache usage, avoid context switch and reduce IO operations.

The goal is to detect on which NUMA node are located devices (network card, NVME) and pin IRQ to core of this node.

Let’s see how can we improve this

First we can list interrupts on our system like this

1
cat /proc/interrupts

This will show you number of interrupts triggered on each CPU. Eg. on dual core VM :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
root@scw-goofy-chatelet:~# cat /proc/interrupts
CPU0 CPU1
10: 6380 7046 GICv3 27 Level arch_timer
12: 0 0 GICv3 33 Level uart-pl011
49: 0 0 GICv3 41 Edge ACPI:Ged
50: 0 0 GICv3 23 Level arm-pmu
51: 1 0 ITS-MSI 491520 Edge PCIe PME, aerdrv, pciehp
52: 0 1 ITS-MSI 493568 Edge PCIe PME, aerdrv, pciehp
53: 1 0 ITS-MSI 495616 Edge PCIe PME, aerdrv, pciehp
54: 0 1 ITS-MSI 497664 Edge PCIe PME, aerdrv, pciehp
55: 1 0 ITS-MSI 499712 Edge PCIe PME, aerdrv, pciehp
56: 0 1 ITS-MSI 501760 Edge PCIe PME, aerdrv, pciehp
57: 1 0 ITS-MSI 503808 Edge PCIe PME, aerdrv, pciehp
58: 0 1 ITS-MSI 505856 Edge PCIe PME, aerdrv, pciehp
59: 0 0 ITS-MSI 49152 Edge virtio2-config
60: 0 30 ITS-MSI 49153 Edge virtio2-virtqueues
61: 0 0 ITS-MSI 32768 Edge virtio1-config
62: 0 0 ITS-MSI 32769 Edge virtio1-control
63: 0 0 ITS-MSI 32770 Edge virtio1-event
64: 2415 0 ITS-MSI 32771 Edge virtio1-request
65: 0 2443 ITS-MSI 32772 Edge virtio1-request
66: 0 0 ITS-MSI 16384 Edge virtio0-config
67: 89 0 ITS-MSI 16385 Edge virtio0-input.0
68: 92 0 ITS-MSI 16386 Edge virtio0-output.0
69: 0 113 ITS-MSI 16387 Edge virtio0-input.1
70: 0 147 ITS-MSI 16388 Edge virtio0-output.1
IPI0: 1273 1051 Rescheduling interrupts
IPI1: 1712 1770 Function call interrupts
IPI2: 0 0 CPU stop interrupts
IPI3: 0 0 CPU stop (for crash dump) interrupts
IPI4: 0 0 Timer broadcast interrupts
IPI5: 0 0 IRQ work interrupts
IPI6: 0 0 CPU wake-up interrupts
Err: 0

But this doesn’t show all IRQ (in above example 10-70), to list them you can perform ls /proc/irq/ which show us something like that :

1
0  1  10  11  12  13  14  15  18  19  2  25  26  27  28  29  3  30  31  32  36  37  38  39  4  42  45  47  49  5  50  51  52  53  54  55  57  59  6  60  61  62  63  64  65  66  67  68  69  7  70  72  73  8  9  default_smp_affinity

Here we see all IRQ line and a file named default_smp_affinity. This file specifies default affinity mask that applies to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask will be set to the default mask. It can then be changed as described above. Default mask is 0xffffffff.

We can read this file :

1
2
# cat /proc/irq/default_smp_affinity
ffffff

In this case, new irq affinity can be allocated on any core of system.

For each IRQ you can see current affinity (CPU mask for mapping) and the list of CPU for mapping :

1
2
3
4
5
cat /proc/irq/19/smp_affinity
000007

cat /proc/irq/19/smp_affinity_list
0,1,2

You can change smp_affinity or smp_affinity_list but you only need to change 1 of them, the other one will be updated automatically.

For smp_affinity_list you can set CPU list with individual core (Eg. 0,1,2) or set set of core (Eg. 0-2).

Example :

1
2
3
4
5
6
7
8
9
echo 0-2 > /proc/irq/19/smp_affinity_list

cat /proc/irq/19/smp_affinity
000007

echo 0,1,2 > /proc/irq/19/smp_affinity_list

cat /proc/irq/19/smp_affinity
000007

Let’s optimize this

Our goal here, will be to get all PCI devices with IRQ and change smp_affinity_list to match their numa node.

To get NUMA node and their core we can do this lscpu |grep NUMA that return :

1
2
3
NUMA node(s):                         2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23

And for each PCI device we can see his numa node like this :

1
cat /sys/bus/pci/devices/<DEVICE_ID>/numa_node

which can return 2 type values :

  • -1 : device is not associated with any specific NUMA node
  • Any other number which is the NUMA node

Also, some PCI devices, doesn’t have any IRQ line, so we should ignore them.
I’ve made a small script to see all devices with IRQ and NUMA node :

check_irq.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash

set -eo pipefail

for device in /sys/bus/pci/devices/*
do
[[ -e "$device" ]] || break

device_id="$(basename "${device}")"
numa_node=$(cat "/sys/bus/pci/devices/${device_id}/numa_node")
irq=$(cat "/sys/bus/pci/devices/${device_id}/irq")

if [[ "$numa_node" != "-1" ]] && [ -f "/proc/irq/${irq}/smp_affinity_list" ]; then
irq_affinity_list=$(cat "/proc/irq/${irq}/smp_affinity_list")
echo "PCI device with id $device_id is located on numa node $numa_node and use irq line ${irq}"
echo "The irq line ${irq} is ${irq_affinity_list}"
fi

done

With the following type of return:

1
2
3
4
5
6
7
8
9
10
11
PCI device with id 0000:00:00.0 is located on numa node 0 and use irq line 0
The irq line 0 is 0-23
PCI device with id 0000:00:01.0 is located on numa node 0 and use irq line 25
The irq line 25 is 0,2,4,6,8,10,12,14,16,18,20,22
PCI device with id 0000:00:02.0 is located on numa node 0 and use irq line 26
The irq line 26 is 0,2,4,6,8,10,12,14,16,18,20,22
PCI device with id 0000:00:02.2 is located on numa node 0 and use irq line 27
The irq line 27 is 0,2,4,6,8,10,12,14,16,18,20,22
PCI device with id 0000:00:03.0 is located on numa node 0 and use irq line 28
The irq line 28 is 0,2,4,6,8,10,12,14,16,18,20,22
[...]

Now we can process remediation by fixing IRQ with good NUMA node. We can adapt previous script to also make changes.
N.B.: We can’t change affinity of IRQ line 0.

fix_irq.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash

set -eo pipefail

for device in /sys/bus/pci/devices/*
do
[[ -e "$device" ]] || break

device_id="$(basename "${device}")"
numa_node=$(cat "/sys/bus/pci/devices/${device_id}/numa_node")
irq=$(cat "/sys/bus/pci/devices/${device_id}/irq")

if [[ "$numa_node" != "-1" ]] && [ -f "/proc/irq/${irq}/smp_affinity_list" ]; then

irq_affinity_list=$(cat "/proc/irq/${irq}/smp_affinity_list")
cpu_list="$(lscpu |grep "NUMA node${numa_node}" |awk '{print$4}')"

if [[ "$cpu_list" != "$irq_affinity_list" ]] && [[ "$irq" != "0" ]]; then
echo "PCI device with id $device_id is located on numa node $numa_node and use irq line ${irq}"
echo "The irq line ${irq} is ${irq_affinity_list}"
echo "CPU list for ${numa_node} is $cpu_list which isn't matching ${irq_affinity_list}"
echo "$cpu_list" > "/proc/irq/${irq}/smp_affinity_list"
fi

fi
done

Remember to disable IRQ balance (if installed and enabled) before performing this kind of operations to avoid auto-rebalancing.

This kind of changes can improve your network speed/latency (interface with bandwidth >= 10gbps) and local storage operations (NVME).