Context
This blogpost will describe what is an IRQ and how can we improve baremetal server performances on linux.
What is an IRQ (Interrupt request)
Interrupt
An IRQ is a signal sent to the processor by hardware or software indicating that an event needs immediate attention. When a CPU receive IRQ, it stop running current task and run instead an interrupt handler.
Type of interrupt
Hardware interrupt are used to handle events like network packet receive. Software interrupt are used for exceptions or specific syscall.
PIC and IRQ priority
Programmable Interrupt Controller (PIC) is a hardware device used to manage and prioritize interrupt signals sent from various peripheral devices to the CPU.
IRQ line
Each device that can generate an interrupt is connected to a specific IRQ line. The number of available IRQ lines depends on the hardware architecture.
Modern systems integrate system named APIC (Advanced Programmable Interrupt Controller) that allow 255 physical hardware IRQ lines that is much more than older system (24 or 8).
How we can improve performance by managing IRQ
In context of NUMA nodes (multiple NPS/multiple sockets), fixing IRQs to a good NUMA node can improve performances because it optimize the locality of data access and reducing latency. It permit to improve cache usage, avoid context switch and reduce IO operations.
The goal is to detect on which NUMA node are located devices (network card, NVME) and pin IRQ to core of this node.
Let’s see how can we improve this
First we can list interrupts on our system like this
1 | cat /proc/interrupts |
This will show you number of interrupts triggered on each CPU. Eg. on dual core VM :
1 | root@scw-goofy-chatelet:~# cat /proc/interrupts |
But this doesn’t show all IRQ (in above example 10-70), to list them you can perform ls /proc/irq/
which show us something like that :
1 | 0 1 10 11 12 13 14 15 18 19 2 25 26 27 28 29 3 30 31 32 36 37 38 39 4 42 45 47 49 5 50 51 52 53 54 55 57 59 6 60 61 62 63 64 65 66 67 68 69 7 70 72 73 8 9 default_smp_affinity |
Here we see all IRQ line and a file named default_smp_affinity
. This file specifies default affinity mask that applies to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask will be set to the default mask. It can then be changed as described above. Default mask is 0xffffffff.
We can read this file :
1 | # cat /proc/irq/default_smp_affinity |
In this case, new irq affinity can be allocated on any core of system.
For each IRQ you can see current affinity (CPU mask for mapping) and the list of CPU for mapping :
1 | cat /proc/irq/19/smp_affinity |
You can change smp_affinity
or smp_affinity_list
but you only need to change 1 of them, the other one will be updated automatically.
For smp_affinity_list
you can set CPU list with individual core (Eg. 0,1,2) or set set of core (Eg. 0-2).
Example :
1 | echo 0-2 > /proc/irq/19/smp_affinity_list |
Let’s optimize this
Our goal here, will be to get all PCI devices with IRQ and change smp_affinity_list to match their numa node.
To get NUMA node and their core we can do this lscpu |grep NUMA
that return :
1 | NUMA node(s): 2 |
And for each PCI device we can see his numa node like this :
1 | cat /sys/bus/pci/devices/<DEVICE_ID>/numa_node |
which can return 2 type values :
- -1 : device is not associated with any specific NUMA node
- Any other number which is the NUMA node
Also, some PCI devices, doesn’t have any IRQ line, so we should ignore them.
I’ve made a small script to see all devices with IRQ and NUMA node :
1 |
|
With the following type of return:
1 | PCI device with id 0000:00:00.0 is located on numa node 0 and use irq line 0 |
Now we can process remediation by fixing IRQ with good NUMA node. We can adapt previous script to also make changes.
N.B.: We can’t change affinity of IRQ line 0.
1 |
|
Remember to disable IRQ balance (if installed and enabled) before performing this kind of operations to avoid auto-rebalancing.
This kind of changes can improve your network speed/latency (interface with bandwidth >= 10gbps) and local storage operations (NVME).