Deal with bond churned state

Timoté Brusson

2023-12-26

linux, network

A nasty surprise

Recently I had a problem with a bonding on a hypervisor: bandwidth was half the usual level.
The 2 links were UP and the bonding too, with full duplex speed and the right link speed.

When I SSH to the hypervisor I quickly saw that something was wrong.

cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:11:22:33:44:55
Speed: 1000 Mbps
Duplex: full

Slave Interface: eth1
MII Status: churned
Link Failure Count: 5
Permanent HW addr: 00:66:77:88:99:AA
Speed: 1000 Mbps
Duplex: full

At this point I’ve never see a bond in state churned, so, before just restart bonding I made some investigations to understand what it means exactly.

How LACP works ?

LACP (802.3ad) allow to create a logical link with multiple physical links with different policies (active/active - active/passive etc…).

To initiate communication each interface member of LACP group emit LACPDUs packets with EtherType value that identifies the payload as an LACPDU. The payload usually contain this informations:

Priority
Mac address
Port priority
Port number
State

When the other side device receive a LACPDU packet, devices tries to established connection. The status switch to monitoring during 60 seconds maximum.

After this period of time, if interface cannot negotiate communication the state of port switch from monitoring to churned and no data pass through link but interface status appear good for a lot of monitoring systems.

On last Linux kernel versions the partner churn state and counter are also displayed.

NB : LACPDUs packets are emitted each 30 seconds to check LACP status.

Remedial measures

A standard link diagnostic can be done. Usually when interface switch to churned state, the link fail count have a high value.

The churned status can also appear when 2 links (or more) are plug on different LACP groups. 1 link mount and the other link cannot join LACP group.

Improve your observability

You can monitor bond status by reading /proc/net/bonding/<bond_name>.
In my case the solution was to create simple bash script that wrote MII Status into flat file read by node exporter.

A nasty surprise

How LACP works ?

Remedial measures

Improve your observability

Read more content