A nasty surprise
Recently I had a problem with a bonding on a hypervisor: bandwidth was half the usual level.
The 2 links were UP and the bonding too, with full duplex speed and the right link speed.
When I SSH to the hypervisor I quickly saw that something was wrong.
1 | cat /proc/net/bonding/bond0 |
At this point I’ve never see a bond in state churned, so, before just restart bonding I made some investigations to understand what it means exactly.
How LACP works ?
LACP (802.3ad) allow to create a logical link with multiple physical links with different policies (active/active - active/passive etc…).
To initiate communication each interface member of LACP group emit LACPDUs packets with EtherType value that identifies the payload as an LACPDU. The payload usually contain this informations:
- Priority
- Mac address
- Port priority
- Port number
- State
When the other side device receive a LACPDU packet, devices tries to established connection. The status switch to monitoring during 60 seconds maximum.
After this period of time, if interface cannot negotiate communication the state of port switch from monitoring to churned and no data pass through link but interface status appear good for a lot of monitoring systems.
On last Linux kernel versions the partner churn state and counter are also displayed.
NB : LACPDUs packets are emitted each 30 seconds to check LACP status.
Remedial measures
A standard link diagnostic can be done. Usually when interface switch to churned state, the link fail count have a high value.
The churned status can also appear when 2 links (or more) are plug on different LACP groups. 1 link mount and the other link cannot join LACP group.
Improve your observability
You can monitor bond status by reading /proc/net/bonding/<bond_name>
.
In my case the solution was to create simple bash script that wrote MII Status into flat file read by node exporter.