Discussion:
bonding + arp monitoring fails if interface is a vlan
(too old to reply)
Santiago Garcia Mantinan
2013-08-01 12:11:42 UTC
Permalink
Hi!

I'm trying to setup a bond of a couple of vlans, these vlans are different
paths to an upstream switch from a local switch. I want to do arp
monitoring of the link in order for the bonding interface to know which path
is ok and wich one is broken. If I set it up using arp monitoring and
without using vlans it works ok, it also works if I set it up using vlans
but without arp monitoring, so the broken setup seems to be with bonding +
arp monitoring + vlans. Here is a schema:

-------------
|Remote Switch|
-------------
| |
P P
A A
T T
H H
1 2
| |
------------
|Local switch|
------------
|
| VLAN for PATH1
| VLAN for PATH2
|
Linux machine

The broken setup seems to work but arp monitoring makes it loose the logical
link from time to time, thus changing to other slave if available. What I
saw when monitoring this with tcpdump is that all the arp requests were
going out and that all the replies where coming in, so acording to the
traffic seen on tcpdump the link should have been stable, but
/proc/net/bonding/bond0 showed the link failures increasing and when testing
with just a vlan interface I was loosing ping when the link was going down.

I've tried this on Debian wheezy with its 3.2.46 kernel and also the 3.10.3
version in unstable, the tests where done on a couple of machines using a 32
bits kernel with different nics (r8169 and skge).

I created a small lab to replicate the problem, on this setup I avoided all
the switching and I directly connected the machine with bonding to another
Linux on which I just had eth0.1002 configured with ip 192.168.1.1, the
results where the same as in the full scenario, link on the bonding slave
was going down from time to time.

This is the setup on the bonding interface.

auto bond0
iface bond0 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves eth0.1002
bond-mode active-backup
bond-arp_validate 0
bond-arp_interval 5000
bond-arp_ip_target 192.168.1.1
pre-up ip link set eth0 up || true
pre-up ip link add link eth0 name eth0.1002 type vlan id 1002 || true
down ip link delete eth0.1002 || true

These are the messages I was seing on the bonding machines:

[ 452.436750] bonding: bond0: adding ARP target 192.168.1.1.
[ 452.436851] bonding: bond0: Setting ARP monitoring interval to 5000.
[ 452.440287] bonding: bond0: setting mode to active-backup (1).
[ 452.440429] bonding: bond0: setting arp_validate to none (0).
[ 452.458349] bonding: bond0: Adding slave eth0.1002.
[ 452.458964] bonding: bond0: making interface eth0.1002 the new active one.
[ 452.458983] bonding: bond0: first active interface up!
[ 452.458999] bonding: bond0: enslaving eth0.1002 as an active interface with an up link.
[ 452.482560] 8021q: adding VLAN 0 to HW filter on device bond0
[ 467.500143] bonding: bond0: link status definitely down for interface eth0.1002, disabling it
[ 467.500193] bonding: bond0: now running without any active interface !
[ 622.748102] bonding: bond0: link status definitely up for interface eth0.1002.
[ 622.748122] bonding: bond0: making interface eth0.1002 the new active one.
[ 622.748522] bonding: bond0: first active interface up!
[ 637.772179] bonding: bond0: link status definitely down for interface eth0.1002, disabling it
[ 637.772228] bonding: bond0: now running without any active interface !
[ 642.780173] bonding: bond0: link status definitely up for interface eth0.1002.
[ 642.780192] bonding: bond0: making interface eth0.1002 the new active one.
[ 642.780603] bonding: bond0: first active interface up!
[ 657.804154] bonding: bond0: link status definitely down for interface eth0.1002, disabling it
[ 657.804209] bonding: bond0: now running without any active interface !
[ 662.812165] bonding: bond0: link status definitely up for interface eth0.1002.
[ 662.812185] bonding: bond0: making interface eth0.1002 the new active one.
[ 662.812592] bonding: bond0: first active interface up!
[ 677.836167] bonding: bond0: link status definitely down for interface eth0.1002, disabling it
[ 677.836223] bonding: bond0: now running without any active interface !
[ 682.844162] bonding: bond0: link status definitely up for interface eth0.1002.
[ 682.844181] bonding: bond0: making interface eth0.1002 the new active one.
[ 682.844590] bonding: bond0: first active interface up!
[ 697.868153] bonding: bond0: link status definitely down for interface eth0.1002, disabling it

Like I said, running tcpdump on both Linux shows everything fine, all arp
replies and requests are there, but link goes down from time to time, on
this setup the bond is built just with one slave, so network is lost when
link goes down.

Some questions:

am I doing something wrong here?
Is this setup not supported?
If it should work... can anybody reproduce this?
Bug?

What should I do now?

Regards...
--
Manty/BestiaTester -> http://manty.net
Erik Hugne
2013-08-01 13:00:58 UTC
Permalink
Post by Santiago Garcia Mantinan
Hi!
I'm trying to setup a bond of a couple of vlans, these vlans are different
paths to an upstream switch from a local switch. I want to do arp
monitoring of the link in order for the bonding interface to know which path
is ok and wich one is broken. If I set it up using arp monitoring and
without using vlans it works ok, it also works if I set it up using vlans
but without arp monitoring, so the broken setup seems to be with bonding +
arp monitoring + vlans.
This have helped me troubleshoot various bonding problems in the past:
mount -t debugfs none /sys/kernel/debug/
ln -s /sys/kernel/debug /debug
echo -n 'module bonding +p' > /debug/dynamic_debug/control
Post by Santiago Garcia Mantinan
The broken setup seems to work but arp monitoring makes it loose the logical
link from time to time, thus changing to other slave if available. What I
saw when monitoring this with tcpdump is that all the arp requests were
going out and that all the replies where coming in, so acording to the
traffic seen on tcpdump the link should have been stable, but
/proc/net/bonding/bond0 showed the link failures increasing and when testing
with just a vlan interface I was loosing ping when the link was going down.
Did you sniff externally, on the native device, bond slaves or on bond0?

//E
Santiago Garcia Mantinan
2013-08-02 07:26:46 UTC
Permalink
Post by Erik Hugne
mount -t debugfs none /sys/kernel/debug/
ln -s /sys/kernel/debug /debug
echo -n 'module bonding +p' > /debug/dynamic_debug/control
I'm compiling a 3.11-rc3 version with dynamic_debug enabled in order
to be able to test this.
Post by Erik Hugne
Did you sniff externally, on the native device, bond slaves or on bond0?
The sniffing was done on both the bonding host (eth0 device) and the
remote host, the one with just the vlan.

Regards.
--
Manty/BestiaTester -> http://manty.net
Santiago Garcia Mantinan
2013-08-02 09:33:26 UTC
Permalink
Post by Santiago Garcia Mantinan
Post by Erik Hugne
mount -t debugfs none /sys/kernel/debug/
ln -s /sys/kernel/debug /debug
echo -n 'module bonding +p' > /debug/dynamic_debug/control
I'm compiling a 3.11-rc3 version with dynamic_debug enabled in order
to be able to test this.
Done with this, running 3.11-rc3 with this debug activated, what I see
is that I'm consistently getting failures 1/4 of the arp probes, so it
gets three ok and then one fails, then again 3 ok and then one fails.
I got the same 1/4 ratio when testing with 2 secs and 5 secs of
arp_interval.

I'm pasting here the debug output in case this can help find what is
going on to somebody else, it hasn't helped me at all :-(

11:16:29 [ 1510.493414] bonding: event_dev: eth0, event: 15
11:16:29 [ 1510.493459] bonding: event_dev: eth0.1002, event: 10
11:16:29 [ 1510.493979] bonding: event_dev: eth0.1002, event: 5
11:16:29 [ 1510.526266] bonding: bond0: adding ARP target 192.168.1.1.
11:16:29 [ 1510.526391] bonding: bond0: Setting ARP monitoring interval to 2000.
11:16:29 [ 1510.530093] bonding: bond0: setting mode to active-backup (1).
11:16:29 [ 1510.530217] bonding: bond0: setting arp_validate to none (0).
11:16:29 [ 1510.537106] bonding: bond0: Adding slave eth0.1002.
11:16:29 [ 1510.537119] bonding: eth0.1002: ! NETIF_F_VLAN_CHALLENGED
11:16:29 [ 1510.537136] bonding: event_dev: eth0.1002, event: 14
11:16:29 [ 1510.537147] bonding: bond_dev=f5581000 slave_dev=f4dcc000
slave_dev->addr_len=6
11:16:29 [ 1510.537206] bonding: event_dev: bond0, event: 8
11:16:29 [ 1510.537213] bonding: IFF_MASTER
11:16:29 [ 1510.537244] bonding: event_dev: eth0.1002, event: 8
11:16:29 [ 1510.537283] bonding: event_dev: eth0.1002, event: 15
11:16:29 [ 1510.537308] bonding: event_dev: eth0.1002, event: d
11:16:29 [ 1510.537389] bonding: event_dev: eth0.1002, event: 1
11:16:29 [ 1510.537424] bonding: event_dev: bond0, event: b
11:16:29 [ 1510.537430] bonding: IFF_MASTER
11:16:29 [ 1510.537945] bonding: Initial state of slave_dev is BOND_LINK_UP
11:16:29 [ 1510.537955] bonding: bond0: making interface eth0.1002 the
new active one.
11:16:29 [ 1510.538032] bonding: event_dev: bond0, event: c
11:16:29 [ 1510.538039] bonding: IFF_MASTER
11:16:29 [ 1510.538051] bonding: bond0: first active interface up!
11:16:29 [ 1510.538073] bonding: bond0: enslaving eth0.1002 as an
active interface with an up link.
11:16:29 [ 1510.565482] bonding: event_dev: bond0, event: d
11:16:29 [ 1510.565490] bonding: IFF_MASTER
11:16:29 [ 1510.565564] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:29 [ 1510.565572] bonding: basa: target 192.168.1.1
11:16:29 [ 1510.565577] bonding: basa: empty vlan: arp_send
11:16:29 [ 1510.565587] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:29 [ 1510.565789] 8021q: adding VLAN 0 to HW filter on device bond0
11:16:29 [ 1510.565798] bonding: bond: bond0, vlan id 0
11:16:29 [ 1510.565803] bonding: added VLAN ID 0 on bond bond0
11:16:29 [ 1510.565873] bonding: event_dev: bond0, event: 1
11:16:29 [ 1510.565876] bonding: IFF_MASTER
11:16:30 [ 1511.492281] bonding: event_dev: bond0, event: 4
11:16:30 [ 1511.492293] bonding: IFF_MASTER
11:16:31 [ 1512.568138] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:31 [ 1512.568154] bonding: basa: target 192.168.1.1
11:16:31 [ 1512.568174] bonding: basa: rtdev == bond->dev: arp_send
11:16:31 [ 1512.568189] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:33 [ 1514.572184] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:33 [ 1514.572203] bonding: basa: target 192.168.1.1
11:16:33 [ 1514.572224] bonding: basa: rtdev == bond->dev: arp_send
11:16:33 [ 1514.572238] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:35 [ 1516.576150] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:35 [ 1516.576171] bonding: bond0: link status definitely down
for interface eth0.1002, disabling it
11:16:35 [ 1516.576225] bonding: bond0: now running without any active
interface !
11:16:35 [ 1516.576237] bonding: basa: target 192.168.1.1
11:16:35 [ 1516.576254] bonding: basa: rtdev == bond->dev: arp_send
11:16:35 [ 1516.576266] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:35 [ 1516.576512] bonding: event_dev: bond0, event: 4
11:16:35 [ 1516.576520] bonding: IFF_MASTER
11:16:37 [ 1518.580153] bonding: bond_should_notify_peers: bond bond0 slave NULL
11:16:37 [ 1518.580174] bonding: bond0: link status definitely up for
interface eth0.1002.
11:16:37 [ 1518.580187] bonding: bond0: making interface eth0.1002 the
new active one.
11:16:37 [ 1518.580232] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:37 [ 1518.580437] bonding: event_dev: bond0, event: c
11:16:37 [ 1518.580444] bonding: IFF_MASTER
11:16:37 [ 1518.580621] bonding: event_dev: bond0, event: 13
11:16:37 [ 1518.580629] bonding: IFF_MASTER
11:16:37 [ 1518.580658] bonding: bond0: first active interface up!
11:16:37 [ 1518.580673] bonding: basa: target 192.168.1.1
11:16:37 [ 1518.580696] bonding: basa: rtdev == bond->dev: arp_send
11:16:37 [ 1518.580715] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:37 [ 1518.580920] bonding: event_dev: bond0, event: 4
11:16:37 [ 1518.580927] bonding: IFF_MASTER
11:16:39 [ 1520.584150] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:39 [ 1520.584170] bonding: basa: target 192.168.1.1
11:16:39 [ 1520.584195] bonding: basa: rtdev == bond->dev: arp_send
11:16:39 [ 1520.584209] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:41 [ 1522.588153] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:41 [ 1522.588174] bonding: basa: target 192.168.1.1
11:16:41 [ 1522.588198] bonding: basa: rtdev == bond->dev: arp_send
11:16:41 [ 1522.588211] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:43 [ 1524.592161] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:43 [ 1524.592182] bonding: bond0: link status definitely down
for interface eth0.1002, disabling it
11:16:43 [ 1524.592227] bonding: bond0: now running without any active
interface !
11:16:43 [ 1524.592238] bonding: basa: target 192.168.1.1
11:16:43 [ 1524.592253] bonding: basa: rtdev == bond->dev: arp_send
11:16:43 [ 1524.592267] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:43 [ 1524.592507] bonding: event_dev: bond0, event: 4
11:16:43 [ 1524.592516] bonding: IFF_MASTER
11:16:45 [ 1526.596178] bonding: bond_should_notify_peers: bond bond0 slave NULL
11:16:45 [ 1526.596199] bonding: bond0: link status definitely up for
interface eth0.1002.
11:16:45 [ 1526.596212] bonding: bond0: making interface eth0.1002 the
new active one.
11:16:45 [ 1526.596249] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:45 [ 1526.596449] bonding: event_dev: bond0, event: c
11:16:45 [ 1526.596457] bonding: IFF_MASTER
11:16:45 [ 1526.596638] bonding: event_dev: bond0, event: 13
11:16:45 [ 1526.596646] bonding: IFF_MASTER
11:16:45 [ 1526.596676] bonding: bond0: first active interface up!
11:16:45 [ 1526.596691] bonding: basa: target 192.168.1.1
11:16:45 [ 1526.596713] bonding: basa: rtdev == bond->dev: arp_send
11:16:45 [ 1526.596730] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:45 [ 1526.596939] bonding: event_dev: bond0, event: 4
11:16:45 [ 1526.596947] bonding: IFF_MASTER
11:16:47 [ 1528.600160] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:47 [ 1528.600181] bonding: basa: target 192.168.1.1
11:16:47 [ 1528.600207] bonding: basa: rtdev == bond->dev: arp_send
11:16:47 [ 1528.600220] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:49 [ 1530.604152] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:49 [ 1530.604172] bonding: basa: target 192.168.1.1
11:16:49 [ 1530.604196] bonding: basa: rtdev == bond->dev: arp_send
11:16:49 [ 1530.604210] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:51 [ 1532.608155] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:51 [ 1532.608176] bonding: bond0: link status definitely down
for interface eth0.1002, disabling it
11:16:51 [ 1532.608232] bonding: bond0: now running without any active
interface !
11:16:51 [ 1532.608244] bonding: basa: target 192.168.1.1
11:16:51 [ 1532.608259] bonding: basa: rtdev == bond->dev: arp_send
11:16:51 [ 1532.608272] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:51 [ 1532.608505] bonding: event_dev: bond0, event: 4
11:16:51 [ 1532.608513] bonding: IFF_MASTER
11:16:53 [ 1534.612150] bonding: bond_should_notify_peers: bond bond0 slave NULL
11:16:53 [ 1534.612170] bonding: bond0: link status definitely up for
interface eth0.1002.
11:16:53 [ 1534.612183] bonding: bond0: making interface eth0.1002 the
new active one.
11:16:53 [ 1534.612220] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:53 [ 1534.612419] bonding: event_dev: bond0, event: c
11:16:53 [ 1534.612427] bonding: IFF_MASTER
11:16:53 [ 1534.612606] bonding: event_dev: bond0, event: 13
11:16:53 [ 1534.612614] bonding: IFF_MASTER
11:16:53 [ 1534.612643] bonding: bond0: first active interface up!
11:16:53 [ 1534.612658] bonding: basa: target 192.168.1.1
11:16:53 [ 1534.612676] bonding: basa: rtdev == bond->dev: arp_send
11:16:53 [ 1534.612691] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:53 [ 1534.612934] bonding: event_dev: bond0, event: 4
11:16:53 [ 1534.612942] bonding: IFF_MASTER
11:16:55 [ 1536.616158] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:55 [ 1536.616180] bonding: basa: target 192.168.1.1
11:16:55 [ 1536.616204] bonding: basa: rtdev == bond->dev: arp_send
11:16:55 [ 1536.616218] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:57 [ 1538.620145] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:57 [ 1538.620166] bonding: basa: target 192.168.1.1
11:16:57 [ 1538.620179] bonding: basa: rtdev == bond->dev: arp_send
11:16:57 [ 1538.620192] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:59 [ 1540.624155] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:16:59 [ 1540.624176] bonding: bond0: link status definitely down
for interface eth0.1002, disabling it
11:16:59 [ 1540.624227] bonding: bond0: now running without any active
interface !
11:16:59 [ 1540.624238] bonding: basa: target 192.168.1.1
11:16:59 [ 1540.624253] bonding: basa: rtdev == bond->dev: arp_send
11:16:59 [ 1540.624266] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:16:59 [ 1540.624501] bonding: event_dev: bond0, event: 4
11:16:59 [ 1540.624509] bonding: IFF_MASTER
11:17:01 [ 1542.628132] bonding: bond_should_notify_peers: bond bond0 slave NULL
11:17:01 [ 1542.628148] bonding: bond0: link status definitely up for
interface eth0.1002.
11:17:01 [ 1542.628158] bonding: bond0: making interface eth0.1002 the
new active one.
11:17:01 [ 1542.628198] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:01 [ 1542.628396] bonding: event_dev: bond0, event: c
11:17:01 [ 1542.628404] bonding: IFF_MASTER
11:17:01 [ 1542.628581] bonding: event_dev: bond0, event: 13
11:17:01 [ 1542.628589] bonding: IFF_MASTER
11:17:01 [ 1542.628619] bonding: bond0: first active interface up!
11:17:01 [ 1542.628634] bonding: basa: target 192.168.1.1
11:17:01 [ 1542.628657] bonding: basa: rtdev == bond->dev: arp_send
11:17:01 [ 1542.628675] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:01 [ 1542.628846] bonding: event_dev: bond0, event: 4
11:17:01 [ 1542.628856] bonding: IFF_MASTER
11:17:03 [ 1544.632138] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:03 [ 1544.632154] bonding: basa: target 192.168.1.1
11:17:03 [ 1544.632177] bonding: basa: rtdev == bond->dev: arp_send
11:17:03 [ 1544.632190] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:05 [ 1546.636147] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:05 [ 1546.636167] bonding: basa: target 192.168.1.1
11:17:05 [ 1546.636186] bonding: basa: rtdev == bond->dev: arp_send
11:17:05 [ 1546.636201] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:07 [ 1548.640129] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:07 [ 1548.640150] bonding: bond0: link status definitely down
for interface eth0.1002, disabling it
11:17:07 [ 1548.640208] bonding: bond0: now running without any active
interface !
11:17:07 [ 1548.640221] bonding: basa: target 192.168.1.1
11:17:07 [ 1548.640235] bonding: basa: rtdev == bond->dev: arp_send
11:17:07 [ 1548.640249] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:07 [ 1548.640524] bonding: event_dev: bond0, event: 4
11:17:07 [ 1548.640532] bonding: IFF_MASTER
11:17:09 [ 1550.644152] bonding: bond_should_notify_peers: bond bond0 slave NULL
11:17:09 [ 1550.644172] bonding: bond0: link status definitely up for
interface eth0.1002.
11:17:09 [ 1550.644184] bonding: bond0: making interface eth0.1002 the
new active one.
11:17:09 [ 1550.644226] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:09 [ 1550.644421] bonding: event_dev: bond0, event: c
11:17:09 [ 1550.644429] bonding: IFF_MASTER
11:17:09 [ 1550.644608] bonding: event_dev: bond0, event: 13
11:17:09 [ 1550.644616] bonding: IFF_MASTER
11:17:09 [ 1550.644645] bonding: bond0: first active interface up!
11:17:09 [ 1550.644659] bonding: basa: target 192.168.1.1
11:17:09 [ 1550.644683] bonding: basa: rtdev == bond->dev: arp_send
11:17:09 [ 1550.644697] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:09 [ 1550.644866] bonding: event_dev: bond0, event: 4
11:17:09 [ 1550.644875] bonding: IFF_MASTER
11:17:11 [ 1552.648128] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:11 [ 1552.648144] bonding: basa: target 192.168.1.1
11:17:11 [ 1552.648167] bonding: basa: rtdev == bond->dev: arp_send
11:17:11 [ 1552.648180] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:13 [ 1554.652132] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:13 [ 1554.652149] bonding: basa: target 192.168.1.1
11:17:13 [ 1554.652166] bonding: basa: rtdev == bond->dev: arp_send
11:17:13 [ 1554.652179] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:15 [ 1556.656131] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:15 [ 1556.656147] bonding: bond0: link status definitely down
for interface eth0.1002, disabling it
11:17:15 [ 1556.656198] bonding: bond0: now running without any active
interface !
11:17:15 [ 1556.656210] bonding: basa: target 192.168.1.1
11:17:15 [ 1556.656224] bonding: basa: rtdev == bond->dev: arp_send
11:17:15 [ 1556.656238] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:15 [ 1556.656488] bonding: event_dev: bond0, event: 4
11:17:15 [ 1556.656497] bonding: IFF_MASTER
11:17:17 [ 1558.660147] bonding: bond_should_notify_peers: bond bond0 slave NULL
11:17:17 [ 1558.660168] bonding: bond0: link status definitely up for
interface eth0.1002.
11:17:17 [ 1558.660180] bonding: bond0: making interface eth0.1002 the
new active one.
11:17:17 [ 1558.660221] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:17 [ 1558.660424] bonding: event_dev: bond0, event: c
11:17:17 [ 1558.660432] bonding: IFF_MASTER
11:17:17 [ 1558.660611] bonding: event_dev: bond0, event: 13
11:17:17 [ 1558.660619] bonding: IFF_MASTER
11:17:17 [ 1558.660649] bonding: bond0: first active interface up!
11:17:17 [ 1558.660665] bonding: basa: target 192.168.1.1
11:17:17 [ 1558.660683] bonding: basa: rtdev == bond->dev: arp_send
11:17:17 [ 1558.660699] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:17 [ 1558.660905] bonding: event_dev: bond0, event: 4
11:17:17 [ 1558.660913] bonding: IFF_MASTER
11:17:19 [ 1560.664082] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:19 [ 1560.664103] bonding: basa: target 192.168.1.1
11:17:19 [ 1560.664128] bonding: basa: rtdev == bond->dev: arp_send
11:17:19 [ 1560.664141] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:21 [ 1562.668149] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:21 [ 1562.668169] bonding: basa: target 192.168.1.1
11:17:21 [ 1562.668189] bonding: basa: rtdev == bond->dev: arp_send
11:17:21 [ 1562.668202] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:23 [ 1564.672153] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:23 [ 1564.672174] bonding: bond0: link status definitely down
for interface eth0.1002, disabling it
11:17:23 [ 1564.672230] bonding: bond0: now running without any active
interface !
11:17:23 [ 1564.672243] bonding: basa: target 192.168.1.1
11:17:23 [ 1564.672258] bonding: basa: rtdev == bond->dev: arp_send
11:17:23 [ 1564.672271] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:23 [ 1564.672512] bonding: event_dev: bond0, event: 4
11:17:23 [ 1564.672520] bonding: IFF_MASTER
11:17:25 [ 1566.676149] bonding: bond_should_notify_peers: bond bond0 slave NULL
11:17:25 [ 1566.676170] bonding: bond0: link status definitely up for
interface eth0.1002.
11:17:25 [ 1566.676182] bonding: bond0: making interface eth0.1002 the
new active one.
11:17:25 [ 1566.676222] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:25 [ 1566.676422] bonding: event_dev: bond0, event: c
11:17:25 [ 1566.676431] bonding: IFF_MASTER
11:17:25 [ 1566.676610] bonding: event_dev: bond0, event: 13
11:17:25 [ 1566.676617] bonding: IFF_MASTER
11:17:25 [ 1566.676646] bonding: bond0: first active interface up!
11:17:25 [ 1566.676660] bonding: basa: target 192.168.1.1
11:17:25 [ 1566.676678] bonding: basa: rtdev == bond->dev: arp_send
11:17:25 [ 1566.676695] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:25 [ 1566.676906] bonding: event_dev: bond0, event: 4
11:17:25 [ 1566.676914] bonding: IFF_MASTER
11:17:27 [ 1568.680155] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:27 [ 1568.680176] bonding: basa: target 192.168.1.1
11:17:27 [ 1568.680200] bonding: basa: rtdev == bond->dev: arp_send
11:17:27 [ 1568.680214] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:29 [ 1570.684149] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:29 [ 1570.684171] bonding: basa: target 192.168.1.1
11:17:29 [ 1570.684189] bonding: basa: rtdev == bond->dev: arp_send
11:17:29 [ 1570.684203] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:31 [ 1572.688152] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:31 [ 1572.688173] bonding: bond0: link status definitely down
for interface eth0.1002, disabling it
11:17:31 [ 1572.688225] bonding: bond0: now running without any active
interface !
11:17:31 [ 1572.688237] bonding: basa: target 192.168.1.1
11:17:31 [ 1572.688253] bonding: basa: rtdev == bond->dev: arp_send
11:17:31 [ 1572.688266] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:31 [ 1572.688505] bonding: event_dev: bond0, event: 4
11:17:31 [ 1572.688513] bonding: IFF_MASTER
11:17:33 [ 1574.692147] bonding: bond_should_notify_peers: bond bond0 slave NULL
11:17:33 [ 1574.692168] bonding: bond0: link status definitely up for
interface eth0.1002.
11:17:33 [ 1574.692179] bonding: bond0: making interface eth0.1002 the
new active one.
11:17:33 [ 1574.692221] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:33 [ 1574.692424] bonding: event_dev: bond0, event: c
11:17:33 [ 1574.692432] bonding: IFF_MASTER
11:17:33 [ 1574.692610] bonding: event_dev: bond0, event: 13
11:17:33 [ 1574.692618] bonding: IFF_MASTER
11:17:33 [ 1574.692648] bonding: bond0: first active interface up!
11:17:33 [ 1574.692664] bonding: basa: target 192.168.1.1
11:17:33 [ 1574.692681] bonding: basa: rtdev == bond->dev: arp_send
11:17:33 [ 1574.692697] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:33 [ 1574.692907] bonding: event_dev: bond0, event: 4
11:17:33 [ 1574.692915] bonding: IFF_MASTER
11:17:35 [ 1576.696156] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:35 [ 1576.696176] bonding: basa: target 192.168.1.1
11:17:35 [ 1576.696203] bonding: basa: rtdev == bond->dev: arp_send
11:17:35 [ 1576.696218] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0
11:17:37 [ 1578.700129] bonding: bond_should_notify_peers: bond bond0
slave eth0.1002
11:17:37 [ 1578.700145] bonding: basa: target 192.168.1.1
11:17:37 [ 1578.700162] bonding: basa: rtdev == bond->dev: arp_send
11:17:37 [ 1578.700175] bonding: arp 1 on slave eth0.1002: dst
192.168.1.1 src 192.168.1.2 vid 0

Regards.
--
Manty/BestiaTester -> http://manty.net
Veaceslav Falico
2013-08-01 20:21:25 UTC
Permalink
On Thu, Aug 1, 2013 at 2:11 PM, Santiago Garcia Mantinan
Post by Santiago Garcia Mantinan
Hi!
...snip...
Post by Santiago Garcia Mantinan
This is the setup on the bonding interface.
auto bond0
iface bond0 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves eth0.1002
bond-mode active-backup
bond-arp_validate 0
Could you please try with arp_validate=1?
Santiago Garcia Mantinan
2013-08-02 07:30:29 UTC
Permalink
Post by Veaceslav Falico
Post by Santiago Garcia Mantinan
bond-slaves eth0.1002
bond-mode active-backup
bond-arp_validate 0
Could you please try with arp_validate=1?
arp_validate=1 was what I first tried, then I thought that making
validations could be dropping packages so I went for the 0
validations, both had the same problem.

Regards.
--
Manty/BestiaTester -> http://manty.net
Nikolay Aleksandrov
2013-08-02 11:58:29 UTC
Permalink
Post by Santiago Garcia Mantinan
Hi!
I'm trying to setup a bond of a couple of vlans, these vlans are different
paths to an upstream switch from a local switch. I want to do arp
monitoring of the link in order for the bonding interface to know which path
is ok and wich one is broken. If I set it up using arp monitoring and
without using vlans it works ok, it also works if I set it up using vlans
but without arp monitoring, so the broken setup seems to be with bonding +
-------------
|Remote Switch|
-------------
| |
P P
A A
T T
H H
1 2
| |
------------
|Local switch|
------------
|
| VLAN for PATH1
| VLAN for PATH2
|
Linux machine
The broken setup seems to work but arp monitoring makes it loose the logical
link from time to time, thus changing to other slave if available. What I
saw when monitoring this with tcpdump is that all the arp requests were
going out and that all the replies where coming in, so acording to the
traffic seen on tcpdump the link should have been stable, but
/proc/net/bonding/bond0 showed the link failures increasing and when testing
with just a vlan interface I was loosing ping when the link was going down.
I've tried this on Debian wheezy with its 3.2.46 kernel and also the 3.10.3
version in unstable, the tests where done on a couple of machines using a 32
bits kernel with different nics (r8169 and skge).
I created a small lab to replicate the problem, on this setup I avoided all
the switching and I directly connected the machine with bonding to another
Linux on which I just had eth0.1002 configured with ip 192.168.1.1, the
results where the same as in the full scenario, link on the bonding slave
was going down from time to time.
This is the setup on the bonding interface.
auto bond0
iface bond0 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves eth0.1002
bond-mode active-backup
bond-arp_validate 0
bond-arp_interval 5000
bond-arp_ip_target 192.168.1.1
pre-up ip link set eth0 up || true
pre-up ip link add link eth0 name eth0.1002 type vlan id 1002 || true
down ip link delete eth0.1002 || true
I believe that it is because dev_trans_start() returns 0 for 8021q devices and
so the calculations if the slave has transmitted are wrong, and the flip-flop
happens.
Please try the attached patch, it should resolve your issue (basically it gets
the dev_trans_start of the vlan's underlying device if a vlan is found).

The patch is against Linus' tree.

Cheers,
Nik
Jay Vosburgh
2013-08-02 15:49:18 UTC
Permalink
Post by Nikolay Aleksandrov
Post by Santiago Garcia Mantinan
Hi!
I'm trying to setup a bond of a couple of vlans, these vlans are different
paths to an upstream switch from a local switch. I want to do arp
monitoring of the link in order for the bonding interface to know which path
is ok and wich one is broken. If I set it up using arp monitoring and
without using vlans it works ok, it also works if I set it up using vlans
but without arp monitoring, so the broken setup seems to be with bonding +
-------------
|Remote Switch|
-------------
| |
P P
A A
T T
H H
1 2
| |
------------
|Local switch|
------------
|
| VLAN for PATH1
| VLAN for PATH2
|
Linux machine
The broken setup seems to work but arp monitoring makes it loose the logical
link from time to time, thus changing to other slave if available. What I
saw when monitoring this with tcpdump is that all the arp requests were
going out and that all the replies where coming in, so acording to the
traffic seen on tcpdump the link should have been stable, but
/proc/net/bonding/bond0 showed the link failures increasing and when testing
with just a vlan interface I was loosing ping when the link was going down.
I've tried this on Debian wheezy with its 3.2.46 kernel and also the 3.10.3
version in unstable, the tests where done on a couple of machines using a 32
bits kernel with different nics (r8169 and skge).
I created a small lab to replicate the problem, on this setup I avoided all
the switching and I directly connected the machine with bonding to another
Linux on which I just had eth0.1002 configured with ip 192.168.1.1, the
results where the same as in the full scenario, link on the bonding slave
was going down from time to time.
This is the setup on the bonding interface.
auto bond0
iface bond0 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves eth0.1002
bond-mode active-backup
bond-arp_validate 0
bond-arp_interval 5000
bond-arp_ip_target 192.168.1.1
pre-up ip link set eth0 up || true
pre-up ip link add link eth0 name eth0.1002 type vlan id 1002 || true
down ip link delete eth0.1002 || true
I believe that it is because dev_trans_start() returns 0 for 8021q devices and
so the calculations if the slave has transmitted are wrong, and the flip-flop
happens.
Please try the attached patch, it should resolve your issue (basically it gets
the dev_trans_start of the vlan's underlying device if a vlan is found).
The patch is against Linus' tree.
Cheers,
Nik
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 07f257d4..6aac0ae 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -665,6 +665,16 @@ static int bond_check_dev_link(struct bonding *bond,
return reporting ? -1 : BMSR_LSTATUS;
}
+static unsigned long bond_dev_trans_start(struct net_device *dev)
+{
+ struct net_device *real_dev = dev;
+
+ if (dev->priv_flags & IFF_802_1Q_VLAN)
+ real_dev = vlan_dev_real_dev(dev);
+
+ return dev_trans_start(real_dev);
+}
Should this handle nested VLANs? E.g.,

static unsigned long bond_dev_trans_start(struct net_device *dev)
{
while (dev->priv_flags & IFF_802_1Q_VLAN)
dev = vlan_dev_real_dev(dev);

return dev_trans_start(dev);
}

Also, this (ARP monitoring of a VLAN slave) has likely never
worked, and therefore this change should be considered for -stable.

-J
Post by Nikolay Aleksandrov
+
/*----------------------------- Multicast list ------------------------------*/
/*
@@ -2750,7 +2760,7 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
* so it can wait
*/
bond_for_each_slave(bond, slave, i) {
- unsigned long trans_start = dev_trans_start(slave->dev);
+ unsigned long trans_start = bond_dev_trans_start(slave->dev);
if (slave->link != BOND_LINK_UP) {
if (time_in_range(jiffies,
@@ -2912,7 +2922,7 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
* - (more than 2*delta since receive AND
* the bond has an IP address)
*/
- trans_start = dev_trans_start(slave->dev);
+ trans_start = bond_dev_trans_start(slave->dev);
if (bond_is_active_slave(slave) &&
(!time_in_range(jiffies,
trans_start - delta_in_ticks,
@@ -2947,7 +2957,7 @@ static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
continue;
- trans_start = dev_trans_start(slave->dev);
+ trans_start = bond_dev_trans_start(slave->dev);
if ((!bond->curr_active_slave &&
time_in_range(jiffies,
trans_start - delta_in_ticks,
---
-Jay Vosburgh, IBM Linux Technology Center, ***@us.ibm.com
Nikolay Aleksandrov
2013-08-02 16:13:23 UTC
Permalink
Post by Jay Vosburgh
Post by Nikolay Aleksandrov
Post by Santiago Garcia Mantinan
Hi!
I'm trying to setup a bond of a couple of vlans, these vlans are different
paths to an upstream switch from a local switch. I want to do arp
monitoring of the link in order for the bonding interface to know which path
is ok and wich one is broken. If I set it up using arp monitoring and
without using vlans it works ok, it also works if I set it up using vlans
but without arp monitoring, so the broken setup seems to be with bonding +
-------------
|Remote Switch|
-------------
| |
P P
A A
T T
H H
1 2
| |
------------
|Local switch|
------------
|
| VLAN for PATH1
| VLAN for PATH2
|
Linux machine
The broken setup seems to work but arp monitoring makes it loose the logical
link from time to time, thus changing to other slave if available. What I
saw when monitoring this with tcpdump is that all the arp requests were
going out and that all the replies where coming in, so acording to the
traffic seen on tcpdump the link should have been stable, but
/proc/net/bonding/bond0 showed the link failures increasing and when testing
with just a vlan interface I was loosing ping when the link was going down.
I've tried this on Debian wheezy with its 3.2.46 kernel and also the 3.10.3
version in unstable, the tests where done on a couple of machines using a 32
bits kernel with different nics (r8169 and skge).
I created a small lab to replicate the problem, on this setup I avoided all
the switching and I directly connected the machine with bonding to another
Linux on which I just had eth0.1002 configured with ip 192.168.1.1, the
results where the same as in the full scenario, link on the bonding slave
was going down from time to time.
This is the setup on the bonding interface.
auto bond0
iface bond0 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves eth0.1002
bond-mode active-backup
bond-arp_validate 0
bond-arp_interval 5000
bond-arp_ip_target 192.168.1.1
pre-up ip link set eth0 up || true
pre-up ip link add link eth0 name eth0.1002 type vlan id 1002 || true
down ip link delete eth0.1002 || true
I believe that it is because dev_trans_start() returns 0 for 8021q devices and
so the calculations if the slave has transmitted are wrong, and the flip-flop
happens.
Please try the attached patch, it should resolve your issue (basically it gets
the dev_trans_start of the vlan's underlying device if a vlan is found).
The patch is against Linus' tree.
Cheers,
Nik
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 07f257d4..6aac0ae 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -665,6 +665,16 @@ static int bond_check_dev_link(struct bonding *bond,
return reporting ? -1 : BMSR_LSTATUS;
}
+static unsigned long bond_dev_trans_start(struct net_device *dev)
+{
+ struct net_device *real_dev = dev;
+
+ if (dev->priv_flags & IFF_802_1Q_VLAN)
+ real_dev = vlan_dev_real_dev(dev);
+
+ return dev_trans_start(real_dev);
+}
Should this handle nested VLANs? E.g.,
static unsigned long bond_dev_trans_start(struct net_device *dev)
{
while (dev->priv_flags & IFF_802_1Q_VLAN)
dev = vlan_dev_real_dev(dev);
return dev_trans_start(dev);
}
Also, this (ARP monitoring of a VLAN slave) has likely never
worked, and therefore this change should be considered for -stable.
-J
Yes, it should :-)
Thanks Jay, I'll re-submit it as a proper patch for -net in a bit.

Nik
Santiago Garcia Mantinan
2013-08-04 10:45:43 UTC
Permalink
Post by Nikolay Aleksandrov
I believe that it is because dev_trans_start() returns 0 for 8021q devices and
so the calculations if the slave has transmitted are wrong, and the flip-flop
happens.
Please try the attached patch, it should resolve your issue (basically it gets
the dev_trans_start of the vlan's underlying device if a vlan is found).
Thanks, patched and compiling, I'll try today with my laptops and
tomorrow at the lab I had setup and then at the original machine.

I'll let you know how things go.

Regards.
--
Manty/BestiaTester -> http://manty.net
Santiago Garcia Mantinan
2013-08-05 10:26:16 UTC
Permalink
Post by Santiago Garcia Mantinan
Post by Nikolay Aleksandrov
I believe that it is because dev_trans_start() returns 0 for 8021q devices and
so the calculations if the slave has transmitted are wrong, and the flip-flop
happens.
Please try the attached patch, it should resolve your issue (basically it gets
the dev_trans_start of the vlan's underlying device if a vlan is found).
Thanks, patched and compiling, I'll try today with my laptops and
tomorrow at the lab I had setup and then at the original machine.
I'll let you know how things go.
Ok, initial tests seem to show that a bonding defined like I had on my
very basic setup that I sent to the list is now working.

What doesn't seem to be working is if I set it up using bonding under
the vlans and then doing a bond of those, I mean:

iface bond0 inet manual
bond-slaves eth0
bond-mode 802.3ad
bond-miimon 100
...
iface bond2 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves bond0.1001 bond0.1002
bond-mode active-backup
bond-arp_validate 0
bond-arp_interval 2000
bond-arp_ip_target 192.168.1.1
...

Should this bond of bonds work?

I'm doing more tests to make sure that the basic eth0.1001 and
eth0.1002 works 100% after finding that the bond of bonds wasn't
working ok, just in case the basic was also failing, but at least the
double bond is failing and basic bond seems to work ok.

Regards.
--
Manty/BestiaTester -> http://manty.net
Nikolay Aleksandrov
2013-08-05 10:26:14 UTC
Permalink
Post by Santiago Garcia Mantinan
Post by Santiago Garcia Mantinan
Post by Nikolay Aleksandrov
I believe that it is because dev_trans_start() returns 0 for 8021q devices and
so the calculations if the slave has transmitted are wrong, and the flip-flop
happens.
Please try the attached patch, it should resolve your issue (basically it gets
the dev_trans_start of the vlan's underlying device if a vlan is found).
Thanks, patched and compiling, I'll try today with my laptops and
tomorrow at the lab I had setup and then at the original machine.
I'll let you know how things go.
Ok, initial tests seem to show that a bonding defined like I had on my
very basic setup that I sent to the list is now working.
What doesn't seem to be working is if I set it up using bonding under
iface bond0 inet manual
bond-slaves eth0
bond-mode 802.3ad
bond-miimon 100
...
iface bond2 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves bond0.1001 bond0.1002
bond-mode active-backup
bond-arp_validate 0
bond-arp_interval 2000
bond-arp_ip_target 192.168.1.1
...
Should this bond of bonds work?
No, because we take the first non-vlan's interface trans_start after the patch
which in this case is a bonding interface which also doesn't update its
trans_start, i.e. bond over bond (or over vlans over bond) with arp monitoring
shouldn't work.
Post by Santiago Garcia Mantinan
I'm doing more tests to make sure that the basic eth0.1001 and
eth0.1002 works 100% after finding that the bond of bonds wasn't
working ok, just in case the basic was also failing, but at least the
double bond is failing and basic bond seems to work ok.
Regards.
Santiago Garcia Mantinan
2013-08-07 07:26:49 UTC
Permalink
Post by Nikolay Aleksandrov
Post by Santiago Garcia Mantinan
bond-arp_validate 0
No, because we take the first non-vlan's interface trans_start after the patch
which in this case is a bonding interface which also doesn't update its
trans_start, i.e. bond over bond (or over vlans over bond) with arp monitoring
shouldn't work.
Ok, after several days of testing it seems to work ok if I go with
arp_validate 0, going for arp_validate 1 would cause the link failure
count to be increased from time to time, is this ok?

Regards.
--
Manty/BestiaTester -> http://manty.net
Nikolay Aleksandrov
2013-08-07 07:39:09 UTC
Permalink
Post by Santiago Garcia Mantinan
Post by Nikolay Aleksandrov
Post by Santiago Garcia Mantinan
bond-arp_validate 0
No, because we take the first non-vlan's interface trans_start after the patch
which in this case is a bonding interface which also doesn't update its
trans_start, i.e. bond over bond (or over vlans over bond) with arp monitoring
shouldn't work.
Ok, after several days of testing it seems to work ok if I go with
arp_validate 0, going for arp_validate 1 would cause the link failure
count to be increased from time to time, is this ok?
Regards.
If arp_interval is changed then you have to disable the interface (e.g.
ifconfig bondX down) and enable it again (ifconfig bondX up), or set it
while the interface is down. Also there're pr_debug()s in bond_validate_arp
and bond_arp_rcv that you can enable to check if it's validated properly.
Santiago Garcia Mantinan
2013-08-07 10:44:06 UTC
Permalink
Post by Nikolay Aleksandrov
Post by Santiago Garcia Mantinan
Ok, after several days of testing it seems to work ok if I go with
arp_validate 0, going for arp_validate 1 would cause the link failure
count to be increased from time to time, is this ok?
If arp_interval is changed then you have to disable the interface (e.g.
ifconfig bondX down) and enable it again (ifconfig bondX up), or set it
while the interface is down. Also there're pr_debug()s in bond_validate_arp
and bond_arp_rcv that you can enable to check if it's validated properly.
I don't quite get what you mean here, if you want me to activate de
debug and check that please explain me what I should do.

Aside this, I've been doing some other tests like enabling arp
validate with xor balance and I've got a problem:

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: load balancing (xor)
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 2000
ARP IP target/s (n.n.n.n form): 192.168.1.1

Slave Interface: eth1.1001
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:23:7d:30:bc:48
Slave queue ID: 0

Slave Interface: eth1.1002
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:23:7d:30:bc:48
Slave queue ID: 0

Slave Interface: eth2.1001
MII Status: down
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:23:7d:30:bc:48
Slave queue ID: 0

Slave Interface: eth2.1002
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:23:7d:30:bc:48

The arp target is only visible through the vlan 1002, not the vlan
1001, then eth1.1001 and eth2.1001 shoult both be down, however
eth1.1001 is up. I'm monitoring eth1.1001 and I can see the arp
queries but no replies, then... why is it up?

arp validation is incompatible with xor balance? if so... why is
eth2.1001 set to down?

Regards.
--
Manty/BestiaTester -> http://manty.net
Santiago Garcia Mantinan
2013-08-20 08:05:13 UTC
Permalink
Hi!

Sorry it took me so long to reply back. I've been doing more tests on
xor mode and I see that arp monitoring is not working at all. I
haven't found any doc that says which modes should be compatible with
arp monitoring, maybe xor mode shouldn't be used at all.

My last setup is a Linux with a couple of vlans both interfaces
(eth2.1001 and eth2.1002) with IP 192.168.1.1 (no bonding at all) and
another Linux machine with a 3.11-rc3 with Nicolay's arp fix for
bonding and a bond configured like this:

iface bond0 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves eth0.1001 eth0.1002 eth1.1001 eth1.1002
bond-mode balance-xor
bond-arp_validate 0
bond-arp_interval 2000
bond-arp_ip_target 192.168.1.1

A silly switch connects the couple of ethernets of the machine with
the bond to the interface of the not bonded machine.

What I saw was that the bonded machine didn't detect the ifconfig down
of the interfaces of the not bonded machine at all. That drove me to
the hypothesis that the bonded machine was considering its own traffic
(there was no traffic but the arp requests of the bonding) as
indication that the link was ok.

To test the hypothesis, when the not bonded machine (192.168.1.1)
which is the target for arp requests was unplugged and the bonding was
seeing all interfaces up (not detecting that the other machine was not
responding) I unplugged one of the bonded interfaces and all 4 slaves
went to down, then if I replugged it all 4 would go up.

Maybe this is something to be expected due to arp monitoring not
working with balance-xor, but I haven't found any doc saying this.

If you need the debug info for this I can send it, but the events show
nothing, as there are no event saying that link is lost or anything
:-(

Regards.
--
Manty/BestiaTester -> http://manty.net
Nikolay Aleksandrov
2013-08-20 10:11:34 UTC
Permalink
Post by Santiago Garcia Mantinan
Hi!
Sorry it took me so long to reply back. I've been doing more tests on
xor mode and I see that arp monitoring is not working at all. I
haven't found any doc that says which modes should be compatible with
arp monitoring, maybe xor mode shouldn't be used at all.
My last setup is a Linux with a couple of vlans both interfaces
(eth2.1001 and eth2.1002) with IP 192.168.1.1 (no bonding at all) and
another Linux machine with a 3.11-rc3 with Nicolay's arp fix for
iface bond0 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves eth0.1001 eth0.1002 eth1.1001 eth1.1002
bond-mode balance-xor
bond-arp_validate 0
bond-arp_interval 2000
bond-arp_ip_target 192.168.1.1
A silly switch connects the couple of ethernets of the machine with
the bond to the interface of the not bonded machine.
What I saw was that the bonded machine didn't detect the ifconfig down
of the interfaces of the not bonded machine at all. That drove me to
the hypothesis that the bonded machine was considering its own traffic
(there was no traffic but the arp requests of the bonding) as
indication that the link was ok.
To test the hypothesis, when the not bonded machine (192.168.1.1)
which is the target for arp requests was unplugged and the bonding was
seeing all interfaces up (not detecting that the other machine was not
responding) I unplugged one of the bonded interfaces and all 4 slaves
went to down, then if I replugged it all 4 would go up.
Maybe this is something to be expected due to arp monitoring not
working with balance-xor, but I haven't found any doc saying this.
If you need the debug info for this I can send it, but the events show
nothing, as there are no event saying that link is lost or anything
:-(
Regards.
Hi,
This setup works for me, what might be wrong with your setup is that you connect
all 4 ports to a "dumb" switch, and you have the same vlans over the real
devices that are connected so they see each other's packets and the port's
last_rx gets updated so they stay up.
I tried your setup with a "smart" switch so the ports couldn't see each other
and only the one that saw 192.168.1.1 was up, and the moment 192.168.1.1 went
down - the port went down in the bonding.

Cheers,
Nik
Santiago Garcia Mantinan
2013-08-21 07:39:07 UTC
Permalink
Hi!

I think we have to clarify the setup...
Post by Nikolay Aleksandrov
Post by Santiago Garcia Mantinan
iface bond0 inet static
address 192.168.1.2
netmask 255.255.255.0
bond-slaves eth0.1001 eth0.1002 eth1.1001 eth1.1002
bond-mode balance-xor
bond-arp_validate 0
bond-arp_interval 2000
bond-arp_ip_target 192.168.1.1
This setup works for me, what might be wrong with your setup is that you connect
all 4 ports to a "dumb" switch,
What I have is three ports, not 4, I have two network cards on the
bonded machine and one on the not bonded machine, so I have three
ports. On the not bonded machine I configure the two vlan interfaces
over the same physical ethernet like this:
ifconfig eth2.1001 192.168.1.1
ifconfig eth2.1002 192.168.1.1
Post by Nikolay Aleksandrov
and you have the same vlans over the real
devices that are connected so they see each other's packets and the port's
last_rx gets updated so they stay up.
I'd like to clarify this a bit, reading the bonding.txt file (the
howto) specially the arp_all_targets option (I haven't set this on my
setup) one would think that only a arp reply from at least one of the
specified targets had to be received in order for the link to be
considered on good state, not any traffic, specially if the traffic is
generated by your very own bonding driver. Isn't this like that?

What I'm trying to check on the real world scenario is if the gw,
which is on a remote location, is available, but I can have local
traffic that could be incrementing the counters.
Post by Nikolay Aleksandrov
I tried your setup with a "smart" switch so the ports couldn't see each other
and only the one that saw 192.168.1.1 was up, and the moment 192.168.1.1 went
down - the port went down in the bonding.
I think that the problem here is not the "dumb" or "smart" switch. I
believe we are having different setups somehow. Please, if you don't
understand anything on my setup (two machines, one with the bonding
config I explained, and the other with one card and the ifconfig
commands I said up there) just let me know.

My first "dumb" switch was the switch of a soho adsl wifi router, then
I tried a soho "dumb" 8 ports 10/100 switch, then I tried an old
Cabletron SSR2000 where I had to define the two vlans on the three
ports and make these ports tagged ports, then I tried on a Enterasys
B5 (where I also had to specify that this ports had those vlans as
egress and tagged). On the smart machines the slaves were considered
to be down when vlans were not configured, as it was dropping all
traffic, but once the vlans were setup the slaves came up.

The behaviour I get is the same on "dumb" and "smart" switches, when I
have the eth2.1001 and 1002 interfaces up everything is like expected,
but then I run:
ifconfig eth2.1001 0.0.0.0 down
ifconfig eth2.1002 0.0.0.0 down
and the bonded machine still sees all the slaves up even though I can
see on the tcpdump I run on eth2 on the target machine that all 4
requests are arriving but none of them is being replied.

I have checked the counters you said and indeed they are being
increased, both in "dumb" and "smart" switches (note that I haven't
defined any bond on the switch side). I believe that either switch has
to forward what comes from eth0.1001 (connected to switch port X) to
eth1.1001 (connected to switch port Y) as they are broadcast messages
and I haven't defined any bonding, so he has to forward what comes on
port X to port Y, not doing so would break broadcast for a lot of
setups. What doesn't make sense to me is the assumption that
increasing counters when none of the specified targets are replying
means we have a good link.

I don't know what else to add to clarify what is going on, please, if
something is not clear ask me.

Thanks for your replies.

Regards.
--
Manty/BestiaTester -> http://manty.net
Continue reading on narkive:
Loading...