Discussion:
[PATCH net-next v3 0/2] vxlan: allow specifying multiple default destinations
Mike Rapoport
2013-05-29 10:00:14 UTC
Permalink
Hi,

These patches add ability to specify multiple default destinations to
vxlan. This ability is usefull in cases when multicast are disabled on
infrastructure level, for instance in public clouds.

v3 changes:
* make netlink interface for remote destinations management more
generic and extend vxlan_validate to handle newly added atttributes.
(as proposed by Thomas Graf)

v2 changes:
* rebased on current net-next
* flush default destinations list at dellink as per Atzm Watanabe comment
* support only IPv4

Mike Rapoport (2):
vxlan: introduce vxlan_rdst_append
vxlan: allow specifying multiple default destinations

drivers/net/vxlan.c | 252 ++++++++++++++++++++++++++++++++++++++++++-
include/uapi/linux/if_link.h | 17 +++
2 files changed, 265 insertions(+), 4 deletions(-)
--
1.8.1.5
Mike Rapoport
2013-05-29 10:00:15 UTC
Permalink
to allow remotes list management for both FDB entries and default
destinations

Signed-off-by: Mike Rapoport <***@ravellosystems.com>
---
drivers/net/vxlan.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 8111565..5a2cf2f 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -366,14 +366,13 @@ static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev *vxlan,
return f;
}

-/* Add/update destinations for multicast */
-static int vxlan_fdb_append(struct vxlan_fdb *f,
- __be32 ip, __be16 port, __u32 vni, __u32 ifindex)
+static int vxlan_rdst_append(struct vxlan_rdst *rdst, __be32 ip, __be16 port,
+ __u32 vni, __u32 ifindex)
{
struct vxlan_rdst *rd_prev, *rd;

rd_prev = NULL;
- for (rd = &f->remote; rd; rd = rd->remote_next) {
+ for (rd = rdst; rd; rd = rd->remote_next) {
if (rd->remote_ip == ip &&
rd->remote_port == port &&
rd->remote_vni == vni &&
@@ -393,6 +392,13 @@ static int vxlan_fdb_append(struct vxlan_fdb *f,
return 1;
}

+/* Add/update destinations for multicast */
+static int vxlan_fdb_append(struct vxlan_fdb *f,
+ __be32 ip, __be16 port, __u32 vni, __u32 ifindex)
+{
+ return vxlan_rdst_append(&f->remote, ip, port, vni, ifindex);
+}
+
/* Add new entry to forwarding table -- assumes lock held */
static int vxlan_fdb_create(struct vxlan_dev *vxlan,
const u8 *mac, __be32 ip,
--
1.8.1.5
Stephen Hemminger
2013-05-29 22:56:42 UTC
Permalink
On Wed, 29 May 2013 13:00:15 +0300
Post by Mike Rapoport
-/* Add/update destinations for multicast */
-static int vxlan_fdb_append(struct vxlan_fdb *f,
- __be32 ip, __be16 port, __u32 vni, __u32 ifindex)
+static int vxlan_rdst_append(struct vxlan_rdst *rdst, __be32 ip, __be16 port,
+ __u32 vni, __u32 ifindex)
{
struct vxlan_rdst *rd_prev, *rd;
rd_prev = NULL;
- for (rd = &f->remote; rd; rd = rd->remote_next) {
+ for (rd = rdst; rd; rd = rd->remote_next) {
The remote destinations should be using one of our nice list macros.
And what about locking? or RCU?
Mike Rapoport
2013-05-30 08:42:42 UTC
Permalink
On Thu, May 30, 2013 at 1:56 AM, Stephen Hemminger
Post by Stephen Hemminger
On Wed, 29 May 2013 13:00:15 +0300
Post by Mike Rapoport
-/* Add/update destinations for multicast */
-static int vxlan_fdb_append(struct vxlan_fdb *f,
- __be32 ip, __be16 port, __u32 vni, __u32 ifindex)
+static int vxlan_rdst_append(struct vxlan_rdst *rdst, __be32 ip, __be16 port,
+ __u32 vni, __u32 ifindex)
{
struct vxlan_rdst *rd_prev, *rd;
rd_prev = NULL;
- for (rd = &f->remote; rd; rd = rd->remote_next) {
+ for (rd = rdst; rd; rd = rd->remote_next) {
The remote destinations should be using one of our nice list macros.
Can you explain why?
Current linked list implementation for multiple destinations in fdb
seems quite Ok and I've just reused it for default destinations
management...
Post by Stephen Hemminger
And what about locking? or RCU?
Oops, mu fault :)

--
Sincerely yours,
Mike.
Mike Rapoport
2013-05-29 10:00:17 UTC
Permalink
Signed-off-by: Mike Rapoport <***@ravellosystems.com>
---
ip/iplink_vxlan.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 125 insertions(+)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 1025326..be6c0ac 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -28,11 +28,87 @@ static void explain(void)
fprintf(stderr, " [ port MIN MAX ] [ [no]learning ]\n");
fprintf(stderr, " [ [no]proxy ] [ [no]rsc ]\n");
fprintf(stderr, " [ [no]l2miss ] [ [no]l3miss ]\n");
+ fprintf(stderr, " [ dstadd DST ]\n");
+ fprintf(stderr, " [ dstdel ADDR ]\n");
fprintf(stderr, "\n");
fprintf(stderr, "Where: VNI := 0-16777215\n");
fprintf(stderr, " ADDR := { IP_ADDRESS | any }\n");
fprintf(stderr, " TOS := { NUMBER | inherit }\n");
fprintf(stderr, " TTL := { 1..255 | inherit }\n");
+ fprintf(stderr, " DST := [ ADDR [port PORT] [vni VNI] [via DEV]]\n");
+}
+
+static int vxlan_parse_dstadd(int *argcp, char ***argvp, struct nlmsghdr *n)
+{
+ int argc = *argcp;
+ char **argv = *argvp;
+ __u32 vni, ifindex;
+ __u16 port;
+ struct rtattr *nest;
+ int addr_set = 0;
+
+ nest = addattr_nest(n, 1024, IFLA_VXLAN_REMOTE_NEW);
+
+ while (argc > 0) {
+ if (!matches(*argv, "vni")) {
+ NEXT_ARG();
+ if (get_u32(&vni, *argv, 0) ||
+ vni >= 1u << 24)
+ invarg("invalid id", *argv);
+ addattr32(n, 1024, IFLA_VXLAN_REMOTE_VNI, vni);
+ } else if (!matches(*argv, "port")) {
+ NEXT_ARG();
+ if (get_u16(&port, *argv, 0))
+ invarg("port", *argv);
+ addattr32(n, 1024, IFLA_VXLAN_REMOTE_PORT, htons(port));
+ } else if (!matches(*argv, "via")) {
+ NEXT_ARG();
+ ifindex = if_nametoindex(*argv);
+ addattr32(n, 1024, IFLA_VXLAN_REMOTE_IFINDEX, ifindex);
+ } else {
+ inet_prefix addr;
+ get_prefix(&addr, *argv, AF_UNSPEC);
+ addattr_l(n, 1024, IFLA_VXLAN_REMOTE_ADDR,
+ &addr.data, addr.bytelen);
+ addr_set = 1;
+ }
+ argc--, argv++;
+ }
+
+ if (!addr_set)
+ incomplete_command();
+
+ addattr_nest_end(n, nest);
+
+ *argcp = argc;
+ *argvp = argv;
+ return 0;
+}
+
+static int vxlan_parse_dstdel(int *argcp, char ***argvp, struct nlmsghdr *n)
+{
+ int argc = *argcp;
+ char **argv = *argvp;
+ struct rtattr *nest;
+
+ nest = addattr_nest(n, 1024, IFLA_VXLAN_REMOTE_DEL);
+
+ while (argc > 0) {
+ inet_prefix addr;
+ get_prefix(&addr, *argv, AF_UNSPEC);
+ addattr_l(n, 1024, IFLA_VXLAN_REMOTE_ADDR,
+ &addr.data, addr.bytelen);
+ argc--, argv++;
+ }
+
+ if (argc == *argcp)
+ incomplete_command();
+
+ addattr_nest_end(n, nest);
+
+ *argcp = argc;
+ *argvp = argv;
+ return 0;
}

static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
@@ -54,6 +130,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
__u32 age = 0;
__u32 maxaddr = 0;
struct ifla_vxlan_port_range range = { 0, 0 };
+ struct rtattr *remotes;

while (argc > 0) {
if (!matches(*argv, "id") ||
@@ -125,6 +202,16 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
invarg("max port", *argv);
range.low = htons(minport);
range.high = htons(maxport);
+ } else if (!matches(*argv, "dstadd")) {
+ NEXT_ARG();
+ remotes = addattr_nest(n, 1024, IFLA_VXLAN_REMOTES);
+ vxlan_parse_dstadd(&argc, &argv, n);
+ addattr_nest_end(n, remotes);
+ } else if (!matches(*argv, "dstdel")) {
+ NEXT_ARG();
+ remotes = addattr_nest(n, 1024, IFLA_VXLAN_REMOTES);
+ vxlan_parse_dstdel(&argc, &argv, n);
+ addattr_nest_end(n, remotes);
} else if (!matches(*argv, "nolearning")) {
learning = 0;
} else if (!matches(*argv, "learning")) {
@@ -187,6 +274,41 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
return 0;
}

+static void vxlan_print_remote(FILE *f, struct rtattr *attr)
+{
+ struct rtattr *tb[IFLA_VXLAN_REMOTE_MAX];
+ char s1[1024];
+
+ parse_rtattr_nested(tb, IFLA_VXLAN_REMOTE_MAX, attr);
+
+ if (tb[IFLA_VXLAN_REMOTE_ADDR]) {
+ struct rtattr *i = tb[IFLA_VXLAN_REMOTE_ADDR];
+ if (RTA_PAYLOAD(i) >= sizeof(struct in6_addr)) {
+ struct in6_addr addr;
+ memcpy(&addr, RTA_DATA(i), sizeof(struct in6_addr));
+ fprintf(f, " %s\n",
+ format_host(AF_INET6, sizeof(struct in6_addr),
+ &addr, s1, sizeof(s1)));
+ } else if (RTA_PAYLOAD(i) >= sizeof(__be32)) {
+ __be32 addr = rta_getattr_u32(i);
+ fprintf(f, " %s\n",
+ format_host(AF_INET, 4, &addr, s1, sizeof(s1)));
+ }
+ }
+}
+
+static void vxlan_print_remotes(FILE *f, struct rtattr *attr)
+{
+ struct rtattr *i;
+ int rem, n = 0;
+
+ fprintf(f, "\n default destinations :\n");
+
+ rem = RTA_PAYLOAD(attr);
+ for (i = RTA_DATA(attr); RTA_OK(i, rem); i = RTA_NEXT(i, rem), n++)
+ vxlan_print_remote(f, i);
+}
+
static void vxlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
{
__u32 vni;
@@ -277,6 +399,9 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
if (tb[IFLA_VXLAN_LIMIT] &&
(maxaddr = rta_getattr_u32(tb[IFLA_VXLAN_LIMIT]) != 0))
fprintf(f, "maxaddr %u ", maxaddr);
+
+ if (tb[IFLA_VXLAN_REMOTES])
+ vxlan_print_remotes(f, tb[IFLA_VXLAN_REMOTES]);
}

struct link_util vxlan_link_util = {
--
1.8.1.5
Cong Wang
2013-05-29 10:13:48 UTC
Permalink
Post by Mike Rapoport
---
ip/iplink_vxlan.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 125 insertions(+)
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 1025326..be6c0ac 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -28,11 +28,87 @@ static void explain(void)
fprintf(stderr, " [ port MIN MAX ] [ [no]learning ]\n");
fprintf(stderr, " [ [no]proxy ] [ [no]rsc ]\n");
fprintf(stderr, " [ [no]l2miss ] [ [no]l3miss ]\n");
+ fprintf(stderr, " [ dstadd DST ]\n");
+ fprintf(stderr, " [ dstdel ADDR ]\n");
Excuse me, but this looks like a design failure as you manipulate
remotes with `ip link` while creating vxlan devices, shouldn't this be
in a standard alone tool if we can't reuse any existing tool? Or am I
missing anything?
Mike Rapoport
2013-05-29 10:52:55 UTC
Permalink
Post by Cong Wang
Post by Mike Rapoport
---
ip/iplink_vxlan.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 125 insertions(+)
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 1025326..be6c0ac 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -28,11 +28,87 @@ static void explain(void)
fprintf(stderr, " [ port MIN MAX ] [ [no]learning ]\n");
fprintf(stderr, " [ [no]proxy ] [ [no]rsc ]\n");
fprintf(stderr, " [ [no]l2miss ] [ [no]l3miss ]\n");
+ fprintf(stderr, " [ dstadd DST ]\n");
+ fprintf(stderr, " [ dstdel ADDR ]\n");
Excuse me, but this looks like a design failure as you manipulate
remotes with `ip link` while creating vxlan devices, shouldn't this be
in a standard alone tool if we can't reuse any existing tool? Or am I
missing anything?
Frankly, I had a long hesitation about the userspace implementation.
From one side it seems very logical to use ip/iplink_vxlan for vxlan
device manipulations. Moreover, since the remotes are used pretty much
the same way as the group address, adding the remotes management to
ip/iplink_vxlan makes a lot of sense. Besides, creation of stand alone
tool for remote list manipulation in vxlan seemed to me little bit far
fetched.

On the other hand, I quite agree with you that
ip link add vxlan0 ... dstadd 192.168.1.1
or
ip link set vxlan0 ... dstdel 192.168.1.1
looks weird at least.
Post by Cong Wang
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Sincerely yours,
Mike.
Stephen Hemminger
2013-05-29 22:56:25 UTC
Permalink
On Wed, 29 May 2013 13:52:55 +0300
Post by Mike Rapoport
Post by Cong Wang
Post by Mike Rapoport
---
ip/iplink_vxlan.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 125 insertions(+)
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 1025326..be6c0ac 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -28,11 +28,87 @@ static void explain(void)
fprintf(stderr, " [ port MIN MAX ] [ [no]learning ]\n");
fprintf(stderr, " [ [no]proxy ] [ [no]rsc ]\n");
fprintf(stderr, " [ [no]l2miss ] [ [no]l3miss ]\n");
+ fprintf(stderr, " [ dstadd DST ]\n");
+ fprintf(stderr, " [ dstdel ADDR ]\n");
Excuse me, but this looks like a design failure as you manipulate
remotes with `ip link` while creating vxlan devices, shouldn't this be
in a standard alone tool if we can't reuse any existing tool? Or am I
missing anything?
Frankly, I had a long hesitation about the userspace implementation.
From one side it seems very logical to use ip/iplink_vxlan for vxlan
device manipulations. Moreover, since the remotes are used pretty much
the same way as the group address, adding the remotes management to
ip/iplink_vxlan makes a lot of sense. Besides, creation of stand alone
tool for remote list manipulation in vxlan seemed to me little bit far
fetched.
On the other hand, I quite agree with you that
ip link add vxlan0 ... dstadd 192.168.1.1
or
ip link set vxlan0 ... dstdel 192.168.1.1
looks weird at least.
Don't like add/delete semantics here either.
Maybe replace or modify, or has this grown enough that having its own
command line tool "vxlan ..." makes sense?
Mike Rapoport
2013-05-30 08:42:59 UTC
Permalink
On Thu, May 30, 2013 at 1:56 AM, Stephen Hemminger
Post by Stephen Hemminger
On Wed, 29 May 2013 13:52:55 +0300
Post by Mike Rapoport
Post by Cong Wang
Post by Mike Rapoport
---
ip/iplink_vxlan.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 125 insertions(+)
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 1025326..be6c0ac 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -28,11 +28,87 @@ static void explain(void)
fprintf(stderr, " [ port MIN MAX ] [ [no]learning ]\n");
fprintf(stderr, " [ [no]proxy ] [ [no]rsc ]\n");
fprintf(stderr, " [ [no]l2miss ] [ [no]l3miss ]\n");
+ fprintf(stderr, " [ dstadd DST ]\n");
+ fprintf(stderr, " [ dstdel ADDR ]\n");
Excuse me, but this looks like a design failure as you manipulate
remotes with `ip link` while creating vxlan devices, shouldn't this be
in a standard alone tool if we can't reuse any existing tool? Or am I
missing anything?
Frankly, I had a long hesitation about the userspace implementation.
From one side it seems very logical to use ip/iplink_vxlan for vxlan
device manipulations. Moreover, since the remotes are used pretty much
the same way as the group address, adding the remotes management to
ip/iplink_vxlan makes a lot of sense. Besides, creation of stand alone
tool for remote list manipulation in vxlan seemed to me little bit far
fetched.
On the other hand, I quite agree with you that
ip link add vxlan0 ... dstadd 192.168.1.1
or
ip link set vxlan0 ... dstdel 192.168.1.1
looks weird at least.
Don't like add/delete semantics here either.
Maybe replace or modify,
I think that replace or modify do not express the actual operation
meaning. My intention with dstadd was "add remote host X to
pseudo-multicast group". Replace/modify maybe nice to have features to
avoid doing delete+ add.
Post by Stephen Hemminger
or has this grown enough that having its own
command line tool "vxlan ..." makes sense?
Say, misc/vxlan that will handle remote destinations management? Or
should it take care of some vxlan parameters currently implemented in
ip/iplink_vxlan and bridge/fdb?

--
Sincerely yours,
Mike.
Thomas Graf
2013-05-30 11:44:24 UTC
Permalink
Post by Mike Rapoport
On Thu, May 30, 2013 at 1:56 AM, Stephen Hemminger
Post by Stephen Hemminger
On Wed, 29 May 2013 13:52:55 +0300
Post by Mike Rapoport
Frankly, I had a long hesitation about the userspace implementation.
From one side it seems very logical to use ip/iplink_vxlan for vxlan
device manipulations. Moreover, since the remotes are used pretty much
the same way as the group address, adding the remotes management to
ip/iplink_vxlan makes a lot of sense. Besides, creation of stand alone
tool for remote list manipulation in vxlan seemed to me little bit far
fetched.
On the other hand, I quite agree with you that
ip link add vxlan0 ... dstadd 192.168.1.1
or
ip link set vxlan0 ... dstdel 192.168.1.1
looks weird at least.
Don't like add/delete semantics here either.
Maybe replace or modify,
I think that replace or modify do not express the actual operation
meaning. My intention with dstadd was "add remote host X to
pseudo-multicast group". Replace/modify maybe nice to have features to
avoid doing delete+ add.
The alternative would be to require iproute2 to always provide the
full list of remote addresses like we do we route nexthops.

I do like the add/del though and don't see a problem with requiring
an ''ip link set [..] dstadd/dstdel''
Post by Mike Rapoport
Post by Stephen Hemminger
or has this grown enough that having its own
command line tool "vxlan ..." makes sense?
Say, misc/vxlan that will handle remote destinations management? Or
should it take care of some vxlan parameters currently implemented in
ip/iplink_vxlan and bridge/fdb?
What do we gain from a separate tool?
Mike Rapoport
2013-05-30 12:46:52 UTC
Permalink
Post by Thomas Graf
Post by Mike Rapoport
On Thu, May 30, 2013 at 1:56 AM, Stephen Hemminger
Post by Stephen Hemminger
On Wed, 29 May 2013 13:52:55 +0300
Post by Mike Rapoport
Frankly, I had a long hesitation about the userspace implementation.
From one side it seems very logical to use ip/iplink_vxlan for vxlan
device manipulations. Moreover, since the remotes are used pretty much
the same way as the group address, adding the remotes management to
ip/iplink_vxlan makes a lot of sense. Besides, creation of stand alone
tool for remote list manipulation in vxlan seemed to me little bit far
fetched.
On the other hand, I quite agree with you that
ip link add vxlan0 ... dstadd 192.168.1.1
or
ip link set vxlan0 ... dstdel 192.168.1.1
looks weird at least.
Don't like add/delete semantics here either.
Maybe replace or modify,
I think that replace or modify do not express the actual operation
meaning. My intention with dstadd was "add remote host X to
pseudo-multicast group". Replace/modify maybe nice to have features to
avoid doing delete+ add.
The alternative would be to require iproute2 to always provide the
full list of remote addresses like we do we route nexthops.
I do like the add/del though and don't see a problem with requiring
an ''ip link set [..] dstadd/dstdel''
I'm feeling Ok about "ip link set [..] dstadd/dstdel". What does bother
me is that you can't have different parameters for "ip link add" and "ip
link set" for vxlan (and other iplink) utility. So, one can use
ip link add [..] dstdel
which does not make sense...
Post by Thomas Graf
Post by Mike Rapoport
Post by Stephen Hemminger
or has this grown enough that having its own
command line tool "vxlan ..." makes sense?
Say, misc/vxlan that will handle remote destinations management? Or
should it take care of some vxlan parameters currently implemented in
ip/iplink_vxlan and bridge/fdb?
What do we gain from a separate tool?
--
Sincerely yours,
Mike.
Thomas Graf
2013-05-30 15:57:19 UTC
Permalink
Post by Mike Rapoport
I'm feeling Ok about "ip link set [..] dstadd/dstdel". What does bother
me is that you can't have different parameters for "ip link add" and "ip
link set" for vxlan (and other iplink) utility. So, one can use
ip link add [..] dstdel
which does not make sense...
You can easily pass an additional argument into iplink_modify()
and exclude certain options in the "add" use case.
Mike Rapoport
2013-06-02 07:09:23 UTC
Permalink
Post by Thomas Graf
Post by Mike Rapoport
I'm feeling Ok about "ip link set [..] dstadd/dstdel". What does bother
me is that you can't have different parameters for "ip link add" and "ip
link set" for vxlan (and other iplink) utility. So, one can use
ip link add [..] dstdel
which does not make sense...
You can easily pass an additional argument into iplink_modify()
and exclude certain options in the "add" use case.
I think there's no need to pass an additional argument to iplink_modify.
The vxlan_parse_opts may check the flags in nlmsghdr to distinguish
between the "add" and "set" cases.
Than we'll have 'ip link add [..]' as it was and the 'ip link set
[..]' will be used to manage default destinations.

--
Sincerely yours,
Mike.
Stephen Hemminger
2013-06-05 04:30:55 UTC
Permalink
On Sun, 2 Jun 2013 10:09:23 +0300
Post by Mike Rapoport
Post by Thomas Graf
Post by Mike Rapoport
I'm feeling Ok about "ip link set [..] dstadd/dstdel". What does bother
me is that you can't have different parameters for "ip link add" and "ip
link set" for vxlan (and other iplink) utility. So, one can use
ip link add [..] dstdel
which does not make sense...
You can easily pass an additional argument into iplink_modify()
and exclude certain options in the "add" use case.
I think there's no need to pass an additional argument to iplink_modify.
The vxlan_parse_opts may check the flags in nlmsghdr to distinguish
between the "add" and "set" cases.
Than we'll have 'ip link add [..]' as it was and the 'ip link set
[..]' will be used to manage default destinations.
--
Sincerely yours,
Mike.
I think multiple destinations should be handled like multipath routes.
I.e you don't specify multiple destinations on the command line, you specify them
individually and can add/delete them

If you delete the last destination then the forwarding entry should disappear.
The collapsing of multiple entries into one entry in table is an internal data structure
choice of vxlan and shouldn't be part of the netlink API requirement.

The API to iproute2/netlink should look like routing (through bridge fdb command).
Feel free to reject this if since I don't actually use this stuff.
Mike Rapoport
2013-06-05 12:58:50 UTC
Permalink
On Wed, Jun 5, 2013 at 7:30 AM, Stephen Hemminger
Post by Stephen Hemminger
On Sun, 2 Jun 2013 10:09:23 +0300
Post by Mike Rapoport
Post by Thomas Graf
Post by Mike Rapoport
I'm feeling Ok about "ip link set [..] dstadd/dstdel". What does bother
me is that you can't have different parameters for "ip link add" and "ip
link set" for vxlan (and other iplink) utility. So, one can use
ip link add [..] dstdel
which does not make sense...
You can easily pass an additional argument into iplink_modify()
and exclude certain options in the "add" use case.
I think there's no need to pass an additional argument to iplink_modify.
The vxlan_parse_opts may check the flags in nlmsghdr to distinguish
between the "add" and "set" cases.
Than we'll have 'ip link add [..]' as it was and the 'ip link set
[..]' will be used to manage default destinations.
--
Sincerely yours,
Mike.
I think multiple destinations should be handled like multipath routes.
I.e you don't specify multiple destinations on the command line, you specify them
individually and can add/delete them
If you delete the last destination then the forwarding entry should disappear.
The collapsing of multiple entries into one entry in table is an internal data structure
choice of vxlan and shouldn't be part of the netlink API requirement.
The API to iproute2/netlink should look like routing (through bridge fdb command).
Feel free to reject this if since I don't actually use this stuff.
Well, if we're to follow David Stevens suggestion to make default
destination fdb entry with ALL_ZEROS_MAC (1), they surely can be
managed using 'bridge fdb'.

(1) http://thread.gmane.org/gmane.linux.network/270969/focus=271791

--
Sincerely yours,
Mike.

Stephen Hemminger
2013-05-30 17:07:47 UTC
Permalink
On Thu, 30 May 2013 12:44:24 +0100
Post by Thomas Graf
Post by Mike Rapoport
Say, misc/vxlan that will handle remote destinations management? Or
should it take care of some vxlan parameters currently implemented in
ip/iplink_vxlan and bridge/fdb?
What do we gain from a separate tool?
At some point the syntax becomes unwieldy, could even just be a shell script.
Mike Rapoport
2013-05-29 10:00:16 UTC
Permalink
A list of multiple default destinations can be used in environments that
disable multicast on the infrastructure level, e.g. public clouds.

Signed-off-by: Mike Rapoport <***@ravellosystems.com>
---
drivers/net/vxlan.c | 238 +++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/if_link.h | 17 ++++
2 files changed, 255 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 5a2cf2f..29c1752 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -137,6 +137,8 @@ struct vxlan_dev {
unsigned int addrcnt;
unsigned int addrmax;

+ unsigned int remote_cnt; /* for additional default destinations */
+
struct hlist_head fdb_head[FDB_HASH_SIZE];
};

@@ -642,6 +644,102 @@ static void vxlan_snoop(struct net_device *dev,
}
}

+/* Add remote to default destinations list */
+static int vxlan_remote_add(struct vxlan_dev *vxlan, struct nlattr *attr)
+{
+ struct nlattr *i;
+ __be32 ip = htonl(INADDR_NONE);
+ __be16 port;
+ u32 ifindex, vni;
+ int rem, err = 0;
+
+ port = vxlan->dst_port;
+ vni = vxlan->default_dst.remote_vni;
+ ifindex = vxlan->default_dst.remote_ifindex;
+
+ nla_for_each_nested(i, attr, rem) {
+ switch (nla_type(i)) {
+ case IFLA_VXLAN_REMOTE_ADDR:
+ ip = nla_get_be32(i);
+ break;
+ case IFLA_VXLAN_REMOTE_PORT:
+ port = nla_get_be16(i);
+ break;
+ case IFLA_VXLAN_REMOTE_VNI:
+ vni = nla_get_u32(i);
+ break;
+ case IFLA_VXLAN_REMOTE_IFINDEX:
+ ifindex = nla_get_u32(i);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ };
+
+ if (err)
+ return err;
+ }
+
+ if (ip == htonl(INADDR_NONE))
+ return -EINVAL;
+
+ err = vxlan_rdst_append(&vxlan->default_dst, ip, port, vni, ifindex);
+ if (err < 0)
+ return err;
+
+ if (err == 0)
+ return -EEXIST;
+
+ vxlan->remote_cnt++;
+
+ return 0;
+}
+
+static void vxlan_remote_destroy(struct vxlan_dev *vxlan,
+ struct vxlan_rdst *rd)
+{
+ vxlan->remote_cnt--;
+ kfree(rd);
+}
+
+/* Delete remote from default destinations list */
+static int vxlan_remote_delete(struct vxlan_dev *vxlan, struct nlattr *attr)
+{
+ struct vxlan_rdst *rd, *rd_prev = NULL;
+ struct nlattr *i;
+ __be32 ip = htonl(INADDR_NONE);
+ int rem, err = 0;
+
+ nla_for_each_nested(i, attr, rem) {
+ switch (nla_type(i)) {
+ case IFLA_VXLAN_REMOTE_ADDR:
+ ip = nla_get_be32(i);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ if (err)
+ return err;
+ }
+
+ if (ip == htonl(INADDR_NONE))
+ return -EINVAL;
+
+ rd_prev = &vxlan->default_dst;
+
+ for (rd = vxlan->default_dst.remote_next; rd; rd = rd->remote_next) {
+ if (rd->remote_ip == ip) {
+ rd_prev->remote_next = rd->remote_next;
+ vxlan_remote_destroy(vxlan, rd);
+ return 0;
+ }
+ rd_prev = rd;
+ }
+
+ return -ENOENT;
+}

/* See if multicast group is already in use by other ID */
static bool vxlan_group_used(struct vxlan_net *vn,
@@ -1380,6 +1478,13 @@ static void vxlan_setup(struct net_device *dev)
INIT_HLIST_HEAD(&vxlan->fdb_head[h]);
}

+static const struct nla_policy vxlan_remotes_policy[IFLA_VXLAN_REMOTE_MAX + 1] = {
+ [IFLA_VXLAN_REMOTE_ADDR] = { .len = FIELD_SIZEOF(struct iphdr, daddr) },
+ [IFLA_VXLAN_REMOTE_IFINDEX] = { .type = NLA_U32 },
+ [IFLA_VXLAN_REMOTE_PORT] = { .type = NLA_U16 },
+ [IFLA_VXLAN_REMOTE_VNI] = { .type = NLA_U32 },
+};
+
static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
[IFLA_VXLAN_ID] = { .type = NLA_U32 },
[IFLA_VXLAN_GROUP] = { .len = FIELD_SIZEOF(struct iphdr, daddr) },
@@ -1396,10 +1501,35 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
[IFLA_VXLAN_L2MISS] = { .type = NLA_U8 },
[IFLA_VXLAN_L3MISS] = { .type = NLA_U8 },
[IFLA_VXLAN_PORT] = { .type = NLA_U16 },
+ [IFLA_VXLAN_REMOTES] = { .type = NLA_NESTED },
};

+static int vxlan_validate_remotes(struct nlattr *data)
+{
+ struct nlattr *attr;
+ int rem, err;
+
+ if (!data)
+ return 0;
+
+ nla_for_each_nested(attr, data, rem) {
+ if ((nla_type(attr) != IFLA_VXLAN_REMOTE_NEW) &&
+ (nla_type(attr) != IFLA_VXLAN_REMOTE_DEL))
+ return -EINVAL;
+
+ err = nla_validate_nested(attr, IFLA_VXLAN_REMOTE_MAX,
+ vxlan_remotes_policy);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
{
+ int err;
+
if (tb[IFLA_ADDRESS]) {
if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) {
pr_debug("invalid link address (not ethernet)\n");
@@ -1432,6 +1562,10 @@ static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
}
}

+ err = vxlan_validate_remotes(data[IFLA_VXLAN_REMOTES]);
+ if (err)
+ return err;
+
return 0;
}

@@ -1512,6 +1646,46 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port)
return vs;
}

+static int vxlan_remotes_update(struct vxlan_dev *vxlan, struct nlattr *attr)
+{
+ struct nlattr *i;
+ int rem, err = 0;
+
+ nla_for_each_nested(i, attr, rem) {
+ switch (nla_type(i)) {
+ case IFLA_VXLAN_REMOTE_NEW:
+ err = vxlan_remote_add(vxlan, i);
+ break;
+ case IFLA_VXLAN_REMOTE_DEL:
+ err = vxlan_remote_delete(vxlan, i);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ };
+
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+static int vxlan_changelink(struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[])
+{
+ struct vxlan_dev *vxlan = netdev_priv(dev);
+ int err;
+
+ if (data[IFLA_VXLAN_REMOTES]) {
+ err = vxlan_remotes_update(vxlan, data[IFLA_VXLAN_REMOTES]);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
static int vxlan_newlink(struct net *net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[])
{
@@ -1630,11 +1804,20 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
return 0;
}

+static void vxlan_remotes_flush(struct vxlan_dev *vxlan)
+{
+ struct vxlan_rdst *rd;
+
+ for (rd = vxlan->default_dst.remote_next; rd; rd = rd->remote_next)
+ vxlan_remote_destroy(vxlan, rd);
+}
+
static void vxlan_dellink(struct net_device *dev, struct list_head *head)
{
struct vxlan_dev *vxlan = netdev_priv(dev);
struct vxlan_sock *vs = vxlan->vn_sock;

+ vxlan_remotes_flush(vxlan);
hlist_del_rcu(&vxlan->hlist);
list_del(&vxlan->next);
unregister_netdevice_queue(dev, head);
@@ -1645,6 +1828,20 @@ static void vxlan_dellink(struct net_device *dev, struct list_head *head)
}
}

+static size_t vxlan_remote_list_size(const struct net_device *dev)
+{
+ struct vxlan_dev *vxlan = netdev_priv(dev);
+ ssize_t size = nla_total_size(sizeof(struct nlattr));
+ struct vxlan_rdst *rd;
+
+ for (rd = vxlan->default_dst.remote_next; rd; rd = rd->remote_next) {
+ size += nla_total_size(sizeof(struct nlattr));
+ size += nla_total_size(sizeof(__be32));
+ }
+
+ return size;
+}
+
static size_t vxlan_get_size(const struct net_device *dev)
{

@@ -1663,9 +1860,46 @@ static size_t vxlan_get_size(const struct net_device *dev)
nla_total_size(sizeof(__u32)) + /* IFLA_VXLAN_LIMIT */
nla_total_size(sizeof(struct ifla_vxlan_port_range)) +
nla_total_size(sizeof(__be16))+ /* IFLA_VXLAN_PORT */
+ vxlan_remote_list_size(dev) +
0;
}

+static int vxlan_fill_remotes_info(struct sk_buff *skb,
+ const struct vxlan_dev *vxlan)
+{
+ struct vxlan_rdst *rdst;
+ struct nlattr *nest, *rdst_nest;
+ __be32 ip;
+ int i;
+
+ if (vxlan->remote_cnt) {
+ nest = nla_nest_start(skb, IFLA_VXLAN_REMOTES);
+ if (nest == NULL)
+ goto nla_put_failure;
+
+ for (rdst = vxlan->default_dst.remote_next, i = 0; rdst;
+ rdst = rdst->remote_next, i++) {
+ ip = rdst->remote_ip;
+
+ rdst_nest = nla_nest_start(skb, i);
+ if (rdst_nest == NULL)
+ goto nla_put_failure;
+
+ if (nla_put_be32(skb, IFLA_VXLAN_REMOTE_ADDR, ip))
+ goto nla_put_failure;
+
+ nla_nest_end(skb, rdst_nest);
+ }
+
+ nla_nest_end(skb, nest);
+ }
+
+ return 0;
+
+nla_put_failure:
+ return -EMSGSIZE;
+}
+
static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
{
const struct vxlan_dev *vxlan = netdev_priv(dev);
@@ -1706,6 +1940,9 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
if (nla_put(skb, IFLA_VXLAN_PORT_RANGE, sizeof(ports), &ports))
goto nla_put_failure;

+ if (vxlan_fill_remotes_info(skb, vxlan))
+ goto nla_put_failure;
+
return 0;

nla_put_failure:
@@ -1720,6 +1957,7 @@ static struct rtnl_link_ops vxlan_link_ops __read_mostly = {
.setup = vxlan_setup,
.validate = vxlan_validate,
.newlink = vxlan_newlink,
+ .changelink = vxlan_changelink,
.dellink = vxlan_dellink,
.get_size = vxlan_get_size,
.fill_info = vxlan_fill_info,
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index b05823c..3f25bbd 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -311,10 +311,27 @@ enum {
IFLA_VXLAN_L2MISS,
IFLA_VXLAN_L3MISS,
IFLA_VXLAN_PORT, /* destination port */
+ IFLA_VXLAN_REMOTES,
__IFLA_VXLAN_MAX
};
#define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)

+enum {
+ IFLA_VXLAN_REMOTE_NEW,
+ IFLA_VXLAN_REMOTE_DEL,
+};
+
+enum {
+ IFLA_VXLAN_REMOTE_UNSPEC,
+ IFLA_VXLAN_REMOTE_ADDR,
+ IFLA_VXLAN_REMOTE_IFINDEX,
+ IFLA_VXLAN_REMOTE_PORT,
+ IFLA_VXLAN_REMOTE_VNI,
+ __IFLA_VXLAN_REMOTE_MAX
+};
+
+#define IFLA_VXLAN_REMOTE_MAX (__IFLA_VXLAN_REMOTE_MAX - 1)
+
struct ifla_vxlan_port_range {
__be16 low;
__be16 high;
--
1.8.1.5
Thomas Graf
2013-05-30 11:09:48 UTC
Permalink
Looks much better, thanks for taking the feedback. Some additional
Post by Mike Rapoport
+static int vxlan_remote_add(struct vxlan_dev *vxlan, struct nlattr *attr)
+{
+ struct nlattr *i;
+ __be32 ip = htonl(INADDR_NONE);
+ __be16 port;
+ u32 ifindex, vni;
+ int rem, err = 0;
+
+ port = vxlan->dst_port;
+ vni = vxlan->default_dst.remote_vni;
+ ifindex = vxlan->default_dst.remote_ifindex;
+
+ nla_for_each_nested(i, attr, rem) {
+ switch (nla_type(i)) {
+ ip = nla_get_be32(i);
+ break;
+ port = nla_get_be16(i);
+ break;
+ vni = nla_get_u32(i);
+ break;
+ ifindex = nla_get_u32(i);
+ break;
+ err = -EINVAL;
+ break;
+ };
+
+ if (err)
+ return err;
+ }
The above construct is not forward compatible. If we want to add
more attributes in the future any older kernel will break with
EINVAL. We typically support partial requests in older kernels.
Post by Mike Rapoport
@@ -1380,6 +1478,13 @@ static void vxlan_setup(struct net_device *dev)
INIT_HLIST_HEAD(&vxlan->fdb_head[h]);
}
+static const struct nla_policy vxlan_remotes_policy[IFLA_VXLAN_REMOTE_MAX + 1] = {
+ [IFLA_VXLAN_REMOTE_ADDR] = { .len = FIELD_SIZEOF(struct iphdr, daddr) },
I would just make this a u32 for now.
Post by Mike Rapoport
@@ -1512,6 +1646,46 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port)
return vs;
}
+static int vxlan_remotes_update(struct vxlan_dev *vxlan, struct nlattr *attr)
+{
+ struct nlattr *i;
+ int rem, err = 0;
+
+ nla_for_each_nested(i, attr, rem) {
+ switch (nla_type(i)) {
+ err = vxlan_remote_add(vxlan, i);
+ break;
+ err = vxlan_remote_delete(vxlan, i);
+ break;
+ err = -EINVAL;
+ break;
I would return EOPNOTSUPP here
Post by Mike Rapoport
+static int vxlan_fill_remotes_info(struct sk_buff *skb,
+ const struct vxlan_dev *vxlan)
+{
+ struct vxlan_rdst *rdst;
+ struct nlattr *nest, *rdst_nest;
+ __be32 ip;
+ int i;
+
+ if (vxlan->remote_cnt) {
+ nest = nla_nest_start(skb, IFLA_VXLAN_REMOTES);
+ if (nest == NULL)
+ goto nla_put_failure;
+
+ for (rdst = vxlan->default_dst.remote_next, i = 0; rdst;
+ rdst = rdst->remote_next, i++) {
+ ip = rdst->remote_ip;
+
+ rdst_nest = nla_nest_start(skb, i);
Attribute type '0' is reserved, please don't use it. Start with '1'.
Post by Mike Rapoport
+enum {
+ IFLA_VXLAN_REMOTE_NEW,
+ IFLA_VXLAN_REMOTE_DEL,
+};
Same here, attribute type '0' is reserved.
Mike Rapoport
2013-05-30 11:16:26 UTC
Permalink
Post by Thomas Graf
Looks much better, thanks for taking the feedback. Some additional
+ const struct vxlan_dev *vxlan)
+{
+ struct vxlan_rdst *rdst;
+ struct nlattr *nest, *rdst_nest;
+ __be32 ip;
+ int i;
+
+ if (vxlan->remote_cnt) {
+ nest = nla_nest_start(skb, IFLA_VXLAN_REMOTES);
+ if (nest == NULL)
+ goto nla_put_failure;
+
+ for (rdst = vxlan->default_dst.remote_next, i = 0; rdst;
+ rdst = rdst->remote_next, i++) {
+ ip = rdst->remote_ip;
+
+ rdst_nest = nla_nest_start(skb, i);
Attribute type '0' is reserved, please don't use it. Start with '1'.
+enum {
+ IFLA_VXLAN_REMOTE_NEW,
+ IFLA_VXLAN_REMOTE_DEL,
+};
Same here, attribute type '0' is reserved.
Does it mean I have to add _UNSPEC and _MAX? And, consequently, rename
either REMOTE_{NEW,DEL} or REMOTE_ADDR and friends to avoid redefinition
of _MAX attribute?

--
Sincerely yours,
Mike.
Thomas Graf
2013-05-30 11:37:39 UTC
Permalink
Post by Mike Rapoport
Post by Thomas Graf
Looks much better, thanks for taking the feedback. Some additional
+ const struct vxlan_dev *vxlan)
+{
+ struct vxlan_rdst *rdst;
+ struct nlattr *nest, *rdst_nest;
+ __be32 ip;
+ int i;
+
+ if (vxlan->remote_cnt) {
+ nest = nla_nest_start(skb, IFLA_VXLAN_REMOTES);
+ if (nest == NULL)
+ goto nla_put_failure;
+
+ for (rdst = vxlan->default_dst.remote_next, i = 0; rdst;
+ rdst = rdst->remote_next, i++) {
+ ip = rdst->remote_ip;
+
+ rdst_nest = nla_nest_start(skb, i);
Attribute type '0' is reserved, please don't use it. Start with '1'.
+enum {
+ IFLA_VXLAN_REMOTE_NEW,
+ IFLA_VXLAN_REMOTE_DEL,
+};
Same here, attribute type '0' is reserved.
Does it mean I have to add _UNSPEC and _MAX? And, consequently, rename
either REMOTE_{NEW,DEL} or REMOTE_ADDR and friends to avoid redefinition
of _MAX attribute?
I think this is good enough:

enum {
IFLA_VXLAN_REMOTE_NEW = 1,
IFLA_VXLAN_REMOTE_DEL,
};
Stephen Hemminger
2013-05-31 16:17:40 UTC
Permalink
Looking at this code in more detail, I see a slew of problems.

First the list of destinations isn't really a list. The default one is still
embedded in the fdb entry. This means you can't change it safely.

Also the notification via netlink only sends back a single destination
value.

And the lack of locking on the open coded link list means it is not safe since
the forwarding table is used with RCU. In order to be safe, proper RCU
barriers would be needed or better yet convert to list_rcu..

Overall, I feel guilty for not inspecting this more closely and am surprised
that others did not catch the lack of locking.
Mike Rapoport
2013-06-02 10:29:42 UTC
Permalink
Post by Stephen Hemminger
Looking at this code in more detail, I see a slew of problems.
First the list of destinations isn't really a list. The default one is still
embedded in the fdb entry. This means you can't change it safely.
Also the notification via netlink only sends back a single destination
value.
And the lack of locking on the open coded link list means it is not safe since
the forwarding table is used with RCU. In order to be safe, proper RCU
barriers would be needed or better yet convert to list_rcu..
Overall, I feel guilty for not inspecting this more closely and am surprised
that others did not catch the lack of locking.
I've tried to convert remotes list to use hlist_rcu. The patch below implements the conversion, but it does not address the netlink notification issue.
Post by Stephen Hemminger
From 4de2001606a258462369520231643874e3e5c14c Mon Sep 17 00:00:00 2001
From: Mike Rapoport <***@ravellosystems.com>
Date: Sun, 2 Jun 2013 13:21:23 +0300
Subject: [PATCH] vxlan: convert remotes list to hlist_rcu

Signed-off-by: Mike Rapoport <***@ravellosystems.com>
---
drivers/net/vxlan.c | 75 +++++++++++++++++++++++++++++++++--------------------
1 file changed, 47 insertions(+), 28 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 8111565..664853e 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -101,7 +101,8 @@ struct vxlan_rdst {
__be16 remote_port;
u32 remote_vni;
u32 remote_ifindex;
- struct vxlan_rdst *remote_next;
+ struct hlist_node hlist;
+ struct rcu_head rcu;
};

/* Forwarding table entry */
@@ -110,7 +111,7 @@ struct vxlan_fdb {
struct rcu_head rcu;
unsigned long updated; /* jiffies */
unsigned long used;
- struct vxlan_rdst remote;
+ struct hlist_head remotes;
u16 state; /* see ndm_state */
u8 flags; /* see ndm_flags */
u8 eth_addr[ETH_ALEN];
@@ -163,6 +164,13 @@ static inline struct hlist_head *vs_head(struct net *net, __be16 port)
return &vn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
}

+/* The first remote destination in FDB remotes list */
+static inline struct vxlan_rdst *remotes_head(struct vxlan_fdb *f)
+{
+ return hlist_entry(hlist_first_rcu(&f->remotes),
+ struct vxlan_rdst, hlist);
+}
+
/* Find VXLAN socket based on network namespace and UDP port */
static struct vxlan_sock *vxlan_find_port(struct net *net, __be16 port)
{
@@ -278,7 +286,7 @@ static void vxlan_fdb_notify(struct vxlan_dev *vxlan,
if (skb == NULL)
goto errout;

- err = vxlan_fdb_info(skb, vxlan, fdb, 0, 0, type, 0, &fdb->remote);
+ err = vxlan_fdb_info(skb, vxlan, fdb, 0, 0, type, 0, remotes_head(fdb));
if (err < 0) {
/* -EMSGSIZE implies BUG in vxlan_nlmsg_size() */
WARN_ON(err == -EMSGSIZE);
@@ -297,11 +305,14 @@ static void vxlan_ip_miss(struct net_device *dev, __be32 ipa)
{
struct vxlan_dev *vxlan = netdev_priv(dev);
struct vxlan_fdb f;
+ struct vxlan_rdst remote;

memset(&f, 0, sizeof f);
f.state = NUD_STALE;
- f.remote.remote_ip = ipa; /* goes to NDA_DST */
- f.remote.remote_vni = VXLAN_N_VID;
+ remote.remote_ip = ipa; /* goes to NDA_DST */
+ remote.remote_vni = VXLAN_N_VID;
+
+ hlist_add_head_rcu(&remote.hlist, &f.remotes);

vxlan_fdb_notify(vxlan, &f, RTM_GETNEIGH);
}
@@ -370,17 +381,16 @@ static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev *vxlan,
static int vxlan_fdb_append(struct vxlan_fdb *f,
__be32 ip, __be16 port, __u32 vni, __u32 ifindex)
{
- struct vxlan_rdst *rd_prev, *rd;
+ struct vxlan_rdst *rd;

- rd_prev = NULL;
- for (rd = &f->remote; rd; rd = rd->remote_next) {
+ hlist_for_each_entry_rcu(rd, &f->remotes, hlist) {
if (rd->remote_ip == ip &&
rd->remote_port == port &&
rd->remote_vni == vni &&
rd->remote_ifindex == ifindex)
return 0;
- rd_prev = rd;
}
+
rd = kmalloc(sizeof(*rd), GFP_ATOMIC);
if (rd == NULL)
return -ENOBUFS;
@@ -388,8 +398,9 @@ static int vxlan_fdb_append(struct vxlan_fdb *f,
rd->remote_port = port;
rd->remote_vni = vni;
rd->remote_ifindex = ifindex;
- rd->remote_next = NULL;
- rd_prev->remote_next = rd;
+
+ hlist_add_head_rcu(&rd->hlist, &f->remotes);
+
return 1;
}

@@ -441,16 +452,13 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
return -ENOMEM;

notify = 1;
- f->remote.remote_ip = ip;
- f->remote.remote_port = port;
- f->remote.remote_vni = vni;
- f->remote.remote_ifindex = ifindex;
- f->remote.remote_next = NULL;
f->state = state;
f->flags = ndm_flags;
f->updated = f->used = jiffies;
memcpy(f->eth_addr, mac, ETH_ALEN);

+ vxlan_fdb_append(f, ip, port, vni, ifindex);
+
++vxlan->addrcnt;
hlist_add_head_rcu(&f->hlist,
vxlan_fdb_head(vxlan, mac));
@@ -462,16 +470,22 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
return 0;
}

+static void vxlan_rdst_free(struct rcu_head *head)
+{
+ struct vxlan_rdst *rd = container_of(head, struct vxlan_rdst, rcu);
+ kfree(rd);
+}
+
static void vxlan_fdb_free(struct rcu_head *head)
{
struct vxlan_fdb *f = container_of(head, struct vxlan_fdb, rcu);
+ struct vxlan_rdst *rd;

- while (f->remote.remote_next) {
- struct vxlan_rdst *rd = f->remote.remote_next;
-
- f->remote.remote_next = rd->remote_next;
- kfree(rd);
+ hlist_for_each_safe(rd, &f->remotes, hlist) {
+ hlist_del_rcu(&rd->hlist);
+ call_rcu(&rd->rcu, vxlan_rdst_free);
}
+
kfree(f);
}

@@ -581,7 +595,7 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,

hlist_for_each_entry_rcu(f, &vxlan->fdb_head[h], hlist) {
struct vxlan_rdst *rd;
- for (rd = &f->remote; rd; rd = rd->remote_next) {
+ hlist_for_each_entry_rcu(rd, &f->remotes, hlist) {
if (idx < cb->args[0])
goto skip;

@@ -613,15 +627,17 @@ static void vxlan_snoop(struct net_device *dev,

f = vxlan_find_mac(vxlan, src_mac);
if (likely(f)) {
- if (likely(f->remote.remote_ip == src_ip))
+ struct vxlan_rdst *remote = remotes_head(f);
+
+ if (likely(remote->remote_ip == src_ip))
return;

if (net_ratelimit())
netdev_info(dev,
"%pM migrated from %pI4 to %pI4\n",
- src_mac, &f->remote.remote_ip, &src_ip);
+ src_mac, &remote->remote_ip, &src_ip);

- f->remote.remote_ip = src_ip;
+ remote->remote_ip = src_ip;
f->updated = jiffies;
} else {
/* learned new entry */
@@ -856,6 +872,7 @@ static int arp_reduce(struct net_device *dev, struct sk_buff *skb)
if (n) {
struct vxlan_fdb *f;
struct sk_buff *reply;
+ struct vxlan_rdst *remote;

if (!(n->nud_state & NUD_CONNECTED)) {
neigh_release(n);
@@ -863,7 +880,8 @@ static int arp_reduce(struct net_device *dev, struct sk_buff *skb)
}

f = vxlan_find_mac(vxlan, n->ha);
- if (f && f->remote.remote_ip == htonl(INADDR_ANY)) {
+ remote = remotes_head(f);
+ if (f && remote->remote_ip == htonl(INADDR_ANY)) {
/* bridge-local neighbor */
neigh_release(n);
goto out;
@@ -1181,12 +1199,13 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
!is_multicast_ether_addr(eth->h_dest))
vxlan_fdb_miss(vxlan, eth->h_dest);
} else
- rdst0 = &f->remote;
+ rdst0 = remotes_head(f);

rc = NETDEV_TX_OK;

/* if there are multiple destinations, send copies */
- for (rdst = rdst0->remote_next; rdst; rdst = rdst->remote_next) {
+ rdst = rdst0;
+ hlist_for_each_entry_continue_rcu(rdst, hlist) {
struct sk_buff *skb1;

skb1 = skb_clone(skb, GFP_ATOMIC);
--
1.8.1.5


--
Sincerely yours,
Mike.
Stephen Hemminger
2013-06-03 15:57:37 UTC
Permalink
On Sun, 2 Jun 2013 13:29:42 +0300
Post by Mike Rapoport
Post by Stephen Hemminger
Looking at this code in more detail, I see a slew of problems.
First the list of destinations isn't really a list. The default one is still
embedded in the fdb entry. This means you can't change it safely.
Also the notification via netlink only sends back a single destination
value.
And the lack of locking on the open coded link list means it is not safe since
the forwarding table is used with RCU. In order to be safe, proper RCU
barriers would be needed or better yet convert to list_rcu..
Overall, I feel guilty for not inspecting this more closely and am surprised
that others did not catch the lack of locking.
I've tried to convert remotes list to use hlist_rcu. The patch below implements the conversion, but it does not address the netlink notification issue.
Ok, let me have a go at this.
1. Using list rather than hlist makes more sense for handling simple list
2. The complexity is in how do do the netlink API.

3. If we can't work this out for 3.11, the multiple remotes may have to be reverted;
I don't want the existing API to be exposed in a release
Mike Rapoport
2013-06-03 19:47:31 UTC
Permalink
(added David Stevens to CC)
Post by Stephen Hemminger
On Sun, 2 Jun 2013 13:29:42 +0300
Post by Mike Rapoport
Post by Stephen Hemminger
Looking at this code in more detail, I see a slew of problems.
First the list of destinations isn't really a list. The default one is still
embedded in the fdb entry. This means you can't change it safely.
Also the notification via netlink only sends back a single destination
value.
And the lack of locking on the open coded link list means it is not safe since
the forwarding table is used with RCU. In order to be safe, proper RCU
barriers would be needed or better yet convert to list_rcu..
Overall, I feel guilty for not inspecting this more closely and am surprised
that others did not catch the lack of locking.
I've tried to convert remotes list to use hlist_rcu. The patch below implements the conversion, but it does not address the netlink notification issue.
Ok, let me have a go at this.
1. Using list rather than hlist makes more sense for handling simple list
I can replace hlist with list.
Post by Stephen Hemminger
2. The complexity is in how do do the netlink API.
As far as I can tell, there are two places in the driver that should
report multiple destinations via netlink. They are vxlan_fdb_dump and
vxlan_fdb_restore. These methods can traverse the remote destinations
list and send netlink notification for each list item.
Post by Stephen Hemminger
3. If we can't work this out for 3.11, the multiple remotes may have to be reverted;
I don't want the existing API to be exposed in a release
I'd really like to help to sort this out ASAP, so that I could update and resend
my patches that, by coinsidence, require list of remote destinations.
I'm not the right person to redefine the fdb API, though.

--
Sincerely yours,
Mike.
Stephen Hemminger
2013-06-03 18:26:16 UTC
Permalink
This is based on earlier step from Mike Rapoport

My changes:
1. eliminate second rcu for rdst free
2. handle possilble races where fdb notify called but no rdst
3. only update snooped entries for dynamic entries, shouldn't modify static entries
4. clean up the multi-dest tx code

This is compile tested only.

#3 is probably a bug that should be fixed differently in -stable


---
drivers/net/vxlan.c | 75 +++++++++++++++++++++++++++++++++--------------------
1 file changed, 47 insertions(+), 28 deletions(-)

--- a/drivers/net/vxlan.c 2013-06-03 11:08:39.214230482 -0700
+++ b/drivers/net/vxlan.c 2013-06-03 11:11:37.220114181 -0700
@@ -101,7 +101,7 @@ struct vxlan_rdst {
__be16 remote_port;
u32 remote_vni;
u32 remote_ifindex;
- struct vxlan_rdst *remote_next;
+ struct list_head list;
};

/* Forwarding table entry */
@@ -110,7 +110,7 @@ struct vxlan_fdb {
struct rcu_head rcu;
unsigned long updated; /* jiffies */
unsigned long used;
- struct vxlan_rdst remote;
+ struct list_head remotes;
u16 state; /* see ndm_state */
u8 flags; /* see ndm_flags */
u8 eth_addr[ETH_ALEN];
@@ -163,6 +163,12 @@ static inline struct hlist_head *vs_head
return &vn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
}

+/* First remote destination for a forwarding entry (or NULL) */
+static inline struct vxlan_rdst *first_remote(struct vxlan_fdb *fdb)
+{
+ return list_first_or_null_rcu(&fdb->remotes, struct vxlan_rdst, list);
+}
+
/* Find VXLAN socket based on network namespace and UDP port */
static struct vxlan_sock *vxlan_find_port(struct net *net, __be16 port)
{
@@ -216,10 +222,11 @@ static int vxlan_fdb_info(struct sk_buff

if (type == RTM_GETNEIGH) {
ndm->ndm_family = AF_INET;
- send_ip = rdst->remote_ip != htonl(INADDR_ANY);
+ send_ip = rdst && rdst->remote_ip != htonl(INADDR_ANY);
send_eth = !is_zero_ether_addr(fdb->eth_addr);
} else
ndm->ndm_family = AF_BRIDGE;
+
ndm->ndm_state = fdb->state;
ndm->ndm_ifindex = vxlan->dev->ifindex;
ndm->ndm_flags = fdb->flags;
@@ -228,18 +235,20 @@ static int vxlan_fdb_info(struct sk_buff
if (send_eth && nla_put(skb, NDA_LLADDR, ETH_ALEN, &fdb->eth_addr))
goto nla_put_failure;

- if (send_ip && nla_put_be32(skb, NDA_DST, rdst->remote_ip))
- goto nla_put_failure;
-
- if (rdst->remote_port && rdst->remote_port != vxlan->dst_port &&
- nla_put_be16(skb, NDA_PORT, rdst->remote_port))
- goto nla_put_failure;
- if (rdst->remote_vni != vxlan->default_dst.remote_vni &&
- nla_put_be32(skb, NDA_VNI, rdst->remote_vni))
- goto nla_put_failure;
- if (rdst->remote_ifindex &&
- nla_put_u32(skb, NDA_IFINDEX, rdst->remote_ifindex))
- goto nla_put_failure;
+ if (rdst) {
+ if (send_ip && nla_put_be32(skb, NDA_DST, rdst->remote_ip))
+ goto nla_put_failure;
+
+ if (rdst->remote_port && rdst->remote_port != vxlan->dst_port &&
+ nla_put_be16(skb, NDA_PORT, rdst->remote_port))
+ goto nla_put_failure;
+ if (rdst->remote_vni != vxlan->default_dst.remote_vni &&
+ nla_put_be32(skb, NDA_VNI, rdst->remote_vni))
+ goto nla_put_failure;
+ if (rdst->remote_ifindex &&
+ nla_put_u32(skb, NDA_IFINDEX, rdst->remote_ifindex))
+ goto nla_put_failure;
+ }

ci.ndm_used = jiffies_to_clock_t(now - fdb->used);
ci.ndm_confirmed = 0;
@@ -268,7 +277,7 @@ static inline size_t vxlan_nlmsg_size(vo
}

static void vxlan_fdb_notify(struct vxlan_dev *vxlan,
- const struct vxlan_fdb *fdb, int type)
+ struct vxlan_fdb *fdb, int type)
{
struct net *net = dev_net(vxlan->dev);
struct sk_buff *skb;
@@ -278,7 +287,7 @@ static void vxlan_fdb_notify(struct vxla
if (skb == NULL)
goto errout;

- err = vxlan_fdb_info(skb, vxlan, fdb, 0, 0, type, 0, &fdb->remote);
+ err = vxlan_fdb_info(skb, vxlan, fdb, 0, 0, type, 0, first_remote(fdb));
if (err < 0) {
/* -EMSGSIZE implies BUG in vxlan_nlmsg_size() */
WARN_ON(err == -EMSGSIZE);
@@ -297,11 +306,16 @@ static void vxlan_ip_miss(struct net_dev
{
struct vxlan_dev *vxlan = netdev_priv(dev);
struct vxlan_fdb f;
+ struct vxlan_rdst remote;

memset(&f, 0, sizeof f);
f.state = NUD_STALE;
- f.remote.remote_ip = ipa; /* goes to NDA_DST */
- f.remote.remote_vni = VXLAN_N_VID;
+
+ remote.remote_ip = ipa; /* goes to NDA_DST */
+ remote.remote_vni = VXLAN_N_VID;
+
+ INIT_LIST_HEAD(&f.remotes);
+ list_add_rcu(&remote.list, &f.remotes);

vxlan_fdb_notify(vxlan, &f, RTM_GETNEIGH);
}
@@ -370,17 +384,17 @@ static struct vxlan_fdb *vxlan_find_mac(
static int vxlan_fdb_append(struct vxlan_fdb *f,
__be32 ip, __be16 port, __u32 vni, __u32 ifindex)
{
- struct vxlan_rdst *rd_prev, *rd;
+ struct vxlan_rdst *rd;

- rd_prev = NULL;
- for (rd = &f->remote; rd; rd = rd->remote_next) {
+ /* protected by vxlan->hash_lock */
+ list_for_each_entry(rd, &f->remotes, list) {
if (rd->remote_ip == ip &&
rd->remote_port == port &&
rd->remote_vni == vni &&
rd->remote_ifindex == ifindex)
return 0;
- rd_prev = rd;
}
+
rd = kmalloc(sizeof(*rd), GFP_ATOMIC);
if (rd == NULL)
return -ENOBUFS;
@@ -388,8 +402,9 @@ static int vxlan_fdb_append(struct vxlan
rd->remote_port = port;
rd->remote_vni = vni;
rd->remote_ifindex = ifindex;
- rd->remote_next = NULL;
- rd_prev->remote_next = rd;
+
+ list_add_tail_rcu(&rd->list, &f->remotes);
+
return 1;
}

@@ -441,16 +456,14 @@ static int vxlan_fdb_create(struct vxlan
return -ENOMEM;

notify = 1;
- f->remote.remote_ip = ip;
- f->remote.remote_port = port;
- f->remote.remote_vni = vni;
- f->remote.remote_ifindex = ifindex;
- f->remote.remote_next = NULL;
f->state = state;
f->flags = ndm_flags;
f->updated = f->used = jiffies;
+ INIT_LIST_HEAD(&f->remotes);
memcpy(f->eth_addr, mac, ETH_ALEN);

+ vxlan_fdb_append(f, ip, port, vni, ifindex);
+
++vxlan->addrcnt;
hlist_add_head_rcu(&f->hlist,
vxlan_fdb_head(vxlan, mac));
@@ -465,13 +478,10 @@ static int vxlan_fdb_create(struct vxlan
static void vxlan_fdb_free(struct rcu_head *head)
{
struct vxlan_fdb *f = container_of(head, struct vxlan_fdb, rcu);
+ struct vxlan_rdst *rd, *nd;

- while (f->remote.remote_next) {
- struct vxlan_rdst *rd = f->remote.remote_next;
-
- f->remote.remote_next = rd->remote_next;
+ list_for_each_entry_safe(rd, nd, &f->remotes, list)
kfree(rd);
- }
kfree(f);
}

@@ -577,26 +587,25 @@ static int vxlan_fdb_dump(struct sk_buff

for (h = 0; h < FDB_HASH_SIZE; ++h) {
struct vxlan_fdb *f;
- int err;

hlist_for_each_entry_rcu(f, &vxlan->fdb_head[h], hlist) {
- struct vxlan_rdst *rd;
- for (rd = &f->remote; rd; rd = rd->remote_next) {
- if (idx < cb->args[0])
- goto skip;
+ struct vxlan_rdst *rdst;

+ list_for_each_entry_rcu(rdst, &f->remotes, list) {
+ int err;
err = vxlan_fdb_info(skb, vxlan, f,
NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq,
RTM_NEWNEIGH,
- NLM_F_MULTI, rd);
+ NLM_F_MULTI, rdst);
if (err < 0)
- break;
-skip:
+ goto errout;
+
++idx;
}
}
}
+ errout:

return idx;
}
@@ -613,16 +622,24 @@ static void vxlan_snoop(struct net_devic

f = vxlan_find_mac(vxlan, src_mac);
if (likely(f)) {
- if (likely(f->remote.remote_ip == src_ip))
+ struct vxlan_rdst *remote;
+
+ /* is this a static entry, ignore snooping */
+ if (!(f->flags & NTF_SELF))
+ return;
+
+ remote = first_remote(f);
+ if (!remote || remote->remote_ip == src_ip)
return;

if (net_ratelimit())
netdev_info(dev,
"%pM migrated from %pI4 to %pI4\n",
- src_mac, &f->remote.remote_ip, &src_ip);
+ src_mac, &remote->remote_ip, &src_ip);

- f->remote.remote_ip = src_ip;
+ remote->remote_ip = src_ip;
f->updated = jiffies;
+ vxlan_fdb_notify(vxlan, f, RTM_NEWNEIGH);
} else {
/* learned new entry */
spin_lock(&vxlan->hash_lock);
@@ -856,6 +873,7 @@ static int arp_reduce(struct net_device
if (n) {
struct vxlan_fdb *f;
struct sk_buff *reply;
+ struct vxlan_rdst *remote;

if (!(n->nud_state & NUD_CONNECTED)) {
neigh_release(n);
@@ -863,7 +881,9 @@ static int arp_reduce(struct net_device
}

f = vxlan_find_mac(vxlan, n->ha);
- if (f && f->remote.remote_ip == htonl(INADDR_ANY)) {
+ if (f &&
+ (remote = first_remote(f)) &&
+ remote->remote_ip == htonl(INADDR_ANY)) {
/* bridge-local neighbor */
neigh_release(n);
goto out;
@@ -1153,9 +1173,8 @@ static netdev_tx_t vxlan_xmit(struct sk_
struct vxlan_dev *vxlan = netdev_priv(dev);
struct ethhdr *eth;
bool did_rsc = false;
- struct vxlan_rdst *rdst0, *rdst;
+ struct vxlan_rdst *rdst0;
struct vxlan_fdb *f;
- int rc1, rc;

skb_reset_mac_header(skb);
eth = eth_hdr(skb);
@@ -1173,32 +1192,40 @@ static netdev_tx_t vxlan_xmit(struct sk_
f = vxlan_find_mac(vxlan, eth->h_dest);
}

- if (f == NULL) {
+ if (f) {
+ struct vxlan_rdst *rdst;
+
+ rdst0 = NULL;
+ list_for_each_entry_rcu(rdst, &f->remotes, list) {
+ if (rdst0) {
+ struct sk_buff *skb1;
+ skb1 = skb_clone(skb, GFP_ATOMIC);
+ if (skb1)
+ vxlan_xmit_one(skb1, dev,
+ rdst0, did_rsc);
+ else
+ ++dev->stats.tx_dropped;
+ }
+ rdst0 = rdst;
+ }
+
+ if (!rdst0) {
+ /* forwarding entry but destination list empty */
+ ++dev->stats.tx_dropped;
+ kfree_skb(skb);
+ return NETDEV_TX_OK;
+ }
+
+ } else {
rdst0 = &vxlan->default_dst;

if (rdst0->remote_ip == htonl(INADDR_ANY) &&
(vxlan->flags & VXLAN_F_L2MISS) &&
!is_multicast_ether_addr(eth->h_dest))
vxlan_fdb_miss(vxlan, eth->h_dest);
- } else
- rdst0 = &f->remote;
-
- rc = NETDEV_TX_OK;
-
- /* if there are multiple destinations, send copies */
- for (rdst = rdst0->remote_next; rdst; rdst = rdst->remote_next) {
- struct sk_buff *skb1;
-
- skb1 = skb_clone(skb, GFP_ATOMIC);
- rc1 = vxlan_xmit_one(skb1, dev, rdst, did_rsc);
- if (rc == NETDEV_TX_OK)
- rc = rc1;
}

- rc1 = vxlan_xmit_one(skb, dev, rdst0, did_rsc);
- if (rc == NETDEV_TX_OK)
- rc = rc1;
- return rc;
+ return vxlan_xmit_one(skb, dev, rdst0, did_rsc);
}

/* Walk the forwarding table and purge stale entries */
David Stevens
2013-06-03 20:18:14 UTC
Permalink
--- a/drivers/net/vxlan.c 2013-06-03 11:08:39.214230482 -0700
+++ b/drivers/net/vxlan.c 2013-06-03 11:11:37.220114181 -0700
@@ -101,7 +101,7 @@ struct vxlan_rdst {
__be16 remote_port;
u32 remote_vni;
u32 remote_ifindex;
- struct vxlan_rdst *remote_next;
+ struct list_head list;
};
/* Forwarding table entry */
@@ -110,7 +110,7 @@ struct vxlan_fdb {
struct rcu_head rcu;
unsigned long updated; /* jiffies */
unsigned long used;
- struct vxlan_rdst remote;
+ struct list_head remotes;
u16 state; /* see ndm_state */
u8 flags; /* see ndm_flags */
u8 eth_addr[ETH_ALEN];
@@ -163,6 +163,12 @@ static inline struct hlist_head *vs_head
return &vn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
}
+/* First remote destination for a forwarding entry (or NULL) */
+static inline struct vxlan_rdst *first_remote(struct vxlan_fdb *fdb)
+{
+ return list_first_or_null_rcu(&fdb->remotes, struct vxlan_rdst,
list);
+}
+
/* Find VXLAN socket based on network namespace and UDP port */
static struct vxlan_sock *vxlan_find_port(struct net *net, __be16 port)
{
@@ -216,10 +222,11 @@ static int vxlan_fdb_info(struct sk_buff
if (type == RTM_GETNEIGH) {
ndm->ndm_family = AF_INET;
- send_ip = rdst->remote_ip != htonl(INADDR_ANY);
+ send_ip = rdst && rdst->remote_ip != htonl(INADDR_ANY);
send_eth = !is_zero_ether_addr(fdb->eth_addr);
} else
ndm->ndm_family = AF_BRIDGE;
rdst cannot be NULL in the original code, and no fdb entry should
ever have a NULL destination, so I'm not sure I understand why you
appear to be sending netlink messages at all if rdst is NULL.
+
ndm->ndm_state = fdb->state;
ndm->ndm_ifindex = vxlan->dev->ifindex;
ndm->ndm_flags = fdb->flags;
@@ -228,18 +235,20 @@ static int vxlan_fdb_info(struct sk_buff
if (send_eth && nla_put(skb, NDA_LLADDR, ETH_ALEN, &fdb->eth_addr))
goto nla_put_failure;
- if (send_ip && nla_put_be32(skb, NDA_DST, rdst->remote_ip))
- goto nla_put_failure;
-
- if (rdst->remote_port && rdst->remote_port != vxlan->dst_port &&
- nla_put_be16(skb, NDA_PORT, rdst->remote_port))
- goto nla_put_failure;
- if (rdst->remote_vni != vxlan->default_dst.remote_vni &&
- nla_put_be32(skb, NDA_VNI, rdst->remote_vni))
- goto nla_put_failure;
- if (rdst->remote_ifindex &&
- nla_put_u32(skb, NDA_IFINDEX, rdst->remote_ifindex))
- goto nla_put_failure;
+ if (rdst) {
+ if (send_ip && nla_put_be32(skb, NDA_DST, rdst->remote_ip))
+ goto nla_put_failure;
+
+ if (rdst->remote_port && rdst->remote_port != vxlan->dst_port &&
+ nla_put_be16(skb, NDA_PORT, rdst->remote_port))
+ goto nla_put_failure;
+ if (rdst->remote_vni != vxlan->default_dst.remote_vni &&
+ nla_put_be32(skb, NDA_VNI, rdst->remote_vni))
+ goto nla_put_failure;
+ if (rdst->remote_ifindex &&
+ nla_put_u32(skb, NDA_IFINDEX, rdst->remote_ifindex))
+ goto nla_put_failure;
+ }
ci.ndm_used = jiffies_to_clock_t(now - fdb->used);
ci.ndm_confirmed = 0;
Again, if rdst == NULL, why are we sending anything? It should never
happen, so should be an error case if you check at all, shouldn't it?
@@ -1173,32 +1192,40 @@ static netdev_tx_t vxlan_xmit(struct sk_
f = vxlan_find_mac(vxlan, eth->h_dest);
}
- if (f == NULL) {
+ if (f) {
+ struct vxlan_rdst *rdst;
+
+ rdst0 = NULL;
+ list_for_each_entry_rcu(rdst, &f->remotes, list) {
+ if (rdst0) {
+ struct sk_buff *skb1;
+ skb1 = skb_clone(skb, GFP_ATOMIC);
+ if (skb1)
+ vxlan_xmit_one(skb1, dev,
+ rdst0, did_rsc);
This ignores the return value of vxlan_xmit_one(); the original
code returns an error if any of the destinations fail, while this
code ignores errors for all but the last destination.
+ else
+ ++dev->stats.tx_dropped;
+ }
+ rdst0 = rdst;
+ }
+
+ if (!rdst0) {
+ /* forwarding entry but destination list empty */
+ ++dev->stats.tx_dropped;
+ kfree_skb(skb);
+ return NETDEV_TX_OK;
+ }
This should be a panic, in the original code -- it can't
happen unless the fdb table is corrupted.
1. eliminate second rcu for rdst free
Why is this a good thing?
2. handle possilble races where fdb notify called but no rdst
How can an fdb entry ever not have an rdst? Isn't that a reason
to remove the fdb entry? I don't count "0.0.0.0" as not having an
rdst, since that is explicitly used for ARP reduction suppression,
but rather I mean: when can a netlink message adding an fdb entry
not have an NDA_DST (even if it is 0.0.0.0) and why would that not
be an error to add an fdb entry without a dst?
3. only update snooped entries for dynamic entries, shouldn't modify static entries
I agree with this part.
4. clean up the multi-dest tx code
Everybody likes their own code, of course, but I think the original
was cleaner. A vxlan_rdst was part of an fdb structure because all
fdbs require a remote destination, while the clean-up changes that
to a list which may then be NULL, and in my opinion is notably less
clean. I don't know the reasoning for that-- perhaps some thread I
missed?

So, I guess I don't see what you're trying to fix here, other than
preventing snoop updates on static entries, which I agree with.

+-DLS
Stephen Hemminger
2013-06-03 20:45:12 UTC
Permalink
On Mon, 3 Jun 2013 16:18:14 -0400
Post by Mike Rapoport
--- a/drivers/net/vxlan.c 2013-06-03 11:08:39.214230482 -0700
+++ b/drivers/net/vxlan.c 2013-06-03 11:11:37.220114181 -0700
@@ -101,7 +101,7 @@ struct vxlan_rdst {
__be16 remote_port;
u32 remote_vni;
u32 remote_ifindex;
- struct vxlan_rdst *remote_next;
+ struct list_head list;
};
/* Forwarding table entry */
@@ -110,7 +110,7 @@ struct vxlan_fdb {
struct rcu_head rcu;
unsigned long updated; /* jiffies */
unsigned long used;
- struct vxlan_rdst remote;
+ struct list_head remotes;
u16 state; /* see ndm_state */
u8 flags; /* see ndm_flags */
u8 eth_addr[ETH_ALEN];
@@ -163,6 +163,12 @@ static inline struct hlist_head *vs_head
return &vn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
}
+/* First remote destination for a forwarding entry (or NULL) */
+static inline struct vxlan_rdst *first_remote(struct vxlan_fdb *fdb)
+{
+ return list_first_or_null_rcu(&fdb->remotes, struct vxlan_rdst,
list);
+}
+
/* Find VXLAN socket based on network namespace and UDP port */
static struct vxlan_sock *vxlan_find_port(struct net *net, __be16 port)
{
@@ -216,10 +222,11 @@ static int vxlan_fdb_info(struct sk_buff
if (type == RTM_GETNEIGH) {
ndm->ndm_family = AF_INET;
- send_ip = rdst->remote_ip != htonl(INADDR_ANY);
+ send_ip = rdst && rdst->remote_ip != htonl(INADDR_ANY);
send_eth = !is_zero_ether_addr(fdb->eth_addr);
} else
ndm->ndm_family = AF_BRIDGE;
rdst cannot be NULL in the original code, and no fdb entry should
ever have a NULL destination, so I'm not sure I understand why you
appear to be sending netlink messages at all if rdst is NULL.
Just being safe. need to audit to make sure not possible for all
destinations to be removed from FDB entry via netlink.
Post by Mike Rapoport
+
ndm->ndm_state = fdb->state;
ndm->ndm_ifindex = vxlan->dev->ifindex;
ndm->ndm_flags = fdb->flags;
@@ -228,18 +235,20 @@ static int vxlan_fdb_info(struct sk_buff
if (send_eth && nla_put(skb, NDA_LLADDR, ETH_ALEN, &fdb->eth_addr))
goto nla_put_failure;
- if (send_ip && nla_put_be32(skb, NDA_DST, rdst->remote_ip))
- goto nla_put_failure;
-
- if (rdst->remote_port && rdst->remote_port != vxlan->dst_port &&
- nla_put_be16(skb, NDA_PORT, rdst->remote_port))
- goto nla_put_failure;
- if (rdst->remote_vni != vxlan->default_dst.remote_vni &&
- nla_put_be32(skb, NDA_VNI, rdst->remote_vni))
- goto nla_put_failure;
- if (rdst->remote_ifindex &&
- nla_put_u32(skb, NDA_IFINDEX, rdst->remote_ifindex))
- goto nla_put_failure;
+ if (rdst) {
+ if (send_ip && nla_put_be32(skb, NDA_DST, rdst->remote_ip))
+ goto nla_put_failure;
+
+ if (rdst->remote_port && rdst->remote_port != vxlan->dst_port &&
+ nla_put_be16(skb, NDA_PORT, rdst->remote_port))
+ goto nla_put_failure;
+ if (rdst->remote_vni != vxlan->default_dst.remote_vni &&
+ nla_put_be32(skb, NDA_VNI, rdst->remote_vni))
+ goto nla_put_failure;
+ if (rdst->remote_ifindex &&
+ nla_put_u32(skb, NDA_IFINDEX, rdst->remote_ifindex))
+ goto nla_put_failure;
+ }
ci.ndm_used = jiffies_to_clock_t(now - fdb->used);
ci.ndm_confirmed = 0;
Again, if rdst == NULL, why are we sending anything? It should never
happen, so should be an error case if you check at all, shouldn't it?
It maybe possible to create or modify existing fdb entry so no destinatons
left.
Post by Mike Rapoport
@@ -1173,32 +1192,40 @@ static netdev_tx_t vxlan_xmit(struct sk_
f = vxlan_find_mac(vxlan, eth->h_dest);
}
- if (f == NULL) {
+ if (f) {
+ struct vxlan_rdst *rdst;
+
+ rdst0 = NULL;
+ list_for_each_entry_rcu(rdst, &f->remotes, list) {
+ if (rdst0) {
+ struct sk_buff *skb1;
+ skb1 = skb_clone(skb, GFP_ATOMIC);
+ if (skb1)
+ vxlan_xmit_one(skb1, dev,
+ rdst0, did_rsc);
This ignores the return value of vxlan_xmit_one(); the original
code returns an error if any of the destinations fail, while this
code ignores errors for all but the last destination.
Return value doesn't really matter here anyway, and certainly not
in the fanout case.
Post by Mike Rapoport
+ else
+ ++dev->stats.tx_dropped;
+ }
+ rdst0 = rdst;
+ }
+
+ if (!rdst0) {
+ /* forwarding entry but destination list empty */
+ ++dev->stats.tx_dropped;
+ kfree_skb(skb);
+ return NETDEV_TX_OK;
+ }
This should be a panic, in the original code -- it can't
happen unless the fdb table is corrupted.
1. eliminate second rcu for rdst free
Why is this a good thing?
Because FDB is only freed by RCU, and don't need another RCU synchronization
pass to drop the dst entries.
Post by Mike Rapoport
2. handle possilble races where fdb notify called but no rdst
How can an fdb entry ever not have an rdst? Isn't that a reason
to remove the fdb entry? I don't count "0.0.0.0" as not having an
rdst, since that is explicitly used for ARP reduction suppression,
but rather I mean: when can a netlink message adding an fdb entry
not have an NDA_DST (even if it is 0.0.0.0) and why would that not
be an error to add an fdb entry without a dst?
3. only update snooped entries for dynamic entries, shouldn't modify
static entries
I agree with this part.
4. clean up the multi-dest tx code
Everybody likes their own code, of course, but I think the original
was cleaner. A vxlan_rdst was part of an fdb structure because all
fdbs require a remote destination, while the clean-up changes that
to a list which may then be NULL, and in my opinion is notably less
clean. I don't know the reasoning for that-- perhaps some thread I
missed?
I don't like code with _continue_ since it is often error prone.
But the original was okay.
Post by Mike Rapoport
So, I guess I don't see what you're trying to fix here, other than
preventing snoop updates on static entries, which I agree with.
+-DLS
Mike's code didn't allow delete of modification of dst entries to be
done in a safe manner.
David Stevens
2013-06-03 21:46:46 UTC
Permalink
Post by Stephen Hemminger
Just being safe. need to audit to make sure not possible for all
destinations to be removed from FDB entry via netlink.
In the original code, you can't delete destinations at all.
Because the netlink API only had fdb-entry-level operations, you
can add with the NLM_F_APPEND flag to add new destinations to
existing entries (it's an "add/create" if it didn't exist, and it
appends the new dest if it did exist and has NLM_F_APPEND set),
but an fdb delete deletes the entire fdb entry, and all destinations.
So, you can't delete individual destinations, you can only delete
all of them and rebuild it with appends to not include the entry
you don't want.
This was for compatibility rather than adding new add/del of
destinations within an fdb, it uses the existing fdb add/del and
the existing exclusive/append flag. Deleting individual destinations
from an fdb entry would require extensions for the generic fdb
management, or a separate set of netlink handlers.
Post by Stephen Hemminger
Post by David Stevens
Again, if rdst == NULL, why are we sending anything? It should never
happen, so should be an error case if you check at all, shouldn't it?
It maybe possible to create or modify existing fdb entry so no
destinatons
Post by Stephen Hemminger
left.
It shouldn't be (by design). The only way you can shorten an fdb
entry's destination list is by deleting it and rebuilding it with fewer
entries. And all fdb_add's require an NDA_DST, though that can be
0.0.0.0.
Post by Stephen Hemminger
Post by David Stevens
This ignores the return value of vxlan_xmit_one(); the original
code returns an error if any of the destinations fail, while this
code ignores errors for all but the last destination.
Return value doesn't really matter here anyway, and certainly not
in the fanout case.
My intention in the code was to count any loss in the multiple
destinations as a "drop", for the purposes of drawing attention to
an admin. This new code reports only the sate of the last destination
and ignores any errors from all the other destinations-- I don't
agree with that. If it didn't work perfectly as intended, it should
show up in stats for admin investigation.

But i don't think any of this fdb code relates to the default
destination directly, since the generic fdb code does not support
default, or any other aggregate matching. VXLAN can define the
netlink API for default destinations without affecting anything
else, because it is already a per-device feature.

+-DLS
Mike Rapoport
2013-06-04 09:18:08 UTC
Permalink
Post by David Stevens
But i don't think any of this fdb code relates to the default
destination directly, since the generic fdb code does not support
default, or any other aggregate matching. VXLAN can define the
netlink API for default destinations without affecting anything
else, because it is already a per-device feature.
+-DLS
The only relation is the 'struct vxlan_rdst' which I intended to use
for default destinations list. So, until there's a consensus about
remote destinations list in fdb, I cannot continue to work on multiple
default destinations.

--
Sincerely yours,
Mike.
David Stevens
2013-06-04 12:48:06 UTC
Permalink
Post by Mike Rapoport
Post by David Stevens
But i don't think any of this fdb code relates to the default
destination directly, since the generic fdb code does not support
default, or any other aggregate matching. VXLAN can define the
netlink API for default destinations without affecting anything
else, because it is already a per-device feature.
+-DLS
The only relation is the 'struct vxlan_rdst' which I intended to use
for default destinations list. So, until there's a consensus about
remote destinations list in fdb, I cannot continue to work on multiple
default destinations.
Yes, I agree with that.

I'd expect we could just either add a lock for the default to do
deletes, or overload vxlan_dev->hash_lock and acquire it for deletes
of the default list. [overload probably simplest]

Actually, we could make it an fdb and also have "use" and "updated"
stats that way, and/or use the existing fdb code and add an entry
with MAC "00:00:00:00:00:00" [which is disallowed via the generic
fdb code, so can't be in the fdb table otherwise].

So something like:
f = vxlan_find_mac(vxlan, mac);
did_rsc = false;

if (f && (f->flags & NTF_ROUTER ... {
...
}

if (f == NULL) {
f = vxlan_find_mac(vxlan, ALL_ZEROS_MAC);
if (f == NULL) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^
do these two lines instead of checking "default_dst" for INADDR_ANY

if (vxlan->flags & VXLAN_F_L2MISS &&
..
vxlan_fdb_miss(...);
dev->stats.tx_dropped++;
dev_kfree_skb(skb);
return NETDEV_TX_OK;
}
}

rc = NETDEV_TX_OK;

and the default will then use the same code as the fdb; can
then remove the NULL dst check in vxlan_xmit_one(), and do
fdb_adds/fdb_append/fdb_del with ALL_ZEROS_MAC as in the fdb,
but get there with a device-specific netlink message. If there
is no default, delete the all-zeros MAC fdb entry, like any
other fdb entry.

Or even extend the generic rtnetlink fdb code to not check
for all-zeros and just treat the all-zeros mac, if there, as a
default destination and use all the same code as fdb to
manage the default.

+-DLS
Mike Rapoport
2013-06-04 17:20:22 UTC
Permalink
Post by David Stevens
Actually, we could make it an fdb and also have "use" and "updated"
stats that way, and/or use the existing fdb code and add an entry
with MAC "00:00:00:00:00:00" [which is disallowed via the generic
fdb code, so can't be in the fdb table otherwise].
If I've understood you right it would be something like this:

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 8111565..b7801e3 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -80,6 +80,8 @@ MODULE_PARM_DESC(log_ecn_error, "Log packets
received with corrupted ECN");

static unsigned int vxlan_net_id;

+static u8 ALL_ZEROS_MAC[ETH_ALEN];
+
/* per UDP socket information */
struct vxlan_sock {
struct hlist_node hlist;
@@ -1025,13 +1027,10 @@ static netdev_tx_t vxlan_xmit_one(struct
sk_buff *skb, struct net_device *dev,
vni = rdst->remote_vni;
dst = rdst->remote_ip;

- if (!dst) {
- if (did_rsc) {
- /* short-circuited back to local bridge */
- vxlan_encap_bypass(skb, vxlan, vxlan);
- return NETDEV_TX_OK;
- }
- goto drop;
+ if (did_rsc) {
+ /* short-circuited back to local bridge */
+ vxlan_encap_bypass(skb, vxlan, vxlan);
+ return NETDEV_TX_OK;
}

if (!skb->encapsulation) {
@@ -1174,14 +1173,19 @@ static netdev_tx_t vxlan_xmit(struct sk_buff
*skb, struct net_device *dev)
}

if (f == NULL) {
- rdst0 = &vxlan->default_dst;
+ f = vxlan_find_mac(vxlan, ALL_ZEROS_MAC);
+ if (f == NULL) {
+ if ((vxlan->flags & VXLAN_F_L2MISS) &&
+ !is_multicast_ether_addr(eth->h_dest))
+ vxlan_fdb_miss(vxlan, eth->h_dest);
+
+ dev->stats.tx_dropped++;
+ dev_kfree_skb(skb);
+ return NETDEV_TX_OK;
+ }
+ }

- if (rdst0->remote_ip == htonl(INADDR_ANY) &&
- (vxlan->flags & VXLAN_F_L2MISS) &&
- !is_multicast_ether_addr(eth->h_dest))
- vxlan_fdb_miss(vxlan, eth->h_dest);
- } else
- rdst0 = &f->remote;
+ rdst0 = &f->remote;

rc = NETDEV_TX_OK;

@@ -1590,6 +1594,17 @@ static int vxlan_newlink(struct net *net,
struct net_device *dev,
return -EEXIST;
}

+ if (dst->remote_ip != htonl(INADDR_ANY)) {
+ err = vxlan_fdb_create(vxlan, ALL_ZEROS_MAC, dst->remote_ip,
+ NUD_REACHABLE,
+ NLM_F_EXCL|NLM_F_CREATE,
+ dst->remote_port,
+ dst->remote_vni,
+ 0, NTF_SELF);
+ if (err)
+ return err;
+ }
+
vs = vxlan_find_port(net, vxlan->dst_port);
if (vs)
++vs->refcnt;
--
Sincerely yours,
Mike.
David Stevens
2013-06-04 19:02:26 UTC
Permalink
All the tabs are removed in the code I got-- don't know if
it's your mailer or my reader....
Post by Mike Rapoport
@@ -1025,13 +1027,10 @@ static netdev_tx_t vxlan_xmit_one(struct
sk_buff *skb, struct net_device *dev,
vni = rdst->remote_vni;
dst = rdst->remote_ip;
- if (!dst) {
- if (did_rsc) {
- /* short-circuited back to local bridge */
- vxlan_encap_bypass(skb, vxlan, vxlan);
- return NETDEV_TX_OK;
- }
- goto drop;
+ if (did_rsc) {
+ /* short-circuited back to local bridge */
+ vxlan_encap_bypass(skb, vxlan, vxlan);
+ return NETDEV_TX_OK;
}
This isn't correct -- I led you into that one,
but the original code is what we'd need -- it isn't
"rdst NULL" this is checking, but really "dst == INADDR_ANY",
which is probably what it ought to be, instead of "0".
At any rate, the check there works and needs to stay;
dst == INADDR_ANY is a distinct case from rdst == NULL.

Also looks like you're missing a brace on the L2MISS check --
the miss and the drop are only if the all-zeroes lookup
fails.

But the general idea looks good. If you remove the
is_zero_ether_addr() check in rtnl_fdb_add and rtnl_fdb_del,
then the existing fdb_add/fdb_append/fdb_del code code
would allow changes, including multiple destinations, to
the default destinations as well.

But there's more to it, too -- default_dst and all its references
would need to be replaced for joining groups and the netlink
part, and removed from vxlan_dev.

+-DLS
Mike Rapoport
2013-06-05 12:53:30 UTC
Permalink
Post by David Stevens
All the tabs are removed in the code I got-- don't know if
it's your mailer or my reader....
My mailer, I afraid. Sorry about that.
Post by David Stevens
This isn't correct -- I led you into that one,
but the original code is what we'd need -- it isn't
"rdst NULL" this is checking, but really "dst == INADDR_ANY",
which is probably what it ought to be, instead of "0".
At any rate, the check there works and needs to stay;
dst == INADDR_ANY is a distinct case from rdst == NULL.
Yeah, it looked wrong to me, but it was too late in the evening for me :)
Post by David Stevens
Also looks like you're missing a brace on the L2MISS check --
the miss and the drop are only if the all-zeroes lookup
fails.
I'm not sure if we can allow empty list of default destinations. I
think that the fdb entry for ALL_ZEROS_MAC should always contain at
least one entry corresponding to current default_dst.
Post by David Stevens
But the general idea looks good. If you remove the
is_zero_ether_addr() check in rtnl_fdb_add and rtnl_fdb_del,
then the existing fdb_add/fdb_append/fdb_del code code
would allow changes, including multiple destinations, to
the default destinations as well.
Yeah, but in this case 'fdb del 00:00:00:00:00:00' kills all the
defaults and makes userspace responsible to restore those that are
still valid.
Maybe in the case of 'fdb del 00:00:00:00:00:00' we should require
NDA_DST to be present...
Another issue with making default destinations fdb with ALL_ZEROS_MAC
is that vxlan_cleanup may eventually garbage collect it.

I'll experiment some more to see if maintaining separate list of
default destinations or making them fdb comes cleaner at the end.
Post by David Stevens
But there's more to it, too -- default_dst and all its references
would need to be replaced for joining groups and the netlink
part, and removed from vxlan_dev.
Again, there should always be at least one default destination. And if
the references for default_dst should be replaced, some defaults need
to be kept in vxlan_dev for fallbacks...
Post by David Stevens
+-DLS
--
Sincerely yours,
Mike.
Mike Rapoport
2013-06-04 09:10:22 UTC
Permalink
Post by David Stevens
Post by Stephen Hemminger
4. clean up the multi-dest tx code
Everybody likes their own code, of course, but I think the original
was cleaner. A vxlan_rdst was part of an fdb structure because all
fdbs require a remote destination, while the clean-up changes that
to a list which may then be NULL, and in my opinion is notably less
clean. I don't know the reasoning for that-- perhaps some thread I
missed?
Maybe this one:
http://thread.gmane.org/gmane.linux.network/271383
Post by David Stevens
So, I guess I don't see what you're trying to fix here, other than
preventing snoop updates on static entries, which I agree with.
+-DLS
--
Sincerely yours,
Mike.
Stephen Hemminger
2013-06-04 16:00:31 UTC
Permalink
I am still not 100% convinced of the use case of providing multiple destinations.
It seems like either a solution for migrating guests, a way to do SPAN, or a weird
form of multicast.

Wouldn't it be better to use something like TC actions for these special cases and
handle it at the IP layer rather than having to add so much complexity into the
VXLAN driver itself.
David Stevens
2013-06-04 16:29:05 UTC
Permalink
Post by Stephen Hemminger
I am still not 100% convinced of the use case of providing multiple destinations.
It seems like either a solution for migrating guests, a way to do SPAN, or a weird
form of multicast.
The use is, as you would say it, a weird form of multicast. It
is to allow forwarding MAC multicasts without relying on the
underlying network having multicast routers in the same domain.
Post by Stephen Hemminger
Wouldn't it be better to use something like TC actions for these special cases and
handle it at the IP layer rather than having to add so much
complexity into the
VXLAN driver itself.
This isn't a "special case" for DOVE -- it is the mechanism
for doing MAC multicasts. I don't agree that the code is
complex -- it simply replaces what was a vxlan_rdst with a
list of vxlan_rdsts. And, no, I don't believe it can be
done with equivalent performance using external mechanisms.

Didn't we have this discussion when the patch went in?

+-DLS
Mike Rapoport
2013-06-04 17:22:02 UTC
Permalink
Post by David Stevens
Post by Stephen Hemminger
I am still not 100% convinced of the use case of providing multiple destinations.
It seems like either a solution for migrating guests, a way to do SPAN, or a weird
form of multicast.
The use is, as you would say it, a weird form of multicast. It
is to allow forwarding MAC multicasts without relying on the
underlying network having multicast routers in the same domain.
For instance, public clouds do not allow multicasts, and vxlan with
multiple destinations would be a neat SDN solution there.
Post by David Stevens
Post by Stephen Hemminger
Wouldn't it be better to use something like TC actions for these special cases and
handle it at the IP layer rather than having to add so much
complexity into the
VXLAN driver itself.
This isn't a "special case" for DOVE -- it is the mechanism
for doing MAC multicasts. I don't agree that the code is
complex -- it simply replaces what was a vxlan_rdst with a
list of vxlan_rdsts. And, no, I don't believe it can be
done with equivalent performance using external mechanisms.
Didn't we have this discussion when the patch went in?
+-DLS
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...