Netlink mmap tx security?

Discussion:

Netlink mmap tx security?

Andy Lutomirski

2014-05-12 21:08:20 UTC

[moving to netdev -- this is much lower impact than I thought, since
you can't set up a netlink mmap ring without global CAP_NET_ADMIN]

Disclaimer: I haven't tried to write a proof-of-concept, so I could be
wrong here.
/* Netlink messages are validated by the receiver before processing.
* In order to avoid userspace changing the contents of the message
* after validation, the socket and the ring may only be used by a
* single process, otherwise we fall back to copying.
*/
if (atomic_long_read(&sk->sk_socket->file->f_count) > 2 ||
atomic_read(&nlk->mapped) > 1)
excl = false;
1. Shouldn't that be atomic_long(read(&f_count) > 1))?
2. threads
3. process_vm_writev
I wouldn't be surprised if RDMA and AIO also break this assumption.
Does anything rely on mmapped netlink tx being fast? If not, can this
code just be deleted?
For that matter, does anything rely on mmapped netlink tx at all?
(On a non-KASLR, non-SMEP system, this probably gives reliable kernel
code execution.)
For that matter, netlink_mmap_sendmsg also appears to vulnerable to a
TOCTOU attack via hdr.

Andy Lutomirski

2014-10-11 22:29:17 UTC

Post by Andy Lutomirski
[moving to netdev -- this is much lower impact than I thought, since
you can't set up a netlink mmap ring without global CAP_NET_ADMIN]

Did anything ever happen here? Despite not being a privilege
escalation in the normal sense, it's still a bug, and it's still a
fairly easy bypass of module signatures.

--Andy

Post by Andy Lutomirski

Disclaimer: I haven't tried to write a proof-of-concept, so I could be
wrong here.
/* Netlink messages are validated by the receiver before processing.
* In order to avoid userspace changing the contents of the message
* after validation, the socket and the ring may only be used by a
* single process, otherwise we fall back to copying.
*/
if (atomic_long_read(&sk->sk_socket->file->f_count) > 2 ||
atomic_read(&nlk->mapped) > 1)
excl = false;
1. Shouldn't that be atomic_long(read(&f_count) > 1))?
2. threads
3. process_vm_writev
I wouldn't be surprised if RDMA and AIO also break this assumption.
Does anything rely on mmapped netlink tx being fast? If not, can this
code just be deleted?
For that matter, does anything rely on mmapped netlink tx at all?
(On a non-KASLR, non-SMEP system, this probably gives reliable kernel
code execution.)
For that matter, netlink_mmap_sendmsg also appears to vulnerable to a
TOCTOU attack via hdr.

David Miller

2014-10-11 23:09:57 UTC

From: Andy Lutomirski <***@amacapital.net>
Date: Sat, 11 Oct 2014 15:29:17 -0700

Post by Andy Lutomirski

Post by Andy Lutomirski
[moving to netdev -- this is much lower impact than I thought, since
you can't set up a netlink mmap ring without global CAP_NET_ADMIN]

Did anything ever happen here? Despite not being a privilege
escalation in the normal sense, it's still a bug, and it's still a
fairly easy bypass of module signatures.

We definitely still need to address this, and since Patrick has still
not responded on this issue after being given 5 months to do so, the
only reasonable path is to get rid of this feature altogether.

Thanks for reminding me.

David Miller

2014-10-14 19:19:49 UTC

From: Andy Lutomirski <***@amacapital.net>
Date: Sat, 11 Oct 2014 15:29:17 -0700

Post by Andy Lutomirski

Post by Andy Lutomirski
[moving to netdev -- this is much lower impact than I thought, since
you can't set up a netlink mmap ring without global CAP_NET_ADMIN]

Did anything ever happen here? Despite not being a privilege
escalation in the normal sense, it's still a bug, and it's still a
fairly easy bypass of module signatures.

Andy, please review:

====================
[PATCH] netlink: Remove TX mmap support.

There is no reasonable manner in which to absolutely make sure that another
thread of control cannot write to the pages in the mmap()'d area and thus
make sure that netlink messages do not change underneath us after we've
performed verifications.

Reported-by: Andy Lutomirski <***@amacapital.net>
Signed-off-by: David S. Miller <***@davemloft.net>
---
net/netlink/af_netlink.c | 135 ++---------------------------------------------
1 file changed, 5 insertions(+), 130 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c416725..771e6c0 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -289,11 +289,6 @@ static bool netlink_rx_is_mmaped(struct sock *sk)
return nlk_sk(sk)->rx_ring.pg_vec != NULL;
}

-static bool netlink_tx_is_mmaped(struct sock *sk)
-{
- return nlk_sk(sk)->tx_ring.pg_vec != NULL;
-}
-
static __pure struct page *pgvec_to_page(const void *addr)
{
if (is_vmalloc_addr(addr))
@@ -662,13 +657,6 @@ static unsigned int netlink_poll(struct file *file, struct socket *sock,
}
spin_unlock_bh(&sk->sk_receive_queue.lock);

- spin_lock_bh(&sk->sk_write_queue.lock);
- if (nlk->tx_ring.pg_vec) {
- if (netlink_current_frame(&nlk->tx_ring, NL_MMAP_STATUS_UNUSED))
- mask |= POLLOUT | POLLWRNORM;
- }
- spin_unlock_bh(&sk->sk_write_queue.lock);
-
return mask;
}

@@ -698,104 +686,6 @@ static void netlink_ring_setup_skb(struct sk_buff *skb, struct sock *sk,
NETLINK_CB(skb).sk = sk;
}

-static int netlink_mmap_sendmsg(struct sock *sk, struct msghdr *msg,
- u32 dst_portid, u32 dst_group,
- struct sock_iocb *siocb)
-{
- struct netlink_sock *nlk = nlk_sk(sk);
- struct netlink_ring *ring;
- struct nl_mmap_hdr *hdr;
- struct sk_buff *skb;
- unsigned int maxlen;
- bool excl = true;
- int err = 0, len = 0;
-
- /* Netlink messages are validated by the receiver before processing.
- * In order to avoid userspace changing the contents of the message
- * after validation, the socket and the ring may only be used by a
- * single process, otherwise we fall back to copying.
- */
- if (atomic_long_read(&sk->sk_socket->file->f_count) > 2 ||
- atomic_read(&nlk->mapped) > 1)
- excl = false;
-
- mutex_lock(&nlk->pg_vec_lock);
-
- ring = &nlk->tx_ring;
- maxlen = ring->frame_size - NL_MMAP_HDRLEN;
-
- do {
- hdr = netlink_current_frame(ring, NL_MMAP_STATUS_VALID);
- if (hdr == NULL) {
- if (!(msg->msg_flags & MSG_DONTWAIT) &&
- atomic_read(&nlk->tx_ring.pending))
- schedule();
- continue;
- }
- if (hdr->nm_len > maxlen) {
- err = -EINVAL;
- goto out;
- }
-
- netlink_frame_flush_dcache(hdr);
-
- if (likely(dst_portid == 0 && dst_group == 0 && excl)) {
- skb = alloc_skb_head(GFP_KERNEL);
- if (skb == NULL) {
- err = -ENOBUFS;
- goto out;
- }
- sock_hold(sk);
- netlink_ring_setup_skb(skb, sk, ring, hdr);
- NETLINK_CB(skb).flags |= NETLINK_SKB_TX;
- __skb_put(skb, hdr->nm_len);
- netlink_set_status(hdr, NL_MMAP_STATUS_RESERVED);
- atomic_inc(&ring->pending);
- } else {
- skb = alloc_skb(hdr->nm_len, GFP_KERNEL);
- if (skb == NULL) {
- err = -ENOBUFS;
- goto out;
- }
- __skb_put(skb, hdr->nm_len);
- memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
- netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
- }
-
- netlink_increment_head(ring);
-
- NETLINK_CB(skb).portid = nlk->portid;
- NETLINK_CB(skb).dst_group = dst_group;
- NETLINK_CB(skb).creds = siocb->scm->creds;
-
- err = security_netlink_send(sk, skb);
- if (err) {
- kfree_skb(skb);
- goto out;
- }
-
- if (unlikely(dst_group)) {
- atomic_inc(&skb->users);
- netlink_broadcast(sk, skb, dst_portid, dst_group,
- GFP_KERNEL);
- }
- err = netlink_unicast(sk, skb, dst_portid,
- msg->msg_flags & MSG_DONTWAIT);
- if (err < 0)
- goto out;
- len += err;
-
- } while (hdr != NULL ||
- (!(msg->msg_flags & MSG_DONTWAIT) &&
- atomic_read(&nlk->tx_ring.pending)));
-
- if (len > 0)
- err = len;
-out:
- mutex_unlock(&nlk->pg_vec_lock);
- return err;
-}
-
static void netlink_queue_mmaped_skb(struct sock *sk, struct sk_buff *skb)
{
struct nl_mmap_hdr *hdr;
@@ -842,10 +732,8 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
#else /* CONFIG_NETLINK_MMAP */
#define netlink_skb_is_mmaped(skb) false
#define netlink_rx_is_mmaped(sk) false
-#define netlink_tx_is_mmaped(sk) false
#define netlink_mmap sock_no_mmap
#define netlink_poll datagram_poll
-#define netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group, siocb) 0
#endif /* CONFIG_NETLINK_MMAP */

static void netlink_skb_destructor(struct sk_buff *skb)
@@ -864,16 +752,11 @@ static void netlink_skb_destructor(struct sk_buff *skb)
hdr = netlink_mmap_hdr(skb);
sk = NETLINK_CB(skb).sk;

- if (NETLINK_CB(skb).flags & NETLINK_SKB_TX) {
- netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
- ring = &nlk_sk(sk)->tx_ring;
- } else {
- if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
- hdr->nm_len = 0;
- netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
- }
- ring = &nlk_sk(sk)->rx_ring;
+ if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
+ hdr->nm_len = 0;
+ netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
}
+ ring = &nlk_sk(sk)->rx_ring;

WARN_ON(atomic_read(&ring->pending) == 0);
atomic_dec(&ring->pending);
@@ -2165,8 +2048,7 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
err = 0;
break;
#ifdef CONFIG_NETLINK_MMAP
- case NETLINK_RX_RING:
- case NETLINK_TX_RING: {
+ case NETLINK_RX_RING: {
struct nl_mmap_req req;

/* Rings might consume more memory than queue limits, require
@@ -2295,13 +2177,6 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
goto out;
}

- if (netlink_tx_is_mmaped(sk) &&
- msg->msg_iov->iov_base == NULL) {
- err = netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group,
- siocb);
- goto out;
- }
-
err = -EMSGSIZE;
if (len > sk->sk_sndbuf - 32)
goto out;

--
1.7.11.7

Andy Lutomirski

2014-10-14 19:33:43 UTC

Post by David Miller
Date: Sat, 11 Oct 2014 15:29:17 -0700

Post by Andy Lutomirski

Post by Andy Lutomirski
[moving to netdev -- this is much lower impact than I thought, since
you can't set up a netlink mmap ring without global CAP_NET_ADMIN]

Did anything ever happen here? Despite not being a privilege
escalation in the normal sense, it's still a bug, and it's still a
fairly easy bypass of module signatures.

====================
[PATCH] netlink: Remove TX mmap support.
There is no reasonable manner in which to absolutely make sure that another
thread of control cannot write to the pages in the mmap()'d area and thus
make sure that netlink messages do not change underneath us after we've
performed verifications.

For full honesty, there is now the machinery developed for memfd
sealing, but I can't imagine that this is ever faster than just
copying the buffer.

I think that the NETLINK_SKB_TX declaration in include/linux/netlink.h
should probably go, too. And there's the last parameter to
netlink_set_ring, too, and possibly even the nlk->tx_ring struct
itself.

--Andy

Post by David Miller
---
net/netlink/af_netlink.c | 135 ++---------------------------------------------
1 file changed, 5 insertions(+), 130 deletions(-)
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c416725..771e6c0 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -289,11 +289,6 @@ static bool netlink_rx_is_mmaped(struct sock *sk)
return nlk_sk(sk)->rx_ring.pg_vec != NULL;
}
-static bool netlink_tx_is_mmaped(struct sock *sk)
-{
- return nlk_sk(sk)->tx_ring.pg_vec != NULL;
-}
-
static __pure struct page *pgvec_to_page(const void *addr)
{
if (is_vmalloc_addr(addr))
@@ -662,13 +657,6 @@ static unsigned int netlink_poll(struct file *file, struct socket *sock,
}
spin_unlock_bh(&sk->sk_receive_queue.lock);
- spin_lock_bh(&sk->sk_write_queue.lock);
- if (nlk->tx_ring.pg_vec) {
- if (netlink_current_frame(&nlk->tx_ring, NL_MMAP_STATUS_UNUSED))
- mask |= POLLOUT | POLLWRNORM;
- }
- spin_unlock_bh(&sk->sk_write_queue.lock);
-
return mask;
}
@@ -698,104 +686,6 @@ static void netlink_ring_setup_skb(struct sk_buff *skb, struct sock *sk,
NETLINK_CB(skb).sk = sk;
}
-static int netlink_mmap_sendmsg(struct sock *sk, struct msghdr *msg,
- u32 dst_portid, u32 dst_group,
- struct sock_iocb *siocb)
-{
- struct netlink_sock *nlk = nlk_sk(sk);
- struct netlink_ring *ring;
- struct nl_mmap_hdr *hdr;
- struct sk_buff *skb;
- unsigned int maxlen;
- bool excl = true;
- int err = 0, len = 0;
-
- /* Netlink messages are validated by the receiver before processing.
- * In order to avoid userspace changing the contents of the message
- * after validation, the socket and the ring may only be used by a
- * single process, otherwise we fall back to copying.
- */
- if (atomic_long_read(&sk->sk_socket->file->f_count) > 2 ||
- atomic_read(&nlk->mapped) > 1)
- excl = false;
-
- mutex_lock(&nlk->pg_vec_lock);
-
- ring = &nlk->tx_ring;
- maxlen = ring->frame_size - NL_MMAP_HDRLEN;
-
- do {
- hdr = netlink_current_frame(ring, NL_MMAP_STATUS_VALID);
- if (hdr == NULL) {
- if (!(msg->msg_flags & MSG_DONTWAIT) &&
- atomic_read(&nlk->tx_ring.pending))
- schedule();
- continue;
- }
- if (hdr->nm_len > maxlen) {
- err = -EINVAL;
- goto out;
- }
-
- netlink_frame_flush_dcache(hdr);
-
- if (likely(dst_portid == 0 && dst_group == 0 && excl)) {
- skb = alloc_skb_head(GFP_KERNEL);
- if (skb == NULL) {
- err = -ENOBUFS;
- goto out;
- }
- sock_hold(sk);
- netlink_ring_setup_skb(skb, sk, ring, hdr);
- NETLINK_CB(skb).flags |= NETLINK_SKB_TX;
- __skb_put(skb, hdr->nm_len);
- netlink_set_status(hdr, NL_MMAP_STATUS_RESERVED);
- atomic_inc(&ring->pending);
- } else {
- skb = alloc_skb(hdr->nm_len, GFP_KERNEL);
- if (skb == NULL) {
- err = -ENOBUFS;
- goto out;
- }
- __skb_put(skb, hdr->nm_len);
- memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
- netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
- }
-
- netlink_increment_head(ring);
-
- NETLINK_CB(skb).portid = nlk->portid;
- NETLINK_CB(skb).dst_group = dst_group;
- NETLINK_CB(skb).creds = siocb->scm->creds;
-
- err = security_netlink_send(sk, skb);
- if (err) {
- kfree_skb(skb);
- goto out;
- }
-
- if (unlikely(dst_group)) {
- atomic_inc(&skb->users);
- netlink_broadcast(sk, skb, dst_portid, dst_group,
- GFP_KERNEL);
- }
- err = netlink_unicast(sk, skb, dst_portid,
- msg->msg_flags & MSG_DONTWAIT);
- if (err < 0)
- goto out;
- len += err;
-
- } while (hdr != NULL ||
- (!(msg->msg_flags & MSG_DONTWAIT) &&
- atomic_read(&nlk->tx_ring.pending)));
-
- if (len > 0)
- err = len;
- mutex_unlock(&nlk->pg_vec_lock);
- return err;
-}
-
static void netlink_queue_mmaped_skb(struct sock *sk, struct sk_buff *skb)
{
struct nl_mmap_hdr *hdr;
@@ -842,10 +732,8 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
#else /* CONFIG_NETLINK_MMAP */
#define netlink_skb_is_mmaped(skb) false
#define netlink_rx_is_mmaped(sk) false
-#define netlink_tx_is_mmaped(sk) false
#define netlink_mmap sock_no_mmap
#define netlink_poll datagram_poll
-#define netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group, siocb) 0
#endif /* CONFIG_NETLINK_MMAP */
static void netlink_skb_destructor(struct sk_buff *skb)
@@ -864,16 +752,11 @@ static void netlink_skb_destructor(struct sk_buff *skb)
hdr = netlink_mmap_hdr(skb);
sk = NETLINK_CB(skb).sk;
- if (NETLINK_CB(skb).flags & NETLINK_SKB_TX) {
- netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
- ring = &nlk_sk(sk)->tx_ring;
- } else {
- if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
- hdr->nm_len = 0;
- netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
- }
- ring = &nlk_sk(sk)->rx_ring;
+ if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
+ hdr->nm_len = 0;
+ netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
}
+ ring = &nlk_sk(sk)->rx_ring;
WARN_ON(atomic_read(&ring->pending) == 0);
atomic_dec(&ring->pending);
@@ -2165,8 +2048,7 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
err = 0;
break;
#ifdef CONFIG_NETLINK_MMAP
- case NETLINK_TX_RING: {
+ case NETLINK_RX_RING: {
struct nl_mmap_req req;
/* Rings might consume more memory than queue limits, require
@@ -2295,13 +2177,6 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
goto out;
}
- if (netlink_tx_is_mmaped(sk) &&
- msg->msg_iov->iov_base == NULL) {
- err = netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group,
- siocb);
- goto out;
- }
-
err = -EMSGSIZE;
if (len > sk->sk_sndbuf - 32)
goto out;
--
1.7.11.7

--
Andy Lutomirski
AMA Capital Management, LLC

David Miller

2014-10-14 20:00:56 UTC

From: Andy Lutomirski <***@amacapital.net>
Date: Tue, 14 Oct 2014 12:33:43 -0700

Post by Andy Lutomirski
For full honesty, there is now the machinery developed for memfd
sealing, but I can't imagine that this is ever faster than just
copying the buffer.

I don't have much motivation to even check if it's a worthwhile
pursuit at this point.

Someone who wants to can :-)

Post by Andy Lutomirski
I think that the NETLINK_SKB_TX declaration in include/linux/netlink.h
should probably go, too. And there's the last parameter to
netlink_set_ring, too, and possibly even the nlk->tx_ring struct
itself.

Agreed on all counts, here is the new patch:

====================
[PATCH] netlink: Remove TX mmap support.

There is no reasonable manner in which to absolutely make sure that another
thread of control cannot write to the pages in the mmap()'d area and thus
make sure that netlink messages do not change underneath us after we've
performed verifications.

Reported-by: Andy Lutomirski <***@amacapital.net>
Signed-off-by: David S. Miller <***@davemloft.net>
---
include/linux/netlink.h | 5 +-
net/netlink/af_netlink.c | 161 +++++------------------------------------------
net/netlink/af_netlink.h | 1 -
3 files changed, 16 insertions(+), 151 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 9e572da..57080a9 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -17,9 +17,8 @@ static inline struct nlmsghdr *nlmsg_hdr(const struct sk_buff *skb)

enum netlink_skb_flags {
NETLINK_SKB_MMAPED = 0x1, /* Packet data is mmaped */
- NETLINK_SKB_TX = 0x2, /* Packet was sent by userspace */
- NETLINK_SKB_DELIVERED = 0x4, /* Packet was delivered */
- NETLINK_SKB_DST = 0x8, /* Dst set in sendto or sendmsg */
+ NETLINK_SKB_DELIVERED = 0x2, /* Packet was delivered */
+ NETLINK_SKB_DST = 0x4, /* Dst set in sendto or sendmsg */
};

struct netlink_skb_parms {
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c416725..07ef0c9 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -289,11 +289,6 @@ static bool netlink_rx_is_mmaped(struct sock *sk)
return nlk_sk(sk)->rx_ring.pg_vec != NULL;
}

-static bool netlink_tx_is_mmaped(struct sock *sk)
-{
- return nlk_sk(sk)->tx_ring.pg_vec != NULL;
-}
-
static __pure struct page *pgvec_to_page(const void *addr)
{
if (is_vmalloc_addr(addr))
@@ -359,7 +354,7 @@ err1:
}

static int netlink_set_ring(struct sock *sk, struct nl_mmap_req *req,
- bool closing, bool tx_ring)
+ bool closing)
{
struct netlink_sock *nlk = nlk_sk(sk);
struct netlink_ring *ring;
@@ -368,8 +363,8 @@ static int netlink_set_ring(struct sock *sk, struct nl_mmap_req *req,
unsigned int order = 0;
int err;

- ring = tx_ring ? &nlk->tx_ring : &nlk->rx_ring;
- queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
+ ring = &nlk->rx_ring;
+ queue = &sk->sk_receive_queue;

if (!closing) {
if (atomic_read(&nlk->mapped))
@@ -476,11 +471,9 @@ static int netlink_mmap(struct file *file, struct socket *sock,
mutex_lock(&nlk->pg_vec_lock);

expected = 0;
- for (ring = &nlk->rx_ring; ring <= &nlk->tx_ring; ring++) {
- if (ring->pg_vec == NULL)
- continue;
+ ring = &nlk->rx_ring;
+ if (ring->pg_vec)
expected += ring->pg_vec_len * ring->pg_vec_pages * PAGE_SIZE;
- }

if (expected == 0)
goto out;
@@ -490,10 +483,8 @@ static int netlink_mmap(struct file *file, struct socket *sock,
goto out;

start = vma->vm_start;
- for (ring = &nlk->rx_ring; ring <= &nlk->tx_ring; ring++) {
- if (ring->pg_vec == NULL)
- continue;
-
+ ring = &nlk->rx_ring;
+ if (ring->pg_vec) {
for (i = 0; i < ring->pg_vec_len; i++) {
struct page *page;
void *kaddr = ring->pg_vec[i];
@@ -662,13 +653,6 @@ static unsigned int netlink_poll(struct file *file, struct socket *sock,
}
spin_unlock_bh(&sk->sk_receive_queue.lock);

- spin_lock_bh(&sk->sk_write_queue.lock);
- if (nlk->tx_ring.pg_vec) {
- if (netlink_current_frame(&nlk->tx_ring, NL_MMAP_STATUS_UNUSED))
- mask |= POLLOUT | POLLWRNORM;
- }
- spin_unlock_bh(&sk->sk_write_queue.lock);
-
return mask;
}

@@ -698,104 +682,6 @@ static void netlink_ring_setup_skb(struct sk_buff *skb, struct sock *sk,
NETLINK_CB(skb).sk = sk;
}

-static int netlink_mmap_sendmsg(struct sock *sk, struct msghdr *msg,
- u32 dst_portid, u32 dst_group,
- struct sock_iocb *siocb)
-{
- struct netlink_sock *nlk = nlk_sk(sk);
- struct netlink_ring *ring;
- struct nl_mmap_hdr *hdr;
- struct sk_buff *skb;
- unsigned int maxlen;
- bool excl = true;
- int err = 0, len = 0;
-
- /* Netlink messages are validated by the receiver before processing.
- * In order to avoid userspace changing the contents of the message
- * after validation, the socket and the ring may only be used by a
- * single process, otherwise we fall back to copying.
- */
- if (atomic_long_read(&sk->sk_socket->file->f_count) > 2 ||
- atomic_read(&nlk->mapped) > 1)
- excl = false;
-
- mutex_lock(&nlk->pg_vec_lock);
-
- ring = &nlk->tx_ring;
- maxlen = ring->frame_size - NL_MMAP_HDRLEN;
-
- do {
- hdr = netlink_current_frame(ring, NL_MMAP_STATUS_VALID);
- if (hdr == NULL) {
- if (!(msg->msg_flags & MSG_DONTWAIT) &&
- atomic_read(&nlk->tx_ring.pending))
- schedule();
- continue;
- }
- if (hdr->nm_len > maxlen) {
- err = -EINVAL;
- goto out;
- }
-
- netlink_frame_flush_dcache(hdr);
-
- if (likely(dst_portid == 0 && dst_group == 0 && excl)) {
- skb = alloc_skb_head(GFP_KERNEL);
- if (skb == NULL) {
- err = -ENOBUFS;
- goto out;
- }
- sock_hold(sk);
- netlink_ring_setup_skb(skb, sk, ring, hdr);
- NETLINK_CB(skb).flags |= NETLINK_SKB_TX;
- __skb_put(skb, hdr->nm_len);
- netlink_set_status(hdr, NL_MMAP_STATUS_RESERVED);
- atomic_inc(&ring->pending);
- } else {
- skb = alloc_skb(hdr->nm_len, GFP_KERNEL);
- if (skb == NULL) {
- err = -ENOBUFS;
- goto out;
- }
- __skb_put(skb, hdr->nm_len);
- memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
- netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
- }
-
- netlink_increment_head(ring);
-
- NETLINK_CB(skb).portid = nlk->portid;
- NETLINK_CB(skb).dst_group = dst_group;
- NETLINK_CB(skb).creds = siocb->scm->creds;
-
- err = security_netlink_send(sk, skb);
- if (err) {
- kfree_skb(skb);
- goto out;
- }
-
- if (unlikely(dst_group)) {
- atomic_inc(&skb->users);
- netlink_broadcast(sk, skb, dst_portid, dst_group,
- GFP_KERNEL);
- }
- err = netlink_unicast(sk, skb, dst_portid,
- msg->msg_flags & MSG_DONTWAIT);
- if (err < 0)
- goto out;
- len += err;
-
- } while (hdr != NULL ||
- (!(msg->msg_flags & MSG_DONTWAIT) &&
- atomic_read(&nlk->tx_ring.pending)));
-
- if (len > 0)
- err = len;
-out:
- mutex_unlock(&nlk->pg_vec_lock);
- return err;
-}
-
static void netlink_queue_mmaped_skb(struct sock *sk, struct sk_buff *skb)
{
struct nl_mmap_hdr *hdr;
@@ -842,10 +728,8 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
#else /* CONFIG_NETLINK_MMAP */
#define netlink_skb_is_mmaped(skb) false
#define netlink_rx_is_mmaped(sk) false
-#define netlink_tx_is_mmaped(sk) false
#define netlink_mmap sock_no_mmap
#define netlink_poll datagram_poll
-#define netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group, siocb) 0
#endif /* CONFIG_NETLINK_MMAP */

static void netlink_skb_destructor(struct sk_buff *skb)
@@ -864,16 +748,11 @@ static void netlink_skb_destructor(struct sk_buff *skb)
hdr = netlink_mmap_hdr(skb);
sk = NETLINK_CB(skb).sk;

- if (NETLINK_CB(skb).flags & NETLINK_SKB_TX) {
- netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
- ring = &nlk_sk(sk)->tx_ring;
- } else {
- if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
- hdr->nm_len = 0;
- netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
- }
- ring = &nlk_sk(sk)->rx_ring;
+ if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
+ hdr->nm_len = 0;
+ netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
}
+ ring = &nlk_sk(sk)->rx_ring;

WARN_ON(atomic_read(&ring->pending) == 0);
atomic_dec(&ring->pending);
@@ -921,10 +800,7 @@ static void netlink_sock_destruct(struct sock *sk)

memset(&req, 0, sizeof(req));
if (nlk->rx_ring.pg_vec)
- netlink_set_ring(sk, &req, true, false);
- memset(&req, 0, sizeof(req));
- if (nlk->tx_ring.pg_vec)
- netlink_set_ring(sk, &req, true, true);
+ netlink_set_ring(sk, &req, true);
}
#endif /* CONFIG_NETLINK_MMAP */

@@ -2165,8 +2041,7 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
err = 0;
break;
#ifdef CONFIG_NETLINK_MMAP
- case NETLINK_RX_RING:
- case NETLINK_TX_RING: {
+ case NETLINK_RX_RING: {
struct nl_mmap_req req;

/* Rings might consume more memory than queue limits, require
@@ -2178,8 +2053,7 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
return -EINVAL;
if (copy_from_user(&req, optval, sizeof(req)))
return -EFAULT;
- err = netlink_set_ring(sk, &req, false,
- optname == NETLINK_TX_RING);
+ err = netlink_set_ring(sk, &req, false);
break;
}
#endif /* CONFIG_NETLINK_MMAP */
@@ -2295,13 +2169,6 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
goto out;
}

- if (netlink_tx_is_mmaped(sk) &&
- msg->msg_iov->iov_base == NULL) {
- err = netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group,
- siocb);
- goto out;
- }
-
err = -EMSGSIZE;
if (len > sk->sk_sndbuf - 32)
goto out;
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index b20a173..4741b88 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -45,7 +45,6 @@ struct netlink_sock {
#ifdef CONFIG_NETLINK_MMAP
struct mutex pg_vec_lock;
struct netlink_ring rx_ring;
- struct netlink_ring tx_ring;
atomic_t mapped;
#endif /* CONFIG_NETLINK_MMAP */

--
1.7.11.7

Andy Lutomirski

2014-10-14 22:16:46 UTC

Post by David Miller
Date: Tue, 14 Oct 2014 12:33:43 -0700

Post by Andy Lutomirski
For full honesty, there is now the machinery developed for memfd
sealing, but I can't imagine that this is ever faster than just
copying the buffer.

I don't have much motivation to even check if it's a worthwhile
pursuit at this point.
Someone who wants to can :-)

Post by Andy Lutomirski
I think that the NETLINK_SKB_TX declaration in include/linux/netlink.h
should probably go, too. And there's the last parameter to
netlink_set_ring, too, and possibly even the nlk->tx_ring struct
itself.

====================
[PATCH] netlink: Remove TX mmap support.
There is no reasonable manner in which to absolutely make sure that another
thread of control cannot write to the pages in the mmap()'d area and thus
make sure that netlink messages do not change underneath us after we've
performed verifications.

Looks sensible to me, but I have no idea how to test it.

It's at least remotely possible that there's something that assumes
that assumes that the availability of NETLINK_RX_RING implies
NETLINK_TX_RING, which would be unfortunate.

--Andy

David Miller

2014-10-15 02:01:11 UTC

From: Andy Lutomirski <***@amacapital.net>
Date: Tue, 14 Oct 2014 15:16:46 -0700

Post by Andy Lutomirski
It's at least remotely possible that there's something that assumes
that assumes that the availability of NETLINK_RX_RING implies
NETLINK_TX_RING, which would be unfortunate.

I already found one such case, nlmon :-/

It also reminds me that I'll have to update
Documentation/networking/netlink_mmap.txt

Thomas, the context is that we have to remove NETLINK_TX_RING support
(there is absolutely no way whatsoever to reliably keep some thread of
control from modifying the underlying pages while we parse and
validate the netlink request).

I'd like to be able to do so while retaining NETLINK_RX_RING because
that works fine and is great for monitoring when the rate of events
is high.

But I already have found userland pieces of code, like nlmon, which
assume that if one is present then both must be present.

I really think this means I'll have to remove all of the netlink
mmap() support in order to prevent from breaking applications. :(

The other option is to keep NETLINK_TX_RING, but copy the data into
a kernel side buffer before acting upon it.

Andy Lutomirski

2014-10-15 02:03:11 UTC

Post by David Miller
Date: Tue, 14 Oct 2014 15:16:46 -0700

Post by Andy Lutomirski
It's at least remotely possible that there's something that assumes
that assumes that the availability of NETLINK_RX_RING implies
NETLINK_TX_RING, which would be unfortunate.

I already found one such case, nlmon :-/
It also reminds me that I'll have to update
Documentation/networking/netlink_mmap.txt
Thomas, the context is that we have to remove NETLINK_TX_RING support
(there is absolutely no way whatsoever to reliably keep some thread of
control from modifying the underlying pages while we parse and
validate the netlink request).
I'd like to be able to do so while retaining NETLINK_RX_RING because
that works fine and is great for monitoring when the rate of events
is high.
But I already have found userland pieces of code, like nlmon, which
assume that if one is present then both must be present.
I really think this means I'll have to remove all of the netlink
mmap() support in order to prevent from breaking applications. :(
The other option is to keep NETLINK_TX_RING, but copy the data into
a kernel side buffer before acting upon it.

Option 3, which sucks but maybe not that badly: change the value of
NETLINK_RX_RING. (Practically: add NETLINK_RX_RING2 or something like
that.)

--Andy

David Miller

2014-10-15 02:09:08 UTC

From: Andy Lutomirski <***@amacapital.net>
Date: Tue, 14 Oct 2014 19:03:11 -0700

Post by Andy Lutomirski

Post by David Miller
I really think this means I'll have to remove all of the netlink
mmap() support in order to prevent from breaking applications. :(
The other option is to keep NETLINK_TX_RING, but copy the data into
a kernel side buffer before acting upon it.

Option 3, which sucks but maybe not that badly: change the value of
NETLINK_RX_RING. (Practically: add NETLINK_RX_RING2 or something like
that.)

That would work as well.

There are pros and cons to all of these approaches.

I was thinking that if we do the "TX mmap --> copy to kernel buffer"
approach, then if in the future we find a way to make it work
reliably, we can avoid the copy. And frankly performance wise it's no
worse than what happens via normal sendmsg() calls.

And all applications using NETLINK_RX_RING keep working and keep
getting the performance boost.

Daniel Borkmann

2014-10-16 06:45:44 UTC

Post by David Miller
Date: Tue, 14 Oct 2014 19:03:11 -0700

Post by Andy Lutomirski

Post by David Miller
I really think this means I'll have to remove all of the netlink
mmap() support in order to prevent from breaking applications. :(
The other option is to keep NETLINK_TX_RING, but copy the data into
a kernel side buffer before acting upon it.

Option 3, which sucks but maybe not that badly: change the value of
NETLINK_RX_RING. (Practically: add NETLINK_RX_RING2 or something like
that.)

That would work as well.
There are pros and cons to all of these approaches.
I was thinking that if we do the "TX mmap --> copy to kernel buffer"
approach, then if in the future we find a way to make it work
reliably, we can avoid the copy. And frankly performance wise it's no
worse than what happens via normal sendmsg() calls.
And all applications using NETLINK_RX_RING keep working and keep
getting the performance boost.

That would be better, yes. This would avoid having such a TPACKET_V*
API chaos we have in packet sockets if this could be fixed for netlink
eventually.

Thomas Graf

2014-10-16 07:07:53 UTC

Post by Daniel Borkmann

Post by David Miller
That would work as well.
There are pros and cons to all of these approaches.
I was thinking that if we do the "TX mmap --> copy to kernel buffer"
approach, then if in the future we find a way to make it work
reliably, we can avoid the copy. And frankly performance wise it's no
worse than what happens via normal sendmsg() calls.
And all applications using NETLINK_RX_RING keep working and keep
getting the performance boost.

That would be better, yes. This would avoid having such a TPACKET_V*
API chaos we have in packet sockets if this could be fixed for netlink
eventually.

Only saw the second part of Dave's message now. I agree that this
is even a better option.

Daniel Borkmann

2014-10-15 23:45:22 UTC

Post by David Miller
Date: Tue, 14 Oct 2014 15:16:46 -0700

Post by Andy Lutomirski
It's at least remotely possible that there's something that assumes
that assumes that the availability of NETLINK_RX_RING implies
NETLINK_TX_RING, which would be unfortunate.

I already found one such case, nlmon :-/

Hmm, can you elaborate? I currently don't think that nlmon cares
actually.

David Miller

2014-10-15 23:57:37 UTC

From: Daniel Borkmann <***@redhat.com>
Date: Thu, 16 Oct 2014 01:45:22 +0200

Post by Daniel Borkmann

Post by David Miller
Date: Tue, 14 Oct 2014 15:16:46 -0700

Post by Andy Lutomirski
It's at least remotely possible that there's something that assumes
that assumes that the availability of NETLINK_RX_RING implies
NETLINK_TX_RING, which would be unfortunate.

I already found one such case, nlmon :-/

Hmm, can you elaborate? I currently don't think that nlmon cares
actually.

nlmon cares, openvswitch cares, etc:

http://openvswitch.org/pipermail/dev/2013-December/034496.html

Andy Lutomirski

2014-10-15 23:58:39 UTC

Post by David Miller
Date: Thu, 16 Oct 2014 01:45:22 +0200

Post by Daniel Borkmann

Post by David Miller
Date: Tue, 14 Oct 2014 15:16:46 -0700

Post by Andy Lutomirski
It's at least remotely possible that there's something that assumes
that assumes that the availability of NETLINK_RX_RING implies
NETLINK_TX_RING, which would be unfortunate.

I already found one such case, nlmon :-/

Hmm, can you elaborate? I currently don't think that nlmon cares
actually.

It'll still work, just slower, right?

+ if (setsockopt(sock->fd, SOL_NETLINK, NETLINK_RX_RING, &req,
sizeof(req)) < 0
+ || setsockopt(sock->fd, SOL_NETLINK, NETLINK_TX_RING, &req,
sizeof(req)) < 0) {
+ VLOG_INFO("mmap netlink is not supported");
+ return 0;
+ }
+

--Andy

David Miller

2014-10-16 03:34:59 UTC

From: Andy Lutomirski <***@amacapital.net>
Date: Wed, 15 Oct 2014 16:58:39 -0700

Post by Andy Lutomirski

Post by David Miller
Date: Thu, 16 Oct 2014 01:45:22 +0200

Post by Daniel Borkmann

Post by David Miller
Date: Tue, 14 Oct 2014 15:16:46 -0700

Post by Andy Lutomirski
It's at least remotely possible that there's something that assumes
that assumes that the availability of NETLINK_RX_RING implies
NETLINK_TX_RING, which would be unfortunate.

I already found one such case, nlmon :-/

Hmm, can you elaborate? I currently don't think that nlmon cares
actually.

It'll still work, just slower, right?
+ if (setsockopt(sock->fd, SOL_NETLINK, NETLINK_RX_RING, &req,
sizeof(req)) < 0
+ || setsockopt(sock->fd, SOL_NETLINK, NETLINK_TX_RING, &req,
sizeof(req)) < 0) {
+ VLOG_INFO("mmap netlink is not supported");
+ return 0;
+ }

So NETLINK_RX_RING succeeds but NETLINK_TX_RING fails, so now the
recvmsg() path will take the RX ring code path because this code
doesn't clean up and undo the setsockopt().

This is the common coding paradigm I've seen in all NETLINK_*_RING
uses.

Thomas Graf

2014-10-16 05:52:47 UTC

Post by David Miller
Date: Thu, 16 Oct 2014 01:45:22 +0200

Post by Daniel Borkmann

Post by David Miller
Date: Tue, 14 Oct 2014 15:16:46 -0700

Post by Andy Lutomirski
It's at least remotely possible that there's something that assumes
that assumes that the availability of NETLINK_RX_RING implies
NETLINK_TX_RING, which would be unfortunate.

I already found one such case, nlmon :-/

Hmm, can you elaborate? I currently don't think that nlmon cares
actually.

http://openvswitch.org/pipermail/dev/2013-December/034496.html

(Fortunately) the OVS patch has not been merged yet because the number
of Netlink sockets created per vport in the current architecture
currently make it a non scalable approach.

I think introdcing a NETLINK_RX_RING2 and having NETLINK_RX_RING fail
is not a bad way out of this.

16 Replies
8 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Andy Lutomirski 2014-05-12 21:08:20 UTC

Andy Lutomirski 2014-10-11 22:29:17 UTC

David Miller 2014-10-11 23:09:57 UTC

David Miller 2014-10-14 19:19:49 UTC

Andy Lutomirski 2014-10-14 19:33:43 UTC

David Miller 2014-10-14 20:00:56 UTC

Andy Lutomirski 2014-10-14 22:16:46 UTC

David Miller 2014-10-15 02:01:11 UTC

Andy Lutomirski 2014-10-15 02:03:11 UTC

David Miller 2014-10-15 02:09:08 UTC

Daniel Borkmann 2014-10-16 06:45:44 UTC

Thomas Graf 2014-10-16 07:07:53 UTC

Daniel Borkmann 2014-10-15 23:45:22 UTC

David Miller 2014-10-15 23:57:37 UTC

Andy Lutomirski 2014-10-15 23:58:39 UTC

David Miller 2014-10-16 03:34:59 UTC

Thomas Graf 2014-10-16 05:52:47 UTC

about - legalese

Loading...