Discussion:
[PATCH v2] ipv4: dst_entry leak in ip_append_data()
Vasily Averin
2014-10-14 04:57:14 UTC
Permalink
v2: adjust the indentation of the arguments __ip_append_data() call

Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")

If sk_write_queue is empty ip_append_data() executes ip_setup_cork()
that "steals" dst entry from rt to cork. Later it calls __ip_append_data()
that creates skb and adds it to sk_write_queue.

If skb was added successfully following ip_push_pending_frames() call
reassign dst entries from cork to skb, and kfree_skb frees dst_entry.

However nobody frees stolen dst_entry if skb was not added into sk_write_queue.

Signed-off-by: Vasily Averin <***@parallels.com>
---
net/ipv4/ip_output.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e35b712..3ba2291 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1120,6 +1120,15 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
return 0;
}

+static void ip_cork_release(struct inet_cork *cork)
+{
+ cork->flags &= ~IPCORK_OPT;
+ kfree(cork->opt);
+ cork->opt = NULL;
+ dst_release(cork->dst);
+ cork->dst = NULL;
+}
+
/*
* ip_append_data() and ip_append_page() can make one large IP datagram
* from many pieces of data. Each pieces will be holded on the socket
@@ -1152,9 +1161,14 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
transhdrlen = 0;
}

- return __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base,
- sk_page_frag(sk), getfrag,
- from, length, transhdrlen, flags);
+ err = __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base,
+ sk_page_frag(sk), getfrag,
+ from, length, transhdrlen, flags);
+
+ if (skb_queue_empty(&sk->sk_write_queue))
+ ip_cork_release(&inet->cork.base);
+
+ return err;
}

ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
@@ -1304,15 +1318,6 @@ error:
return err;
}

-static void ip_cork_release(struct inet_cork *cork)
-{
- cork->flags &= ~IPCORK_OPT;
- kfree(cork->opt);
- cork->opt = NULL;
- dst_release(cork->dst);
- cork->dst = NULL;
-}
-
/*
* Combined all pending IP fragments on the socket as one IP datagram
* and push them out.
--
1.9.1
David Miller
2014-10-14 20:12:25 UTC
Permalink
From: Vasily Averin <***@parallels.com>
Date: Tue, 14 Oct 2014 08:57:14 +0400
Post by Vasily Averin
v2: adjust the indentation of the arguments __ip_append_data() call
Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
If sk_write_queue is empty ip_append_data() executes ip_setup_cork()
that "steals" dst entry from rt to cork. Later it calls __ip_append_data()
that creates skb and adds it to sk_write_queue.
If skb was added successfully following ip_push_pending_frames() call
reassign dst entries from cork to skb, and kfree_skb frees dst_entry.
However nobody frees stolen dst_entry if skb was not added into sk_write_queue.
Why doesn't ip_make_skb() need the same fix? It seems to do the same
thing.
Vasily Averin
2014-10-15 07:48:02 UTC
Permalink
Post by David Miller
Date: Tue, 14 Oct 2014 08:57:14 +0400
Post by Vasily Averin
v2: adjust the indentation of the arguments __ip_append_data() call
Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
If sk_write_queue is empty ip_append_data() executes ip_setup_cork()
that "steals" dst entry from rt to cork. Later it calls __ip_append_data()
that creates skb and adds it to sk_write_queue.
If skb was added successfully following ip_push_pending_frames() call
reassign dst entries from cork to skb, and kfree_skb frees dst_entry.
However nobody frees stolen dst_entry if skb was not added into sk_write_queue.
Why doesn't ip_make_skb() need the same fix? It seems to do the same
thing.
It seems for me ip_make_skb() works (almost) correctly,
but seems refcounting can be is incorrect if queue can be not empty
(Please see details below).

If __ip_append_data() returns errors ip_make_skb() calls
__ip_flush_pending_frames() that calls ip_cork_release() inside
and frees stolen dst_entry.

If __ip_append_data() returns success -- dst refcounter changes are not required.
In this case skb will be created and added to queue (and it will not be empty)
Later in __ip_make_skb() these skb will get dst reference,
and refcounter will be decremented during kfree_skb().

I do not like that there is such unclear dependency between functions,
but seems currently it works correctly.

However I afraid dst refcountng can work incorrectly if sk_write_queue
can be not empty at the moment of ip_append_data() call.
It was not happen in case ip_send_unicast_reply() but probably
can happen in other places.

Let's calculate dst refcounters changes in this case.

First packet:

dst_refcount increment was happen in ip_append_data() caller, taken during rt lookup
- ip_append_data():
-- sk_write_queue is empty, ip_setup_cork() steals dst entry
-- __ip_append_data() adds skb to queue, queue is not flushed, waiting for next packets.
ip_rt_put in ip_append_data() caller does not work, because dst reference was stolen.

dst refcount here +1

then we want to sent 2nd packet:
dst refcount increment was happen in ip_append_data() caller
- ip_append_data():
-- sk_write_queue is NOT empty, dst was not stolen
-- __ip_append_data() adds skb to queue
ip_rt_put in ip_append_data() caller decrements dst refcount, because it as not stolen

dst refcount here +1

Then we handle new packets, all of them are added to queue

dst refcount is still +1

Then queue is flushed.
Each packet in queue get dst reference from cork,
Each kfree_skb decrements dst refcounter, and it may become negative.

Am I wrong probably?
Eric Dumazet
2014-10-15 04:46:25 UTC
Permalink
Post by Vasily Averin
v2: adjust the indentation of the arguments __ip_append_data() call
Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
If sk_write_queue is empty ip_append_data() executes ip_setup_cork()
that "steals" dst entry from rt to cork. Later it calls __ip_append_data()
that creates skb and adds it to sk_write_queue.
If skb was added successfully following ip_push_pending_frames() call
reassign dst entries from cork to skb, and kfree_skb frees dst_entry.
However nobody frees stolen dst_entry if skb was not added into sk_write_queue.
I thought this was done by ip_flush_pending_frames() ?

Can you describe the issue more precisely ?

Thanks !
Vasily Averin
2014-10-15 06:56:47 UTC
Permalink
Post by Eric Dumazet
Post by Vasily Averin
v2: adjust the indentation of the arguments __ip_append_data() call
Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
If sk_write_queue is empty ip_append_data() executes ip_setup_cork()
that "steals" dst entry from rt to cork. Later it calls __ip_append_data()
that creates skb and adds it to sk_write_queue.
If skb was added successfully following ip_push_pending_frames() call
reassign dst entries from cork to skb, and kfree_skb frees dst_entry.
However nobody frees stolen dst_entry if skb was not added into sk_write_queue.
I thought this was done by ip_flush_pending_frames() ?
Take look at ip_send_unicast_reply():

ip_flush_pending_frames() is not called if skb was not added to sk_write_queue.
And ip_rt_put() does not work, because dst entry was stolen in ip_setup_cork().

Probably it can happen in raw_sendmsg() and udp_sendmsg() too.
Eric Dumazet
2014-10-15 09:30:33 UTC
Permalink
Post by Vasily Averin
Post by Eric Dumazet
Post by Vasily Averin
v2: adjust the indentation of the arguments __ip_append_data() call
Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
If sk_write_queue is empty ip_append_data() executes ip_setup_cork()
that "steals" dst entry from rt to cork. Later it calls __ip_append_data()
that creates skb and adds it to sk_write_queue.
If skb was added successfully following ip_push_pending_frames() call
reassign dst entries from cork to skb, and kfree_skb frees dst_entry.
However nobody frees stolen dst_entry if skb was not added into sk_write_queue.
I thought this was done by ip_flush_pending_frames() ?
So maybe the bug is here ?
Post by Vasily Averin
ip_flush_pending_frames() is not called if skb was not added to sk_write_queue.
And ip_rt_put() does not work, because dst entry was stolen in ip_setup_cork().
Probably it can happen in raw_sendmsg() and udp_sendmsg() too.
UDP & RAW do :

err = ip_append_data(...);
if (err)
udp_flush_pending_frames(sk);


It seems you chose to add a test in fast path, with not even adding an
unlikely() clause, while it seems that we took care of all the cases but
missed a single one : ip_send_unicast_reply()

I am suggesting to fix this bug in another way.

Thanks.
Vasily Averin
2014-10-15 11:31:37 UTC
Permalink
Post by Eric Dumazet
Post by Eric Dumazet
Post by Vasily Averin
v2: adjust the indentation of the arguments __ip_append_data() call
Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
If sk_write_queue is empty ip_append_data() executes ip_setup_cork()
that "steals" dst entry from rt to cork. Later it calls __ip_append_data()
that creates skb and adds it to sk_write_queue.
If skb was added successfully following ip_push_pending_frames() call
reassign dst entries from cork to skb, and kfree_skb frees dst_entry.
However nobody frees stolen dst_entry if skb was not added into sk_write_queue.
I thought this was done by ip_flush_pending_frames() ?
So maybe the bug is here ?
Thank you, I'll remake my patch.

Loading...