Discussion:
Data corruption issue with splice() on 2.6.27.10
(too old to reply)
Willy Tarreau
2008-12-24 15:28:41 UTC
Permalink
Hi Jens,

I'm facing a data corruption problem with splice() between two
non-blocking TCP sockets on 2.6.27.10. I could finally write a
simpler proof of concept, and capture a snapshot of the issue
with the associated strace result.

My program does the following :
- accept an incoming connection
- connect to a remote server
- forward all data from the server to the client using splice()

The data count is always correct, but some parts are corrupted and
contain data which seem to come from random memory locations (this
raises a security concern BTW). It *sometimes* happens that a few
megabytes can be transferred without any problem, but most of the
time, corruption happens for a few hundreds of bytes every few
hundreds of kilobytes.

I first noticed that sometimes the corrupted part sized a multiple of
the MSS. But now it's more precise. Each time I observe the problem,
it is when the splice from the pipe to the outgoing socket has sent
two chunks at once.

For my test, I run my PoC program on my laptop. I connect to it using
netcat from localhost, then it connects to another machine on which
netcat is waiting for an incoming connection. The source of gcc-4.1.2
is then feed on stdin, and transferred via my program to the client-
side netcat. Both NICs are tg3, but the problem was first observed
with ixgbe NICs, then reproduced on sky2, so I don't think there's a
relation.

Here's an example of what I observed. This trace is really nice
because the problem happened very early (after about 14kB). 20k1 is an
"od -vAx -tx1" output of the original file, 20k2 is an output of the
corrupted one :

--- 20k1.hex 2008-12-24 11:35:36 +0100
+++ 20k2.hex 2008-12-24 14:48:26 +0100
@@ -883,17 +883,17 @@
003720 05 06 78 1b e8 0f 3b 88 6f 96 e1 99 93 51 fd 07
003730 a7 e9 d6 f6 9f b4 b4 78 ff 7f ea 3b b9 79 cc 4c
003740 65 d8 9e de 49 e5 fa 45 ab b4 3f 95 d4 42 23 e1
-003750 a8 19 22 c4 fd 6a 8c 8e 08 24 01 29 1c f5 65 86
-003760 7d be de 81 d2 07 41 44 40 3a 31 1c b9 6e ce 19
-003770 e6 a0 5c 81 f6 40 6e 1b a1 9f 58 61 9d 37 c7 d5
-003780 e3 10 83 b7 3f 36 e3 26 fd 22 e3 03 50 8c 3d 54
-003790 f0 f3 d4 73 cc 7e 61 96 43 04 13 e6 d3 96 eb 2a
-0037a0 f5 37 9e 09 9f d2 25 e5 73 ec f3 88 08 47 ca 3f
-0037b0 b0 27 e9 ae 7d a0 04 db 3a c3 c4 2f f8 23 05 04
-0037c0 76 c7 7f 90 65 56 ed 4f 31 03 fb 3b a6 de b5 01
-0037d0 48 02 55 f2 d0 38 89 03 7f 90 cf 5a a6 01 04 18
-0037e0 5b ae 62 c6 84 4b f2 ff 82 04 12 18 4f 32 6e 4f
-0037f0 2e 01 15 14 51 23 81 de 53 9a 86 a6 71 50 47 ee
+003750 a8 19 22 c4 02 00 00 00 00 00 00 00 08 00 00 00
+003760 00 00 00 00 4d 69 63 00 00 00 00 00 00 00 00 00
+003770 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+003780 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+003790 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
003800 fe 0a 4b 11 82 07 55 32 e6 44 21 e6 19 07 d2 38
003810 77 8f 5a 8f 9c 7b 9f 1e 72 01 04 fc 6a 88 c9 cf
003820 f3 6c e0 f8 a0 fe 43 76 50 13 5e d1 b2 08 7c e9

As you can see, the first corrupted byte starts at offset 0x3754 =
14164 and the corruption remains for 172 bytes.

Now let's take a look at the strace output :

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(4000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(3, 10) = 0
accept(3, 0, NULL) = 4

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.1.1.1")}, 16) = 0
fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
pipe([6, 7]) = 0

select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1024
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0x9b4, 0x3) = 2484
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920

==> at this point, exactly 14164 bytes have been transferred from the
server to the client, ie exactly the amount of data before the
corruption.

select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 172

==> we have a read of 172 bytes, exactly the number of corrupted bytes.

select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0x660, 0x3) = 1632

==> and here, we write the last two chunks at once and the reader
observes random data for the first chunk.

select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920
...


I tried to replace either socket with a file, and in both cases, the
problem disappears (which explains why we have not seen reports of
corruption on sendfile). So it's really between TCP and TCP the the
problem lies.

I tried to disable SPLICE_F_MOVE, it did not change anything.
Disabling SPLICE_F_NONBLOCK makes the code unusable as it blocks when
the pipe is full, while the same process is supposed to flush it.

It looks like the corruption happens between sockets and/or pipes,
because the first time it was noticed on a production system, the
system was running haproxy (reverse-proxy load-balancer), and we found
multiple copies of request and response headers in a data file. I
don't know however if those were from other active splices or from
earlier ones running on the same FDs.

Also, it's very hard to reproduce the problem under strace, so I feel
really lucky to have caught this trace. I think it's just a timing
problem.

Another method I found to quickly detect corruption is to feed
/dev/zero to netcat on the server side, and pipe the output to "od",
which will only show sequences of non-zeroes. Hmm just did it right
now again and I got a complete capture of the output of the "top"
utility that I just started in another window.

At first, I thought that blocking reads when there are data in the
pipe would make the problem go away, but it's not the case. I modified
my program to stop polling for reads when "bytes" is not null, and I
finally got on my socket a complete output of strace running in
another xterm.

I found an analysis [1] for a potential corruption problem between two
sockets, but I noticed there were no responses and I did not fully
understand the report anyway.

What can I do to help debug the problem ? I'm really willing to help
getting this fixed, and I also have at least one user who definitely
wants splice() to work because the recv/send model currently limits
haproxy to 3 Gbps on his machines, while I have no problem reaching
10 Gbps with splice().

Please find the attached program. The server's address is hard-coded
to 10.1.1.1:4001. It accepts connections on port 4000. Here's how you
can try to reproduce :

# server :
# nc -lp 4000 </dev/zero

# test machine :
# ./splice-net-to-net

# test machine (again) :
# nc 127.1 4000 | od -Ax -tx1

If you don't get the problem, start strace on splice-net-to-net, here
it seems to help (probably the timing race again).

Thanks in advance,
Willy

----
[1] http://lkml.org/lkml/2008/2/26/210
Jarek Poplawski
2009-01-06 08:54:42 UTC
Permalink
Post by Willy Tarreau
Hi Jens,
I'm facing a data corruption problem with splice() between two
non-blocking TCP sockets on 2.6.27.10. I could finally write a
simpler proof of concept, and capture a snapshot of the issue
with the associated strace result.
...
Post by Willy Tarreau
I found an analysis [1] for a potential corruption problem between two
sockets, but I noticed there were no responses and I did not fully
understand the report anyway.
What can I do to help debug the problem ? I'm really willing to help
getting this fixed, and I also have at least one user who definitely
wants splice() to work because the recv/send model currently limits
haproxy to 3 Gbps on his machines, while I have no problem reaching
10 Gbps with splice().
...
Post by Willy Tarreau
----
[1] http://lkml.org/lkml/2008/2/26/210
Great story! Alas I don't understand this fully either, but it seems
Changli Gao was concerned with sendpage sending this "as pages", so
when NETIF_F_SG flag is available. Did you try this without SG btw?

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-06 09:41:38 UTC
Permalink
Hi Jarek,
Post by Jarek Poplawski
Post by Willy Tarreau
Hi Jens,
I'm facing a data corruption problem with splice() between two
non-blocking TCP sockets on 2.6.27.10. I could finally write a
simpler proof of concept, and capture a snapshot of the issue
with the associated strace result.
...
Post by Willy Tarreau
I found an analysis [1] for a potential corruption problem between two
sockets, but I noticed there were no responses and I did not fully
understand the report anyway.
What can I do to help debug the problem ? I'm really willing to help
getting this fixed, and I also have at least one user who definitely
wants splice() to work because the recv/send model currently limits
haproxy to 3 Gbps on his machines, while I have no problem reaching
10 Gbps with splice().
...
Post by Willy Tarreau
----
[1] http://lkml.org/lkml/2008/2/26/210
Great story! Alas I don't understand this fully either, but it seems
Changli Gao was concerned with sendpage sending this "as pages", so
when NETIF_F_SG flag is available. Did you try this without SG btw?
No I did not. I can try, it's not too hard. It would in part defeat the
purpose of the mechanism (especially at 10 Gbps) but at least it will
help narrow the problem down.

Thanks for the tip, I'll keep you informed !
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-06 10:01:13 UTC
Permalink
On Tue, Jan 06, 2009 at 10:41:38AM +0100, Willy Tarreau wrote:
...
Post by Willy Tarreau
Post by Jarek Poplawski
Great story! Alas I don't understand this fully either, but it seems
Changli Gao was concerned with sendpage sending this "as pages", so
when NETIF_F_SG flag is available. Did you try this without SG btw?
No I did not. I can try, it's not too hard. It would in part defeat the
purpose of the mechanism (especially at 10 Gbps) but at least it will
help narrow the problem down.
Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
bit for these sendpages should then make it more reproducible, I guess.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-06 10:04:08 UTC
Permalink
Post by Jarek Poplawski
...
Post by Willy Tarreau
Post by Jarek Poplawski
Great story! Alas I don't understand this fully either, but it seems
Changli Gao was concerned with sendpage sending this "as pages", so
when NETIF_F_SG flag is available. Did you try this without SG btw?
No I did not. I can try, it's not too hard. It would in part defeat the
purpose of the mechanism (especially at 10 Gbps) but at least it will
help narrow the problem down.
Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
bit for these sendpages should then make it more reproducible, I guess.
ah ? I can try that too, using iptables on the target machine to drop
a few outgoing acks.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-06 15:57:15 UTC
Permalink
Post by Jarek Poplawski
...
Post by Willy Tarreau
Post by Jarek Poplawski
Great story! Alas I don't understand this fully either, but it seems
Changli Gao was concerned with sendpage sending this "as pages", so
when NETIF_F_SG flag is available. Did you try this without SG btw?
No I did not. I can try, it's not too hard. It would in part defeat the
purpose of the mechanism (especially at 10 Gbps) but at least it will
help narrow the problem down.
Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
bit for these sendpages should then make it more reproducible, I guess.
OK here is an update. It does not change anything to turn off any acceleration
feature on the interface (tg3) :

***@wtap:~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off

It still forwards corrupted data like mad. I noticed that the corruption rate
is 10-100 times higher when forwarding from eth0 to eth0 than from eth0 to lo.

Maybe this can help find the culprit ?

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-07 09:39:15 UTC
Permalink
Post by Willy Tarreau
Post by Jarek Poplawski
...
Post by Willy Tarreau
Post by Jarek Poplawski
Great story! Alas I don't understand this fully either, but it seems
Changli Gao was concerned with sendpage sending this "as pages", so
when NETIF_F_SG flag is available. Did you try this without SG btw?
No I did not. I can try, it's not too hard. It would in part defeat the
purpose of the mechanism (especially at 10 Gbps) but at least it will
help narrow the problem down.
Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
bit for these sendpages should then make it more reproducible, I guess.
OK here is an update. It does not change anything to turn off any acceleration
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
It still forwards corrupted data like mad. I noticed that the corruption rate
is 10-100 times higher when forwarding from eth0 to eth0 than from eth0 to lo.
Maybe this can help find the culprit ?
I hope it will, but I still don't get it. Anyway, here is an untested
patch, which I guess partly tries Changli Gao's recommendation to give
real pages to splice/pipe (but here it's always - not for sendpage
only).

BTW, I added Changli to Cc - great review!

Thanks,
Jarek P.

---

net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..4c080cd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}

static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}

static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}

- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(p + offset, page + offset, len);
+
+ return p;
}

/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;

+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
spd->nr_pages++;
+ get_page(page);
+
return 0;
}

@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;

/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 12:24:07 UTC
Permalink
Post by Jarek Poplawski
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..4c080cd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}
static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}
Well this patch can only make it worse because not only are you
still ref counting skb->head with get_page, but you've also
completely removed the skb ref count which means that the corruption
can only occur sooner.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-07 12:38:04 UTC
Permalink
Post by Herbert Xu
Post by Jarek Poplawski
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..4c080cd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}
static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}
Well this patch can only make it worse because not only are you
still ref counting skb->head with get_page, but you've also
completely removed the skb ref count which means that the corruption
can only occur sooner.
But we don't need this skb... except its ->frags[] pages, which are
get_paged?! (The rest is copied to new pages.)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-07 12:22:05 UTC
Permalink
[ CCing Evgeniy and Herbert who also participate to the thread ]
Post by Jarek Poplawski
...
Great story! Alas I don't understand this fully either, but i=
t seems
Post by Jarek Poplawski
Changli Gao was concerned with sendpage sending this "as page=
s", so
Post by Jarek Poplawski
when NETIF_F_SG flag is available. Did you try this without S=
G btw?
Post by Jarek Poplawski
=20
No I did not. I can try, it's not too hard. It would in part de=
feat the
Post by Jarek Poplawski
purpose of the mechanism (especially at 10 Gbps) but at least i=
t will
Post by Jarek Poplawski
help narrow the problem down.
=20
Yes, I meant it only as a proof of concept. BTW, delaying TCP ack=
s a
Post by Jarek Poplawski
bit for these sendpages should then make it more reproducible, I =
guess.
Post by Jarek Poplawski
=20
OK here is an update. It does not change anything to turn off any a=
cceleration
Post by Jarek Poplawski
=20
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
=20
It still forwards corrupted data like mad. I noticed that the corru=
ption rate
Post by Jarek Poplawski
is 10-100 times higher when forwarding from eth0 to eth0 than from =
eth0 to lo.
Post by Jarek Poplawski
=20
Maybe this can help find the culprit ?
=20
I hope it will, but I still don't get it. Anyway, here is an untested
patch, which I guess partly tries Changli Gao's recommendation to giv=
e
Post by Jarek Poplawski
real pages to splice/pipe (but here it's always - not for sendpage
only).
BTW, I added Changli to Cc - great review!
=20
Thanks,
Jarek P.
Well, I've just tested it. It did not fix the problem but made it worse=
=2E
Sending a few bytes at a time, the corruption is still there, from the
beginning. Here's what I fed :

***@pcw:~$ nc -lp4001=20
azerazerazerazerazerazer
eiguhaeihgaeighaeighaeirghiareg
aeroigjaeorgjaeorgjaoeigjaeoig
ioejrgoiaerjgoiaerjgoiaerjgaoiejgoaiejg

Here's what I got :

***@pcw:~$ telnet 10.0.3.2 4000
Trying 10.0.3.2...
Connected to 10.0.3.2.
Escape character is '^]'.
_J0s9=F1uMG1S9=D0t2D$E=F0L$T$

However, when feeding /dev/zero as in previous tests, the kernel panice=
d
in skb_release_data().

Regards,
Willy
Post by Jarek Poplawski
---
=20
net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)
=20
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..4c080cd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __r=
ead_mostly;
Post by Jarek Poplawski
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb =3D (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}
=20
static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb =3D (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}
=20
static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned =
int i)
Post by Jarek Poplawski
{
- struct sk_buff *skb =3D (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}
=20
- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigne=
d int len,
Post by Jarek Poplawski
+ unsigned int offset)
+{
+ struct page *p =3D alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(p + offset, page + offset, len);
+
+ return p;
}
=20
/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pi=
pe_desc *spd, unsigned int i)
Post by Jarek Poplawski
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct=
page *page,
Post by Jarek Poplawski
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages =3D=3D PIPE_BUFFERS))
return 1;
=20
+ if (linear) {
+ page =3D linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] =3D page;
spd->partial[spd->nr_pages].len =3D len;
spd->partial[spd->nr_pages].offset =3D offset;
- spd->partial[spd->nr_pages].private =3D (unsigned long) skb_get(skb=
);
Post by Jarek Poplawski
spd->nr_pages++;
+ get_page(page);
+
return 0;
}
=20
@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page *=
*page, unsigned int *poff,
Post by Jarek Poplawski
static inline int __splice_segment(struct page *page, unsigned int p=
off,
Post by Jarek Poplawski
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page =
*page, unsigned int poff,
Post by Jarek Poplawski
/* the linear region may spread across several pages */
flen =3D min_t(unsigned int, flen, PAGE_SIZE - poff);
=20
- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;
=20
__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *sk=
b, unsigned int *offset,
Post by Jarek Poplawski
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;
=20
/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *sk=
b, unsigned int *offset,
Post by Jarek Poplawski
const skb_frag_t *f =3D &skb_shinfo(skb)->frags[seg];
=20
if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}
=20
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-07 12:31:53 UTC
Permalink
Post by Willy Tarreau
[ CCing Evgeniy and Herbert who also participate to the thread ]
...
Post by Willy Tarreau
Well, I've just tested it. It did not fix the problem but made it worse.
...

Terrible mistake! Here is take 2.

Sorry!
Jarek P.

---

net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..a598034 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}

static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}

static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}

- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy((void *)p + offset, (void *)page + offset, len);
+
+ return p;
}

/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;

+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
spd->nr_pages++;
+ get_page(page);
+
return 0;
}

@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;

/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2009-01-07 12:35:04 UTC
Permalink
Post by Jarek Poplawski
Post by Willy Tarreau
[ CCing Evgeniy and Herbert who also participate to the thread ]
...
Post by Willy Tarreau
Well, I've just tested it. It did not fix the problem but made it worse.
...
Terrible mistake! Here is take 2.
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy((void *)p + offset, (void *)page + offset, len);
is trying to do. I'm assuming you want to copy the page contents? If so,
you'd want something like

memcpy(page_address(p) + offset, page_address(page) + offset, len);

with possible kmaps for 'page'.

Irregardless of that particular oddity, I don't think this is the right
path to take at all. We need to delay the pipe buffer consumption until
the appropriate time.
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-07 12:40:34 UTC
Permalink
Post by Jens Axboe
Irregardless of that particular oddity, I don't think this is the right
path to take at all. We need to delay the pipe buffer consumption until
the appropriate time.
As a proof of concept we can put a delayed work_struct into the buffer
and only release its content after some timeout big enough (like one
second or so) for the hardware to actually transmit its buffers.
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-07 12:52:01 UTC
Permalink
Post by Evgeniy Polyakov
Post by Jens Axboe
Irregardless of that particular oddity, I don't think this is the right
path to take at all. We need to delay the pipe buffer consumption until
the appropriate time.
As a proof of concept we can put a delayed work_struct into the buffer
and only release its content after some timeout big enough (like one
second or so) for the hardware to actually transmit its buffers.
Evgeniy, I'd like to understand something related to our apparent lack of
knowledge of when the data is effectively transmitted. If we're focusing
on the send part, I can't understand why I never reproduce the corruption
when the data source is a file or loopback, but I only see it when the source
is an ethernet interface. How is it possible that a problem affecting only
the send side is so much selective about the source ? And in fact, why can't
we apply the same workflow for outgoing data for both types of sources ? It
seems to me that the page is released at the right time when sending a file,
and I don't see why we cannot apply the same principle when splicing between
sockets.

Please excuse me for my blattant ignorance in this area, as I once said, I
could not completely follow the whole splice process between tcp_splice_read()
and the moment the data leaves the machine. Also, I failed to understand what
linear data means. It seems to me this is the parts that are memcpy'd, but I'm
not sure.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 12:53:56 UTC
Permalink
Post by Willy Tarreau
Evgeniy, I'd like to understand something related to our apparent lack of
knowledge of when the data is effectively transmitted. If we're focusing
on the send part, I can't understand why I never reproduce the corruption
when the data source is a file or loopback, but I only see it when the source
is an ethernet interface. How is it possible that a problem affecting only
It doesn't happen with a file because in that case you don't
start with an skb so there is no skb->head. It probably doesn't
happen with loopback because loopback does GSO so again skb->head
does not exist (so to speak).

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-07 12:57:56 UTC
Permalink
Post by Herbert Xu
Post by Willy Tarreau
Evgeniy, I'd like to understand something related to our apparent lack of
knowledge of when the data is effectively transmitted. If we're focusing
on the send part, I can't understand why I never reproduce the corruption
when the data source is a file or loopback, but I only see it when the source
is an ethernet interface. How is it possible that a problem affecting only
It doesn't happen with a file because in that case you don't
start with an skb so there is no skb->head. It probably doesn't
happen with loopback because loopback does GSO so again skb->head
does not exist (so to speak).
Yup, basically splice's transmit pipe buffer contains page references,
where the first one is actually not a real page but skb, while in the
case of sendfile() and/or splice() from the file first page is a real
page of the appropriate file.
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-07 13:08:29 UTC
Permalink
Post by Evgeniy Polyakov
Post by Herbert Xu
Post by Willy Tarreau
Evgeniy, I'd like to understand something related to our apparent lack of
knowledge of when the data is effectively transmitted. If we're focusing
on the send part, I can't understand why I never reproduce the corruption
when the data source is a file or loopback, but I only see it when the source
is an ethernet interface. How is it possible that a problem affecting only
It doesn't happen with a file because in that case you don't
start with an skb so there is no skb->head. It probably doesn't
happen with loopback because loopback does GSO so again skb->head
does not exist (so to speak).
Yup, basically splice's transmit pipe buffer contains page references,
where the first one is actually not a real page but skb, while in the
case of sendfile() and/or splice() from the file first page is a real
page of the appropriate file.
OK thanks guys for the clarifications.

Evgeniy, my printk() in tcp_sendpage() fired several times indicating we were
going through do_tcp_sendpage. During the same test, I observed a lot of
corruption.

Also, I have a good news. As you suggested, disabling both SG and GSO indeed
fixes the issue. do_tcp_sendpage() is not called anymore from tcp_sendpage()
in this case (according to dmesg).

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-07 12:49:46 UTC
Permalink
Post by Jens Axboe
Post by Jarek Poplawski
Post by Willy Tarreau
[ CCing Evgeniy and Herbert who also participate to the thread ]
...
Post by Willy Tarreau
Well, I've just tested it. It did not fix the problem but made it worse.
...
Terrible mistake! Here is take 2.
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy((void *)p + offset, (void *)page + offset, len);
is trying to do. I'm assuming you want to copy the page contents? If so,
you'd want something like
memcpy(page_address(p) + offset, page_address(page) + offset, len);
with possible kmaps for 'page'.
My BAD!!! Thanks!
Post by Jens Axboe
Irregardless of that particular oddity, I don't think this is the right
path to take at all. We need to delay the pipe buffer consumption until
the appropriate time.
Hmm... in any case: take 3

Jarek P.

---

net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..6e43d52 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}

static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}

static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}

- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+ return p;
}

/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;

+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
spd->nr_pages++;
+ get_page(page);
+
return 0;
}

@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;

/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 12:52:17 UTC
Permalink
Post by Jarek Poplawski
Hmm... in any case: take 3
Yes this should fix the corruption but it kind of defeats the
purpose of splice by copying the data.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 13:01:29 UTC
Permalink
that's what I was initially wondering about, but it looks like only linear
data is copied. Will that cause too much a overhead ? (I don't like copying
For splicing from socket to socket or socket to file, all the
data will be linear with most drivers.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-07 13:02:45 UTC
Permalink
Post by Herbert Xu
Post by Jarek Poplawski
Hmm... in any case: take 3
Yes this should fix the corruption but it kind of defeats the
purpose of splice by copying the data.
Hmm.. Again - the main point of this patch was a proof of concept...
If the bug is really here it could be optimized (e.g. done only for
sendmesage) or fixed another way.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-07 13:00:31 UTC
Permalink
Post by Herbert Xu
Post by Jarek Poplawski
Hmm... in any case: take 3
Yes this should fix the corruption but it kind of defeats the
purpose of splice by copying the data.
that's what I was initially wondering about, but it looks like only linear
data is copied. Will that cause too much a overhead ? (I don't like copying
at all anyway, but if it can help us find the cause, I'll happily test it).

BTW, Jarek, don't be sorry, I *am* expecting to crash my laptop a number of
times before ensuring the bug is fixed. I just don't want to lose my data.

Cheers,
Willy
Herbert Xu
2009-01-12 12:02:57 UTC
Permalink
Post by Herbert Xu
Post by Jarek Poplawski
Hmm... in any case: take 3
Yes this should fix the corruption but it kind of defeats the
purpose of splice by copying the data.
However, as we don't have a better fix yet, we probably should
take Jarek's patch for now since data corruption is bad.

This is a very hard problem, so in the end the only viable solution
might be to get the drivers to switch to using page frags, like
the Intel page split method.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-12 12:45:46 UTC
Permalink
Post by Herbert Xu
Post by Herbert Xu
Post by Jarek Poplawski
Hmm... in any case: take 3
Yes this should fix the corruption but it kind of defeats the
purpose of splice by copying the data.
However, as we don't have a better fix yet, we probably should
take Jarek's patch for now since data corruption is bad.
Iirc it copies data from skb to the new pipe page unconditionally while
it is needed only for skb->sendpage path, although it is not possible to
know what is the other side of the pipe (or not?).

What about storing a callback and private pointer in the shared info for
the skb and clone them during usual clone, and invoke the callback at
shared info freeing time, which in turn will call spd->spd_release()?

Given that we only need to protect linear part, it should be simple
enough and we will not need to mess with the pskb_expand* calls.
Post by Herbert Xu
This is a very hard problem, so in the end the only viable solution
might be to get the drivers to switch to using page frags, like
the Intel page split method.
As a long-term solution this sounds as the best case, but introduces
quite heavy overhead for the allocators. Right now we allocate
1500+shared_info rounded up to the nearest power of the two (2k), but
then we will either need to have own network allocator (I have one :) or
allocate PAGE_SIZE+shared_info rounded up to the pwoer of the two (i.e.
8k), which is unfeasible.
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-12 12:56:07 UTC
Permalink
Post by Evgeniy Polyakov
As a long-term solution this sounds as the best case, but introduces
quite heavy overhead for the allocators. Right now we allocate
1500+shared_info rounded up to the nearest power of the two (2k), but
then we will either need to have own network allocator (I have one :) or
allocate PAGE_SIZE+shared_info rounded up to the pwoer of the two (i.e.
8k), which is unfeasible.
No that's not what I was suggesting. The page split model allocates
an skb with a very small head that accomodates only the headers.
All payload is stored in the frags structure.

For 1500-byte packets, we can manage the payload area efficiently
by dividing each allocated page into 2K chunks. The page will
then be automatically freed once all the 2K chunks on it have been
freed through the page ref count mechanism.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-12 12:59:14 UTC
Permalink
Post by Herbert Xu
Post by Evgeniy Polyakov
As a long-term solution this sounds as the best case, but introduces
quite heavy overhead for the allocators. Right now we allocate
1500+shared_info rounded up to the nearest power of the two (2k), but
then we will either need to have own network allocator (I have one :) or
allocate PAGE_SIZE+shared_info rounded up to the pwoer of the two (i.e.
8k), which is unfeasible.
No that's not what I was suggesting. The page split model allocates
an skb with a very small head that accomodates only the headers.
All payload is stored in the frags structure.
For 1500-byte packets, we can manage the payload area efficiently
by dividing each allocated page into 2K chunks. The page will
then be automatically freed once all the 2K chunks on it have been
freed through the page ref count mechanism.
That's the part I referred to as a network own allocator. We can have
multiple kmem_caches though for the popular MTUs and round up the
requested size otherwise.
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-12 21:11:08 UTC
Permalink
Post by Evgeniy Polyakov
That's the part I referred to as a network own allocator. We can have
multiple kmem_caches though for the popular MTUs and round up the
requested size otherwise.
No we can't use kmem_caches because we're relying on using the
page refereounce count to free the page. It has to be naked
pages.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-12 13:15:17 UTC
Permalink
Post by Herbert Xu
Post by Herbert Xu
Post by Jarek Poplawski
Hmm... in any case: take 3
Yes this should fix the corruption but it kind of defeats the
purpose of splice by copying the data.
However, as we don't have a better fix yet, we probably should
take Jarek's patch for now since data corruption is bad.
I've wondered if something like this could work as a temporary hack?

!!! not compiled/tested !!!
---

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ce572f9..99b0876 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -264,6 +264,7 @@
#include <linux/cache.h>
#include <linux/err.h>
#include <linux/crypto.h>
+#include <linux/page-flags.h>

#include <net/icmp.h>
#include <net/tcp.h>
@@ -776,7 +777,8 @@ ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
struct sock *sk = sock->sk;

if (!(sk->sk_route_caps & NETIF_F_SG) ||
- !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
+ !(sk->sk_route_caps & NETIF_F_ALL_CSUM) ||
+ PageSlab(page))
return sock_no_sendpage(sock, page, offset, size, flags);

lock_sock(sk);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-12 21:12:37 UTC
Permalink
Post by Jarek Poplawski
I've wondered if something like this could work as a temporary hack?
No we should fix it in skb_splice_bits because the corruption
can affect other terminations as well if a delay occurs.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-19 07:32:35 UTC
Permalink
On Mon, Jan 12, 2009 at 01:15:17PM +0000, Jarek Poplawski wrote:
...
Post by Jarek Poplawski
I've wondered if something like this could work as a temporary hack?
...
Post by Jarek Poplawski
@@ -776,7 +777,8 @@ ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
struct sock *sk = sock->sk;
if (!(sk->sk_route_caps & NETIF_F_SG) ||
- !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
+ !(sk->sk_route_caps & NETIF_F_ALL_CSUM) ||
+ PageSlab(page))
return sock_no_sendpage(sock, page, offset, size, flags);
Just for the record: this wouldn't work yet:-( It should be probably
something like "PageCompound(compound_head(page))" test instead.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-07 12:39:02 UTC
Permalink
Post by Jarek Poplawski
Post by Willy Tarreau
[ CCing Evgeniy and Herbert who also participate to the thread ]
...
Post by Willy Tarreau
Well, I've just tested it. It did not fix the problem but made it worse.
...
Terrible mistake! Here is take 2.
no problem. However, Herbert explained that this fix could not work for
multiple reasons (refcounting not fixed where the issue really is). BTW,
I've just tried it, and it crashed again, this time in tcp_send_ack().

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-07 12:56:19 UTC
Permalink
On Wed, Jan 07, 2009 at 01:39:02PM +0100, Willy Tarreau wrote:
...
Post by Willy Tarreau
no problem. However, Herbert explained that this fix could not work for
multiple reasons (refcounting not fixed where the issue really is). BTW,
I've just tried it, and it crashed again, this time in tcp_send_ack().
I'm extremely sorry! (I don't even expect you'll dare take 3...)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 12:44:52 UTC
Permalink
Post by Willy Tarreau
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
Can you please upgrade your ethtool so that it knows about GSO?
Once you've done that please retest with everything including
GSO turned off.

Alternatively edit out the bit that enables GSO by default in
net/core/dev.c.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ben Mansell
2009-01-06 17:42:31 UTC
Permalink
Hi,
Post by Willy Tarreau
I'm facing a data corruption problem with splice() between two
non-blocking TCP sockets on 2.6.27.10. I could finally write a
simpler proof of concept, and capture a snapshot of the issue
with the associated strace result.
- accept an incoming connection
- connect to a remote server
- forward all data from the server to the client using splice()
The data count is always correct, but some parts are corrupted and
contain data which seem to come from random memory locations (this
raises a security concern BTW). It *sometimes* happens that a few
megabytes can be transferred without any problem, but most of the
time, corruption happens for a few hundreds of bytes every few
hundreds of kilobytes.
FWIW, I can easily reproduce this on a Linux 2.6.27-9 (Ubuntu kernel),
using both forcedeth and tg3 network drivers. It's reassuring to hear of
this network corruption as I have been puzzling over non-blocking
splice() code recently!

The corruption does seem to be confined to the user data on the
connections, as I have been able to run some benchmarks using my own
splice()-enabled HTTP proxy to transfer lots of data. All the TCP
connections 'work' fine (as in no broken TCP), the initial HTTP request
& response headers get through OK, but the body data of the responses
sometimes gets corrupted. The benchmark seems to work flawlessly until
you look at the web page data!

I'm happy to run any test code on systems here or provide any debug
information if it would help to track this down.


Ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-06 18:15:59 UTC
Permalink
Hi Ben,
Post by Ben Mansell
Hi,
Post by Willy Tarreau
I'm facing a data corruption problem with splice() between two
non-blocking TCP sockets on 2.6.27.10. I could finally write a
simpler proof of concept, and capture a snapshot of the issue
with the associated strace result.
- accept an incoming connection
- connect to a remote server
- forward all data from the server to the client using splice()
The data count is always correct, but some parts are corrupted and
contain data which seem to come from random memory locations (this
raises a security concern BTW). It *sometimes* happens that a few
megabytes can be transferred without any problem, but most of the
time, corruption happens for a few hundreds of bytes every few
hundreds of kilobytes.
FWIW, I can easily reproduce this on a Linux 2.6.27-9 (Ubuntu kernel),
using both forcedeth and tg3 network drivers. It's reassuring to hear of
this network corruption as I have been puzzling over non-blocking
splice() code recently!
Ah, so you might also have discovered a few annoyances with the API, eg
the fact that splice() returns after the first read in non-blocking mode,
as well as the fact that it never returns zero on close, but -EAGAIN,
which requires an additional recv(MSG_PEEK) to distinguish between a
close and a lack of data. But I leave that for a later discussion, let's
address the corruption issue first.
Post by Ben Mansell
The corruption does seem to be confined to the user data on the
connections, as I have been able to run some benchmarks using my own
splice()-enabled HTTP proxy to transfer lots of data. All the TCP
connections 'work' fine (as in no broken TCP), the initial HTTP request
& response headers get through OK, but the body data of the responses
sometimes gets corrupted. The benchmark seems to work flawlessly until
you look at the web page data!
I confirm your observations. Benchmarks were OK, it's the first user of
my experimental code who reported wrong md5sums on their ISO images :-/
For this reason, I think that it's completely related to the way pages
are passed between sockets, but I'm too much a loser in this area. I
understood tcp_splice_read(), but can't manage to find what is called
on the other side nor what the data become :-(
Post by Ben Mansell
I'm happy to run any test code on systems here or provide any debug
information if it would help to track this down.
That's nice, because I'd like to ensure that whatever fix is proposed
is properly validated, not only on my platforms!

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-08 07:16:51 UTC
Permalink
On 06-01-2009 19:15, Willy Tarreau wrote:
...
Post by Willy Tarreau
Ah, so you might also have discovered a few annoyances with the API, eg
the fact that splice() returns after the first read in non-blocking mode,
as well as the fact that it never returns zero on close, but -EAGAIN,
which requires an additional recv(MSG_PEEK) to distinguish between a
close and a lack of data. But I leave that for a later discussion, let's
address the corruption issue first.
FYI, this should be just fixed:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1

Regards,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-08 08:05:20 UTC
Permalink
Post by Jarek Poplawski
...
Post by Willy Tarreau
Ah, so you might also have discovered a few annoyances with the API, eg
the fact that splice() returns after the first read in non-blocking mode,
as well as the fact that it never returns zero on close, but -EAGAIN,
which requires an additional recv(MSG_PEEK) to distinguish between a
close and a lack of data. But I leave that for a later discussion, let's
address the corruption issue first.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1
Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch
I had pending here ;-)

I'll ping Greg for a backport into -stable, as applications relying on this
will clearly not work without that fix.

The other one I had consists in removing "|| !timeo" at the end of the loop,
because otherwise splice() returns very small chunks (typically 1448 or
1460 bytes), leading to disastrous performance on high bandwidth links.
At 10 Gbps, this means about 800000 calls to splice() per second!

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ingo Molnar
2009-01-08 14:53:00 UTC
Permalink
Post by Willy Tarreau
Post by Jarek Poplawski
...
Post by Willy Tarreau
Ah, so you might also have discovered a few annoyances with the API, eg
the fact that splice() returns after the first read in non-blocking mode,
as well as the fact that it never returns zero on close, but -EAGAIN,
which requires an additional recv(MSG_PEEK) to distinguish between a
close and a lack of data. But I leave that for a later discussion, let's
address the corruption issue first.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1
Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch
I had pending here ;-)
I'll ping Greg for a backport into -stable, as applications relying on
this will clearly not work without that fix.
The other one I had consists in removing "|| !timeo" at the end of the
loop, because otherwise splice() returns very small chunks (typically
1448 or 1460 bytes), leading to disastrous performance on high bandwidth
links. At 10 Gbps, this means about 800000 calls to splice() per second!
looks interesting - would you mind to submit it?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ben Mansell
2009-01-08 15:16:16 UTC
Permalink
Post by Ingo Molnar
Post by Willy Tarreau
Post by Jarek Poplawski
...
Post by Willy Tarreau
Ah, so you might also have discovered a few annoyances with the API, eg
the fact that splice() returns after the first read in non-blocking mode,
as well as the fact that it never returns zero on close, but -EAGAIN,
which requires an additional recv(MSG_PEEK) to distinguish between a
close and a lack of data. But I leave that for a later discussion, let's
address the corruption issue first.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1
Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch
I had pending here ;-)
I'll ping Greg for a backport into -stable, as applications relying on
this will clearly not work without that fix.
The other one I had consists in removing "|| !timeo" at the end of the
loop, because otherwise splice() returns very small chunks (typically
1448 or 1460 bytes), leading to disastrous performance on high bandwidth
links. At 10 Gbps, this means about 800000 calls to splice() per second!
looks interesting - would you mind to submit it?
FWIW, I've also tested this change with some splice() benchmarks. I can
confirm that removing the "|| !timeo" works well and improves
performance significantly.


Ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-08 17:14:19 UTC
Permalink
Post by Ingo Molnar
Post by Willy Tarreau
Post by Jarek Poplawski
...
Post by Willy Tarreau
Ah, so you might also have discovered a few annoyances with the API, eg
the fact that splice() returns after the first read in non-blocking mode,
as well as the fact that it never returns zero on close, but -EAGAIN,
which requires an additional recv(MSG_PEEK) to distinguish between a
close and a lack of data. But I leave that for a later discussion, let's
address the corruption issue first.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1
Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch
I had pending here ;-)
I'll ping Greg for a backport into -stable, as applications relying on
this will clearly not work without that fix.
The other one I had consists in removing "|| !timeo" at the end of the
loop, because otherwise splice() returns very small chunks (typically
1448 or 1460 bytes), leading to disastrous performance on high bandwidth
links. At 10 Gbps, this means about 800000 calls to splice() per second!
looks interesting - would you mind to submit it?
OK I will. I preferred to focus on the bug first, which is clearly
more annoying since it basically makes splice() unusuable in a proxy.
I wanted to avoid changes before the bug is fixed, but now, with the
first investigations, it seems that the fix will be in a different
area anyway so that doesn't matter.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-06 18:32:23 UTC
Permalink
Hi Willy.

Unfortunately I can not work on this problem right now, but will do if
things are not resolved after Jan 11 (long vacations will be finished in
Russia and I will return to my test machines :) But right now I have
one quesstion: I read several times your mail but still can not figure
out if receiving or sending side is broken?

I.e. can you splice from socket into the file, check the file, and then
splice to the another socket and check received data to find out which
side is broken? Or did I just missed that in the problem description?

Thanks a lot for the test application, it will greatly help to resolve
this issue.
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2009-01-06 18:37:05 UTC
Permalink
Post by Evgeniy Polyakov
Hi Willy.
Unfortunately I can not work on this problem right now, but will do if
things are not resolved after Jan 11 (long vacations will be finished in
Russia and I will return to my test machines :) But right now I have
one quesstion: I read several times your mail but still can not figure
out if receiving or sending side is broken?
I.e. can you splice from socket into the file, check the file, and then
splice to the another socket and check received data to find out which
side is broken? Or did I just missed that in the problem description?
Thanks a lot for the test application, it will greatly help to resolve
this issue.
I'll give this a spin tomorrow as well. A hunch tells me that this is
likely a page reuse issue, that splice is getting the reference to the
buffer dropped before the data has really been transmitted. IOW, the
page is likely fine reaching the ->sendpage() bit, but will be reused
before the data has actually been transmitted. So once you get that far,
other random data from that page is going out.

Just a guess, I'll try and reproduce this tomorrow!
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-06 18:55:38 UTC
Permalink
Hi Jens,
Post by Jens Axboe
Post by Evgeniy Polyakov
Hi Willy.
Unfortunately I can not work on this problem right now, but will do if
things are not resolved after Jan 11 (long vacations will be finished in
Russia and I will return to my test machines :) But right now I have
one quesstion: I read several times your mail but still can not figure
out if receiving or sending side is broken?
I.e. can you splice from socket into the file, check the file, and then
splice to the another socket and check received data to find out which
side is broken? Or did I just missed that in the problem description?
Thanks a lot for the test application, it will greatly help to resolve
this issue.
I'll give this a spin tomorrow as well. A hunch tells me that this is
likely a page reuse issue, that splice is getting the reference to the
buffer dropped before the data has really been transmitted. IOW, the
page is likely fine reaching the ->sendpage() bit, but will be reused
before the data has actually been transmitted. So once you get that far,
other random data from that page is going out.
I like your explanation because eventhough I don't understand the code
(can't follow it past the actors in fact), I understand the problem you're
suggesting ;-)
Post by Jens Axboe
Just a guess, I'll try and reproduce this tomorrow!
OK. In order not to waste your time, run the test app from one interface to
the same one, with both the client and the server on the same machine, distinct
from the test app. It will trigger immediately. "nc|od -Ax -tx1" will save you
a lot of time on the client side too BTW.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 04:42:32 UTC
Permalink
Post by Jens Axboe
I'll give this a spin tomorrow as well. A hunch tells me that this is
likely a page reuse issue, that splice is getting the reference to the
buffer dropped before the data has really been transmitted. IOW, the
page is likely fine reaching the ->sendpage() bit, but will be reused
before the data has actually been transmitted. So once you get that far,
other random data from that page is going out.
I see the problem.

The socket pipes in net/core/skbuff.c use references on the skb
to hold down the memory in skb->head as well as the pages in the
skb.

Unfortunately, once the pipe is fed into sendpage we only use
page reference counting to pin down the memory. So as soon as
sendpage returns we drop the ref count on the skb, thus freeing
the memory in skb->head, which is yet to be transmitted.

Moral: Using page reference counts on skb->head is wrong.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Willy Tarreau
2009-01-07 06:38:59 UTC
Permalink
Post by Herbert Xu
Post by Jens Axboe
I'll give this a spin tomorrow as well. A hunch tells me that this is
likely a page reuse issue, that splice is getting the reference to the
buffer dropped before the data has really been transmitted. IOW, the
page is likely fine reaching the ->sendpage() bit, but will be reused
before the data has actually been transmitted. So once you get that far,
other random data from that page is going out.
I see the problem.
The socket pipes in net/core/skbuff.c use references on the skb
to hold down the memory in skb->head as well as the pages in the
skb.
Unfortunately, once the pipe is fed into sendpage we only use
page reference counting to pin down the memory. So as soon as
sendpage returns we drop the ref count on the skb, thus freeing
the memory in skb->head, which is yet to be transmitted.
So this means that anything relying on sendpage() is at risk ? What
I find really strange is that I can only reproduce the issue if the
spliced data come from a real interface. If they come from the loopback
or from a file, there is no problem. Maybe the ref counting is different
depending on the origin of the data ?
Post by Herbert Xu
Moral: Using page reference counts on skb->head is wrong.
My question will sound stupid to some of you, but wouldn't increasing
the refcount on those skb solve the problem (and decreasing it once
the skb is effectively sent) ?

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 09:52:18 UTC
Permalink
Post by Willy Tarreau
Post by Herbert Xu
Unfortunately, once the pipe is fed into sendpage we only use
page reference counting to pin down the memory. So as soon as
sendpage returns we drop the ref count on the skb, thus freeing
the memory in skb->head, which is yet to be transmitted.
So this means that anything relying on sendpage() is at risk ? What
No the bug is in the splice code. sendfile() and other sendpage
users are not affected.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-07 09:54:15 UTC
Permalink
Post by Herbert Xu
Post by Willy Tarreau
Post by Herbert Xu
Unfortunately, once the pipe is fed into sendpage we only use
page reference counting to pin down the memory. So as soon as
sendpage returns we drop the ref count on the skb, thus freeing
the memory in skb->head, which is yet to be transmitted.
So this means that anything relying on sendpage() is at risk ? What
No the bug is in the splice code. sendfile() and other sendpage
users are not affected.
but sendfile() now uses splice().

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 11:52:42 UTC
Permalink
Post by Willy Tarreau
but sendfile() now uses splice().
Good point. However, this bug only triggers when inbound splice
is reinjected into sendpage so sendfile() users shouldn't see it.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2009-01-07 08:17:08 UTC
Permalink
Post by Herbert Xu
Post by Jens Axboe
I'll give this a spin tomorrow as well. A hunch tells me that this is
likely a page reuse issue, that splice is getting the reference to the
buffer dropped before the data has really been transmitted. IOW, the
page is likely fine reaching the ->sendpage() bit, but will be reused
before the data has actually been transmitted. So once you get that far,
other random data from that page is going out.
I see the problem.
The socket pipes in net/core/skbuff.c use references on the skb
to hold down the memory in skb->head as well as the pages in the
skb.
Unfortunately, once the pipe is fed into sendpage we only use
page reference counting to pin down the memory. So as soon as
sendpage returns we drop the ref count on the skb, thus freeing
the memory in skb->head, which is yet to be transmitted.
Moral: Using page reference counts on skb->head is wrong.
So my hunch was pretty close. The fix would seem to involve NOT calling
ops->release(pipe, buf) until we actually have an ACK on that data gone
out.
--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-07 11:29:06 UTC
Permalink
Hi Herbert.
Post by Herbert Xu
I see the problem.
The socket pipes in net/core/skbuff.c use references on the skb
to hold down the memory in skb->head as well as the pages in the
skb.
Unfortunately, once the pipe is fed into sendpage we only use
page reference counting to pin down the memory. So as soon as
sendpage returns we drop the ref count on the skb, thus freeing
the memory in skb->head, which is yet to be transmitted.
Moral: Using page reference counts on skb->head is wrong.
That would not happen without scatter-gather support on the interface,
date would be plain copied, and after Jarek's requst Willy confirmed
that corruption happens with all acceleration being turned off.
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 11:50:32 UTC
Permalink
Post by Evgeniy Polyakov
That would not happen without scatter-gather support on the interface,
date would be plain copied, and after Jarek's requst Willy confirmed
that corruption happens with all acceleration being turned off.
It will happen without scatter-gather support. The problem
is with skb->head, which is always there. In any case, SG can't
make any difference because the skb is an inbound one and most
drivers only produce linear packets.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-07 11:56:05 UTC
Permalink
Post by Herbert Xu
Post by Evgeniy Polyakov
That would not happen without scatter-gather support on the interface,
date would be plain copied, and after Jarek's requst Willy confirmed
that corruption happens with all acceleration being turned off.
It will happen without scatter-gather support. The problem
is with skb->head, which is always there. In any case, SG can't
make any difference because the skb is an inbound one and most
drivers only produce linear packets.
But it will not push pages into the stack, but copy them including copy
of the submitted head?
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 11:59:21 UTC
Permalink
Post by Evgeniy Polyakov
But it will not push pages into the stack, but copy them including copy
of the submitted head?
It will use a single page entry for skb->head with splice, see
skb_splice_bits.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-07 12:15:30 UTC
Permalink
Post by Herbert Xu
Post by Evgeniy Polyakov
But it will not push pages into the stack, but copy them including copy
of the submitted head?
It will use a single page entry for skb->head with splice, see
skb_splice_bits.
Looks like we are talking about different directions of the dataflow.
I meant that set of pages submitted into the sending part will be copied
if sending interface does not support acceleration, and thus it will
copy part of the page corresponding to the linear part of the skb prior
the transmission, so even if skb will be freed right after that call
(prior data transmission by the hardware), it should not affect copied
data.
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 12:22:38 UTC
Permalink
Post by Evgeniy Polyakov
Looks like we are talking about different directions of the dataflow.
I meant that set of pages submitted into the sending part will be copied
if sending interface does not support acceleration, and thus it will
copy part of the page corresponding to the linear part of the skb prior
the transmission, so even if skb will be freed right after that call
(prior data transmission by the hardware), it should not affect copied
data.
You must be looking at a different tcp.c than the one I've got
because mine clearly always uses skb frags in sendpage regardless
of SG support.

Yes we will linearize the packet in dev_queue_xmit but as soon
as the netdev stops the tx queue you'll get corruption.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 12:27:31 UTC
Permalink
Post by Herbert Xu
Yes we will linearize the packet in dev_queue_xmit but as soon
as the netdev stops the tx queue you'll get corruption.
OK that can't happen because the linearisation precedes that, but
ARP (or anything else that delays the skb prior to dev_queue_xmit)
can still cause corruption.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 12:30:36 UTC
Permalink
Post by Herbert Xu
OK that can't happen because the linearisation precedes that, but
ARP (or anything else that delays the skb prior to dev_queue_xmit)
can still cause corruption.
In fact it's probably TCP that's delaying the packet.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-07 12:37:41 UTC
Permalink
Post by Herbert Xu
Post by Evgeniy Polyakov
Looks like we are talking about different directions of the dataflow.
I meant that set of pages submitted into the sending part will be copied
if sending interface does not support acceleration, and thus it will
copy part of the page corresponding to the linear part of the skb prior
the transmission, so even if skb will be freed right after that call
(prior data transmission by the hardware), it should not affect copied
data.
You must be looking at a different tcp.c than the one I've got
because mine clearly always uses skb frags in sendpage regardless
of SG support.
Doesn't your tcp fallbacks to kernel_sendmsg() without sg in
tcp_sendpage()? And then just feeds data into the stack the same way it
happens with send() i.e. by copying it.
Post by Herbert Xu
Yes we will linearize the packet in dev_queue_xmit but as soon
as the netdev stops the tx queue you'll get corruption.
That's perfectly valid when sendpage() returns and holds a reference to
the pages but not skb->head, so freed skb will free (and potentially
reuse) that area which has not been transmitted yet.
But without acceleration it will copy data and the whole original skb
may be freed without any problems.
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 12:42:46 UTC
Permalink
Post by Evgeniy Polyakov
Doesn't your tcp fallbacks to kernel_sendmsg() without sg in
tcp_sendpage()? And then just feeds data into the stack the same way it
happens with send() i.e. by copying it.
Good point. Did he check GSO though? GSO will always enable SG
on the socket regardless of the netdev's setting. And if the device
started out with SG enabled then recent kernels will enable GSO
by default.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-07 12:46:29 UTC
Permalink
Post by Herbert Xu
Post by Evgeniy Polyakov
Doesn't your tcp fallbacks to kernel_sendmsg() without sg in
tcp_sendpage()? And then just feeds data into the stack the same way it
happens with send() i.e. by copying it.
Good point. Did he check GSO though? GSO will always enable SG
on the socket regardless of the netdev's setting. And if the device
started out with SG enabled then recent kernels will enable GSO
by default.
Willy, what was the kernel you are tested no-accel behaviour and what
were the gso settings?
Can you add a simple single print into tcp_sendpage() to determine if
content was copied or fed into do_tcp_sendpages() otherwise?
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-07 12:55:35 UTC
Permalink
Post by Evgeniy Polyakov
Post by Herbert Xu
Post by Evgeniy Polyakov
Doesn't your tcp fallbacks to kernel_sendmsg() without sg in
tcp_sendpage()? And then just feeds data into the stack the same way it
happens with send() i.e. by copying it.
Good point. Did he check GSO though? GSO will always enable SG
on the socket regardless of the netdev's setting. And if the device
started out with SG enabled then recent kernels will enable GSO
by default.
Willy, what was the kernel you are tested no-accel behaviour and what
were the gso settings?
kernel is 2.6.27.10 + a few patches (squashfs, etc..., nothing related to
this area). My ethtool is a bit old and does not report GSO. I'll download
and rebuild a more recent one and retest.
Post by Evgeniy Polyakov
Can you add a simple single print into tcp_sendpage() to determine if
content was copied or fed into do_tcp_sendpages() otherwise?
Yes, will do that too.

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu
2009-01-07 12:57:34 UTC
Permalink
Post by Willy Tarreau
kernel is 2.6.27.10 + a few patches (squashfs, etc..., nothing related to
2.6.27 will enable GSO by default if SG is enabled at registration
time.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <***@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov
2009-01-07 13:02:59 UTC
Permalink
Post by Herbert Xu
Post by Willy Tarreau
kernel is 2.6.27.10 + a few patches (squashfs, etc..., nothing related to
2.6.27 will enable GSO by default if SG is enabled at registration
time.
So even if later sg was turned off without gso being turned off, it
will try doing fair page transfer and not falling back to the copy.

The simplest case is to also turn gso off, but with older ethtools it is
not possible and you hit the bug.

If I understood correctly Jarek's patch, he wants to allocate a page and
copy linear content of the skb there, so this will end up being the same
case as with splicing from the file. And while generally this may be a
good idea, but with usual 1.5k mtu this will copy every skb, which is
rather bad idea. Will have to think :)
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-07 13:10:57 UTC
Permalink
On Wed, Jan 07, 2009 at 04:02:59PM +0300, Evgeniy Polyakov wrote:
...
Post by Evgeniy Polyakov
If I understood correctly Jarek's patch, he wants to allocate a page and
copy linear content of the skb there, so this will end up being the same
case as with splicing from the file. And while generally this may be a
good idea, but with usual 1.5k mtu this will copy every skb, which is
rather bad idea. Will have to think :)
As I mentioned earlier, I (poorly) tried to realize Changli's idea
only, and it's more to verify the reason still (after it failed with
SG method...).

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-07 13:15:39 UTC
Permalink
Post by Jarek Poplawski
...
Post by Evgeniy Polyakov
If I understood correctly Jarek's patch, he wants to allocate a page and
copy linear content of the skb there, so this will end up being the same
case as with splicing from the file. And while generally this may be a
good idea, but with usual 1.5k mtu this will copy every skb, which is
rather bad idea. Will have to think :)
As I mentioned earlier, I (poorly) tried to realize Changli's idea
only, and it's more to verify the reason still (after it failed with
SG method...).
Jarek, since it now works when disabling both SG and GSO, we know that the
bug is triggered when passing through do_tcp_sendpage(). Thus, I will not
try your patch #3 yet, but rather wait for fix candidates to try.

Thanks!
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-07 13:22:58 UTC
Permalink
Post by Willy Tarreau
Post by Jarek Poplawski
...
Post by Evgeniy Polyakov
If I understood correctly Jarek's patch, he wants to allocate a page and
copy linear content of the skb there, so this will end up being the same
case as with splicing from the file. And while generally this may be a
good idea, but with usual 1.5k mtu this will copy every skb, which is
rather bad idea. Will have to think :)
As I mentioned earlier, I (poorly) tried to realize Changli's idea
only, and it's more to verify the reason still (after it failed with
SG method...).
Jarek, since it now works when disabling both SG and GSO, we know that the
bug is triggered when passing through do_tcp_sendpage(). Thus, I will not
try your patch #3 yet, but rather wait for fix candidates to try.
Sure! Congratulations for debugging to you and Evgeniy!

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski
2009-01-07 14:01:41 UTC
Permalink
On Wed, Jan 07, 2009 at 01:22:58PM +0000, Jarek Poplawski wrote:
...
Post by Jarek Poplawski
Sure! Congratulations for debugging to you and Evgeniy!
...and Herbert! (I missed the beginning of this GSO trail...)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-06 18:50:12 UTC
Permalink
Hi Evgeniy,
Post by Evgeniy Polyakov
Hi Willy.
Unfortunately I can not work on this problem right now, but will do if
things are not resolved after Jan 11 (long vacations will be finished in
Russia and I will return to my test machines :) But right now I have
one quesstion: I read several times your mail but still can not figure
out if receiving or sending side is broken?
I.e. can you splice from socket into the file, check the file, and then
splice to the another socket and check received data to find out which
side is broken? Or did I just missed that in the problem description?
Maybe I wasn't very clear. It's only when splicing from a socket to another
socket that it breaks. Splicing from a file to a socket is OK (sendfile is
OK too BTW). Splicing from a socket to a file is OK.

And BTW, the receiving side must be on a real interface, not loopback. Here
are my observations :
- splice data from lo to lo => no corruption at all
- splice data from lo to eth0 => no corruption at all
- splice data from eth0 to lo => occasional corruption
- splice data from eth0 to eth0 => massive corruption

I think this may ring a bell for someone.
Post by Evgeniy Polyakov
Thanks a lot for the test application, it will greatly help to resolve
this issue.
I figured it was an absolute necessity. The original code in my proxy is in
an experimental state and far too hard to debug for these purposes. It was
enough to detect the problem, but I could run a lot more tests with this
small test app ! An who know, maybe it will serve as an example for
non-blocking splice ;-)

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Lennert Buytenhek
2009-01-19 08:39:21 UTC
Permalink
Post by Willy Tarreau
Post by Evgeniy Polyakov
Thanks a lot for the test application, it will greatly help to resolve
this issue.
I figured it was an absolute necessity. The original code in my proxy is in
an experimental state and far too hard to debug for these purposes. It was
enough to detect the problem, but I could run a lot more tests with this
small test app ! An who know, maybe it will serve as an example for
non-blocking splice ;-)
:-)

Just to throw some more (hacky) example code into the pool, below is
an echo server that I hacked up to test nonblocking splice(). (You'll
need sf.net/projects/libivykis to use it.) I also have a splice()
discard server and a patch to my intercept-connection-via-iptables-and-
forward-it-to-a-remote-SOCKS5-server-to-deal-with-crappy-VPNs app to
use splice() somewhere.

My main annoyances with splice(2) are/were:

1. -EAGAIN return on splice from socket/pipe to socket/pipe doesn't
directly tell you whether the source ran out of data or the
destination can't accept more data, which means you can't e.g. use
epoll in edge triggered mode without jumping through some minor
number of extra hoops. (For a pipe you can keep track of how many
bytes are in it by hand, but for a socket->pipe splice -EAGAIN return
you'll have to assume that the pipe is full if there are >0 bytes in
it.)

2. Because of (1), and because when splicing from a socket to a pipe
it returns after the first bit of data (you mentioned this as well),
you don't know at that point whether your pipe is full or not.

3. Always returns -EAGAIN even if there was a FIN or error on the
source socket. (Now fixed.)



#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <iv.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <signal.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>

struct connection {
struct iv_fd sock;
int pfd[2];
int pipe_bytes;
int saw_fin;
};


static void conn_kill(struct connection *conn)
{
iv_unregister_fd(&conn->sock);
close(conn->sock.fd);
close(conn->pfd[0]);
close(conn->pfd[1]);
free(conn);
}

static void conn_pollin(void *_conn);
static void conn_pollout(void *_conn);
static void conn_pollerr(void *_conn);

static void conn_pollin(void *_conn)
{
struct connection *conn = (struct connection *)_conn;
int ret;

while (1) {
ret = splice(conn->sock.fd, NULL, conn->pfd[1], NULL,
1048576, SPLICE_F_NONBLOCK);
if (ret <= 0)
break;

if (conn->pipe_bytes == 0)
iv_fd_set_handler_out(&conn->sock, conn_pollout);
conn->pipe_bytes += ret;
}

if (ret == 0) {
conn->saw_fin = 1;
iv_fd_set_handler_in(&conn->sock, NULL);
if (conn->pipe_bytes == 0) {
shutdown(conn->sock.fd, SHUT_WR);
conn_kill(conn);
}
return;
}

if (errno == EAGAIN && conn->pipe_bytes > 0) {
int bytes_sock = 1;

ioctl(conn->sock.fd, FIONREAD, &bytes_sock);
if (bytes_sock > 0)
iv_fd_set_handler_in(&conn->sock, NULL);
} else if (errno != EAGAIN) {
conn_kill(conn);
}
}

static void conn_pollout(void *_conn)
{
struct connection *conn = (struct connection *)_conn;
int ret;

if (!conn->pipe_bytes) {
fprintf(stderr, "conn_pollout: no pipe bytes!\n");
abort();
}

do {
ret = splice(conn->pfd[0], NULL, conn->sock.fd, NULL,
conn->pipe_bytes, 0);
if (ret <= 0)
break;

conn->pipe_bytes -= ret;
if (!conn->saw_fin)
iv_fd_set_handler_in(&conn->sock, conn_pollin);
} while (conn->pipe_bytes);

if (ret == 0) {
fprintf(stderr, "pollout: splice returned zero!\n");
abort();
}

if (errno == EAGAIN && conn->pipe_bytes == 0) {
iv_fd_set_handler_out(&conn->sock, NULL);
if (conn->saw_fin) {
shutdown(conn->sock.fd, SHUT_WR);
conn_kill(conn);
}
} else if (errno != EAGAIN) {
conn_kill(conn);
}
}

static void conn_pollerr(void *_conn)
{
struct connection *conn = (struct connection *)_conn;
socklen_t len;
int ret;

len = sizeof(ret);
if (getsockopt(conn->sock.fd, SOL_SOCKET, SO_ERROR, &ret, &len) < 0) {
fprintf(stderr, "pollerr: error %d while "
"getsockopt(SO_ERROR)\n", errno);
abort();
}

if (ret == 0) {
fprintf(stderr, "pollerr: no error?!\n");
abort();
}

conn_kill(conn);
}


static struct iv_fd listening_socket;

static void got_connection(void *_dummy)
{
struct sockaddr_in addr;
struct connection *conn;
socklen_t addrlen;
int ret;

addrlen = sizeof(addr);
ret = accept(listening_socket.fd, (struct sockaddr *)&addr, &addrlen);
if (ret < 0)
return;

conn = malloc(sizeof(*conn));
if (conn == NULL) {
fprintf(stderr, "memory squeeze\n");
abort();
}

if (pipe(conn->pfd) < 0) {
fprintf(stderr, "pipe squeeze\n");
abort();
}

INIT_IV_FD(&conn->sock);
conn->sock.fd = ret;
conn->sock.cookie = (void *)conn;
conn->sock.handler_in = conn_pollin;
conn->sock.handler_err = conn_pollerr;
iv_register_fd(&conn->sock);

conn->pipe_bytes = 0;
conn->saw_fin = 0;
}

static int open_listening_socket(void)
{
struct sockaddr_in addr;
int sock;

sock = socket(AF_INET, SOCK_STREAM, 0);
if (sock < 0) {
perror("socket");
return 1;
}

addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons(6969);
if (bind(sock, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
perror("bind");
return 1;
}

listen(sock, 5);

INIT_IV_FD(&listening_socket);
listening_socket.fd = sock;
listening_socket.cookie = NULL;
listening_socket.handler_in = got_connection;
iv_register_fd(&listening_socket);

return 0;
}

int main()
{
iv_init();
open_listening_socket();
iv_main();

return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willy Tarreau
2009-01-19 09:53:21 UTC
Permalink
Post by Lennert Buytenhek
Post by Willy Tarreau
Post by Evgeniy Polyakov
Thanks a lot for the test application, it will greatly help to resolve
this issue.
I figured it was an absolute necessity. The original code in my proxy is in
an experimental state and far too hard to debug for these purposes. It was
enough to detect the problem, but I could run a lot more tests with this
small test app ! An who know, maybe it will serve as an example for
non-blocking splice ;-)
:-)
Just to throw some more (hacky) example code into the pool, below is
an echo server that I hacked up to test nonblocking splice(). (You'll
need sf.net/projects/libivykis to use it.) I also have a splice()
discard server and a patch to my intercept-connection-via-iptables-and-
forward-it-to-a-remote-SOCKS5-server-to-deal-with-crappy-VPNs app to
use splice() somewhere.
1. -EAGAIN return on splice from socket/pipe to socket/pipe doesn't
directly tell you whether the source ran out of data or the
destination can't accept more data, which means you can't e.g. use
epoll in edge triggered mode without jumping through some minor
number of extra hoops. (For a pipe you can keep track of how many
bytes are in it by hand, but for a socket->pipe splice -EAGAIN return
you'll have to assume that the pipe is full if there are >0 bytes in
it.)
I proceeded the same way : if EAGAIN and data still in the pipe, then
stop polling.
Post by Lennert Buytenhek
2. Because of (1), and because when splicing from a socket to a pipe
it returns after the first bit of data (you mentioned this as well),
you don't know at that point whether your pipe is full or not.
In fact this is fixed now. tcp_splice_read() returns all available data,
which somewhat hides problem #1. I'm running with 23 kB in a push/pull
method all the time, so it remains optimal.
Post by Lennert Buytenhek
3. Always returns -EAGAIN even if there was a FIN or error on the
source socket. (Now fixed.)
Yes I saw your fix, it was indeed very annoying because the only workaround
I found was to perform an recv(MSG_PEEK) on the socket after each EAGAIN
to check whether the connection was closed or not.

For these reasons, I'd really love to see the few recent fixes backported
to -stable ASAP. It will boost splice() adoption among products.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Continue reading on narkive:
Loading...