irq disable in __netdev_alloc

Discussion:

irq disable in __netdev_alloc_frag() ?

Alexei Starovoitov

2014-10-23 00:15:19 UTC

Hi Eric,

in the commit 6f532612cc24 ("net: introduce netdev_alloc_frag()")
you mentioned that the reason to disable interrupts
in __netdev_alloc_frag() is:
"- Must be IRQ safe (non NAPI drivers can use it)"

Is there a way to do this conditionally?

Without it I see 10% performance gain for my RX tests
(from 6.9Mpps to 7.7Mpps) and __netdev_alloc_frag()
itself goes from 6.6% to 2.1%
(popf seems to be quite costly)

Thanks
Alexei

Eric Dumazet

2014-10-23 01:52:40 UTC

Permalink

Post by Alexei Starovoitov
Hi Eric,
in the commit 6f532612cc24 ("net: introduce netdev_alloc_frag()")
you mentioned that the reason to disable interrupts
"- Must be IRQ safe (non NAPI drivers can use it)"
Is there a way to do this conditionally?
Without it I see 10% performance gain for my RX tests
(from 6.9Mpps to 7.7Mpps) and __netdev_alloc_frag()
itself goes from 6.6% to 2.1%
(popf seems to be quite costly)

Well, your driver is probably a NAPI one, so you need to
mask irqs, or to remove all non NAPI drivers from linux.

__netdev_alloc_frag() (__netdev_alloc_skb()) is used by all.

Problem is __netdev_alloc_frag() is generally deep inside caller
chain, so using a private pool might have quite an overhead.

Same could be said for skb_queue_head() /skb_queue_tail() /
sock_queue_rcv_skb() :
Many callers don't need to block irq.

Alexei Starovoitov

2014-10-23 02:22:29 UTC

Permalink

Post by Eric Dumazet

Well, your driver is probably a NAPI one, so you need to
mask irqs, or to remove all non NAPI drivers from linux.

yeah, the 10G+ nics I care about are all napi :)

Post by Eric Dumazet
__netdev_alloc_frag() (__netdev_alloc_skb()) is used by all.
Problem is __netdev_alloc_frag() is generally deep inside caller
chain, so using a private pool might have quite an overhead.

yes. I was thinking, since dev is already passed
into __netdev_alloc_skb(), we can check whether
dev registered with napi via dev->napi_list and if so,
tell inner __netdev_alloc_frag() to skip irq disabling...

don't know about skb_queue_head() and friends.
I'm only looking at pure rx now. One challenge at a time.

Eric Dumazet

2014-10-23 03:48:06 UTC

Permalink

Post by Alexei Starovoitov
yes. I was thinking, since dev is already passed
into __netdev_alloc_skb(), we can check whether
dev registered with napi via dev->napi_list and if so,
tell inner __netdev_alloc_frag() to skip irq disabling...

This does not matter. The problem is not _this_ device, the problem is
that another device might trigger a hard irq, and this hard irq could
mess your data.

Alexander Duyck

2014-10-23 03:19:59 UTC

Permalink

Post by Eric Dumazet

Well, your driver is probably a NAPI one, so you need to
mask irqs, or to remove all non NAPI drivers from linux.
__netdev_alloc_frag() (__netdev_alloc_skb()) is used by all.
Problem is __netdev_alloc_frag() is generally deep inside caller
chain, so using a private pool might have quite an overhead.
Same could be said for skb_queue_head() /skb_queue_tail() /
Many callers don't need to block irq.

Couldn't __netdev_alloc_frag() be forked into two functions, one that is
only called from inside the NAPI context and one that is called for all
other contexts? It would mean having to double the number of pages
being held per CPU, but I would think something like that would be doable.

Thanks,

Alex

Eric Dumazet

2014-10-23 03:51:16 UTC

Permalink

Post by Alexander Duyck
Couldn't __netdev_alloc_frag() be forked into two functions, one that is
only called from inside the NAPI context and one that is called for all
other contexts? It would mean having to double the number of pages
being held per CPU, but I would think something like that would be doable.

Possibly, but this looks like code bloat for me.

On my hosts, this hard irq masking is pure noise.

What CPU are you using Alexander ?

Same could be done with some kmem_cache_alloc() : SLAB uses hard irq
masking while some caches are never used from hard irq context.

Eric Dumazet

2014-10-23 03:56:10 UTC

Permalink

Post by Eric Dumazet

Possibly, but this looks like code bloat for me.
On my hosts, this hard irq masking is pure noise.
What CPU are you using Alexander ?

Sorry, the question was for Alexei ;)

Alexei Starovoitov

2014-10-23 04:29:22 UTC

Permalink

Post by Eric Dumazet

Possibly, but this looks like code bloat for me.
On my hosts, this hard irq masking is pure noise.
What CPU are you using Alexander ?

Sorry, the question was for Alexei ;)

my numbers were from just released Xeon E5-1630 v3
which is haswell-ep
I think Alexander's idea is worth pursuing.
Code-wise the bloat is minimal: one extra netdev_alloc_cache
and few 'if' which one to choose.
Most of the time only one of them is used, since
systems with napi and not-napi nics are very rare.
I think it should help virtualization case as well,
since virtio_net is napi already.

Eric Dumazet

2014-10-23 05:14:04 UTC

Permalink

Post by Alexei Starovoitov
my numbers were from just released Xeon E5-1630 v3
which is haswell-ep
I think Alexander's idea is worth pursuing.
Code-wise the bloat is minimal: one extra netdev_alloc_cache
and few 'if' which one to choose.
Most of the time only one of them is used, since
systems with napi and not-napi nics are very rare.
I think it should help virtualization case as well,
since virtio_net is napi already.

Really, I doubt this will make any difference in real workloads
where bottleneck is memory bandwidth, not the time taken by
pushf/cli/popf.

What difference do you have on a real workload like : netperf TCP_RR or
UDP_RR ?

Alexei Starovoitov

2014-10-23 06:12:21 UTC

Permalink

Post by Eric Dumazet
Really, I doubt this will make any difference in real workloads
where bottleneck is memory bandwidth, not the time taken by
pushf/cli/popf.
What difference do you have on a real workload like : netperf TCP_RR or
UDP_RR ?

there are different workloads. l2/l3 switching is very real
as well. For tcp_rr on the host the difference will be
minimal, since the time is spent elsewhere. For cases
that don't use tcp stack, but do pure l2/l3 processing
all improvements will add up. imo the kernel should be
optimized for all use cases when possible.