Queue with wait-free enqueue, blocking dequeue, splice

Discussion:

Mathieu Desnoyers

2014-10-18 11:48:12 UTC

Hi Jesper,

(re-send after getting the right netdev ML address)

Following our LPC discussion on lock-free queue algorithms
for qdisc, here is some info on the wfcqueue implementation
found in Userspace RCU. See http://urcu.so for info and
git repository.

Here is the wfcqueue ported to the Linux kernel I sent last
year as RFC:
https://lkml.org/lkml/2013/3/14/289

I'm very interested to learn if it fits well for your
use-case,

Feedback is welcome,

Thanks!

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Jesper Dangaard Brouer

2014-10-20 14:02:37 UTC

Permalink

Post by Mathieu Desnoyers
Following our LPC discussion on lock-free queue algorithms
for qdisc, here is some info on the wfcqueue implementation
found in Userspace RCU. See http://urcu.so for info and
git repository.

Thank for following up on our very interesting discussions.

I've started with the more simple variant "urcu/static/wfqueue.h" to
understand the concepts. And I'm now reading wfcqueue.h, which I guess
it replacing wfqueue.h.

Post by Mathieu Desnoyers
Here is the wfcqueue ported to the Linux kernel I sent last
https://lkml.org/lkml/2013/3/14/289
I'm very interested to learn if it fits well for your
use-case,

Does this wfcqueue API support bulk dequeue? (A feature needed for the
lock-less qdisc implementation, else it cannot compete with our new
bulk dequeue strategy).

AFAIK your queue implementation is a CAS-based, Wait-Free on enqueue,
but Lock-Free on dequeue with the potential for waiting/blocking on
a enqueue processes.
I'm not 100% sure, that we want this behavior for the qdisc system.

I can certainly use the wfcq_empty() check, but I guess I need to
maintain a separate counter to maintain the qdisc limit, right?
(I would use the approximate/split counter API percpu_counter to keep
this scalable, and wfcq_empty() would provide an accurate empty check)

I think, we/I should start micro benchmarking the different approaches.
As our time budget is only 67.2ns
http://netoptimizer.blogspot.dk/2014/05/the-calculations-10gbits-wirespeed.html
(or bulking tricks artificially "increase" this budget)

The motivation behind this lockless qdisc is, the current qdisc locking
cost 48ns, see slide 9/16 titled "Qdisc locking is nasty":
http://people.netfilter.org/hawk/presentations/LinuxPlumbers2014/performance_tx_qdisc_bulk_LPC2014.pdf

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

Mathieu Desnoyers

2014-10-21 00:04:10 UTC

Permalink

----- Original Message -----

Sent: Monday, October 20, 2014 10:02:37 AM
Subject: Re: Queue with wait-free enqueue, blocking dequeue, splice
On Sat, 18 Oct 2014 11:48:12 +0000 (UTC) Mathieu Desnoyers

Thank for following up on our very interesting discussions.
I've started with the more simple variant "urcu/static/wfqueue.h" to
understand the concepts. And I'm now reading wfcqueue.h, which I guess
it replacing wfqueue.h.

Yep, this is correct.

Post by Mathieu Desnoyers
Here is the wfcqueue ported to the Linux kernel I sent last
https://lkml.org/lkml/2013/3/14/289
I'm very interested to learn if it fits well for your
use-case,

Does this wfcqueue API support bulk dequeue? (A feature needed for the
lock-less qdisc implementation, else it cannot compete with our new
bulk dequeue strategy).

Yes, it does. See the "__wfcq_splice" API.

AFAIK your queue implementation is a CAS-based, Wait-Free on enqueue,
but Lock-Free on dequeue with the potential for waiting/blocking on
a enqueue processes.
I'm not 100% sure, that we want this behavior for the qdisc system.

It's actually xchg-based (not CAS-based). It is indeed wait-free
in the strictest sense of the term on enqueue (at least on x86,
some other arch implement xchg using ll/sc, which is not strictly
wait-free).

On dequeue, it can busy-wait for a short while that the enqueue
completes. Note that in kernel, since we disable preemption during
enqueue, the dequeue does not have to ever block, just busy-looping
is OK, since the longest thing that could nest over the enqueue
is possibly an interrupt and softirq. So yes, I guess the dequeue
would qualify as lock-free.

I can certainly use the wfcq_empty() check,

Not sure why you would want to use it, considering that the dequeue
operation implies it. If there is nothing to dequeue, we just return
immediately. Dequeue operation does not block on empty queue. It
just busy waits if it happen to see the queue in an intermediate
state of the enqueue operation (which is very short, few instructions
at most, with preemption disabled).

but I guess I need to
maintain a separate counter to maintain the qdisc limit, right?
(I would use the approximate/split counter API percpu_counter to keep
this scalable, and wfcq_empty() would provide an accurate empty check)

Yes for split counters, not sure why you need the empty check explicitly
in your use-case though.

I think, we/I should start micro benchmarking the different approaches.
As our time budget is only 67.2ns
http://netoptimizer.blogspot.dk/2014/05/the-calculations-10gbits-wirespeed.html
(or bulking tricks artificially "increase" this budget)
The motivation behind this lockless qdisc is, the current qdisc locking
http://people.netfilter.org/hawk/presentations/LinuxPlumbers2014/performance_tx_qdisc_bulk_LPC2014.pdf

Chances are that the scheme:

__wfcq_enqueue() on enqueue

__wfcq_splice() for bulk dequeue

__wfcq_first()
__wfcq_next() for iteration on the splice dest queue

Would be must faster than the ton of locks you use currently.
The nice part about using just enqueue and splice is that you
don't need locking on the queue at all. Iteration on the
destination queue can be done by a single thread, so no need
for explicit locking there neither.

By the way, you should look at the kernel wfcq implementation
I posted earlier. It's more compact that the one in userspace RCU
because we don't need to support blocking/nonblocking modes,
because we have the luxury of disabling preemption in the kernel.

I look forward to see the numbers,

Thanks!

Mathieu

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Jesper Dangaard Brouer

2014-10-21 11:48:02 UTC

Permalink

[...]

Post by Mathieu Desnoyers

Post by Jesper Dangaard Brouer
I can certainly use the wfcq_empty() check,

Post by Jesper Dangaard Brouer
but I guess I need to
maintain a separate counter to maintain the qdisc limit, right?
(I would use the approximate/split counter API percpu_counter to keep
this scalable, and wfcq_empty() would provide an accurate empty check)

Yes for split counters, not sure why you need the empty check explicitly
in your use-case though.

In case the qdisc is empty, we avoid/bypass the enqueue + dequeue phase
and instead transmit the packet directly.

Iif the flag TCQ_F_CAN_BYPASS is set.
https://github.com/torvalds/linux/blob/master/net/core/dev.c#L2799
But I'm not 100% sure that we can set this flag on a lock-less qdisc.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

Mathieu Desnoyers

2014-10-21 12:02:38 UTC

Permalink

----- Original Message -----

Sent: Monday, October 20, 2014 8:04:10 PM
Subject: Re: Queue with wait-free enqueue, blocking dequeue, splice
----- Original Message -----

Sent: Monday, October 20, 2014 10:02:37 AM
Subject: Re: Queue with wait-free enqueue, blocking dequeue, splice

[...]

It's actually xchg-based (not CAS-based). It is indeed wait-free
in the strictest sense of the term on enqueue (at least on x86,
some other arch implement xchg using ll/sc, which is not strictly
wait-free).
On dequeue, it can busy-wait for a short while that the enqueue
completes. Note that in kernel, since we disable preemption during
enqueue, the dequeue does not have to ever block, just busy-looping
is OK, since the longest thing that could nest over the enqueue
is possibly an interrupt and softirq. So yes, I guess the dequeue
would qualify as lock-free.

Scratch this last part about dequeue lock-freedom: dequeue can
busy-wait endlessly long if an enqueue is, for instance, interrupted
by a dequeue operation that would sit within an interrupt handler.
Dequeue is therefore not "lock-free". It is not even obstruction-free.

This just means that you need to be aware of this if you use dequeue
in an execution context that can interrupt enqueue.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com