Discussion:
rps: question
jamal
2010-02-07 18:42:02 UTC
Permalink
Hi Tom,

First off: Kudos on the numbers you are seeing; they are
impressive. Do you have any numbers on a forwarding path test?

My first impression when i saw the numbers was one of suprise.
Back in the days when we tried to split stack processing the way
you did(it was one of the experiments on early NAPI), IPIs were
_damn_ expensive. What changed in current architecture that makes
this more palatable? IPIs are still synchronous AFAIK (and the more
IPI receiver there are, the worse the ACK latency). Did you test this
across other archs or say 3-4 year old machines?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tom Herbert
2010-02-08 05:58:32 UTC
Permalink
Post by jamal
Hi Tom,
First off: Kudos on the numbers you are seeing; they are
impressive. Do you have any numbers on a forwarding path test?
I don't have specific numbers, although we are using this on
application doing forwarding and numbers seem in line with what we see
for an end host.
Post by jamal
My first impression when i saw the numbers was one of suprise.
Back in the days when we tried to split stack processing the way
you did(it was one of the experiments on early NAPI), IPIs were
_damn_ expensive. What changed in current architecture that makes
this more palatable? IPIs are still synchronous AFAIK (and the more
IPI receiver there are, the worse the ACK latency). Did you test this
across other archs or say 3-4 year old machines?
No, the cost of the IPIs hasn't been an issue for us performance-wise.
We are using them extensively-- up to one per core per device
interrupt.

We're calling __smp_call_function_single which is asynchronous in that
the caller provides the call structure and there is not waiting for
the IPI to complete. A flag is used with each call structure that is
set when the IPI is in progress, this prevents simultaneous use of a
call structure.

I haven't seen any architectural specific issues with the IPIs, I
believe they are completing in < 2 usecs on platforms we're running
(some opteron systems that are over 3yrs old).

Tom
Post by jamal
cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
jamal
2010-02-08 15:09:08 UTC
Permalink
Post by Tom Herbert
I don't have specific numbers, although we are using this on
application doing forwarding and numbers seem in line with what we see
for an end host.
When i get the chance i will give it a run. I have access to an i7
somewhere. It seems like i need some specific nics?
Post by Tom Herbert
No, the cost of the IPIs hasn't been an issue for us performance-wise.
We are using them extensively-- up to one per core per device
interrupt.
Ok, so you are not going across cores then? I wonder if there's
some new optimization to reduce IPI latency when both sender/receiver
reside on the same core?
Post by Tom Herbert
We're calling __smp_call_function_single which is asynchronous in that
the caller provides the call structure and there is not waiting for
the IPI to complete. A flag is used with each call structure that is
set when the IPI is in progress, this prevents simultaneous use of a
call structure.
It is possible that is just an abstraction hiding the details..
AFAIK, IPIs are synchronous. Remote has to ack with another IPI
while the issuing cpu waits for ack IPI and then returns.
Post by Tom Herbert
I haven't seen any architectural specific issues with the IPIs, I
believe they are completing in < 2 usecs on platforms we're running
(some opteron systems that are over 3yrs old).
2 usecs aint bad (at 10G you only accumulate a few packets while
stalled). I think we saw much higher values.
I was asking on different architectures because I have tried something
equivalent as recent as 2 years back on a MIPS multicore and the
forwarding results were horrible.
IPIs flush the processor pipeline so they aint cheap - but that may
vary depending on the architecture. Someone more knowledgeable should
be able to give better insights.
My suspicion is that with low transaction rate (with appropriate traffic
patterns) you will see a very much increased latency since you will
be sending more IPIs..

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
jamal
2010-04-14 11:53:06 UTC
Permalink
Post by jamal
Post by Tom Herbert
I don't have specific numbers, although we are using this on
application doing forwarding and numbers seem in line with what we see
for an end host.
When i get the chance i will give it a run. I have access to an i7
somewhere. It seems like i need some specific nics?
I did step #0 last night on an i7 (single Nehalem). I think more than
anything i was impressed by the Nehalem's excellent caching system.
Robert, I am almost tempted to say skb recycling performance will be
excellent on this machine given the cost of a cache miss is much lower
than previous generation hardware.

My test was simple: irq affinity on cpu0(core0) and rps redirection to
cpu1(core 1); tried also to redirect to different SMT threads (aka CPUs)
on different cores with similar results. I base tested against no rps
being used and a kernel which didnt have any RPS config on.
[BTW, I had to hand-edit the .config since i couldnt do it from
menuconfig (Is there any reason for it to be so?)]

Traffic was sent from another machine into the i7 via an el-cheapo sky2
(dont know how shitty this NIC is, but it seems to know how to do MSI so
probably capable of multiqueueing); the test was several sets of
a ping first and then a ping -f (I will get more sophisticated in my
next test likely this weekend).

Results:
CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
cpu was being chewed highly by sky2_poll and on the redirected-to-core
it was always smp_call_function_single.
Latency was (consistently) on average 5 microseconds.
So if i sent 1M ping -f packets, without RPS it took on average
176 seconds and with RPS it took 181 seconds to do a round-trip.
Throughput didnt change but this could be attributed to the low amounts
of data i was sending.
I observed that we were generating, on average, an IPI per packet even
with ping -f. (added an extra stat to record when we sent an IPI and
counted against the number of packets sent).
In my opinion it is these IPIs that contribute the most to the latency
and i think it happens that the Nehalem is just highly improved in this
area. I wish i had a more commonly used machine to test rps on.
I expect that rps will perform worse on currently cheaper/older hardware
for the traffic characteristic i tested.

On IPIs:
Is anyone familiar with what is going on with Nehalem? Why is it this
good? I expect things will get a lot nastier with other hardware like
xeon based or even Nehalem with rps going across QPI.
Here's why i think IPIs are bad, please correct me if i am wrong:
- they are synchronous. i.e an IPI issuer has to wait for an ACK (which
is in the form of an IPI).
- data cache has to be synced to main memory
- the instruction pipeline is flushed
- what else did i miss? Andi?

So my question to Tom, Eric and Changli or anyone else who has been
running RPS:
What hardware did you use? Is there anyone using older hardware than
say AMD Opteron or Intel Nehalem?

My impressions of rps so far:
I think i may end up being impressed when i generate a lot more traffic
since the cost of IPI will be amortized.
At this point multiqueue seems a lot more impressive alternative and it
seems to me multiqueu hardware is a lot more commodity (price-point)
than a Nehalem.

Plan:
I plan to still attack the app space (and write a basic udp app that
binds to one or more rps cpus and try blasting a lot of UDP traffic to
see what happens) my step after that is to move to forwarding tests..

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tom Herbert
2010-04-14 17:31:34 UTC
Permalink
The point of RPS is to increase parallelism, but the cost of that is
more overhead per packet. If you are running a single flow, then
you'll see latency increase for that flow. With more concurrent flows
the benefits of parallelism kick in and latency gets better.-- we've
seen the break even point around ten connections in our tests. Also,
I don't think we've made the claim that RPS should generally perform
better than multi-queue, the primary motivation for RPS is make single
queue NICs give reasonable performance.
Post by jamal
Post by jamal
Post by Tom Herbert
I don't have specific numbers, although we are using this on
application doing forwarding and numbers seem in line with what we=
see
Post by jamal
Post by jamal
Post by Tom Herbert
for an end host.
When i get the chance i will give it a run. I have access to an i7
somewhere. It seems like i need some specific nics?
I did step #0 last night on an i7 (single Nehalem). I think more than
anything i was impressed by the Nehalem's excellent caching system.
Robert, I am almost tempted to say skb recycling performance will be
excellent on this =A0machine given the cost of a cache miss is much l=
ower
Post by jamal
than previous generation hardware.
My test was simple: irq affinity on cpu0(core0) and rps redirection t=
o
Post by jamal
cpu1(core 1); tried also to redirect to different SMT threads (aka CP=
Us)
Post by jamal
on different cores with similar results. I base tested against no rps
being used and a kernel which didnt have any RPS config on.
[BTW, I had to hand-edit the .config since i couldnt do it from
menuconfig (Is there any reason for it to be so?)]
Traffic was sent from another machine into the i7 via an el-cheapo sk=
y2
Post by jamal
(dont know how shitty this NIC is, but it seems to know how to do MSI=
so
Post by jamal
probably capable of multiqueueing); the test was several sets of
a ping first and then a ping -f (I will get more sophisticated in my
next test likely this weekend).
CPU utilization was about 20-30% higher in the case of rps. On cpu0, =
the
Post by jamal
cpu was being chewed highly by sky2_poll and on the redirected-to-cor=
e
Post by jamal
it was always smp_call_function_single.
Latency was (consistently) on average 5 microseconds.
So if i sent 1M ping -f packets, without RPS it took on average
176 seconds and with RPS it took 181 seconds to do a round-trip.
Throughput didnt change but this could be attributed to the low amoun=
ts
Post by jamal
of data i was sending.
I observed that we were generating, on average, an IPI per packet eve=
n
Post by jamal
with ping -f. (added an extra stat to record when we sent an IPI and
counted against the number of packets sent).
In my opinion it is these IPIs that contribute the most to the latenc=
y
Post by jamal
and i think it happens that the Nehalem is just highly improved in th=
is
Post by jamal
area. I wish i had a more commonly used machine to test rps on.
I expect that rps will perform worse on currently cheaper/older hardw=
are
Post by jamal
for the traffic characteristic i tested.
Is anyone familiar with what is going on with Nehalem? Why is it this
good? I expect things will get a lot nastier with other hardware like
xeon based or even Nehalem with rps going across QPI.
- they are synchronous. i.e an IPI issuer has to wait for an ACK (whi=
ch
Post by jamal
is in the form of an IPI).
- data cache has to be synced to main memory
- the instruction pipeline is flushed
- what else did i miss? Andi?
So my question to Tom, Eric and Changli or anyone else who has been
What hardware did you use? Is there anyone using older hardware than
say AMD Opteron or Intel Nehalem?
I think i may end up being impressed when i generate a lot more traff=
ic
Post by jamal
since the cost of IPI will be amortized.
At this point multiqueue seems a lot more impressive alternative and =
it
Post by jamal
seems to me multiqueu hardware is a lot more commodity (price-point)
than a Nehalem.
I plan to still attack the app space (and write a basic udp app that
binds to one or more rps cpus and try blasting a lot of UDP traffic t=
o
Post by jamal
see what happens) my step after that is to move to forwarding tests..
cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2010-04-14 18:04:02 UTC
Permalink
Le mercredi 14 avril 2010 =C3=A0 10:31 -0700, Tom Herbert a =C3=A9crit =
Post by Tom Herbert
The point of RPS is to increase parallelism, but the cost of that is
more overhead per packet. If you are running a single flow, then
you'll see latency increase for that flow. With more concurrent flow=
s
Post by Tom Herbert
the benefits of parallelism kick in and latency gets better.-- we've
seen the break even point around ten connections in our tests. Also,
I don't think we've made the claim that RPS should generally perform
better than multi-queue, the primary motivation for RPS is make singl=
e
Post by Tom Herbert
queue NICs give reasonable performance.
=20
Yes, multiqueue is far better of course, but in case of hardware lackin=
g
multiqueue, RPS can help many workloads, where application has _some_
work to do, not only counting frames or so...

RPS overhead (IPI, cache misses, ...) must be amortized by
parallelization or we lose.

A ping test is not an ideal candidate for RPS, since everything is done
at softirq level, and should be faster without RPS...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
jamal
2010-04-14 18:53:42 UTC
Permalink
Yes, multiqueue is far better of course, but in case of hardware lacking
multiqueue, RPS can help many workloads, where application has _some_
work to do, not only counting frames or so...
Agreed. So to enumerate, the benefits come in if:
a) you have many processors
b) you have single-queue nic
c) at sub-threshold traffic you dont care about a little latency
d) you have a specific cache hierachy
e) app is working hard to process incoming messages
RPS overhead (IPI, cache misses, ...) must be amortized by
parallelization or we lose.
Indeed.
How well they can be amortized seems very cpu or board specific.

I think the main challenge for my pedantic mind is missing details. Is
there a paper on rps? Example for #d above, the commit log mentions that
rps benefits if you have certain types of "cache hierachy". Probably
some arch with large shared L2/3 (maybe inclusive) cache will benefit.
example: it does well on Nehalem and probably opterons as long (as you
dont start stacking these things on some interconnect like QPI or HT).
But what happens when you have FSB sharing across cores (still a very
common setup)? etc etc

Can I ask what hardware you run this on?
A ping test is not an ideal candidate for RPS, since everything is done
at softirq level, and should be faster without RPS...
ping wont do justice to the possible potential of rps mostly because it
generates very little traffic i.e the part #c above. But it helps me at
least boot a machine with proper setup - but it is not totally useless
because i think the cost of IPI can be deduced from the results.
I am going to put together some udp app with variable think-time to see
what happens. Would that be a reasonable thing to test on?

It would be valuable to have something like Documentation/networking/rps
to detail things a little more.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger
2010-04-14 19:44:26 UTC
Permalink
On Wed, 14 Apr 2010 14:53:42 -0400
Post by jamal
a) you have many processors
b) you have single-queue nic
c) at sub-threshold traffic you dont care about a little latency
There probably needs to be better autotuning for this, there is no reason
that RPS to be steering packets unless the queue is getting backed up.
Some kind of high / low water mark mechanism is needed.

RPS might also interact with the core turbo boost functionality on Intel chips.
Newer chips will make a single core faster if other core can be kept idle.
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2010-04-14 19:58:50 UTC
Permalink