htb parallelism on multi-core platforms

From: Radu Rendec <***@ines.ro>
Date: Fri, 17 Apr 2009 13:40:44 +0300

Post by Radu Rendec
Is there any (simple) approach to distribute htb work (for one
interface) on multiple cpus?

HTB acts upon global state, so anything that goes into a particular
device's HTB ruleset is going to be single threaded.

There really isn't any way around this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Badalian Vyacheslav

2009-04-17 11:33:36 UTC

hello

100% SI on ksoftirqd on one CPU because PC can't forward such packets
(napi off if i understand). 2 cpu xeon 2.4 ghz can forward about 400-500
mbs full duplex with about 20-30k htb rules. If we try do more - we get
100% SI. Its our example.

We now use multiple pc for this and will try to by intel 10G with A/IO
that can use Multiqueue.

Anyone can say:
How match CPU we must have for about 5-7G in/out with 2 x intel 10G +
A/IO (1x10g to lan + 1x10g to wan) ?
Any statistic or formula to calculate? pps or mbs?
tc + iptables (+ipset) now use 10-30%. All other cpu now use e1000e driver.

Thanks

Post by Radu Rendec
Hi,
I'm using htb on a dedicated shaping machine. Under heavy traffic (high
packet rate) all htb work is done on a single cpu - only one ksoftirqd
is consuming cpu power.
I have limited network stack knowledge, but I guess all htb work for a
particular interface is done on the same softirq context. Of course this
does not scale with multiple cpus, since only one of them would be used.
Is there any (simple) approach to distribute htb work (for one
interface) on multiple cpus?
Thanks,
Radu Rendec
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

2009-04-17 22:41:38 UTC

Post by Radu Rendec
Hi,

Hi Radu,

Post by Radu Rendec
I'm using htb on a dedicated shaping machine. Under heavy traffic (high
packet rate) all htb work is done on a single cpu - only one ksoftirqd
is consuming cpu power.
I have limited network stack knowledge, but I guess all htb work for a
particular interface is done on the same softirq context. Of course this
does not scale with multiple cpus, since only one of them would be used.
Is there any (simple) approach to distribute htb work (for one
interface) on multiple cpus?

I don't know about anything (simple) for this, but I wonder if you
tried already any htb tweaking like htb_hysteresis module param or
burst/cburst class parameters to limit some maybe useless resolution/
overhead?

Regards,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Denys Fedoryschenko

2009-04-18 00:21:50 UTC

Hi,

Hi Radu,
I don't know about anything (simple) for this, but I wonder if you
tried already any htb tweaking like htb_hysteresis module param or
burst/cburst class parameters to limit some maybe useless resolution/
overhead?

Like adding HZ=1000 as environment variable in scripts :-)
For me it helps....
Also worth to try HFSC.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

2009-04-18 07:56:38 UTC

Post by Denys Fedoryschenko

Hi,

Like adding HZ=1000 as environment variable in scripts :-)
For me it helps....

Right, if you're using high resolution; there is a bug in tc, found by
Denys, which causes wrong (too low) defaults for burst/cburst.

Post by Denys Fedoryschenko
Also worth to try HFSC.

Yes, it seems to be especially interesting for 64 bit boxes.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Radu Rendec

2009-04-22 14:02:06 UTC

Post by Jarek Poplawski
Right, if you're using high resolution; there is a bug in tc, found by
Denys, which causes wrong (too low) defaults for burst/cburst.

Post by Denys Fedoryschenko
Also worth to try HFSC.

Yes, it seems to be especially interesting for 64 bit boxes.

Hi Jarek,

Thanks for the hints! As far as I understand, HFSC is also implemented
as a queue discipline (like HTB), so I guess it suffers from the same
design limitations (doesn't span across multiple CPUs). Is this
assumption correct?

As for htb_hysteresis I actually haven't tried it. Although it is
definitely worth a try (especially if the average traffic grows), I
don't think it can compensate multithreading / parallel execution. At
least half of a packet processing time is consumed by classification
(although I am using hashes). I guess htb_hysteresis only affects the
actual shaping (which takes place after the packet is classified).

Thanks,

Radu Rendec

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jesper Dangaard Brouer

2009-04-22 21:29:32 UTC

Post by Radu Rendec
Thanks for the hints! As far as I understand, HFSC is also implemented
as a queue discipline (like HTB), so I guess it suffers from the same
design limitations (doesn't span across multiple CPUs). Is this
assumption correct?

Yes.

Post by Radu Rendec
As for htb_hysteresis I actually haven't tried it. Although it is
definitely worth a try (especially if the average traffic grows), I
don't think it can compensate multithreading / parallel execution.

Its runtime adjustable, so its easy to try out.

via /sys/module/sch_htb/parameters/htb_hysteresis

Post by Radu Rendec
At least half of a packet processing time is consumed by classification
(although I am using hashes).

The HTB classify hash has a scalability issue in kernels below 2.6.26.
Patrick McHardy fixes that up in 2.6.26. What kernel version are you
using?

Could you explain how you do classification? And perhaps outline where you
possible scalability issue is located?

If you are interested how I do scalable classification, see my
presentation from Netfilter Workshop 2008:

http://nfws.inl.fr/en/?p=115
http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf

Post by Radu Rendec
I guess htb_hysteresis only affects the actual shaping (which takes
place after the packet is classified).

Yes, htb_hysteresis basically is a hack to allow extra bursts... we
actually considered removing it completely...

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

2009-04-23 08:20:52 UTC

Yes.

Within a common tree of classes it would a need finer locking to
separate some jobs but considering cache problems I doubt there would
be much gain from such redesigning for smp. On the other hand, a
common tree is necessary if these classes really have to share every
byte, which I doubt. Then we could think of config and maybe tiny
hardware "redesign" (to more qdiscs/roots). So, e.g. using additional
(cheap) NICs and even switch, if possible, looks quite natural way of
spanning. Similar thing (multiple htb qdiscs) should be possible in
the future with one multiqueue NIC too.

There is also an interesting thread "Software receive packet steering"
nearby, but using this for shaping only looks like "less simple":
http://lwn.net/Articles/328339/

Its runtime adjustable, so its easy to try out.
via /sys/module/sch_htb/parameters/htb_hysteresis

Post by Radu Rendec
At least half of a packet processing time is consumed by classification
(although I am using hashes).

BTW, I hope you add filters after classes they point to.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Radu Rendec

2009-04-23 13:56:42 UTC

Post by Jarek Poplawski
Within a common tree of classes it would a need finer locking to
separate some jobs but considering cache problems I doubt there would
be much gain from such redesigning for smp. On the other hand, a
common tree is necessary if these classes really have to share every
byte, which I doubt. Then we could think of config and maybe tiny
hardware "redesign" (to more qdiscs/roots). So, e.g. using additional
(cheap) NICs and even switch, if possible, looks quite natural way of
spanning. Similar thing (multiple htb qdiscs) should be possible in
the future with one multiqueue NIC too.

Since htb has a tree structure by default, I think it's pretty difficult
to distribute shaping across different htb-enabled queues. Actually we
had thought of using completely separate machines, but soon we realized
there are some issues. Consider the following example:

Customer A and customer B share 2 Mbit of bandwith. Each of them is
guaranteed to reach 1 Mbit and in addition is able to "borrow" up to 1
Mbit from the other's bandwith (depending on the other's traffic).

This is done like this:

* bucket C -> rate 2 Mbit, ceil 2 Mbit
* bucket A -> rate 1 Mbit, ceil 2 Mbit, parent C
* bucket B -> rate 1 Mbit, ceil 2 Mbit, parent C

IP filters for customer A classify packets to bucket A, and similar for
customer B to bucket B.

It's obvious that buckets A, B and C must be in the same htb tree,
otherwise customers A and B would not be able to borrow from each
other's bandwidth. One simple rule would be to allocate all buckets
(with all their child buckets) that have rate = ceil to the same tree /
queue / whatever. I don't know if this is enough.

Post by Jarek Poplawski
There is also an interesting thread "Software receive packet steering"
http://lwn.net/Articles/328339/

I am aware of the thread and even tried out the author's patch (despite
the fact that David Miller suggested it was not sane). Under heavy
(simulated) traffic nothing was changed: only one ksoftirqd using 100%
CPU, one CPU in 100%, others idle. This only confirms what I've already
been told: htb is single threaded by design. It also proves that most of
the packet processing work is actually in htb.

Post by Jarek Poplawski
BTW, I hope you add filters after classes they point to.

Do you mean the actual order I use for the "tc filter add" and "tc class
add" commands? Does it make any difference?

Anyway, speaking of htb redesign or improvement (to use multiple
threads / CPUs) I think classification rules can be cloned on a
per-thread basis (to avoid synchronization issues). This means
sacrificing memory for the benefit of performance but probably it is
better to do it this way.

However, shaping data structures must be shared between all threads as
long as it's not sure that all packets corresponding to a certain IP
address are processed in the same thread (they most probably would not,
if a round-robin alhorithm is used).

While searching the Internet for what has already been accomplished in
this area, I ran several time across the per-CPU cache issue. The
commonly accepted opinion seems to be that CPU parallelism in packet
processing implies synchronization issues which in turn imply cache
misses, which ultimately result in performance loss. However, with only
one core in 100% and other 7 cores idle, I doubt that CPU-cache is
really worth (it's just a guess and it definitely needs real tests as
evidence).

Thanks,

Radu Rendec

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

2009-04-23 18:19:36 UTC

Since htb has a tree structure by default, I think it's pretty difficult
to distribute shaping across different htb-enabled queues. Actually we
had thought of using completely separate machines, but soon we realized
Customer A and customer B share 2 Mbit of bandwith. Each of them is
guaranteed to reach 1 Mbit and in addition is able to "borrow" up to 1
Mbit from the other's bandwith (depending on the other's traffic).
* bucket C -> rate 2 Mbit, ceil 2 Mbit
* bucket A -> rate 1 Mbit, ceil 2 Mbit, parent C
* bucket B -> rate 1 Mbit, ceil 2 Mbit, parent C
IP filters for customer A classify packets to bucket A, and similar for
customer B to bucket B.
It's obvious that buckets A, B and C must be in the same htb tree,
otherwise customers A and B would not be able to borrow from each
other's bandwidth. One simple rule would be to allocate all buckets
(with all their child buckets) that have rate = ceil to the same tree /
queue / whatever. I don't know if this is enough.

Yes, what I meant was rather a config with more individual clients eg.
20 x rate 50kbit ceil 100kbit. But, if you have many such rate = ceil
classes, separating them to another qdisc/NIC looks even better (no
problem with unbalanced load).

Post by Jarek Poplawski
There is also an interesting thread "Software receive packet steering"
http://lwn.net/Articles/328339/

But, I wrote it's not simple. (And it was told about single threadedness
too.) This method is intended for a local traffic (to sockets) AFAIK, so
I thought about using some trick with virtual devs instead, but maybe
I'm totally wrong.

Post by Jarek Poplawski
BTW, I hope you add filters after classes they point to.

Do you mean the actual order I use for the "tc filter add" and "tc class
add" commands? Does it make any difference?

Yes, I mean this order:
tc class add ... classid 1:23 ...
tc filter add ... flowid 1:23

Post by Radu Rendec
Anyway, speaking of htb redesign or improvement (to use multiple
threads / CPUs) I think classification rules can be cloned on a
per-thread basis (to avoid synchronization issues). This means
sacrificing memory for the benefit of performance but probably it is
better to do it this way.
However, shaping data structures must be shared between all threads as
long as it's not sure that all packets corresponding to a certain IP
address are processed in the same thread (they most probably would not,
if a round-robin alhorithm is used).
While searching the Internet for what has already been accomplished in
this area, I ran several time across the per-CPU cache issue. The
commonly accepted opinion seems to be that CPU parallelism in packet
processing implies synchronization issues which in turn imply cache
misses, which ultimately result in performance loss. However, with only
one core in 100% and other 7 cores idle, I doubt that CPU-cache is
really worth (it's just a guess and it definitely needs real tests as
evidence).

There are many things to learn and to do around smp yet, just like
this "Software receive packet steering" thread shows. Anyway, there
are really big htb traffics handled as it is (look at Vyacheslav's
mail in this thread), so I guess you have something to do around your
config/hardware too.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jesper Dangaard Brouer

2009-04-23 20:19:45 UTC

Post by Radu Rendec
I am aware of the thread and even tried out the author's patch (despite
the fact that David Miller suggested it was not sane). Under heavy
(simulated) traffic nothing was changed: only one ksoftirqd using 100%
CPU, one CPU in 100%, others idle. This only confirms what I've already
been told: htb is single threaded by design.

Its more general that just HTB. We have general Qdisc serialization
point in net/sched/sch_generic.c by the qdisc_lock(q).

Post by Radu Rendec
It also proves that most of the packet processing work is actually in
htb.

I'm not sure that statement is true.
Can you run Oprofile on the system? That will tell us exactly where time
is spend...

Post by Jarek Poplawski
...
I thought about using some trick with virtual devs instead, but maybe
I'm totally wrong.

I like the idea with virtual devices, as each virtual device could be
bound to a hardware tx-queue.

Then you just have to construct your HTB trees on each virtual
device, and assign customers accordingly.

I just realized, you don't use a multi-queue capably NIC right?
Then it would be difficult to use the hardware tx-queue idea.
Have you though of using several physical NICs?

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Radu Rendec

2009-04-24 09:42:16 UTC

Post by Radu Rendec
It also proves that most of the packet processing work is actually in
htb.

I'm not sure that statement is true.
Can you run Oprofile on the system? That will tell us exactly where time
is spend...

I've never used oprofile, but it looks very powerful and simple to use.
I'll compile a 2.6.29 (so that I also benefit from the htb patch you
told me about) then put oprofile on top of it. I'll get back to you by
evening (or maybe Monday noon) with real facts :)

Post by Jarek Poplawski
...
I thought about using some trick with virtual devs instead, but maybe
I'm totally wrong.

I like the idea with virtual devices, as each virtual device could be
bound to a hardware tx-queue.

Is there any current support for this or do you talk about it as an
approach to use in future development?

The idea looks interesting indeed. If there's current support for it,
I'd like to try it out. If not, perhaps I can help at least with testing
(or even some coding as well).

Post by Jesper Dangaard Brouer
Then you just have to construct your HTB trees on each virtual
device, and assign customers accordingly.

I don't think it's that easy. Let's say we have the same HTB trees on
both virtual devices A and B (each of them is bound to a different
hardware tx queue). If packets for a specific destination ip address
(pseudo)randomly arrive at both A and B, tokens will be extracted from
both A and B trees, resulting in an erroneus overall bandwidth (at worst
double the ceil, if packets reach the ceil on both A and B).

I have to make sure packets belonging to a certain customer (or ip
address) always come through a specific virtual device. Then HTB trees
don't even need to be identical.

However, this is not trivial at all. A single customer can have
different subnets (even from different class-B networks) but share a
single HTB bucket for all of them. Using a simple hash function on the
ip address to determine which virtual device to send through doesn't
seem to be an option since it does not guarantee all packets for a
certain customer will go together.

What I had in mind for parallel shaping was this:

NIC0 -> mux -----> Thread 0: classify/shape -----> NIC2
\/ \/
/\ /\
NIC1 -> mux -----> Thread 1: classify/shape -----> NIC3

Of course the number of input NICs, processing threads and output NICs
would be adjustable. But this idea has 2 major problems:

* shaping data must be shared between processing threads (in order to
extract tokens from the same bucket regardless of the thread that does
the actual processesing)

* it seems to be impossible to do this with (unmodified) HTB

Post by Jesper Dangaard Brouer
I just realized, you don't use a multi-queue capably NIC right?
Then it would be difficult to use the hardware tx-queue idea.
Have you though of using several physical NICs?

The machine we are preparing for production has this:

2 x Intel Corporation 82571EB Gigabit Ethernet Controller
2 x Intel Corporation 80003ES2LAN Gigabit Ethernet Controller

All 4 NICs use the e1000e driver and I think they are multi-queue
capable. So in theory I can use several NICs and/or multi-queue.

Thanks,

Radu Rendec

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jesper Dangaard Brouer

2009-04-28 10:15:05 UTC

Post by Radu Rendec
It also proves that most of the packet processing work is actually in
htb.

I'm not sure that statement is true.
Can you run Oprofile on the system? That will tell us exactly where time
is spend...

Remember to keep/copy the file "vmlinux".

Here is the steps I usually use:

opcontrol --vmlinux=/boot/vmlinux-`uname -r`

opcontrol --stop
opcontrol --reset
opcontrol --start

<perform stuff that needs profiling>

opcontrol --stop

"Normal" report
opreport --symbols --image-path=/lib/modules/`uname -r`/kernel/ | less

Looking at specific module "sch_htb"

opreport --symbols -cl sch_htb.ko --image-path=/lib/modules/`uname
-r`/kernel/

Post by Jarek Poplawski
...
I thought about using some trick with virtual devs instead, but maybe
I'm totally wrong.

I like the idea with virtual devices, as each virtual device could be
bound to a hardware tx-queue.

Is there any current support for this or do you talk about it as an
approach to use in future development?

This is definitly only ideas for future development...

Post by Radu Rendec
The idea looks interesting indeed. If there's current support for it,
I'd like to try it out. If not, perhaps I can help at least with testing
(or even some coding as well).

Post by Jesper Dangaard Brouer
Then you just have to construct your HTB trees on each virtual
device, and assign customers accordingly.

Correct...

Post by Radu Rendec
However, this is not trivial at all. A single customer can have
different subnets (even from different class-B networks) but share a
single HTB bucket for all of them. Using a simple hash function on the
ip address to determine which virtual device to send through doesn't
seem to be an option since it does not guarantee all packets for a
certain customer will go together.

Well I know the problem, our customers IP's are also allocated adhoc and
not grouped nicely :-(

Post by Radu Rendec
...

2 x Intel Corporation 82571EB Gigabit Ethernet Controller
2 x Intel Corporation 80003ES2LAN Gigabit Ethernet Controller
All 4 NICs use the e1000e driver and I think they are multi-queue
capable. So in theory I can use several NICs and/or multi-queue.

I'm note sure that the driver e1000e has multiqueue for your devices. The
82571EB chip should have 2-rx and 2-tx queues [1].

Looking through the code, the multiqueue capable IRQ MSI-X code first got
in in kernel version v2.6.28-rc1. BUT the driver still uses
alloc_etherdev() and not alloc_etherdev_mq().

Cheers,
Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

[1]:
http://www.intel.com/products/ethernet/index.htm?iid=embnav1+eth#s1=Gigabit%20Ethernet&s2=82571EB&s3=all

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Radu Rendec

2009-04-29 10:21:34 UTC

Thanks for the oprofile newbie guide - it saved much time and digging
through man pages.

Normal report looks like this:
samples % image name app name symbol name
38424 30.7350 cls_u32.ko cls_u32 u32_classify
5321 4.2562 e1000e.ko e1000e e1000_clean_rx_irq
4690 3.7515 vmlinux vmlinux ipt_do_table
3825 3.0596 sch_htb.ko sch_htb htb_dequeue
3458 2.7660 vmlinux vmlinux __hash_conntrack
2597 2.0773 vmlinux vmlinux nf_nat_setup_info
2531 2.0245 vmlinux vmlinux kmem_cache_alloc
2229 1.7830 vmlinux vmlinux ip_route_input
1722 1.3774 vmlinux vmlinux nf_conntrack_in
1547 1.2374 sch_htb.ko sch_htb htb_enqueue
1519 1.2150 vmlinux vmlinux kmem_cache_free
1471 1.1766 vmlinux vmlinux __slab_free
1435 1.1478 vmlinux vmlinux dev_queue_xmit
1313 1.0503 vmlinux vmlinux __qdisc_run
1277 1.0215 vmlinux vmlinux netif_receive_skb

All other symbols are below 1%.

sch_htb.ko report is this:

samples % image name symbol name
-------------------------------------------------------------------------------
3825 49.0762 sch_htb.ko htb_dequeue
3825 100.000 sch_htb.ko htb_dequeue [self]
-------------------------------------------------------------------------------
1547 19.8486 sch_htb.ko htb_enqueue
1547 100.000 sch_htb.ko htb_enqueue [self]
-------------------------------------------------------------------------------
608 7.8009 sch_htb.ko htb_lookup_leaf
608 100.000 sch_htb.ko htb_lookup_leaf [self]
-------------------------------------------------------------------------------
459 5.8891 sch_htb.ko htb_deactivate_prios
459 100.000 sch_htb.ko htb_deactivate_prios [self]
-------------------------------------------------------------------------------
417 5.3503 sch_htb.ko htb_add_to_wait_tree
417 100.000 sch_htb.ko htb_add_to_wait_tree [self]
-------------------------------------------------------------------------------
372 4.7729 sch_htb.ko htb_change_class_mode
372 100.000 sch_htb.ko htb_change_class_mode [self]
-------------------------------------------------------------------------------
276 3.5412 sch_htb.ko htb_activate_prios
276 100.000 sch_htb.ko htb_activate_prios [self]
-------------------------------------------------------------------------------
189 2.4249 sch_htb.ko htb_add_to_id_tree
189 100.000 sch_htb.ko htb_add_to_id_tree [self]
-------------------------------------------------------------------------------
101 1.2959 sch_htb.ko htb_safe_rb_erase
101 100.000 sch_htb.ko htb_safe_rb_erase [self]
-------------------------------------------------------------------------------

Am I misinterpreting the results, or does it look like the real problem
is actually packet classification?

Thanks,

Radu Rendec

Post by Jesper Dangaard Brouer
Remember to keep/copy the file "vmlinux".
opcontrol --vmlinux=/boot/vmlinux-`uname -r`
opcontrol --stop
opcontrol --reset
opcontrol --start
<perform stuff that needs profiling>
opcontrol --stop
"Normal" report
opreport --symbols --image-path=/lib/modules/`uname -r`/kernel/ | less
Looking at specific module "sch_htb"
opreport --symbols -cl sch_htb.ko --image-path=/lib/modules/`uname
-r`/kernel/

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jesper Dangaard Brouer

2009-04-29 10:31:17 UTC

Post by Radu Rendec
Thanks for the oprofile newbie guide - it saved much time and digging
through man pages.

You are welcome :-)

Just noticed that Jeremy Kerr has made some python scripts to make it even
easier to use oprofile.
See http://ozlabs.org/~jk/diary/tech/linux/hiprofile-v1.0.diary/

Post by Radu Rendec
samples % image name app name symbol name
38424 30.7350 cls_u32.ko cls_u32 u32_classify
5321 4.2562 e1000e.ko e1000e e1000_clean_rx_irq
4690 3.7515 vmlinux vmlinux ipt_do_table
3825 3.0596 sch_htb.ko sch_htb htb_dequeue
3458 2.7660 vmlinux vmlinux __hash_conntrack
2597 2.0773 vmlinux vmlinux nf_nat_setup_info
2531 2.0245 vmlinux vmlinux kmem_cache_alloc
2229 1.7830 vmlinux vmlinux ip_route_input
1722 1.3774 vmlinux vmlinux nf_conntrack_in
1547 1.2374 sch_htb.ko sch_htb htb_enqueue
1519 1.2150 vmlinux vmlinux kmem_cache_free
1471 1.1766 vmlinux vmlinux __slab_free
1435 1.1478 vmlinux vmlinux dev_queue_xmit
1313 1.0503 vmlinux vmlinux __qdisc_run
1277 1.0215 vmlinux vmlinux netif_receive_skb
All other symbols are below 1%.
...

I would rather want to see the output from cls_u32.ko

opreport --symbols -cl cls_u32.ko --image-path=/lib/modules/`uname -r`/kernel/

Post by Radu Rendec
Am I misinterpreting the results, or does it look like the real problem
is actually packet classification?

Yes, it looks like the problem is your u32 classification setup... Perhaps
its not doing what you think its doing... didn't Jarek provide some hints
for you to follow?

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Radu Rendec

2009-04-29 11:03:26 UTC

Post by Jesper Dangaard Brouer
Just noticed that Jeremy Kerr has made some python scripts to make it even
easier to use oprofile.
See http://ozlabs.org/~jk/diary/tech/linux/hiprofile-v1.0.diary/

Thanks for the hint; I'll have a look at the scripts too.

Post by Jesper Dangaard Brouer
I would rather want to see the output from cls_u32.ko
opreport --symbols -cl cls_u32.ko --image-path=/lib/modules/`uname -r`/kernel/

samples % image name symbol name
-------------------------------------------------------------------------------
38424 100.000 cls_u32.ko u32_classify
38424 100.000 cls_u32.ko u32_classify [self]
-------------------------------------------------------------------------------

Well, this doesn't tell us much more, but I think it's pretty obvious
what cls_u32 is doing :)

Post by Radu Rendec
Am I misinterpreting the results, or does it look like the real problem
is actually packet classification?

Yes, it looks like the problem is your u32 classification setup... Perhaps
its not doing what you think its doing... didn't Jarek provide some hints
for you to follow?

I've just realized that I might be hitting the worst-case bucket with
the (ip) destinations I chose for the test traffic. I'll try

I haven't tried tweaking htb_hysteresis yet (that was one of Jarek's
hints) - it's debatable that it would help since the real problem seems
to be in u32 (not htb), but I'll give it a try anyway.

Another hint was to make sure that "tc class add" goes before
corresponding "tc filter add" - checked: it's ok.

Another interesting hint came from Calin Velea, whose tests suggest that
the overall performance is better with napi turned off, since (rx)
interrupt work is distributed to all cpus/cores. I'll try to replicate
this as soon as I make some small changes to my test setup so that I'm
able to measure overall htb throughput on the egress nic (bps and pps).

Thanks,

Radu Rendec

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

2009-04-29 12:23:12 UTC

...

Post by Jesper Dangaard Brouer
Yes, it looks like the problem is your u32 classification setup... Perhaps
its not doing what you think its doing... didn't Jarek provide some hints
for you to follow?

According to the author's(?) comment with hysteresis "The speed gain
is about 1/6", so not very much here considering htb_dequeue time.

Post by Radu Rendec
Another hint was to make sure that "tc class add" goes before
corresponding "tc filter add" - checked: it's ok.
Another interesting hint came from Calin Velea, whose tests suggest that
the overall performance is better with napi turned off, since (rx)
interrupt work is distributed to all cpus/cores. I'll try to replicate
this as soon as I make some small changes to my test setup so that I'm
able to measure overall htb throughput on the egress nic (bps and pps).

Radu, since not only your worst case, but also the real case u32
lookups are very big I think you should mainly have a look at Calin's
u32 hash generator or at least his method, and only after optimizing
it try these other tricks. Btw. I hope Calin made this nice program
known to networking/admins lists too.

Btw. #2: I think you wrote you didn't use iptables...

Cheers,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Radu Rendec

2009-04-29 13:15:51 UTC

Post by Jarek Poplawski
According to the author's(?) comment with hysteresis "The speed gain
is about 1/6", so not very much here considering htb_dequeue time.

Thought so :)

Post by Jarek Poplawski
Radu, since not only your worst case, but also the real case u32
lookups are very big I think you should mainly have a look at Calin's
u32 hash generator or at least his method, and only after optimizing
it try these other tricks. Btw. I hope Calin made this nice program
known to networking/admins lists too.

I've just had a look at Calin's approach to optimizing u32 lookups. It
does indeed make a very nice use of u32 hash capabilities, resulting in
a maximum of 4 lookups. The algorithm he uses takes advantage of the
fact that only a (small) subset of the whole ipv4 address space is
actually used in an ISP's network.

Unfortunately his approach makes it a bit difficult to dynamically
adjust the configuration, since the controller (program/application)
must remember the exact hash tables, filters etc in order to be able to
add/remove CIDRs without rewriting the entire configuration. Unused hash
tables also need to be "garbage collected" and reused, otherwise the
hash table id space could be exhausted.

Since I only use IP lookups (and u32 is very generic) I'm starting to
ask myself if a different kind of data structures and classifier were
more appropriate.

For instance, I think a binary search tree that is matched against the
bits in the ip address would result in pretty nice performance. It would
take at most 32 iterations (descending through the tree) with less
overhead than the (complex) u32 rule match.

Post by Jarek Poplawski
Btw. #2: I think you wrote you didn't use iptables...

No, I don't use iptables.

Btw, the e1000e driver seems to have no way to disable NAPI. Am I
missing something (like a global kernel config option that disables NAPI
completely)?

Thanks,

Radu

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

2009-04-29 13:38:11 UTC

On Wed, Apr 29, 2009 at 04:15:51PM +0300, Radu Rendec wrote:
...

Post by Radu Rendec
I've just had a look at Calin's approach to optimizing u32 lookups. It
does indeed make a very nice use of u32 hash capabilities, resulting in
a maximum of 4 lookups. The algorithm he uses takes advantage of the
fact that only a (small) subset of the whole ipv4 address space is
actually used in an ISP's network.

...

Anyway, it looks like your main problem, and I doubt even dividing
current work by e.g. 4 cores (if it were multi-threaded) is enough.
These lookups are simply too long.

Post by Jarek Poplawski
Btw. #2: I think you wrote you didn't use iptables...

No, I don't use iptables.

But your oprofile shows them. Maybe you shouldn't compile it into
kernel at all?

Post by Radu Rendec
Btw, the e1000e driver seems to have no way to disable NAPI. Am I
missing something (like a global kernel config option that disables NAPI
completely)?

Calin uses older kernel, and maybe e1000 driver, I don't know.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Radu Rendec

2009-04-29 16:21:11 UTC

I finally managed to disable NAPI on e1000e - apparently it can only be
done on the "official" Intel driver (downloaded from their website), by
compiling with "make CFLAGS_EXTRA=-DE1000E_NO_NAPI". This doesn't seem
to be available in the (2.6.29) kernel driver.

With NAPI disabled, 4 (of 8) cores go to 100% (instead of only one), but
overall throughput *decreases* from ~110K pps (with NAPI) to ~80K pps.
This makes sense, since h/w interrupt is much more time consuming than
polling (that's the whole idea behind NAPI anyway).

Radu Rendec

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Calin Velea

2009-04-29 22:49:46 UTC

Post by Radu Rendec
I finally managed to disable NAPI on e1000e - apparently it can only be
done on the "official" Intel driver (downloaded from their website), by
compiling with "make CFLAGS_EXTRA=-DE1000E_NO_NAPI". This doesn't seem
to be available in the (2.6.29) kernel driver.
With NAPI disabled, 4 (of 8) cores go to 100% (instead of only one), but
overall throughput *decreases* from ~110K pps (with NAPI) to ~80K pps.
This makes sense, since h/w interrupt is much more time consuming than
polling (that's the whole idea behind NAPI anyway).
Radu Rendec

I tested with e1000 only, on a single quad-core CPU - the L2 cache was
shared between the cores.

For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually
used belong to different physical CPUs, L2 cache sharing does not occur -
maybe this could explain the performance drop in your case.
Or there may be other explanation...

Anyway - coming back to David Miller's words:

"HTB acts upon global state, so anything that goes into a particular
device's HTB ruleset is going to be single threaded.
There really isn't any way around this. "

It could be the only way to get more power is to increase the number
of devices where you are shaping. You could split the IP space into 4 groups
and direct the trafic to 4 IMQ devices with 4 iptables rules -

-d 0.0.0.0/2 -j IMQ --todev imq0,
-d 64.0.0.0/2 -j IMQ --todev imq1, etc...

Or you can customize the split depeding on the traffic distribution.
ipset nethash match can also be used.

The 4 devices can have the same htb ruleset, only the right parts
of it will match.
You should test with 4 flows that use all the devices simultaneously and
see what is the aggregate throughput.

The performance gained through parallelism might be a lot higher than the
added overhead of iptables and/or ipset nethash match. Anyway - this is more of
a "hack" than a clean solution :)

p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that

--
Best regards,
Calin mailto:***@gemenii.ro

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Calin Velea

2009-04-29 23:00:53 UTC

Hello Calin,

Post by Calin Velea

I tested with e1000 only, on a single quad-core CPU - the L2 cache was
shared between the cores.
For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually
used belong to different physical CPUs, L2 cache sharing does not occur -
maybe this could explain the performance drop in your case.
Or there may be other explanation...
"HTB acts upon global state, so anything that goes into a particular
device's HTB ruleset is going to be single threaded.
There really isn't any way around this. "
It could be the only way to get more power is to increase the number
of devices where you are shaping. You could split the IP space into 4 groups
and direct the trafic to 4 IMQ devices with 4 iptables rules -
-d 0.0.0.0/2 -j IMQ --todev imq0,
-d 64.0.0.0/2 -j IMQ --todev imq1, etc...
Or you can customize the split depeding on the traffic distribution.
ipset nethash match can also be used.
The 4 devices can have the same htb ruleset, only the right parts
of it will match.
You should test with 4 flows that use all the devices simultaneously and
see what is the aggregate throughput.
The performance gained through parallelism might be a lot higher than the
added overhead of iptables and/or ipset nethash match. Anyway - this is more of
a "hack" than a clean solution :)
p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that

You will also need -i ethX (router), or -m physdev --physdev-in ethX
(bridge) to differentiate between upload and download in the iptables rules.

Radu Rendec

2009-04-30 11:19:36 UTC

Post by Calin Velea
I tested with e1000 only, on a single quad-core CPU - the L2 cache was
shared between the cores.
For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually
used belong to different physical CPUs, L2 cache sharing does not occur -
maybe this could explain the performance drop in your case.
Or there may be other explanation...

It is correct, I have 2 quad-core CPUs. If adjacent kernel-identified
CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) - and
it is very probable - then I think the L2 cache was actually shared.
That's because the used CPUs where either 0-3 or 4-7 but never a mix of
them. So perhaps there is another explanation (maybe driver/hardware).

Post by Calin Velea
It could be the only way to get more power is to increase the number
of devices where you are shaping. You could split the IP space into 4 groups
and direct the trafic to 4 IMQ devices with 4 iptables rules -
-d 0.0.0.0/2 -j IMQ --todev imq0,
-d 64.0.0.0/2 -j IMQ --todev imq1, etc...

Yes, but what if let's say 10.0.0.0/24 and 70.0.0.0/24 need to share
bandwidth? 10.a.b.c goes to imq0 qdisc, and 70.x.y.z goes to imq1 qdisc,
and the two qdiscs (HTB sets) are independent. This will result in a
maximum of double the allocated bandwidth (if HTB sets are identical and
traffic is equally distributed).

Post by Calin Velea
The performance gained through parallelism might be a lot higher than the
added overhead of iptables and/or ipset nethash match. Anyway - this is more of
a "hack" than a clean solution :)
p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that

Yes, the performance gained through parallelism is expected to be higher
than the loss of the additional overhead. That's why I asked for
parallel HTB in the first place, but got very disappointed after David
Miller's reply :)

Thanks a lot for all the hints and for the imq link. Imq is very
interesting regardless of whether it proves to be useful for this
project of mine or not.

Radu Rendec

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jesper Dangaard Brouer

2009-04-30 11:44:53 UTC

WRONG assumption regarding CPU id's

Look in /proc/cpuinfo for the correct answer.

(From a:
model name : Intel(R) Xeon(R) CPU E5420 @ 2.50GHz)

cat /proc/cpuinfo | egrep -e '(processor|physical id|core id)'
processor : 0
physical id : 0
core id : 0

processor : 1
physical id : 1
core id : 0

processor : 2
physical id : 0
core id : 2

processor : 3
physical id : 1
core id : 2

processor : 4
physical id : 0
core id : 1

processor : 5
physical id : 1
core id : 1

processor : 6
physical id : 0
core id : 3

processor : 7
physical id : 1
core id : 3

E.g. Here CPU0 and CPU4 is sharing the same L2 cache.

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Calin Velea

2009-04-30 14:04:26 UTC

Indeed, you need to use ipset with nethash to avoid bandwidth doubling.
Let's say we have a shaping bridge: customer side (download) is
on eth0, the upstream side (upload) is on eth1.

Create customer groups with ipset (http://ipset.netfilter.org/)

ipset -N cust_group1_ips nethash
ipset -A cust_group1_ips <subnet/mask>
....
....for each subnet

To shape the upload with multiple IMQs:

-m physdev --physdev-in eth0 -m set --set cust_group1_ips src -j IMQ --to-dev 0
-m physdev --physdev-in eth0 -m set --set cust_group2_ips src -j IMQ --to-dev 1
-m physdev --physdev-in eth0 -m set --set cust_group3_ips src -j IMQ --to-dev 2
-m physdev --physdev-in eth0 -m set --set cust_group4_ips src -j IMQ --to-dev 3

You will apply the same htb upload limits to imq 0-3.
Upload for customers having source IPs from the first group will be shaped
by imq0, for the second, by imq1, etc...

For download:

-m physdev --physdev-in eth1 -m set --set cust_group1_ips dst -j IMQ --to-dev 4
-m physdev --physdev-in eth1 -m set --set cust_group2_ips dst -j IMQ --to-dev 5
-m physdev --physdev-in eth1 -m set --set cust_group3_ips dst -j IMQ --to-dev 6
-m physdev --physdev-in eth1 -m set --set cust_group4_ips dst -j IMQ --to-dev 7

and apply the same download limits on imq 4-7

Post by Radu Rendec
__________ NOD32 4045 (20090430) Information __________
This message was checked by NOD32 antivirus system.
http://www.eset.com

Paweł Staszewski

2009-05-08 10:15:01 UTC

Radu You have something wrong with your configuration i think.

I make Traffic management for many different nets with space of /18=20
prefix outside net + 10.0.0.0/18 inside and some nets like /21 , /22 ,=20
/23, /20 network prefixes.

Some stats from my router:

tc -s -d filter show dev eth0 | grep dst | wc -l
14087
tc -s -d filter show dev eth1 | grep dst | wc -l
14087

cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 3075 @ 2.66GHz
stepping : 11
cpu MHz : 2659.843
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge=20
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx l=
m=20
constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx es=
t=20
tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5319.68
clflush size : 64
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 3075 @ 2.66GHz
stepping : 11
cpu MHz : 2659.843
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge=20
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx l=
m=20
constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx es=
t=20
tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5320.30
clflush size : 64
power management:

mpstat -P ALL 1 10
Average: CPU %user %nice %sys %iowait %irq %soft =20
%steal %idle intr/s
Average: all 0.00 0.00 0.15 0.00 0.00 0.10 =20
0.00 99.75 73231.70
Average: 0 0.00 0.00 0.20 0.00 0.00 0.10 =20
0.00 99.70 0.00
Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 =20
0.00 100.00 27686.80
Average: 2 0.00 0.00 0.00 0.00 0.00 0.00 =20
0.00 0.00 0.00

Some opreport:
CPU: Core 2, speed 2659.84 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a=20
unit mask of 0x00 (Unhalted core cycles) count 100000
samples % app name symbol name
7592 8.3103 vmlinux rb_next
5393 5.9033 vmlinux e1000_get_hw_control
4514 4.9411 vmlinux hfsc_dequeue
4069 4.4540 vmlinux e1000_intr_msi
3695 4.0446 vmlinux u32_classify
3522 3.8552 vmlinux poll_idle
2234 2.4454 vmlinux _raw_spin_lock
2077 2.2735 vmlinux read_tsc
1855 2.0305 vmlinux rb_prev
1834 2.0075 vmlinux getnstimeofday
1800 1.9703 vmlinux e1000_clean_rx_irq
1553 1.6999 vmlinux ip_route_input
1509 1.6518 vmlinux hfsc_enqueue
1451 1.5883 vmlinux irq_entries_start
1419 1.5533 vmlinux mwait_idle
1392 1.5237 vmlinux e1000_clean_tx_irq
1345 1.4723 vmlinux rb_erase
1294 1.4164 vmlinux sfq_enqueue
1187 1.2993 libc-2.6.1.so (no symbols)
1162 1.2719 vmlinux sfq_dequeue
1134 1.2413 vmlinux ipt_do_table
1116 1.2216 vmlinux apic_timer_interrupt
1108 1.2128 vmlinux cftree_insert
1039 1.1373 vmlinux rtsc_y2x
985 1.0782 vmlinux e1000_xmit_frame
943 1.0322 vmlinux update_vf

bwm-ng v0.6 (probing every 5.000s), press 'h' for help
input: /proc/net/dev type: rate
/ iface Rx =20
Tx Total
=20
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
lo: 0.00 KB/s 0.00 KB/s =20
0.00 KB/s
eth1: 20716.35 KB/s 24258.43 KB/s =20
44974.78 KB/s
eth0: 24365.31 KB/s 30691.10 KB/s =20
55056.42 KB/s
=20
-----------------------------------------------------------------------=
-------

bwm-ng v0.6 (probing every 5.000s), press 'h' for help
input: /proc/net/dev type: rate
| iface Rx =20
Tx Total
=20
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
lo: 0.00 P/s 0.00 P/s =
=20
0.00 P/s
eth1: 38034.00 P/s 36751.00 P/s =20
74785.00 P/s
eth0: 37195.40 P/s 38115.00 P/s =20
75310.40 P/s
=20
Maximum CPU load is when rush hour (from 5:00 pm to 10:00 pm) then it i=
s=20
20% - 30% of each CPU.

So i think you must change type of your hash tree in u32 filtering.
I use simply split of big nets like /18, /20, /21 to /24 prefixes to=20
build my hash tree.
I make many tests and this configuration of hash works best for my=20
configuration.

Regards
Pawe=C5=82 Sstaszewski

=20

I tested with e1000 only, on a single quad-core CPU - the L2 cac=

he was

shared between the cores.
For 8 cores I suppose you have 2 quad-core CPUs. If the cores act=

ually

used belong to different physical CPUs, L2 cache sharing does not o=

ccur -

maybe this could explain the performance drop in your case.
Or there may be other explanation...
=20

=20

It is correct, I have 2 quad-core CPUs. If adjacent kernel-identifie=

CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) -=

and

it is very probable - then I think the L2 cache was actually shared.
That's because the used CPUs where either 0-3 or 4-7 but never a mix=

them. So perhaps there is another explanation (maybe driver/hardware=

=20

It could be the only way to get more power is to increase the num=

ber=20

of devices where you are shaping. You could split the IP space into=

4 groups

and direct the trafic to 4 IMQ devices with 4 iptables rules -
-d 0.0.0.0/2 -j IMQ --todev imq0,
-d 64.0.0.0/2 -j IMQ --todev imq1, etc...
=20

=20

Yes, but what if let's say 10.0.0.0/24 and 70.0.0.0/24 need to share
bandwidth? 10.a.b.c goes to imq0 qdisc, and 70.x.y.z goes to imq1 qd=

isc,

and the two qdiscs (HTB sets) are independent. This will result in a
maximum of double the allocated bandwidth (if HTB sets are identical=

and

traffic is equally distributed).
=20

=20

The performance gained through parallelism might be a lot higher =

than the=20

added overhead of iptables and/or ipset nethash match. Anyway - thi=

s is more of

a "hack" than a clean solution :)
p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you w=

ill need to try with that

=20

Yes, the performance gained through parallelism is expected to be hi=

gher

than the loss of the additional overhead. That's why I asked for
parallel HTB in the first place, but got very disappointed after Dav=

Miller's reply :)
=20

=20

Thanks a lot for all the hints and for the imq link. Imq is very
interesting regardless of whether it proves to be useful for this
project of mine or not.
=20

=20

Radu Rendec
=20

Indeed, you need to use ipset with nethash to avoid bandwidth doub=

ling.

Let's say we have a shaping bridge: customer side (download) is
on eth0, the upstream side (upload) is on eth1.
Create customer groups with ipset (http://ipset.netfilter.org/)
ipset -N cust_group1_ips nethash
ipset -A cust_group1_ips <subnet/mask>
....
....for each subnet
-m physdev --physdev-in eth0 -m set --set cust_group1_ips src -j IMQ =

--to-dev 0

-m physdev --physdev-in eth0 -m set --set cust_group2_ips src -j IMQ =

--to-dev 1

-m physdev --physdev-in eth0 -m set --set cust_group3_ips src -j IMQ =

--to-dev 2

-m physdev --physdev-in eth0 -m set --set cust_group4_ips src -j IMQ =

--to-dev 3

You will apply the same htb upload limits to imq 0-3.
Upload for customers having source IPs from the first group will be =

shaped

by imq0, for the second, by imq1, etc...
-m physdev --physdev-in eth1 -m set --set cust_group1_ips dst -j IMQ =

--to-dev 4

-m physdev --physdev-in eth1 -m set --set cust_group2_ips dst -j IMQ =

--to-dev 5

-m physdev --physdev-in eth1 -m set --set cust_group3_ips dst -j IMQ =

--to-dev 6

-m physdev --physdev-in eth1 -m set --set cust_group4_ips dst -j IMQ =

--to-dev 7

and apply the same download limits on imq 4-7
=20

__________ NOD32 4045 (20090430) Information __________
=20

=20

This message was checked by NOD32 antivirus system.
http://www.eset.com
=20

=20

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Vladimir Ivashchenko

2009-05-08 17:55:12 UTC

Post by Radu Rendec
It is correct, I have 2 quad-core CPUs. If adjacent

kernel-identified

Post by Radu Rendec
CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3)

- and

Post by Radu Rendec
it is very probable - then I think the L2 cache was actually

shared.

Post by Radu Rendec
That's because the used CPUs where either 0-3 or 4-7 but never a

mix of

Post by Radu Rendec
them. So perhaps there is another explanation (maybe

driver/hardware).

Keep in mind that on Intel quad-core CPU cache is shared between pairs
of cores, not for all four cores.

http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/desktop/processor/processors/core2quad/feature/index.htm
--
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Denys Fedoryschenko

2009-05-08 18:07:58 UTC

Post by Vladimir Ivashchenko

Btw shared L2 cache have higher latency, than dedicated one.
Thats why Core i7 rules (tested recently).

Post by Radu Rendec
It is correct, I have 2 quad-core CPUs. If adjacent

kernel-identified

Post by Radu Rendec
CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3)

- and

Post by Radu Rendec
it is very probable - then I think the L2 cache was actually

shared.

Post by Radu Rendec
That's because the used CPUs where either 0-3 or 4-7 but never a

mix of

Post by Radu Rendec
them. So perhaps there is another explanation (maybe

driver/hardware).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Radu Rendec

2009-04-23 12:31:47 UTC

Post by Jesper Dangaard Brouer
Its runtime adjustable, so its easy to try out.
via /sys/module/sch_htb/parameters/htb_hysteresis

Thanks for the tip! This means I can play around with various values
while the machine is in production and see how it reacts.

Post by Jesper Dangaard Brouer
The HTB classify hash has a scalability issue in kernels below 2.6.26.
Patrick McHardy fixes that up in 2.6.26. What kernel version are you
using?

I'm using 2.6.26, so I guess the fix is already there :(

Post by Jesper Dangaard Brouer
Could you explain how you do classification? And perhaps outline where you
possible scalability issue is located?
If you are interested how I do scalable classification, see my
http://nfws.inl.fr/en/?p=115
http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf

I had a look at your presentation and it seems to be focused in dividing
a single iptables rule chain into multiple chains, so that rule lookup
complexity decreases from linear to logarithmic.

Since I only need to do shaping, I don't use iptables at all. Address
matching is all done in on the egress side, using u32. Rule schema is
this:

1. We have two /19 networks that differ pretty much in the first bits:
80.x.y.z and 83.a.b.c; customer address spaces range from /22 nets to
individual /32 addresses.

2. The default ip hash (0x800) is size 1 (only one bucket) and has two
rules that select between two subsequent hash tables (say 0x100 and
0x101) based on the most significant bits in the address.

3. Level 2 hash tables (0x100 and 0x101) are size 256 (256 buckets);
bucket selection is done by bits b10 - b17 (with b0 being the least
significant).

4. Each bucket contains complete cidr match rules (corresponding to real
customer addresses). Since bits b11 - b31 are already checked in upper
levels, this results in a maximum of 2 ^ 10 = 1024 rules, which is the
worst case, if all customer addresses that "fall" into that bucket
are /32 (fortunately this is not the real case).

In conclusion each packet would be matched against at most 1026 rules
(worst case). The real case is actually much better: only one bucket
with 400 rules, all other less than 70 rules and most of them less than
10 rules.

Post by Radu Rendec
I guess htb_hysteresis only affects the actual shaping (which takes
place after the packet is classified).

Yes, htb_hysteresis basically is a hack to allow extra bursts... we
actually considered removing it completely...

It's definitely worth a try at least. Thanks for the tips!

Radu Rendec

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

2009-04-23 18:43:19 UTC

...

Post by Jesper Dangaard Brouer
The HTB classify hash has a scalability issue in kernels below 2.6.26.
Patrick McHardy fixes that up in 2.6.26. What kernel version are you
using?

I'm using 2.6.26, so I guess the fix is already there :(

If Jesper meant the change of hash I can see it in 2.6.27 yet.

...

Post by Radu Rendec
In conclusion each packet would be matched against at most 1026 rules
(worst case). The real case is actually much better: only one bucket
with 400 rules, all other less than 70 rules and most of them less than
10 rules.

Alas I can't analyze this all now, and probably I miss something, but
your worst and real cases look suspiciously big. Do all these classes
differ so much? Maybe you should have a look at cls_flow?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jesper Dangaard Brouer

2009-04-23 19:06:59 UTC

Post by Jarek Poplawski
...

Post by Jesper Dangaard Brouer
The HTB classify hash has a scalability issue in kernels below 2.6.26.
Patrick McHardy fixes that up in 2.6.26. What kernel version are you
using?

I'm using 2.6.26, so I guess the fix is already there :(

If Jesper meant the change of hash I can see it in 2.6.27 yet.

I'm referring to:

commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2
Author: Patrick McHardy <***@trash.net>
Date: Sat Jul 5 23:22:35 2008 -0700

net-sched: sch_htb: use dynamic class hash helpers

Is there any easy git way to figure out which release this commit got
into?

Cheers,
Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

2009-04-23 19:14:55 UTC

Post by Jarek Poplawski
...

Post by Jesper Dangaard Brouer
The HTB classify hash has a scalability issue in kernels below 2.6.26.
Patrick McHardy fixes that up in 2.6.26. What kernel version are you
using?

I'm using 2.6.26, so I guess the fix is already there :(

If Jesper meant the change of hash I can see it in 2.6.27 yet.

commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2
Date: Sat Jul 5 23:22:35 2008 -0700
net-sched: sch_htb: use dynamic class hash helpers
Is there any easy git way to figure out which release this commit got
into?

I guess git-describe, but I prefer clicking at the "raw" (X-Git-Tag):
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jesper Dangaard Brouer

2009-04-23 19:47:05 UTC

Post by Jarek Poplawski
...

Post by Jesper Dangaard Brouer
The HTB classify hash has a scalability issue in kernels below 2.6.26.
Patrick McHardy fixes that up in 2.6.26. What kernel version are you
using?

I'm using 2.6.26, so I guess the fix is already there :(

If Jesper meant the change of hash I can see it in 2.6.27 yet.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2

I think I prefer the command line edition "git-describe". But it seems
that the two approaches gives a different results.
(Cc'ing the git mailing list as they might know the reason)

git-describe f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2
returns "v2.6.26-rc8-1107-gf4c1f3e"

While you URL returns: "X-Git-Tag: v2.6.27-rc1~964^2~219"

I also did a:
"git log v2.6.26..v2.6.27 | grep f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2"
commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2

To Radu: The change I talked about is in 2.6.27, so you should try that
kernel on you system.

Hilsen
Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

Jarek Poplawski

2009-04-23 20:00:42 UTC

...

Post by Jarek Poplawski
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2

I think I prefer the command line edition "git-describe". But it seems
that the two approaches gives a different results.

Probably there is something more needed around this git-describe.
I prefer the command line too when I can remember this command line...

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jeff King

2009-04-23 20:09:32 UTC

Post by Jesper Dangaard Brouer
Is there any easy git way to figure out which release this commit got
into?

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2

I think I prefer the command line edition "git-describe". But it seems
that the two approaches gives a different results.
(Cc'ing the git mailing list as they might know the reason)

You want "git describe --contains". The default mode for describe is
"you are at tag $X, plus $N commits, and by the way, the sha1 is $H"
(shown as "$X-$N-g$H").

The default mode is useful for generating a unique semi-human-readable
version number (e.g., to be included in your builds).

-Peff

Jarek Poplawski

2009-04-24 06:01:16 UTC

...

Alas I can't analyze this all now, and probably I miss something, but
your worst and real cases look suspiciously big. Do all these classes
differ so much? Maybe you should have a look at cls_flow?

Actually fixing this u32 config (hashes) should be enough here.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

2009-04-24 11:19:40 UTC

Hi,

Hi,

Very interesting message, but try to use plain format next time.
I guess your mime/html original wasn't accepted by ***@.

Jarek P.

Hardware: quad-core Xeon X3210 (2.13GHz, 8M L2 cache), 2 Intel PCI Express Gigabit NICs
Kernel: 2.6.20
I did some udp flood tests in the following configurations - the machine was configured as a
A) napi on, irqs for each card statically allocated to 2 CPU cores
when flooding, the same CPU went 100% softirq always (seems logical,
since it is statically bound to the irq)
B) napi on, CONFIG_IRQBALANCE=y
when flooding, a random CPU went 100% softirq always. (here,
at high interrupt rates, NAPI kicks in and starts using polling
rather than irqs, so no more balancing takes place since there are
no more interrupts - checked this with /proc/interrupts - at high packet
rates the irq counters for the network cards stalled)
C) napi off, CONFIG_IRQBALANCE=y
this is the setup I used in the end since all CPU cores were used. All of them
went to 100%, and the pps rate I could pass through was higher than in
case A or B.
Also, your worst case hashing setup could be improved - I suggest you take a look at
http://vcalinus.gemenii.ro/?p=9 (see the generated filters example). The hashing method
described there will take a constant CPU time (4 checks) for each packet, regardless of how many
filter rules you have (provided you only filter by IP address). A tree of hashtables
is constructed which matches each of the four bytes from the IP address in succesion.
Using this hashing method, the hardware above, 2.6.20 with napi off and irq balancing on, I got
troughputs of 1.3Gbps / 250.000 pps aggregated in+out in normal usage. CPU utilization
averages varied between 25 - 50 % for every core, so there was still room to grow.
I expect much higher pps rates with better hardware (higher freq/larger cache Xeons).

Post by Jesper Dangaard Brouer
Its runtime adjustable, so its easy to try out.
via /sys/module/sch_htb/parameters/htb_hysteresis

Thanks for the tip! This means I can play around with various values
while the machine is in production and see how it reacts.

Post by Jesper Dangaard Brouer
The HTB classify hash has a scalability issue in kernels below 2.6.26.
Patrick McHardy fixes that up in 2.6.26. What kernel version are you
using?

I'm using 2.6.26, so I guess the fix is already there :(

I had a look at your presentation and it seems to be focused in dividing
a single iptables rule chain into multiple chains, so that rule lookup
complexity decreases from linear to logarithmic.
Since I only need to do shaping, I don't use iptables at all. Address
matching is all done in on the egress side, using u32. Rule schema is
80.x.y.z and 83.a.b.c; customer address spaces range from /22 nets to
individual /32 addresses.
2. The default ip hash (0x800) is size 1 (only one bucket) and has two
rules that select between two subsequent hash tables (say 0x100 and
0x101) based on the most significant bits in the address.
3. Level 2 hash tables (0x100 and 0x101) are size 256 (256 buckets);
bucket selection is done by bits b10 - b17 (with b0 being the least
significant).
4. Each bucket contains complete cidr match rules (corresponding to real
customer addresses). Since bits b11 - b31 are already checked in upper
levels, this results in a maximum of 2 ^ 10 = 1024 rules, which is the
worst case, if all customer addresses that "fall" into that bucket
are /32 (fortunately this is not the real case).
In conclusion each packet would be matched against at most 1026 rules
(worst case). The real case is actually much better: only one bucket
with 400 rules, all other less than 70 rules and most of them less than
10 rules.

Post by Radu Rendec
I guess htb_hysteresis only affects the actual shaping (which takes
place after the packet is classified).

Yes, htb_hysteresis basically is a hack to allow extra bursts... we
actually considered removing it completely...

It's definitely worth a try at least. Thanks for the tips!
Radu Rendec
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Best regards,

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Calin Velea

2009-04-24 11:35:38 UTC

Hi,

Maybe some actual results I got some time ago could help you and others who had the same problems:

Hardware: quad-core Xeon X3210 (2.13GHz, 8M L2 cache), 2 Intel PCI Express Gigabit NICs
Kernel: 2.6.20

I did some udp flood tests in the following configurations - the machine was configured as a
traffic shaping bridge, about 10k htb rules loaded, using hashing (see below):

A) napi on, irqs for each card statically allocated to 2 CPU cores

when flooding, the same CPU went 100% softirq always (seems logical,
since it is statically bound to the irq)

B) napi on, CONFIG_IRQBALANCE=y

when flooding, a random CPU went 100% softirq always. (here,
at high interrupt rates, NAPI kicks in and starts using polling
rather than irqs, so no more balancing takes place since there are
no more interrupts - checked this with /proc/interrupts - at high packet
rates the irq counters for the network cards stalled)

C) napi off, CONFIG_IRQBALANCE=y

this is the setup I used in the end since all CPU cores were used. All of them
went to 100%, and the pps rate I could pass through was higher than in
case A or B.

Also, your worst case hashing setup could be improved - I suggest you take a look at
http://vcalinus.gemenii.ro/?p=9 (see the generated filters example). The hashing method
described there will take a constant CPU time (4 checks) for each packet, regardless of how many
filter rules you have (provided you only filter by IP address). A tree of hashtables
is constructed which matches each of the four bytes from the IP address in succesion.

Using this hashing method, the hardware above, 2.6.20 with napi off and irq balancing on, I got
troughputs of 1.3Gbps / 250.000 pps aggregated in+out in normal usage. CPU utilization
averages varied between 25 - 50 % for every core, so there was still room to grow.
I expect much higher pps rates with better hardware (higher freq/larger cache Xeons).

Post by Jesper Dangaard Brouer
Its runtime adjustable, so its easy to try out.
via /sys/module/sch_htb/parameters/htb_hysteresis

Thanks for the tip! This means I can play around with various values
while the machine is in production and see how it reacts.

Post by Jesper Dangaard Brouer
The HTB classify hash has a scalability issue in kernels below 2.6.26.
Patrick McHardy fixes that up in 2.6.26. What kernel version are you
using?

I'm using 2.6.26, so I guess the fix is already there :(

I had a look at your presentation and it seems to be focused in dividing
a single iptables rule chain into multiple chains, so that rule lookup
complexity decreases from linear to logarithmic.
Since I only need to do shaping, I don't use iptables at all. Address
matching is all done in on the egress side, using u32. Rule schema is
80.x.y.z and 83.a.b.c; customer address spaces range from /22 nets to
individual /32 addresses.
2. The default ip hash (0x800) is size 1 (only one bucket) and has two
rules that select between two subsequent hash tables (say 0x100 and
0x101) based on the most significant bits in the address.
3. Level 2 hash tables (0x100 and 0x101) are size 256 (256 buckets);
bucket selection is done by bits b10 - b17 (with b0 being the least
significant).
4. Each bucket contains complete cidr match rules (corresponding to real
customer addresses). Since bits b11 - b31 are already checked in upper
levels, this results in a maximum of 2 ^ 10 = 1024 rules, which is the
worst case, if all customer addresses that "fall" into that bucket
are /32 (fortunately this is not the real case).
In conclusion each packet would be matched against at most 1026 rules
(worst case). The real case is actually much better: only one bucket
with 400 rules, all other less than 70 rules and most of them less than
10 rules.