Discussion:
[PATCH] net: use hardware buffer pool to allocate skb
Pan Jiafei
2014-10-15 03:26:11 UTC
Permalink
In some platform, there are some hardware block provided
to manage buffers to improve performance. So in some case,
it is expected that the packets received by some generic
NIC should be put into such hardware managed buffers
directly, so that such buffer can be released by hardware
or by driver.

This patch provide such general APIs for generic NIC to
use hardware block managed buffers without any modification
for generic NIC drivers.
In this patch, the following fields are added to "net_device":
void *hw_skb_priv;
struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
void (*free_hw_skb)(struct sk_buff *skb);
so in order to let generic NIC driver to use hardware managed
buffers, the function "alloc_hw_skb" and "free_hw_skb"
provide implementation for allocate and free hardware managed
buffers. "hw_skb_priv" is provided to pass some private data for
these two functions.

When the socket buffer is allocated by these APIs, "hw_skb_state"
is provided in struct "sk_buff". this argument can indicate
that the buffer is hardware managed buffer, this buffer
should freed by software or by hardware.

Documentation on how to use this featue can be found at
<file:Documentation/networking/hw_skb.txt>.

Signed-off-by: Pan Jiafei <***@freescale.com>
---
Documentation/networking/hw_skb.txt | 117 ++++++++++++++++++++++++++++++++++++
include/linux/netdevice.h | 5 ++
include/linux/skbuff.h | 16 +++++
net/Kconfig | 10 +++
net/core/skbuff.c | 28 +++++++++
5 files changed, 176 insertions(+)
create mode 100644 Documentation/networking/hw_skb.txt

diff --git a/Documentation/networking/hw_skb.txt b/Documentation/networking/hw_skb.txt
new file mode 100644
index 0000000..256f3fc
--- /dev/null
+++ b/Documentation/networking/hw_skb.txt
@@ -0,0 +1,117 @@
+Document for using hardware managed SKB.
+
+1. Description
+
+In some platform, there are some hardware block provided
+to manage buffers to improve performance. So in some case,
+it is expected that the packets received by some generic
+NIC should be put into such hardware managed buffers
+directly, so that such buffer can be released by hardware
+or by driver.
+
+2. Related Struct Definition
+
+Some general APIs are provided for generic NIC to use hardware
+block managed buffers without any modification for generic NIC
+drivers.
+
+1)Kernel Configuration Item
+
+ "CONFIG_USE_HW_SKB"
+
+2)The DEVICE structure
+
+ struct net_device {
+ ...
+ #ifdef CONFIG_USE_HW_SKB
+ void *hw_skb_priv;
+ struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
+ void (*free_hw_skb)(struct sk_buff *skb);
+ #endif
+ ...
+ }
+
+"hw_skb_priv" is private data for "alloc_hw_skb" and "free_hw_skb" functions.
+"alloc_hw_skb" is for allocating skb by using hardware managed buffer.
+"free_hw_skb" is for freeing skb allocated by hardware manager buffer.
+
+3)struct sk_buff - socket buffer
+
+ struct sk_buff {
+ ...
+ #ifdef CONFIG_SKB_USE_HW_BP
+ __u32 hw_skb_state;
+ void *hw_skb_priv;
+ void (*free_hw_skb)(struct sk_buff *skb);
+ #endif
+ ...
+ }
+
+ /* hw_skb_state list */
+ enum hw_skb_state {
+ /* If set, SKB use hardware managed buffer */
+ IS_HW_SKB = 1 << 0,
+ /* If set, and skb can be freed by software by calling
+ * netdev->free_hw_skb
+ */
+ HW_SKB_SW_FREE = 1 << 1,
+ };
+
+"hw_skb_priv" and "free_hw_skb" are the same with the field in the
+struct "net_device"
+
+After calling "alloc_hw_skb" to allocate skb by using hardware managed
+buffers, "hw_skb_priv" and "free_hw_skb" is set in SKB driver:
+ skb->hw_skb_priv = dev->hw_skb_priv;
+ skb->free_hw_skb = dev->free_hw_skb;
+So that when "struct net_device *dev" is changed after the skb is allocated,
+It is be confirmed that this skb can be freed by the method synced
+with allocation.
+
+"hw_skb_state" indicates that the state of SKB. When the skb is allocated
+by "alloc_hw_skb" function, the flag of "IS_HW_SKB" is set by
+"__netdev_alloc_skb" function in skbuff.c when returned from "alloc_hw_skb".
+But in "alloc_hw_skb", "HW_SKB_SW_FREE" must be set if the skb should be
+freed by calling "free_hw_skb", otherwise, the skb will never be freed by
+any driver until it is freed by hardware block.
+
+SKB using hardware managed buffer is not recycleable.
+
+3. How to use this feature
+
+For example, driver "A" wants the third-party NIC driver "B" to
+store the data in some hardware managed buffer then send to "A".
+
+1) Select "CONFIG_USE_HW_SKB" to enable this feature.
+
+2) In driver "A", implement the function "alloc_hw_skb" and
+"free_hw_skb". For example:
+
+struct sk_buff *alloc_hw_skb(void *priv, unsigned int length)
+{
+ buf = alloc_hw_buffer();
+ skb = build_skb(buf, ...);
+ if (skb)
+ skb->hw_skb_state |= HW_SKB_SW_FREE;
+
+ return skb;
+}
+
+void free_hw_skb(struct sk_buff *skb)
+{
+ free_hw_buffer(skb->head);
+}
+
+3) In driver "A", get "net_device" handle of net device case using
+driver "B".
+ ...
+ net_dev_b->hw_skb_priv = priv;
+ net_dev_b->alloc_hw_skb = alloc_hw_skb;
+ net_dev_b->free_hw_skb = free_hw_skb;
+ ...
+
+4) Then, when driver "B" wants to allocate skb, "alloc_hw_skb"
+will be called to allocate hardware manager skb firstly, if
+failed, the normal skb will also be allocate, if successed,
+the skb will be freed by calling free_hw_skb when "kfree_skb"
+is called to free this skb.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 838407a..42b6158 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1689,6 +1689,11 @@ struct net_device {
struct lock_class_key *qdisc_tx_busylock;
int group;
struct pm_qos_request pm_qos_req;
+#ifdef CONFIG_USE_HW_SKB
+ void *hw_skb_priv;
+ struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
+ void (*free_hw_skb)(struct sk_buff *skb);
+#endif
};
#define to_net_dev(d) container_of(d, struct net_device, dev)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 776104b..d9afdeb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -436,6 +436,16 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
}


+/* hw_skb_state list */
+enum hw_skb_state {
+ /* If set, SKB use hardware managed buffer */
+ IS_HW_SKB = 1 << 0,
+ /* If set, and skb can be freed by software by calling
+ * netdev->free_hw_skb
+ */
+ HW_SKB_SW_FREE = 1 << 1,
+};
+
/**
* struct sk_buff - socket buffer
* @next: Next buffer in list
@@ -646,6 +656,12 @@ struct sk_buff {
__u16 network_header;
__u16 mac_header;

+#ifdef CONFIG_USE_HW_SKB
+ __u32 hw_skb_state;
+ void *hw_skb_priv;
+ void (*free_hw_skb)(struct sk_buff *skb);
+#endif
+
__u32 headers_end[0];

/* These elements must be at the end, see alloc_skb() for details. */
diff --git a/net/Kconfig b/net/Kconfig
index d6b138e..346e021 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -291,6 +291,16 @@ config NET_FLOW_LIMIT
with many clients some protection against DoS by a single (spoofed)
flow that greatly exceeds average workload.

+config USE_HW_SKB
+ bool "NIC use hardware managed buffer to build skb"
+ depends on INET
+ ---help---
+ If select this, the third party drivers will use hardware managed
+ buffers to allocate SKB without any modification for the driver.
+
+ Documentation on how to use this featue can be found at
+ <file:Documentation/networking/hw_skb.txt>.
+
menu "Network testing"

config NET_PKTGEN
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7b3df0d..f8603e5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -415,6 +415,19 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));

+#ifdef CONFIG_USE_HW_SKB
+ if (dev->alloc_hw_skb) {
+ skb = dev->alloc_hw_skb(dev->hw_skb_priv, length);
+ if (likely(skb)) {
+ skb->hw_skb_state |= IS_HW_SKB;
+ skb->hw_skb_priv = dev->hw_skb_priv;
+ skb->free_hw_skb = dev->free_hw_skb;
+ skb_reserve(skb, NET_SKB_PAD);
+ skb->dev = dev;
+ return skb;
+ }
+ }
+#endif
if (fragsz <= PAGE_SIZE && !(gfp_mask & (__GFP_WAIT | GFP_DMA))) {
void *data;

@@ -432,6 +445,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
SKB_ALLOC_RX, NUMA_NO_NODE);
}
+
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -483,6 +497,15 @@ static void skb_clone_fraglist(struct sk_buff *skb)

static void skb_free_head(struct sk_buff *skb)
{
+#ifdef CONFIG_USE_HW_SKB
+ if (skb->hw_skb_state & IS_HW_SKB) {
+ if (skb->hw_skb_state & HW_SKB_SW_FREE) {
+ BUG_ON(!skb->free_hw_skb);
+ skb->free_hw_skb(skb);
+ }
+ return;
+ }
+#endif
if (skb->head_frag)
put_page(virt_to_head_page(skb->head));
else
@@ -506,6 +529,10 @@ static void skb_release_data(struct sk_buff *skb)
* If skb buf is from userspace, we need to notify the caller
* the lower device DMA has done;
*/
+#ifdef CONFIG_USE_HW_SKB
+ if (skb->hw_skb_state & IS_HW_SKB)
+ goto skip_callback;
+#endif
if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) {
struct ubuf_info *uarg;

@@ -514,6 +541,7 @@ static void skb_release_data(struct sk_buff *skb)
uarg->callback(uarg, true);
}

+skip_callback:
if (shinfo->frag_list)
kfree_skb_list(shinfo->frag_list);

--
2.1.0.27.g96db324

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2014-10-15 04:15:03 UTC
Permalink
On Wed, 2014-10-15 at 11:26 +0800, Pan Jiafei wrote:
> In some platform, there are some hardware block provided
> to manage buffers to improve performance. So in some case,
> it is expected that the packets received by some generic
> NIC should be put into such hardware managed buffers
> directly, so that such buffer can be released by hardware
> or by driver.

You repeat 'some' four times.

>
> This patch provide such general APIs for generic NIC to
> use hardware block managed buffers without any modification
> for generic NIC drivers.

...

> In this patch, the following fields are added to "net_device":
> void *hw_skb_priv;
> struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
> void (*free_hw_skb)(struct sk_buff *skb);
> so in order to let generic NIC driver to use hardware managed
> buffers, the function "alloc_hw_skb" and "free_hw_skb"
> provide implementation for allocate and free hardware managed
> buffers. "hw_skb_priv" is provided to pass some private data for
> these two functions.
>
> When the socket buffer is allocated by these APIs, "hw_skb_state"
> is provided in struct "sk_buff". this argument can indicate
> that the buffer is hardware managed buffer, this buffer
> should freed by software or by hardware.
>
> Documentation on how to use this featue can be found at
> <file:Documentation/networking/hw_skb.txt>.
>
> Signed-off-by: Pan Jiafei <***@freescale.com>


I am giving a strong NACK, of course.

We are not going to grow sk_buff and add yet another conditional in fast
path for a very obscure feature like that.

Memory management is not going to be done by drivers.

The way it should work is that if your hardware has specific needs, rx
and tx paths (of the driver) need to make the needed adaptation.
Not the other way.

We already have complex skb layouts, we do not need a new one.

Take a look at how drivers can 'lock' pages already, and build skb sith
page frags. It is already there.


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller
2014-10-15 04:26:01 UTC
Permalink
From: Eric Dumazet <***@gmail.com>
Date: Tue, 14 Oct 2014 21:15:03 -0700

> Take a look at how drivers can 'lock' pages already, and build skb
> sith page frags. It is already there.

+1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J***@freescale.com
2014-10-15 05:43:05 UTC
Permalink
Eric Dumazet
2014-10-15 09:15:21 UTC
Permalink
On Wed, 2014-10-15 at 05:43 +0000, ***@freescale.com wrote:

> For my case, there are some shortcoming to use page frags. Firstly, I=
have to
> modify each Ethernet drivers to support it especially I don=E2=80=99t=
which vendor's
> driver the customer will use. Secondly, it is not enough only=20
> build skb by 'lock' pages, the buffer address comes from hardware blo=
ck and
> should be aligned to hardware block.

So you align to hardware block. What is the problem with this ?


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller
2014-10-15 04:25:14 UTC
Permalink
From: Pan Jiafei <***@freescale.com>
Date: Wed, 15 Oct 2014 11:26:11 +0800

> In some platform, there are some hardware block provided
> to manage buffers to improve performance. So in some case,
> it is expected that the packets received by some generic
> NIC should be put into such hardware managed buffers
> directly, so that such buffer can be released by hardware
> or by driver.
>
> This patch provide such general APIs for generic NIC to
> use hardware block managed buffers without any modification
> for generic NIC drivers.

Why are existing interfaces insufficent for your needs?

Several ethernet drivers already build SKBs from block
buffer pools.

They allocate pools of pages which the hardware divides into various
powers of 2, then the RX descriptor says what pieces of which pools
were used to hold the data for a packet, and then the SKB is
constructed with page frags pointing to those locations.

It's very cheap, inexpensive, and existing APIs are considered to
cover all use cases.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J***@freescale.com
2014-10-15 05:34:24 UTC
Permalink
> -----Original Message-----
> From: David Miller [mailto:***@davemloft.net]
> Sent: Wednesday, October 15, 2014 12:25 PM
> To: Pan Jiafei-B37022
> Cc: ***@suse.cz; ***@vger.kernel.org; Li Yang-Leo-R58472; linux-
> ***@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
>
> From: Pan Jiafei <***@freescale.com>
> Date: Wed, 15 Oct 2014 11:26:11 +0800
>
> > In some platform, there are some hardware block provided
> > to manage buffers to improve performance. So in some case,
> > it is expected that the packets received by some generic
> > NIC should be put into such hardware managed buffers
> > directly, so that such buffer can be released by hardware
> > or by driver.
> >
> > This patch provide such general APIs for generic NIC to
> > use hardware block managed buffers without any modification
> > for generic NIC drivers.
>
> Why are existing interfaces insufficent for your needs?
>
> Several ethernet drivers already build SKBs from block
> buffer pools.
>
Yes, for this matter, in order to do this we should modify the Ethernet
drivers. For example, driver A want to driver B, C, D.. to support driver
A's Hardware block access functions, so we have to modify driver B, C, D...
It will be so complex for this matter.

But by using my patch, I just modify a Ethernet device (I don't care
Which driver it is used) flag in driver A in order to implement this
Ethernet device using hardware block access functions provided by
Driver A.

> They allocate pools of pages which the hardware divides into various
> powers of 2, then the RX descriptor says what pieces of which pools
> were used to hold the data for a packet, and then the SKB is
> constructed with page frags pointing to those locations.
>
> It's very cheap, inexpensive, and existing APIs are considered to
> cover all use cases.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2014-10-15 09:15:33 UTC
Permalink
On Wed, 2014-10-15 at 05:34 +0000, ***@freescale.com wrote:

> Yes, for this matter, in order to do this we should modify the Ethernet
> drivers. For example, driver A want to driver B, C, D.. to support driver
> A's Hardware block access functions, so we have to modify driver B, C, D...
> It will be so complex for this matter.
>
> But by using my patch, I just modify a Ethernet device (I don't care
> Which driver it is used) flag in driver A in order to implement this
> Ethernet device using hardware block access functions provided by
> Driver A.

We care a lot of all the bugs added by your patches. You have little
idea of how many of them were added. We do not want to spend days of
work explaining everything or fixing all the details for you.

Carefully read net/core/skbuff.c, net/core/dev.c, GRO layer, you'll see
how many spots you missed.

You cannot control how skbs are cooked before reaching your driver
ndo_start_xmit(). You are not going to add hooks in UDP , TCP, or other
drivers RX path. This would be absolutely insane.

Trying to control how skb are cooked in RX path is absolutely something
drivers do, using page frags that are read-only by all the stack.

Fix your driver to use existing infra, your suggestion is not going to
be accepted.


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J***@freescale.com
2014-10-16 02:17:16 UTC
Permalink
J***@freescale.com
2014-10-16 02:17:14 UTC
Permalink
Eric Dumazet
2014-10-16 04:15:12 UTC
Permalink
On Thu, 2014-10-16 at 02:17 +0000, ***@freescale.com wrote:

> Thanks for your comments and suggestion. In my case, I want to build skb
> from hardware block specified memory, I only can see two ways, one is modified
> net card driver replace common skb allocation function with my specially
> functions, another way is to hack common skb allocation function in which
> redirect to my specially functions. My patch is just for the second way.
> Except these two ways, would you please give me some advice for some other
> ways for my case? Thanks

I suggest you read drivers/net/ethernet numerous examples.

No need to change anything in net/* or include/*, really.

For a start, look at drivers/net/ethernet/intel/igb/igb_main.c

Mentioning 'hack' in your mails simply should hint you are doing
something very wrong.

What makes you think your hardware is so special ?


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J***@freescale.com
2014-10-16 05:15:30 UTC
Permalink
Alexander Duyck
2014-10-16 15:28:19 UTC
Permalink
On 10/15/2014 10:15 PM, ***@freescale.com wrote:
>> -----Original Message-----
>> From: Eric Dumazet [mailto:***@gmail.com]
>> Sent: Thursday, October 16, 2014 12:15 PM
>> To: Pan Jiafei-B37022
>> Cc: David Miller; ***@suse.cz; ***@vger.kernel.org; Li Yang-L=
eo-R58472;
>> linux-***@vger.kernel.org
>> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
>>
>> On Thu, 2014-10-16 at 02:17 +0000, ***@freescale.com wrote:
>>
>>> Thanks for your comments and suggestion. In my case, I want to buil=
d skb
>>> from hardware block specified memory, I only can see two ways, one =
is modified
>>> net card driver replace common skb allocation function with my spec=
ially
>>> functions, another way is to hack common skb allocation function in=
which
>>> redirect to my specially functions. My patch is just for the second=
way.
>>> Except these two ways, would you please give me some advice for som=
e other
>>> ways for my case? Thanks
>> I suggest you read drivers/net/ethernet numerous examples.
>>
>> No need to change anything in net/* or include/*, really.
>>
>> For a start, look at drivers/net/ethernet/intel/igb/igb_main.c
>>
>> Mentioning 'hack' in your mails simply should hint you are doing
>> something very wrong.
>>
>> What makes you think your hardware is so special ?
>>
> In fact, I am developing a bridge driver, it can bridge between any o=
ther the
> third party net card and my own net card. My target is to let any oth=
er the
> third party net card can directly use my own net card specified buffe=
r, then
> there will be no memory copy in the whole bridge process.
> By the way, I don=E2=80=99t see any similar between igb_main.c and my=
case. And also
> My bridge also can=E2=80=99t implemented with "skb frag" in order to =
aim at zero memory
> copy.

I think the part you are not getting is that is how buffers are=20
essentially handled now. So for example in the case if igb the only=20
part we have copied out is usually the header, or the entire frame in=20
the case of small packets. This has to happen in order to allow for=20
changes to the header for routing and such. Beyond that the frags that=
=20
are passed are the buffers that igb is still holding onto. So=20
effectively what the other device transmits in a bridging/routing=20
scenario is my own net card specified buffer plus the copied/modified=20
header.

=46or a brief period igb used build_skb but that isn't valid on most=20
systems as memory mapped for a device can be overwritten if the page is=
=20
unmapped resulting in any changes to the header for routing/bridging=20
purposes being invalidated. Thus we cannot use the buffers for both th=
e=20
skb->data header which may be changed and Rx DMA simultaneously.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2014-10-16 16:57:37 UTC
Permalink
On Thu, 2014-10-16 at 08:28 -0700, Alexander Duyck wrote:

> I think the part you are not getting is that is how buffers are
> essentially handled now. So for example in the case if igb the only
> part we have copied out is usually the header, or the entire frame in
> the case of small packets. This has to happen in order to allow for
> changes to the header for routing and such. Beyond that the frags that
> are passed are the buffers that igb is still holding onto. So
> effectively what the other device transmits in a bridging/routing
> scenario is my own net card specified buffer plus the copied/modified
> header.
>
> For a brief period igb used build_skb but that isn't valid on most
> systems as memory mapped for a device can be overwritten if the page is
> unmapped resulting in any changes to the header for routing/bridging
> purposes being invalidated. Thus we cannot use the buffers for both the
> skb->data header which may be changed and Rx DMA simultaneously.

This reminds me that igb still has skb->truesize underestimation by 100%

If a fragment is held in some socket receive buffer, a full page is
consumed, not 2048 bytes.

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a21b14495ebd..56ca6c78985e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6586,9 +6586,11 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
struct page *page = rx_buffer->page;
unsigned int size = le16_to_cpu(rx_desc->wb.upper.length);
#if (PAGE_SIZE < 8192)
- unsigned int truesize = IGB_RX_BUFSZ;
+ unsigned int segsize = IGB_RX_BUFSZ;
+ unsigned int truesize = PAGE_SIZE;
#else
- unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
+ unsigned int segsize = ALIGN(size, L1_CACHE_BYTES);
+ unsigned int truesize = segsize;
#endif

if ((size <= IGB_RX_HDR_LEN) && !skb_is_nonlinear(skb)) {
@@ -6614,7 +6616,7 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
rx_buffer->page_offset, size, truesize);

- return igb_can_reuse_rx_page(rx_buffer, page, truesize);
+ return igb_can_reuse_rx_page(rx_buffer, page, segsize);
}

static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring,


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2014-10-16 17:10:27 UTC
Permalink
On 10/16/2014 09:57 AM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 08:28 -0700, Alexander Duyck wrote:
>
>> I think the part you are not getting is that is how buffers are
>> essentially handled now. So for example in the case if igb the only
>> part we have copied out is usually the header, or the entire frame in
>> the case of small packets. This has to happen in order to allow for
>> changes to the header for routing and such. Beyond that the frags that
>> are passed are the buffers that igb is still holding onto. So
>> effectively what the other device transmits in a bridging/routing
>> scenario is my own net card specified buffer plus the copied/modified
>> header.
>>
>> For a brief period igb used build_skb but that isn't valid on most
>> systems as memory mapped for a device can be overwritten if the page is
>> unmapped resulting in any changes to the header for routing/bridging
>> purposes being invalidated. Thus we cannot use the buffers for both the
>> skb->data header which may be changed and Rx DMA simultaneously.
> This reminds me that igb still has skb->truesize underestimation by 100%
>
> If a fragment is held in some socket receive buffer, a full page is
> consumed, not 2048 bytes.
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index a21b14495ebd..56ca6c78985e 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -6586,9 +6586,11 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
> struct page *page = rx_buffer->page;
> unsigned int size = le16_to_cpu(rx_desc->wb.upper.length);
> #if (PAGE_SIZE < 8192)
> - unsigned int truesize = IGB_RX_BUFSZ;
> + unsigned int segsize = IGB_RX_BUFSZ;
> + unsigned int truesize = PAGE_SIZE;
> #else
> - unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
> + unsigned int segsize = ALIGN(size, L1_CACHE_BYTES);
> + unsigned int truesize = segsize;
> #endif

So if a page is used twice we are double counting the page size for the
socket then, is that correct? I just want to make sure because prior to
this patch both flows did the same thing and counted the portion of the
page used in this pass, now with this change for PAGE_SIZE of 4K we
count the entire page, and for all other cases we count the portion of
the page used.

Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2014-10-16 17:45:29 UTC
Permalink
On Thu, 2014-10-16 at 10:10 -0700, Alexander Duyck wrote:

> So if a page is used twice we are double counting the page size for the
> socket then, is that correct? I just want to make sure because prior to
> this patch both flows did the same thing and counted the portion of the
> page used in this pass, now with this change for PAGE_SIZE of 4K we
> count the entire page, and for all other cases we count the portion of
> the page used.

When a page is split in 2 parts only, probability that a segment holds
the 4K page is quite high (There is a single half page)

When we split say 64KB in 42 segments, the probability a single segment
hold the full 64KB block is very low, so we can almost be safe when we
consider 'truesize = 1536'

Of course there are pathological cases, but attacker has to be quite
smart.

I am just saying that counting 2048 might have a big impact on memory
consumption if all these incoming segments are stored a long time in
receive queues (TCP receive queues or out of order queues) : We might be
off by a factor of 2 on the real memory usage, and delay the TCP
collapsing too much.


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2014-10-16 18:20:28 UTC
Permalink
On 10/16/2014 10:45 AM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 10:10 -0700, Alexander Duyck wrote:
>
>> So if a page is used twice we are double counting the page size for the
>> socket then, is that correct? I just want to make sure because prior to
>> this patch both flows did the same thing and counted the portion of the
>> page used in this pass, now with this change for PAGE_SIZE of 4K we
>> count the entire page, and for all other cases we count the portion of
>> the page used.
> When a page is split in 2 parts only, probability that a segment holds
> the 4K page is quite high (There is a single half page)

Actually the likelihood of anything holding onto the 4K page for very
long doesn't seem to occur, at least from the drivers perspective. It
is one of the reasons why I went for the page reuse approach rather than
just partitioning a single large page. It allows us to avoid having to
call IOMMU map/unmap for the pages since the entire page is usually back
in the driver ownership before we need to reuse the portion given to the
stack.

> When we split say 64KB in 42 segments, the probability a single segment
> hold the full 64KB block is very low, so we can almost be safe when we
> consider 'truesize = 1536'

Yes, but the likelihood that only a few segments are holding the page is
still very high. So you might not have one segment holding the 64K
page, but I find it very difficult to believe that all 42 would be
holding it at the same time. In that case should we be adding some
portion of the 64K to the truesize for all frames to account for this?

> Of course there are pathological cases, but attacker has to be quite
> smart.
>
> I am just saying that counting 2048 might have a big impact on memory
> consumption if all these incoming segments are stored a long time in
> receive queues (TCP receive queues or out of order queues) : We might be
> off by a factor of 2 on the real memory usage, and delay the TCP
> collapsing too much.

My concern would be that we are off by a factor of 2 and prematurely
collapse the TCP too soon with this change. For example if you are
looking at a socket that is holding pages for a long period of time
there would be a good chance of it ending up with both halves of the
page. In this case is it fair to charge it for 8K or memory use when in
reality it is only using 4K?

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2014-10-16 21:40:26 UTC
Permalink
On Thu, 2014-10-16 at 11:20 -0700, Alexander Duyck wrote:

> My concern would be that we are off by a factor of 2 and prematurely
> collapse the TCP too soon with this change.

That is the opposite actually. We can consume 4K but we pretend we
consume 2K in some worst cases.

> For example if you are
> looking at a socket that is holding pages for a long period of time
> there would be a good chance of it ending up with both halves of the
> page. In this case is it fair to charge it for 8K or memory use when in
> reality it is only using 4K?

Its better to collapse too soon than too late.

If you want to avoid collapses because one host has plenty of memory,
all you need to do is increase tcp_rmem.

Why are you referring to 8K ? PAGE_SIZE is 4K


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2014-10-16 22:12:37 UTC
Permalink
On 10/16/2014 02:40 PM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 11:20 -0700, Alexander Duyck wrote:
>
>> My concern would be that we are off by a factor of 2 and prematurely
>> collapse the TCP too soon with this change.
> That is the opposite actually. We can consume 4K but we pretend we
> consume 2K in some worst cases.

The only case where we consume the full 4K but only list it as 2K should
be if we have memory from the wrong node and we want to flush it from
the descriptor queue. For all other cases we should be using the page
at least twice per buffer. So the the first page that was assigned for
an Rx descriptor might be flushed but then after that reuse should take
hold and stay in place as long as the NAPI poll doesn't change NUMA nodes.

That should be no worse than the case where the remaining space in a
large page is not large enough to use as a buffer. You still use the
current size as your truesize, you don't include the overhead of the
unused space in your calculation.

>> For example if you are
>> looking at a socket that is holding pages for a long period of time
>> there would be a good chance of it ending up with both halves of the
>> page. In this case is it fair to charge it for 8K or memory use when in
>> reality it is only using 4K?
> Its better to collapse too soon than too late.
>
> If you want to avoid collapses because one host has plenty of memory,
> all you need to do is increase tcp_rmem.
>
> Why are you referring to 8K ? PAGE_SIZE is 4K

The truesize would be reported as 8K vs 4K for 2 half pages with your
change if we were to hand off both halves of a page to the same socket.

The 2K value makes sense and is consistent with how we handle this in
other cases where we are partitioning pages for use as network buffers.
I think increasing this to 4K is just going to cause performance issues
as flows are going to get choked off prematurely for memory usage that
they aren't actually getting.

Part of my hesitation is that I spent the last couple of years
explaining to our performance testing team and customers that they need
to adjust tcp_rmem with all of the changes that have been made to
truesize and the base network drivers, and I think I would prefer it if
I didn't have to go another round of it. Then again I probably won't
have to anyway since I am not doing drivers for Intel any more, but
still my reaction to this kind of change is what it is.

Thanks,

Alex




--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Laight
2014-10-17 09:11:09 UTC
Permalink
From: Alexander Duyck
...
> Actually the likelihood of anything holding onto the 4K page for very
> long doesn't seem to occur, at least from the drivers perspective. It
> is one of the reasons why I went for the page reuse approach rather than
> just partitioning a single large page. It allows us to avoid having to
> call IOMMU map/unmap for the pages since the entire page is usually back
> in the driver ownership before we need to reuse the portion given to the
> stack.

That is almost certainly true for most benchmarks, benchmark processes
consume receive data.
But what about real life situations?

There must be some 'normal' workloads where receive data do
Alexander Duyck
2014-10-17 14:40:25 UTC
Permalink
On 10/17/2014 02:11 AM, David Laight wrote:
> From: Alexander Duyck
> ...
>> Actually the likelihood of anything holding onto the 4K page for very
>> long doesn't seem to occur, at least from the drivers perspective. It
>> is one of the reasons why I went for the page reuse approach rather than
>> just partitioning a single large page. It allows us to avoid having to
>> call IOMMU map/unmap for the pages since the entire page is usually back
>> in the driver ownership before we need to reuse the portion given to the
>> stack.
> That is almost certainly true for most benchmarks, benchmark processes
> consume receive data.
> But what about real life situations?
>
> There must be some 'normal' workloads where receive data doesn't get consumed.
>
> David
>

Yes, but for workloads where receive data doesn't get consumed it is
very unlikely that much receive data is generated. As such from the
device perspective the time the socket is holding the page is still not
all that long as the descriptor ring is not being looped through that
quickly. The page has almost always been freed by the time we have
processed our way through the full descriptor ring.

Thanks,

Alex
Eric Dumazet
2014-10-17 16:55:13 UTC
Permalink
On Fri, 2014-10-17 at 07:40 -0700, Alexander Duyck wrote:
> On 10/17/2014 02:11 AM, David Laight wrote:
> > From: Alexander Duyck
> > ...
> >> Actually the likelihood of anything holding onto the 4K page for very
> >> long doesn't seem to occur, at least from the drivers perspective. It
> >> is one of the reasons why I went for the page reuse approach rather than
> >> just partitioning a single large page. It allows us to avoid having to
> >> call IOMMU map/unmap for the pages since the entire page is usually back
> >> in the driver ownership before we need to reuse the portion given to the
> >> stack.
> > That is almost certainly true for most benchmarks, benchmark processes
> > consume receive data.
> > But what about real life situations?
> >
> > There must be some 'normal' workloads where receive data doesn't get consumed.
> >
> > David
> >
>
> Yes, but for workloads where receive data doesn't get consumed it is
> very unlikely that much receive data is generated.

This is very optimistic.

Any kind of flood can generate 5 or 6 Million packets per second.

So in stress condition, we possibly consume twice more ram than alloted
in tcp_mem. (About 3GBytes per second, think about it)

This is fine, if admins are aware of that and can adjust tcp_mem
accordingly to this.

Apparently none of your customers suffered from this, maybe they had
enough headroom to absorb the over commit or they trusted us and could not
find culprit if they had issues.

Open 50,000 tcp sockets, do not read data on 50% of them (pretend you
are busy on disk access or doing cpu intensive work). As traffic is interleaved
(between consumed data and non consumed data), you'll have the side
effect of consuming more ram than advertised.

Compare /proc/net/protocols (grep TCP /proc/net/protocols) and output of
'free', and you'll see that we are not good citizens.

I will work on TCP stack, to go beyond what I did in commit
b49960a05e3212 ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")

So that TCP should not care if a driver chose to potentially use 4K per
MSS.

Right now, it seems we can drop few packets, and get a slight reduction in
throughput (TCP is very sensitive to losses, even if we drop 0.1 % of packets)


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2014-10-17 18:28:58 UTC
Permalink
On 10/17/2014 09:55 AM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 07:40 -0700, Alexander Duyck wrote:
>> On 10/17/2014 02:11 AM, David Laight wrote:
>>> From: Alexander Duyck
>>> ...
>>>> Actually the likelihood of anything holding onto the 4K page for very
>>>> long doesn't seem to occur, at least from the drivers perspective. It
>>>> is one of the reasons why I went for the page reuse approach rather than
>>>> just partitioning a single large page. It allows us to avoid having to
>>>> call IOMMU map/unmap for the pages since the entire page is usually back
>>>> in the driver ownership before we need to reuse the portion given to the
>>>> stack.
>>> That is almost certainly true for most benchmarks, benchmark processes
>>> consume receive data.
>>> But what about real life situations?
>>>
>>> There must be some 'normal' workloads where receive data doesn't get consumed.
>>>
>>> David
>>>
>> Yes, but for workloads where receive data doesn't get consumed it is
>> very unlikely that much receive data is generated.
> This is very optimistic.
>
> Any kind of flood can generate 5 or 6 Million packets per second.

That is fine. The first 256 (default descriptor ring size) might be 4K
while reporting truesize of 2K, after that each page is guaranteed to be
split in half so we get at least 2 uses per page.

> So in stress condition, we possibly consume twice more ram than alloted
> in tcp_mem. (About 3GBytes per second, think about it)

I see what you are trying to get at, but I don't see how my scenerio is
worse then the setups that use a large page and partition it.

> This is fine, if admins are aware of that and can adjust tcp_mem
> accordingly to this.

I can say I have never had a single report of of us feeding too much
memory to the sockets, if anything the complaints I have seen have
always been that the socket is being starved due to too much memory
being used to move small packets. That is one of the reasons I decided
we had to have a copy-break built in for packets 256B and smaller. It
doesn't make much sense to allocate 2K + ~1K (skb + skb->head) for 256B
or less of payload data.

> Apparently none of your customers suffered from this, maybe they had
> enough headroom to absorb the over commit or they trusted us and could not
> find culprit if they had issues.

Correct. I've never received complaints about memory overcommit. Like I
have said in most cases we are always getting the page back anyway so we
usually get a good ROI on the page recycling.

> Open 50,000 tcp sockets, do not read data on 50% of them (pretend you
> are busy on disk access or doing cpu intensive work). As traffic is interleaved
> (between consumed data and non consumed data), you'll have the side
> effect of consuming more ram than advertised.

Yes, but in the case we are talking about it is only off by a factor of
2. How do you account for the setups such as the code for allocating an
skb that is allocating a 32K page over multiple frames. In your setup I
would suspect it wouldn't be uncommon for the socket to end up with
multiple spots where only a few sockets are holding the entire 32K page
for some period of time. So does that mean we should hit anybody that
uses netdev_alloc_skb with the overhead for 32K since there are
scenarios where that can happen?

> Compare /proc/net/protocols (grep TCP /proc/net/protocols) and output of
> 'free', and you'll see that we are not good citizens.

I'm assuming there is some sort of test I should be running while I do
this? Otherwise the current dump of those is not too interesting
currently because my system is idle.

> I will work on TCP stack, to go beyond what I did in commit
> b49960a05e3212 ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
>
> So that TCP should not care if a driver chose to potentially use 4K per
> MSS.

So long as it doesn't impact performance significantly I am fine with
it. My concern is that you are bringing up issues that none of the
customers were brining up when I was at Intel, and the fixes you are
proposing are likely to result in customers seeing things they will
report as issues.

> Right now, it seems we can drop few packets, and get a slight reduction in
> throughput (TCP is very sensitive to losses, even if we drop 0.1 % of packets)

Yes, I am well aware of this bit. That is my concern. You increase the
size for truesize, it will cut the amount of queueing in half, and then
igb will start seeing drops when it has to deal with bursty traffic and
people will start to complain about a performance regression. That is
the bit I want to avoid.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2014-10-17 18:53:00 UTC
Permalink
On Fri, 2014-10-17 at 11:28 -0700, Alexander Duyck wrote:

> Yes, I am well aware of this bit. That is my concern. You increase the
> size for truesize, it will cut the amount of queueing in half, and then
> igb will start seeing drops when it has to deal with bursty traffic and
> people will start to complain about a performance regression. That is
> the bit I want to avoid.

Then, what about addressing the issue for good, instead of working
around it ?

Don't worry, I will work on it.

We also are working on a direct placement in memory for TCP receive
path. (equivalent of sendfile() but for receiver), to avoid the need to
hold payload in kernel memory (out of order queues)



--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2014-10-18 00:26:00 UTC
Permalink
On Fri, 2014-10-17 at 11:53 -0700, Eric Dumazet wrote:

> Don't worry, I will work on it.

A very simple head drop strategy instead of tail drop on the socket
backlog seems to do the trick :

TCP behaves way better when there are head drops, as fast retransmits
can usually close the gap without any added delay. Sender automatically
adjusts its cwnd (or rate) according to the receiver skb->truesize /
skb->len ratio





--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2014-10-17 19:02:19 UTC
Permalink
On Fri, 2014-10-17 at 11:28 -0700, Alexander Duyck wrote:

> So long as it doesn't impact performance significantly I am fine with
> it. My concern is that you are bringing up issues that none of the
> customers were brining up when I was at Intel, and the fixes you are
> proposing are likely to result in customers seeing things they will
> report as issues.

We regularly hit these issues at Google.

We have memory containers, and we detect quite easily if some layer is
lying, because we cant afford having 20% of headroom on our servers.

I am not claiming IGB is the only offender.

I am sorry if you believe it was an attack on IGB, or any network
driver.

truesize should really be the thing that protects us from OOM,
and apparently driver authors hitting TCP performance problems
thinks it is better to simply let TCP do no garbage collection.

It seems that nobody cares or even understand what I am saying,
so I should probably not care and suggest Google to buy PetaBytes of
memory, right ?



--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck
2014-10-17 19:38:44 UTC
Permalink
On 10/17/2014 12:02 PM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 11:28 -0700, Alexander Duyck wrote:
>
>> So long as it doesn't impact performance significantly I am fine with
>> it. My concern is that you are bringing up issues that none of the
>> customers were brining up when I was at Intel, and the fixes you are
>> proposing are likely to result in customers seeing things they will
>> report as issues.
> We regularly hit these issues at Google.
>
> We have memory containers, and we detect quite easily if some layer is
> lying, because we cant afford having 20% of headroom on our servers.

Data is useful here. If you can give enough data about the setup to
reproduce it then we can actually start looking at fixing it. Otherwise
it is just your anecdotal data versus mine.

> I am not claiming IGB is the only offender.
>
> I am sorry if you believe it was an attack on IGB, or any network
> driver.

My concern is that igb is guilty of being off by at most a factor of 2.
What about the drivers and implementations that are off by possibly much
larger values? I'd be much more interested in this if there was data to
back up your position. Right now it is mostly just conjecture and my
concern is that this type of change may cause more harm than good.

> truesize should really be the thing that protects us from OOM,
> and apparently driver authors hitting TCP performance problems
> thinks it is better to simply let TCP do no garbage collection.

One key point to keep in mind is that the igb performance should take a
pretty hard hit if pages aren't being freed. The overhead difference
between the page reuse path vs non-page reuse is pretty significant. If
this is a scenerio you are actually seeing this would be of interest.

> It seems that nobody cares or even understand what I am saying,
> so I should probably not care and suggest Google to buy PetaBytes of
> memory, right ?

That's not what I am saying, but there is a trade-off we always have to
take into account. Cutting memory overhead will likely have an impact
on performance. I would like to make the best informed trade-off in
that regard rather than just assuming worst case always for the driver.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2014-10-17 19:51:55 UTC
Permalink
On Fri, 2014-10-17 at 12:38 -0700, Alexander Duyck wrote:

> That's not what I am saying, but there is a trade-off we always have to
> take into account. Cutting memory overhead will likely have an impact
> on performance. I would like to make the best informed trade-off in
> that regard rather than just assuming worst case always for the driver.

It seems you misunderstood me. You believe I suggested doing another
allocation strategy in the drivers.

This was not the case.

This allocation strategy is wonderful. I repeat : This is wonderful.

We only have to make sure we do not fool memory management layers, when
they do not understand where the memory is.

Apparently you think it is hard, while it really is not.
Alexander Duyck
2014-10-17 22:13:29 UTC
Permalink
On 10/17/2014 12:51 PM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 12:38 -0700, Alexander Duyck wrote:
>
>> That's not what I am saying, but there is a trade-off we always have to
>> take into account. Cutting memory overhead will likely have an impact
>> on performance. I would like to make the best informed trade-off in
>> that regard rather than just assuming worst case always for the driver.
> It seems you misunderstood me. You believe I suggested doing another
> allocation strategy in the drivers.
>
> This was not the case.
>
> This allocation strategy is wonderful. I repeat : This is wonderful.

No, I think I understand you. I'm just not sure listing this as a 4K
allocation in truesize makes sense. The problem is the actual
allocation can be either 2K or 4K, and my concern is that by setting it
to 4K we are going to be hurting the case where the actual allocation to
the socket is only 2K for the half page w/ reuse.

I was brining up the other allocation strategy to prove a point. From my
perspective it wouldn't make any more sense to assign 32K to the
truesize for an allocated fragment using __netdev_alloc_frag, but it can
suffer the same type of issues only to a greater extent due to the use
of the compound page. Just because it is shared among many more uses
doesn't mean it couldn't end up in a scenario where one socket somehow
keeps queueing up the 32K pages and sitting on them. I would think all
it would take is 1 bad acting flow interleaved in ~20 active flows to
suddenly gobble up a ton of memory without it being accounted for.

> We only have to make sure we do not fool memory management layers, when
> they do not understand where the memory is.
>
> Apparently you think it is hard, while it really is not.

I think you are over simplifying it. By setting it to 4K there are
situations where a socket will be double charged for getting two halves
of the same page. In these cases there will be a negative impact on
performance as the number of frames that can be queued is reduced. What
you are proposing is possibly overreporting memory use by a factor of 2
instead of possibly under-reporting it by a factor of 2.

I would be more moved by data than just conjecture on what the driver is
or isn't doing. My theory is that most of the time the page is reused
so 2K is the correct value to report, and very seldom would 4K ever be
the correct value. This is what I have seen historically with igb/ixgbe
using the page reuse. If you have cases that show that the page isn't
being reused then we can explore the 4K truesize change, but until then
I think the page is likely being reused and we should probably just
stick with the 2K value as we should be getting at least 2 uses per page.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J***@freescale.com
2014-10-17 02:35:35 UTC
Permalink
Eric Dumazet
2014-10-17 14:05:09 UTC
Permalink
On Fri, 2014-10-17 at 02:35 +0000, ***@freescale.com wrote:

> [Pan Jiafei] Hi, Alex, thanks for your comments. I don=E2=80=99t conf=
irm that
> you have catch my concerns. For example, I want to add igb net card=20
> into my bridge, and want to igb net driver allocate skb by using
> my specified memory address, but I don=E2=80=99t want to modify igb n=
et driver
> directly, how to do this in my bridge drivers?

This is exactly our point : We do not want to modify all drivers so tha=
t
your bridge is happy with them.

You'll have to make your bridge using standard infra instead.

IGB has no way to know in advance that a particular frame should
eventually reach your bridge anyway.
Alexander Duyck
2014-10-17 14:12:46 UTC
Permalink
On 10/17/2014 07:05 AM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 02:35 +0000, ***@freescale.com wrote:
>
>> [Pan Jiafei] Hi, Alex, thanks for your comments. I don=E2=80=99t con=
firm that
>> you have catch my concerns. For example, I want to add igb net card=20
>> into my bridge, and want to igb net driver allocate skb by using
>> my specified memory address, but I don=E2=80=99t want to modify igb =
net driver
>> directly, how to do this in my bridge drivers?
> This is exactly our point : We do not want to modify all drivers so t=
hat
> your bridge is happy with them.
>
> You'll have to make your bridge using standard infra instead.
>
> IGB has no way to know in advance that a particular frame should
> eventually reach your bridge anyway.

Also why is it igb should use your buffers, is there any reason why you=
r
device cannot use the receive buffers that are handed off to the stack
from igb? It isn't as if there is a copy in the routing or bridging
path. The receive buffer is normally handed off to the stack from the
ingress device, a few headers might get tweaked, and then that buffer i=
s
transmitted by the egress interface. Only in the case of a buffer bein=
g
handed to multiple egress devices or user space should it ever be copie=
d.

Thanks,

Alex
David Miller
2014-10-15 15:51:54 UTC
Permalink
From: "***@freescale.com" <***@freescale.com>
Date: Wed, 15 Oct 2014 05:34:24 +0000

> Yes, for this matter, in order to do this we should modify the Ethernet
> drivers. For example, driver A want to driver B, C, D.. to support driver
> A's Hardware block access functions, so we have to modify driver B, C, D...
> It will be so complex for this matter.

Experience says otherwise. It's three or four lines of code to attach
a page to an SKB frag.

The driver needs it's own buffer management and setup code anyways,
and no generic facility will replace that.

I think your precondition for these changes therefore doesn't really
exist.

Please, look over all of the drivers that exist already in the tree
and build SKBs using page frage from hardware device buffer pools.

You have to show us that all of those drivers can make use of your
facility.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Hartkopp
2014-10-15 04:59:24 UTC
Permalink
On 15.10.2014 05:26, Pan Jiafei wrote:
> In some platform, there are some hardware block provided
> to manage buffers to improve performance.

(..)

> diff --git a/net/Kconfig b/net/Kconfig
> index d6b138e..346e021 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -291,6 +291,16 @@ config NET_FLOW_LIMIT
> with many clients some protection against DoS by a single (spoofed)
> flow that greatly exceeds average workload.
>
> +config USE_HW_SKB
> + bool "NIC use hardware managed buffer to build skb"
> + depends on INET

The feature seems to be valid for network devices in general.
Why did you make it depending on INET ??

Regards,
Oliver

> + ---help---
> + If select this, the third party drivers will use hardware managed
> + buffers to allocate SKB without any modification for the driver.
> +
> + Documentation on how to use this featue can be found at
> + <file:Documentation/networking/hw_skb.txt>.
> +
> menu "Network testing"
>
> config NET_PKTGEN

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J***@freescale.com
2014-10-15 05:47:46 UTC
Permalink
> -----Original Message-----
> From: Oliver Hartkopp [mailto:***@hartkopp.net]
> Sent: Wednesday, October 15, 2014 12:59 PM
> To: Pan Jiafei-B37022; ***@davemloft.net; ***@suse.cz
> Cc: ***@vger.kernel.org; Li Yang-Leo-R58472; linux-***@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
>
> On 15.10.2014 05:26, Pan Jiafei wrote:
> > In some platform, there are some hardware block provided
> > to manage buffers to improve performance.
>
> (..)
[Pan Jiafei] I want to build a general patch to build skb
by using hardware buffer manager block.
>
> > diff --git a/net/Kconfig b/net/Kconfig
> > index d6b138e..346e021 100644
> > --- a/net/Kconfig
> > +++ b/net/Kconfig
> > @@ -291,6 +291,16 @@ config NET_FLOW_LIMIT
> > with many clients some protection against DoS by a single (spoofed)
> > flow that greatly exceeds average workload.
> >
> > +config USE_HW_SKB
> > + bool "NIC use hardware managed buffer to build skb"
> > + depends on INET
>
> The feature seems to be valid for network devices in general.
> Why did you make it depending on INET ??
>
> Regards,
> Oliver
>
[Pan Jiafei] Yes, INET dependency should be removed, thanks.
> > + ---help---
> > + If select this, the third party drivers will use hardware managed
> > + buffers to allocate SKB without any modification for the driver.
> > +
> > + Documentation on how to use this featue can be found at
> > + <file:Documentation/networking/hw_skb.txt>.
> > +
> > menu "Network testing"
> >
> > config NET_PKTGEN

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Laight
2014-10-15 08:57:52 UTC
Permalink
From: Pan Jiafei
> In some platform, there are some hardware block provided
> to manage buffers to improve performance. So in some case,
> it is expected that the packets received by some generic
> NIC should be put into such hardware managed buffers
> directly, so that such buffer can be released by hardware
> or by driver.

This looks like some strange variant of 'buffer loaning'.
In general it just doesn't work due to the limited number
of such buffers - they soon all become queued waiting for
applications to read from sockets.

It also isn't at all clear how you expect a 'generic NIC'
to actually allocate buffers from your 'special area'.

David



--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger
2014-10-15 09:33:23 UTC
Permalink
Since an skb can sit forever in an application queue, you have created
an easy way to livelock the system when enough skb's are waiting to be
read.
J***@freescale.com
2014-10-16 02:30:05 UTC
Permalink
> -----Original Message-----
> From: Stephen Hemminger [mailto:***@networkplumber.org]
> Sent: Wednesday, October 15, 2014 5:33 PM
> To: Pan Jiafei-B37022
> Cc: ***@davemloft.net; ***@suse.cz; ***@vger.kernel.org; Li Yang-Leo-
> R58472; linux-***@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
>
> Since an skb can sit forever in an application queue, you have created
> an easy way to livelock the system when enough skb's are waiting to be
> read.

I think there is no possible to livelock the system, because in my patch
The function __netdev_alloc_skb will try to allocate hardware block buffer
Firstly if dev->alloc_hw_skb is set, but it will continue allocate normal
skb buffer if the hardware block buffer allocation fails.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...