Discussion:
[RFC PATCH net-next 0/6] seccomp filter JIT
Xi Wang
2013-04-26 07:51:40 UTC
Permalink
This patchset brings JIT support to seccomp filters for x86_64 and ARM.
It is against the net-next tree.

The current BPF JIT interface only accepts sk_filter, not seccomp_filter.
Patch 1/6 refactors the interface to make it more general.

With the refactored interface, patches 2/6 and 3/6 implement the seccomp
BPF_S_ANC_SECCOMP_LD_W instruction in x86 & ARM JIT.

Status:

* x86_64 & ARM: JIT tested with seccomp examples.

* powerpc [4/6]: no seccomp change - compile checked.

* sparc [5/6] & s390 [6/6]: no seccomp change - untested.

Sorry I have no sparc or s390 build environment here. Can someone help
check 5/6 and 6/6? Thanks.

Xi Wang (6):
filter: refactor BPF JIT for seccomp filters
x86: bpf_jit_comp: support BPF_S_ANC_SECCOMP_LD_W instruction
ARM: net: bpf_jit_32: support BPF_S_ANC_SECCOMP_LD_W instruction
PPC: net: bpf_jit_comp: refactor the BPF JIT interface
sparc: bpf_jit_comp: refactor the BPF JIT interface
s390/bpf,jit: refactor the BPF JIT interface

arch/arm/net/bpf_jit_32.c | 64 +++++++++++++++++++++++++----------------
arch/powerpc/net/bpf_jit_comp.c | 36 +++++++++++------------
arch/s390/net/bpf_jit_comp.c | 31 ++++++++++----------
arch/sparc/net/bpf_jit_comp.c | 22 +++++++-------
arch/x86/net/bpf_jit_comp.c | 38 ++++++++++++++++--------
include/linux/filter.h | 16 +++++++----
kernel/seccomp.c | 6 +++-
net/core/filter.c | 6 ++--
8 files changed, 127 insertions(+), 92 deletions(-)
--
1.8.1.2
Xi Wang
2013-04-26 07:51:42 UTC
Permalink
This patch implements the seccomp BPF_S_ANC_SECCOMP_LD_W instruction
in x86 JIT.

Signed-off-by: Xi Wang <***@gmail.com>
---
arch/x86/net/bpf_jit_comp.c | 38 ++++++++++++++++++++++++++------------
1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index f66b540..03c9c81 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -8,10 +8,11 @@
* of the License.
*/
#include <linux/moduleloader.h>
-#include <asm/cacheflush.h>
#include <linux/netdevice.h>
#include <linux/filter.h>
#include <linux/if_vlan.h>
+#include <asm/cacheflush.h>
+#include <asm/syscall.h>

/*
* Conventions :
@@ -144,7 +145,7 @@ static int pkt_type_offset(void)
return -1;
}

-void bpf_jit_compile(struct sk_filter *fp)
+bpf_func_t bpf_jit_compile(struct sock_filter *filter, unsigned int flen)
{
u8 temp[64];
u8 *prog;
@@ -157,15 +158,14 @@ void bpf_jit_compile(struct sk_filter *fp)
int pc_ret0 = -1; /* bpf index of first RET #0 instruction (if any) */
unsigned int cleanup_addr; /* epilogue code offset */
unsigned int *addrs;
- const struct sock_filter *filter = fp->insns;
- int flen = fp->len;
+ bpf_func_t bpf_func = sk_run_filter;

if (!bpf_jit_enable)
- return;
+ return bpf_func;

addrs = kmalloc(flen * sizeof(*addrs), GFP_KERNEL);
if (addrs == NULL)
- return;
+ return bpf_func;

/* Before first pass, make a rough estimation of addrs[]
* each bpf instruction is translated to less than 64 bytes
@@ -684,6 +684,20 @@ cond_branch: f_offset = addrs[i + filter[i].jf] - addrs[i];
}
EMIT_COND_JMP(f_op, f_offset);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case BPF_S_ANC_SECCOMP_LD_W:
+ if (K == offsetof(struct seccomp_data, arch)) {
+ int arch = syscall_get_arch(current, NULL);
+
+ EMIT1_off32(0xb8, arch); /* mov arch,%eax */
+ break;
+ }
+ func = (u8 *)seccomp_bpf_load;
+ t_offset = func - (image + addrs[i]);
+ EMIT1_off32(0xbf, K); /* mov imm32,%edi */
+ EMIT1_off32(0xe8, t_offset); /* call seccomp_bpf_load */
+ break;
+#endif
default:
/* hmm, too complex filter, give up with jit compiler */
goto out;
@@ -694,7 +708,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf] - addrs[i];
pr_err("bpb_jit_compile fatal error\n");
kfree(addrs);
module_free(NULL, image);
- return;
+ return bpf_func;
}
memcpy(image + proglen, temp, ilen);
}
@@ -731,11 +745,11 @@ cond_branch: f_offset = addrs[i + filter[i].jf] - addrs[i];

if (image) {
bpf_flush_icache(image, image + proglen);
- fp->bpf_func = (void *)image;
+ bpf_func = (void *)image;
}
out:
kfree(addrs);
- return;
+ return bpf_func;
}

static void jit_free_defer(struct work_struct *arg)
@@ -746,10 +760,10 @@ static void jit_free_defer(struct work_struct *arg)
/* run from softirq, we must use a work_struct to call
* module_free() from process context
*/
-void bpf_jit_free(struct sk_filter *fp)
+void bpf_jit_free(bpf_func_t bpf_func)
{
- if (fp->bpf_func != sk_run_filter) {
- struct work_struct *work = (struct work_struct *)fp->bpf_func;
+ if (bpf_func != sk_run_filter) {
+ struct work_struct *work = (struct work_struct *)bpf_func;

INIT_WORK(work, jit_free_defer);
schedule_work(work);
--
1.8.1.2
Eric Dumazet
2013-04-26 14:18:46 UTC
Permalink
Post by Xi Wang
+#ifdef CONFIG_SECCOMP_FILTER
+ if (K == offsetof(struct seccomp_data, arch)) {
+ int arch = syscall_get_arch(current, NULL);
+
+ EMIT1_off32(0xb8, arch); /* mov arch,%eax */
+ break;
+ }
+ func = (u8 *)seccomp_bpf_load;
+ t_offset = func - (image + addrs[i]);
+ EMIT1_off32(0xbf, K); /* mov imm32,%edi */
+ EMIT1_off32(0xe8, t_offset); /* call seccomp_bpf_load */
+ break;
+#endif
This seems seriously wrong to me.

This cannot have been tested at all.
Xi Wang
2013-04-26 14:50:06 UTC
Permalink
Post by Eric Dumazet
Post by Xi Wang
+#ifdef CONFIG_SECCOMP_FILTER
+ if (K == offsetof(struct seccomp_data, arch)) {
+ int arch = syscall_get_arch(current, NULL);
+
+ EMIT1_off32(0xb8, arch); /* mov arch,%eax */
+ break;
+ }
+ func = (u8 *)seccomp_bpf_load;
+ t_offset = func - (image + addrs[i]);
+ EMIT1_off32(0xbf, K); /* mov imm32,%edi */
+ EMIT1_off32(0xe8, t_offset); /* call seccomp_bpf_load */
+ break;
+#endif
This seems seriously wrong to me.
Can you elaborate?
Post by Eric Dumazet
This cannot have been tested at all.
Thanks to QEMU for hiding bugs then. :)

- xi
Eric Dumazet
2013-04-26 15:11:34 UTC
Permalink
Post by Xi Wang
Post by Eric Dumazet
Post by Xi Wang
+#ifdef CONFIG_SECCOMP_FILTER
+ if (K == offsetof(struct seccomp_data, arch)) {
+ int arch = syscall_get_arch(current, NULL);
+
+ EMIT1_off32(0xb8, arch); /* mov arch,%eax */
+ break;
+ }
+ func = (u8 *)seccomp_bpf_load;
+ t_offset = func - (image + addrs[i]);
+ EMIT1_off32(0xbf, K); /* mov imm32,%edi */
+ EMIT1_off32(0xe8, t_offset); /* call seccomp_bpf_load */
+ break;
+#endif
This seems seriously wrong to me.
Can you elaborate?
Post by Eric Dumazet
This cannot have been tested at all.
Thanks to QEMU for hiding bugs then. :)
1) 'current' at the time the code is jitted (compiled) is not the
'current' at the time the filter will be evaluated.

On x86_64, if CONFIG_IA32_EMULATION=y, syscall_get_arch() evaluates to :

if (task_thread_info(task)->status & TS_COMPAT)
return AUDIT_ARCH_I386;
return AUDIT_ARCH_X86_64;

So your code is completely wrong.

2) Calling a function potentially destroys some registers.
%rdi,%r8,%r9 for instance, so we are going to crash very easily.

I dont know, I feel a bit uncomfortable having to explain this to
someone sending security related patches...
Xi Wang
2013-04-26 15:29:27 UTC
Permalink
Post by Eric Dumazet
2) Calling a function potentially destroys some registers.
%rdi,%r8,%r9 for instance, so we are going to crash very easily.
I dont know, I feel a bit uncomfortable having to explain this to
someone sending security related patches...
My old code did save these registers. But, do we really need that for
seccomp? For example, %rdi (skb) is always NULL and never used by
seccomp filters. Did I miss anything?

- xi
Eric Dumazet
2013-04-26 15:43:41 UTC
Permalink
Post by Xi Wang
Post by Eric Dumazet
2) Calling a function potentially destroys some registers.
%rdi,%r8,%r9 for instance, so we are going to crash very easily.
I dont know, I feel a bit uncomfortable having to explain this to
someone sending security related patches...
My old code did save these registers. But, do we really need that for
seccomp? For example, %rdi (skb) is always NULL and never used by
seccomp filters. Did I miss anything?
I do not know.

This is not explained in your changelog or in any comment.

You have to make the full analysis yourself and make us comfortable with
the results.

You send patches and ask us to spend hours on it, this is not how it
works.
Xi Wang
2013-04-26 15:57:45 UTC
Permalink
Post by Eric Dumazet
I do not know.
This is not explained in your changelog or in any comment.
You have to make the full analysis yourself and make us comfortable with
the results.
You send patches and ask us to spend hours on it, this is not how it
works.
"do we really need that for seccomp?" is not asking. I just tried to
explain in a gentle way.

%rdi,%r8,%r9 are not used by seccomp filters so I removed the push/pop
part. I agree that I should explain the details in the code comments
or logs. Thanks for catching that.

- xi
David Miller
2013-04-26 18:48:07 UTC
Permalink
From: Eric Dumazet <***@gmail.com>
Date: Fri, 26 Apr 2013 08:43:41 -0700
Post by Eric Dumazet
You send patches and ask us to spend hours on it, this is not how it
works.
+1
Xi Wang
2013-04-26 16:02:04 UTC
Permalink
Post by Eric Dumazet
1) 'current' at the time the code is jitted (compiled) is not the
'current' at the time the filter will be evaluated.
if (task_thread_info(task)->status & TS_COMPAT)
return AUDIT_ARCH_I386;
return AUDIT_ARCH_X86_64;
So your code is completely wrong.
Just to be clear, are you worrying about a process changing its
personality after installing seccomp filters?

- xi
Eric Dumazet
2013-04-26 16:14:00 UTC
Permalink
Post by Xi Wang
Post by Eric Dumazet
1) 'current' at the time the code is jitted (compiled) is not the
'current' at the time the filter will be evaluated.
if (task_thread_info(task)->status & TS_COMPAT)
return AUDIT_ARCH_I386;
return AUDIT_ARCH_X86_64;
So your code is completely wrong.
Just to be clear, are you worrying about a process changing its
personality after installing seccomp filters?
You didn't explained how things worked.

Are you assuming we network guys know everything ?

Just to make it very clear :

We are very dumb and you must explain us everything.

If process would not change personality, why do we have get_arch() at
all ? Why isn't it optimized outside of the JIT itself, in the generic
seccomp checker, its a single "A = K" instruction after all.

Why this part is even in the x86 BPF JIT ?

To me it looks like _if_ get_arch() is provided in BPF, its for a
reason, and your implementation looks very suspicious, if not buggy.
Xi Wang
2013-04-26 18:25:50 UTC
Permalink
Not sure how many you are speaking for when you say "We are very dumb". :)

Thanks for catching this. I'l remove this arch thing in v2.

To address your other concern about registers, I'll add some comments
to the code, something like:

"%rdi,%r8,%r9 are not used by seccomp filters; it's safe to not save them."

- xi
Post by Eric Dumazet
Post by Xi Wang
Post by Eric Dumazet
1) 'current' at the time the code is jitted (compiled) is not the
'current' at the time the filter will be evaluated.
if (task_thread_info(task)->status & TS_COMPAT)
return AUDIT_ARCH_I386;
return AUDIT_ARCH_X86_64;
So your code is completely wrong.
Just to be clear, are you worrying about a process changing its
personality after installing seccomp filters?
You didn't explained how things worked.
Are you assuming we network guys know everything ?
We are very dumb and you must explain us everything.
If process would not change personality, why do we have get_arch() at
all ? Why isn't it optimized outside of the JIT itself, in the generic
seccomp checker, its a single "A = K" instruction after all.
Why this part is even in the x86 BPF JIT ?
To me it looks like _if_ get_arch() is provided in BPF, its for a
reason, and your implementation looks very suspicious, if not buggy.
Eric Dumazet
2013-04-26 18:40:06 UTC
Permalink
Post by Xi Wang
Not sure how many you are speaking for when you say "We are very dumb". :)
Thanks for catching this. I'l remove this arch thing in v2.
To address your other concern about registers, I'll add some comments
"%rdi,%r8,%r9 are not used by seccomp filters; it's safe to not save them."
OK good

BTW, most of us prefer Bottom posting on lkml/netdev

http://en.wikipedia.org/wiki/Posting_style#Bottom-posting

Thanks
David Laight
2013-04-26 15:15:02 UTC
Permalink
Post by Xi Wang
Post by Eric Dumazet
Post by Xi Wang
+#ifdef CONFIG_SECCOMP_FILTER
+ if (K == offsetof(struct seccomp_data, arch)) {
+ int arch = syscall_get_arch(current, NULL);
+
+ EMIT1_off32(0xb8, arch); /* mov arch,%eax */
+ break;
+ }
+ func = (u8 *)seccomp_bpf_load;
+ t_offset = func - (image + addrs[i]);
+ EMIT1_off32(0xbf, K); /* mov imm32,%edi */
+ EMIT1_off32(0xe8, t_offset); /* call seccomp_bpf_load */
+ break;
+#endif
This seems seriously wrong to me.
Can you elaborate?
The 'call seccomp_bpf_load' needs a pc-relative offset,
I assume that is what EMIT1_off32() generates.

The other two instructions want an absolute 32 bit value...

David
Eric Dumazet
2013-04-26 15:27:46 UTC
Permalink
Post by David Laight