Discussion:
[PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
Will Drewry
2011-05-12 03:02:56 UTC
Permalink
This change adds a new seccomp mode based on the work by
agl at chromium.org in [1]. This new mode, "filter mode", provides a hash
table of seccomp_filter objects. When in the new mode (2), all system
calls are checked against the filters - first by system call number,
then by a filter string. If an entry exists for a given system call and
all filter predicates evaluate to true, then the task may proceed.
Otherwise, the task is killed (as per seccomp_mode == 1).

Filter string parsing and evaluation is handled by the ftrace filter
engine. Related patches tweak to the perf filter trace and free allow
the call to be shared. Filters inherit their understanding of types and
arguments for each system call from the CONFIG_FTRACE_SYSCALLS subsystem
which already predefines this information in syscall_metadata associated
enter_event (and exit_event) structures. If CONFIG_FTRACE and
CONFIG_FTRACE_SYSCALLS are not compiled in, only "1" filter strings will
be allowed.

The net result is a process may have its system calls filtered using the
ftrace filter engine's inherent understanding of systems calls. A
logical ruleset for a process that only needs stdin/stdout may be:
sys_read: fd == 0
sys_write: fd == 1 || fd == 2
sys_exit: 1

The set of filters is specified through the PR_SET_SECCOMP path in prctl().
For example:
prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 0");
prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
prctl(PR_SET_SECCOMP_FILTER, __NR_exit, "1");
prctl(PR_SET_SECCOMP, 2, 0);

v2: - changed to use the existing syscall number ABI.
- prctl changes to minimize parsing in the kernel:
prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
- defined PR_SECCOMP_MODE_STRICT and ..._FILTER
- added flags
- provide a default fail syscall_nr_to_meta in ftrace
- provides fallback for unhooked system calls
- use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
- added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
- moved to a hlist and 4 bit hash of linked lists
- added support to operate without CONFIG_FTRACE_SYSCALLS
- moved Kconfig support next to SECCOMP
(should this be done in per-platform patches?)
- made Kconfig entries dependent on EXPERIMENTAL
- added macros to avoid ifdefs from kernel/fork.c
- added compat task/filter matching
- drop seccomp.h inclusion in sched.h and drop seccomp_t
- added Filtering to "show" output
- added on_exec state dup'ing when enabling after a fast-path accept.

Signed-off-by: Will Drewry <wad at chromium.org>
---
arch/arm/Kconfig | 10 +
arch/microblaze/Kconfig | 10 +
arch/mips/Kconfig | 10 +
arch/powerpc/Kconfig | 10 +
arch/s390/Kconfig | 10 +
arch/sh/Kconfig | 10 +
arch/sparc/Kconfig | 10 +
arch/x86/Kconfig | 10 +
include/linux/prctl.h | 9 +
include/linux/sched.h | 5 +-
include/linux/seccomp.h | 116 +++++++++-
include/trace/syscall.h | 7 +
kernel/Makefile | 3 +
kernel/fork.c | 3 +
kernel/seccomp.c | 228 +++++++++++++++++-
kernel/seccomp.h | 74 ++++++
kernel/seccomp_filter.c | 581 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 15 +-
18 files changed, 1100 insertions(+), 21 deletions(-)
create mode 100644 kernel/seccomp.h
create mode 100644 kernel/seccomp_filter.c

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..22e1668 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1664,6 +1664,16 @@ config SECCOMP
and the task is only allowed to execute a few safe syscalls
defined by each seccomp mode.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
depends on EXPERIMENTAL
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index eccdefe..7641ee9 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -129,6 +129,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu

menu "Advanced setup"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8e256cc..fe4cbda 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2245,6 +2245,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config USE_OF
bool "Flattened Device Tree support"
select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..83499e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -605,6 +605,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu

config ISA_DMA_API
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2508a6f..2777515 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -614,6 +614,16 @@ config SECCOMP

If unsure, say Y.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu

menu "Power Management"
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..00c1521 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -676,6 +676,16 @@ config SECCOMP

If unsure, say N.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SMP
bool "Symmetric multi-processing support"
depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e560d10..5b42255 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -270,6 +270,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SPARC64 && SMP
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..d6d44d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1485,6 +1485,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
---help---
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..379b391 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -63,6 +63,15 @@
/* Get/set process seccomp mode */
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22
+# define PR_SECCOMP_MODE_NONE 0
+# define PR_SECCOMP_MODE_STRICT 1
+# define PR_SECCOMP_MODE_FILTER 2
+# define PR_SECCOMP_FLAG_FILTER_ON_EXEC (1 << 1)
+
+/* Get/set process seccomp filters */
+#define PR_GET_SECCOMP_FILTER 35
+#define PR_SET_SECCOMP_FILTER 36
+#define PR_CLEAR_SECCOMP_FILTER 37

/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18d63ce..27eacf9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -77,7 +77,6 @@ struct sched_param {
#include <linux/percpu.h>
#include <linux/topology.h>
#include <linux/proportions.h>
-#include <linux/seccomp.h>
#include <linux/rcupdate.h>
#include <linux/rculist.h>
#include <linux/rtmutex.h>
@@ -1190,6 +1189,8 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

+struct seccomp_state;
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1374,7 +1375,7 @@ struct task_struct {
uid_t loginuid;
unsigned int sessionid;
#endif
- seccomp_t seccomp;
+ struct seccomp_state *seccomp;

/* Thread group tracking */
u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 167c333..289c836 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -2,12 +2,34 @@
#define _LINUX_SECCOMP_H


+/* Forward declare for proc interface */
+struct seq_file;
+
#ifdef CONFIG_SECCOMP

+#include <linux/errno.h>
+#include <linux/list.h>
#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+/**
+ * struct seccomp_state - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 1, the process is under standard seccomp rules
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * @usage: number of references to the current instance.
+ * @flags: a bitmask of behavior altering flags.
+ * @filters: Hash table of filters if using CONFIG_SECCOMP_FILTER.
+ */
+struct seccomp_state {
+ uint16_t mode;
+ atomic_t usage;
+ long flags;
+ struct seccomp_filter_table *filters;
+};

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -16,27 +38,113 @@ static inline void secure_computing(int this_syscall)
__secure_computing(this_syscall);
}

+extern struct seccomp_state *seccomp_state_new(void);
+extern struct seccomp_state *seccomp_state_dup(const struct seccomp_state *);
+extern struct seccomp_state *get_seccomp_state(struct seccomp_state *);
+extern void put_seccomp_state(struct seccomp_state *);
+
+extern long prctl_set_seccomp(unsigned long, unsigned long);
extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+
+extern long prctl_set_seccomp_filter(unsigned long, char __user *);
+extern long prctl_get_seccomp_filter(unsigned long, char __user *,
+ unsigned long);
+extern long prctl_clear_seccomp_filter(unsigned long);
+
+#define inherit_tsk_seccomp_state(_child, _orig) \
+ _child->seccomp = get_seccomp_state(_orig->seccomp);
+#define put_tsk_seccomp_state(_tsk) put_seccomp_state(_tsk->seccomp)

#else /* CONFIG_SECCOMP */

#include <linux/errno.h>

-typedef struct { } seccomp_t;
+struct seccomp_state { };

#define secure_computing(x) do { } while (0)
+#define inherit_tsk_seccomp_state(_child, _orig) do { } while (0)
+#define put_tsk_seccomp_state(_tsk) do { } while (0)

static inline long prctl_get_seccomp(void)
{
return -EINVAL;
}

-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long a2, unsigned long a3)
{
return -EINVAL;
}

+static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
+{
+ return -ENOSYS;
+}
+
+static inline long prctl_clear_seccomp_filter(unsigned long a2)
+{
+ return -ENOSYS;
+}
+
+static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
+ unsigned long a4)
+{
+ return -ENOSYS;
+}
+
+static inline struct seccomp_state *seccomp_state_new(void)
+{
+ return NULL;
+}
+
+static inline struct seccomp_state *seccomp_state_dup(
+ const struct seccomp_state *state)
+{
+ return NULL;
+}
+
+static inline struct seccomp_state *get_seccomp_state(
+ struct seccomp_state *state)
+{
+ return NULL;
+}
+
+static inline void put_seccomp_state(struct seccomp_state *state)
+{
+}
+
#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_SECCOMP_FILTER
+
+extern int seccomp_show_filters(struct seccomp_state *, struct seq_file *);
+extern long seccomp_set_filter(int, char *);
+extern long seccomp_clear_filter(int);
+extern long seccomp_get_filter(int, char *, unsigned long);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+static inline int seccomp_show_filters(struct seccomp_state *state,
+ struct seq_file *m)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_set_filter(int syscall_nr, char *filter)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_clear_filter(int syscall_nr)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_get_filter(int syscall_nr,
+ char *buf, unsigned long available)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
+
#endif /* _LINUX_SECCOMP_H */
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 242ae04..e061ad0 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -35,6 +35,8 @@ struct syscall_metadata {
extern unsigned long arch_syscall_addr(int nr);
extern int init_syscall_trace(struct ftrace_event_call *call);

+extern struct syscall_metadata *syscall_nr_to_meta(int);
+
extern int reg_event_syscall_enter(struct ftrace_event_call *call);
extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
extern int reg_event_syscall_exit(struct ftrace_event_call *call);
@@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event);
enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
struct trace_event *event);
+#else
+static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
+{
+ return NULL;
+}
#endif

#ifdef CONFIG_PERF_EVENTS
diff --git a/kernel/Makefile b/kernel/Makefile
index 85cbfb3..84e7dfb 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+ifeq ($(CONFIG_SECCOMP_FILTER),y)
+obj-$(CONFIG_SECCOMP) += seccomp_filter.o
+endif
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..46987d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ put_tsk_seccomp_state(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
if (err)
goto out;

+ inherit_tsk_seccomp_state(tsk, orig);
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..502ba04 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -6,12 +6,15 @@
* This defines a simple but solid secure-computing mode.
*/

+#include <linux/err.h>
+#include <linux/prctl.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
+#include <linux/slab.h>
#include <linux/compat.h>
+#include <linux/unistd.h>

-/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
+#include "seccomp.h"

/*
* Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -32,11 +35,13 @@ static int mode1_syscalls_32[] = {

void __secure_computing(int this_syscall)
{
- int mode = current->seccomp.mode;
+ int mode = -1;
+ long ret = 0;
int * syscall;
-
+ if (current->seccomp)
+ mode = current->seccomp->mode;
switch (mode) {
- case 1:
+ case PR_SECCOMP_MODE_STRICT:
syscall = mode1_syscalls;
#ifdef CONFIG_COMPAT
if (is_compat_task())
@@ -47,6 +52,20 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+ case PR_SECCOMP_MODE_FILTER:
+ if (this_syscall >= NR_syscalls || this_syscall < 0)
+ break;
+ ret = seccomp_test_filters(current->seccomp, this_syscall);
+ if (!ret)
+ return;
+ /* Only check for an override if an access failure occurred. */
+ if (ret != -EACCES)
+ break;
+ ret = seccomp_maybe_apply_filters(current, this_syscall);
+ if (!ret)
+ return;
+ seccomp_filter_log_failure(this_syscall);
+ break;
default:
BUG();
}
@@ -57,30 +76,213 @@ void __secure_computing(int this_syscall)
do_exit(SIGKILL);
}

+/* seccomp_state_new - allocate a new state object. */
+struct seccomp_state *seccomp_state_new()
+{
+ struct seccomp_state *new = kzalloc(sizeof(struct seccomp_state),
+ GFP_KERNEL);
+ if (!new)
+ return NULL;
+
+ new->flags = 0;
+#ifdef CONFIG_COMPAT
+ /* Annotate if this filterset is being created by a compat task. */
+ if (is_compat_task())
+ new->flags |= SECCOMP_FLAG_COMPAT;
+#endif
+
+ atomic_set(&new->usage, 1);
+ new->filters = seccomp_filter_table_new();
+ /* Not supported errors are fine, others are a problem. */
+ if (IS_ERR(new->filters) && PTR_ERR(new->filters) != -ENOSYS) {
+ kfree(new);
+ new = NULL;
+ }
+ return new;
+}
+
+/* seccomp_state_dup - copies an existing state object. */
+struct seccomp_state *seccomp_state_dup(const struct seccomp_state *orig)
+{
+ int err;
+ struct seccomp_state *new_state = seccomp_state_new();
+
+ err = -ENOMEM;
+ if (!new_state)
+ goto fail;
+ new_state->mode = orig->mode;
+ /* Flag copying will hide if the new process is a compat task. However,
+ * if the rule was compat/non-compat and the process is the opposite,
+ * enforcement will terminate it.
+ */
+ new_state->flags = orig->flags;
+ err = seccomp_copy_all_filters(new_state->filters,
+ orig->filters);
+ if (err)
+ goto fail;
+
+ return new_state;
+fail:
+ put_seccomp_state(new_state);
+ return NULL;
+}
+
+/* get_seccomp_state - increments the reference count of @orig */
+struct seccomp_state *get_seccomp_state(struct seccomp_state *orig)
+{
+ if (!orig)
+ return NULL;
+ atomic_inc(&orig->usage);
+ return orig;
+}
+
+static void __put_seccomp_state(struct seccomp_state *orig)
+{
+ WARN_ON(atomic_read(&orig->usage));
+ seccomp_drop_all_filters(orig);
+ kfree(orig);
+}
+
+/* put_seccomp_state - decrements the reference count of @orig and may free. */
+void put_seccomp_state(struct seccomp_state *orig)
+{
+ if (!orig)
+ return;
+
+ if (atomic_dec_and_test(&orig->usage))
+ __put_seccomp_state(orig);
+}
+
long prctl_get_seccomp(void)
{
- return current->seccomp.mode;
+ if (!current->seccomp)
+ return 0;
+ return current->seccomp->mode;
}

-long prctl_set_seccomp(unsigned long seccomp_mode)
+long prctl_set_seccomp(unsigned long seccomp_mode, unsigned long flags)
{
long ret;
+ struct seccomp_state *state, *cur_state;

+ cur_state = get_seccomp_state(current->seccomp);
/* can set it only once to be even more secure */
ret = -EPERM;
- if (unlikely(current->seccomp.mode))
+ if (cur_state && unlikely(cur_state->mode))
goto out;

ret = -EINVAL;
- if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ if (seccomp_mode <= 0 || seccomp_mode > NR_SECCOMP_MODES)
+ goto out;
+
+ ret = -ENOMEM;
+ state = (cur_state ? seccomp_state_dup(cur_state) :
+ seccomp_state_new());
+ if (!state)
+ goto out;
+
+ if (seccomp_mode == PR_SECCOMP_MODE_STRICT) {
#ifdef TIF_NOTSC
disable_TSC();
#endif
- ret = 0;
}

- out:
+ rcu_assign_pointer(current->seccomp, state);
+ synchronize_rcu();
+ put_seccomp_state(cur_state); /* For the task */
+
+ /* Convert the ABI flag to the internal flag value. */
+ if (seccomp_mode == PR_SECCOMP_MODE_FILTER &&
+ (flags & PR_SECCOMP_FLAG_FILTER_ON_EXEC))
+ state->flags |= SECCOMP_FLAG_ON_EXEC;
+ /* Encourage flag values to stay synchronized explicitly. */
+ BUILD_BUG_ON(PR_SECCOMP_FLAG_FILTER_ON_EXEC != SECCOMP_FLAG_ON_EXEC);
+
+ /* Only set the thread flag once after the new state is in place. */
+ state->mode = seccomp_mode;
+ set_thread_flag(TIF_SECCOMP);
+ ret = 0;
+
+out:
+ put_seccomp_state(cur_state); /* for the get */
+ return ret;
+}
+
+long prctl_set_seccomp_filter(unsigned long syscall_nr,
+ char __user *user_filter)
+{
+ int nr;
+ long ret;
+ char filter[SECCOMP_MAX_FILTER_LENGTH];
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+ goto out;
+
+ ret = -EFAULT;
+ if (!user_filter ||
+ strncpy_from_user(filter, user_filter,
+ sizeof(filter) - 1) < 0)
+ goto out;
+
+ nr = (int) syscall_nr;
+ ret = seccomp_set_filter(nr, filter);
+
+out:
+ return ret;
+}
+
+long prctl_clear_seccomp_filter(unsigned long syscall_nr)
+{
+ int nr = -1;
+ long ret;
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+ goto out;
+
+ nr = (int) syscall_nr;
+ ret = seccomp_clear_filter(nr);
+
+out:
+ return ret;
+}
+
+long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
+ unsigned long available)
+{
+ int ret, nr;
+ unsigned long copied;
+ char *buf = NULL;
+ ret = -EINVAL;
+ if (!available)
+ goto out;
+ /* Ignore extra buffer space. */
+ if (available > SECCOMP_MAX_FILTER_LENGTH)
+ available = SECCOMP_MAX_FILTER_LENGTH;
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+ goto out;
+ nr = (int) syscall_nr;
+
+ ret = -ENOMEM;
+ buf = kmalloc(available, GFP_KERNEL);
+ if (!buf)
+ goto out;
+
+ ret = seccomp_get_filter(nr, buf, available);
+ if (ret < 0)
+ goto out;
+
+ /* Include the NUL byte in the copy. */
+ copied = copy_to_user(dst, buf, ret + 1);
+ ret = -ENOSPC;
+ if (copied)
+ goto out;
+
+ ret = 0;
+out:
+ kfree(buf);
return ret;
}
diff --git a/kernel/seccomp.h b/kernel/seccomp.h
new file mode 100644
index 0000000..5abd219
--- /dev/null
+++ b/kernel/seccomp.h
@@ -0,0 +1,74 @@
+/*
+ * seccomp/seccomp_filter shared internal prototypes and state.
+ *
+ * Copyright (C) 2011 Chromium OS Authors.
+ */
+
+#ifndef __KERNEL_SECCOMP_H
+#define __KERNEL_SECCOMP_H
+
+#include <linux/ftrace_event.h>
+#include <linux/seccomp.h>
+
+/* #define SECCOMP_DEBUG 1 */
+#define NR_SECCOMP_MODES 2
+
+/* Inherit the max filter length from the filtering engine. */
+#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
+
+/* Presently, flags only affect SECCOMP_FILTER. */
+#define _SECCOMP_FLAG_COMPAT 0
+#define _SECCOMP_FLAG_ON_EXEC 1
+
+#define SECCOMP_FLAG_COMPAT (1 << (_SECCOMP_FLAG_COMPAT))
+#define SECCOMP_FLAG_ON_EXEC (1 << (_SECCOMP_FLAG_ON_EXEC))
+
+
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define SECCOMP_FILTER_HASH_BITS 4
+#define SECCOMP_FILTER_HASH_SIZE (1 << SECCOMP_FILTER_HASH_BITS)
+
+struct seccomp_filter_table;
+extern struct seccomp_filter_table *seccomp_filter_table_new(void);
+extern int seccomp_copy_all_filters(struct seccomp_filter_table *,
+ const struct seccomp_filter_table *);
+extern void seccomp_drop_all_filters(struct seccomp_state *);
+
+extern int seccomp_test_filters(struct seccomp_state *, int);
+extern int seccomp_maybe_apply_filters(struct task_struct *, int);
+extern void seccomp_filter_log_failure(int);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+static inline void seccomp_filter_log_failure(int syscall)
+{
+}
+
+static inline int seccomp_maybe_apply_filters(struct task_struct *tsk,
+ int syscall_nr)
+{
+ return -ENOSYS;
+}
+
+static inline struct seccomp_filter_table *seccomp_filter_table_new(void)
+{
+ return ERR_PTR(-ENOSYS);
+}
+
+static inline int seccomp_test_filters(struct seccomp_state *state, int nr)
+{
+ return -ENOSYS;
+}
+
+extern inline int seccomp_copy_all_filters(struct seccomp_filter_table *dst,
+ const struct seccomp_filter_table *src)
+{
+ return 0;
+}
+
+static inline void seccomp_drop_all_filters(struct seccomp_state *state) { }
+
+#endif /* CONFIG_SECCOMP_FILTER */
+
+#endif /* __KERNEL_SECCOMP_H */
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..ff4e055
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,581 @@
+/* filter engine-based seccomp system call filtering.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev at chromium.org>
+ */
+
+#include <linux/compat.h>
+#include <linux/errno.h>
+#include <linux/hash.h>
+#include <linux/prctl.h>
+#include <linux/seccomp.h>
+#include <linux/seq_file.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <asm/syscall.h>
+#include <trace/syscall.h>
+
+#include "seccomp.h"
+
+#define SECCOMP_MAX_FILTER_COUNT 512
+#define SECCOMP_WILDCARD_FILTER "1"
+
+struct seccomp_filter {
+ struct hlist_node node;
+ int syscall_nr;
+ struct syscall_metadata *data;
+ struct event_filter *event_filter;
+};
+
+struct seccomp_filter_table {
+ struct hlist_head heads[SECCOMP_FILTER_HASH_SIZE];
+ int count;
+};
+
+struct seccomp_filter_table *seccomp_filter_table_new(void)
+{
+ struct seccomp_filter_table *t =
+ kzalloc(sizeof(struct seccomp_filter_table), GFP_KERNEL);
+ if (!t)
+ return ERR_PTR(-ENOMEM);
+ return t;
+}
+
+static inline u32 syscall_hash(int syscall_nr)
+{
+ return hash_32(syscall_nr, SECCOMP_FILTER_HASH_BITS);
+}
+
+static const char *get_filter_string(struct seccomp_filter *f)
+{
+ const char *str = SECCOMP_WILDCARD_FILTER;
+ if (!f)
+ return NULL;
+
+ /* Missing event filters qualify as wildcard matches. */
+ if (!f->event_filter)
+ return str;
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+ str = ftrace_get_filter_string(f->event_filter);
+#endif
+ return str;
+}
+
+static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr,
+ const char *filter_string)
+{
+ int err = -ENOMEM;
+ struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter),
+ GFP_KERNEL);
+ if (!filter)
+ goto fail;
+
+ INIT_HLIST_NODE(&filter->node);
+ filter->syscall_nr = syscall_nr;
+ filter->data = syscall_nr_to_meta(syscall_nr);
+
+ /* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip
+ * using a predicate at all.
+ */
+ if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string))
+ goto out;
+
+ /* Argument-based filtering only works on ftrace-hooked syscalls. */
+ if (!filter->data) {
+ err = -ENOSYS;
+ goto fail;
+ }
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+ err = ftrace_parse_filter(&filter->event_filter,
+ filter->data->enter_event->event.type,
+ filter_string);
+ if (err)
+ goto fail;
+#endif
+
+out:
+ return filter;
+
+fail:
+ kfree(filter);
+ return ERR_PTR(err);
+}
+
+static void free_seccomp_filter(struct seccomp_filter *filter)
+{
+#ifdef CONFIG_FTRACE_SYSCALLS
+ ftrace_free_filter(filter->event_filter);
+#endif
+ kfree(filter);
+}
+
+static struct seccomp_filter *copy_seccomp_filter(struct seccomp_filter *orig)
+{
+ return alloc_seccomp_filter(orig->syscall_nr, get_filter_string(orig));
+}
+
+/* Returns the matching filter or NULL */
+static struct seccomp_filter *find_filter(struct seccomp_state *state,
+ int syscall)
+{
+ struct hlist_node *this, *pos;
+ struct seccomp_filter *filter = NULL;
+
+ u32 head = syscall_hash(syscall);
+ if (head >= SECCOMP_FILTER_HASH_SIZE)
+ goto out;
+
+ hlist_for_each_safe(this, pos, &state->filters->heads[head]) {
+ filter = hlist_entry(this, struct seccomp_filter, node);
+ if (filter->syscall_nr == syscall)
+ goto out;
+ }
+
+ filter = NULL;
+
+out:
+ return filter;
+}
+
+/* Safely drops all filters for a given syscall. This should only be called
+ * on unattached seccomp_state objects.
+ */
+static void drop_filter(struct seccomp_state *state, int syscall_nr)
+{
+ struct seccomp_filter *filter = find_filter(state, syscall_nr);
+ if (!filter)
+ return;
+
+ WARN_ON(state->filters->count == 0);
+ state->filters->count--;
+ hlist_del(&filter->node);
+ free_seccomp_filter(filter);
+}
+
+/* This should only be called on unattached seccomp_state objects. */
+static int add_filter(struct seccomp_state *state, int syscall_nr,
+ char *filter_string)
+{
+ struct seccomp_filter *filter;
+ struct hlist_head *head;
+ char merged[SECCOMP_MAX_FILTER_LENGTH];
+ int ret;
+ u32 hash = syscall_hash(syscall_nr);
+
+ ret = -EINVAL;
+ if (state->filters->count == SECCOMP_MAX_FILTER_COUNT - 1)
+ goto out;
+
+ filter_string = strstrip(filter_string);
+
+ /* Disallow empty strings. */
+ if (filter_string[0] == 0)
+ goto out;
+
+ /* Get the right list head. */
+ head = &state->filters->heads[hash];
+
+ /* Find out if there is an existing entry to append to and
+ * build the resultant filter string. The original filter can be
+ * destroyed here since the caller should be operating on a copy.
+ */
+ filter = find_filter(state, syscall_nr);
+ if (filter) {
+ int expected = snprintf(merged, sizeof(merged), "(%s) && %s",
+ get_filter_string(filter),
+ filter_string);
+ ret = -E2BIG;
+ if (expected >= sizeof(merged) || expected < 0)
+ goto out;
+ filter_string = merged;
+ hlist_del(&filter->node);
+ free_seccomp_filter(filter);
+ }
+
+ /* When in seccomp filtering mode, only allow additions. */
+ ret = -EACCES;
+ if (filter == NULL && state->mode == PR_SECCOMP_MODE_FILTER)
+ goto out;
+
+ ret = 0;
+ filter = alloc_seccomp_filter(syscall_nr, filter_string);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ state->filters->count++;
+ hlist_add_head(&filter->node, head);
+out:
+ return ret;
+}
+
+/* Wrap optional ftrace syscall support. Returns 1 on match or if ftrace is not
+ * supported.
+ */
+static int do_ftrace_syscall_match(struct event_filter *event_filter)
+{
+ int err = 1;
+#ifdef CONFIG_FTRACE_SYSCALLS
+ uint8_t syscall_state[64];
+
+ memset(syscall_state, 0, sizeof(syscall_state));
+
+ /* The generic tracing entry can remain zeroed. */
+ err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
+ NULL);
+ if (err)
+ return 0;
+
+ err = filter_match_preds(event_filter, syscall_state);
+#endif
+ return err;
+}
+
+/* 1 on match, 0 otherwise. */
+static int filter_match_current(struct seccomp_filter *filter)
+{
+ /* If no event filter exists, we assume a wildcard match. */
+ if (!filter->event_filter)
+ return 1;
+
+ return do_ftrace_syscall_match(filter->event_filter);
+}
+
+#ifndef KSTK_EIP
+#define KSTK_EIP(x) 0L
+#endif
+
+static const char *syscall_nr_to_name(int syscall)
+{
+ const char *syscall_name = "unknown";
+ struct syscall_metadata *data = syscall_nr_to_meta(syscall);
+ if (data)
+ syscall_name = data->name;
+ return syscall_name;
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ printk(KERN_INFO
+ "%s[%d]: system call %d (%s) blocked at ip:%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ syscall_nr_to_name(syscall), KSTK_EIP(current));
+}
+
+/**
+ * seccomp_drop_all_filters - cleans up the filter list and frees the table
+ * @state: the seccomp_state to destroy the filters in.
+ */
+void seccomp_drop_all_filters(struct seccomp_state *state)
+{
+ struct hlist_node *this, *pos;
+ int head;
+ if (!state->filters)
+ return;
+ for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+ hlist_for_each_safe(this, pos, &state->filters->heads[head]) {
+ struct seccomp_filter *f = hlist_entry(this,
+ struct seccomp_filter, node);
+ WARN_ON(state->filters->count == 0);
+ hlist_del(this);
+ free_seccomp_filter(f);
+ state->filters->count--;
+ }
+ }
+ kfree(state->filters);
+}
+
+/**
+ * seccomp_copy_all_filters - copies all filters from src to dst.
+ *
+ * @dst: seccomp_filter_table to populate.
+ * @src: table to read from.
+ * Returns non-zero on failure.
+ * Both the source and the destination should have no simultaneous
+ * writers, and dst should be exclusive to the caller.
+ */
+int seccomp_copy_all_filters(struct seccomp_filter_table *dst,
+ const struct seccomp_filter_table *src)
+{
+ struct seccomp_filter *filter;
+ int head, ret = 0;
+ BUG_ON(!dst || !src);
+ for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+ struct hlist_node *pos;
+ hlist_for_each_entry(filter, pos, &src->heads[head], node) {
+ struct seccomp_filter *new_filter =
+ copy_seccomp_filter(filter);
+ if (IS_ERR(new_filter)) {
+ ret = PTR_ERR(new_filter);
+ goto done;
+ }
+ hlist_add_head(&new_filter->node,
+ &dst->heads[head]);
+ dst->count++;
+ }
+ }
+
+done:
+ return ret;
+}
+
+/**
+ * seccomp_show_filters - prints the filter state to a seq_file
+ * @state: the seccomp_state to enumerate the filter and bitmask of
+ * @m: the prepared seq_file to receive the data
+ *
+ * Returns 0 on a successful write.
+ */
+int seccomp_show_filters(struct seccomp_state *state, struct seq_file *m)
+{
+ int head;
+ struct hlist_node *pos;
+ struct seccomp_filter *filter;
+ int filtering = 0;
+ if (!state)
+ return 0;
+ if (!state->filters)
+ return 0;
+
+ filtering = (state->mode == 2);
+ filtering &= !(state->flags & SECCOMP_FLAG_ON_EXEC);
+ seq_printf(m, "Filtering: %d\n", filtering);
+ seq_printf(m, "FilterCount: %d\n", state->filters->count);
+ for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+ hlist_for_each_entry(filter, pos, &state->filters->heads[head],
+ node) {
+ seq_printf(m, "SystemCall: %d (%s)\n",
+ filter->syscall_nr,
+ syscall_nr_to_name(filter->syscall_nr));
+ seq_printf(m, "Filter: %s\n",
+ get_filter_string(filter));
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(seccomp_show_filters);
+
+/**
+ * seccomp_maybe_apply_filters - conditionally applies seccomp filters
+ * @tsk: task to update
+ * @syscall_nr: current system call in progress
+ * tsk must already be in seccomp filter mode.
+ *
+ * Returns 0 if the call should be allowed or state has been updated.
+ * This call is only reach if no filters matched the current system call.
+ * In some cases, such as when the ON_EXEC flag is set, failure should
+ * not be terminal.
+ */
+int seccomp_maybe_apply_filters(struct task_struct *tsk, int syscall_nr)
+{
+ struct seccomp_state *state, *new_state = NULL;
+ int ret = -EACCES;
+
+ /* There's no question of application if ON_EXEC is not set. */
+ state = get_seccomp_state(tsk->seccomp);
+ if ((state->flags & SECCOMP_FLAG_ON_EXEC) == 0)
+ goto out;
+
+ ret = 0;
+ if (syscall_nr != __NR_execve)
+ goto out;
+
+ new_state = seccomp_state_dup(state);
+ ret = -ENOMEM;
+ if (!new_state)
+ goto out;
+
+ ret = 0;
+ new_state->flags &= ~(SECCOMP_FLAG_ON_EXEC);
+ rcu_assign_pointer(tsk->seccomp, new_state);
+ synchronize_rcu();
+ put_seccomp_state(state); /* for the task */
+
+out:
+ put_seccomp_state(state); /* for the get */
+ return ret;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @state: seccomp_state of current to use.
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(struct seccomp_state *state, int syscall)
+{
+ struct seccomp_filter *filter = NULL;
+ int ret;
+
+#ifdef CONFIG_COMPAT
+ ret = -EPERM;
+ if (is_compat_task() == !!(state->flags & SECCOMP_FLAG_COMPAT)) {
+ printk(KERN_INFO "%s[%d]: seccomp filter compat() mismatch.\n",
+ current->comm, task_pid_nr(current));
+ goto out;
+ }
+#endif
+
+ ret = 0;
+ filter = find_filter(state, syscall);
+ if (filter && filter_match_current(filter))
+ goto out;
+
+ ret = -EACCES;
+out:
+ return ret;
+}
+
+/**
+ * seccomp_get_filter - copies the filter_string into "buf"
+ * @syscall_nr: system call number to look up
+ * @buf: destination buffer
+ * @bufsize: available space in the buffer.
+ *
+ * Looks up the filter for the given system call number on current. If found,
+ * the string length of the NUL-terminated buffer is returned and < 0 is
+ * returned on error. The NUL byte is not included in the length.
+ */
+long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
+{
+ struct seccomp_state *state;
+ struct seccomp_filter *filter;
+ long ret = -ENOENT;
+
+ if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
+ bufsize = SECCOMP_MAX_FILTER_LENGTH;
+
+ state = get_seccomp_state(current->seccomp);
+ if (!state)
+ goto out;
+
+ filter = find_filter(state, syscall_nr);
+ if (!filter)
+ goto out;
+
+ ret = strlcpy(buf, get_filter_string(filter), bufsize);
+ if (ret >= bufsize) {
+ ret = -ENOSPC;
+ goto out;
+ }
+ /* Zero out any remaining buffer, just in case. */
+ memset(buf + ret, 0, bufsize - ret);
+out:
+ put_seccomp_state(state);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_get_filter);
+
+/**
+ * seccomp_clear_filter: clears the seccomp filter for a syscall.
+ * @syscall_nr: the system call number to clear filters for.
+ *
+ * (acts as a frontend for seccomp_set_filter. All restrictions
+ * apply)
+ *
+ * Returns 0 on success.
+ */
+long seccomp_clear_filter(int syscall_nr)
+{
+ return seccomp_set_filter(syscall_nr, NULL);
+}
+EXPORT_SYMBOL_GPL(seccomp_clear_filter);
+
+/**
+ * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
+ * @syscall_nr: system call number to apply the filter to.
+ * @filter: ftrace filter string to apply.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called.
+ *
+ * New filters may be added for system calls when the current task is
+ * not in a secure computing mode (seccomp). Otherwise, filters may only
+ * be added to already filtered system call entries. Any additions will
+ * be &&'d with the existing filter string to ensure no expansion of privileges
+ * will be possible.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_set_filter(int syscall_nr, char *filter)
+{
+ struct seccomp_state *state, *orig_state;
+ long ret = -EINVAL;
+
+ orig_state = get_seccomp_state(current->seccomp);
+
+ /* Prior to mutating the state, create a duplicate to avoid modifying
+ * the behavior of other instances sharing the state and ensure
+ * consistency.
+ */
+ state = (orig_state ? seccomp_state_dup(orig_state) :
+ seccomp_state_new());
+ if (!state) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* A NULL filter doubles as a drop value, but the exposed prctl
+ * interface requires a trip through seccomp_clear_filter().
+ * Filter dropping is allowed across the is_compat_task() barrier.
+ */
+ ret = 0;
+ if (filter == NULL) {
+ drop_filter(state, syscall_nr);
+ goto assign;
+ }
+
+ /* Avoid amiguous filters which may have been inherited from a parent
+ * with different syscall numbers for the logically same calls.
+ */
+#ifdef CONFIG_COMPAT
+ ret = -EACCES;
+ if (is_compat_task() != !!(state->flags & SECCOMP_FLAG_COMPAT)) {
+ if (state->filters->count)
+ goto free_state;
+ /* It's safe to add if there are no existing ambiguous rules.*/
+ if (is_compat_task())
+ state->flags |= SECCOMP_FLAG_COMPAT;
+ else
+ state->flags &= ~(SECCOMP_FLAG_COMPAT);
+ }
+#endif
+
+ ret = add_filter(state, syscall_nr, filter);
+ if (ret)
+ goto free_state;
+
+assign:
+ rcu_assign_pointer(current->seccomp, state);
+ synchronize_rcu();
+ put_seccomp_state(orig_state); /* for the task */
+out:
+ put_seccomp_state(orig_state); /* for the get */
+ return ret;
+
+free_state:
+ put_seccomp_state(orig_state); /* for the get */
+ put_seccomp_state(state); /* drop the dup/new */
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_set_filter);
diff --git a/kernel/sys.c b/kernel/sys.c
index af468ed..d29003a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_ENDIAN:
error = SET_ENDIAN(me, arg2);
break;
-
case PR_GET_SECCOMP:
error = prctl_get_seccomp();
break;
case PR_SET_SECCOMP:
- error = prctl_set_seccomp(arg2);
+ error = prctl_set_seccomp(arg2, arg3);
+ break;
+ case PR_SET_SECCOMP_FILTER:
+ error = prctl_set_seccomp_filter(arg2,
+ (char __user *) arg3);
+ break;
+ case PR_CLEAR_SECCOMP_FILTER:
+ error = prctl_clear_seccomp_filter(arg2);
+ break;
+ case PR_GET_SECCOMP_FILTER:
+ error = prctl_get_seccomp_filter(arg2,
+ (char __user *) arg3,
+ arg4);
break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
--
1.7.0.4
Ingo Molnar
2011-05-12 07:48:50 UTC
Permalink
Ok, i like the direction here, but i think the ABI should be done differently.
Post by Will Drewry
+static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr,
+ const char *filter_string)
+{
+ int err = -ENOMEM;
+ struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter),
+ GFP_KERNEL);
+ if (!filter)
+ goto fail;
+
+ INIT_HLIST_NODE(&filter->node);
+ filter->syscall_nr = syscall_nr;
+ filter->data = syscall_nr_to_meta(syscall_nr);
+
+ /* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip
+ * using a predicate at all.
+ */
+ if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string))
+ goto out;
+
+ /* Argument-based filtering only works on ftrace-hooked syscalls. */
+ if (!filter->data) {
+ err = -ENOSYS;
+ goto fail;
+ }
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+ err = ftrace_parse_filter(&filter->event_filter,
+ filter->data->enter_event->event.type,
+ filter_string);
+ if (err)
+ goto fail;
+#endif
+
+ return filter;
+
+ kfree(filter);
+ return ERR_PTR(err);
+}
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = SET_ENDIAN(me, arg2);
break;
-
error = prctl_get_seccomp();
break;
- error = prctl_set_seccomp(arg2);
+ error = prctl_set_seccomp(arg2, arg3);
+ break;
+ error = prctl_set_seccomp_filter(arg2,
+ (char __user *) arg3);
+ break;
+ error = prctl_clear_seccomp_filter(arg2);
+ break;
+ error = prctl_get_seccomp_filter(arg2,
+ (char __user *) arg3,
+ arg4);
To restrict execution to system calls.

Two observations:

1) We already have a specific ABI for this: you can set filters for events via
an event fd.

Why not extend that mechanism instead and improve *both* your sandboxing
bits and the events code? This new seccomp code has a lot more
to do with trace event filters than the minimal old seccomp code ...

kernel/trace/trace_event_filter.c is 2000 lines of tricky code that
interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of
mostly trivial code.

2) Why should this concept not be made available wider, to allow the
restriction of not just system calls but other security relevant components
of the kernel as well?

This too, if you approach the problem via the events code, will be a natural
end result, while if you approach it from the seccomp prctl angle it will be
a limited hack only.

Note, the end result will be the same - just using a different ABI.

So i really think the ABI itself should be closer related to the event code.
What this "seccomp" code does is that it uses specific syscall events to
restrict execution of certain event generating codepaths, such as system calls.

Thanks,

Ingo
Kees Cook
2011-05-12 09:24:24 UTC
Permalink
Hi,
Post by Ingo Molnar
1) We already have a specific ABI for this: you can set filters for events via
an event fd.
Why not extend that mechanism instead and improve *both* your sandboxing
bits and the events code? This new seccomp code has a lot more
to do with trace event filters than the minimal old seccomp code ...
Would this require privileges to get the event fd to start with? If so,
I would prefer to avoid that, since using prctl() as shown in the patch
set won't require any privs.

-Kees
--
Kees Cook
Ubuntu Security Team
Ingo Molnar
2011-05-12 10:49:16 UTC
Permalink
Hi,
Post by Ingo Molnar
1) We already have a specific ABI for this: you can set filters for events via
an event fd.
Why not extend that mechanism instead and improve *both* your sandboxing
bits and the events code? This new seccomp code has a lot more
to do with trace event filters than the minimal old seccomp code ...
Would this require privileges to get the event fd to start with? [...]
No special privileges with the default perf_events_paranoid value.
[...] If so, I would prefer to avoid that, since using prctl() as shown in
the patch set won't require any privs.
and we could also explicitly allow syscall events without any privileges,
regardless of the setting of 'perf_events_paranoid' config value.

Obviously a sandboxing host process wants to run with as low privileges as it
can.

Thanks,

Ingo
James Morris
2011-05-12 11:44:15 UTC
Permalink
Post by Ingo Molnar
2) Why should this concept not be made available wider, to allow the
restriction of not just system calls but other security relevant components
of the kernel as well?
Because the aim of this is to reduce the attack surface of the syscall
interface.

LSM is the correct level of abstraction for general security mediation,
because it allows you to take into account all relevant security
information in a race-free context.
Post by Ingo Molnar
This too, if you approach the problem via the events code, will be a natural
end result, while if you approach it from the seccomp prctl angle it will be
a limited hack only.
I'd say it's a well-defined and readily understandable feature.


- James
--
James Morris
<jmorris at namei.org>
Ingo Molnar
2011-05-12 13:01:04 UTC
Permalink
Post by James Morris
Post by Ingo Molnar
2) Why should this concept not be made available wider, to allow the
restriction of not just system calls but other security relevant components
of the kernel as well?
Because the aim of this is to reduce the attack surface of the syscall
interface.
What i suggest achieves the same, my argument is that we could aim it to be
even more flexible and even more useful.
Post by James Morris
LSM is the correct level of abstraction for general security mediation,
because it allows you to take into account all relevant security information
in a race-free context.
I don't care about LSM though, i find it poorly designed.

The approach implemented here, the ability for *unprivileged code* to define
(the seeds of ...) flexible security policies, in a proper Linuxish way, which
is inherited along the task parent/child hieararchy and which allows nesting
etc. is a *lot* more flexible.

What Will implemented here is pretty huge in my opinion: it turns security from
a root-only kind of weird hack into an essential component of its APIs,
available to *any* app not just the select security policy/mechanism chosen by
the distributor ...

If implemented properly this could replace LSM in the long run.

As a prctl() hack bound to seccomp (which, by all means, is a natural extension
to the current seccomp ABI, so perfectly fine if we only want that scope), that
is much less likely to happen.

And if we merge the seccomp interface prematurely then interest towards a more
flexible approach will disappear, so either we do it properly now or it will
take some time for someone to come around and do it ...

Also note that i do not consider the perf events ABI itself cast into stone -
and we could very well add a new system call for this, independent of perf
events. I just think that the seccomp scope itself is exciting but looks
limited to what the real potential of this could be.
Post by James Morris
Post by Ingo Molnar
This too, if you approach the problem via the events code, will be a natural
end result, while if you approach it from the seccomp prctl angle it will be
a limited hack only.
I'd say it's a well-defined and readily understandable feature.
Note, it was me who suggested this very event-filter-engine design a year ago,
when the first submission still used a crude bitmap of allowed seccomp
syscalls:

http://lwn.net/Articles/332974/

Funnily enough, back then you wrote this:

" I'm concerned that we're seeing yet another security scheme being designed on
the fly, without a well-formed threat model, and without taking into account
lessons learned from the seemingly endless parade of similar, failed schemes. "

so when and how did your opinion of this scheme turn from it being an "endless
parade of failed schemes" to it being a "well-defined and readily
understandable feature"? :-)

The idea itself has not changed since last year, what happened is that the
filter engine got a couple of new features and Will has separated it out and
has implemented a working prototype for sandboxing.

What i do here is to suggest *further* steps down the same road, now that we
see that this scheme can indeed be used to implement sandboxing ... I think
it's a valid line of inquiry.

Thanks,

Ingo
Will Drewry
2011-05-12 16:26:21 UTC
Permalink
[Thanks to everyone for the continued feedback and insights - I appreciate it!]
Post by Ingo Molnar
Post by James Morris
Post by Ingo Molnar
2) Why should this concept not be made available wider, to allow the
? ?restriction of not just system calls but other security relevant components
? ?of the kernel as well?
Because the aim of this is to reduce the attack surface of the syscall
interface.
What i suggest achieves the same, my argument is that we could aim it to be
even more flexible and even more useful.
Post by James Morris
LSM is the correct level of abstraction for general security mediation,
because it allows you to take into account all relevant security information
in a race-free context.
I don't care about LSM though, i find it poorly designed.
The approach implemented here, the ability for *unprivileged code* to define
(the seeds of ...) flexible security policies, in a proper Linuxish way, which
is inherited along the task parent/child hieararchy and which allows nesting
etc. is a *lot* more flexible.
What Will implemented here is pretty huge in my opinion: it turns security from
a root-only kind of weird hack into an essential component of its APIs,
available to *any* app not just the select security policy/mechanism chosen by
the distributor ...
If implemented properly this could replace LSM in the long run.
As a prctl() hack bound to seccomp (which, by all means, is a natural extension
to the current seccomp ABI, so perfectly fine if we only want that scope), that
is much less likely to happen.
And if we merge the seccomp interface prematurely then interest towards a more
flexible approach will disappear, so either we do it properly now or it will
take some time for someone to come around and do it ...
Also note that i do not consider the perf events ABI itself cast into stone -
and we could very well add a new system call for this, independent of perf
events. I just think that the seccomp scope itself is exciting but looks
limited to what the real potential of this could be.
I agree with you on many of these points! However, I don't think that
the views around LSMs, perf/ftrace infrastructure, or the current
seccomp filtering implementation are necessarily in conflict. Here is
my understanding of how the different worlds fit together and where I
see this patchset living, along with where I could see future work
going. Perhaps I'm being a trifle naive, but here goes anyway:

1. LSMs provide a global mechanism for hooking "security relevant"
events at a point where all the incoming user-sourced data has been
preprocessed and moved into userspace. The hooks are called every
time one of those boundaries are crossed.
2. Perf and the ftrace infrastructure provide global function tracing
and system call hooks with direct access to the caller's registers
(and memory).
3. seccomp (as it exists today) provides a global system call entry
hook point with a binary per-process decision about whether to provide
"secure computing" behavior.

When I boil that down to abstractions, I see:
A. Globally scoped: LSMs, ftrace/perf
B. Locally/process scoped: seccomp

The result of that logical equivalence is that I see room for:
I. A per-process, locally scoped security event hooking interface (the
proposed changes in this patchset)
II. A globally scoped security event hooking interface _prior_ to
argument processing
III. A globally scoped security event hooking interface _post_
argument processing

II and III could be reduced further if I assume that ftrace/perf
provides (II) and a simple intermediary layer (hook entry/exit)
provides the argument processing steps that then call out a global
security policy system.

The driving motivation for this patchset is kernel attack surface
reduction, but that need arises because we lack a process-scoped
mechanism for making security decisions -- everything is global:
creds/DAC, containers, LSM, etc. Adding ftrace filtering to agl's
original bitmask-seccomp proposal opens up the process-local security
world. At present, it can limit the attack surface with simple binary
filters or apply limited security policy through the use of filter
strings.

Based on your mails, I see two main deficiencies in my proposed patchset:
a. Deep argument analysis: Any arguments that live in user memory
needs to be copied into the kernel, then checked, and substituted for
the actual system call, then have the original pointers restored (when
applicable) on system call exit. There is a large overhead here and
the LSM hooks provide much of this support on a global level.
b. Lack of support for non-system call events.

For (a), if the long term view of ftrace/perf & LSMs is that LSM-like
functionality will live on top of the ftrace/perf infrastructure, then
adding support for the intermediary layer to analyze arguments will
come with time. It's also likely that for process-local stuff (e.g.,)
a new predicate could be added to callback to a userspace supervisor,
or even a more generic ability for modules to register new
predicates/functions in the filtering engine itself -- like "fd == 1
&& check_path(path) == '/etc/safe.conf'" or "check_xattr(path,
expected)". Of course, I'm just making stuff up right now :)

For (b), we could just add a field we don't use right now in the prctl
interface:
prctl(PR_SET_SECCOMP_FILTER, int event_type, int
event_or_syscall_nr, char *filter)
[or something similar]

Then we can add process-local/scoped supported event types somewhere
down the road without an ABI change.

Tying it all together, it'd look like:
* Now -- add process-scoped security support: secocmp filter with
support for "future" event types
* Soon -- expand ftrace syscall hooks to hook more system calls
* Later -- expand ftrace filter language to support either deep
argument analysis and/or custom registered predicates
* Later, later -- implement a LSM-like hooking layer for "interesting"
event types on top of the ftrace hooks

That would yield process-scoped security controls and global security
controls and the ability to continue to create new and interesting
security modules.

All that said, I'm in over my head. I've focused primarily on the
process-scoped security. I think James, some of the LSM authors, and
out-of-tree security system maintainers would be good to help guide
direction toward the security view you have in mind to ensure the
flexibility desired exists. And that's even assuming this sketch is
even vaguely interesting...

[snip]
Post by Ingo Molnar
What i do here is to suggest *further* steps down the same road, now that we
see that this scheme can indeed be used to implement sandboxing ... I think
it's a valid line of inquiry.
I certainly agree that it's a valid line of inquiry, but I worry about
the massive scope expansion. I know it hurts my head, but I'm hoping
the brain-dump above frames up how I think about this patch and your
line of inquiry. ftrace hooking and the perf code certainly look a
lot like LSMs if I squint hard :) But there is a substantial amount
of work to merge the worlds, and (thankfully) I don't think that
future directly impacts process-scoped security mechanisms even if
they can interact nicely.

thanks!
will
Ingo Molnar
2011-05-16 12:55:05 UTC
Permalink
I agree with you on many of these points! However, I don't think that the
views around LSMs, perf/ftrace infrastructure, or the current seccomp
filtering implementation are necessarily in conflict. Here is my
understanding of how the different worlds fit together and where I see this
patchset living, along with where I could see future work going. Perhaps I'm
1. LSMs provide a global mechanism for hooking "security relevant"
events at a point where all the incoming user-sourced data has been
preprocessed and moved into userspace. The hooks are called every
time one of those boundaries are crossed.
2. Perf and the ftrace infrastructure provide global function tracing
and system call hooks with direct access to the caller's registers
(and memory).
No, perf events are not just global but per task as well. Nor are events
limited to 'tracing' (generating a flow of events into a trace buffer) - they
can just be themselves as well and count and generate callbacks.

The generic NMI watchdog uses that kind of event model for example, see
kernel/watchdog.c and how it makes use of struct perf_event abstractions to do
per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per
task events and integrates it with the ptrace hw-breakpoints code.

Ideally Peter's one particular suggestion is right IMO and we'd want to be able
for a perf_event to just be a list of callbacks, attached to a task and barely
more than a discoverable, named notifier chain in its slimmest form.

In practice it's fatter than that right now, but we should definitely factor
out that aspect of it more clearly, both code-wise and API-wise.
kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is
possible and desirable.
3. seccomp (as it exists today) provides a global system call entry
hook point with a binary per-process decision about whether to provide
"secure computing" behavior.
A. Globally scoped: LSMs, ftrace/perf
B. Locally/process scoped: seccomp
Ok, i see where you got the idea that you needed to cut your surface of
abstraction at the filter engine / syscall enumeration level - i think you were
thinking of it in the ftrace model of tracepoints, not in the perf model of
events.

No, events are generic and as such per task as well, not just global.

I've replied to your other mail with more specific suggestions of how we could
provide your feature using abstractions that share code more widely. Talking
specifics will i hope help move the discussion forward! :-)

Thanks,

Ingo
Will Drewry
2011-05-16 14:42:07 UTC
Permalink
Post by Ingo Molnar
I agree with you on many of these points! ?However, I don't think that the
views around LSMs, perf/ftrace infrastructure, or the current seccomp
filtering implementation are necessarily in conflict. ?Here is my
understanding of how the different worlds fit together and where I see this
patchset living, along with where I could see future work going. ?Perhaps I'm
1. LSMs provide a global mechanism for hooking "security relevant"
events at a point where all the incoming user-sourced data has been
preprocessed and moved into userspace. ?The hooks are called every
time one of those boundaries are crossed.
2. Perf and the ftrace infrastructure provide global function tracing
and system call hooks with direct access to the caller's registers
(and memory).
No, perf events are not just global but per task as well. Nor are events
limited to 'tracing' (generating a flow of events into a trace buffer) - they
can just be themselves as well and count and generate callbacks.
I was looking at the perf_sysenter_enable() call, but clearly there is
more going on :)
Post by Ingo Molnar
The generic NMI watchdog uses that kind of event model for example, see
kernel/watchdog.c and how it makes use of struct perf_event abstractions to do
per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per
task events and integrates it with the ptrace hw-breakpoints code.
Ideally Peter's one particular suggestion is right IMO and we'd want to be able
for a perf_event to just be a list of callbacks, attached to a task and barely
more than a discoverable, named notifier chain in its slimmest form.
In practice it's fatter than that right now, but we should definitely factor
out that aspect of it more clearly, both code-wise and API-wise.
kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is
possible and desirable.
3. seccomp (as it exists today) provides a global system call entry
hook point with a binary per-process decision about whether to provide
"secure computing" behavior.
A. Globally scoped: LSMs, ftrace/perf
B. Locally/process scoped: seccomp
Ok, i see where you got the idea that you needed to cut your surface of
abstraction at the filter engine / syscall enumeration level - i think you were
thinking of it in the ftrace model of tracepoints, not in the perf model of
events.
No, events are generic and as such per task as well, not just global.
I've replied to your other mail with more specific suggestions of how we could
provide your feature using abstractions that share code more widely. Talking
specifics will i hope help move the discussion forward! :-)
Agreed. I'll digest both the watchdog code as well as your other
comments and follow up when I have a fuller picture in my head.

(I have a few initial comments I'll post in response to your other mail.)

Thanks!
will
James Morris
2011-05-13 00:18:52 UTC
Permalink
Post by Ingo Molnar
" I'm concerned that we're seeing yet another security scheme being designed on
the fly, without a well-formed threat model, and without taking into account
lessons learned from the seemingly endless parade of similar, failed schemes. "
so when and how did your opinion of this scheme turn from it being an "endless
parade of failed schemes" to it being a "well-defined and readily
understandable feature"? :-)
When it was defined in a way which limited its purpose to reducing the
attack surface of the sycall interface.


- James
--
James Morris
<jmorris at namei.org>
Peter Zijlstra
2011-05-13 12:19:06 UTC
Permalink
err = event_vfs_getname(result);
I really think we should not do this. Events like we have them should be
inactive, totally passive entities, only observe but not affect
execution (other than the bare minimal time delay introduced by
observance).

If you want another entity that is more active, please invent a new name
for it and create a new subsystem for them, now you could have these
active entities also have an (automatic) passive event side, but that's
some detail.
Ingo Molnar
2011-05-13 12:26:46 UTC
Permalink
Post by Peter Zijlstra
err = event_vfs_getname(result);
I really think we should not do this. Events like we have them should be
inactive, totally passive entities, only observe but not affect execution
(other than the bare minimal time delay introduced by observance).
Well, this patchset already demonstrates that we can use a single event
callback for a rather useful purpose.

Either it makes sense to do, in which case we should share facilities as much
as possible, or it makes no sense, in which case we should not merge it at all.
Post by Peter Zijlstra
If you want another entity that is more active, please invent a new name for
it and create a new subsystem for them, now you could have these active
entities also have an (automatic) passive event side, but that's some detail.
Why should we have two callbacks next to each other:

event_vfs_getname(result);
result = check_event_vfs_getname(result);

if one could do it all?

Thanks,

Ingo
Peter Zijlstra
2011-05-13 12:39:30 UTC
Permalink
Post by Ingo Molnar
Post by Peter Zijlstra
err = event_vfs_getname(result);
I really think we should not do this. Events like we have them should be
inactive, totally passive entities, only observe but not affect execution
(other than the bare minimal time delay introduced by observance).
Well, this patchset already demonstrates that we can use a single event
callback for a rather useful purpose.
Can and should are two distinct things.
Post by Ingo Molnar
Either it makes sense to do, in which case we should share facilities as much
as possible, or it makes no sense, in which case we should not merge it at all.
And I'm arguing we should _not_. Observing is radically different from
Affecting, at the very least the two things should have different
permission schemes. We should not confuse these two matters.
Post by Ingo Molnar
Post by Peter Zijlstra
If you want another entity that is more active, please invent a new name for
it and create a new subsystem for them, now you could have these active
entities also have an (automatic) passive event side, but that's some detail.
event_vfs_getname(result);
result = check_event_vfs_getname(result);
if one could do it all?
Did you actually read the bit where I said that check_event_* (although
I still think that name sucks) could imply a matching event_*?
Peter Zijlstra
2011-05-13 12:43:32 UTC
Permalink
event_vfs_getname(result);
result = check_event_vfs_getname(result);
Another fundamental difference is how to treat the callback chains for
these two.

Observers won't have a return value and are assumed to never fail,
therefore we can always call every entry on the callback list.

Active things otoh do have a return value, and thus we need to have
semantics that define what to do with that during callback iteration,
when to continue and when to break. Thus for active elements its
impossible to guarantee all entries will indeed be called.
Ingo Molnar
2011-05-13 12:54:52 UTC
Permalink
Post by Peter Zijlstra
event_vfs_getname(result);
result = check_event_vfs_getname(result);
Another fundamental difference is how to treat the callback chains for
these two.
Observers won't have a return value and are assumed to never fail,
therefore we can always call every entry on the callback list.
Active things otoh do have a return value, and thus we need to have
semantics that define what to do with that during callback iteration,
when to continue and when to break. Thus for active elements its
impossible to guarantee all entries will indeed be called.
I think the sanest semantics is to run all active callbacks as well.

For example if this is used for three stacked security policies - as if 3 LSM
modules were stacked at once. We'd call all three, and we'd determine that at
least one failed - and we'd return a failure.

Even if the first one failed already we'd still want to trigger *all* the
failures, because security policies like to know when they have triggered a
failure (regardless of other active policies) and want to see that failure
event (if they are logging such events).

So to me this looks pretty similar to observer callbacks as well, it's the
natural extension to an observer callback chain.

Observer callbacks are simply constant functions (to the caller), those which
never return failure and which never modify any of the parameters.

It's as if you argued that there should be separate syscalls/facilities for
handling readonly files versus handling read/write files.

Thanks,

Ingo
Peter Zijlstra
2011-05-13 13:08:52 UTC
Permalink
Post by Ingo Molnar
I think the sanest semantics is to run all active callbacks as well.
For example if this is used for three stacked security policies - as if 3 LSM
modules were stacked at once. We'd call all three, and we'd determine that at
least one failed - and we'd return a failure.
But that only works for boolean functions where you can return the
multi-bit-or of the result. What if you need to return the specific
error code.

Also, there's bound to be other cases where people will want to employ
this, look at all the various notifier chain muck we've got, it already
deals with much of this -- simply because users need it.

Then there's the whole indirection argument, if you don't need
indirection, its often better to not use it, I myself much prefer code
to look like:

foo1(bar);
foo2(bar);
foo3(bar);

Than:

foo_notifier(bar);

Simply because its much clearer who all are involved without me having
to grep around to see who registers for foo_notifier and wth they do
with it. It also makes it much harder to sneak in another user, whereas
its nearly impossible to find new notifier users.

Its also much faster, no extra memory accesses, no indirect function
calls, no other muck.
Ingo Molnar
2011-05-13 13:18:00 UTC
Permalink
Post by Peter Zijlstra
Post by Ingo Molnar
I think the sanest semantics is to run all active callbacks as well.
For example if this is used for three stacked security policies - as if 3 LSM
modules were stacked at once. We'd call all three, and we'd determine that at
least one failed - and we'd return a failure.
But that only works for boolean functions where you can return the
multi-bit-or of the result. What if you need to return the specific
error code.
Do you mean that one filter returns -EINVAL while the other -EACCES?

Seems like a non-problem to me, we'd return the first nonzero value.
Post by Peter Zijlstra
Also, there's bound to be other cases where people will want to employ
this, look at all the various notifier chain muck we've got, it already
deals with much of this -- simply because users need it.
Do you mean it would be easy to abuse it? What kind of abuse are you most
worried about?
Post by Peter Zijlstra
Then there's the whole indirection argument, if you don't need
indirection, its often better to not use it, I myself much prefer code
foo1(bar);
foo2(bar);
foo3(bar);
foo_notifier(bar);
Simply because its much clearer who all are involved without me having
to grep around to see who registers for foo_notifier and wth they do
with it. It also makes it much harder to sneak in another user, whereas
its nearly impossible to find new notifier users.
Its also much faster, no extra memory accesses, no indirect function
calls, no other muck.
But i suspect this question has been settled, given the fact that even pure
observer events need and already process a chain of events? Am i missing
something about your argument?

Thanks,

Ingo
Peter Zijlstra
2011-05-13 13:55:35 UTC
Permalink
Cut the microblaze list since its bouncy.
Post by Ingo Molnar
Post by Peter Zijlstra
Post by Ingo Molnar
I think the sanest semantics is to run all active callbacks as well.
For example if this is used for three stacked security policies - as if 3 LSM
modules were stacked at once. We'd call all three, and we'd determine that at
least one failed - and we'd return a failure.
But that only works for boolean functions where you can return the
multi-bit-or of the result. What if you need to return the specific
error code.
Do you mean that one filter returns -EINVAL while the other -EACCES?
Seems like a non-problem to me, we'd return the first nonzero value.
Assuming the first is -EINVAL, what then is the value in computing the
-EACCESS? Sounds like a massive waste of time to me.
Post by Ingo Molnar
Post by Peter Zijlstra
Also, there's bound to be other cases where people will want to employ
this, look at all the various notifier chain muck we've got, it already
deals with much of this -- simply because users need it.
Do you mean it would be easy to abuse it? What kind of abuse are you most
worried about?
I'm not worried about abuse, I'm saying that going by the existing
notifier pattern always visiting all entries on the callback list is
undesired.
Post by Ingo Molnar
Post by Peter Zijlstra
Then there's the whole indirection argument, if you don't need
indirection, its often better to not use it, I myself much prefer code
foo1(bar);
foo2(bar);
foo3(bar);
foo_notifier(bar);
Simply because its much clearer who all are involved without me having
to grep around to see who registers for foo_notifier and wth they do
with it. It also makes it much harder to sneak in another user, whereas
its nearly impossible to find new notifier users.
Its also much faster, no extra memory accesses, no indirect function
calls, no other muck.
But i suspect this question has been settled, given the fact that even pure
observer events need and already process a chain of events? Am i missing
something about your argument?
I'm saying that there's reasons to not use notifiers passive or active.

Mostly the whole notifier/indirection muck comes up once you want
modules to make use of the thing, because then you need dynamic
management of the callback list.

(Then again, I'm fairly glad we don't have explicit callbacks in
kernel/cpu.c for all the cpu-hotplug callbacks :-)

Anyway, I oppose for the existing events to gain an active role.
Ingo Molnar
2011-05-13 14:57:37 UTC
Permalink
Post by Peter Zijlstra
Cut the microblaze list since its bouncy.
Post by Ingo Molnar
Post by Peter Zijlstra
Post by Ingo Molnar
I think the sanest semantics is to run all active callbacks as well.
For example if this is used for three stacked security policies - as if 3 LSM
modules were stacked at once. We'd call all three, and we'd determine that at
least one failed - and we'd return a failure.
But that only works for boolean functions where you can return the
multi-bit-or of the result. What if you need to return the specific
error code.
Do you mean that one filter returns -EINVAL while the other -EACCES?
Seems like a non-problem to me, we'd return the first nonzero value.
Assuming the first is -EINVAL, what then is the value in computing the
-EACCESS? Sounds like a massive waste of time to me.
No, because the common case is no rejection - this is a security mechanism. So
in the normal case we would execute all 3 anyway, just to determine that all
return 0.

Are you really worried about the abnormal case of one of them returning an
error and us calculating all 3 return values?
Post by Peter Zijlstra
Post by Ingo Molnar
Post by Peter Zijlstra
Also, there's bound to be other cases where people will want to employ
this, look at all the various notifier chain muck we've got, it already
deals with much of this -- simply because users need it.
Do you mean it would be easy to abuse it? What kind of abuse are you most
worried about?
I'm not worried about abuse, I'm saying that going by the existing
notifier pattern always visiting all entries on the callback list is
undesired.
That is because many notifier chains are used in an 'event consuming' manner -
they are responding to things like hardware events and are called in an
interrupt-handler alike fashion most of the time.
Post by Peter Zijlstra
Post by Ingo Molnar
Post by Peter Zijlstra
Then there's the whole indirection argument, if you don't need
indirection, its often better to not use it, I myself much prefer code
foo1(bar);
foo2(bar);
foo3(bar);
foo_notifier(bar);
Simply because its much clearer who all are involved without me having
to grep around to see who registers for foo_notifier and wth they do
with it. It also makes it much harder to sneak in another user, whereas
its nearly impossible to find new notifier users.
Its also much faster, no extra memory accesses, no indirect function
calls, no other muck.
But i suspect this question has been settled, given the fact that even pure
observer events need and already process a chain of events? Am i missing
something about your argument?
I'm saying that there's reasons to not use notifiers passive or active.
Mostly the whole notifier/indirection muck comes up once you want
modules to make use of the thing, because then you need dynamic
management of the callback list.
But your argument assumes that we'd have a chain of functions to call, like
regular notifiers.

While the natural model here would be to have a list of registered event
structs for that point, with different filters but basically the same callback
mechanism (a call into the filter engine in essence).

Also note that the common case would be no event registered - and we'd
automatically optimize that case via the existing jump labels optimization.
Post by Peter Zijlstra
(Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c
for all the cpu-hotplug callbacks :-)
Anyway, I oppose for the existing events to gain an active role.
Why if 'being active' is optional and useful?

Thanks,

Ingo
Peter Zijlstra
2011-05-13 15:27:23 UTC
Permalink
Post by Ingo Molnar
this is a security mechanism
Who says? and why would you want to unify two separate concepts only to
them limit it to security that just doesn't make sense.

Either you provide a full on replacement for notifier chain like things
or you don't, only extending trace events in this fashion for security
is like way weird.

Plus see the arguments Eric made about stacking stuff, not only security
schemes will have those problems.
Ingo Molnar
2011-05-14 07:05:42 UTC
Permalink
Post by Ingo Molnar
this is a security mechanism
Who says? [...]
Kernel developers/maintainers of the affected code.

We have security hooks all around the kernel, which can deny/accept execution
at various key points, but we do not have 'execute arbitrary user-space defined
(safe) scripts' callbacks in general.

But yes, if a particular callback point is defined widely enough to allow much
bigger intervention into the flow of execution, then more is possible as well.
[...] and why would you want to unify two separate concepts only to them
limit it to security that just doesn't make sense.
I don't limit them to security - the callbacks themselves are either for
passive observation or, at most, for security accept/deny callbacks.

It's decided by the subsystem maintainers what kind of user-space control power
(or observation power) they want to allow, not me.

I would just like to not stop the facility itself at the 'observe only' level,
like you suggest.

Thanks,

Ingo
Steven Rostedt
2011-05-16 16:23:46 UTC
Permalink
Post by Ingo Molnar
Post by Peter Zijlstra
Post by Ingo Molnar
Post by Peter Zijlstra
Then there's the whole indirection argument, if you don't need
indirection, its often better to not use it, I myself much prefer code
foo1(bar);
foo2(bar);
foo3(bar);
foo_notifier(bar);
Simply because its much clearer who all are involved without me having
to grep around to see who registers for foo_notifier and wth they do
with it. It also makes it much harder to sneak in another user, whereas
its nearly impossible to find new notifier users.
Its also much faster, no extra memory accesses, no indirect function
calls, no other muck.
But i suspect this question has been settled, given the fact that even pure
observer events need and already process a chain of events? Am i missing
something about your argument?
I'm saying that there's reasons to not use notifiers passive or active.
Mostly the whole notifier/indirection muck comes up once you want
modules to make use of the thing, because then you need dynamic
management of the callback list.
But your argument assumes that we'd have a chain of functions to call, like
regular notifiers.
While the natural model here would be to have a list of registered event
structs for that point, with different filters but basically the same callback
mechanism (a call into the filter engine in essence).
Also note that the common case would be no event registered - and we'd
automatically optimize that case via the existing jump labels optimization.
I agree that I prefer the "notifier" type over having direct function
calls. Yes, it's easier to read and figure out what functions are
called, but notifiers can be optimized for the default case where
nothing is called (jump-label nop).
Post by Ingo Molnar
Post by Peter Zijlstra
(Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c
for all the cpu-hotplug callbacks :-)
Anyway, I oppose for the existing events to gain an active role.
Why if 'being active' is optional and useful?
I'm a bit nervous about the 'active' role of (trace_)events, because of
the way multiple callbacks can be registered. How would:

err = event_x();
if (err == -EACCESS) {

be handled? Would we need a way to prioritize which call back gets the
return value? One way I guess would be to add a check_event option,
where you pass in an ENUM of the event you want:

event_x();
err = check_event_x(MYEVENT);

If something registered itself as "MYEVENT" to event_x, then you get the
return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or
something could be returned. I'm sure we could even optimize it such a
way if no active events have been registered to event_x, that
check_event_x() will return -ENODEV without any branches.

-- Steve
Ingo Molnar
2011-05-16 16:52:49 UTC
Permalink
I'm a bit nervous about the 'active' role of (trace_)events, because of the
err = event_x();
if (err == -EACCESS) {
be handled? [...]
The default behavior would be something obvious: to trigger all callbacks and
use the first non-zero return value.
[...] Would we need a way to prioritize which call back gets the return
value? One way I guess would be to add a check_event option, where you pass
event_x();
err = check_event_x(MYEVENT);
If something registered itself as "MYEVENT" to event_x, then you get the
return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or
something could be returned. I'm sure we could even optimize it such a way if
no active events have been registered to event_x, that check_event_x() will
return -ENODEV without any branches.
I would keep it simple and extensible - that way we can complicate it when the
need arises! :)

Thanks,

Ingo
Steven Rostedt
2011-05-16 17:03:42 UTC
Permalink
Post by Ingo Molnar
I'm a bit nervous about the 'active' role of (trace_)events, because of the
err = event_x();
if (err == -EACCESS) {
be handled? [...]
The default behavior would be something obvious: to trigger all callbacks and
use the first non-zero return value.
But how do we know which callback that was from? There's no ordering of
what callbacks are called first.
Post by Ingo Molnar
[...] Would we need a way to prioritize which call back gets the return
value? One way I guess would be to add a check_event option, where you pass
event_x();
err = check_event_x(MYEVENT);
If something registered itself as "MYEVENT" to event_x, then you get the
return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or
something could be returned. I'm sure we could even optimize it such a way if
no active events have been registered to event_x, that check_event_x() will
return -ENODEV without any branches.
I would keep it simple and extensible - that way we can complicate it when the
need arises! :)
The above is rather trivial to implement. I don't think it complicates
anything.

-- Steve
Ingo Molnar
2011-05-17 12:42:12 UTC
Permalink
Post by Ingo Molnar
I'm a bit nervous about the 'active' role of (trace_)events, because of the
err = event_x();
if (err == -EACCESS) {
be handled? [...]
The default behavior would be something obvious: to trigger all callbacks and
use the first non-zero return value.
But how do we know which callback that was from? There's no ordering of what
callbacks are called first.
We do not have to know that - nor do the calling sites care in general. Do you
have some specific usecase in mind where the identity of the callback that
generates a match matters?

Thanks,

Ingo
Steven Rostedt
2011-05-17 13:05:28 UTC
Permalink
Post by Ingo Molnar
Post by Ingo Molnar
I'm a bit nervous about the 'active' role of (trace_)events, because of the
err = event_x();
if (err == -EACCESS) {
be handled? [...]
The default behavior would be something obvious: to trigger all callbacks and
use the first non-zero return value.
But how do we know which callback that was from? There's no ordering of what
callbacks are called first.
We do not have to know that - nor do the calling sites care in general. Do you
have some specific usecase in mind where the identity of the callback that
generates a match matters?
Maybe I'm confused. I was thinking that these event_*() are what we
currently call trace_*(), but the event_*(), I assume, can return a
value if a call back returns one.

Thus, we now have the ability to dynamically attach function calls to
arbitrary points in the kernel that can have an affect on the code that
called it. Right now, we only have the ability to attach function calls
to these locations that have passive affects (tracing/profiling).

But you say, "nor do the calling sites care in general". Then what do
these calling sites do with the return code? Are we limiting these
actions to security only? Or can we have some other feature. I can
envision that we can make the Linux kernel quite dynamic here with "self
modifying code". That is, anywhere we have "hooks", perhaps we could
replace them with dynamic switches (jump labels). Maybe events would not
be the best use, but they could be a generic one.

Knowing what callback returned the result would be beneficial. Right
now, you are saying if the call back return anything, just abort the
call, not knowing what callback was called.

-- Steve
Ingo Molnar
2011-05-17 13:19:02 UTC
Permalink
Post by Steven Rostedt
Post by Ingo Molnar
Post by Ingo Molnar
I'm a bit nervous about the 'active' role of (trace_)events, because of the
err = event_x();
if (err == -EACCESS) {
be handled? [...]
The default behavior would be something obvious: to trigger all callbacks and
use the first non-zero return value.
But how do we know which callback that was from? There's no ordering of what
callbacks are called first.
We do not have to know that - nor do the calling sites care in general. Do you
have some specific usecase in mind where the identity of the callback that
generates a match matters?
Maybe I'm confused. I was thinking that these event_*() are what we
currently call trace_*(), but the event_*(), I assume, can return a
value if a call back returns one.
Yeah - and the call site can treat it as:

- Ugh, if i get an error i need to abort whatever i was about to do

or (more advanced future use):

- If i get a positive value i need to re-evaluate the parameters that were
passed in, they were changed
Post by Steven Rostedt
Thus, we now have the ability to dynamically attach function calls to
arbitrary points in the kernel that can have an affect on the code that
called it. Right now, we only have the ability to attach function calls to
these locations that have passive affects (tracing/profiling).
Well, they can only have the effect that the calling site accepts and handles.
So the 'effect' is not arbitrary and not defined by the callbacks, it is
controlled and handled by the calling code.

We do not want invisible side-effects, opaque hooks, etc.

Instead of that we want (this is the getname() example i cited in the thread)
explicit effects, like:

if (event_vfs_getname(result))
return ERR_PTR(-EPERM);
Post by Steven Rostedt
But you say, "nor do the calling sites care in general". Then what do
these calling sites do with the return code? Are we limiting these
actions to security only? Or can we have some other feature. [...]
Yeah, not just security. One other example that came up recently is whether to
panic the box on certain (bad) events such as NMI errors. This too could be
made flexible via the event filter code: we already capture many events, so
places that might conceivably do some policy could do so based on a filter
condition.
Post by Steven Rostedt
[...] I can envision that we can make the Linux kernel quite dynamic here
with "self modifying code". That is, anywhere we have "hooks", perhaps we
could replace them with dynamic switches (jump labels). Maybe events would
not be the best use, but they could be a generic one.
events and explicit function calls and explicit side-effects are pretty much
the only thing that are acceptable. We do not want opaque hooks and arbitrary
side-effects.
Post by Steven Rostedt
Knowing what callback returned the result would be beneficial. Right now, you
are saying if the call back return anything, just abort the call, not knowing
what callback was called.
Yeah, and that's a feature: that way a number of conditions can be attached.
Multiple security frameworks may have effect on a task or multiple tools might
set policy action on a given event.

Thanks,

Ingo
Will Drewry
2011-05-19 04:07:15 UTC
Permalink
Post by Steven Rostedt
Post by Ingo Molnar
Post by Ingo Molnar
I'm a bit nervous about the 'active' role of (trace_)events, because of the
? ? ? err = event_x();
? ? ? if (err == -EACCESS) {
be handled? [...]
The default behavior would be something obvious: to trigger all callbacks and
use the first non-zero return value.
But how do we know which callback that was from? There's no ordering of what
callbacks are called first.
We do not have to know that - nor do the calling sites care in general. Do you
have some specific usecase in mind where the identity of the callback that
generates a match matters?
Maybe I'm confused. I was thinking that these event_*() are what we
currently call trace_*(), but the event_*(), I assume, can return a
value if a call back returns one.
?- Ugh, if i get an error i need to abort whatever i was about to do
?- If i get a positive value i need to re-evaluate the parameters that were
? passed in, they were changed
Do event_* that return non-void exist in the tree at all now? I've
looked at the various tracepoint macros as well as some of the other
handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
places where a return value is honored nor could be. At best, the
perf_tp_event can be short-circuited it in the hlist_for_each, but
it'd still need a way to bubble up a failure and result in not calling
the trace/event that the hook precedes.

Am I missing something really obvious? I don't feel I've gotten a
good handle on exactly how all the tracing code gets triggered, so
perhaps I'm still a level (or three) too shallow. (I can see the asm
hooks for trace functions and I can see where that translates to
registered calls - like trace_function - but I don't see how the
hooked calls can be trivially aborted).

As is, I'm not sure how the perf and ftrace infrastructure could be
reused cleanly without a fair number of hacks to the interface and a
good bit of reworking. I can already see a number of challenges
around reusing the sys_perf_event_open interface and the fact that
reimplementing something even as simple as seccomp mode=1 seems to
require a fair amount of tweaking to avoid from being leaky. (E.g.,
enabling all TRACE_EVENT()s for syscalls will miss unhooked syscalls
so either acceptance matching needs to be propagated up the stack
along with some seccomp-like task modality or seccomp-on-perf would
have to depend on sys_enter events with syscall number predicate
matching and fail when a filter discard applies to all active events.)

At present, I'm leaning back towards the v2 series (plus the requested
minor changes) for the benefit of code clarity and its fail-secure
behavior. Even just considering the reduced case of seccomp mode 1
being implemented on the shared infrastructure, I feel like I missing
something that makes it viable. Any clues?

If not, I don't think a seccomp mode 2 interface via prctl would be
intractable if the long term movement is to a ftrace/perf backend - it
just means that the in-kernel code would change to wrap whatever the
final design ended up being.

Thanks and sorry if I'm being dense!
Post by Steven Rostedt
Thus, we now have the ability to dynamically attach function calls to
arbitrary points in the kernel that can have an affect on the code that
called it. Right now, we only have the ability to attach function calls to
these locations that have passive affects (tracing/profiling).
Well, they can only have the effect that the calling site accepts and handles.
So the 'effect' is not arbitrary and not defined by the callbacks, it is
controlled and handled by the calling code.
We do not want invisible side-effects, opaque hooks, etc.
Instead of that we want (this is the getname() example i cited in the thread)
?if (event_vfs_getname(result))
? ? ? ?return ERR_PTR(-EPERM);
Post by Steven Rostedt
But you say, "nor do the calling sites care in general". Then what do
these calling sites do with the return code? Are we limiting these
actions to security only? Or can we have some other feature. [...]
Yeah, not just security. One other example that came up recently is whether to
panic the box on certain (bad) events such as NMI errors. This too could be
made flexible via the event filter code: we already capture many events, so
places that might conceivably do some policy could do so based on a filter
condition.
This sounds great - I just wish I could figure out how it'd work :)
Post by Steven Rostedt
[...] I can envision that we can make the Linux kernel quite dynamic here
with "self modifying code". That is, anywhere we have "hooks", perhaps we
could replace them with dynamic switches (jump labels). Maybe events would
not be the best use, but they could be a generic one.
events and explicit function calls and explicit side-effects are pretty much
the only thing that are acceptable. We do not want opaque hooks and arbitrary
side-effects.
Post by Steven Rostedt
Knowing what callback returned the result would be beneficial. Right now, you
are saying if the call back return anything, just abort the call, not knowing
what callback was called.
Yeah, and that's a feature: that way a number of conditions can be attached.
Multiple security frameworks may have effect on a task or multiple tools might
set policy action on a given event.
Thanks,
? ? ? ?Ingo
Steven Rostedt
2011-05-19 12:22:08 UTC
Permalink
Post by Will Drewry
Do event_* that return non-void exist in the tree at all now? I've
looked at the various tracepoint macros as well as some of the other
handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
places where a return value is honored nor could be. At best, the
perf_tp_event can be short-circuited it in the hlist_for_each, but
it'd still need a way to bubble up a failure and result in not calling
the trace/event that the hook precedes.
No, none of the current trace hooks have return values. That was what I
was talking about how to implement in my previous emails.

-- Steve
Will Drewry
2011-05-19 21:05:14 UTC
Permalink
Post by Steven Rostedt
Do event_* that return non-void exist in the tree at all now? ?I've
looked at the various tracepoint macros as well as some of the other
handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
places where a return value is honored nor could be. ?At best, the
perf_tp_event can be short-circuited it in the hlist_for_each, but
it'd still need a way to bubble up a failure and result in not calling
the trace/event that the hook precedes.
No, none of the current trace hooks have return values. That was what I
was talking about how to implement in my previous emails.
Led on by complete ignorance, I think I'm finally beginning to untwist
the different pieces of the tracing infrastructure. Unfortunately,
that means I took a few wrong turns along the way...

I think function tracing looks something like this:

ftrace_call has been injected into at a specific callsite. Upon hit:
1. ftrace_call triggers
2. does some checks then calls ftrace_trace_function (via mcount instrs)
3. ftrace_trace_function may be a single func or a list. For a list it
will be: ftrace_list_func
4. ftrace_list_func calls each registered hook for that function in a
while loop ignoring return values
5. registered hook funcs may then track the call, farm it out to
specific sub-handlers, etc.

This seems to be a red herring for my use case :/ though this helped
me understand your back and forth (Ingo & Steve) regarding dynamic
versus explicit events.


System call tracing is done via kernel/tracepoint.c events fed in via
arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. This
yields direct sys_enter and sys_exit event sources (and an event class
to hook up per-system call events). This means that
ftrace_syscall_enter() does the event prep prior to doing a filter
check comparing the ftrace_event object for the given syscall_nr to
the event data. perf_sysenter_enter() is similar but it pushes the
info over to perf_tp_event to be matched not against the global
syscall event entry, but against any sub-event in the linked list on
that syscall's event.

Is that roughly an accurate representation of the two? I wish I
hadn't gotten distracted along the function path, but at least I
learned something (and it is relevant to the larger scope of this
thread's discussion).


After doing that digging, it looks like providing hook call
pre-emption and return value propagation will be a unique and fun task
for each tracer and event subsystem. If I look solely at tracepoints,
a generic change would be to make the trace_##subsys function return
an int (which I think was the event_vfs_getname proposal?). The other
option is to change the trace_sys_enter proto to include a 'int
*retcode'.

That change would allow the propagation of some sort of policy
information. To put it to use, seccomp mode 1 could be implemented on
top of trace_syscalls. The following changes would need to happen:
1. dummy metadata should be inserted for all unhooked system calls
2. perf_trace_buf_submit would need to return an int or a new
TRACE_REG_SECCOMP_REGISTER handler would need to be setup in
syscall_enter_register.
3. If perf is abused, a "kill/fail_on_discard" bit would be added to
event->attrs.
4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches,
'n' for the number of event matches, and -EACCES/? if a
'fail_on_discard' event is seen.
5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode
6. trace_sys_enter() would need to be moved to be the first entry
arch/../kernel/ptrace.c for incoming syscalls
7. if trace_sys_enter() yields a negative return code, then
do_exit(SIGKILL) the process and return.

Entering into seccomp mode 1 would require adding a "0" filter for
every system call number (which is why we need a dummy event call for
them since failing to check the bitmask can't be flagged
fail_on_discard) with the fail_on_discard bit. For the three calls
that are allowed, a '1' filter would be set.

That would roughly implement seccomp mode 1. It's pretty ugly and the
fact that every system call that's disallowed has to be blacklisted is
not ideal. An alternate model would be to just use the seccomp mode
as we do today and let secure_computing() handle the return code of "#
of matches". If it the # of matches is 0, it terminates. A
'fail_on_discard' bit then would only be good to stop further
tracepoint callback evaluation. This approach would also *almost* nix
the need to provide dummy syscall hooks. (Since sigreturn isn't
hooked on x86 because it uses ptregs fixup, a dummy would still be
needed to apply a "1" filter to.)

Even with that tweak to move to a whitelist model, the perf event
evaluation and tracepoint callback ordering is still not guaranteed.
Without changing tracepoint itself, all other TPs will still execute.
And for perf events, it'll be first-come-first-serve until a
fail_on_discard is hit.

After using seccomp mode=1 as the sample case to reduce scope, it's
possible to ignore it for now :) and look at the seccomp-filter/mode=2
case. The same mechanism could be used to inject more expressive
filters. This would be done over the sys_perf_event_open() interface
assuming the new attr is added to stop perf event list enumeration.
Assuming a whitelist model, a call to prctl(PR_SET_SECCOMP, 2) would
be needed (maybe with the ON_EXEC flag option too to mirror the
event->attr on-exec bit). That would yield the ability to register
perf events for system calls then use ioctl() to SET_FILTER on them.

Reducing the privileges of the filters after installation could be
done with another attribute bit like 'set_filter_ands'. If that is
also set on the event, and a filter is installed to ensure that
sys_perf_event_open is blocked, then it should be reasonably sane.

I'd need to add a PERF_EVENT_IOC_GET_FILTER handler to allow
extracting the settings.

Clearly, I haven't written the code for that yet, though making the
change for a single platform shouldn't be too much code.

So that leaves me with some questions:
- Is this the type of reuse that was being encouraged?
- Does it really make sense to cram this through the perf interface
and events? While the changed attributes are innocuous and
potentially reusable, it seems that a large amount of the perf
facilities are exposed that could have weird side effects, but I'm
sure I still don't fully grok the perf infrastructure.
- Does extending one tracepoint to provide return values via a pointer
make sense? I'd hesitate to make all tracepoint hooks return an int by
default.
- If all tracepoints returned an int, what would the standard value
look like - or would it need to be per tracepoint impl?
- How is ambiguity resolved if multiple perf_events are registered for
a syscall with different filters? Maybe a 'stop_on_match'? though
ordering is still a problem then.
- Is there a way to affect a task-wide change without a seccomp flag
(or new task_struct entry) via the existing sys_perf_event_open
syscall? I considered suggesting a attr bit call 'secure_computing'
that when an event with the bit is enabled, it automagically forces
the task into seccomp enforcement mode, but that, again, seemed
hackish.

While I like the idea of sharing the tracepoints infrastructure and
the trace_syscalls hooks as well as using a pre-existing interface
with very minor changes, I'm worried that the complexity of the
interface and of the infrastructure might undermine the ability to
continue meeting the desired security goals. I had originally stayed
very close to the seccomp world because of how trivial it is to review
the code and verify its accuracy/effectiveness. This approach leaves
a lot of gaps for kernel code to seep through and a fair amount of
ambiguity in what locked down syscall filters might look like.

To me, the best part of the above is that it shows that even if we go
with a prctl SET_SECCOMP_FILTER-style interface, it is completely
certain that if a perf/ftrace-based security infrastructure is on our
future, it will be entirely compatible -- even if the prctl()
interface is just the "simpler" interface at that point somewhere down
the road.


Regardless, I'll hack up a proof of concept based on the outline
above. Perhaps something more elegant will emerge once I start
tweaking the source, but I'm still seeing too many gaps to be
comfortable so far.


[[There is a whole other approach to this too. We could continue with
the prctl() interface and mirror the trace_sys_enter model for
secure_computing(). Instead of keeping a seccomp_t-based hlist of
events, they could be stored in a new hlist for seccomp_events in
struct ftrace_event_call. The ftrace filters would be installed there
and the seccomp_syscall_enter() function could do the checks and pass
up some state data on the task's seccomp_t that indicates it needs to
do_exit(). That would likely reduce the amount of code in
seccomp_filter.c pretty seriously (though not entirely
beneficially).]]


Thanks again for all the feedback and insights! I really hope we can
come to an agreeable approach for implementing kernel attack surface
reduction.
will
Will Drewry
2011-05-24 15:59:04 UTC
Permalink
Post by Will Drewry
Post by Steven Rostedt
Do event_* that return non-void exist in the tree at all now? ?I've
looked at the various tracepoint macros as well as some of the other
handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
places where a return value is honored nor could be. ?At best, the
perf_tp_event can be short-circuited it in the hlist_for_each, but
it'd still need a way to bubble up a failure and result in not calling
the trace/event that the hook precedes.
No, none of the current trace hooks have return values. That was what I
was talking about how to implement in my previous emails.
Led on by complete ignorance, I think I'm finally beginning to untwist
the different pieces of the tracing infrastructure. ?Unfortunately,
that means I took a few wrong turns along the way...
1. ftrace_call triggers
2. does some checks then calls ftrace_trace_function (via mcount instrs)
3. ftrace_trace_function may be a single func or a list. For a list it
will be: ftrace_list_func
4. ftrace_list_func calls each registered hook for that function in a
while loop ignoring return values
5. registered hook funcs may then track the call, farm it out to
specific sub-handlers, etc.
This seems to be a red herring for my use case :/ though this helped
me understand your back and forth (Ingo & Steve) regarding dynamic
versus explicit events.
System call tracing is done via kernel/tracepoint.c events fed in via
arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. ?This
yields direct sys_enter and sys_exit event sources (and an event class
to hook up per-system call events). ?This means that
ftrace_syscall_enter() does the event prep prior to doing a filter
check comparing the ftrace_event object for the given syscall_nr to
the event data. ?perf_sysenter_enter() is similar but it pushes the
info over to perf_tp_event to be matched not against the global
syscall event entry, but against any sub-event in the linked list on
that syscall's event.
Is that roughly an accurate representation of the two? ?I wish I
hadn't gotten distracted along the function path, but at least I
learned something (and it is relevant to the larger scope of this
thread's discussion).
After doing that digging, it looks like providing hook call
pre-emption and return value propagation will be a unique and fun task
for each tracer and event subsystem. ?If I look solely at tracepoints,
a generic change would be to make the trace_##subsys function return
an int (which I think was the event_vfs_getname proposal?). ?The other
option is to change the trace_sys_enter proto to include a 'int
*retcode'.
That change would allow the propagation of some sort of policy
information. ?To put it to use, seccomp mode 1 could be implemented on
1. dummy metadata should be inserted for all unhooked system calls
2. perf_trace_buf_submit would need to return an int or a new
TRACE_REG_SECCOMP_REGISTER handler would need to be setup in
syscall_enter_register.
3. If perf is abused, a "kill/fail_on_discard" bit would be added to
event->attrs.
4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches,
'n' for the number of event matches, and -EACCES/? if a
'fail_on_discard' event is seen.
5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode
6. trace_sys_enter() would need to be moved to be the first entry
arch/../kernel/ptrace.c for incoming syscalls
7. if trace_sys_enter() yields a negative return code, then
do_exit(SIGKILL) the process and return.
Entering into seccomp mode 1 would require adding a ?"0" filter for
every system call number (which is why we need a dummy event call for
them since failing to check the bitmask can't be flagged
fail_on_discard) with the fail_on_discard bit. ?For the three calls
that are allowed, a '1' filter would be set.
That would roughly implement seccomp mode 1. ?It's pretty ugly and the
fact that every system call that's disallowed has to be blacklisted is
not ideal. ?An alternate model would be to just use the seccomp mode
as we do today and let secure_computing() handle the return code of "#
of matches". ?If it the # of matches is 0, it terminates. A
'fail_on_discard' bit then would only be good to stop further
tracepoint callback evaluation. ?This approach would also *almost* nix
the need to provide dummy syscall hooks. ?(Since sigreturn isn't
hooked on x86 because it uses ptregs fixup, a dummy would still be
needed to apply a "1" filter to.)
Even with that tweak to move to a whitelist model, the perf event
evaluation and tracepoint callback ordering is still not guaranteed.
Without changing tracepoint itself, all other TPs will still execute.
And for perf events, it'll be first-come-first-serve until a
fail_on_discard is hit.
After using seccomp mode=1 as the sample case to reduce scope, it's
possible to ignore it for now :) and look at the seccomp-filter/mode=2
case. ?The same mechanism could be used to inject more expressive
filters. ?This would be done over the sys_perf_event_open() interface
assuming the new attr is added to stop perf event list enumeration.
Assuming a whitelist model, a call to prctl(PR_SET_SECCOMP, 2) would
be needed (maybe with the ON_EXEC flag option too to mirror the
event->attr on-exec bit). That would yield the ability to register
perf events for system calls then use ioctl() to SET_FILTER on them.
Reducing the privileges of the filters after installation could be
done with another attribute bit like 'set_filter_ands'. ?If that is
also set on the event, and a filter is installed to ensure that
sys_perf_event_open is blocked, then it should be reasonably sane.
I'd need to add a PERF_EVENT_IOC_GET_FILTER handler to allow
extracting the settings.
Clearly, I haven't written the code for that yet, though making the
change for a single platform shouldn't be too much code.
- Is this the type of reuse that was being encouraged?
- Does it really make sense to cram this through the perf interface
and events? ?While the changed attributes are innocuous and
potentially reusable, it seems that a large amount of the perf
facilities are exposed that could have weird side effects, but I'm
sure I still don't fully grok the perf infrastructure.
- Does extending one tracepoint to provide return values via a pointer
make sense? I'd hesitate to make all tracepoint hooks return an int by
default.
- If all tracepoints returned an int, what would the standard value
look like - or would it need to be per tracepoint impl?
- How is ambiguity resolved if multiple perf_events are registered for
a syscall with different filters? ?Maybe a 'stop_on_match'? though
ordering is still a problem then.
- Is there a way to affect a task-wide change without a seccomp flag
(or new task_struct entry) via the existing sys_perf_event_open
syscall? ?I considered suggesting a attr bit call 'secure_computing'
that when an event with the bit is enabled, it automagically forces
the task into seccomp enforcement mode, but that, again, seemed
hackish.
While I like the idea of sharing the tracepoints infrastructure and
the trace_syscalls hooks as well as using a pre-existing interface
with very minor changes, I'm worried that the complexity of the
interface and of the infrastructure might undermine the ability to
continue meeting the desired security goals. ?I had originally stayed
very close to the seccomp world because of how trivial it is to review
the code and verify its accuracy/effectiveness. ?This approach leaves
a lot of gaps for kernel code to seep through and a fair amount of
ambiguity in what locked down syscall filters might look like.
To me, the best part of the above is that it shows that even if we go
with a prctl SET_SECCOMP_FILTER-style interface, it is completely
certain that if a perf/ftrace-based security infrastructure is on our
future, it will be entirely compatible -- even if the prctl()
interface is just the "simpler" interface at that point somewhere down
the road.
Regardless, I'll hack up a proof of concept based on the outline
above. Perhaps something more elegant will emerge once I start
tweaking the source, but I'm still seeing too many gaps to be
comfortable so far.
[[There is a whole other approach to this too. We could continue with
the prctl() interface and mirror the trace_sys_enter model for
secure_computing(). ? Instead of keeping a seccomp_t-based hlist of
events, they could be stored in a new hlist for seccomp_events in
struct ftrace_event_call. ?The ftrace filters would be installed there
and the seccomp_syscall_enter() function could do the checks and pass
up some state data on the task's seccomp_t that indicates it needs to
do_exit(). ?That would likely reduce the amount of code in
seccomp_filter.c pretty seriously (though not entirely
beneficially).]]
Thanks again for all the feedback and insights! I really hope we can
come to an agreeable approach for implementing kernel attack surface
reduction.
Just a quick follow up. I have a PoC of the perf-based system call
filtering code in place which uses seccomp.mode as a task_context
flag, adds require_secure and err_on_discard perf_event_attrs, and
enforces these new events terminate the process in perf_syscall_enter
via a return value on perf_buf_submit. This lets a call like:

LD_LIBRARY_PATH=/host/home/wad/kernel.uml/tests/
/host/home/wad/kernel.uml/tests/perf record \
-S \
-e 'syscalls:sys_enter_access' \
-e 'syscalls:sys_enter_brk' \
-e 'syscalls:sys_enter_close' \
-e 'syscalls:sys_enter_exit_group' \
-e 'syscalls:sys_enter_fcntl64' \
-e 'syscalls:sys_enter_fstat64' \
-e 'syscalls:sys_enter_getdents64' \
-e 'syscalls:sys_enter_getpid' \
-e 'syscalls:sys_enter_getuid' \
-e 'syscalls:sys_enter_ioctl' \
-e 'syscalls:sys_enter_lstat64' \
-e 'syscalls:sys_enter_mmap_pgoff' \
-e 'syscalls:sys_enter_mprotect' \
-e 'syscalls:sys_enter_munmap' \
-e 'syscalls:sys_enter_open' \
-e 'syscalls:sys_enter_read' \
-e 'syscalls:sys_enter_stat64' \
-e 'syscalls:sys_enter_time' \
-e 'syscalls:sys_enter_newuname' \
-e 'syscalls:sys_enter_write' --filter "fd == 1" \
/bin/ls

run successfully while omitting a syscall or passing in --filter "fd
== 0" properly terminates it. (ignoring the need for execve and
set_thread_area for PoC purposes).

The change avoids defining a new trace call type or a huge number of
internal changes and hides seccomp.mode=2 from ABI-exposure in prctl,
but the attack surface is non-trivial to verify, and I'm not sure if
this ABI change makes sense. It amounts to:

include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)

And can be found here: http://static.dataspill.org/perf_secure/v1/

If there is any interest at all, I can post it properly to this giant
CC list. However, it's unclear to me where this thread stands. I
also see a completely independent path for implementing system call
filtering now, but I'll hold off posting that patch until I see where
this goes.

Any guidance will be appreciated - thanks again!
will
Peter Zijlstra
2011-05-24 16:20:27 UTC
Permalink
Post by Will Drewry
include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)
I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
Thomas Gleixner
2011-05-24 16:25:28 UTC
Permalink
Post by Peter Zijlstra
Post by Will Drewry
include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)
I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
Amen. We have enough crap to cleanup in perf/ftrace already, so we
really do not need security magic added to it.

Thanks,

tglx
Will Drewry
2011-05-24 19:00:56 UTC
Permalink
Post by Thomas Gleixner
Post by Peter Zijlstra
?include/linux/ftrace_event.h ?| ? ?4 +-
?include/linux/perf_event.h ? ?| ? 10 +++++---
?kernel/perf_event.c ? ? ? ? ? | ? 49 +++++++++++++++++++++++++++++++++++++---
?kernel/seccomp.c ? ? ? ? ? ? ?| ? ?8 ++++++
?kernel/trace/trace_syscalls.c | ? 27 +++++++++++++++++-----
?5 files changed, 82 insertions(+), 16 deletions(-)
I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
Amen. We have enough crap to cleanup in perf/ftrace already, so we
really do not need security magic added to it.
Thanks for the quick responses!

I agree, but I'm left a little bit lost now w.r.t. the comments around
reusing the ABI. If perf doesn't make sense (which certainly seems
wrong from a security interface perspective), then the preexisting
ABIs I know of for this case are as follows:
- /sys/kernel/debug/tracing/*
- prctl(PR_SET_SECCOMP* (or /proc/...)

Both would require expansion. The latter was reused by the original
patch series. The former doesn't expose much in the way of per-task
event filtering -- ftrace_pids doesn't translate well to
ftrace_syscall_enter-based enforcement. I'd expect we'd need
ftrace_event_call->task_events (like ->perf_events), and either
explore them in ftrace_syscall_enter or add a new tracepoint handler,
ftrace_task_syscall_enter, via something like TRACE_REG_TASK_REGISTER.
It could then do whatever it wanted with the successful or
unsuccessful matching against predicates, stacking or not, which could
be used for a seccomp-like mechanism. However, bubbling that change
up to the non-existent interfaces in debug/tracing could be a
challenge too (Registration would require an alternate flow like perf
to call TRACE_REG_*? Do they become
tracing/events/subsystem/event/task/<tid>/<filter_string_N>? ...?).

This is all just a matter of programming... but at this point, I'm not
seeing the clear shared path forward. Even with per-task ftrace
access in debug/tracing, that would introduce a reasonably large
change to the system and add a new ABI, albeit in debug/tracing. If
the above (or whatever the right approach is) comes into existence,
then any prctl(PR_SET_SECCOMP) ABI could have the backend
implementation to modify the same data. I'm not putting it like this
to say that I'm designing to be obsolete, but to show that the defined
interface wouldn't conflict if ftrace does overlap more in the future.
Given the importance of a clearly defined interface for security
functionality, I'd be surprised to see all the pieces come together in
the near future in such a way that a transition would be immediately
possible -- I'm not even sure what the ftrace roadmap really is!

Would it be more desirable to put a system call filtering interface on
a miscdev (like /dev/syscall_filter) instead of in /proc or prctl (and
not reuse seccomp at all)? I'm not clear what the onus is to justify
a change in the different ABI areas, but I see system call filtering
as an important piece of system security and would like to determine
if there is a viable path forward, or if this will need to be
revisited in another 2 years.

thanks again!
will
Ingo Molnar
2011-05-24 19:54:35 UTC
Permalink
Post by Peter Zijlstra
Post by Will Drewry
include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)
I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
I'd object to invisible side-effects as well, and vehemently so. But note how
intelligently it's used here: it's explicit in the code, it's used explicitly
in kernel/seccomp.c and the event generation place in
kernel/trace/trace_syscalls.c.

So this is a really flexible solution IMO and does not extend events with some
invisible 'active' role. It extends the *call site* with an open-coded active
role - which active role btw. already pre-existed.

Thanks,

Ingo
Ingo Molnar
2011-05-24 20:10:53 UTC
Permalink
Post by Ingo Molnar
Post by Peter Zijlstra
Post by Will Drewry
include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)
I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
I'd object to invisible side-effects as well, and vehemently so. But note how
intelligently it's used here: it's explicit in the code, it's used explicitly
in kernel/seccomp.c and the event generation place in
kernel/trace/trace_syscalls.c.
So this is a really flexible solution IMO and does not extend events with
some invisible 'active' role. It extends the *call site* with an open-coded
active role - which active role btw. already pre-existed.
Also see my other mail - i think this seccomp code is too tied in to the perf
core and ABI - but this is fixable IMO.

The fundamental notion that a generator subsystem of events can use filter
results as well (such as kernel/trace/trace_syscalls.c.) for its own purposes
is pretty robust though.

Thanks,

Ingo
Thomas Gleixner
2011-05-25 10:35:03 UTC
Permalink
Post by Ingo Molnar
Post by Peter Zijlstra
Post by Will Drewry
include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)
I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
I'd object to invisible side-effects as well, and vehemently so. But note how
intelligently it's used here: it's explicit in the code, it's used explicitly
in kernel/seccomp.c and the event generation place in
kernel/trace/trace_syscalls.c.
So this is a really flexible solution IMO and does not extend events with some
invisible 'active' role. It extends the *call site* with an open-coded active
role - which active role btw. already pre-existed.
We do _NOT_ make any decision based on the trace point so what's the
"pre-existing" active role in the syscall entry code?

I'm all for code reuse and reuse of interfaces, but this is completely
wrong. Instrumentation and security decisions are two fundamentally
different things and we want them kept separate. Instrumentation is
not meant to make decisions. Just because we can does not mean that it
is a good idea.

So what the current approach does is:

- abuse the existing ftrace syscall hook by adding a return value to
the tracepoint.

So we need to propagate that for every tracepoint just because we
have a single user.

- abuse the perf per task mechanism

Just because we have per task context in perf does not mean that we
pull everything and the world which requires per task context into
perf. The security folks have per task context already so security
related stuff wants to go there.

- abuse the perf/ftrace interfaces

One of the arguments was that perf and ftrace have permission which
are not available from the existing security interfaces. That's not
at all a good reason to abuse these interfaces. Let the security
folks sort out the problem on their end and do not impose any
expectations on perf/ftrace which we have to carry around forever.

Yes, it can be made working with a relatively small patch, but it has
a very nasty side effect:

You add another user space visible ABI to the existing perf/ftrace
mess which needs to be supported forever.

Brilliant, we have already two ABIs (perf/ftrace) to support and at
the same time we urgently need to solve the problem of better
integration of those two. So adding a third completely unrelated
component with a guaranteed ABI is just making this even more complex.

We can factor out the filtering code and let the security dudes reuse
it for their own purposes. That makes them to have their own
interfaces and does not impose any restrictions upon the tracing/perf
ones. And really security stuff wants to be integrated into the
existing security frameworks and not duct taped into perf/trace just
because it's a conveniant hack around limitiations of the existing
security stuff.

You really should stop to see everything as a nail just because the
only tool you have handy is the perf hammer. perf is about
instrumentation and we don't want to violate the oldest principle of
unix to have simple tools which do one thing and do it good.

Even swiss army knifes have the restriction that you can use only one
tool at a time unless you want to stick the corkscrew through your
palm when you try to cut bread.

Thanks,

tglx
Ingo Molnar
2011-05-25 15:01:53 UTC
Permalink
Post by Thomas Gleixner
Post by Ingo Molnar
Post by Peter Zijlstra
Post by Will Drewry
include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)
I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
I'd object to invisible side-effects as well, and vehemently so. But note how
intelligently it's used here: it's explicit in the code, it's used explicitly
in kernel/seccomp.c and the event generation place in
kernel/trace/trace_syscalls.c.
So this is a really flexible solution IMO and does not extend events with some
invisible 'active' role. It extends the *call site* with an open-coded active
role - which active role btw. already pre-existed.
We do _NOT_ make any decision based on the trace point so what's the
"pre-existing" active role in the syscall entry code?
The seccomp code we are discussing in this thread.
Post by Thomas Gleixner
I'm all for code reuse and reuse of interfaces, but this is completely
wrong. Instrumentation and security decisions are two fundamentally
different things and we want them kept separate. Instrumentation is
not meant to make decisions. Just because we can does not mean that it
is a good idea.
Instrumentation does not 'make decisions': the calling site, which is
already emitting both the event and wants to do decisions based on
the data that also generates the event wants to do decisions.

Those decisions *will be made* and you cannot prevent that, the only
open question is can it reuse code intelligently, which code it is
btw. already calling for observation reasons?

( Note that pure observers wont be affected and note that pure
observation call sites are not affected either. )
Post by Thomas Gleixner
- abuse the existing ftrace syscall hook by adding a return value to
the tracepoint.
So we need to propagate that for every tracepoint just because we
have a single user.
This is a technical criticism i share with you and i think it can be
fixed - i outlined it to Will yesterday.

And no, if done cleanly it's not 'abuse' to reuse code. Could we wait
for the first clean iteration of this patch instead of rushing
judgement prematurely?
Post by Thomas Gleixner
- abuse the perf per task mechanism
Just because we have per task context in perf does not mean that we
pull everything and the world which requires per task context into
perf. The security folks have per task context already so security
related stuff wants to go there.
We do not pull 'everything and the world' in, but code that wants to
process events in places that already emit events surely sounds
related to me :-)
Post by Thomas Gleixner
- abuse the perf/ftrace interfaces
One of the arguments was that perf and ftrace have permission which
are not available from the existing security interfaces. That's not
at all a good reason to abuse these interfaces. Let the security
folks sort out the problem on their end and do not impose any
expectations on perf/ftrace which we have to carry around forever.
Yes, it can be made working with a relatively small patch, but it has
You add another user space visible ABI to the existing perf/ftrace
mess which needs to be supported forever.
What mess? I'm not aware of a mess - other than the ftrace API which
is not used by this patch.
Post by Thomas Gleixner
Brilliant, we have already two ABIs (perf/ftrace) to support and at
the same time we urgently need to solve the problem of better
integration of those two. So adding a third completely unrelated
component with a guaranteed ABI is just making this even more complex.
So your solution is to add yet another ABI for seccomp and to keep
seccomp a limited hack forever, just because you are not interested
in security?

I think we want fewer ABIs and more flexible/reusable facilities.
Post by Thomas Gleixner
We can factor out the filtering code and let the security dudes
reuse it for their own purposes. That makes them to have their own
interfaces and does not impose any restrictions upon the
tracing/perf ones. And really security stuff wants to be integrated
into the existing security frameworks and not duct taped into
perf/trace just because it's a conveniant hack around limitiations
of the existing security stuff.
You are missing what i tried to point out in earlier discussions:
from a security design POV this isnt just about the system call
boundary. If this seccomp variant is based on events then it could
grow proper security checks in other places as well, in places where
we have some sort of object observation event anyway.

So this is opens up possibilities to reuse and unify code on a very
broad basis.
Post by Thomas Gleixner
You really should stop to see everything as a nail just because the
only tool you have handy is the perf hammer. perf is about
instrumentation and we don't want to violate the oldest principle
of unix to have simple tools which do one thing and do it good.
That is one of the most misunderstood principles of Unix.

The original Unix tool landscape was highly *integrated* and
*unified*, into a very tight codebase that was maintained together.
Yes, there were different, atomic, simple commands but it was all
done together and it was a coherent whole and pleasant to use!

People misunderstood this as a license to fragment the heck out
functionality and build 'simple and stupid' tools that fit nowhere
really and use different, incompatible principles not synced with
each other. That is wrong and harmful.

So yes, over-integration is obviously wrong - but so is needless
fragmentation.

Anyway, i still reserve judgement on this seccomp bit but the patches
so far looked very promising with a very surprisingly small diffstat.

If anything then that should tell you something that events and
seccomp are not just casually related ...
Post by Thomas Gleixner
Even swiss army knifes have the restriction that you can use only
one tool at a time unless you want to stick the corkscrew through
your palm when you try to cut bread.
I'm not sure what you are arguing here - do you pine for the DOS days
where you could only use one command at a time?

Thanks,

Ingo
Peter Zijlstra
2011-05-25 17:43:22 UTC
Permalink
Post by Ingo Molnar
Post by Thomas Gleixner
We do _NOT_ make any decision based on the trace point so what's the
"pre-existing" active role in the syscall entry code?
The seccomp code we are discussing in this thread.
That isn't pre-existing, that's proposed.

But face it, you can argue until you're blue in the face, but both tglx
and I will NAK any and all patches that extend perf/ftrace beyond the
passive observing role.

Your arguments appear to be as non-persuasive to us as ours are to you,
so please drop this endeavor and let the security folks sort it on their
own and let's get back to doing useful work.
Ingo Molnar
2011-05-29 20:17:13 UTC
Permalink
Post by Peter Zijlstra
But face it, you can argue until you're blue in the face,
That is not a technical argument though - and i considered and
answered every valid technical argument made by you and Thomas.
You were either not able to or not willing to counter them.
Post by Peter Zijlstra
[...] but both tglx and I will NAK any and all patches that extend
perf/ftrace beyond the passive observing role.
The thing is, perf is *already* well beyond the 'passive observer'
role: we already generate lots of 'action' in response to events. We
generate notification signals, we write events - all of which can
(and does) modify program behavior.

So what's your point? There's no "passive observer" role really -
it's apparently just that you dislike this use of instrumentation
while you approve of other uses.

Thanks,

Ingo
Thomas Gleixner
2011-05-25 17:48:51 UTC
Permalink
Post by Ingo Molnar
Post by Thomas Gleixner
Post by Ingo Molnar
Post by Peter Zijlstra
Post by Will Drewry
include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)
I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
I'd object to invisible side-effects as well, and vehemently so. But note how
intelligently it's used here: it's explicit in the code, it's used explicitly
in kernel/seccomp.c and the event generation place in
kernel/trace/trace_syscalls.c.
So this is a really flexible solution IMO and does not extend events with some
invisible 'active' role. It extends the *call site* with an open-coded active
role - which active role btw. already pre-existed.
We do _NOT_ make any decision based on the trace point so what's the
"pre-existing" active role in the syscall entry code?
The seccomp code we are discussing in this thread.
That's proposed code and has absolutely nothing to do with the
existing trace point semantics.
Post by Ingo Molnar
Post by Thomas Gleixner
I'm all for code reuse and reuse of interfaces, but this is completely
wrong. Instrumentation and security decisions are two fundamentally
different things and we want them kept separate. Instrumentation is
not meant to make decisions. Just because we can does not mean that it
is a good idea.
Instrumentation does not 'make decisions': the calling site, which is
already emitting both the event and wants to do decisions based on
the data that also generates the event wants to do decisions.
You can repeat that as often as you want, it does not make it more
true. Fact is that the decision is made in the middle of the perf code.
Post by Ingo Molnar
+ /* Transition the task if required. */
+ if (ctx->type == task_context && event->attr.require_secure) {
+#ifdef CONFIG_SECCOMP
+ /* Don't allow perf events to escape mode = 1. */
+ if (!current->seccomp.mode)
+ current->seccomp.mode = 2;
+#endif
+ }
and further down
Post by Ingo Molnar
+ if (event->attr.err_on_discard)
+ ok = -EACCES;
Those decisions *will be made* and you cannot prevent that, the only
open question is can it reuse code intelligently, which code it is
btw. already calling for observation reasons?
The tracepoint is called for observation reasons and now you make it a
decision function. That's what I call abuse.
Post by Ingo Molnar
( Note that pure observers wont be affected and note that pure
observation call sites are not affected either. )
Hahaha, they still have to run through the additional code when
seccomp is enabled and we still have to propagate the return value
down to the point where the tracepoint itself is. You call that not
affected?
Post by Ingo Molnar
Post by Thomas Gleixner
- abuse the existing ftrace syscall hook by adding a return value to
the tracepoint.
So we need to propagate that for every tracepoint just because we
have a single user.
This is a technical criticism i share with you and i think it can be
fixed - i outlined it to Will yesterday.
And no, if done cleanly it's not 'abuse' to reuse code. Could we wait
for the first clean iteration of this patch instead of rushing
judgement prematurely?
There is no way to do it cleanly. It always comes for the price that
you add additional code into the tracing code path. And there are
other people who try hard to remove stuff to recude the overhead which
is caused by instrumentation.
Post by Ingo Molnar
Post by Thomas Gleixner
- abuse the perf per task mechanism
Just because we have per task context in perf does not mean that we
pull everything and the world which requires per task context into
perf. The security folks have per task context already so security
related stuff wants to go there.
We do not pull 'everything and the world' in, but code that wants to
process events in places that already emit events surely sounds
related to me :-)
We have enough places where different independent parts of the kernel
want to hook into for obvious reasons.

We have notifiers for those where performance does not matter much and
we have separate calls into the independent functions where it matters
or where we need to evaluate the results in specific ways.

So now you turn instrumentation into a security mechanism, which
"works" nicely for a particular purpose, i.e. decision on a particular
syscall number. Now, how do you make that work when a decision has to
be made on more than a simple match, e.g. syscall number + arguments ?

Not at all, unless you add more complexity and arbitrary callbacks
into the instrumentation code.
Post by Ingo Molnar
Post by Thomas Gleixner
Brilliant, we have already two ABIs (perf/ftrace) to support and at
the same time we urgently need to solve the problem of better
integration of those two. So adding a third completely unrelated
component with a guaranteed ABI is just making this even more complex.
So your solution is to add yet another ABI for seccomp and to keep
seccomp a limited hack forever, just because you are not interested
in security?
Well, I'm interested in security, but I'm neither interested in
security decisions inside instrumentation code nor in security models
which are limited hacks by definition unless you want to add callback
complexities to the instrumentation code.

It all looks nice and charming with this minimalistic use case, but
add real features to it and it gets messy in a split second and you
can't hold that off once you started to allow A. And you clearly
stated that you want to have more trace point based security features
than the simple syscall number filtering.
Post by Ingo Molnar
I think we want fewer ABIs and more flexible/reusable facilities.
I'm all for that, but security and instrumentation are different
beasts.
Post by Ingo Molnar
Post by Thomas Gleixner
We can factor out the filtering code and let the security dudes
reuse it for their own purposes. That makes them to have their own
interfaces and does not impose any restrictions upon the
tracing/perf ones. And really security stuff wants to be integrated
into the existing security frameworks and not duct taped into
perf/trace just because it's a conveniant hack around limitiations
of the existing security stuff.
from a security design POV this isnt just about the system call
boundary. If this seccomp variant is based on events then it could
grow proper security checks in other places as well, in places where
we have some sort of object observation event anyway.
Right, that requires callbacks and more stuff to do object based
observation and ties a trace point into a place where it might not
make sense after a while, but the security decision wants to stay at
that place. The syscall example is so tempting because it's simplistic
and easy to solve, but every extension to that model is going to
create a nightmare.

You are duct taping stuff together which has totally different
semantics and requirements. And your only argument is reuse of
existing code. That's a good argument in principle, but there is a
fundamental difference between intelligent reuse and enforcing it just
for reuse sake.
Post by Ingo Molnar
So this is opens up possibilities to reuse and unify code on a very
broad basis.
By making a total mess out of it?
Post by Ingo Molnar
So yes, over-integration is obviously wrong - but so is needless
fragmentation.
Right. And this falls into the category of over-integration. We have
people working on security and they are working on stacked security
models, so where is the justification to start another security
framework inside the instrumentation code which is completely non
interoperable with the existing ones?
Post by Ingo Molnar
If anything then that should tell you something that events and
seccomp are not just casually related ...
They happen to have the hook at the same point in the source and for
pure coincidence it works because the problem to solve is extremly
simplistic. And that's why the diffstat is minimalistic, but that does
not prove anything.

Thanks,

tglx
Ingo Molnar
2011-05-26 08:43:41 UTC
Permalink
Post by Thomas Gleixner
Post by Ingo Molnar
Post by Thomas Gleixner
We do _NOT_ make any decision based on the trace point so
what's the "pre-existing" active role in the syscall entry
code?
The seccomp code we are discussing in this thread.
That's proposed code and has absolutely nothing to do with the
existing trace point semantics.
So because it's proposed code it does not exist?

If the feature is accepted (and given Linus's opinion it's not clear
at all it's accepted in any form) then it's obviously a very
legitimate technical concern whether we do:

ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5);
if (ret)
return -EACCES;

... random code ...

trace_syscall_event(p1, p2, p3, p4, p5);

Where seccomp_check_syscall_event() duplicates much of the machinery
that is behind trace_syscall_event().

Or we do the more intelligent:

ret = check_syscall_event(p1, p2, p3, p4, p5);
if (ret)
return -EACCES;

Where we have the happy side effects of:

- less code at the call site

- (a lot of!) shared infrastructure between the proposed seccomp
code and event filters.

- we'd also be able to trace at security check boundaries - which
has obvious bug analysis advantages.

In fact i do not see *any* advantages in keeping this needlessly
bloaty and needlessly inconsistently sampled form of instrumentation:

ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5);
if (ret)
return -EACCES;

... random code ...

trace_syscall_event(p1, p2, p3, p4, p5);

Do you?

Thanks,

Ingo
Ingo Molnar
2011-05-26 09:15:18 UTC
Permalink
Post by Thomas Gleixner
Post by Ingo Molnar
If anything then that should tell you something that events and
seccomp are not just casually related ...
They happen to have the hook at the same point in the source and
for pure coincidence it works because the problem to solve is
extremly simplistic. And that's why the diffstat is minimalistic,
but that does not prove anything.
Here are the diffstats of the various versions of this proposed
security feature:

bitmask (2009): 6 files changed, 194 insertions(+), 22 deletions(-)
filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
event filters (2011): 5 files changed, 82 insertions(+), 16 deletions(-)

The third variant, 'event filters', is actually the most
sophisticated one of all and it is not simplistic at all.

The main reason why the diffstat is small is because it reuses over
ten thousand lines of pre-existing kernel code intelligently. Are you
interpreting that as some sort of failure of the patch? I think it's
a very good thing.

To demonstrate the non-simplicity of the feature:

- These security rules/filters can be sophisticated like:

sys_close() rule protecting against the closing of
stdin/stdout/stderr:

"fd == 0 || fd == 1 || fd == 2"

sys_ioperm() rule allowing port 0x80 access but nothing else:

"from != 128 || num != 1"

sys_listen() rule limiting the max accept() backlog to 16 entries:

"backlog > 16"

sys_mprotect(), sys_mmap[2](), sys_unmap() and sys_mremap() rule
protecting the first 1 MB NULL pointer guard range:

"addr < 0x00100000"

sys_setscheduler() rule protecting against the switch to
non-SCHED_OTHER scheduler policies:

"policy != 0"

Most of these examples are finegrained access restrictions that
AFAIK are not possible with any of the LSM based security measures
that Linux offers today.

- These security rules/filters can be safely used and installed by
unprivileged userspace, allowing arbitrary end user apps to define
their own, flexible security policies.

- These security rules/filters get automatically inherited into child
tasks and child tasks cannot mess with them - they cannot even
query/observe that these filters *exist*.

- These security rules/filters nest on each other in basically
arbitrary depth, giving us a working, implemented, stackable LSM
concept.

- These security rules/filters can be extended to arbitrary more
object lifetime events in the future, without changing the ABI.

- These security rules/filters, unlike most LSM rules, can execute
not just within hardirqs but also within deeply atomic contexts
such as NMI contexts, putting far less restrictions on what can
be security/access checked.

- Access permission violations can be set up to generate events of
the violations into a scalable ring-buffer, providing unprivileged
security-auditing functionality to the managing task(s).

I'd call that anything but 'simplistic'.

Thanks,

Ingo
Ingo Molnar
2011-05-24 20:08:15 UTC
Permalink
The change avoids defining a new trace call type or a huge number of internal
changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack
surface is non-trivial to verify, and I'm not sure if this ABI change makes
include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)
And can be found here: http://static.dataspill.org/perf_secure/v1/
Wow, i'm very impressed how few changes you needed to do to support this!

So, firstly, i don't think we should change perf_tp_event() at all - the
'observer' codepaths should be unaffected.

But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that
code like seccomp which wants to use event results can use.

Also, i'm not sure about the seccomp details and assumptions that were moved
into the perf core. How about passing in a helper function to
perf_tp_event_check(), where seccomp would define its seccomp specific helper
function?

That looks sufficiently flexible. That helper function could be an 'extra
filter' kind of thing, right?

Also, regarding the ABI and the attr.err_on_discard and attr.require_secure
bits, they look a bit too specific as well.

attr.err_on_discard: with the filter helper function passed in this is probably
not needed anymore, right?

attr.require_secure: this is basically used to *force* the creation of
security-controlling filters, right? It seems to me that this could be done via
a seccomp ABI extension as well, without adding this to the perf ABI. That
seccomp call could check whether the right events are created and move the task
to mode 2 only if that prereq is met - or something like that.
If there is any interest at all, I can post it properly to this giant
CC list. [...]
I'd suggest to trim the Cc: list aggressively - anyone interested in the
discussion can pick it up on lkml - and i strongly suspect that most of the Cc:
participants would want to be off the Cc: :-)

Thanks,

Ingo
Steven Rostedt
2011-05-24 20:14:22 UTC
Permalink
Post by Ingo Molnar
But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that
code like seccomp which wants to use event results can use.
We should name it something else. The "perf_tp.." is a misnomer as it
has nothing to do with performance monitoring. "dynamic_event_.." maybe,
as it is dynamic to the affect that we can use jump labels to enable or
disable it.

-- Steve
Eric Paris
2011-05-13 15:17:52 UTC
Permalink
[dropping microblaze and roland]
Post by Ingo Molnar
Post by Peter Zijlstra
Post by Ingo Molnar
I think the sanest semantics is to run all active callbacks as well.
For example if this is used for three stacked security policies - as if 3 LSM
modules were stacked at once. We'd call all three, and we'd determine that at
least one failed - and we'd return a failure.
But that only works for boolean functions where you can return the
multi-bit-or of the result. What if you need to return the specific
error code.
Do you mean that one filter returns -EINVAL while the other -EACCES?
Seems like a non-problem to me, we'd return the first nonzero value.
Sounds so easy! Why haven't LSMs stacked already? Because what happens
if one of these hooks did something stateful? Lets say on open, hook #1
returns EPERM. hook #2 allocates memory. The open is going to fail and
hooks #2 is never going to get the close() which should have freed the
allocation. If you can be completely stateless its easier, but there's
a reason that stacking security modules is hard. Serge has tried in the
past and both dhowells and casey schaufler are working on it right now.
Stacking is never as easy as it sounds :)

-Eric
Ingo Molnar
2011-05-13 12:49:02 UTC
Permalink
Post by Peter Zijlstra
event_vfs_getname(result);
result = check_event_vfs_getname(result);
if one could do it all?
Did you actually read the bit where I said that check_event_* (although
I still think that name sucks) could imply a matching event_*?
No, did not notice that - and yes that solves this particular problem.

So given that by your own admission it makes sense to share the facilities at
the low level, i also argue that it makes sense to share as high up as
possible.

Are you perhaps arguing for a ->observe flag that would make 100% sure that the
default behavior for events is observe-only? That would make sense indeed.

Otherwise both cases really want to use all the same facilities for event
discovery, setup, control and potential extraction of events.

Thanks,

Ingo
Peter Zijlstra
2011-05-13 13:55:36 UTC
Permalink
Post by Ingo Molnar
So given that by your own admission it makes sense to share the facilities at
the low level, i also argue that it makes sense to share as high up as
possible.
I'm not saying any such thing, I'm saying that it might make sense to
observe active objects and auto-create these observation points. That
doesn't make them similar or make them share anything.
Ingo Molnar
2011-05-13 15:02:15 UTC
Permalink
Post by Peter Zijlstra
Post by Ingo Molnar
So given that by your own admission it makes sense to share the facilities at
the low level, i also argue that it makes sense to share as high up as
possible.
I'm not saying any such thing, I'm saying that it might make sense to
observe active objects and auto-create these observation points. That
doesn't make them similar or make them share anything.
Well, they would share the lowest level call site:

result = check_event_vfs_getname(result);

You call it 'auto-generated call site', i call it a shared (single line) call
site. The same thing as far as the lowest level goes.

Now (the way i understood it) you'd want to stop the sharing right after that.
I argue that it should go all the way up.

Note: i fully agree that there should be events where filters can have no
effect whatsoever. For example if this was written as:

check_event_vfs_getname(result);

Then it would have no effect. This is decided by the subsystem developers,
obviously. So whether an event is 'active' or 'passive' can be enforced at the
subsystem level as well.

As far as the event facilities go, 'no effect observation' is a special-case of
'active observation' - just like read-only files are a special case of
read-write files.

Thanks,

Ingo
Eric Paris
2011-05-13 15:10:49 UTC
Permalink
[dropping microblaze and roland]
It is a simple and sensible security feature, agreed? It allows most code to
run well and link to countless libraries - but no access to other files is
allowed.
It's simple enough and sounds reasonable, but you can read all the
discussion about AppArmour why many people don't really think it's the
best. Still, I'll agree it's a lot better than nothing.
But if i had a VFS event at the fs/namei.c::getname() level, i would have
access to a central point where the VFS string becomes stable to the kernel and
can be checked (and denied if necessary).
A sidenote, and not surprisingly, the audit subsystem already has an event
audit_getname(result);
Unfortunately this audit callback cannot be used for my purposes, because the
event is single-purpose for auditd and because it allows no feedback (no
deny/accept discretion for the security policy).
err = event_vfs_getname(result);
Wow it sounds so easy. Now lets keep extending your train of thought
until we can actually provide the security provided by SELinux. What do
we end up with? We end up with an event hook right next to every LSM
hook. You know, the LSM hooks were placed where they are for a reason.
Because those were the locations inside the kernel where you actually
have information about the task doing an operation and the objects
(files, sockets, directories, other tasks, etc) they are doing an
operation on.

Honestly all you are talking about it remaking the LSM with 2 sets of
hooks instead if 1. Why? It seems much easier that if you want the
language of the filter engine you would just make a new LSM that uses
the filter engine for it's policy language rather than the language
created by SELinux or SMACK or name your LSM implementation.
- unprivileged: application-definable, allowing the embedding of security
policy in *apps* as well, not just the system
- flexible: can be added/removed runtime unprivileged, and cheaply so
- transparent: does not impact executing code that meets the policy
- nestable: it is inherited by child tasks and is fundamentally stackable,
multiple policies will have the combined effect and they
are transparent to each other. So if a child task within a
sandbox adds *more* checks then those add to the already
existing set of checks. We only narrow permissions, never
extend them.
- generic: allowing observation and (safe) control of security relevant
parameters not just at the system call boundary but at other
relevant places of kernel execution as well: which
points/callbacks could also be used for other types of event
extraction such as perf. It could even be shared with audit ...
I'm not arguing that any of these things are bad things. What you
describe is a new LSM that uses a discretionary access control model but
with the granularity and flexibility that has traditionally only existed
in the mandatory access control security modules previously implemented
in the kernel.

I won't argue that's a bad idea, there's no reason in my mind that a
process shouldn't be allowed to control it's own access decisions in a
more flexible way than rwx bits. Then again, I certainly don't see a
reason that this syscall hardening patch should be held up while a whole
new concept in computer security is contemplated...

-Eric
Peter Zijlstra
2011-05-13 15:23:01 UTC
Permalink
Post by Eric Paris
Then again, I certainly don't see a
reason that this syscall hardening patch should be held up while a whole
new concept in computer security is contemplated...
Which makes me wonder why this syscall hardening stuff is done outside
of LSM? Why isn't is part of the LSM so that say SELinux can have a
syscall bitmask per security context?

Making it part of the LSM also avoids having to add this prctl().
Eric Paris
2011-05-13 15:55:34 UTC
Permalink
Post by Peter Zijlstra
Post by Eric Paris
Then again, I certainly don't see a
reason that this syscall hardening patch should be held up while a whole
new concept in computer security is contemplated...
Which makes me wonder why this syscall hardening stuff is done outside
of LSM? Why isn't is part of the LSM so that say SELinux can have a
syscall bitmask per security context?
I could do that, but I like Will's approach better. From the PoV of
meeting security goals of information flow, data confidentiality,
integrity, least priv, etc limiting on the syscall boundary doesn't make
a lot of sense. You just don't know enough there to enforce these
things. These are the types of goals that SELinux and other LSMs have
previously tried to enforce. From the PoV of making the kernel more
resistant to attacks and making a process more resistant to misbehavior
I think that the syscall boundary is appropriate. Although I could do
it in SELinux it don't really want to do it there.

In case people are interested or confused let me give my definition of
two words I've used a bit in these conversations: discretionary and
mandatory. Any time I talk about a 'discretionary' security decision it
is a security decisions that a process imposed upon itself. Aka the
choice to use seccomp is discretionary. The choice to mark our own file
u-wx is discretionary. This isn't the best definition but it's one that
works well in this discussion. Mandatory security is one enforce by a
global policy. It's what selinux is all about. SELinux doesn't give
hoot what a process wants to do, it enforces a global policy from the
top down. You take over a process, well, too bad, you still have no
choice but to follow the mandatory policy.

The LSM does NOT enforce a mandatory access control model, it's just how
it's been used in the past. Ingo appears to me (please correct me if
I'm wrong) to really be a fan of exposing the flexibility of the LSM to
a discretionary access control model. That doesn't seem like a bad
idea. And maybe using the filter engine to define the language to do
this isn't a bad idea either. But I think that's a 'down the road'
project, not something to hold up a better seccomp.
Post by Peter Zijlstra
Making it part of the LSM also avoids having to add this prctl().
Well, it would mean exposing some new language construct to every LSM
(instead of a single prctl construct) and it would mean anyone wanting
to use the interface would have to rely on the LSM implementing those
hooks the way they need it. Honestly chrome can already get all of the
benefits of this patch (given a perfectly coded kernel) and a whole lot
more using SELinux, but (surprise surprise) not everyone uses SELinux.
I think it's a good idea to expose a simple interface which will be
widely enough adopted that many userspace applications can rely on it
for hardening.

The existence of the LSM and the fact that there exists multiple
security modules that may or may not be enabled really leads application
developers to be unable to rely on LSM for security. If linux had a
single security model which everyone could rely on we wouldn't really
have as big of an issue but that's not possible. So I'm advocating for
this series which will provide a single useful change which applications
can rely upon across distros and platforms to enhance the properties and
abilities of the linux kernel.

-Eric
Will Drewry
2011-05-13 16:29:06 UTC
Permalink
Post by Peter Zijlstra
Post by Eric Paris
Then again, I certainly don't see a
reason that this syscall hardening patch should be held up while a whole
new concept in computer security is contemplated...
Which makes me wonder why this syscall hardening stuff is done outside
of LSM? Why isn't is part of the LSM so that say SELinux can have a
syscall bitmask per security context?
I could do that, but I like Will's approach better. ?From the PoV of
meeting security goals of information flow, data confidentiality,
integrity, least priv, etc limiting on the syscall boundary doesn't make
a lot of sense. ?You just don't know enough there to enforce these
things. ?These are the types of goals that SELinux and other LSMs have
previously tried to enforce. ?From the PoV of making the kernel more
resistant to attacks and making a process more resistant to misbehavior
I think that the syscall boundary is appropriate. ?Although I could do
it in SELinux it don't really want to do it there.
There's also the problem that there are no hooks per-system call for
LSMs, only logical hooks that sometimes mirror system call names and
are called after user data has been parsed. If system call enter
hooks, like seccomp's, were added for LSMs, it would allow the lsm
bitmask approach, but it still wouldn't satisfy the issues you raise
below (and I wholeheartedly agree with).
In case people are interested or confused let me give my definition of
two words I've used a bit in these conversations: discretionary and
mandatory. ?Any time I talk about a 'discretionary' security decision it
is a security decisions that a process imposed upon itself. ?Aka the
choice to use seccomp is discretionary. ?The choice to mark our own file
u-wx is discretionary. ?This isn't the best definition but it's one that
works well in this discussion. ?Mandatory security is one enforce by a
global policy. ?It's what selinux is all about. ?SELinux doesn't give
hoot what a process wants to do, it enforces a global policy from the
top down. ?You take over a process, well, too bad, you still have no
choice but to follow the mandatory policy.
The LSM does NOT enforce a mandatory access control model, it's just how
it's been used in the past. ?Ingo appears to me (please correct me if
I'm wrong) to really be a fan of exposing the flexibility of the LSM to
a discretionary access control model. ?That doesn't seem like a bad
idea. ?And maybe using the filter engine to define the language to do
this isn't a bad idea either. ?But I think that's a 'down the road'
project, not something to hold up a better seccomp.
Post by Peter Zijlstra
Making it part of the LSM also avoids having to add this prctl().
Well, it would mean exposing some new language construct to every LSM
(instead of a single prctl construct) and it would mean anyone wanting
to use the interface would have to rely on the LSM implementing those
hooks the way they need it. ?Honestly chrome can already get all of the
benefits of this patch (given a perfectly coded kernel) and a whole lot
more using SELinux, but (surprise surprise) not everyone uses SELinux.
I think it's a good idea to expose a simple interface which will be
widely enough adopted that many userspace applications can rely on it
for hardening.
The existence of the LSM and the fact that there exists multiple
security modules that may or may not be enabled really leads application
developers to be unable to rely on LSM for security. ?If linux had a
single security model which everyone could rely on we wouldn't really
have as big of an issue but that's not possible. ?So I'm advocating for
this series which will provide a single useful change which applications
can rely upon across distros and platforms to enhance the properties and
abilities of the linux kernel.
-Eric
Ingo Molnar
2011-05-14 07:30:15 UTC
Permalink
Post by Eric Paris
[dropping microblaze and roland]
It is a simple and sensible security feature, agreed? It allows most code to
run well and link to countless libraries - but no access to other files is
allowed.
It's simple enough and sounds reasonable, but you can read all the discussion
about AppArmour why many people don't really think it's the best. [...]
I have to say most of the participants of the AppArmour flamefests were dead
wrong, and it wasnt the AppArmour folks who were wrong ...

The straight ASCII VFS namespace *makes sense*, and yes, the raw physical
objects space that SELinux uses makes sense as well.

And no, i do not subscribe to the dogma that it is not possible to secure the
ASCII VFS namespace: it evidently is possible, if you know and handle the
ambiguitites. It is also obviously true that the ASCII VFS namespaces we use
every day are a *lot* more intuitive than the labeled physical objects space
...

What all the security flamewars missed is the simple fact that being intuitive
matters a *lot* not just to not annoy users, but also to broaden the effective
number of security-conscious developers ...
Post by Eric Paris
Unfortunately this audit callback cannot be used for my purposes, because
the event is single-purpose for auditd and because it allows no feedback
(no deny/accept discretion for the security policy).
err = event_vfs_getname(result);
Wow it sounds so easy. Now lets keep extending your train of thought
until we can actually provide the security provided by SELinux. What do
we end up with? We end up with an event hook right next to every LSM
hook. You know, the LSM hooks were placed where they are for a reason.
Because those were the locations inside the kernel where you actually
have information about the task doing an operation and the objects
(files, sockets, directories, other tasks, etc) they are doing an
operation on.
Honestly all you are talking about it remaking the LSM with 2 sets of
hooks instead if 1. Why? [...]
Not at all. I am taking about using *one* set of events, to keep the intrusion
at the lowest possible level.

LSM could make use of them as well.

Obviously for pragmatic reasons that might not be feasible initially.
Post by Eric Paris
[...] It seems much easier that if you want the language of the filter
engine you would just make a new LSM that uses the filter engine for it's
policy language rather than the language created by SELinux or SMACK or name
your LSM implementation.
Correct, that is what i suggested.

Note that performance is a primary concern, so if certain filters are very
popular then in practice this would come with support for a couple of 'built
in' (pre-optimized) filters that the kernel can accelerate directly, so that we
do not incure the cost of executing the filter preds for really common-sense
security policies that almost everyone is using.

I.e. in the end we'd *roughly* end up with the same performance and security as
we are today (i mean, SELinux and the other LSMs did a nice job of collecting
the things that apps should be careful about), but the crutial difference isnt
just the advantages i menioned, but the fact that the *development model* of
security modules would be a *lot* more extensible.

So security efforts could move to a whole different level: they could move into
key apps and they could integrate with the general mind-set of developers.

At least Will as an application framework developer cares, so that hope is
justified i think.
Post by Eric Paris
- unprivileged: application-definable, allowing the embedding of security
policy in *apps* as well, not just the system
- flexible: can be added/removed runtime unprivileged, and cheaply so
- transparent: does not impact executing code that meets the policy
- nestable: it is inherited by child tasks and is fundamentally stackable,
multiple policies will have the combined effect and they
are transparent to each other. So if a child task within a
sandbox adds *more* checks then those add to the already
existing set of checks. We only narrow permissions, never
extend them.
- generic: allowing observation and (safe) control of security relevant
parameters not just at the system call boundary but at other
relevant places of kernel execution as well: which
points/callbacks could also be used for other types of event
extraction such as perf. It could even be shared with audit ...
I'm not arguing that any of these things are bad things. What you describe
is a new LSM that uses a discretionary access control model but with the
granularity and flexibility that has traditionally only existed in the
mandatory access control security modules previously implemented in the
kernel.
I won't argue that's a bad idea, there's no reason in my mind that a process
shouldn't be allowed to control it's own access decisions in a more flexible
way than rwx bits. Then again, I certainly don't see a reason that this
syscall hardening patch should be held up while a whole new concept in
computer security is contemplated...
Note, i'm not actually asking for the moon, a pony and more.

I fully submit that we are yet far away from being able to do a full LSM via
this mechanism.

What i'm asking for is that because the syscall point steps taken by Will look
very promising so it would be nice to do *that* in a slightly more flexible
scheme that does not condemn it to be limited to the syscall boundary and such
...

Also, to answer you, do you say that by my argument someone should have stood
up and said 'no' to the LSM mess that was introduced a couple of years ago and
which caused so many problems:

- kernel inefficiencies and user-space overhead

- stalled security efforts

- infighting

- friction, fragmentation, overmodularization

- non-stackability

- security annoyances on the Linux desktop

- probably *less* Linux security

and should have asked them to do something better designed instead?

Thanks,

Ingo
Will Drewry
2011-05-14 20:57:55 UTC
Permalink
Post by Ingo Molnar
Post by Eric Paris
[dropping microblaze and roland]
It is a simple and sensible security feature, agreed? It allows most code to
run well and link to countless libraries - but no access to other files is
allowed.
It's simple enough and sounds reasonable, but you can read all the discussion
about AppArmour why many people don't really think it's the best. [...]
I have to say most of the participants of the AppArmour flamefests were dead
wrong, and it wasnt the AppArmour folks who were wrong ...
The straight ASCII VFS namespace *makes sense*, and yes, the raw physical
objects space that SELinux uses makes sense as well.
And no, i do not subscribe to the dogma that it is not possible to secure the
ASCII VFS namespace: it evidently is possible, if you know and handle the
ambiguitites. It is also obviously true that the ASCII VFS namespaces we use
every day are a *lot* more intuitive than the labeled physical objects space
...
What all the security flamewars missed is the simple fact that being intuitive
matters a *lot* not just to not annoy users, but also to broaden the effective
number of security-conscious developers ...
Post by Eric Paris
Unfortunately this audit callback cannot be used for my purposes, because
the event is single-purpose for auditd and because it allows no feedback
(no deny/accept discretion for the security policy).
? ? err = event_vfs_getname(result);
Wow it sounds so easy. ?Now lets keep extending your train of thought
until we can actually provide the security provided by SELinux. ?What do
we end up with? ?We end up with an event hook right next to every LSM
hook. ?You know, the LSM hooks were placed where they are for a reason.
Because those were the locations inside the kernel where you actually
have information about the task doing an operation and the objects
(files, sockets, directories, other tasks, etc) they are doing an
operation on.
Honestly all you are talking about it remaking the LSM with 2 sets of
hooks instead if 1. ?Why? [...]
Not at all. I am taking about using *one* set of events, to keep the intrusion
at the lowest possible level.
LSM could make use of them as well.
Obviously for pragmatic reasons that might not be feasible initially.
Post by Eric Paris
[...] ?It seems much easier that if you want the language of the filter
engine you would just make a new LSM that uses the filter engine for it's
policy language rather than the language created by SELinux or SMACK or name
your LSM implementation.
Correct, that is what i suggested.
Note that performance is a primary concern, so if certain filters are very
popular then in practice this would come with support for a couple of 'built
in' (pre-optimized) filters that the kernel can accelerate directly, so that we
do not incure the cost of executing the filter preds for really common-sense
security policies that almost everyone is using.
I.e. in the end we'd *roughly* end up with the same performance and security as
we are today (i mean, SELinux and the other LSMs did a nice job of collecting
the things that apps should be careful about), but the crutial difference isnt
just the advantages i menioned, but the fact that the *development model* of
security modules would be a *lot* more extensible.
So security efforts could move to a whole different level: they could move into
key apps and they could integrate with the general mind-set of developers.
At least Will as an application framework developer cares, so that hope is
justified i think.
Post by Eric Paris
?- unprivileged: ?application-definable, allowing the embedding of security
? ? ? ? ? ? ? ? ? policy in *apps* as well, not just the system
?- flexible: ? ? ?can be added/removed runtime unprivileged, and cheaply so
?- transparent: ? does not impact executing code that meets the policy
?- nestable: ? ? ?it is inherited by child tasks and is fundamentally stackable,
? ? ? ? ? ? ? ? ? multiple policies will have the combined effect and they
? ? ? ? ? ? ? ? ? are transparent to each other. So if a child task within a
? ? ? ? ? ? ? ? ? sandbox adds *more* checks then those add to the already
? ? ? ? ? ? ? ? ? existing set of checks. We only narrow permissions, never
? ? ? ? ? ? ? ? ? extend them.
?- generic: ? ? ? allowing observation and (safe) control of security relevant
? ? ? ? ? ? ? ? ? parameters not just at the system call boundary but at other
? ? ? ? ? ? ? ? ? relevant places of kernel execution as well: which
? ? ? ? ? ? ? ? ? points/callbacks could also be used for other types of event
? ? ? ? ? ? ? ? ? extraction such as perf. It could even be shared with audit ...
I'm not arguing that any of these things are bad things. ?What you describe
is a new LSM that uses a discretionary access control model but with the
granularity and flexibility that has traditionally only existed in the
mandatory access control security modules previously implemented in the
kernel.
I won't argue that's a bad idea, there's no reason in my mind that a process
shouldn't be allowed to control it's own access decisions in a more flexible
way than rwx bits. ?Then again, I certainly don't see a reason that this
syscall hardening patch should be held up while a whole new concept in
computer security is contemplated...
Note, i'm not actually asking for the moon, a pony and more.
I fully submit that we are yet far away from being able to do a full LSM via
this mechanism.
What i'm asking for is that because the syscall point steps taken by Will look
very promising so it would be nice to do *that* in a slightly more flexible
scheme that does not condemn it to be limited to the syscall boundary and such
...
What do you suggest here?
Post by Ingo Molnar
From my brief exploration of the ftrace/perf (and seccomp) code, I
don't see a clean way of integrating over the existing interfaces to
the ftrace framework (e.g., the global perf event pump seems to be a
mismatch), but I may be missing something obvious. In my view,
implementing this nestled between the seccomp/ftrace world provides a
stepping stone forward without being too restrictive. No matter how
we change security events in the future, system calls will always be
the first line of attack surface reduction. It could just mean that,
in the long term, accessing the "security event filtering" framework
is done through another new interface with seccomp providing only a
targeted syscall filtering featureset that may one day be deprecated
(if that day ever comes).

If there's a clear way to cleanly expand this interface that I'm
missing, I'd love to know - thanks!
will
Post by Ingo Molnar
Also, to answer you, do you say that by my argument someone should have stood
up and said 'no' to the LSM mess that was introduced a couple of years ago and
?- kernel inefficiencies and user-space overhead
?- stalled security efforts
?- infighting
?- friction, fragmentation, overmodularization
?- non-stackability
?- security annoyances on the Linux desktop
?- probably *less* Linux security
and should have asked them to do something better designed instead?
Thanks,
? ? ? ?Ingo
Ingo Molnar
2011-05-16 12:43:04 UTC
Permalink
Post by Will Drewry
Post by Ingo Molnar
Note, i'm not actually asking for the moon, a pony and more.
I fully submit that we are yet far away from being able to do a full LSM
via this mechanism.
What i'm asking for is that because the syscall point steps taken by Will
look very promising so it would be nice to do *that* in a slightly more
flexible scheme that does not condemn it to be limited to the syscall
boundary and such ...
What do you suggest here?
From my brief exploration of the ftrace/perf (and seccomp) code, I don't see
a clean way of integrating over the existing interfaces to the ftrace
framework (e.g., the global perf event pump seems to be a mismatch), but I
may be missing something obvious. [...]
Well, there's no global perf event pump. Here we'd obviously want to use
buffer-less events that do no output (pumping) whatsoever - i.e. just counting
events with filters attached.

What i suggest is to:

- share syscall events # you are fine with that as your patch makes use of them
- share the scripting engine # you are fine with that as your patch makes use of it

- share *any* other event to do_exit() a process at syscall exit time

- share any other active event that kernel developers specifically enable
for active use to impact security-relevant execution even sooner than
syscall exit time - not just system calls

- share the actual facility that manages (sets/gets) filters

So right now you have this structure for your new feature:

Documentation/trace/seccomp_filter.txt | 75 +++++
arch/x86/kernel/ldt.c | 5
arch/x86/kernel/tls.c | 7
fs/proc/array.c | 21 +
fs/proc/base.c | 25 +
include/linux/ftrace_event.h | 9
include/linux/seccomp.h | 98 +++++++
include/linux/syscalls.h | 54 ++--
include/trace/syscall.h | 6
kernel/Makefile | 1
kernel/fork.c | 8
kernel/perf_event.c | 7
kernel/seccomp.c | 144 ++++++++++-
kernel/seccomp_filter.c | 428 +++++++++++++++++++++++++++++++++
kernel/sys.c | 11
kernel/trace/Kconfig | 10
kernel/trace/trace_events_filter.c | 60 ++--
kernel/trace/trace_syscalls.c | 96 ++++++-
18 files changed, 986 insertions(+), 79 deletions(-)

Firstly, one specific problem i can see is that kernel/seccomp_filter.c
hardcodes to the system call boundary. Which is fine to a prototype
implementation (and obviously fine for something that builds upon seccomp) but
not fine in terms of a flexible Linux security feature :-)

You have hardcoded these syscall assumptions via:

struct seccomp_filter {
struct list_head list;
struct rcu_head rcu;
int syscall_nr;
struct syscall_metadata *data;
struct event_filter *event_filter;
};

Which comes just because you chose to enumerate only syscall events - instead
of enumerating all events.

Instead of that please bind to all events instead - syscalls are just one of
the many events we have. Type 'perf list' and see how many event types we have,
and quite a few could be enabled for 'active feedback' sandboxing as well.

Secondly, and this is really a variant of the first problem you have, the way
you process event filter 'failures' is pretty limited. You utilize the regular
seccomp method which works by calling into __secure_computing() and silently
accepting syscalls or generating a hard do_exit() on even the slightest of
filter failures.

Instead of that what we'd want to have is to have regular syscall events
registered, *and* enabled for such active filtering purposes. The moment the
filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag and
some other attribute in the task and the kernel would do a do_exit() at the
earliest of opportunities before calling the syscall or at the
return-from-syscall point latest.

Note that no seccomp specific code would have to execute here, we can already
generate events both at syscall entry and at syscall exit points, the only new
bit we'd need is for the 'kill the task' [or abort the syscall] policy action.

This is *far* more generic still yields the same short-term end result as far
as your sandboxing is concerned.

What we'd need for this is a way to mark existing TRACE_EVENT()s as 'active'.
We'd mark all syscall events as 'active' straight away.

[ Detail: these trace events return a return code, which the calling code can
use, that way event return values could be used sooner than syscall exit
points. IRQ code could make use of it as well, so for example this way we
could filter based early packet inspection, still in the IRQ code. ]

Then what your feature would do is to simply open up the events you are
interested in (just as counting events, without any buffers), and set 'active'
filters in them them to 'deny/allow' specific combinations of parameters only.

Thirdly, you have added a separate facility to set/query filters via /proc,
this lives mostly in /proc/<pid>/seccomp_filter. This seccomp specificity would
go away altogether: it should be possible to query existing events and filters
even outside of the seccomp facility, so if you want something explicit here
beyond the current event ioctl() to set filters please improve the generic
facility - which right now is poorer than what you have implemented so it's in
good need to be improved! :-)

All in one such a solution would enable us to reuse code in a lot more deeper
fashion while still providing all the functionality you are interested in.
These are all short-term concerns, not pie-in-the-sky future ideal LSM
concerns.

Thanks,

Ingo
Will Drewry
2011-05-16 15:29:56 UTC
Permalink
Post by Ingo Molnar
Post by Will Drewry
Post by Ingo Molnar
Note, i'm not actually asking for the moon, a pony and more.
I fully submit that we are yet far away from being able to do a full LSM
via this mechanism.
What i'm asking for is that because the syscall point steps taken by Will
look very promising so it would be nice to do *that* in a slightly more
flexible scheme that does not condemn it to be limited to the syscall
boundary and such ...
What do you suggest here?
From my brief exploration of the ftrace/perf (and seccomp) code, I don't see
a clean way of integrating over the existing interfaces to the ftrace
framework (e.g., the global perf event pump seems to be a mismatch), but I
may be missing something obvious. [...]
Well, there's no global perf event pump. Here we'd obviously want to use
buffer-less events that do no output (pumping) whatsoever - i.e. just counting
events with filters attached.
Cool - I missed these entirely. I'll get reading :)
Post by Ingo Molnar
?- share syscall events ? ? ? ?# you are fine with that as your patch makes use of them
?- share the scripting engine ?# you are fine with that as your patch makes use of it
?- share *any* other event to do_exit() a process at syscall exit time
?- share any other active event that kernel developers specifically enable
? for active use to impact security-relevant execution even sooner than
? syscall exit time - not just system calls
?- share the actual facility that manages (sets/gets) filters
These make sense to me at a high level. I'll throw in a few initial
comments, but I'll be back for a round-two once I catch up on the rest
of the perf code.
Post by Ingo Molnar
?Documentation/trace/seccomp_filter.txt | ? 75 +++++
?arch/x86/kernel/ldt.c ? ? ? ? ? ? ? ? ?| ? ?5
?arch/x86/kernel/tls.c ? ? ? ? ? ? ? ? ?| ? ?7
?fs/proc/array.c ? ? ? ? ? ? ? ? ? ? ? ?| ? 21 +
?fs/proc/base.c ? ? ? ? ? ? ? ? ? ? ? ? | ? 25 +
?include/linux/ftrace_event.h ? ? ? ? ? | ? ?9
?include/linux/seccomp.h ? ? ? ? ? ? ? ?| ? 98 +++++++
?include/linux/syscalls.h ? ? ? ? ? ? ? | ? 54 ++--
?include/trace/syscall.h ? ? ? ? ? ? ? ?| ? ?6
?kernel/Makefile ? ? ? ? ? ? ? ? ? ? ? ?| ? ?1
?kernel/fork.c ? ? ? ? ? ? ? ? ? ? ? ? ?| ? ?8
?kernel/perf_event.c ? ? ? ? ? ? ? ? ? ?| ? ?7
?kernel/seccomp.c ? ? ? ? ? ? ? ? ? ? ? | ?144 ++++++++++-
?kernel/seccomp_filter.c ? ? ? ? ? ? ? ?| ?428 +++++++++++++++++++++++++++++++++
?kernel/sys.c ? ? ? ? ? ? ? ? ? ? ? ? ? | ? 11
?kernel/trace/Kconfig ? ? ? ? ? ? ? ? ? | ? 10
?kernel/trace/trace_events_filter.c ? ? | ? 60 ++--
?kernel/trace/trace_syscalls.c ? ? ? ? ?| ? 96 ++++++-
?18 files changed, 986 insertions(+), 79 deletions(-)
Firstly, one specific problem i can see is that kernel/seccomp_filter.c
hardcodes to the system call boundary. Which is fine to a prototype
implementation (and obviously fine for something that builds upon seccomp) but
not fine in terms of a flexible Linux security feature :-)
?struct seccomp_filter {
? ? ? ?struct list_head list;
? ? ? ?struct rcu_head rcu;
? ? ? ?int syscall_nr;
? ? ? ?struct syscall_metadata *data;
? ? ? ?struct event_filter *event_filter;
?};
(This structure is a bit different in the new rev of the patch, but
nothing relevant to this specific part of the discussion :)
Post by Ingo Molnar
Which comes just because you chose to enumerate only syscall events - instead
of enumerating all events.
While I'm willing to live with the tradeoff, using ftrace event
numbers from FTRACE_SYSCALLS means that the filter will be unable to
hook a fair number of syscalls: execve, clone, etc (all the ptregs
fixup syscalls on x86) and anything that returns int instead of long
(on x86). Though the last two patches in the initial patch series
provided a proposed clean up for the latter :/

The current revision of the seccomp filter patch can function in a
bitmask-like state when CONFIG_FTRACE is unset or
CONFIG_FTRACE_SYSCALLS is unset. This also means any platform with
CONFIG_SECCOMP support can start using this right away, but I realize
that is more of a short-term gain rather than a long-term one.
Post by Ingo Molnar
Instead of that please bind to all events instead - syscalls are just one of
the many events we have. Type 'perf list' and see how many event types we have,
and quite a few could be enabled for 'active feedback' sandboxing as well.
Secondly, and this is really a variant of the first problem you have, the way
you process event filter 'failures' is pretty limited. You utilize the regular
seccomp method which works by calling into __secure_computing() and silently
accepting syscalls or generating a hard do_exit() on even the slightest of
filter failures.
Instead of that what we'd want to have is to have regular syscall events
registered, *and* enabled for such active filtering purposes. The moment the
filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag and
some other attribute in the task and the kernel would do a do_exit() at the
earliest of opportunities before calling the syscall or at the
return-from-syscall point latest.
Note that no seccomp specific code would have to execute here, we can already
generate events both at syscall entry and at syscall exit points, the only new
bit we'd need is for the 'kill the task' [or abort the syscall] policy action.
This is *far* more generic still yields the same short-term end result as far
as your sandboxing is concerned.
Almost :/ I still need to review the code you've pointed out, but, at
present, the ftrace hooks occur after the seccomp and syscall auditing
hooks. This means that that code is exposed no matter what in this
model. To trim the exposed surface to userspace, we really need those
early hooks. While I can see both hacky and less hacky approaches
around this, it stills strikes me that the seccomp thread flag and
early interception are good to reuse. One option might be to allow
seccomp to be a secure-syscall event source, but I suspect that lands
more on the hack-y side of the fence :)

Anyway, I'll read up on the other filters and try to understand
exactly how per-task, counting perf events are being handled.
Post by Ingo Molnar
What we'd need for this is a way to mark existing TRACE_EVENT()s as 'active'.
We'd mark all syscall events as 'active' straight away.
It seems like we'll still end up special casing system calls no matter
what as they are the userland vector into the kernel. Marking
syscalls active right away sounds roughly equivalent to calling
prctl(PR_SET_SECCOMP, 2). This could easily be done following the
model of syscall_nr_to_meta() since that enumerates every system call
meta data entry which yields both entry and exit event numbers. Of
course, I'd need to understand how that ties in with the per-task,
counting code.
Post by Ingo Molnar
[ Detail: these trace events return a return code, which the calling code can
?use, that way event return values could be used sooner than syscall exit
?points. IRQ code could make use of it as well, so for example this way we
?could filter based early packet inspection, still in the IRQ code. ]
Then what your feature would do is to simply open up the events you are
interested in (just as counting events, without any buffers), and set 'active'
filters in them them to 'deny/allow' specific combinations of parameters only.
Thirdly, you have added a separate facility to set/query filters via /proc,
this lives mostly in /proc/<pid>/seccomp_filter. This seccomp specificity would
go away altogether: it should be possible to query existing events and filters
even outside of the seccomp facility, so if you want something explicit here
beyond the current event ioctl() to set filters please improve the generic
facility - which right now is poorer than what you have implemented so it's in
good need to be improved! :-)
I haven't looked at the other interface, so I will. I know about the
existence of the perf_event_open syscall, but that's about it. Would
it make sense to use the same syscall in a unified-interface world?
Even if the right way, semantically it seems awkward.
Post by Ingo Molnar
All in one such a solution would enable us to reuse code in a lot more deeper
fashion while still providing all the functionality you are interested in.
These are all short-term concerns, not pie-in-the-sky future ideal LSM
concerns.
Thanks for the concrete feedback! I'll follow up again once I better
understand the code you're citing as well as the existing userland
interface to it.

cheers!
will
Ingo Molnar
2011-05-17 12:57:00 UTC
Permalink
Post by Ingo Molnar
This is *far* more generic still yields the same short-term end result as
far as your sandboxing is concerned.
Almost :/ [...]
Hey that's a pretty good result from a subsystem that was not written with your
usecase in mind *at all* ;-)
[...] I still need to review the code you've pointed out, but, at present,
the ftrace hooks occur after the seccomp and syscall auditing hooks. This
means that that code is exposed no matter what in this model. To trim the
exposed surface to userspace, we really need those early hooks. While I can
see both hacky and less hacky approaches around this, it stills strikes me
that the seccomp thread flag and early interception are good to reuse. One
option might be to allow seccomp to be a secure-syscall event source, but I
suspect that lands more on the hack-y side of the fence :)
Agreed, there should be no security compromise imposed on your usecase, at all.

You could move the event callback sooner into the syscall-entry sequence to
make sure it's the highest priority thing to process?

There's no semantic dependency on its current location so this can be changed
AFAICS.

Thanks,

Ingo
Ingo Molnar
2011-05-13 12:10:34 UTC
Permalink
Post by Ingo Molnar
" I'm concerned that we're seeing yet another security scheme being designed on
the fly, without a well-formed threat model, and without taking into account
lessons learned from the seemingly endless parade of similar, failed schemes. "
so when and how did your opinion of this scheme turn from it being an
"endless parade of failed schemes" to it being a "well-defined and readily
understandable feature"? :-)
When it was defined in a way which limited its purpose to reducing the attack
surface of the sycall interface.
Let me outline a simple example of a new filter expression based security
feature that could be implemented outside the narrow system call boundary you
find acceptable, and please tell what is bad about it.

Say i'm a user-space sandbox developer who wants to enforce that sandboxed code
should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/.

It is a simple and sensible security feature, agreed? It allows most code to
run well and link to countless libraries - but no access to other files is
allowed.

I would also like my sandbox app to be able to install this policy without
having to be root. I do not want the sandbox app to have permission to create
labels on /lib and /usr/lib and what not.

Firstly, using the filter code i deny the various link creation syscalls so
that sandboxed code cannot escape for example by creating a symlink to outside
the permitted VFS namespace. (Note: we opt-in to syscalls, that way new
syscalls added by new kernels are denied by defalt. The current symlink
creation syscalls are not opted in to.)

But the next step, actually checking filenames, poses a big hurdle: i cannot
implement the filename checking at the sys_open() syscall level in a secure
way: because the pathname is passed to sys_open() by pointer, and if i check it
at the generic sys_open() syscall level, another thread in the sandbox might
modify the underlying filename *after* i've checked it.

But if i had a VFS event at the fs/namei.c::getname() level, i would have
access to a central point where the VFS string becomes stable to the kernel and
can be checked (and denied if necessary).

A sidenote, and not surprisingly, the audit subsystem already has an event
callback there:

audit_getname(result);

Unfortunately this audit callback cannot be used for my purposes, because the
event is single-purpose for auditd and because it allows no feedback (no
deny/accept discretion for the security policy).

But if had this simple event there:

err = event_vfs_getname(result);

I could implement this new filename based sandboxing policy, using a filter
like this installed on the vfs::getname event and inherited by all sandboxed
tasks (which cannot uninstall the filter, obviously):

"
if (strstr(name, ".."))
return -EACCESS;

if (!strncmp(name, "/home/sandbox/", 14) &&
!strncmp(name, "/lib/", 5) &&
!strncmp(name, "/usr/lib/", 9))
return -EACCESS;

"

#
# Note1: Obviously the filter engine would be extended to allow such simple string
# match functions. )
#
# Note2: ".." is disallowed so that sandboxed code cannot escape the restrictions
# using "/..".
#

This kind of flexible and dynamic sandboxing would allow a wide range of file
ops within the sandbox, while still isolating it from files not included in the
specified VFS namespace.

( Note that there are tons of other examples as well, for useful security features
that are best done using events outside the syscall boundary. )

The security event filters code tied to seccomp and syscalls at the moment is
useful, but limited in its future potential.

So i argue that it should go slightly further and should become:

- unprivileged: application-definable, allowing the embedding of security
policy in *apps* as well, not just the system

- flexible: can be added/removed runtime unprivileged, and cheaply so

- transparent: does not impact executing code that meets the policy

- nestable: it is inherited by child tasks and is fundamentally stackable,
multiple policies will have the combined effect and they
are transparent to each other. So if a child task within a
sandbox adds *more* checks then those add to the already
existing set of checks. We only narrow permissions, never
extend them.

- generic: allowing observation and (safe) control of security relevant
parameters not just at the system call boundary but at other
relevant places of kernel execution as well: which
points/callbacks could also be used for other types of event
extraction such as perf. It could even be shared with audit ...

I argue that this is the LSM and audit subsystems designed right: in the long
run it could allow everything that LSM does at the moment - and so much more
...

And you argue that allowing this would be bad, if it was extended like that
then you'd consider it a failed scheme? Why?

Thanks,

Ingo
James Morris
2011-05-16 00:36:05 UTC
Permalink
Post by Ingo Molnar
Say i'm a user-space sandbox developer who wants to enforce that sandboxed code
should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/.
It is a simple and sensible security feature, agreed? It allows most code to
run well and link to countless libraries - but no access to other files is
allowed.
Not really.

Firstly, what is the security goal of these restrictions? Then, are the
restrictions complete and unbypassable?

How do you reason about the behavior of the system as a whole?
Post by Ingo Molnar
I argue that this is the LSM and audit subsystems designed right: in the long
run it could allow everything that LSM does at the moment - and so much more
...
Now you're proposing a redesign of the security subsystem. That's a
significant undertaking.

In the meantime, we have a simple, well-defined enhancement to seccomp
which will be very useful to current users in reducing their kernel attack
surface.

We should merge that, and the security subsystem discussion can carry on
separately.


- James
--
James Morris
<jmorris at namei.org>
Ingo Molnar
2011-05-16 15:08:37 UTC
Permalink
Post by James Morris
Post by Ingo Molnar
Say i'm a user-space sandbox developer who wants to enforce that sandboxed
code should only be allowed to open files in /home/sandbox/, /lib/ and
/usr/lib/.
It is a simple and sensible security feature, agreed? It allows most code
to run well and link to countless libraries - but no access to other files
is allowed.
Not really.
Firstly, what is the security goal of these restrictions? [...]
To do what i described above? Namely:

" Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
and /usr/lib/ "
Post by James Morris
[...] Then, are the restrictions complete and unbypassable?
If only the system calls i mentioned are allowed, and if the sandboxed VFS
namespace itself is isolated from the rest of the system (no bind mounts, no
hard links outside the sandbox, etc.) then its goal is to not be bypassable -
what use is a sandbox if the sandbox can be bypassed by the sandboxed code?

There's a few ways how to alter (and thus bypass) VFS namespace lookups:
symlinks, chdir, chroot, rename, etc., which (as i mentioned) have to be
excluded by default or filtered as well.
Post by James Morris
How do you reason about the behavior of the system as a whole?
For some usecases i mainly want to reason about what the sandboxed code can do
and can not do, within a fairly static and limited VFS namespace environment.

I might not want to have a full-blown 'physical barrier' for all objects
labeled as inaccessible to sandboxed code (or labeled as accessible to
sandboxed code).

Especially as manipulating file labels is not also slow (affects all files) but
is also often an exclusively privileged operation even for owned files, for no
good reason. For things like /lib/ and /usr/lib/ it also *has* to be a
privileged operation.
Post by James Morris
Post by Ingo Molnar
I argue that this is the LSM and audit subsystems designed right: in the
long run it could allow everything that LSM does at the moment - and so
much more ...
Now you're proposing a redesign of the security subsystem. That's a
significant undertaking.
It certainly is.
Post by James Morris
In the meantime, we have a simple, well-defined enhancement to seccomp which
will be very useful to current users in reducing their kernel attack surface.
We should merge that, and the security subsystem discussion can carry on
separately.
Is that the development and merge process along which the LSM subsystem got
into its current state?

Thanks,

Ingo
James Morris
2011-05-17 02:24:34 UTC
Permalink
Post by Ingo Molnar
Post by James Morris
Not really.
Firstly, what is the security goal of these restrictions? [...]
" Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
and /usr/lib/ "
These are access rules, they don't really describe a high-level security
goal. How do you know it's ok to open everything in these directories?


- James
--
James Morris
<jmorris at namei.org>
Ingo Molnar
2011-05-17 13:10:58 UTC
Permalink
Post by James Morris
Post by Ingo Molnar
Post by James Morris
Not really.
Firstly, what is the security goal of these restrictions? [...]
" Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
and /usr/lib/ "
These are access rules, they don't really describe a high-level security
goal. [...]
Restrictng sandboxed code to only open files within a given VFS namespace
boundary sure sounds like a high-level security goal to me.

If implemented and set up correctly then it restricts sandboxed code to only be
able to open files reachable via that VFS sub-namespace.

That is a rather meaningful high-level concept. What higher level concept do
you want to argue?
Post by James Morris
[...] How do you know it's ok to open everything in these directories?
How do you know it's ok to open /etc/hosts? The sysadmin has configured the
system that way.

How do you know that it's ok for sandboxed code to open files in
/home/sandbox/? The sandbox developer has configured the system that way.

I'm not sure i get your point.

Thanks,

Ingo
James Morris
2011-05-17 13:29:36 UTC
Permalink
Post by Ingo Molnar
I'm not sure i get your point.
Your example was not complete as described. After an apparently simple
specification, you've since added several qualifiers and assumptions, and
I still doubt that it's complete.

A higher level goal would look like

"Allow a sandbox app access only to approved resources, to contain the
effects of flaws in the app", or similar.

Note that this includes a threat model (remote attacker taking control of
the app) and a general and fully stated strategy for dealing with it.
Post by Ingo Molnar
From there, you can start to analyze how to implement the goal, at which
point you'd start thinking about configuration, assumptions, filesystem
access, namespaces, indirect access (e.g. via sockets, rpc, ipc, shared
memory, invocation).

Anyway, this is getting off track from the main discussion, but you
asked...



- James
--
James Morris
<jmorris at namei.org>
Ingo Molnar
2011-05-17 18:34:59 UTC
Permalink
Post by James Morris
Post by Ingo Molnar
I'm not sure i get your point.
Your example was not complete as described. After an apparently simple
specification, you've since added several qualifiers and assumptions, [...]
I havent added any qualifiers really (i added examples/description), the opt-in
method i mentioned in my first mail should be pretty robust:

| Firstly, using the filter code i deny the various link creation syscalls so
| that sandboxed code cannot escape for example by creating a symlink to
| outside the permitted VFS namespace. (Note: we opt-in to syscalls, that way
| new syscalls added by new kernels are denied by defalt. The current symlink
| creation syscalls are not opted in to.)
Post by James Morris
[...] and I still doubt that it's complete.
I could too claim that i doubt that the SELinux kernel implementation is
secure!

So how about we both come up with specific examples about how it's not secure,
instead of going down the fear-uncertainty-and-doubt road? ;-)
Post by James Morris
A higher level goal would look like
"Allow a sandbox app access only to approved resources, to contain the
effects of flaws in the app", or similar.
I see what you mean.

I really think that "restricting sandboxed code to only open files within a
given VFS namespace boundary" is the most useful highlevel description here -
which is really a subset of a "allow a sandbox app access only to an easily
approved set of files" highlevel concept.

There's no "to contain ..." bit here: *all* of the sandboxed app code is
untrusted, so there's no 'remote attacker' and we do not limit our threat to
flaws in the app. We want to contain apps to within a small subset of Linux
functionality, and we want to do that within regular apps (without having to be
superuser), full stop.
Post by James Morris
Note that this includes a threat model (remote attacker taking control of the
app) and a general and fully stated strategy for dealing with it.
Attacker does not have to be remote - most sandboxing concepts protect against
locally installed plugins/apps/applets. In sandboxing the whole app is
considered untrusted - not just some flaw in it, abused remotely.
Post by James Morris
From there, you can start to analyze how to implement the goal, at which
point you'd start thinking about configuration, assumptions, filesystem
access, namespaces, indirect access (e.g. via sockets, rpc, ipc, shared
memory, invocation).
Sandboxed code generally does not have access to anything fancy like that - if
it is added then all possible side effects have to be examined.

Thanks,

Ingo
Pavel Machek
2011-05-26 06:27:52 UTC
Permalink
Post by James Morris
How do you reason about the behavior of the system as a whole?
Post by Ingo Molnar
I argue that this is the LSM and audit subsystems designed right: in the long
run it could allow everything that LSM does at the moment - and so much more
...
Now you're proposing a redesign of the security subsystem. That's a
significant undertaking.
In the meantime, we have a simple, well-defined enhancement to seccomp
which will be very useful to current users in reducing their kernel attack
surface.
Well, you can do the same with subterfugue, even without kernel
changes. But that's ptrace -- slow. (And it already shows that syscall
based filters are extremely tricky to configure).

If yu want speed, seccomp+server for non-permitted operations seems like reasonable way.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Ingo Molnar
2011-05-26 08:35:13 UTC
Permalink
Post by Pavel Machek
Post by James Morris
How do you reason about the behavior of the system as a whole?
Post by Ingo Molnar
I argue that this is the LSM and audit subsystems designed right: in the long
run it could allow everything that LSM does at the moment - and so much more
...
Now you're proposing a redesign of the security subsystem. That's a
significant undertaking.
In the meantime, we have a simple, well-defined enhancement to seccomp
which will be very useful to current users in reducing their kernel attack
surface.
Well, you can do the same with subterfugue, even without kernel
changes. But that's ptrace -- slow. (And it already shows that
syscall based filters are extremely tricky to configure).
Yes, if you use syscall based filters to implement access to
underlying objects where the access methods do not capture essential
lifetime events properly (such as files) they you'll quickly run into
trouble achieving a secure solution.

But you can robustly use syscall filters to control the underlying
primary *resource*: various pieces of kernel code with *negative*
utility to the current app - which have no use to the app but pose
risks in terms of potential exploits in them.

But you can use event filters to implement arbitrary security
policies robustly.

For example file objects: if you generate the right events for a
class of objects then you can control access to them very robustly.

It's not a surprise that this is what SELinux does primarily: it has
lifetime event hooks at the inode object (and socket, packet, etc.)
level and captures those access attempts and validates them against
the permissions of that object, in light of the accessing task's
credentials.

Exactly that can be done with Will's patch as well, if its potential
scope of event-checking points is not stupidly limited to the syscall
boundary alone ...

Thanks,

Ingo
Frederic Weisbecker
2011-05-12 12:15:42 UTC
Permalink
Post by Ingo Molnar
To restrict execution to system calls.
1) We already have a specific ABI for this: you can set filters for events via
an event fd.
Why not extend that mechanism instead and improve *both* your sandboxing
bits and the events code? This new seccomp code has a lot more
to do with trace event filters than the minimal old seccomp code ...
kernel/trace/trace_event_filter.c is 2000 lines of tricky code that
interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of
mostly trivial code.
2) Why should this concept not be made available wider, to allow the
restriction of not just system calls but other security relevant components
of the kernel as well?
This too, if you approach the problem via the events code, will be a natural
end result, while if you approach it from the seccomp prctl angle it will be
a limited hack only.
Note, the end result will be the same - just using a different ABI.
So i really think the ABI itself should be closer related to the event code.
What this "seccomp" code does is that it uses specific syscall events to
restrict execution of certain event generating codepaths, such as system calls.
Thanks,
Ingo
What's positive with that approach is that the code is all there already.
Create a perf event for a given trace event, attach a filter to it.

What needs to be added is an override of the effect of the filter. By default
it's dropping the event, but there may be different flavours, including sending
a signal. All in one, extending the current code to allow that looks trivial.

The negative points are that

* trace events are supposed to stay passive and not act on the system, except
doing some endpoint things like writing to a buffer. We can't call do_exit()
from a tracepoint for example, preemption is disabled there.

* Also, is it actually relevant to extend that seccomp filtering to other events
than syscalls? Exposing kernel events to filtering sounds actually to me bringing
a new potential security issue. But with fine restrictions this can probably
be dealt with. Especially if by default only syscalls can be filtered

* I think Peter did not want to give such "active" role to perf in the system.
James Morris
2011-05-12 11:33:27 UTC
Permalink
Post by Will Drewry
+void seccomp_filter_log_failure(int syscall)
+{
+ printk(KERN_INFO
+ "%s[%d]: system call %d (%s) blocked at ip:%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ syscall_nr_to_name(syscall), KSTK_EIP(current));
+}
I think it'd be a good idea to utilize the audit facility here.


- James
--
James Morris
<jmorris at namei.org>
David Laight
2011-05-13 15:29:27 UTC
Permalink
... If you can be completely stateless its easier, but there's
a reason that stacking security modules is hard. Serge has tried in
the
past and both dhowells and casey schaufler are working on it right
now.
Stacking is never as easy as it sounds :)
For a bad example of trying to allow alternate security models
look at NetBSD's kauth code :-)

NetBSD also had issues where some 'system call trace' code
was being used to (try to) apply security - unfortunately
it worked by looking at the user-space buffers on system
call entry - and a multithreaded program can easily arrange
to update them after the initial check!
For trace/event type activities this wouldn't really matter,
for security policy it does.
(I've not looked directly at these event points in linux)

David
Ingo Molnar
2011-05-16 12:03:02 UTC
Permalink
[...] unfortunately it worked by looking at the user-space buffers on system
call entry - and a multithreaded program can easily arrange to update them
after the initial check! [...]
Such problems of reliability/persistency of security checks is exactly one of
my arguments why this should not be limited to the syscall boundary, if you
read the example i have provided in this discussion.

Thanks,

Ingo
Arnd Bergmann
2011-05-13 19:35:34 UTC
Permalink
Post by Will Drewry
This change adds a new seccomp mode based on the work by
agl at chromium.org in [1]. This new mode, "filter mode", provides a hash
table of seccomp_filter objects. When in the new mode (2), all system
calls are checked against the filters - first by system call number,
then by a filter string. If an entry exists for a given system call and
all filter predicates evaluate to true, then the task may proceed.
Otherwise, the task is killed (as per seccomp_mode == 1).
I've got a question about this: Do you expect the typical usage to disallow
ioctl()? Given that ioctl alone is responsible for a huge number of exploits
in various drivers, while certain ioctls are immensely useful (FIONREAD,
FIOASYNC, ...), do you expect to extend the mechanism to filter specific
ioctl commands in the future?

Arnd
Will Drewry
2011-05-14 20:58:00 UTC
Permalink
Post by Arnd Bergmann
Post by Will Drewry
This change adds a new seccomp mode based on the work by
agl at chromium.org in [1]. This new mode, "filter mode", provides a hash
table of seccomp_filter objects. ?When in the new mode (2), all system
calls are checked against the filters - first by system call number,
then by a filter string. ?If an entry exists for a given system call and
all filter predicates evaluate to true, then the task may proceed.
Otherwise, the task is killed (as per seccomp_mode == 1).
I've got a question about this: Do you expect the typical usage to disallow
ioctl()? Given that ioctl alone is responsible for a huge number of exploits
in various drivers, while certain ioctls are immensely useful (FIONREAD,
FIOASYNC, ...), do you expect to extend the mechanism to filter specific
ioctl commands in the future?
In many cases, I do expect ioctl's to be dropped, but it's totally up
to whoever is setting the filters.

As is, it can already help out: [even though an LSM, if available,
would be appropriate to define a fine-grained policy]

ioctl() is hooked by the ftrace syscalls infrastructure (via SYSCALL_DEFINE3):

SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned
long, arg)

This means you can do:
sprintf(filter, "cmd == %u || cmd == %u", FIOASYNC, FIONREAD);
prctl(PR_SET_SECCOMP_FILTER, __NR_ioctl, filter);
...
prctl(PR_SET_SECCOMP, 2, 0);
and then you'll be able to call ioctl on any fd with any argument but
limited to only the FIOASYNC and FIONREAD commands.

Depending on integration, it could even be limited to ioctl commands
that are appropriate to a known fd if the fd is opened prior to
entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
with a filter of "1" then narrowed through a later addition of
something like "(fd == %u && (cmd == %u || cmd == %u))" or something
along those lines.

Does that make sense?

In general, this interface won't need specific extensions for most
system call oriented filtering events. ftrace events may be expanded
(to include more system calls), but that's behind the scenes. Only
arguments subject to time-of-check-time-of-use attacks (data living in
userspace passed in by pointer) are not safe to use via this
interface. In theory, that limitation could also be lifted in the
implementation without changing the ABI.

Thanks!
will
Arnd Bergmann
2011-05-15 06:42:07 UTC
Permalink
Post by Will Drewry
Depending on integration, it could even be limited to ioctl commands
that are appropriate to a known fd if the fd is opened prior to
entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
with a filter of "1" then narrowed through a later addition of
something like "(fd == %u && (cmd == %u || cmd == %u))" or something
along those lines.
Does that make sense?
Thanks for the explanation. This sounds like it's already doing all
we need.

Arnd
Ingo Molnar
2011-05-16 12:00:43 UTC
Permalink
Post by Arnd Bergmann
Post by Will Drewry
Depending on integration, it could even be limited to ioctl commands
that are appropriate to a known fd if the fd is opened prior to
entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
with a filter of "1" then narrowed through a later addition of
something like "(fd == %u && (cmd == %u || cmd == %u))" or something
along those lines.
Does that make sense?
Thanks for the explanation. This sounds like it's already doing all
we need.
One thing we could do more clearly here is to help keep the filter expressions
symbolic - i.e. help resolve the various ioctl variants as well, not just the
raw syscall parameter numbers.

But yes, access to the raw syscall parameters and the ability to filter them
already gives us the ability to exclude/include specific ioctls in a rather
flexible way.

Thanks,

Ingo
Will Drewry
2011-05-16 15:28:35 UTC
Permalink
Sorry to be absent from this thread so far, I just got back from my
travels and I'm now catching up on email.
Post by Will Drewry
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..22e1668 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1664,6 +1664,16 @@ config SECCOMP
? ? ? ? and the task is only allowed to execute a few safe syscalls
? ? ? ? defined by each seccomp mode.
+config SECCOMP_FILTER
+ ? ? bool "Enable seccomp-based system call filtering"
+ ? ? depends on SECCOMP && EXPERIMENTAL
+ ? ? help
+ ? ? ? Per-process, inherited system call filtering using shared code
+ ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
+ ? ? ? is not available, enhanced filters will not be available.
+
+ ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
+
?config CC_STACKPROTECTOR
? ? ? bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
? ? ? depends on EXPERIMENTAL
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index eccdefe..7641ee9 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -129,6 +129,16 @@ config SECCOMP
? ? ? ? If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ ? ? bool "Enable seccomp-based system call filtering"
+ ? ? depends on SECCOMP && EXPERIMENTAL
+ ? ? help
+ ? ? ? Per-process, inherited system call filtering using shared code
+ ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
+ ? ? ? is not available, enhanced filters will not be available.
+
+ ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
+
?endmenu
?menu "Advanced setup"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8e256cc..fe4cbda 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2245,6 +2245,16 @@ config SECCOMP
? ? ? ? If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ ? ? bool "Enable seccomp-based system call filtering"
+ ? ? depends on SECCOMP && EXPERIMENTAL
+ ? ? help
+ ? ? ? Per-process, inherited system call filtering using shared code
+ ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
+ ? ? ? is not available, enhanced filters will not be available.
+
+ ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
+
?config USE_OF
? ? ? bool "Flattened Device Tree support"
? ? ? select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..83499e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -605,6 +605,16 @@ config SECCOMP
? ? ? ? If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ ? ? bool "Enable seccomp-based system call filtering"
+ ? ? depends on SECCOMP && EXPERIMENTAL
+ ? ? help
+ ? ? ? Per-process, inherited system call filtering using shared code
+ ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
+ ? ? ? is not available, enhanced filters will not be available.
+
+ ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
+
?endmenu
?config ISA_DMA_API
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2508a6f..2777515 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -614,6 +614,16 @@ config SECCOMP
? ? ? ? If unsure, say Y.
+config SECCOMP_FILTER
+ ? ? bool "Enable seccomp-based system call filtering"
+ ? ? depends on SECCOMP && EXPERIMENTAL
+ ? ? help
+ ? ? ? Per-process, inherited system call filtering using shared code
+ ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
+ ? ? ? is not available, enhanced filters will not be available.
+
+ ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
+
?endmenu
?menu "Power Management"
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..00c1521 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -676,6 +676,16 @@ config SECCOMP
? ? ? ? If unsure, say N.
+config SECCOMP_FILTER
+ ? ? bool "Enable seccomp-based system call filtering"
+ ? ? depends on SECCOMP && EXPERIMENTAL
+ ? ? help
+ ? ? ? Per-process, inherited system call filtering using shared code
+ ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
+ ? ? ? is not available, enhanced filters will not be available.
+
+ ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
+
?config SMP
? ? ? bool "Symmetric multi-processing support"
? ? ? depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e560d10..5b42255 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -270,6 +270,16 @@ config SECCOMP
? ? ? ? If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ ? ? bool "Enable seccomp-based system call filtering"
+ ? ? depends on SECCOMP && EXPERIMENTAL
+ ? ? help
+ ? ? ? Per-process, inherited system call filtering using shared code
+ ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
+ ? ? ? is not available, enhanced filters will not be available.
+
+ ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
+
?config HOTPLUG_CPU
? ? ? bool "Support for hot-pluggable CPUs"
? ? ? depends on SPARC64 && SMP
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..d6d44d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1485,6 +1485,16 @@ config SECCOMP
? ? ? ? If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ ? ? bool "Enable seccomp-based system call filtering"
+ ? ? depends on SECCOMP && EXPERIMENTAL
+ ? ? help
+ ? ? ? Per-process, inherited system call filtering using shared code
+ ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
+ ? ? ? is not available, enhanced filters will not be available.
+
+ ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
+
?config CC_STACKPROTECTOR
? ? ? bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
? ? ? ---help---
You just cut-and-pasted 8 copies of a config selection. The proper way
to do that is to add the Kconfig selection in a core kernel Kconfig,
have it depend on "HAVE_SECCOMP_FILTER" and then in each of these
config <ARCH>
? ? ? ?[...]
? ? ? ?select HAVE_SECCOMP_FILTER
That way you don't need to duplicate the config option all over the
place.
Thanks! I had seen the HAVE_* format but failed to acknowledge why!

I'll fix it in the next rev (assuming it still fits there)!
will
Steven Rostedt
2011-05-16 15:26:47 UTC
Permalink
Sorry to be absent from this thread so far, I just got back from my
travels and I'm now catching up on email.
Post by Will Drewry
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..22e1668 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1664,6 +1664,16 @@ config SECCOMP
and the task is only allowed to execute a few safe syscalls
defined by each seccomp mode.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
depends on EXPERIMENTAL
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index eccdefe..7641ee9 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -129,6 +129,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu
menu "Advanced setup"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8e256cc..fe4cbda 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2245,6 +2245,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config USE_OF
bool "Flattened Device Tree support"
select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..83499e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -605,6 +605,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu
config ISA_DMA_API
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2508a6f..2777515 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -614,6 +614,16 @@ config SECCOMP
If unsure, say Y.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu
menu "Power Management"
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..00c1521 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -676,6 +676,16 @@ config SECCOMP
If unsure, say N.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SMP
bool "Symmetric multi-processing support"
depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e560d10..5b42255 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -270,6 +270,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SPARC64 && SMP
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..d6d44d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1485,6 +1485,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
---help---
You just cut-and-pasted 8 copies of a config selection. The proper way
to do that is to add the Kconfig selection in a core kernel Kconfig,
have it depend on "HAVE_SECCOMP_FILTER" and then in each of these
configs, simply add in the arch Kconfig:

config <ARCH>
[...]
select HAVE_SECCOMP_FILTER


That way you don't need to duplicate the config option all over the
place.

-- Steve
Steven Rostedt
2011-05-16 15:26:47 UTC
Permalink
Sorry to be absent from this thread so far, I just got back from my
travels and I'm now catching up on email.
Post by Will Drewry
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..22e1668 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1664,6 +1664,16 @@ config SECCOMP
and the task is only allowed to execute a few safe syscalls
defined by each seccomp mode.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
depends on EXPERIMENTAL
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index eccdefe..7641ee9 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -129,6 +129,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu
menu "Advanced setup"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8e256cc..fe4cbda 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2245,6 +2245,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config USE_OF
bool "Flattened Device Tree support"
select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..83499e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -605,6 +605,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu
config ISA_DMA_API
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2508a6f..2777515 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -614,6 +614,16 @@ config SECCOMP
If unsure, say Y.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu
menu "Power Management"
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..00c1521 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -676,6 +676,16 @@ config SECCOMP
If unsure, say N.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SMP
bool "Symmetric multi-processing support"
depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e560d10..5b42255 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -270,6 +270,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SPARC64 && SMP
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..d6d44d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1485,6 +1485,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
---help---
You just cut-and-pasted 8 copies of a config selection. The proper way
to do that is to add the Kconfig selection in a core kernel Kconfig,
have it depend on "HAVE_SECCOMP_FILTER" and then in each of these
configs, simply add in the arch Kconfig:

config <ARCH>
[...]
select HAVE_SECCOMP_FILTER


That way you don't need to duplicate the config option all over the
place.

-- Steve
Will Drewry
2011-07-08 19:03:57 UTC
Permalink
This change adds a new seccomp mode based on the work by
agl at chromium.org in [1]. This new mode, "filter mode", provides a hash
table of seccomp_filter objects. When in the new mode (2), all system
calls are checked against the filters - first by system call number,
then by a filter string. If an entry exists for a given system call and
all filter predicates evaluate to true, then the task may proceed.
Otherwise, the task is killed (as per seccomp_mode == 1).

Filter string parsing and evaluation is handled by the ftrace filter
engine. Related patches tweak to the perf filter trace and free allow
the call to be shared. Filters inherit their understanding of types and
arguments for each system call from the CONFIG_FTRACE_SYSCALLS subsystem
which already predefines this information in syscall_metadata associated
enter_event (and exit_event) structures. If CONFIG_FTRACE and
CONFIG_FTRACE_SYSCALLS are not compiled in, only "1" filter strings will
be allowed.

The net result is a process may have its system calls filtered using the
ftrace filter engine's inherent understanding of systems calls. A
logical ruleset for a process that only needs stdin/stdout may be:
sys_read: fd == 0
sys_write: fd == 1 || fd == 2
sys_exit: 1

The set of filters is specified through the PR_SET_SECCOMP path in prctl().
For example:
prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 0");
prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
prctl(PR_SET_SECCOMP_FILTER, __NR_exit, "1");
prctl(PR_SET_SECCOMP, 2, 0);

v2: - changed to use the existing syscall number ABI.
- prctl changes to minimize parsing in the kernel:
prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
- defined PR_SECCOMP_MODE_STRICT and ..._FILTER
- added flags
- provide a default fail syscall_nr_to_meta in ftrace
- provides fallback for unhooked system calls
- use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
- added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
- moved to a hlist and 4 bit hash of linked lists
- added support to operate without CONFIG_FTRACE_SYSCALLS
- moved Kconfig support next to SECCOMP
(should this be done in per-platform patches?)
- made Kconfig entries dependent on EXPERIMENTAL
- added macros to avoid ifdefs from kernel/fork.c
- added compat task/filter matching
- drop seccomp.h inclusion in sched.h and drop seccomp_t
- added Filtering to "show" output
- added on_exec state dup'ing when enabling after a fast-path accept.

Signed-off-by: Will Drewry <wad at chromium.org>
---
arch/arm/Kconfig | 10 +
arch/microblaze/Kconfig | 10 +
arch/mips/Kconfig | 10 +
arch/powerpc/Kconfig | 10 +
arch/s390/Kconfig | 10 +
arch/sh/Kconfig | 10 +
arch/sparc/Kconfig | 10 +
arch/x86/Kconfig | 10 +
include/linux/prctl.h | 9 +
include/linux/sched.h | 5 +-
include/linux/seccomp.h | 116 +++++++++-
include/trace/syscall.h | 7 +
kernel/Makefile | 3 +
kernel/fork.c | 3 +
kernel/seccomp.c | 228 +++++++++++++++++-
kernel/seccomp.h | 74 ++++++
kernel/seccomp_filter.c | 581 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 15 +-
18 files changed, 1100 insertions(+), 21 deletions(-)
create mode 100644 kernel/seccomp.h
create mode 100644 kernel/seccomp_filter.c

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..22e1668 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1664,6 +1664,16 @@ config SECCOMP
and the task is only allowed to execute a few safe syscalls
defined by each seccomp mode.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
depends on EXPERIMENTAL
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index eccdefe..7641ee9 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -129,6 +129,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu

menu "Advanced setup"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8e256cc..fe4cbda 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2245,6 +2245,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config USE_OF
bool "Flattened Device Tree support"
select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..83499e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -605,6 +605,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu

config ISA_DMA_API
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2508a6f..2777515 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -614,6 +614,16 @@ config SECCOMP

If unsure, say Y.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu

menu "Power Management"
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..00c1521 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -676,6 +676,16 @@ config SECCOMP

If unsure, say N.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SMP
bool "Symmetric multi-processing support"
depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e560d10..5b42255 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -270,6 +270,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SPARC64 && SMP
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..d6d44d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1485,6 +1485,16 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
---help---
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..379b391 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -63,6 +63,15 @@
/* Get/set process seccomp mode */
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22
+# define PR_SECCOMP_MODE_NONE 0
+# define PR_SECCOMP_MODE_STRICT 1
+# define PR_SECCOMP_MODE_FILTER 2
+# define PR_SECCOMP_FLAG_FILTER_ON_EXEC (1 << 1)
+
+/* Get/set process seccomp filters */
+#define PR_GET_SECCOMP_FILTER 35
+#define PR_SET_SECCOMP_FILTER 36
+#define PR_CLEAR_SECCOMP_FILTER 37

/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18d63ce..27eacf9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -77,7 +77,6 @@ struct sched_param {
#include <linux/percpu.h>
#include <linux/topology.h>
#include <linux/proportions.h>
-#include <linux/seccomp.h>
#include <linux/rcupdate.h>
#include <linux/rculist.h>
#include <linux/rtmutex.h>
@@ -1190,6 +1189,8 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

+struct seccomp_state;
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1374,7 +1375,7 @@ struct task_struct {
uid_t loginuid;
unsigned int sessionid;
#endif
- seccomp_t seccomp;
+ struct seccomp_state *seccomp;

/* Thread group tracking */
u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 167c333..289c836 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -2,12 +2,34 @@
#define _LINUX_SECCOMP_H


+/* Forward declare for proc interface */
+struct seq_file;
+
#ifdef CONFIG_SECCOMP

+#include <linux/errno.h>
+#include <linux/list.h>
#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+/**
+ * struct seccomp_state - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 1, the process is under standard seccomp rules
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * @usage: number of references to the current instance.
+ * @flags: a bitmask of behavior altering flags.
+ * @filters: Hash table of filters if using CONFIG_SECCOMP_FILTER.
+ */
+struct seccomp_state {
+ uint16_t mode;
+ atomic_t usage;
+ long flags;
+ struct seccomp_filter_table *filters;
+};

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -16,27 +38,113 @@ static inline void secure_computing(int this_syscall)
__secure_computing(this_syscall);
}

+extern struct seccomp_state *seccomp_state_new(void);
+extern struct seccomp_state *seccomp_state_dup(const struct seccomp_state *);
+extern struct seccomp_state *get_seccomp_state(struct seccomp_state *);
+extern void put_seccomp_state(struct seccomp_state *);
+
+extern long prctl_set_seccomp(unsigned long, unsigned long);
extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+
+extern long prctl_set_seccomp_filter(unsigned long, char __user *);
+extern long prctl_get_seccomp_filter(unsigned long, char __user *,
+ unsigned long);
+extern long prctl_clear_seccomp_filter(unsigned long);
+
+#define inherit_tsk_seccomp_state(_child, _orig) \
+ _child->seccomp = get_seccomp_state(_orig->seccomp);
+#define put_tsk_seccomp_state(_tsk) put_seccomp_state(_tsk->seccomp)

#else /* CONFIG_SECCOMP */

#include <linux/errno.h>

-typedef struct { } seccomp_t;
+struct seccomp_state { };

#define secure_computing(x) do { } while (0)
+#define inherit_tsk_seccomp_state(_child, _orig) do { } while (0)
+#define put_tsk_seccomp_state(_tsk) do { } while (0)

static inline long prctl_get_seccomp(void)
{
return -EINVAL;
}

-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long a2, unsigned long a3)
{
return -EINVAL;
}

+static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
+{
+ return -ENOSYS;
+}
+
+static inline long prctl_clear_seccomp_filter(unsigned long a2)
+{
+ return -ENOSYS;
+}
+
+static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
+ unsigned long a4)
+{
+ return -ENOSYS;
+}
+
+static inline struct seccomp_state *seccomp_state_new(void)
+{
+ return NULL;
+}
+
+static inline struct seccomp_state *seccomp_state_dup(
+ const struct seccomp_state *state)
+{
+ return NULL;
+}
+
+static inline struct seccomp_state *get_seccomp_state(
+ struct seccomp_state *state)
+{
+ return NULL;
+}
+
+static inline void put_seccomp_state(struct seccomp_state *state)
+{
+}
+
#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_SECCOMP_FILTER
+
+extern int seccomp_show_filters(struct seccomp_state *, struct seq_file *);
+extern long seccomp_set_filter(int, char *);
+extern long seccomp_clear_filter(int);
+extern long seccomp_get_filter(int, char *, unsigned long);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+static inline int seccomp_show_filters(struct seccomp_state *state,
+ struct seq_file *m)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_set_filter(int syscall_nr, char *filter)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_clear_filter(int syscall_nr)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_get_filter(int syscall_nr,
+ char *buf, unsigned long available)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
+
#endif /* _LINUX_SECCOMP_H */
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 242ae04..e061ad0 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -35,6 +35,8 @@ struct syscall_metadata {
extern unsigned long arch_syscall_addr(int nr);
extern int init_syscall_trace(struct ftrace_event_call *call);

+extern struct syscall_metadata *syscall_nr_to_meta(int);
+
extern int reg_event_syscall_enter(struct ftrace_event_call *call);
extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
extern int reg_event_syscall_exit(struct ftrace_event_call *call);
@@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event);
enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
struct trace_event *event);
+#else
+static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
+{
+ return NULL;
+}
#endif

#ifdef CONFIG_PERF_EVENTS
diff --git a/kernel/Makefile b/kernel/Makefile
index 85cbfb3..84e7dfb 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+ifeq ($(CONFIG_SECCOMP_FILTER),y)
+obj-$(CONFIG_SECCOMP) += seccomp_filter.o
+endif
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..46987d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ put_tsk_seccomp_state(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
if (err)
goto out;

+ inherit_tsk_seccomp_state(tsk, orig);
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..502ba04 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -6,12 +6,15 @@
* This defines a simple but solid secure-computing mode.
*/

+#include <linux/err.h>
+#include <linux/prctl.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
+#include <linux/slab.h>
#include <linux/compat.h>
+#include <linux/unistd.h>

-/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
+#include "seccomp.h"

/*
* Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -32,11 +35,13 @@ static int mode1_syscalls_32[] = {

void __secure_computing(int this_syscall)
{
- int mode = current->seccomp.mode;
+ int mode = -1;
+ long ret = 0;
int * syscall;
-
+ if (current->seccomp)
+ mode = current->seccomp->mode;
switch (mode) {
- case 1:
+ case PR_SECCOMP_MODE_STRICT:
syscall = mode1_syscalls;
#ifdef CONFIG_COMPAT
if (is_compat_task())
@@ -47,6 +52,20 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+ case PR_SECCOMP_MODE_FILTER:
+ if (this_syscall >= NR_syscalls || this_syscall < 0)
+ break;
+ ret = seccomp_test_filters(current->seccomp, this_syscall);
+ if (!ret)
+ return;
+ /* Only check for an override if an access failure occurred. */
+ if (ret != -EACCES)
+ break;
+ ret = seccomp_maybe_apply_filters(current, this_syscall);
+ if (!ret)
+ return;
+ seccomp_filter_log_failure(this_syscall);
+ break;
default:
BUG();
}
@@ -57,30 +76,213 @@ void __secure_computing(int this_syscall)
do_exit(SIGKILL);
}

+/* seccomp_state_new - allocate a new state object. */
+struct seccomp_state *seccomp_state_new()
+{
+ struct seccomp_state *new = kzalloc(sizeof(struct seccomp_state),
+ GFP_KERNEL);
+ if (!new)
+ return NULL;
+
+ new->flags = 0;
+#ifdef CONFIG_COMPAT
+ /* Annotate if this filterset is being created by a compat task. */
+ if (is_compat_task())
+ new->flags |= SECCOMP_FLAG_COMPAT;
+#endif
+
+ atomic_set(&new->usage, 1);
+ new->filters = seccomp_filter_table_new();
+ /* Not supported errors are fine, others are a problem. */
+ if (IS_ERR(new->filters) && PTR_ERR(new->filters) != -ENOSYS) {
+ kfree(new);
+ new = NULL;
+ }
+ return new;
+}
+
+/* seccomp_state_dup - copies an existing state object. */
+struct seccomp_state *seccomp_state_dup(const struct seccomp_state *orig)
+{
+ int err;
+ struct seccomp_state *new_state = seccomp_state_new();
+
+ err = -ENOMEM;
+ if (!new_state)
+ goto fail;
+ new_state->mode = orig->mode;
+ /* Flag copying will hide if the new process is a compat task. However,
+ * if the rule was compat/non-compat and the process is the opposite,
+ * enforcement will terminate it.
+ */
+ new_state->flags = orig->flags;
+ err = seccomp_copy_all_filters(new_state->filters,
+ orig->filters);
+ if (err)
+ goto fail;
+
+ return new_state;
+fail:
+ put_seccomp_state(new_state);
+ return NULL;
+}
+
+/* get_seccomp_state - increments the reference count of @orig */
+struct seccomp_state *get_seccomp_state(struct seccomp_state *orig)
+{
+ if (!orig)
+ return NULL;
+ atomic_inc(&orig->usage);
+ return orig;
+}
+
+static void __put_seccomp_state(struct seccomp_state *orig)
+{
+ WARN_ON(atomic_read(&orig->usage));
+ seccomp_drop_all_filters(orig);
+ kfree(orig);
+}
+
+/* put_seccomp_state - decrements the reference count of @orig and may free. */
+void put_seccomp_state(struct seccomp_state *orig)
+{
+ if (!orig)
+ return;
+
+ if (atomic_dec_and_test(&orig->usage))
+ __put_seccomp_state(orig);
+}
+
long prctl_get_seccomp(void)
{
- return current->seccomp.mode;
+ if (!current->seccomp)
+ return 0;
+ return current->seccomp->mode;
}

-long prctl_set_seccomp(unsigned long seccomp_mode)
+long prctl_set_seccomp(unsigned long seccomp_mode, unsigned long flags)
{
long ret;
+ struct seccomp_state *state, *cur_state;

+ cur_state = get_seccomp_state(current->seccomp);
/* can set it only once to be even more secure */
ret = -EPERM;
- if (unlikely(current->seccomp.mode))
+ if (cur_state && unlikely(cur_state->mode))
goto out;

ret = -EINVAL;
- if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ if (seccomp_mode <= 0 || seccomp_mode > NR_SECCOMP_MODES)
+ goto out;
+
+ ret = -ENOMEM;
+ state = (cur_state ? seccomp_state_dup(cur_state) :
+ seccomp_state_new());
+ if (!state)
+ goto out;
+
+ if (seccomp_mode == PR_SECCOMP_MODE_STRICT) {
#ifdef TIF_NOTSC
disable_TSC();
#endif
- ret = 0;
}

- out:
+ rcu_assign_pointer(current->seccomp, state);
+ synchronize_rcu();
+ put_seccomp_state(cur_state); /* For the task */
+
+ /* Convert the ABI flag to the internal flag value. */
+ if (seccomp_mode == PR_SECCOMP_MODE_FILTER &&
+ (flags & PR_SECCOMP_FLAG_FILTER_ON_EXEC))
+ state->flags |= SECCOMP_FLAG_ON_EXEC;
+ /* Encourage flag values to stay synchronized explicitly. */
+ BUILD_BUG_ON(PR_SECCOMP_FLAG_FILTER_ON_EXEC != SECCOMP_FLAG_ON_EXEC);
+
+ /* Only set the thread flag once after the new state is in place. */
+ state->mode = seccomp_mode;
+ set_thread_flag(TIF_SECCOMP);
+ ret = 0;
+
+out:
+ put_seccomp_state(cur_state); /* for the get */
+ return ret;
+}
+
+long prctl_set_seccomp_filter(unsigned long syscall_nr,
+ char __user *user_filter)
+{
+ int nr;
+ long ret;
+ char filter[SECCOMP_MAX_FILTER_LENGTH];
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+ goto out;
+
+ ret = -EFAULT;
+ if (!user_filter ||
+ strncpy_from_user(filter, user_filter,
+ sizeof(filter) - 1) < 0)
+ goto out;
+
+ nr = (int) syscall_nr;
+ ret = seccomp_set_filter(nr, filter);
+
+out:
+ return ret;
+}
+
+long prctl_clear_seccomp_filter(unsigned long syscall_nr)
+{
+ int nr = -1;
+ long ret;
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+ goto out;
+
+ nr = (int) syscall_nr;
+ ret = seccomp_clear_filter(nr);
+
+out:
+ return ret;
+}
+
+long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
+ unsigned long available)
+{
+ int ret, nr;
+ unsigned long copied;
+ char *buf = NULL;
+ ret = -EINVAL;
+ if (!available)
+ goto out;
+ /* Ignore extra buffer space. */
+ if (available > SECCOMP_MAX_FILTER_LENGTH)
+ available = SECCOMP_MAX_FILTER_LENGTH;
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+ goto out;
+ nr = (int) syscall_nr;
+
+ ret = -ENOMEM;
+ buf = kmalloc(available, GFP_KERNEL);
+ if (!buf)
+ goto out;
+
+ ret = seccomp_get_filter(nr, buf, available);
+ if (ret < 0)
+ goto out;
+
+ /* Include the NUL byte in the copy. */
+ copied = copy_to_user(dst, buf, ret + 1);
+ ret = -ENOSPC;
+ if (copied)
+ goto out;
+
+ ret = 0;
+out:
+ kfree(buf);
return ret;
}
diff --git a/kernel/seccomp.h b/kernel/seccomp.h
new file mode 100644
index 0000000..5abd219
--- /dev/null
+++ b/kernel/seccomp.h
@@ -0,0 +1,74 @@
+/*
+ * seccomp/seccomp_filter shared internal prototypes and state.
+ *
+ * Copyright (C) 2011 Chromium OS Authors.
+ */
+
+#ifndef __KERNEL_SECCOMP_H
+#define __KERNEL_SECCOMP_H
+
+#include <linux/ftrace_event.h>
+#include <linux/seccomp.h>
+
+/* #define SECCOMP_DEBUG 1 */
+#define NR_SECCOMP_MODES 2
+
+/* Inherit the max filter length from the filtering engine. */
+#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
+
+/* Presently, flags only affect SECCOMP_FILTER. */
+#define _SECCOMP_FLAG_COMPAT 0
+#define _SECCOMP_FLAG_ON_EXEC 1
+
+#define SECCOMP_FLAG_COMPAT (1 << (_SECCOMP_FLAG_COMPAT))
+#define SECCOMP_FLAG_ON_EXEC (1 << (_SECCOMP_FLAG_ON_EXEC))
+
+
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define SECCOMP_FILTER_HASH_BITS 4
+#define SECCOMP_FILTER_HASH_SIZE (1 << SECCOMP_FILTER_HASH_BITS)
+
+struct seccomp_filter_table;
+extern struct seccomp_filter_table *seccomp_filter_table_new(void);
+extern int seccomp_copy_all_filters(struct seccomp_filter_table *,
+ const struct seccomp_filter_table *);
+extern void seccomp_drop_all_filters(struct seccomp_state *);
+
+extern int seccomp_test_filters(struct seccomp_state *, int);
+extern int seccomp_maybe_apply_filters(struct task_struct *, int);
+extern void seccomp_filter_log_failure(int);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+static inline void seccomp_filter_log_failure(int syscall)
+{
+}
+
+static inline int seccomp_maybe_apply_filters(struct task_struct *tsk,
+ int syscall_nr)
+{
+ return -ENOSYS;
+}
+
+static inline struct seccomp_filter_table *seccomp_filter_table_new(void)
+{
+ return ERR_PTR(-ENOSYS);
+}
+
+static inline int seccomp_test_filters(struct seccomp_state *state, int nr)
+{
+ return -ENOSYS;
+}
+
+extern inline int seccomp_copy_all_filters(struct seccomp_filter_table *dst,
+ const struct seccomp_filter_table *src)
+{
+ return 0;
+}
+
+static inline void seccomp_drop_all_filters(struct seccomp_state *state) { }
+
+#endif /* CONFIG_SECCOMP_FILTER */
+
+#endif /* __KERNEL_SECCOMP_H */
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..ff4e055
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,581 @@
+/* filter engine-based seccomp system call filtering.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev at chromium.org>
+ */
+
+#include <linux/compat.h>
+#include <linux/errno.h>
+#include <linux/hash.h>
+#include <linux/prctl.h>
+#include <linux/seccomp.h>
+#include <linux/seq_file.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <asm/syscall.h>
+#include <trace/syscall.h>
+
+#include "seccomp.h"
+
+#define SECCOMP_MAX_FILTER_COUNT 512
+#define SECCOMP_WILDCARD_FILTER "1"
+
+struct seccomp_filter {
+ struct hlist_node node;
+ int syscall_nr;
+ struct syscall_metadata *data;
+ struct event_filter *event_filter;
+};
+
+struct seccomp_filter_table {
+ struct hlist_head heads[SECCOMP_FILTER_HASH_SIZE];
+ int count;
+};
+
+struct seccomp_filter_table *seccomp_filter_table_new(void)
+{
+ struct seccomp_filter_table *t =
+ kzalloc(sizeof(struct seccomp_filter_table), GFP_KERNEL);
+ if (!t)
+ return ERR_PTR(-ENOMEM);
+ return t;
+}
+
+static inline u32 syscall_hash(int syscall_nr)
+{
+ return hash_32(syscall_nr, SECCOMP_FILTER_HASH_BITS);
+}
+
+static const char *get_filter_string(struct seccomp_filter *f)
+{
+ const char *str = SECCOMP_WILDCARD_FILTER;
+ if (!f)
+ return NULL;
+
+ /* Missing event filters qualify as wildcard matches. */
+ if (!f->event_filter)
+ return str;
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+ str = ftrace_get_filter_string(f->event_filter);
+#endif
+ return str;
+}
+
+static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr,
+ const char *filter_string)
+{
+ int err = -ENOMEM;
+ struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter),
+ GFP_KERNEL);
+ if (!filter)
+ goto fail;
+
+ INIT_HLIST_NODE(&filter->node);
+ filter->syscall_nr = syscall_nr;
+ filter->data = syscall_nr_to_meta(syscall_nr);
+
+ /* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip
+ * using a predicate at all.
+ */
+ if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string))
+ goto out;
+
+ /* Argument-based filtering only works on ftrace-hooked syscalls. */
+ if (!filter->data) {
+ err = -ENOSYS;
+ goto fail;
+ }
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+ err = ftrace_parse_filter(&filter->event_filter,
+ filter->data->enter_event->event.type,
+ filter_string);
+ if (err)
+ goto fail;
+#endif
+
+out:
+ return filter;
+
+fail:
+ kfree(filter);
+ return ERR_PTR(err);
+}
+
+static void free_seccomp_filter(struct seccomp_filter *filter)
+{
+#ifdef CONFIG_FTRACE_SYSCALLS
+ ftrace_free_filter(filter->event_filter);
+#endif
+ kfree(filter);
+}
+
+static struct seccomp_filter *copy_seccomp_filter(struct seccomp_filter *orig)
+{
+ return alloc_seccomp_filter(orig->syscall_nr, get_filter_string(orig));
+}
+
+/* Returns the matching filter or NULL */
+static struct seccomp_filter *find_filter(struct seccomp_state *state,
+ int syscall)
+{
+ struct hlist_node *this, *pos;
+ struct seccomp_filter *filter = NULL;
+
+ u32 head = syscall_hash(syscall);
+ if (head >= SECCOMP_FILTER_HASH_SIZE)
+ goto out;
+
+ hlist_for_each_safe(this, pos, &state->filters->heads[head]) {
+ filter = hlist_entry(this, struct seccomp_filter, node);
+ if (filter->syscall_nr == syscall)
+ goto out;
+ }
+
+ filter = NULL;
+
+out:
+ return filter;
+}
+
+/* Safely drops all filters for a given syscall. This should only be called
+ * on unattached seccomp_state objects.
+ */
+static void drop_filter(struct seccomp_state *state, int syscall_nr)
+{
+ struct seccomp_filter *filter = find_filter(state, syscall_nr);
+ if (!filter)
+ return;
+
+ WARN_ON(state->filters->count == 0);
+ state->filters->count--;
+ hlist_del(&filter->node);
+ free_seccomp_filter(filter);
+}
+
+/* This should only be called on unattached seccomp_state objects. */
+static int add_filter(struct seccomp_state *state, int syscall_nr,
+ char *filter_string)
+{
+ struct seccomp_filter *filter;
+ struct hlist_head *head;
+ char merged[SECCOMP_MAX_FILTER_LENGTH];
+ int ret;
+ u32 hash = syscall_hash(syscall_nr);
+
+ ret = -EINVAL;
+ if (state->filters->count == SECCOMP_MAX_FILTER_COUNT - 1)
+ goto out;
+
+ filter_string = strstrip(filter_string);
+
+ /* Disallow empty strings. */
+ if (filter_string[0] == 0)
+ goto out;
+
+ /* Get the right list head. */
+ head = &state->filters->heads[hash];
+
+ /* Find out if there is an existing entry to append to and
+ * build the resultant filter string. The original filter can be
+ * destroyed here since the caller should be operating on a copy.
+ */
+ filter = find_filter(state, syscall_nr);
+ if (filter) {
+ int expected = snprintf(merged, sizeof(merged), "(%s) && %s",
+ get_filter_string(filter),
+ filter_string);
+ ret = -E2BIG;
+ if (expected >= sizeof(merged) || expected < 0)
+ goto out;
+ filter_string = merged;
+ hlist_del(&filter->node);
+ free_seccomp_filter(filter);
+ }
+
+ /* When in seccomp filtering mode, only allow additions. */
+ ret = -EACCES;
+ if (filter == NULL && state->mode == PR_SECCOMP_MODE_FILTER)
+ goto out;
+
+ ret = 0;
+ filter = alloc_seccomp_filter(syscall_nr, filter_string);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ state->filters->count++;
+ hlist_add_head(&filter->node, head);
+out:
+ return ret;
+}
+
+/* Wrap optional ftrace syscall support. Returns 1 on match or if ftrace is not
+ * supported.
+ */
+static int do_ftrace_syscall_match(struct event_filter *event_filter)
+{
+ int err = 1;
+#ifdef CONFIG_FTRACE_SYSCALLS
+ uint8_t syscall_state[64];
+
+ memset(syscall_state, 0, sizeof(syscall_state));
+
+ /* The generic tracing entry can remain zeroed. */
+ err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
+ NULL);
+ if (err)
+ return 0;
+
+ err = filter_match_preds(event_filter, syscall_state);
+#endif
+ return err;
+}
+
+/* 1 on match, 0 otherwise. */
+static int filter_match_current(struct seccomp_filter *filter)
+{
+ /* If no event filter exists, we assume a wildcard match. */
+ if (!filter->event_filter)
+ return 1;
+
+ return do_ftrace_syscall_match(filter->event_filter);
+}
+
+#ifndef KSTK_EIP
+#define KSTK_EIP(x) 0L
+#endif
+
+static const char *syscall_nr_to_name(int syscall)
+{
+ const char *syscall_name = "unknown";
+ struct syscall_metadata *data = syscall_nr_to_meta(syscall);
+ if (data)
+ syscall_name = data->name;
+ return syscall_name;
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ printk(KERN_INFO
+ "%s[%d]: system call %d (%s) blocked at ip:%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ syscall_nr_to_name(syscall), KSTK_EIP(current));
+}
+
+/**
+ * seccomp_drop_all_filters - cleans up the filter list and frees the table
+ * @state: the seccomp_state to destroy the filters in.
+ */
+void seccomp_drop_all_filters(struct seccomp_state *state)
+{
+ struct hlist_node *this, *pos;
+ int head;
+ if (!state->filters)
+ return;
+ for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+ hlist_for_each_safe(this, pos, &state->filters->heads[head]) {
+ struct seccomp_filter *f = hlist_entry(this,
+ struct seccomp_filter, node);
+ WARN_ON(state->filters->count == 0);
+ hlist_del(this);
+ free_seccomp_filter(f);
+ state->filters->count--;
+ }
+ }
+ kfree(state->filters);
+}
+
+/**
+ * seccomp_copy_all_filters - copies all filters from src to dst.
+ *
+ * @dst: seccomp_filter_table to populate.
+ * @src: table to read from.
+ * Returns non-zero on failure.
+ * Both the source and the destination should have no simultaneous
+ * writers, and dst should be exclusive to the caller.
+ */
+int seccomp_copy_all_filters(struct seccomp_filter_table *dst,
+ const struct seccomp_filter_table *src)
+{
+ struct seccomp_filter *filter;
+ int head, ret = 0;
+ BUG_ON(!dst || !src);
+ for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+ struct hlist_node *pos;
+ hlist_for_each_entry(filter, pos, &src->heads[head], node) {
+ struct seccomp_filter *new_filter =
+ copy_seccomp_filter(filter);
+ if (IS_ERR(new_filter)) {
+ ret = PTR_ERR(new_filter);
+ goto done;
+ }
+ hlist_add_head(&new_filter->node,
+ &dst->heads[head]);
+ dst->count++;
+ }
+ }
+
+done:
+ return ret;
+}
+
+/**
+ * seccomp_show_filters - prints the filter state to a seq_file
+ * @state: the seccomp_state to enumerate the filter and bitmask of
+ * @m: the prepared seq_file to receive the data
+ *
+ * Returns 0 on a successful write.
+ */
+int seccomp_show_filters(struct seccomp_state *state, struct seq_file *m)
+{
+ int head;
+ struct hlist_node *pos;
+ struct seccomp_filter *filter;
+ int filtering = 0;
+ if (!state)
+ return 0;
+ if (!state->filters)
+ return 0;
+
+ filtering = (state->mode == 2);
+ filtering &= !(state->flags & SECCOMP_FLAG_ON_EXEC);
+ seq_printf(m, "Filtering: %d\n", filtering);
+ seq_printf(m, "FilterCount: %d\n", state->filters->count);
+ for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+ hlist_for_each_entry(filter, pos, &state->filters->heads[head],
+ node) {
+ seq_printf(m, "SystemCall: %d (%s)\n",
+ filter->syscall_nr,
+ syscall_nr_to_name(filter->syscall_nr));
+ seq_printf(m, "Filter: %s\n",
+ get_filter_string(filter));
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(seccomp_show_filters);
+
+/**
+ * seccomp_maybe_apply_filters - conditionally applies seccomp filters
+ * @tsk: task to update
+ * @syscall_nr: current system call in progress
+ * tsk must already be in seccomp filter mode.
+ *
+ * Returns 0 if the call should be allowed or state has been updated.
+ * This call is only reach if no filters matched the current system call.
+ * In some cases, such as when the ON_EXEC flag is set, failure should
+ * not be terminal.
+ */
+int seccomp_maybe_apply_filters(struct task_struct *tsk, int syscall_nr)
+{
+ struct seccomp_state *state, *new_state = NULL;
+ int ret = -EACCES;
+
+ /* There's no question of application if ON_EXEC is not set. */
+ state = get_seccomp_state(tsk->seccomp);
+ if ((state->flags & SECCOMP_FLAG_ON_EXEC) == 0)
+ goto out;
+
+ ret = 0;
+ if (syscall_nr != __NR_execve)
+ goto out;
+
+ new_state = seccomp_state_dup(state);
+ ret = -ENOMEM;
+ if (!new_state)
+ goto out;
+
+ ret = 0;
+ new_state->flags &= ~(SECCOMP_FLAG_ON_EXEC);
+ rcu_assign_pointer(tsk->seccomp, new_state);
+ synchronize_rcu();
+ put_seccomp_state(state); /* for the task */
+
+out:
+ put_seccomp_state(state); /* for the get */
+ return ret;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @state: seccomp_state of current to use.
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(struct seccomp_state *state, int syscall)
+{
+ struct seccomp_filter *filter = NULL;
+ int ret;
+
+#ifdef CONFIG_COMPAT
+ ret = -EPERM;
+ if (is_compat_task() == !!(state->flags & SECCOMP_FLAG_COMPAT)) {
+ printk(KERN_INFO "%s[%d]: seccomp filter compat() mismatch.\n",
+ current->comm, task_pid_nr(current));
+ goto out;
+ }
+#endif
+
+ ret = 0;
+ filter = find_filter(state, syscall);
+ if (filter && filter_match_current(filter))
+ goto out;
+
+ ret = -EACCES;
+out:
+ return ret;
+}
+
+/**
+ * seccomp_get_filter - copies the filter_string into "buf"
+ * @syscall_nr: system call number to look up
+ * @buf: destination buffer
+ * @bufsize: available space in the buffer.
+ *
+ * Looks up the filter for the given system call number on current. If found,
+ * the string length of the NUL-terminated buffer is returned and < 0 is
+ * returned on error. The NUL byte is not included in the length.
+ */
+long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
+{
+ struct seccomp_state *state;
+ struct seccomp_filter *filter;
+ long ret = -ENOENT;
+
+ if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
+ bufsize = SECCOMP_MAX_FILTER_LENGTH;
+
+ state = get_seccomp_state(current->seccomp);
+ if (!state)
+ goto out;
+
+ filter = find_filter(state, syscall_nr);
+ if (!filter)
+ goto out;
+
+ ret = strlcpy(buf, get_filter_string(filter), bufsize);
+ if (ret >= bufsize) {
+ ret = -ENOSPC;
+ goto out;
+ }
+ /* Zero out any remaining buffer, just in case. */
+ memset(buf + ret, 0, bufsize - ret);
+out:
+ put_seccomp_state(state);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_get_filter);
+
+/**
+ * seccomp_clear_filter: clears the seccomp filter for a syscall.
+ * @syscall_nr: the system call number to clear filters for.
+ *
+ * (acts as a frontend for seccomp_set_filter. All restrictions
+ * apply)
+ *
+ * Returns 0 on success.
+ */
+long seccomp_clear_filter(int syscall_nr)
+{
+ return seccomp_set_filter(syscall_nr, NULL);
+}
+EXPORT_SYMBOL_GPL(seccomp_clear_filter);
+
+/**
+ * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
+ * @syscall_nr: system call number to apply the filter to.
+ * @filter: ftrace filter string to apply.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called.
+ *
+ * New filters may be added for system calls when the current task is
+ * not in a secure computing mode (seccomp). Otherwise, filters may only
+ * be added to already filtered system call entries. Any additions will
+ * be &&'d with the existing filter string to ensure no expansion of privileges
+ * will be possible.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_set_filter(int syscall_nr, char *filter)
+{
+ struct seccomp_state *state, *orig_state;
+ long ret = -EINVAL;
+
+ orig_state = get_seccomp_state(current->seccomp);
+
+ /* Prior to mutating the state, create a duplicate to avoid modifying
+ * the behavior of other instances sharing the state and ensure
+ * consistency.
+ */
+ state = (orig_state ? seccomp_state_dup(orig_state) :
+ seccomp_state_new());
+ if (!state) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* A NULL filter doubles as a drop value, but the exposed prctl
+ * interface requires a trip through seccomp_clear_filter().
+ * Filter dropping is allowed across the is_compat_task() barrier.
+ */
+ ret = 0;
+ if (filter == NULL) {
+ drop_filter(state, syscall_nr);
+ goto assign;
+ }
+
+ /* Avoid amiguous filters which may have been inherited from a parent
+ * with different syscall numbers for the logically same calls.
+ */
+#ifdef CONFIG_COMPAT
+ ret = -EACCES;
+ if (is_compat_task() != !!(state->flags & SECCOMP_FLAG_COMPAT)) {
+ if (state->filters->count)
+ goto free_state;
+ /* It's safe to add if there are no existing ambiguous rules.*/
+ if (is_compat_task())
+ state->flags |= SECCOMP_FLAG_COMPAT;
+ else
+ state->flags &= ~(SECCOMP_FLAG_COMPAT);
+ }
+#endif
+
+ ret = add_filter(state, syscall_nr, filter);
+ if (ret)
+ goto free_state;
+
+assign:
+ rcu_assign_pointer(current->seccomp, state);
+ synchronize_rcu();
+ put_seccomp_state(orig_state); /* for the task */
+out:
+ put_seccomp_state(orig_state); /* for the get */
+ return ret;
+
+free_state:
+ put_seccomp_state(orig_state); /* for the get */
+ put_seccomp_state(state); /* drop the dup/new */
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_set_filter);
diff --git a/kernel/sys.c b/kernel/sys.c
index af468ed..d29003a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_ENDIAN:
error = SET_ENDIAN(me, arg2);
break;
-
case PR_GET_SECCOMP:
error = prctl_get_seccomp();
break;
case PR_SET_SECCOMP:
- error = prctl_set_seccomp(arg2);
+ error = prctl_set_seccomp(arg2, arg3);
+ break;
+ case PR_SET_SECCOMP_FILTER:
+ error = prctl_set_seccomp_filter(arg2,
+ (char __user *) arg3);
+ break;
+ case PR_CLEAR_SECCOMP_FILTER:
+ error = prctl_clear_seccomp_filter(arg2);
+ break;
+ case PR_GET_SECCOMP_FILTER:
+ error = prctl_get_seccomp_filter(arg2,
+ (char __user *) arg3,
+ arg4);
break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
--
1.7.0.4
Loading...