DragonFly kernel List (threaded) for 2003-12
:I agree and disagree both in principle and practice :-). Actually, I
:think you'll find that if you dig a little deeper, the placement of MAC
:Framework entry points is done exactly on the philosophy you describe. In
:order to prevent race conditions, you have to perform access control
:checks on the actual objects, not the names provided in system calls. We
:place our checks at the front ends of various subsystems: i.e., the top
:layer of VFS, the top layer of the process signalling pieces, etc. This
:is the point where the name has been resolved to the object, and the
:correct locks are held to make sure you can perform a consistent check. In
:a traditional UNIX kernel, you cannot do this safely at the system call
:layer using wrappers, because that involves multiple lookups, which can be
:raced (time of check, time of use).
:However, if you have a compartmentalized kernel (i.e., microkernel) with
:message passing between subsystems, subsystems can perform the checks at
:the point where the message enters, which might accomplish what both you
:want, and what I want architecturally :-). However, that relies on
:cleaning up object naming, perhaps in the style of Mach
:ports/capabilities, so that the names used in messages are "authoritative"
:and safe to control with.
:The problem with system call level wrappers is pretty hard to fix,
:however, and sometimes, it can cause more security vulnerabilities than it
:solves. Take, for example, systrace's interception and replacement of
:path names. There are actually two race conditions here: first, a dual
:copyin, which can be raced by threaded processes and shared memory -- this
:is fixed through proper encapsulation of system call arguments. Second, a
:semantic race in the implementation of the file system code, which is a
:lot harder to solve. Systrace's lookup occurs before the kernel has
:resolved the file name from the string passed by the process, so the
:lookup actually occurs twice: once in the wrapper for the control, and
:once when the actual system call does the work. Neils has explored using
:a "look aside buffer" to cache system call arguments to address the first
:problem, but I think the "Separation" in DragonFly will solve this much
:more cleanly. The second can't be avoided unless the name used for the
:test acts more like a capability (or, you combine the checks with the same
:locked referenced used in the file system code, as in FreeBSD).
:Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
:robert@xxxxxxxxxxxxxxxxx Senior Research Scientist, McAfee Research
Yah. I wasn't planning on actually porting systrace, just porting
the concepts into the message-passing framework as we implement it.
We will definitely be leveraging off othe kern_*() syscall separation
The sequence of events will be:
* Implement syscall message encapsulation through a 'syscall emulation
library' which is unconditionally mapped by the kernel into user memory.
i.e. think of it as a syscall.so userland library which implements all
the syscalls instead of libc implementing the syscalls.
* Revector int 0x80 to enter into this library via userland (SEL_UPL)
to handle legacy int 0x80 syscalls. This effectively 'turns off'
direct kernel access via int 0x80.
* Make the native libc directly aware of the kernel-mapped syscall
library (i.e. have it call into syscall.so directly instead of issuing
* Move all emulation code (linux, sysv, etc...) out of the kernel and
into userland via syscall.so. The kernel will select the correct
syscall*.so library to map into the user process when exec()ing.
This removes all potential security issues from the sysv and linux
* Implement a kernel-loadable security 'filter' on syscall messages
that intercepts the syscall message. Actually, make it a layering
* per-process (child inherited)
I see two ways to implement the filter mechanism:
(1) The filter would have to implement the copyin/copyout layer and then
call the syscall meat layer. Any filtered syscalls that don't
actually have to examine user-supplied data could simply call the
main syscall entry point after acceptance, so it would not be too
messy. This is easier to do then #(2) but makes the layering of
multiple syscall filters difficult.
(2) Do all copyins necessary for filter operations (basically anything
that passes a path) prior to executing the first filter. Then the
filters need only deal with the data. Harder but probably the more