DragonFly BSD

CVS log for src/sys/kern/sys_pipe.c

[BACK] Up to [DragonFly] / src / sys / kern

Request diff between arbitrary revisions


Keyword substitution: kv
Default branch: MAIN


Revision 1.50: download - view: text, markup, annotated - select for diffs
Tue Sep 9 04:06:13 2008 UTC (5 years, 7 months ago) by dillon
Branches: MAIN
CVS tags: HEAD
Diff to: previous 1.49: preferred, unified
Changes since revision 1.49: +4 -4 lines
Fix issues with the scheduler that were causing unnecessary reschedules
between tightly coupled processes as well as inefficient reschedules under
heavy loads.

The basic problem is that a process entering the kernel is 'passively
released', meaning its thread priority is left at TDPRI_USER_NORM.  The
thread priority is only raised to TDPRI_KERN_USER if the thread switches
out.  This has the side effect of forcing a LWKT reschedule when any other
user process woke up from a blocked condition in the kernel, regardless of
its user priority, because it's LWKT thread was at the higher
TDPRI_KERN_USER priority.   This resulted in some significant switching
cavitation under load.

There is a twist here because we do not want to starve threads running in
the kernel acting on behalf of a very low priority user process, because
doing so can deadlock the namecache or other kernel elements that sleep with
lockmgr locks held.  In addition, the 'other' LWKT thread might be associated
with a much higher priority user process that we *DO* in fact want to give
cpu to.

The solution is elegant.  First, do not force a LWKT reschedule for the
above case.  Second, force a LWKT reschedule on every hard clock.  Remove
all the old hacks.  That's it!

The result is that the current thread is allowed to return to user
mode and run until the next hard clock even if other LWKT threads (running
on behalf of a user process) are runnable.  Pure kernel LWKT threads still
get absolute priority, of course.  When the hard clock occurs the other LWKT
threads get the cpu and at the end of that whole mess most of those
LWKT threads will be trying to return to user mode and the user scheduler
will be able to select the best one.  Doing this on a hardclock boundary
prevents cavitation from occuring at the syscall enter and return boundary.

With this change the TDF_NORESCHED and PNORESCHED flags and their associated
code hacks have also been removed, along with lwkt_checkpri_self() which
is no longer needed.

Revision 1.49: download - view: text, markup, annotated - select for diffs
Thu Jun 5 18:06:32 2008 UTC (5 years, 10 months ago) by swildner
Branches: MAIN
CVS tags: DragonFly_RELEASE_2_0_Slip, DragonFly_RELEASE_2_0, DragonFly_Preview
Diff to: previous 1.48: preferred, unified
Changes since revision 1.48: +4 -4 lines
* Fix some cases where NULL was used but 0 was meant (and vice versa).

* Remove some bogus casts of NULL to (void *).

Revision 1.48: download - view: text, markup, annotated - select for diffs
Sat May 10 01:25:55 2008 UTC (5 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.47: preferred, unified
Changes since revision 1.47: +11 -7 lines
Fix feature logic so changing kern.pipe.dwrite_enable on the fly works
properly.  Before it could cause processes to block forever.

Revision 1.47: download - view: text, markup, annotated - select for diffs
Fri May 9 07:24:45 2008 UTC (5 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.46: preferred, unified
Changes since revision 1.46: +6 -3 lines
Fix many bugs and issues in the VM system, particularly related to
heavy paging.

* (cleanup) PG_WRITEABLE is now set by the low level pmap code and not by
  high level code.  It means 'This page may contain a managed page table
  mapping which is writeable', meaning that hardware can dirty the page
  at any time.  The page must be tested via appropriate pmap calls before
  being disposed of.

* (cleanup) PG_MAPPED is now handled by the low level pmap code and only
  applies to managed mappings.  There is still a bit of cruft left over
  related to the pmap code's page table pages but the high level code is now
  clean.

* (bug) Various XIO, SFBUF, and MSFBUF routines which bypass normal paging
  operations were not properly dirtying pages when the caller intended
  to write to them.

* (bug) vfs_busy_pages in kern/vfs_bio.c had a busy race.  Separate the code
  out to ensure that we have marked all the pages as undergoing IO before we
  call vm_page_protect().  vm_page_protect(... VM_PROT_NONE) can block
  under very heavy paging conditions and if the pages haven't been marked
  for IO that could blow up the code.

* (optimization) Make a minor optimization.  When busying pages for write
  IO, downgrade the page table mappings to read-only instead of removing
  them entirely.

* (bug) In platform/pc32/i386/pmap.c fix various places where
  pmap_inval_add() was being called at the wrong point.  Only one was
  critical, in pmap_enter(), where pmap_inval_add() was being called so far
  away from the pmap entry being modified that it could wind up being flushed
  out prior to the modification, breaking the cpusync required.

  pmap.c also contains most of the work involved in the PG_MAPPED and
  PG_WRITEABLE changes.

* (bug) Close numerous pte updating races with hardware setting the
  modified bit.  There is still one race left (in pmap_enter()).

* (bug) Disable pmap_copy() entirely.   Fix most of the bugs anyway, but
  there is still one left in the handling of the srcmpte variable.

* (cleanup) Change vm_page_dirty() from an inline to a real procedure, and
  move the code which set the object to writeable/maybedirty into
  vm_page_dirty().

* (bug) Calls to vm_page_protect(... VM_PROT_NONE) can block.  Fix all cases
  where this call was made with a non-busied page.  All such calls are
  now made with a busied page, preventing blocking races from re-dirtying
  or remapping the page unexpectedly.

  (Such blockages could only occur during heavy paging activity where the
  underlying page table pages are being actively recycled).

* (bug) Fix the pageout code to properly mark pages as undergoing I/O before
  changing their protection bits.

* (bug) Busy pages undergoing zeroing or partial zeroing in the vnode pager
  (vm/vnode_pager.c) to avoid unexpected effects.

Revision 1.46: download - view: text, markup, annotated - select for diffs
Thu May 8 01:31:01 2008 UTC (5 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.45: preferred, unified
Changes since revision 1.45: +21 -9 lines
Fix some lock ordering issues in the pipe code.

In particular fix a bug in the pipe_write() code when multiple writers
are present that could cause garbage to be injected into the pipe due
to a resize possibly occuring while wpipe->pipe_buffer.cnt is non-zero.

Revision 1.45: download - view: text, markup, annotated - select for diffs
Sun May 4 08:42:03 2008 UTC (5 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.44: preferred, unified
Changes since revision 1.44: +5 -1 lines
The direct-write pipe code has a bug in it somewhere when the system is
paging heavily.  Disable it for now.

Revision 1.44: download - view: text, markup, annotated - select for diffs
Thu Dec 28 21:24:01 2006 UTC (7 years, 3 months ago) by dillon
Branches: MAIN
CVS tags: DragonFly_RELEASE_1_8_Slip, DragonFly_RELEASE_1_8, DragonFly_RELEASE_1_12_Slip, DragonFly_RELEASE_1_12, DragonFly_RELEASE_1_10_Slip, DragonFly_RELEASE_1_10
Diff to: previous 1.43: preferred, unified
Changes since revision 1.43: +7 -7 lines
Make kernel_map, buffer_map, clean_map, exec_map, and pager_map direct
structural declarations instead of pointers.  Clean up all related code,
in particular kmem_suballoc().

Remove the offset calculation for kernel_object.  kernel_object's page
indices used to be relative to the start of kernel virtual memory in order
to improve the performance of VM page scanning algorithms.  The optimization
is no longer needed now that VM objects use Red-Black trees.  Removal of
the offset simplifies a number of calculations and makes the code more
readable.

Revision 1.43: download - view: text, markup, annotated - select for diffs
Sat Dec 23 23:47:54 2006 UTC (7 years, 3 months ago) by swildner
Branches: MAIN
Diff to: previous 1.42: preferred, unified
Changes since revision 1.42: +7 -17 lines
Ansify function declarations and fix some minor style issues.

In-collaboration-with: Alexey Slynko <slynko@tronet.ru>

Revision 1.42: download - view: text, markup, annotated - select for diffs
Mon Sep 11 20:25:01 2006 UTC (7 years, 7 months ago) by dillon
Branches: MAIN
Diff to: previous 1.41: preferred, unified
Changes since revision 1.41: +5 -2 lines
Move flag(s) representing the type of vm_map_entry into its own vm_maptype_t
type.  This is a precursor to adding a new VM mapping type for virtualized
page tables.

Revision 1.41: download - view: text, markup, annotated - select for diffs
Tue Sep 5 00:55:45 2006 UTC (7 years, 7 months ago) by dillon
Branches: MAIN
Diff to: previous 1.40: preferred, unified
Changes since revision 1.40: +2 -2 lines
Rename malloc->kmalloc, free->kfree, and realloc->krealloc.  Pass 1

Revision 1.40: download - view: text, markup, annotated - select for diffs
Wed Aug 2 01:25:25 2006 UTC (7 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.39: preferred, unified
Changes since revision 1.39: +8 -4 lines
Get rid of some unused fields in the fileops and adjust the declarations
to use the '.field = blah' initialization method.

Revision 1.39: download - view: text, markup, annotated - select for diffs
Tue Jun 13 08:12:03 2006 UTC (7 years, 10 months ago) by dillon
Branches: MAIN
CVS tags: DragonFly_RELEASE_1_6_Slip, DragonFly_RELEASE_1_6
Diff to: previous 1.38: preferred, unified
Changes since revision 1.38: +25 -8 lines
Add kernel syscall support for explicit blocking and non-blocking I/O
regardless of the setting applied to the file pointer.

send/sendmsg/sendto/recv/recvmsg/recfrom: New MSG_ flags defined in
sys/socket.h may be passed to these functions to override the settings
applied to the file pointer on a per-I/O basis.

MSG_FBLOCKING	- Force the operation to be blocking
MSG_FNONBLOCKING- Force the operation to be non-blocking

pread/preadv/pwrite/pwritev: These system calls have been renamed and
wrappers will be added to libc.  The new system calls are prefixed with
a double underscore (like getcwd vs __getcwd) and include an additional
flags argument.  The new flags are defined in sys/fcntl.h and may be
used to override settings applied to the file pointer on a per-I/O basis.

Additionally, the internal __ versions of these functions now accept an
offset of -1 to mean 'degenerate into a read/readv/write/writev' (i.e.
use the offset in the file pointer and update it on completion).

O_FBLOCKING	- Force the operation to be blocking
O_FNONBLOCKING	- Force the operation to be non-blocking
O_FAPPEND	- Force the write operation to append (to a regular file)
O_FOFFSET	- (implied of the offset != -1) - offset is valid
O_FSYNCWRITE	- Force a synchronous write
O_FASYNCWRITE	- Force an asynchronous write
O_FUNBUFFERED	- Force an unbuffered operation (O_DIRECT)
O_FBUFFERED	- Force a buffered operation (negate O_DIRECT)

If the flags do not specify an operation (e.g. neither FBLOCKING or
FNONBLOCKING are set), then the settings in the file pointer are used.

The original system calls will become wrappers in libc, without the flags
arguments.  The new system calls will be made available to libc_r to allow
it to perform non-blocking I/O without having to mess with a descriptor's
file flags.

NOTE: the new __pread and __pwrite system calls are backwards compatible
with the originals due to a pad byte that libc always set to 0.
The new __preadv and __pwritev system calls are NOT backwards compatible,
but since they were added to HEAD just two months ago I have decided
to not renumber them either.

NOTE: The subrev has been bumped to 1.5.4 and installworld will refuse to
install if you are not running at least a 1.5.4 kernel.

Revision 1.38: download - view: text, markup, annotated - select for diffs
Mon Jun 5 07:26:10 2006 UTC (7 years, 10 months ago) by dillon
Branches: MAIN
Diff to: previous 1.37: preferred, unified
Changes since revision 1.37: +1 -1 lines
Modify kern/makesyscall.sh to prefix all kernel system call procedures
with "sys_".  Modify all related kernel procedures to use the new naming
convention.  This gets rid of most of the namespace overloading between
the kernel and standard header files.

Revision 1.37: download - view: text, markup, annotated - select for diffs
Fri May 26 00:33:09 2006 UTC (7 years, 10 months ago) by dillon
Branches: MAIN
Diff to: previous 1.36: preferred, unified
Changes since revision 1.36: +81 -31 lines
More MP work.

* Incorporate fd_knlistsize initialization into fsetfd().

* Mark all fileops vectors as MPSAFE (but get the mplock for most of them).
  Clean up a number of fileops routines, mainly *_ioctl().

* Make crget(), crhold(), and crfree() MPSAFE.  crfree still needs the mplock
  on the last release.  Give ucred a spinlock to handle the crfree()
  0 transition race.

Revision 1.36: download - view: text, markup, annotated - select for diffs
Mon May 22 21:21:21 2006 UTC (7 years, 10 months ago) by dillon
Branches: MAIN
Diff to: previous 1.35: preferred, unified
Changes since revision 1.35: +4 -1 lines
Do a major cleanup of the file descriptor handling code in preparation for
making the descriptor table MPSAFE.  Introduce a new feature that allows a
file descriptor number to be reserved without having to assign a file
pointer to it.  This allows code such as open(), dup(), etc to reserve
descriptors to work with without having to worry about the related file
being ripped out from under them by another thread sharing the descriptor
table.

falloc() -	This function allocates the file pointer and descriptor as
		before, but does NOT associate the file pointer with the
		descriptor.

		Before this change another thread could access the file
		pointer while the system call creating it was blocked,
		before the system call had a chance to completely initialize
		the file pointer.

		The caller must call fsetfd() to assign or clear the
		reserved descriptor.

fsetfd() -	Is now responsible for associating a file pointer with a
		previously reserved descriptor or clearing the reservation.

fdealloc() -	This hack existed to deal with open/dup races against other
		threads.  The above changes remove the possibility so this
		routine has been deleted.

dup code -	kern_dup() and dupfdopen() have been completely rewritten.
		They are much cleaner and less obtuse now.  Additional race
		conditions in the original code were also found and fixed.

funsetfd() -	Now returns the file pointer that was cleared and takes
		responsibility for adjusting fd_lastfile.

		NOTE: fd_lastfile is inclusive of any reserved descriptors.

fdcopy() -	While not yet MPSAFE, fdcopy now properly handles races
		against other threads.

fdp->fd_lastfile -
		This field was not being properly updated in certain failure
		cases.  This commit fixes that.  Also, if all a process's
		descriptors were closed this field was incorrectly left at
		0 when it should have been set to -1.

fdp->fd_files -	A number of code blocks were trying to optimize a for()
		loop over all file descriptors by caching a pointer to
		fd_files.  This is a problem because fd_files can be
		reallocated if code within the loop blocks.  These loops
		have been rewritten.

Revision 1.35: download - view: text, markup, annotated - select for diffs
Fri May 19 05:15:35 2006 UTC (7 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.34: preferred, unified
Changes since revision 1.34: +1 -6 lines
Consolidate the file descriptor destruction code used when a newly created
file descriptor must be destroyed due to an error into a new procedure,
fdealloc(), rather then manually repeating it over and over again.

Move holdsock() and holdfp() into kern/kern_descrip.c.

Revision 1.34: download - view: text, markup, annotated - select for diffs
Sat May 6 06:38:38 2006 UTC (7 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.33: preferred, unified
Changes since revision 1.33: +4 -4 lines
The fdrop() procedure no longer needs a thread argument, remove it.

Revision 1.33: download - view: text, markup, annotated - select for diffs
Sat May 6 02:43:12 2006 UTC (7 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.32: preferred, unified
Changes since revision 1.32: +16 -19 lines
The thread/proc pointer argument in the VFS subsystem originally existed
for...  well, I'm not sure *WHY* it originally existed when most of the
time the pointer couldn't be anything other then curthread or curproc or
the code wouldn't work.  This is particularly true of lockmgr locks.

Remove the pointer argument from all VOP_*() functions, all fileops functions,
and most ioctl functions.

Revision 1.32: download - view: text, markup, annotated - select for diffs
Fri Sep 2 07:16:58 2005 UTC (8 years, 7 months ago) by hsu
Branches: MAIN
CVS tags: DragonFly_RELEASE_1_4_Slip, DragonFly_RELEASE_1_4
Diff to: previous 1.31: preferred, unified
Changes since revision 1.31: +4 -4 lines
Now that the C language has a "void *", use it instead of caddr_t.

Revision 1.31: download - view: text, markup, annotated - select for diffs
Wed Jul 13 01:38:50 2005 UTC (8 years, 9 months ago) by dillon
Branches: MAIN
Diff to: previous 1.30: preferred, unified
Changes since revision 1.30: +39 -1 lines
Make shutdown() a fileops operation rather then a socket operation.
Pipes are full-duplex entities, so implement shutdown support for them.

Revision 1.30: download - view: text, markup, annotated - select for diffs
Mon Jul 4 18:39:16 2005 UTC (8 years, 9 months ago) by dillon
Branches: MAIN
Diff to: previous 1.29: preferred, unified
Changes since revision 1.29: +7 -2 lines
The pipe code was not properly handling kernel space writes.  Such writes
can be made by the journaling code when journaling to a pipe.

Revision 1.29: download - view: text, markup, annotated - select for diffs
Wed Jun 22 01:33:21 2005 UTC (8 years, 9 months ago) by dillon
Branches: MAIN
Diff to: previous 1.28: preferred, unified
Changes since revision 1.28: +1 -1 lines
File descriptor cleanup stage 2, remove the separate arrays for file
pointers, fileflags, and allocation counts and replace the mess with a
single structural array.  Also revamp the code that checks whether the
file descriptor array is built-in or allocated.

Note that the removed malloc's were doing something weird, allocating
'nf * OFILESIZE + 1' bytes instead of 'nf * OFILESIZE' bytes.  I could
not find any reason at all why it was doing that.  It's gone now anyway.

Revision 1.28: download - view: text, markup, annotated - select for diffs
Tue Jun 21 23:58:53 2005 UTC (8 years, 9 months ago) by hsu
Branches: MAIN
Diff to: previous 1.27: preferred, unified
Changes since revision 1.27: +1 -1 lines
Replace the linear search in file descriptor allocation with an O(log N)
algorithm based on full in-place binary search trees augmented with
subtree free file descriptor counts.

Idea from:	Solaris

Revision 1.27: download - view: text, markup, annotated - select for diffs
Wed Mar 9 02:22:31 2005 UTC (9 years, 1 month ago) by dillon
Branches: MAIN
CVS tags: DragonFly_Stable, DragonFly_RELEASE_1_2_Slip, DragonFly_RELEASE_1_2
Diff to: previous 1.26: preferred, unified
Changes since revision 1.26: +6 -0 lines
pipe->pipe_buffer.out was not being reset to 0 when switching from direct
mode back to copy mode, leading to the pipe data becoming corrupt.

Reported-by: Joerg Sonnenberger <joerg@britannica.bec.de>

Revision 1.26: download - view: text, markup, annotated - select for diffs
Tue Mar 1 23:35:14 2005 UTC (9 years, 1 month ago) by dillon
Branches: MAIN
Diff to: previous 1.25: preferred, unified
Changes since revision 1.25: +35 -22 lines
Clean up the XIO API and structure.  XIO no longer tries to 'track' partial
copies into or out of an XIO.  It no longer adjusts xio_offset or xio_bytes
once they have been initialized.  Instead, a relative offset is now passed
to API calls to handle partial copies.  This makes the API a lot less confusing
and makes the XIO structure a lot more flexible, shareable, and more suitable
for use by higher level entities (buffer cache, pipe code, upcoming MSFBUF
work, etc).

Revision 1.25: download - view: text, markup, annotated - select for diffs
Fri Nov 12 00:09:24 2004 UTC (9 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.24: preferred, unified
Changes since revision 1.24: +1 -1 lines
VFS messaging/interfacing work stage 9/99: VFS 'NEW' API WORK.

NOTE: unionfs and nullfs are temporarily broken by this commit.

* Remove the old namecache API.  Remove vfs_cache_lookup(), cache_lookup(),
  cache_enter(), namei() and lookup() are all gone.  VOP_LOOKUP() and
  VOP_CACHEDLOOKUP() have been collapsed into a single non-caching
  VOP_LOOKUP().

* Complete the new VFS CACHE (namecache) API.  The new API is able to
  supply topological guarentees and is able to reserve namespaces,
  including negative cache spaces (whether the target name exists or not),
  which the new API uses to reserve namespace for things like NRENAME
  and NCREATE (and others).

* Complete the new namecache API.  VOP_NRESOLVE, NLOOKUPDOTDOT, NCREATE,
  NMKDIR, NMKNOD, NLINK, NSYMLINK, NWHITEOUT, NRENAME, NRMDIR, NREMOVE.
  These new calls take (typicaly locked) namecache pointers rather then
  combinations of directory vnodes, file vnodes, and name components.  The
  new calls are *MUCH* simpler in concept and implementation.  For example,
  VOP_RENAME() has 8 arguments while VOP_NRENAME() has only 3 arguments.

  The new namecache API uses the namecache to lock namespaces without having
  to lock the underlying vnodes.  For example, this allows the kernel
  to reserve the target name of a create function trivially.  Namecache
  records are maintained BY THE KERNEL for both positive and negative hits.

  Generally speaking, the kernel layer is now responsible for resolving
  path elements.  NRESOLVE is called when an unresolved namecache record
  needs to be resolved.  Unlike the old VOP_LOOKUP, NRESOLVE is simply
  responsible for associating a vnode to a namecache record (positive hit)
  or telling the system that it's a negative hit, and not responsible for
  handling symlinks or other special cases or doing any of the other
  path lookup work, much unlike the old VOP_LOOKUP.

  It should be particularly noted that the new namecache topology does not
  allow disconnected namecache records.  In rare cases where a vnode must
  be converted to a namecache pointer for new API operation via a file handle
  (i.e. NFS), the cache_fromdvp() function is provided and a new API VOP,
  VOP_NLOOKUPDOTDOT() is provided to allow the namecache to resolve the
  topology leading up to the requested vnode.  These and other topological
  guarentees greatly reduce the complexity of the new namecache API.

  The new namei() is called nlookup().  This function uses a combination
  of cache_n*() calls, VOP_NRESOLVE(), and standard VOP calls resolve the
  supplied path, deal with symlinks, and so forth, in a nice small compact
  compartmentalized procedure.

* The old VFS code is no longer responsible for maintaining namecache records,
  a function which was mostly adhoc cache_purge()s occuring before the VFS
  actually knows whether an operation will succeed or not.

  The new VFS code is typically responsible for adjusting the state of
  locked namecache records passed into it.  For example, if NCREATE succeeds
  it must call cache_setvp() to associate the passed namecache record with
  the vnode representing the successfully created file.  The new requirements
  are much less complex then the old requirements.

* Most VFSs still implement the old API calls, albeit somewhat modified
  and in particular the VOP_LOOKUP function is now *MUCH* simpler.  However,
  the kernel now uses the new API calls almost exclusively and relies on
  compatibility code installed in the default ops (vop_compat_*()) to
  convert the new calls to the old calls.

* All kernel system calls and related support functions which used to do
  complex and confusing namei() operations now do far less complex and
  far less confusing nlookup() operations.

* SPECOPS shortcutting has been implemented.  User reads and writes now go
  directly to supporting functions which talk to the device via fileops
  rather then having to be routed through VOP_READ or VOP_WRITE, saving
  significant overhead.  Note, however, that these only really effect
  /dev/null and /dev/zero.

  Implementing this was fairly easy, we now simply pass an optional
  struct file pointer to VOP_OPEN() and let spec_open() handle the
  override.

SPECIAL NOTES: It should be noted that we must still lock a directory vnode
LK_EXCLUSIVE before issuing a VOP_LOOKUP(), even for simple lookups, because
a number of VFS's (including UFS) store active directory scanning information
in the directory vnode.  The legacy NAMEI_LOOKUP cases can be changed to
use LK_SHARED once these VFS cases are fixed.  In particular, we are now
organized well enough to actually be able to do record locking within a
directory for handling NCREATE, NDELETE, and NRENAME situations, but it hasn't
been done yet.

Many thanks to all of the testers and in particular David Rhodus for
finding a large number of panics and other issues.

Revision 1.24: download - view: text, markup, annotated - select for diffs
Sat Jul 24 20:30:00 2004 UTC (9 years, 8 months ago) by dillon
Branches: MAIN
CVS tags: DragonFly_Snap29Sep2004, DragonFly_Snap13Sep2004
Diff to: previous 1.23: preferred, unified
Changes since revision 1.23: +2 -0 lines
Make fstat() account for pending direct-write data when run on a pipe.

Submitted-by: Hiten Pandya <hmp@freebsd.org>
Obtained-from: FreeBSD 1.172 (Mike Silbersack)

Revision 1.23: download - view: text, markup, annotated - select for diffs
Fri May 21 14:26:28 2004 UTC (9 years, 10 months ago) by hmp
Branches: MAIN
CVS tags: DragonFly_1_0_REL, DragonFly_1_0_RC1, DragonFly_1_0A_REL
Diff to: previous 1.22: preferred, unified
Changes since revision 1.22: +2 -1 lines
Fix SYSCTL description style.

Revision 1.22: download - view: text, markup, annotated - select for diffs
Thu May 13 23:49:23 2004 UTC (9 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.21: preferred, unified
Changes since revision 1.21: +1 -1 lines
device switch 1/many: Remove d_autoq, add d_clone (where d_autoq was).

d_autoq was used to allow the device port dispatch to mix old-style synchronous
calls with new style messaging calls within a particular device.  It was never
used for that purpose.

d_clone will be more fully implemented as work continues.  We are going to
install d_port in the dev_t (struct specinfo) structure itself and d_clone
will be needed to allow devices to 'revector' the port on a minor-number
by minor-number basis, in particular allowing minor numbers to be directly
dispatched to distinct threads.  This is something we will be needing later
on.

Revision 1.21: download - view: text, markup, annotated - select for diffs
Tue May 11 22:48:53 2004 UTC (9 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.20: preferred, unified
Changes since revision 1.20: +4 -4 lines
Fix a bug in sys/pipe.c.  xio_init_ubuf() might not be able to load up the
requested number of bytes even if the request is limited to XIO_INTERNAL_SIZE
if the user buffer base is not page-aligned.  XIO will set xio_bytes to the
actual size of the buffer.

Note that this bug was never exercised due to the 64KB pipe kmem buffer size
limit, so it could not have been the cause of recent problems.

Use kmem_alloc_nofault() instead of kmem_alloc_pageable() for the kmem
reservation.  This is more correct but should have no actual effect on
the system.

Revision 1.20: download - view: text, markup, annotated - select for diffs
Tue May 11 18:05:05 2004 UTC (9 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.19: preferred, unified
Changes since revision 1.19: +2 -0 lines
Add an assertion to sys_pipe to cover a possible overrun case and reorder
the zone cache code in zalloc() to not assign the link pointer until
after various sanity checks are performed.

Revision 1.19: download - view: text, markup, annotated - select for diffs
Sun May 2 07:57:45 2004 UTC (9 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.18: preferred, unified
Changes since revision 1.18: +14 -1 lines
We must pmap_qremove() pages that we previously pmap_qenter()'d before
we can safely call kmem_free().  This corrects a serious corruption issue
that occured when using PIPE algorithms other then the default.

The default SFBUF algorithm was not effected by this bug.

Revision 1.18: download - view: text, markup, annotated - select for diffs
Sat May 1 18:16:43 2004 UTC (9 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.17: preferred, unified
Changes since revision 1.17: +152 -23 lines
Commit an update to the pipe code that implements various pipe algorithms.
Note that the newer algorithms are either experimental or only exist for
testing purposes.  The default remains the same (sfbuf mode), which is
considered to be stable.  The code is just too useful not to commit it.

Add pmap_qenter2() for installing cpu-localized KVM mappings.

Add pmap_page_assertzero() which will be used in a later diagnostic commit.

Revision 1.17: download - view: text, markup, annotated - select for diffs
Thu Apr 1 17:58:02 2004 UTC (10 years ago) by dillon
Branches: MAIN
Diff to: previous 1.16: preferred, unified
Changes since revision 1.16: +59 -101 lines
Enhance the pmap_kenter*() API and friends, separating out entries which
only need invalidation on the local cpu against entries which need invalidation
across the entire system, and provide a synchronization abstraction.

Enhance sf_buf_alloc() and friends to allow the caller to specify whether the
sf_buf's kernel mapping is going to be used on just the current cpu or
whether it needs to be valid across all cpus.  This is done by maintaining
a cpumask of known-synchronized cpus in the struct sf_buf

Optimize sf_buf_alloc() and friends by removing both TAILQ operations in the
critical path.  TAILQ operations to remove the sf_buf from the free queue
are now done in a lazy fashion.  Most sf_buf operations allocate a buf,
work on it, and free it, so why waste time moving the sf_buf off the freelist
if we are only going to move back onto the free list a microsecond later?

Fix a bug in sf_buf_alloc() code as it was being used by the PIPE code.
sf_buf_alloc() was unconditionally using PCATCH in its tsleep() call, which
is only correct when called from the sendfile() interface.

Optimize the PIPE code to require only local cpu_invlpg()'s when mapping
sf_buf's, greatly reducing the number of IPIs required.  On a DELL-2550,
a pipe test which explicitly blows out the sf_buf caching by using huge
buffers improves from 350 to 550 MBytes/sec.  However, note that buildworld
times were not found to have changed.

Replace the PIPE code's custom 'struct pipemapping' structure with a
struct xio and use the XIO API functions rather then its own.

Revision 1.16: download - view: text, markup, annotated - select for diffs
Tue Mar 30 19:14:11 2004 UTC (10 years ago) by dillon
Branches: MAIN
Diff to: previous 1.15: preferred, unified
Changes since revision 1.15: +3 -3 lines
Second major scheduler patch.  This corrects interactive issues that were
introduced in the pipe sf_buf patch.

Split need_resched() into need_user_resched() and need_lwkt_resched().
Userland reschedules are requested when a process is scheduled with a higher
priority then the currently running process, and LWKT reschedules are
requested when a thread is scheduled with a higher priority then the
currently running thread.  As before, these are ASTs, LWKTs are not
preemptively switch while running in the kernel.

Exclusively use the resched wanted flags to determine whether to reschedule
or call lwkt_switch() upon return to user mode.  We were previously also
testing the LWKT run queue for higher priority threads, but this was causing
inefficient scheduler interactions when two processes are doing tightly
bound synchronous IPC (e.g. using PIPEs) because in DragonFly the LWKT
priority of a thread is raised when it enters the kernel, and lowered when
it tries to return to userland.  The wakeups occuring in the pipe code
were causing extra quick-flip thread switches.

Introduce a new tsleep() flag which disables the need_lwkt_resched() call
when the sleeping thread is woken up.   This is used by the PIPE code in
the synchronous direct-write PIPE case to avoid the above problem.

Redocument and revamp the ESTCPU code.  The original changes reduced the
interrupt rate from 100Hz (FBsd-4 and FBsd-5) to 20Hz, but did not compensate
for the slower ramp-up time.  This commit introduces a 'virtual' ESTCPU
frequency which compensates without us having to bump up the actual systimer
interrupt rate.

Redo the P_CURPROC methodology, which is used by the userland scheduler
to manage processes running in userland.  Create a globaldata->gd_uschedcp
process pointer which represents the current running-in-userland (or about
to be running in userland) process, and carefully recode acquire_curproc()
to allow this gd_uschedcp designation to be stolen from other threads trying
to return to userland without having to request a reschedule (which would
have to switch back to those threads to release the designation).  This
reduces the number of unnecessary context switches that occur due to
scheduler interactions.  Also note that this specifically solves the case
where there might be several threads running in the kernel which are trying
to return to userland at the same time.  A heuristic check against gd_upri
is used to select the correct thread for schedling to userland 'most of the
time'.  When the correct thread is not selected, we fall back to the old
behavior of forcing a reschedule.

Add debugging sysctl variables to better track userland scheduler efficiency.

With these changes pipe statistics are further improved.  Though some
scheduling aberrations still exist(1), the previous scheduler had totally
broken interactive processes and this one does not.

	BLKSIZE	BEFORE		NEWPIPE		NOW	    Tests on AMD64
		MBytes/s	MBytes/s	MBytes/s	3200+ FN85MB
							    (64KB L1, 1MB L2)
	256KB	1900		2200		2250
	 64KB	1800		2200		2250
	 32KB	-		-		3300
	 16KB	1650		2500-3000	2600-3200
	  8KB	1400		2300		2000-2400(1)
	  4KB	1300		1400-1500	1500-1700

Revision 1.15: download - view: text, markup, annotated - select for diffs
Sun Mar 28 08:25:48 2004 UTC (10 years ago) by dillon
Branches: MAIN
Diff to: previous 1.14: preferred, unified
Changes since revision 1.14: +45 -71 lines
Import Alan Cox's /usr/src/sys/kern/sys_pipe.c 1.171.  This rips out
writer-side KVA mappings and replaces them with writer-side vm_page wiring
(left intact from before) plus reader-side SF_BUF copies.

Import 1.141, which is a simple patch which removes a blocking condition
when space is available in the pipe's write buffer which was causing
non-blocking I/O select-based writes to spin-wait unnecessarily.  1.171
rips out writer-side KVA mappings and replaces them

Import FreeBSD-5.x's uiomove_fromphys(), which sys_pipe.c now uses.  This
procedure could become very useful in a number of DragonFly subsystems.

This greatly improves PIPE performance for the direct-mapped case (moderate
to large reads and writes).  Additionally, recent scheduler fixes greatly
improve PIPE performance for both the direct-mapped and small-buffer cases.

NOTE: wired page limits for pipes have not yet been imported, and the heavy
use of sf_buf's may require some tuning in the many-pipes case.


    BLKSIZE	BEFORE		AFTER
		MBytes/s	MBytes/s	Tests on AMD64/3200+ FN85 MB
    -------	------		------		(64KB L1, 1MB L2)
    256KB	1900		2200
     64KB	1800		2200
     16KB	1650		2500-3000
      8KB	1400		2300
      4KB	1300		1400-1500	(note 1)

    note 1: The 4KB case is not a direct-write case, the results are due to
    the scheduler fixes only.


Obtained-from: FreeBSD-5.x / FreeBSD's Alan Cox

Revision 1.14: download - view: text, markup, annotated - select for diffs
Fri Feb 20 17:11:07 2004 UTC (10 years, 1 month ago) by dillon
Branches: MAIN
Diff to: previous 1.13: preferred, unified
Changes since revision 1.13: +131 -107 lines
Implement a pipe KVM cache primarily to reduce unnecessary TLB IPIs between
cpcus on MP systems due to continuous KVM allocations.

Revision 1.13: download - view: text, markup, annotated - select for diffs
Mon Nov 3 17:11:21 2003 UTC (10 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.12: preferred, unified
Changes since revision 1.12: +2 -1 lines
64 bit address space cleanups which are a prerequisit for future 64 bit
address space work and PAE.  Note: this is not PAE.  This patch basically
adds vm_paddr_t, which represents a 'physical address'.  Physical addresses
may be larger then virtual addresses and on IA32 we make vm_paddr_t a 64
bit quantity.

Submitted-by: Hiten Pandya <hmp@backplane.com>

Revision 1.12: download - view: text, markup, annotated - select for diffs
Wed Sep 3 14:19:06 2003 UTC (10 years, 7 months ago) by hmp
Branches: MAIN
Diff to: previous 1.11: preferred, unified
Changes since revision 1.11: +1 -1 lines
Pass only one argument to vm_page_hold() as a sane person would do.

Reported by:	DragonFly BuildBox

Revision 1.11: download - view: text, markup, annotated - select for diffs
Wed Sep 3 11:49:27 2003 UTC (10 years, 7 months ago) by hmp
Branches: MAIN
Diff to: previous 1.10: preferred, unified
Changes since revision 1.10: +1 -1 lines
Return a more sane error code, EPIPE.  The EBADF error code is
misleading, since we have already got this far, and it's not a
bad file descriptor.

Obtained from:	FreeBSD

Revision 1.10: download - view: text, markup, annotated - select for diffs
Wed Sep 3 11:47:03 2003 UTC (10 years, 7 months ago) by hmp
Branches: MAIN
Diff to: previous 1.9: preferred, unified
Changes since revision 1.9: +3 -3 lines
Use vm_page_hold() instead of vm_page_wire().

Obtained from:	FreeBSD

Revision 1.9: download - view: text, markup, annotated - select for diffs
Tue Aug 26 21:09:02 2003 UTC (10 years, 7 months ago) by rob
Branches: MAIN
Diff to: previous 1.8: preferred, unified
Changes since revision 1.8: +21 -21 lines
__P() removal

Revision 1.8: download - view: text, markup, annotated - select for diffs
Wed Jul 30 00:19:14 2003 UTC (10 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.7: preferred, unified
Changes since revision 1.7: +2 -2 lines
syscall messaging 3: Expand the 'header' that goes in front of the syscall
arguments in the kernel copy.  The header was previously just an lwkt_msg.
The header is now a 'union sysmsg'.  'union sysmsg' contains an lwkt_msg
plus space for the additional meta data required to asynchronize various
system calls.   We haven't actually asynchronized anything yet and will not
be able to until the reply port and abort processing infrastructure is
in place.  See sys/sysmsg.h for more information on the new header.

Also cleanup syscall generation somewhat and add some ibcs2 stuff I missed.

Revision 1.7: download - view: text, markup, annotated - select for diffs
Tue Jul 29 20:03:05 2003 UTC (10 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.6: preferred, unified
Changes since revision 1.6: +2 -0 lines
fileops messaging stage 1: add port and feature mask to struct fileops and
rename fo_ functions to fold.

Revision 1.6: download - view: text, markup, annotated - select for diffs
Sat Jul 26 18:12:44 2003 UTC (10 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.5: preferred, unified
Changes since revision 1.5: +7 -7 lines
syscall messaging 2: Change the standard return value storage for system
calls from proc->p_retval[] to the message structure embedded in the syscall.
System calls used to set their non-error return value in p_retval[] but
must now set it in the message structure.  This is a necessary precursor to
any sort of asynchronizatino, for obvious reasons.

This work was particularly annoying because all the emualtion code declares
and manually fills in syscall argument structures.

This commit could potentially destabilize some of the emulation code but I
went through the most important Linux emulation code three times and tested it
with linux-mozilla, so I am fairly confident that I got it right.

Note: proper linux emulation requires setting the fallback elf brand to 3 or
it will default to SVR4.  It really ought to default to linux (3), not SVR4.

    sysctl -w kern.fallback_elf_brand=3

Revision 1.5: download - view: text, markup, annotated - select for diffs
Sat Jul 19 21:14:39 2003 UTC (10 years, 9 months ago) by dillon
Branches: MAIN
Diff to: previous 1.4: preferred, unified
Changes since revision 1.4: +11 -11 lines
Remove the priority part of the priority|flags argument to tsleep().  Only
flags are passed now.  The priority was a user scheduler thingy that is not
used by the LWKT subsystem.  For process statistics assume sleeps without
P_SINTR set to be disk-waits, and sleeps with it set to be normal sleeps.

This commit should not contain any operational changes.

Revision 1.4: download - view: text, markup, annotated - select for diffs
Wed Jun 25 03:55:57 2003 UTC (10 years, 9 months ago) by dillon
Branches: MAIN
CVS tags: PRE_MP
Diff to: previous 1.3: preferred, unified
Changes since revision 1.3: +30 -47 lines
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread
pointers instead of process pointers as arguments, similar to what FreeBSD-5
did.  Note however that ultimately both APIs are going to be message-passing
which means the current thread context will not be useable for creds and
descriptor access.

Revision 1.3: download - view: text, markup, annotated - select for diffs
Mon Jun 23 17:55:41 2003 UTC (10 years, 9 months ago) by dillon
Branches: MAIN
Diff to: previous 1.2: preferred, unified
Changes since revision 1.2: +7 -8 lines
proc->thread stage 2: MAJOR revamping of system calls, ucred, jail API,
and some work on the low level device interface (proc arg -> thread arg).
As -current did, I have removed p_cred and incorporated its functions
into p_ucred.  p_prison has also been moved into p_ucred and adjusted
accordingly.  The jail interface tests now uses ucreds rather then processes.

The syscall(p,uap) interface has been changed to just (uap).  This is inclusive
of the emulation code.  It makes little sense to pass a proc pointer around
which confuses the MP readability of the code, because most system call code
will only work with the current process anyway.  Note that eventually
*ALL* syscall emulation code will be moved to a kernel-protected userland
layer because it really makes no sense whatsoever to implement these
emulations in the kernel.

suser() now takes no arguments and only operates with the current process.
The process argument has been removed from suser_xxx() so it now just takes
a ucred and flags.

The sysctl interface was adjusted somewhat.

Revision 1.2: download - view: text, markup, annotated - select for diffs
Tue Jun 17 04:28:41 2003 UTC (10 years, 10 months ago) by dillon
Branches: MAIN
Diff to: previous 1.1: preferred, unified
Changes since revision 1.1: +1 -0 lines
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids.  Most
ids have been removed from !lint sections and moved into comment sections.

Revision 1.1: download - view: text, markup, annotated - select for diffs
Tue Jun 17 02:55:04 2003 UTC (10 years, 10 months ago) by dillon
Branches: MAIN
CVS tags: FREEBSD_4_FORK
import from FreeBSD RELENG_4 1.60.2.13

Diff request

This form allows you to request diffs between any two revisions of a file. You may select a symbolic revision name using the selection box or you may type in a numeric name using the type-in text box.

Log view options