DragonFly BSD

CVS log for src/sys/kern/vfs_bio.c

[BACK] Up to [DragonFly] / src / sys / kern

Request diff between arbitrary revisions


Keyword substitution: kv
Default branch: MAIN


Revision 1.112.2.2: download - view: text, markup, annotated - select for diffs
Thu Sep 25 01:44:52 2008 UTC (6 years, 1 month ago) by dillon
Branches: DragonFly_RELEASE_2_0
CVS tags: DragonFly_RELEASE_2_0_Slip
Diff to: previous 1.112.2.1: preferred, unified; next MAIN 1.113: preferred, unified
Changes since revision 1.112.2.1: +1 -12 lines
MFC numerous features from HEAD.

* Bounce buffer fixes for physio.
* Disk flush support in scsi and nata subsystems.
* Dead bio handling

Revision 1.115: download - view: text, markup, annotated - select for diffs
Wed Aug 13 11:02:31 2008 UTC (6 years, 2 months ago) by swildner
Branches: MAIN
CVS tags: HEAD
Diff to: previous 1.114: preferred, unified
Changes since revision 1.114: +1 -8 lines
Remove a useless assignment and two unused variables.

Found-by: LLVM/Clang Static Analyzer

Revision 1.114: download - view: text, markup, annotated - select for diffs
Sun Aug 10 20:03:14 2008 UTC (6 years, 2 months ago) by dillon
Branches: MAIN
Diff to: previous 1.113: preferred, unified
Changes since revision 1.113: +0 -4 lines
Implement a bounce buffer for physio if the buffer passed from userland
is not at least 16-byte aligned.

Reported-by: "Steve O'Hara-Smith" <steve@sohara.org>, and others

Revision 1.112.2.1: download - view: text, markup, annotated - select for diffs
Fri Jul 18 00:02:10 2008 UTC (6 years, 3 months ago) by dillon
Branches: DragonFly_RELEASE_2_0
Diff to: previous 1.112: preferred, unified
Changes since revision 1.112: +41 -27 lines
MFC 1.113 - buffer cache adjustments for handling write errors.

Revision 1.113: download - view: text, markup, annotated - select for diffs
Fri Jul 18 00:01:11 2008 UTC (6 years, 3 months ago) by dillon
Branches: MAIN
CVS tags: DragonFly_Preview
Diff to: previous 1.112: preferred, unified
Changes since revision 1.112: +41 -27 lines
Make some adjustments to the buffer cache:

* Retain B_ERROR instead of clearing it.

* Change B_ERROR's behavior.  It no longer causes the buffer to be
  invalidated on write.

* Change B_NOCACHE's behavior.  It no longer causes the buffer to be
  invalidated while the buffer is marked dirty.

* Code that was supposed to re-dirty a failed write buffer in brelse()
  was not running because biodone() cleared the fields brelse() was
  testing.  Move the code to biodone().

* When attempting to reflush B_DELWRI|B_ERROR'd buffers, sleep a tick
  to try to avoid a live-lock.

Revision 1.112: download - view: text, markup, annotated - select for diffs
Mon Jul 14 03:09:00 2008 UTC (6 years, 3 months ago) by dillon
Branches: MAIN
Branch point for: DragonFly_RELEASE_2_0
Diff to: previous 1.111: preferred, unified
Changes since revision 1.111: +21 -2 lines
Kernel support for HAMMER:

* Add another type to the bio->bio_caller_info1 union

* Add two new flags to getblk(), used by the cluster code.

  GETBLK_SZMATCH	- Tell getblk() to fail and return NULL if a
			  pre-existing buffer's size does not match
			  the requested size (this prevents getblk()
			  from doing a potentially undesired bwrite()
			  sequence).

  GETBLK_NOWAIT		- Tell getblk() to use a non-blocking lock.

* pop_bio() now returns the previous BIO (or NULL if there is no previous
  BIO).  This allows HAMMER to chain bio_done()'s

* Fix a bug in cluster_read().  The cluster code's read-ahead at the
  end could go past the caller-specified limit and force a block to
  the wrong block size.

Revision 1.111: download - view: text, markup, annotated - select for diffs
Tue Jul 8 03:34:27 2008 UTC (6 years, 3 months ago) by dillon
Branches: MAIN
Diff to: previous 1.110: preferred, unified
Changes since revision 1.110: +4 -3 lines
Cleanup - move a warning so it doesn't spam the screen so much, cleanup
some syntax.

Revision 1.110: download - view: text, markup, annotated - select for diffs
Mon Jul 7 17:31:07 2008 UTC (6 years, 3 months ago) by dillon
Branches: MAIN
Diff to: previous 1.109: preferred, unified
Changes since revision 1.109: +35 -12 lines
UFS+softupdates can build up thousands of dirty 1K buffers and run out
of buffers before it even hits the lodirtybufspace point.  The buf_daemon
is never triggered.  This case occurs rarely but can be triggered e.g.
by a cvs update.

Add dirtybufcount back in and flush if it exceeds (nbuf / 2) to handle
this degenerate case.

Reported-by: "Sepherosa Ziehau" <sepherosa@gmail.com>

Revision 1.109: download - view: text, markup, annotated - select for diffs
Tue Jul 1 02:02:54 2008 UTC (6 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.108: preferred, unified
Changes since revision 1.108: +202 -27 lines
Fix numerous pageout daemon -> buffer cache deadlocks in the main system.
These issues usually only occur on systems with small amounts of ram
but it is possible to trigger them on any system.

* Get rid of the IO_NOBWILL hack.  Just have the VN device use IO_DIRECT,
  which will clean out the buffer on completion of the write.

* Add a timeout argument to vm_wait().

* Add a thread->td_flags flag called TDF_SYSTHREAD.  kmalloc()'s made
  from designated threads are allowed to dip into the system reserve
  when allocating pages.  Only the pageout daemon and buf_daemon[_hw] use
  the flag.

* Add a new static procedure, recoverbufpages(), which explicitly tries to
  free buffers and their backing pages on the clean queue.

* Add a new static procedure, bio_page_alloc(), to do all the nasty work
  of allocating a page on behalf of a buffer cache buffer.

  This function will call vm_page_alloc() with VM_ALLOC_SYSTEM to allow
  it to dip into the system reserve.  If the allocation fails this
  function will call recoverbufpages() to try to recycle from VM pages
  from clean buffer cache buffers, and will then attempt to reallocate
  using VM_ALLOC_SYSTEM | VM_ALLOC_INTERRUPT to allow it to dip into
  the interrupt reserve as well.

  Warnings will blare on the console.  If the effort still fails we
  sleep for 1/20 of a second and retry.  The idea though is for all
  the effort above to not result in a failure at the end.

Reported-by: Gergo Szakal <bastyaelvtars@gmail.com>

Revision 1.108: download - view: text, markup, annotated - select for diffs
Mon Jun 30 02:11:53 2008 UTC (6 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.107: preferred, unified
Changes since revision 1.107: +68 -80 lines
Fix a buf_daemon performance issue when running on machines with small
amounts of ram.  The daemon was hitting a 1/2 sleep case that it should
not have been hitting.

Revision 1.107: download - view: text, markup, annotated - select for diffs
Sat Jun 28 23:45:18 2008 UTC (6 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.106: preferred, unified
Changes since revision 1.106: +87 -124 lines
Fix hopefully all possible deadlocks that can occur when mixed block sizes
are used with the buffer cache.  The fix is simply to base the limiting
and flushing code on a byte count rather then a buffer count.

This will allow UFS to utilizes a greater number of dirty buffers and
will cause HAMMER to use fewer.  This also makes tuning the buffer cache
a whole lot easier.

Revision 1.106: download - view: text, markup, annotated - select for diffs
Sat Jun 28 17:59:49 2008 UTC (6 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.105: preferred, unified
Changes since revision 1.105: +83 -138 lines
Replace the bwillwrite() subsystem to make it more fair to processes.

* Add new API functions, bwillread(), bwillwrite(), bwillinode() which
  the kernel calls when it intends to read, write, or make inode
  modifications.

* Redo the backend.  Add bd_heatup() and bd_wait().  bd_heatup() heats up
  the buf_daemon, starting it flushing before we hit any blocking conditions
  (similar to the previous algorith).

* The new bwill*() blocking functions no longer introduce escalating delays
  to keep the number of dirty buffers under control.  Instead it takes a page
  from HAMMER and estimates the load caused by the caller, then waits for a
  specific number of dirty buffers to complete their write I/O's before
  returning.  If the buffers can be retired quickly these functions will
  return more quickly.

Revision 1.105: download - view: text, markup, annotated - select for diffs
Thu Jun 19 23:27:35 2008 UTC (6 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.104: preferred, unified
Changes since revision 1.104: +8 -6 lines
Miscellanious performance adjustments to the kernel

* Add an argument to VOP_BMAP so VFSs can discern the type of operation
  the BMAP is being done for.

* Normalize the variable name denoting the blocksize to 'blksize' in
  vfs_cluster.c.

* Fix a bug in the cluster code where a stale bp->b_error could wind up
  getting returned when B_ERROR is not set.

* Do not B_AGE cluster bufs.

* Pass the block size to both cluster_read() and cluster_write() instead
  of those routines getting the block size from
  vp->v_mount->mnt_stat.f_iosize.  This allows different areas of a file
  to use a different block size.

* Properly initialize bp->b_bio2.bio_offset to doffset in cluster_read().
  This fixes an issue where VFSs were making an extra, unnecessary call
  to BMAP.

* Do not recycle vnodes on the free list until numvnodes has reached
  desiredvnodes.  Vnodes were being recycled when their resident page count
  had dropped to zero, but this is actually too early as the VFS may cache
  important information in the vnode that would otherwise require a number
  of I/O's to re-acquire.  This mainly helps HAMMER (whos inode lookups are
  fairly expensive).

* Do not VAGE vnodes.

* Remove the minvnodes test.  There is no reason not to load the vnode cache
  all the way through to its max.

* buf_cmd_t visibility for the new BMAP argument.

Revision 1.104: download - view: text, markup, annotated - select for diffs
Thu Jun 12 23:26:37 2008 UTC (6 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.103: preferred, unified
Changes since revision 1.103: +44 -21 lines
Reimplement B_AGE.  Have it cycle the buffer in the queue twice instead of
placing buffers at the head of the queue (which causes them to be run-down
backwards).  Leave B_AGE set through the write cycle and have the bufdaemon
set the flag when flushing dirty buffers.  B_AGE no longer effects the
ordering of the actual write and is allowed to slide through to the clean
queue when the write completes.

Revision 1.103: download - view: text, markup, annotated - select for diffs
Tue Jun 10 05:02:09 2008 UTC (6 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.102: preferred, unified
Changes since revision 1.102: +41 -10 lines
Change bwillwrite() to smooth out performance under heavy loads.  Blocking
based on strict hystersis was being used to try to gang flushes together
but filesystems can still blow out the buffer cache and cause processes
to block for long periods of time waiting for the dirty count to drop
significantly.

Instead, as the number of dirty buffers exceeds the desired maximum
bwillwrite() imposes a dynamic delay which increases as the number of
dirty buffers increase.  This improves the stall behavior under heavy loads
and keeps the system responsive.

TODO: The algorithm needs to have a per-LWP heuristic to penalize heavy
writers more then light ones.

Revision 1.102: download - view: text, markup, annotated - select for diffs
Fri May 9 07:24:45 2008 UTC (6 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.101: preferred, unified
Changes since revision 1.101: +23 -2 lines
Fix many bugs and issues in the VM system, particularly related to
heavy paging.

* (cleanup) PG_WRITEABLE is now set by the low level pmap code and not by
  high level code.  It means 'This page may contain a managed page table
  mapping which is writeable', meaning that hardware can dirty the page
  at any time.  The page must be tested via appropriate pmap calls before
  being disposed of.

* (cleanup) PG_MAPPED is now handled by the low level pmap code and only
  applies to managed mappings.  There is still a bit of cruft left over
  related to the pmap code's page table pages but the high level code is now
  clean.

* (bug) Various XIO, SFBUF, and MSFBUF routines which bypass normal paging
  operations were not properly dirtying pages when the caller intended
  to write to them.

* (bug) vfs_busy_pages in kern/vfs_bio.c had a busy race.  Separate the code
  out to ensure that we have marked all the pages as undergoing IO before we
  call vm_page_protect().  vm_page_protect(... VM_PROT_NONE) can block
  under very heavy paging conditions and if the pages haven't been marked
  for IO that could blow up the code.

* (optimization) Make a minor optimization.  When busying pages for write
  IO, downgrade the page table mappings to read-only instead of removing
  them entirely.

* (bug) In platform/pc32/i386/pmap.c fix various places where
  pmap_inval_add() was being called at the wrong point.  Only one was
  critical, in pmap_enter(), where pmap_inval_add() was being called so far
  away from the pmap entry being modified that it could wind up being flushed
  out prior to the modification, breaking the cpusync required.

  pmap.c also contains most of the work involved in the PG_MAPPED and
  PG_WRITEABLE changes.

* (bug) Close numerous pte updating races with hardware setting the
  modified bit.  There is still one race left (in pmap_enter()).

* (bug) Disable pmap_copy() entirely.   Fix most of the bugs anyway, but
  there is still one left in the handling of the srcmpte variable.

* (cleanup) Change vm_page_dirty() from an inline to a real procedure, and
  move the code which set the object to writeable/maybedirty into
  vm_page_dirty().

* (bug) Calls to vm_page_protect(... VM_PROT_NONE) can block.  Fix all cases
  where this call was made with a non-busied page.  All such calls are
  now made with a busied page, preventing blocking races from re-dirtying
  or remapping the page unexpectedly.

  (Such blockages could only occur during heavy paging activity where the
  underlying page table pages are being actively recycled).

* (bug) Fix the pageout code to properly mark pages as undergoing I/O before
  changing their protection bits.

* (bug) Busy pages undergoing zeroing or partial zeroing in the vnode pager
  (vm/vnode_pager.c) to avoid unexpected effects.

Revision 1.101: download - view: text, markup, annotated - select for diffs
Tue May 6 00:13:53 2008 UTC (6 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.100: preferred, unified
Changes since revision 1.100: +30 -13 lines
Keep track of the number of buffers undgoing IO, and include that number
in calculations involving numdirtybuffers.  This prevents the kernel from
believing that there are only a few dirty buffers when, in fact, all the
dirty buffers are running IOs.

Revision 1.100: download - view: text, markup, annotated - select for diffs
Wed Apr 30 04:11:44 2008 UTC (6 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.99: preferred, unified
Changes since revision 1.99: +2 -0 lines
Add some assertions when a buffer is reused

Revision 1.99: download - view: text, markup, annotated - select for diffs
Tue Apr 22 18:46:51 2008 UTC (6 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.98: preferred, unified
Changes since revision 1.98: +15 -28 lines
Fix some IO sequencing performance issues and reformulate the strategy
we use to deal with potential buffer cache deadlocks.  Generally speaking
try to remove roadblocks in the vn_strategy() path.

* Remove buf->b_tid (HAMMER no longer needs it)

* Replace IO_NOWDRAIN with IO_NOBWILL, requesting that bwillwrite() not
  be called.  Used by VN to try to avoid deadlocking.  Remove B_NOWDRAIN.

* No longer block in bwrite() or getblk() when we have a lot of dirty
  buffers.   getblk() in particular needs to be callable by filesystems
  to drain dirty buffers and we don't want to deadlock.

* Improve bwillwrite() by having it wake up the buffer flusher at 1/2 the
  dirty buffer limit but not block, and then block if the limit is reached.
  This should smooth out flushes during heavy filesystem activity.

Revision 1.98: download - view: text, markup, annotated - select for diffs
Sat Feb 23 21:55:49 2008 UTC (6 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.97: preferred, unified
Changes since revision 1.97: +1 -0 lines
HAMMER 30C/many: Fix more TID synchronization issues

* Properly zero-out b_tid in getnewbuf so a buffer does not get an old
  stale (and possibly duplicate) b_tid.

* A b_tid assignment was missing in the truncation case, causing an assertion.

* Panic instead of warn when we find a duplicate record in the B-Tree.

Revision 1.97: download - view: text, markup, annotated - select for diffs
Mon Jan 28 07:19:06 2008 UTC (6 years, 9 months ago) by nth
Branches: MAIN
CVS tags: DragonFly_RELEASE_1_12_Slip, DragonFly_RELEASE_1_12
Diff to: previous 1.96: preferred, unified
Changes since revision 1.96: +2 -1 lines
Fix spurious "softdep_deallocate_dependencies: dangling deps" panic occuring
on low memory condition.

Add assertion to catch similar bugs automagically.

Reported-by: Peter Avalos <pavalos@theshell.com>

Reviewed-by: Matthew Dillon <dillon@backplane.com>

Revision 1.96: download - view: text, markup, annotated - select for diffs
Thu Jan 10 07:34:01 2008 UTC (6 years, 9 months ago) by dillon
Branches: MAIN
Diff to: previous 1.95: preferred, unified
Changes since revision 1.95: +200 -59 lines
Fix buffer cache deadlocks by splitting dirty buffers into two categories:
Light weight dirty buffers and heavy weight dirty buffers.  Add a second
buffer cache flushing daemon to deal with the heavy weight dirty buffers.

Currently only HAMMER uses the new feature, but it can also easily be used
by UFS in the future.

Buffer cache deadlocks can occur in low memory situations where the buffer
cache tries to flush out dirty buffers and deadlocks when the act of
flushing a dirty buffer requires additional buffers to be acquired.  Because
there was only one buffer flushing daemon, a deadlock on a heavy weight buffer
prevented any further buffer flushes, whether light or heavy weight, and
wound up deadlocking the entire system.

Giving the heavy weight buffers their own daemon solves the problem by
allowing light weight buffers to continue to be flushed even if a stall
occurs on a heavy weight buffer.  The numbers of dirty heavy weight buffers
is limited to ensure that enough light weight buffers are available.

This is primarily implemented by changing getblk()'s mostly unused slpflag
parameter to a new blkflags parameter and adding a new buffer cache queue
called BQUEUE_DIRTY_HW.

Revision 1.95: download - view: text, markup, annotated - select for diffs
Wed Nov 7 00:46:36 2007 UTC (6 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.94: preferred, unified
Changes since revision 1.94: +22 -38 lines
Add bio_ops->io_checkread and io_checkwrite - a read and write pre-check
which gives HAMMER a chance to set B_LOCKED if the kernel wants to write out
a passively held buffer.

Change B_LOCKED semantics slightly.  B_LOCKED buffers will not be written
until B_LOCKED is cleared.  This allows HAMMER to hold off B_DELWRI writes
on passively held buffers.

Revision 1.94: download - view: text, markup, annotated - select for diffs
Tue Nov 6 20:06:26 2007 UTC (6 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.93: preferred, unified
Changes since revision 1.93: +54 -7 lines
Add regetblk() - reacquire a buffer lock.  The buffer must be B_LOCKED or
must be interlocked with bio_ops.  Used by HAMMER.

Further changes to B_LOCKED buffers.  A B_LOCKED|B_DELWRI buffer will be
placed on the dirty queue and then returned to the locked queue once the
I/O completes.  That is, B_LOCKED does not interfere with B_DELWRI
operation.

Revision 1.93: download - view: text, markup, annotated - select for diffs
Tue Nov 6 03:49:58 2007 UTC (6 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.92: preferred, unified
Changes since revision 1.92: +45 -28 lines
Convert the global 'bioops' into per-mount bio_ops.  For now we also have
to have a per buffer b_ops as well since the controlling filesystem cannot
be located from information in struct buf (b_vp could be the backing store
so that can't be used).  This change allows HAMMER to use bio_ops.

Change the ordering of the bio_ops.io_deallocate call so it occurs before
the buffer's B_LOCKED is checked.  This allows the deallocate call to set
B_LOCKED to retain the buffer in situations where the target filesystem
is unable to immediately disassociate the buffer.  Also keep VMIO intact
for B_LOCKED buffers (in addition to B_DELWRI buffers).

HAMMER will use this feature to keep buffers passively associated with
other filesystem structures and thus be able to avoid constantly brelse()ing
and getblk()ing them.

Revision 1.92: download - view: text, markup, annotated - select for diffs
Mon Aug 13 17:31:51 2007 UTC (7 years, 2 months ago) by dillon
Branches: MAIN
Diff to: previous 1.91: preferred, unified
Changes since revision 1.91: +1 -1 lines
Remove the vpp (returned underlying device vnode) argument from VOP_BMAP().
VOP_BMAP() may now only be used to determine linearity and clusterability of
the blocks underlying a filesystem object.  The meaning of the returned
block number (other then being contiguous as a means of indicating
linearity or clusterability) is now up to the VFS.

This removes visibility into the device(s) underlying a filesystem from
the rest of the kernel.

Revision 1.91: download - view: text, markup, annotated - select for diffs
Sun May 13 18:33:58 2007 UTC (7 years, 5 months ago) by swildner
Branches: MAIN
CVS tags: DragonFly_RELEASE_1_10_Slip, DragonFly_RELEASE_1_10
Diff to: previous 1.90: preferred, unified
Changes since revision 1.90: +1 -1 lines
Fix numerous spelling mistakes.

Revision 1.90: download - view: text, markup, annotated - select for diffs
Sun May 6 19:23:31 2007 UTC (7 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.89: preferred, unified
Changes since revision 1.89: +1 -1 lines
Use SYSREF to reference count struct vnode.  v_usecount is now
v_sysref(.refcnt).  v_holdcnt is now v_auxrefs.  SYSREF's termination state
(using a negative reference count from -0x40000000+) now places the vnode in
a VCACHED or VFREE state and deactivates it.  The vnode is now assigned a
64 bit unique id via SYSREF.

vhold() (which manipulates v_auxrefs) no longer reactivates a vnode and
is explicitly used only to track references from auxillary structures
and references to prevent premature destruction of the vnode.  vdrop()
will now only move a vnode from VCACHED to VFREE on the 1->0 transition
of v_auxrefs if the vnode is in a termination state.

vref() will now panic if used on a vnode in a termination state.  vget()
must now be used to explicitly reactivate a vnode.  These requirements
existed before but are now explicitly asserted.

vlrureclaim() and allocvnode() should now interact a bit better.  In
particular, vlrureclaim() will do a better job of finding vnodes to flush
and transition from VCACHED to VFREE, and allocvnode() will do a better
job finding vnodes to reuse without getting blocked by a flush.

allocvnode now uses a real VX lock to sequence vnodes into VRECLAIMED.  All
vnode special state processing now uses a VX lock.

Vnodes are now able to be slowly returned to the memory pool when
kern.maxvnodes is reduced at run time.

Various initialization elements have been moved to CTOR/DTOR and are
no longer in the critical path, improving performance.  However, since
SYSREF uses atomic_cmpset_int() (aka cmpxchgl), which reduces performance
somewhat, overall performance tends to be about the same.

Revision 1.89: download - view: text, markup, annotated - select for diffs
Fri Jan 12 03:05:49 2007 UTC (7 years, 9 months ago) by dillon
Branches: MAIN
CVS tags: DragonFly_RELEASE_1_8_Slip, DragonFly_RELEASE_1_8
Diff to: previous 1.88: preferred, unified
Changes since revision 1.88: +6 -2 lines
Add missing link options to export global symbols to the _DYNAMIC section,
allowing the kernel namelist functions to operate.  For now just make
certain static variables global instead of using linker magic to export
static variables.

Add infrastructure to allow out of band kernel memory to be accessed.  The
virtual kernel's memory map does not include the virtual kernel executable
or data areas.

vmstat, systat, pstat, and netstat now work with virtual kernels.

Revision 1.88: download - view: text, markup, annotated - select for diffs
Mon Jan 8 19:42:24 2007 UTC (7 years, 9 months ago) by dillon
Branches: MAIN
Diff to: previous 1.87: preferred, unified
Changes since revision 1.87: +11 -21 lines
Rewrite vmapbuf() to use vm_fault_page_quick() instead of vm_fault_quick().
Overhead is slightly increased (until we can optimize vm_fault_page_quick()),
but the code is greatly simplified.

Revision 1.87: download - view: text, markup, annotated - select for diffs
Mon Jan 1 22:51:17 2007 UTC (7 years, 10 months ago) by corecode
Branches: MAIN
Diff to: previous 1.86: preferred, unified
Changes since revision 1.86: +4 -4 lines
1:1 Userland threading stage 2.10/4:

Separate p_stats into p_ru and lwp_ru.

proc.p_ru keeps track of all statistics directly related to a proc.  This
consists of RSS usage and nswap information and aggregate numbers for all
former lwps of this proc.

proc.p_cru is the sum of all stats of reaped children.

lwp.lwp_ru contains the stats directly related to one specific lwp, meaning
packet, scheduler switch or page fault counts, etc.  This information gets
added to lwp.lwp_proc.p_ru when the lwp exits.

Revision 1.86: download - view: text, markup, annotated - select for diffs
Sun Dec 31 03:50:07 2006 UTC (7 years, 10 months ago) by dillon
Branches: MAIN
Diff to: previous 1.85: preferred, unified
Changes since revision 1.85: +1 -1 lines
Correct a conditional used to detect a panic situation.  The index was off by
one.

Revision 1.85: download - view: text, markup, annotated - select for diffs
Thu Dec 28 21:24:01 2006 UTC (7 years, 10 months ago) by dillon
Branches: MAIN
Diff to: previous 1.84: preferred, unified
Changes since revision 1.84: +14 -14 lines
Make kernel_map, buffer_map, clean_map, exec_map, and pager_map direct
structural declarations instead of pointers.  Clean up all related code,
in particular kmem_suballoc().

Remove the offset calculation for kernel_object.  kernel_object's page
indices used to be relative to the start of kernel virtual memory in order
to improve the performance of VM page scanning algorithms.  The optimization
is no longer needed now that VM objects use Red-Black trees.  Removal of
the offset simplifies a number of calculations and makes the code more
readable.

Revision 1.84: download - view: text, markup, annotated - select for diffs
Thu Dec 28 18:29:03 2006 UTC (7 years, 10 months ago) by dillon
Branches: MAIN
Diff to: previous 1.83: preferred, unified
Changes since revision 1.83: +4 -4 lines
Introduce globals: KvaStart, KvaEnd, and KvaSize.  Used by the kernel
instead of the nutty VADDR and VM_*_KERNEL_ADDRESS macros.  Move extern
declarations for these variables as well as for virtual_start, virtual_end,
and phys_avail[] from MD headers to MI headers.

Make kernel_object a global structure instead of a pointer.

Remove kmem_object and all related code (none of it is used any more).

Revision 1.83: download - view: text, markup, annotated - select for diffs
Sat Dec 23 23:47:54 2006 UTC (7 years, 10 months ago) by swildner
Branches: MAIN
Diff to: previous 1.82: preferred, unified
Changes since revision 1.82: +1 -1 lines
Ansify function declarations and fix some minor style issues.

In-collaboration-with: Alexey Slynko <slynko@tronet.ru>

Revision 1.82: download - view: text, markup, annotated - select for diffs
Sat Dec 23 00:35:04 2006 UTC (7 years, 10 months ago) by swildner
Branches: MAIN
Diff to: previous 1.81: preferred, unified
Changes since revision 1.81: +21 -21 lines
Rename printf -> kprintf in sys/ and add some defines where necessary
(files which are used in userland, too).

Revision 1.81: download - view: text, markup, annotated - select for diffs
Tue Sep 19 16:06:11 2006 UTC (8 years, 1 month ago) by dillon
Branches: MAIN
Diff to: previous 1.80: preferred, unified
Changes since revision 1.80: +2 -2 lines
Remove the last bits of code that stored mount point linkages in vnodes.
Mount point linkages are now ENTIRELY a function of the namecache topology,
made possible by DragonFly's advanced namecache.

This fixes a number of problems with NULLFS and adds two major features to
our NULLFS mounting capabilities.

NULLFS mounting paths NO LONGER NEED TO BE DISTINCT.  For example, you
can now safely do things like 'mount_null -o ro / /fubar/jail1' without
creating a recursion and you can now create SUB-MOUNTS within nullfs
mounts, such as 'mount_null -o ro /usr /fubar/jail1/usr', without creating
problems in the original master partitions.

The result is that NULLFS can now be used to glue arbitrary pieces of
filesystems together using a mixture of read-only and read-write NULLFS
mounts for situations where localhost NFS mounts had to be used before.
Jail or chroot construction is now utterly trivial.

With-input-from: Joerg Sonnenberger <joerg@britannica.bec.de>

Revision 1.80: download - view: text, markup, annotated - select for diffs
Mon Sep 11 20:25:01 2006 UTC (8 years, 1 month ago) by dillon
Branches: MAIN
Diff to: previous 1.79: preferred, unified
Changes since revision 1.79: +3 -1 lines
Move flag(s) representing the type of vm_map_entry into its own vm_maptype_t
type.  This is a precursor to adding a new VM mapping type for virtualized
page tables.

Revision 1.79: download - view: text, markup, annotated - select for diffs
Tue Sep 5 00:55:45 2006 UTC (8 years, 1 month ago) by dillon
Branches: MAIN
Diff to: previous 1.78: preferred, unified
Changes since revision 1.78: +3 -3 lines
Rename malloc->kmalloc, free->kfree, and realloc->krealloc.  Pass 1

Revision 1.78: download - view: text, markup, annotated - select for diffs
Fri Jul 7 13:00:37 2006 UTC (8 years, 3 months ago) by corecode
Branches: MAIN
CVS tags: DragonFly_RELEASE_1_6_Slip, DragonFly_RELEASE_1_6
Diff to: previous 1.77: preferred, unified
Changes since revision 1.77: +1 -1 lines
Correct typo in comment

Revision 1.53.2.2: download - view: text, markup, annotated - select for diffs
Mon Jun 5 14:51:29 2006 UTC (8 years, 4 months ago) by dillon
Branches: DragonFly_RELEASE_1_4
CVS tags: DragonFly_RELEASE_1_4_Slip
Diff to: previous 1.53.2.1: preferred, unified; next MAIN 1.54: preferred, unified
Changes since revision 1.53.2.1: +25 -1 lines
Add some diagnostic messages to try to catch a ufs_dirbad panic before it
happens.

MFC: Reorder BUF_UNLOCK() - it must occur after b_flags is modified, not
before.

A newly created non-VMIO buffer is now marked B_INVAL.  Callers of getblk()
now always clear B_INVAL before issuing a READ I/O or when clearing or
overwriting the buffer.  Before this change, a getblk() (getnewbuf),
brelse(), getblk() sequence on a non-VMIO buffer would result in a buffer
with B_CACHE set yet containing uninitialized data.

MFC: B_NOCACHE cannot be set on a clean VMIO-backed buffer as this will
destroy the VM backing store, which might be dirty.

MFC: Reorder vnode_pager_setsize() calls to close a race condition.

Revision 1.77: download - view: text, markup, annotated - select for diffs
Sat May 27 20:17:16 2006 UTC (8 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.76: preferred, unified
Changes since revision 1.76: +25 -8 lines
Mark various forms of read() and write() MPSAFE.  Note that the MP lock is
still acquire, but now its a lot deeper in the fileops.

Mark dup(), dup2(), close(), closefrom(), and fcntl() MPSAFE.  Some code
paths don't have to get the MP lock, but most still do deeper into the
fileops.

Revision 1.76: download - view: text, markup, annotated - select for diffs
Thu May 25 19:31:13 2006 UTC (8 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.75: preferred, unified
Changes since revision 1.75: +36 -31 lines
Fix several buffer cache issues related to B_NOCACHE.

* Do not set B_NOCACHE when calling vinvalbuf(... V_SAVE).  This will
  destroy dirty VM backing store associated with clean buffers before
  the VM system has a chance to check for and flush them.

  Taken-from: FreeBSD

* Properly set B_NOCACHE when destroying buffers related to truncated data.

* Fix a bug in vnode_pager_setsize() that was recently introduced.
  v_filesize was being set before a new/old size comparison, causing a
  file truncation to not destroy related VM pages past the new EOF.

* Remove a bogus B_NOCACHE|B_DIRTY test in brelse().  This was originally
  intended to be a B_NOCACHE|B_DELWRITE test which then cleared B_NOCACHE,
  but now that B_NOCACHE operation has been fixed it really does indicate that
  the buffer, its contents, and its backing store are to be destroyed, even
  if the buffer is marked B_DELWRI.

  Instead of clearing B_NOCACHE when B_DELWRITE is found to be set, clear
  B_DELWRITE when B_NOCACHE is found to be set.

  Note that B_NOCACHE is still cleared when bdirty() is called in order to
  ensure that data is not lost when softupdates and other code do a
  'B_NOCACHE + bwrite' sequence.  Softupdates can redirty a buffer in its
  io completion hook and a write error can also redirty a buffer.

* The VMIO buffer rundown seems to have mophed into a state where the
  distinction between NFS and non-NFS buffers can be removed.  Remove
  the test.

Revision 1.75: download - view: text, markup, annotated - select for diffs
Sun May 7 00:24:02 2006 UTC (8 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.74: preferred, unified
Changes since revision 1.74: +1 -1 lines
We have to use pmap_extract() here.  If we lose a race against page
table cleaning pmap_kextract() could choke on a missing page directory.

Revision 1.74: download - view: text, markup, annotated - select for diffs
Fri May 5 16:35:00 2006 UTC (8 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.73: preferred, unified
Changes since revision 1.73: +7 -7 lines
Remove VOP_BWRITE().  This function provided a way for a VFS to override
the bwrite() function and was used *only* by NFS in order to allow NFS to
handle the B_NEEDCOMMIT flag as part of NFSv3's 2-phase commit operation.
However, over time, the handling of this flag was moved to the strategy code.
Additionally, the kernel now fully supports the redirtying of buffers
during an I/O (which both softupdates and NFS need to be able to do).

The override is no longer needed.  All former calls to VOP_BWRITE() now
simply call bwrite().

Revision 1.73: download - view: text, markup, annotated - select for diffs
Fri May 5 16:15:56 2006 UTC (8 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.72: preferred, unified
Changes since revision 1.72: +12 -12 lines
Cleanup procedure prototypes, get rid of extra spaces in pointer decls.

Revision 1.72: download - view: text, markup, annotated - select for diffs
Thu May 4 18:32:22 2006 UTC (8 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.71: preferred, unified
Changes since revision 1.71: +1 -1 lines
Block devices generally truncate the size of I/O requests which go past EOF.
This is exactly what we want when manually reading or writing a block device
such as /dev/ad0s1a, but is not desired when a VFS issues I/O ops on
filesystem buffers.  In such cases, any EOF condition must be considered an
error.

Implement a new filesystem buffer flag B_BNOCLIP, which getblk() and friends
automatically set.  If set, block devices are guarenteed to return an error
if the I/O request is at EOF or would otherwise have to be clipped to EOF.
Block devices further guarentee that b_bcount will not be modified when this
flag is set.

Adjust all block device EOF checks to use the new flag, and clean up the code
while I'm there.  Also, set b_resid in a couple of degenerate cases where
it was not being set.

Revision 1.71: download - view: text, markup, annotated - select for diffs
Wed May 3 20:44:49 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.70: preferred, unified
Changes since revision 1.70: +3 -5 lines
- Clarify the definitions of b_bufsize, b_bcount, and b_resid.
- Remove unnecessary assignments based on the clarified fields.
- Add additional checks for premature EOF.

b_bufsize is only used by buffer management entities such as getblk() and
other vnode-backed buffer handling procedures.  b_bufsize is not required
for calls to vn_strategy() or dev_dstrategy().  A number of other subsystems
use it to track the original request size.

b_bcount is the I/O request size, but b_bcount() is allowed to be truncated
by the device chain if the request encompasses EOF (such as on a raw disk
device).  A caller which needs to record the original buffer size verses
the EOF-truncated buffer can compare b_bcount after the I/O against a
recorded copy of the original request size.  This copy can be recorded in
b_bufsize for unmanaged buffers (malloced or getpbuf()'d buffers).

b_resid is always relative to b_bcount, not b_bufsize.  A successful read
that is truncated to the device EOF will thus have a b_resid of 0 and a
truncated b_bcount.

Revision 1.70: download - view: text, markup, annotated - select for diffs
Sun Apr 30 20:23:24 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.69: preferred, unified
Changes since revision 1.69: +48 -35 lines
Remove buf->b_saveaddr, assert that vmapbuf() is only called on pbuf's.  Pass
the user pointer and length to vmapbuf() rather then having it try to pull
the information out of the buffer.  vmapbuf() is now responsible for setting
b_data, b_bufsize, and b_bcount.

Also fix a bug in cam_periph_mapmem().  The procedure was failing to unmap
earlier vmapped bufs if later vmapbuf() calls in the loop failed.

Revision 1.69: download - view: text, markup, annotated - select for diffs
Sun Apr 30 18:52:36 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.68: preferred, unified
Changes since revision 1.68: +2 -2 lines
The pbuf subsystem now initializes b_kvabase and b_kvasize at startup and
no longer reinitializes these fields in initpbuf().

Users of getpbuf() may no longer modify b_kvabase or b_kvasize.  b_data may
still be modified.

Revision 1.68: download - view: text, markup, annotated - select for diffs
Sun Apr 30 18:25:35 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.67: preferred, unified
Changes since revision 1.67: +0 -1 lines
Remove b_xflags.  Fold BX_VNCLEAN and BX_VNDIRTY into b_flags as
B_VNCLEAN and B_VNDIRTY.  Remove BX_AUTOCHAINDONE and recode the
swap pager to use one of the caller data fields in the BIO instead.

Revision 1.67: download - view: text, markup, annotated - select for diffs
Sun Apr 30 17:22:17 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.66: preferred, unified
Changes since revision 1.66: +81 -67 lines
Replace the the buffer cache's B_READ, B_WRITE, B_FORMAT, and B_FREEBUF
b_flags with a separate b_cmd field.  Use b_cmd to test for I/O completion
as well (getting rid of B_DONE in the process).  This further simplifies
the setup required to issue a buffer cache I/O.

Remove a redundant header file, bus/isa/i386/isa_dma.h and merge any
discrepancies into bus/isa/isavar.h.

Give ISADMA_READ/WRITE/RAW their own independant flag definitions instead of
trying to overload them on top of B_READ, B_WRITE, and B_RAW.  Add a
routine isa_dmabp() which takes a struct buf pointer and returns the ISA
dma flags associated with the operation.

Remove the 'clear_modify' argument to vfs_busy_pages().  Instead,
vfs_busy_pages() asserts that the buffer's b_cmd is valid and then uses
it to determine the action it must take.

Revision 1.66: download - view: text, markup, annotated - select for diffs
Fri Apr 28 16:34:01 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.65: preferred, unified
Changes since revision 1.65: +14 -10 lines
Get rid of pbgetvp() and pbrelvp().  Instead fold the B_PAGING flag directly
into getpbuf() (the only type of buffer that pbgetvp() could be called on
anyway).  Change related b_flags assignments from '=' to '|='.

Get rid of remaining depdendancies on b_vp.  vn_strategy() now relies solely
on the vp passed to it as an argument.  Remove buffer cache code that sets
b_vp for anonymous pbuf's.

Add a stopgap 'vp' argument to vfs_busy_pages().  This is only really needed
by NFS and the clustering code do to the severely hackish nature of the
NFS and clustering code.

Fix a bug in the ext2fs inode code where vfs_busy_pages() was being called
on B_CACHE buffers.  Add an assertion to vfs_busy_pages() to panic if it
encounters a B_CACHE buffer.

Revision 1.65: download - view: text, markup, annotated - select for diffs
Fri Apr 28 06:13:54 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.64: preferred, unified
Changes since revision 1.64: +2 -27 lines
Get rid of the remaining buffer background bitmap code.  It's been turned
off for a while, and it represents a fairly severe hack to the buffer
cache code that just complicates further development.

Revision 1.64: download - view: text, markup, annotated - select for diffs
Fri Apr 28 00:24:46 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.63: preferred, unified
Changes since revision 1.63: +0 -5 lines
Remove the buffer cache's B_PHYS flag.  This flag was originally used as
part of a severe hack to treat buffers containing 'user' addresses
differently, in particular by using b_offset instead of b_blkno.  Now that
buffer cache buffers only HAVE b_offset (b_*blkno is gone for good), there
is literally no difference between B_PHYS I/O and non-B_PHYS I/O once
the buffer has been handed off to the device.

Revision 1.63: download - view: text, markup, annotated - select for diffs
Thu Apr 27 23:28:32 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.62: preferred, unified
Changes since revision 1.62: +20 -0 lines
Move most references to the buffer cache array (buf[]) to kern/vfs_bio.c.
Implement a procedure which scans all buffers, called scan_all_buffers().
Cleanup unused debugging code referencing buf[].

Revision 1.62: download - view: text, markup, annotated - select for diffs
Mon Apr 24 21:44:52 2006 UTC (8 years, 6 months ago) by dillon
Branches: MAIN
Diff to: previous 1.61: preferred, unified
Changes since revision 1.61: +19 -2 lines
If softupdates or some other entity re-dirties a buffer, make sure
that B_NOCACHE is cleared to prevent the buffer from being discarded.
Add printfs to warn if the situation is encountered.

Fix a bug in brelse() where a buffer's flags were being modified after
the unlock instead of before.

Revision 1.53.2.1: download - view: text, markup, annotated - select for diffs
Tue Apr 18 17:12:25 2006 UTC (8 years, 6 months ago) by dillon
Branches: DragonFly_RELEASE_1_4
Diff to: previous 1.53: preferred, unified
Changes since revision 1.53: +16 -7 lines
MFC vfs_bio.c 1.57, vfs_subr.c 1.69 - fix race condition in vfs_bio_awrite().

Revision 1.61: download - view: text, markup, annotated - select for diffs
Sat Apr 1 22:20:18 2006 UTC (8 years, 7 months ago) by dillon
Branches: MAIN
Diff to: previous 1.60: preferred, unified
Changes since revision 1.60: +31 -49 lines
Require that *ALL* vnode-based buffer cache ops be backed by a VM object.
No exceptions.  Start simplifying the getblk() based on the new requirements.

Revision 1.60: download - view: text, markup, annotated - select for diffs
Wed Mar 29 18:44:50 2006 UTC (8 years, 7 months ago) by dillon
Branches: MAIN
Diff to: previous 1.59: preferred, unified
Changes since revision 1.59: +15 -32 lines
Remove VOP_GETVOBJECT, VOP_DESTROYVOBJECT, and VOP_CREATEVOBJECT.  Rearrange
the VFS code such that VOP_OPEN is now responsible for associating a VM
object with a vnode.  Add the vinitvmio() helper routine.

Revision 1.59: download - view: text, markup, annotated - select for diffs
Fri Mar 24 18:35:33 2006 UTC (8 years, 7 months ago) by dillon
Branches: MAIN
Diff to: previous 1.58: preferred, unified
Changes since revision 1.58: +61 -73 lines
Major BUF/BIO work commit.  Make I/O BIO-centric and specify the disk or
file location with a 64 bit offset instead of a 32 bit block number.

* All I/O is now BIO-centric instead of BUF-centric.

* File/Disk addresses universally use a 64 bit bio_offset now.  bio_blkno
  no longer exists.

* Stackable BIO's hold disk offset translations.  Translations are no longer
  overloaded onto a single structure (BUF or BIO).

* bio_offset == NOOFFSET is now universally used to indicate that a
  translation has not been made.  The old (blkno == lblkno) junk has all
  been removed.

* There is no longer a distinction between logical I/O and physical I/O.

* All driver BUFQs have been converted to BIOQs.

* BMAP, FREEBLKS, getblk, bread, breadn, bwrite, inmem, cluster_*,
  and findblk all now take and/or return 64 bit byte offsets instead
  of block numbers.  Note that BMAP now returns a byte range for the before
  and after variables.

Revision 1.58: download - view: text, markup, annotated - select for diffs
Sun Mar 5 18:38:34 2006 UTC (8 years, 7 months ago) by dillon
Branches: MAIN
Diff to: previous 1.57: preferred, unified
Changes since revision 1.57: +36 -131 lines
Replace the global buffer cache hash table with a per-vnode red-black tree.
Add a B_HASHED b_flags bit as a sanity check.  Remove the invalhash junk
and replace with assertions in several cases where the buffer must already
not be hashed.  Get rid of incore() and gbincore() and replace with a new
function called findblk().

Merge the new RB management with bgetvp(), the two are now fully integrated.

Previous work has turned reassignbuf() into a mostly degenerate call, simplify
its arguments and functionality to match.  Remove an unnecessary reassignbuf()
call from the NFS code.  Get rid of pbreassignbuf().

Adjust the code in several places where it was assumed that calling
BUF_LOCK() with LK_SLEEPFAIL after previously failing with LK_NOWAIT
would always fail.  This code was used to sleep before a retry.  Instead,
if the second lock unexpectedly succeeds, simply issue an unlock and retry
anyway.

Testing-by: Stefan Krueger <skrueger@meinberlikomm.de>

Revision 1.57: download - view: text, markup, annotated - select for diffs
Thu Mar 2 20:28:49 2006 UTC (8 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.56: preferred, unified
Changes since revision 1.56: +16 -7 lines
vfs_bio_awrite() was unconditionally locking a buffer without checking
for races, potentially resulting in the wrong buffer, an invalid buffer,
or a recently replaced buffer being written out.  Change the call semantics
to require a locked buffer to be passed into the function rather then
locking the buffer in the function.

Revision 1.56: download - view: text, markup, annotated - select for diffs
Thu Mar 2 19:26:14 2006 UTC (8 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.55: preferred, unified
Changes since revision 1.55: +0 -2 lines
buftimespinlock is utterly useless since the spinlock is released
within lockmgr().  The only real problem was with lk_prio, which no longer
exists, so get rid of the spin lock and document the remaining passive
races.

Revision 1.55: download - view: text, markup, annotated - select for diffs
Thu Mar 2 19:07:59 2006 UTC (8 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.54: preferred, unified
Changes since revision 1.54: +7 -3 lines
Pass LK_PCATCH instead of trying to store tsleep flags in the lock
structure, so multiple entities competing for the same lock do not
use unexpected flags when sleeping.

Only NFS really uses PCATCH with lockmgr locks.

Revision 1.54: download - view: text, markup, annotated - select for diffs
Fri Feb 17 19:18:06 2006 UTC (8 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.53: preferred, unified
Changes since revision 1.53: +221 -216 lines
Make the entire BUF/BIO system BIO-centric instead of BUF-centric.  Vnode
and device strategy routines now take a BIO and must pass that BIO to
biodone().  All code which previously managed a BUF undergoing I/O now
manages a BIO.

The new BIO-centric algorithms allow BIOs to be stacked, where each layer
represents a block translation, completion callback, or caller or device
private data.  This information is no longer overloaded within the BUF.
Translation layer linkages remain intact as a 'cache' after I/O has completed.

The VOP and DEV strategy routines no longer make assumptions as to which
translated block number applies to them.  The use the block number in the
BIO specifically passed to them.

Change the 'untranslated' constant to NOOFFSET (for bio_offset), and
(daddr_t)-1 (for bio_blkno).  Rip out all code that previously set the
translated block number to the untranslated block number to indicate
that the translation had not been made.

Rip out all the cluster linkage fields for clustered VFS and clustered
paging operations.  Clustering now occurs in a private BIO layer using
private fields within the BIO.

Reformulate the vn_strategy() and dev_dstrategy() abstraction(s).  These
routines no longer assume that bp->b_vp == the vp of the VOP operation, and
the dev_t is no longer stored in the struct buf.  Instead, only the vp passed
to vn_strategy() (and related *_strategy() routines for VFS ops), and
the dev_t passed to dev_dstrateg() (and related *_strategy() routines for
device ops) is used by the VFS or DEV code.  This will allow an arbitrary
number of translation layers in the future.

Create an independant per-BIO tracking entity, struct bio_track, which
is used to determine when I/O is in-progress on the associated device
or vnode.

NOTE: Unlike FreeBSD's BIO work, our struct BUF is still used to hold
the fields describing the data buffer, resid, and error state.

Major-testing-by: Stefan Krueger

Revision 1.53: download - view: text, markup, annotated - select for diffs
Sat Nov 19 17:19:47 2005 UTC (8 years, 11 months ago) by dillon
Branches: MAIN
Branch point for: DragonFly_RELEASE_1_4
Diff to: previous 1.52: preferred, unified
Changes since revision 1.52: +2 -2 lines
Convert the lockmgr interlock from a token to a spinlock.  This fixes a
problem on SMP boxes where the MP lock would unexpectedly lose atomicy for
a short period of time due to token acquisition.

Add a tsleep_interlock() call which takes advantage of tsleep()'s cpu
locality of reference to provide a helper function which allows us to
atomically spin_unlock() and tsleep() in an MP safe manner with only
a critical section.  Basically all it does is set a cpumask bit for the
ident hash index to cause other cpu's issuing a wakeup to notify our cpu.
Any actual wakeup occuring during the race period after the spin_unlock
but before the tsleep() call will be delayed by the critical section
until after the tsleep has queued the thread.

Cleanup some unused junk in vm_map.h.

Revision 1.52: download - view: text, markup, annotated - select for diffs
Mon Nov 14 19:14:05 2005 UTC (8 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.51: preferred, unified
Changes since revision 1.51: +10 -2 lines
Temporarily check for and correct a race in getnewbuf() that exists due
to the fact that lockmgr locks use tokens for their interlock.  The use
of a token can cause the atomicy of the big giant lock to be temporarily
lost and wind up breaking the assumed atomicy of higher level operations that
believed themselves to be safe making lockmgr calls with the LK_NOWAIT flag.

The general problem will soon be fixed by changing the lockmgr interlock
from a token to one of Jeffrey Hsu's spin locks.  Fortunately there are
only a few places left in DragonFly where LK_INTERLOCK is used.

Revision 1.51: download - view: text, markup, annotated - select for diffs
Mon Oct 24 20:12:11 2005 UTC (9 years ago) by dillon
Branches: MAIN
Diff to: previous 1.50: preferred, unified
Changes since revision 1.50: +1 -0 lines
Add a missing BUF_UNLOCK in the last commit.

Revision 1.50: download - view: text, markup, annotated - select for diffs
Mon Oct 24 17:14:04 2005 UTC (9 years ago) by dillon
Branches: MAIN
Diff to: previous 1.49: preferred, unified
Changes since revision 1.49: +24 -3 lines
Add two checks for potential buffer cache races.

"Warning buffer %p (vp %p lblkno %d) was recycled"
    Occurs if a buffer is recycled unexpectedly.  The code will print this
    warning and retry if it detects the case.

"Warning invalid buffer %p (vp %p lblkno %d) did not have cleared b_blkno cache"
    Occurs if a B_INVAL buffer's b_blkno cache has not been reset.  The
    code will reset the cache if it detects this case.

Revision 1.49: download - view: text, markup, annotated - select for diffs
Thu Aug 25 20:11:18 2005 UTC (9 years, 2 months ago) by hmp
Branches: MAIN
Diff to: previous 1.48: preferred, unified
Changes since revision 1.48: +6 -21 lines
Remove NO_B_MALLOC preprocessor macro, it was never turned on, and
just a hindrence.

Reviewed-by:	Matthew Dillon

Revision 1.48: download - view: text, markup, annotated - select for diffs
Wed Aug 10 01:11:19 2005 UTC (9 years, 2 months ago) by hmp
Branches: MAIN
Diff to: previous 1.47: preferred, unified
Changes since revision 1.47: +6 -6 lines
Re-word some sysctl descriptions, make them compact.

Revision 1.47: download - view: text, markup, annotated - select for diffs
Mon Aug 8 16:53:11 2005 UTC (9 years, 2 months ago) by hmp
Branches: MAIN
Diff to: previous 1.46: preferred, unified
Changes since revision 1.46: +0 -2 lines
Move the bswlist symbol into vm/vm_pager.c because PBUFs are the only
consumer of the latter.

The PBUF abstraction is just a clever hack, this code will be redone
at some point so this measure is temporary.

Revision 1.46: download - view: text, markup, annotated - select for diffs
Mon Aug 8 01:25:31 2005 UTC (9 years, 2 months ago) by hmp
Branches: MAIN
Diff to: previous 1.45: preferred, unified
Changes since revision 1.45: +4 -3 lines
BUF/BIO cleanup 7/99:

First attempt at separating low-level information from BUF structure into
the new BIO structure.  The latter will be used to represent the actual
I/O underlying the buffer cache, other subsystems and device drivers.

Other information from the BUF structure will be moved eventually once
their place in the grand scheme is determined.  For now, preprocess macros
have been added to reduce widespread changes; this is a temporary measure
by all means until more of the BIO and BUF API is formalised.

Remove compatibility preprocessor macros in the AAC driver because our
BUF/BIO system is mutating; not to mention they were getting in the way.

NB the name BIO has been used because it's quite appropriate and known
among kernel developers from other operating system groups, be it BSD or
Linux.

This change should not have any operational affect (famous last words).

Reviewed by:	Matthew Dillon <dillon@dragonflybsd.org>

Revision 1.45: download - view: text, markup, annotated - select for diffs
Sun Aug 7 03:28:50 2005 UTC (9 years, 2 months ago) by hmp
Branches: MAIN
Diff to: previous 1.44: preferred, unified
Changes since revision 1.44: +1 -2 lines
BUF/BIO cleanup 6/99:

Move 'bogus_offset' variable into bufinit(), it is not used anywhere out
of the said function.

Revision 1.44: download - view: text, markup, annotated - select for diffs
Sun Aug 7 03:17:37 2005 UTC (9 years, 2 months ago) by hmp
Branches: MAIN
Diff to: previous 1.43: preferred, unified
Changes since revision 1.43: +2 -0 lines
Add 'debug.sizeof.buf' sysctl for determining size of struct buf on a
system.

Declare the _debug_sizeof sysctl in sys/sysctl.h instead of redundantly
declaring in two files.

Revision 1.34.2.1: download - view: text, markup, annotated - select for diffs
Fri Aug 5 16:36:50 2005 UTC (9 years, 2 months ago) by dillon
Branches: DragonFly_RELEASE_1_2
CVS tags: DragonFly_RELEASE_1_2_Slip
Diff to: previous 1.34: preferred, unified; next MAIN 1.35: preferred, unified
Changes since revision 1.34: +64 -24 lines
MFC 1.38.  Fix a case where a buffer is moved to EMPTY or EMPTYKVA without
disassociating its vnode.

Revision 1.43: download - view: text, markup, annotated - select for diffs
Fri Aug 5 04:54:42 2005 UTC (9 years, 2 months ago) by hmp
Branches: MAIN
Diff to: previous 1.42: preferred, unified
Changes since revision 1.42: +252 -161 lines
BUF/BIO cleanup 5/99:

Cleanup and document the buffer cache sysctls.  The 'getnewbufrestarts',
'getnewbufcalls', 'bufdefragcnt', 'buffreekvacnt' and 'bufreusecnt' are
now changed to be read-only sysctls.  Group them depending on whether
they are writable or not.

Correct, extend and write documentation for various functions in this
file.

Correct typos in various code comments and adjust nearby style issue.

Revision 1.42: download - view: text, markup, annotated - select for diffs
Thu Aug 4 16:44:37 2005 UTC (9 years, 2 months ago) by hmp
Branches: MAIN
Diff to: previous 1.41: preferred, unified
Changes since revision 1.41: +3 -2 lines
Initialize buf->b_iodone to NULL during bufinit(9) stage.

Use NULL instead of 0 for assigning to pointers.

Revision 1.41: download - view: text, markup, annotated - select for diffs
Thu Aug 4 15:37:20 2005 UTC (9 years, 2 months ago) by drhodus
Branches: MAIN
Diff to: previous 1.40: preferred, unified
Changes since revision 1.40: +0 -4 lines
Remove scheduler define which was never used.

Revision 1.40: download - view: text, markup, annotated - select for diffs
Wed Aug 3 16:36:33 2005 UTC (9 years, 2 months ago) by hmp
Branches: MAIN
Diff to: previous 1.39: preferred, unified
Changes since revision 1.39: +8 -6 lines
BUF/BIO cleanup 3/99:

Retire the B_CALL flag in favour of checking the bp->b_iodone pointer
directly, thus simplifying the BUF interface even more.

Move scattered B_UNUSED* flag space defintions into one place, that
is below the rest of the definitions.

Revision 1.39: download - view: text, markup, annotated - select for diffs
Wed Aug 3 04:59:53 2005 UTC (9 years, 3 months ago) by hmp
Branches: MAIN
Diff to: previous 1.38: preferred, unified
Changes since revision 1.38: +99 -53 lines
BUF/BIO cleanup 2/99:

Localise buffer queue information into kern/vfs_bio.c, it should not be
messed with outside of the named file.  Convert the QUEUE_* #defines
into enum bufq_type, prefix the names with 'B'.  The change to initpbuf()
is acceptable since they are a hack anyway, not to mention that

Move vfs_bufstats() from kern/vfs_syscalls.c into kern/vfs_bio.c since
that's where it should really belong, atleast till its use is cleaned.

Move bufqueues extern from sys/buf.h into kern/vfs_bio.c as it shouldn't
be messed with by anything else.  It was only sitting in sys/buf.h
because of vfs_bufstats().

Note the change to initpbuf() is acceptable since they are a hack anyway,
not to mention that the said function and friends should probably reside
in kern/vfs_bio.c.

Revision 1.38: download - view: text, markup, annotated - select for diffs
Thu Jul 28 18:15:09 2005 UTC (9 years, 3 months ago) by dillon
Branches: MAIN
Diff to: previous 1.37: preferred, unified
Changes since revision 1.37: +68 -26 lines
There is a case when B_VMIO is clear where a buffer can be placed on the
EMPTY or EMPTYKVA queues without being disassociated from its vnode.
This can lead to a duplicate logical block panic in the red-black tree code.
Rework brelse() to ensure that buffers are properly cleaned up before being
placed on said queues, and add assertions to validate other cases.

Reported-by: Tomaz Borstnar

Revision 1.37: download - view: text, markup, annotated - select for diffs
Mon Jun 6 15:02:28 2005 UTC (9 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.36: preferred, unified
Changes since revision 1.36: +70 -78 lines
Remove spl*() calls from kern, replacing them with critical sections.
Change the meaning of safepri from a cpl mask to a thread priority.
Make a minor adjustment to tests within one of the buffer cache's
critical sections.

Revision 1.36: download - view: text, markup, annotated - select for diffs
Sun May 8 00:12:22 2005 UTC (9 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.35: preferred, unified
Changes since revision 1.35: +1 -3 lines
incore() is used to detect logical block number collisions, and other
callers will check B_INVAL on return.  Do not return a false negative
if the buffer we find happens to be B_INVAL as this could result in
duplicate buffers in the buffer cache.  As of the red-black tree code,
which detects duplicate entries, the case will immediately panic the machine.

MFC: 2 weeks

Revision 1.35: download - view: text, markup, annotated - select for diffs
Fri Apr 15 19:08:11 2005 UTC (9 years, 6 months ago) by dillon
Branches: MAIN
CVS tags: DragonFly_Stable
Diff to: previous 1.34: preferred, unified
Changes since revision 1.34: +19 -2 lines
Implement Red-Black trees for the vnode clean/dirty buffer lists.

Implement ranged fsyncs and adjust the syncer to use the new capability.
This capability will also soon be used to replace the write_behind
heuristic.  Rewrite the fsync code for all VFSs to use the new APIs
(generally simplifying them).

Get rid of B_WRITEINPROG, it is no longer useful or needed.
Get rid of B_SCANNED, it is no longer useful or needed.

Rewrite the NFS 2-phase commit protocol to take advantage of the new
Red-Black tree topology.

Add RB_SCAN() for callback-scanning of Red-Black trees.  Give RB_SCAN
the ability to track the 'next' scan node and automatically fix it up
if the callback directly or indirectly or through blocking indirectly
deletes nodes in the tree while the scan is in progress.

Remove most related loop restart conditions, they are no longer necessary.

Disable filesystem background bitmap writes.  This really needs to be
solved a different way and the concept does not work well with red-black
trees.

Revision 1.34: download - view: text, markup, annotated - select for diffs
Wed Mar 23 20:37:03 2005 UTC (9 years, 7 months ago) by dillon
Branches: MAIN
Branch point for: DragonFly_RELEASE_1_2
Diff to: previous 1.33: preferred, unified
Changes since revision 1.33: +4 -3 lines
Remove an assertion in bundirty() that requires the buffer to not be on
a queue.  There is a code path in brelse() where the buffer may be put on
a queue prior to calling bundirty().

Reported-by: David Rhodus <sdrhodus@gmail.com>

Revision 1.33: download - view: text, markup, annotated - select for diffs
Sat Jan 29 19:17:06 2005 UTC (9 years, 9 months ago) by dillon
Branches: MAIN
Diff to: previous 1.32: preferred, unified
Changes since revision 1.32: +16 -3 lines
getblk() has an old crufty API in which the logical block size is not a
well known quantity.  Device drivers standardize on using DEV_BSIZE (512),
while file ops are supposed to use mount->mnt_stat.f_iosize.

The existing code was testing for a non-NULL vnode->v_mountedhere field
but this field is part of a union and only valid for VDIR types.  It was being
improperly tested on non-VDIR vnode types.  In particular, if vn_isdisk()
fails due to the disk device being ripped out from under a filesystem,
the code would fall through and try to use v_mountedhere, leading to a
crash.  It also makes no sense to use the target mount to calculate the
block size for the underlying mount point's vnode, so this test has been
removed entirely.  The vn_isdisk() test has been replaced with an explicit
VBLK/VCHR test.

Finally, note that filesystems like UFS use varying buffer cache buffer
sizes for different areas of the same block device (e.g. bitmap areas,
inode area, file data areas, superblock), which is why DEV_BSIZE is being
used here.  What really needs to happen is for b_blkno to be entirely
removed in favor of a 64 bit offset.

Crash-Reported-by: Vyacheslav Bocharov <list@smz.com.ua>

Revision 1.32: download - view: text, markup, annotated - select for diffs
Tue Nov 9 17:36:41 2004 UTC (9 years, 11 months ago) by dillon
Branches: MAIN
Diff to: previous 1.31: preferred, unified
Changes since revision 1.31: +17 -9 lines
Create a non-blocking version of BUF_REFCNT() called BUF_REFCNTNB() to be
used for non-critical KASSERT()'s or in situations where the buffer lock
is in a known state.

This fixes a blocking condition in the ATA interrupt path.  The normal
BUF_REFCNT() calls lockcount() which obtains a token which caused the
interrupt thread to temporarily block in biodone() due to a KASSERT.

Found-from: kernel core provided by David Rhodus.

Revision 1.31: download - view: text, markup, annotated - select for diffs
Tue Oct 12 19:29:28 2004 UTC (10 years ago) by dillon
Branches: MAIN
Diff to: previous 1.30: preferred, unified
Changes since revision 1.30: +4 -3 lines
Try to close an occassional VM page related panic that is believed to occur
due to the VM page queues or free lists being indirectly manipulated by
interrupts that are not protected by splvm().  Do this by replacing splvm()'s
with critical sections in a number of places.

Note: some of this work bled over into the "VFS messaging/interfacing work
stage 8/99" commit.

Revision 1.30: download - view: text, markup, annotated - select for diffs
Wed Jul 14 03:43:58 2004 UTC (10 years, 3 months ago) by hmp
Branches: MAIN
CVS tags: DragonFly_Snap29Sep2004, DragonFly_Snap13Sep2004
Diff to: previous 1.29: preferred, unified
Changes since revision 1.29: +1 -1 lines
Correct reference to buf->b_xio.xio_pages in a comment.

Revision 1.29: download - view: text, markup, annotated - select for diffs
Wed Jul 14 03:10:17 2004 UTC (10 years, 3 months ago) by hmp
Branches: MAIN
Diff to: previous 1.28: preferred, unified
Changes since revision 1.28: +99 -92 lines
BUF/BIO work, for removing the requirement of KVA mappings for I/O
requests.

Stage 1 of 8:

	o Replace the b_pages member of the BUF structure with an embedded
	  XIO (b_xio).  The XIO will be used for managing the BUF's page
	  lists.

	o Initialize the XIO at two main (only) points: 1) the pbuf code,
	  which is used by the NFS code to create a temporary buffer; and
	  bufinit(9), which is used by the rest of the BUF/BIO consumers.

Discussed-with: 	Matthew Dillon <dillon@apollo.backplane.com>,

Revision 1.28: download - view: text, markup, annotated - select for diffs
Tue Jun 1 22:19:30 2004 UTC (10 years, 5 months ago) by dillon
Branches: MAIN
CVS tags: DragonFly_1_0_REL, DragonFly_1_0_RC1, DragonFly_1_0A_REL
Diff to: previous 1.27: preferred, unified
Changes since revision 1.27: +12 -17 lines
ANSIfication.  No operational changes.

Submitted-by: Tim Wickberg <me@k9mach3.org>

Revision 1.27: download - view: text, markup, annotated - select for diffs
Thu May 20 22:42:24 2004 UTC (10 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.26: preferred, unified
Changes since revision 1.26: +2 -2 lines
Get rid of VM_WAIT and VM_WAITPFAULT crud, replace with calls to
vm_wait() and vm_waitpfault().  This is a non-operational change.

vm_page.c now uses the _vm_page_list_find() inline (which itself is only
in vm_page.c) for various critical path operations.

Revision 1.26: download - view: text, markup, annotated - select for diffs
Wed May 19 22:52:58 2004 UTC (10 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.25: preferred, unified
Changes since revision 1.25: +5 -0 lines
Device layer rollup commit.

* cdevsw_add() is now required.  cdevsw_add() and cdevsw_remove() may specify
  a mask/match indicating the range of supported minor numbers.  Multiple
  cdevsw_add()'s using the same major number, but distinctly different
  ranges, may be issued.  All devices that failed to call cdevsw_add() before
  now do.

* cdevsw_remove() now automatically marks all devices within its supported
  range as being destroyed.

* vnode->v_rdev is no longer resolved when the vnode is created.  Instead,
  only v_udev (a newly added field) is resolved.  v_rdev is resolved when
  the vnode is opened and cleared on the last close.

* A great deal of code was making rather dubious assumptions with regards
  to the validity of devices associated with vnodes, primarily due to
  the persistence of a device structure due to being indexed by (major, minor)
  instead of by (cdevsw, major, minor).  In particular, if you run a program
  which connects to a USB device and then you pull the USB device and plug
  it back in, the vnode subsystem will continue to believe that the device
  is open when, in fact, it isn't (because it was destroyed and recreated).

  In particular, note that all the VFS mount procedures now check devices
  via v_udev instead of v_rdev prior to calling VOP_OPEN(), since v_rdev
  is NULL prior to the first open.

* The disk layer's device interaction has been rewritten.  The disk layer
  (i.e. the slice and disklabel management layer) no longer overloads
  its data onto the device structure representing the underlying physical
  disk.  Instead, the disk layer uses the new cdevsw_add() functionality
  to register its own cdevsw using the underlying device's major number,
  and simply does NOT register the underlying device's cdevsw.  No
  confusion is created because the device hash is now based on
  (cdevsw,major,minor) rather then (major,minor).

  NOTE: This also means that underlying raw disk devices may use the entire
  device minor number instead of having to reserve the bits used by the disk
  layer, and also means that can we (theoretically) stack a fully
  disklabel-supported 'disk' on top of any block device.

* The new reference counting scheme prevents this by associating a device
  with a cdevsw and disconnecting the device from its cdevsw when the cdevsw
  is removed.  Additionally, all udev2dev() lookups run through the cdevsw
  mask/match and only successfully find devices still associated with an
  active cdevsw.

* Major work on MFS:  MFS no longer shortcuts vnode and device creation.  It
  now creates a real vnode and a real device and implements real open and
  close VOPs.  Additionally, due to the disk layer changes, MFS is no longer
  limited to 255 mounts.  The new limit is 16 million.  Since MFS creates a
  real device node, mount_mfs will now create a real /dev/mfs<PID> device
  that can be read from userland (e.g. so you can dump an MFS filesystem).

* BUF AND DEVICE STRATEGY changes.  The struct buf contains a b_dev field.
  In order to properly handle stacked devices we now require that the b_dev
  field be initialized before the device strategy routine is called.  This
  required some additional work in various VFS implementations.  To enforce
  this requirement, biodone() now sets b_dev to NODEV.  The new disk layer
  will adjust b_dev before forwarding a request to the actual physical
  device.

* A bug in the ISO CD boot sequence which resulted in a panic has been fixed.

Testing by: lots of people, but David Rhodus found the most aggregious bugs.

Revision 1.25: download - view: text, markup, annotated - select for diffs
Thu May 13 17:40:15 2004 UTC (10 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.24: preferred, unified
Changes since revision 1.24: +39 -8 lines
Close an interrupt race between vm_page_lookup() and (typically) a
vm_page_sleep_busy() check by using the correct spl protection.
An interrupt can occur inbetween the two operations and unbusy/free
the page in question, causing the busy check to fail and for the code
to fall through and then operate on a page that may have been freed
and possibly even reused.   Also note that vm_page_grab() had the same
issue between the lookup, busy check, and vm_page_busy() call.

Close an interrupt race when scanning a VM object's memq.  Interrupts
can free pages, removing them from memq, which interferes with memq scans
and can cause a page unassociated with the object to be processed as if it
were associated with the object.

Calls to vm_page_hold() and vm_page_unhold() require spl protection.

Rename the passed socket descriptor argument in sendfile() to make the
code more readable.

Fix several serious bugs in procfs_rwmem().  In particular, force it to
block if a page is busy and then retry.

Get rid of vm_pager_map_pag() and vm_pager_unmap_page(), make the functions
that used to use these routines use SFBUF's instead.

Get rid of the (userland?) 4MB page mapping feature in pmap_object_init_pt()
for now.  The code appears to not track the page directory properly and
could result in a non-zero page being freed as PG_ZERO.

This commit also includes updated code comments and some additional
non-operational code cleanups.

Revision 1.24: download - view: text, markup, annotated - select for diffs
Mon May 10 10:51:31 2004 UTC (10 years, 5 months ago) by hmp
Branches: MAIN
Diff to: previous 1.23: preferred, unified
Changes since revision 1.23: +4 -4 lines
Remove newline from panic(9) message, it is redundant.

Revision 1.23: download - view: text, markup, annotated - select for diffs
Sat May 8 04:11:46 2004 UTC (10 years, 5 months ago) by dillon
Branches: MAIN
Diff to: previous 1.22: preferred, unified
Changes since revision 1.22: +33 -2 lines
Peter Edwards brought up an interesting NFS bug which we both originally
thought would be a fairly straightforward bug fix.  But it turns out to
require a nasty hack to fix.

The issue is that near the file EOF NFS uses piecemeal writes and
piecemeal buffer cache buffers.  The result is that manipulation through
the buffer cache only sets some of the m->valid bits in the associated
vm_page(s).  This case may also occur in the middle of a file if for
example a file is piecemeal written and then ftruncated to be much
larger (or lseek/write at a much higher seek position).

The nfs_getpages() routine was assuming that if m->valid was non-0, the
page is basically valid and no read rpc is required to fill it.

The problem is that if you mmap() a piecemeal VM page and fault it in,
m->valid is set to VM_PAGE_BITS_ALL (0xFF). Then, later, when NFS flushes
the buffer cache, only some of the m->valid bits are clear (e.g. 0xFC).
A later page fault will cause NFS to believe that the page is sufficiently
valid and vm_fault will then zero-out the first X bytes of the page when,
in fact, we really should have done an I/O to refill those X bytes.

The fix in PR misc/64816 (FreeBSD) tried to solve this by checking to see
if the m->valid bits were 'sufficiently valid' in the file EOF case but
tesing with fsx resulted in several failure modes.  This doesn't work
because (1) if you extend the file w/ ftruncate or lseek/write these
partially valid pages can end up in the middle of the file rather then
just at the end and (2) There may be a dirty buffer associated with these
pages, meaning that the pages may contain dirty data, and we cannot safely
overwrite the pages with a new read I/O.

The solution in this patch is to deal with the screwy m->valid bit clearing
but special-casing NFS and then having the BIO system clear ALL the m->valid
bits instead of just some of them when NFS calls vinvalbuf().  That way
m->valid will be set to 0 when the buffer is invalidated and the
nfs_getpages() code can be left doing it's simple 'if any m->valid bits
are set assume the whole page is valid' test.  In order for the BIO system
to safely be able to do this (so as not to invalidate portions of a VM page
associated with an adjacent buffer), the NFS io size has been further
restricted to be an integral multiple of PAGE_SIZE.

This is a terrible hack but there is no other way to fix the problem short
of rewriting the entire buffer cache.  We will do that eventually, but not
now.

Reported-by: Peter Edwards <peter.edwards@vordel.com>
Referencing-PR: misc/64816 by Patrick Mackinlay <patrick@spacesurfer.com>

Revision 1.22: download - view: text, markup, annotated - select for diffs
Wed Mar 31 15:32:53 2004 UTC (10 years, 7 months ago) by drhodus
Branches: MAIN
Diff to: previous 1.21: preferred, unified
Changes since revision 1.21: +37 -2 lines
The existing hash algorithm in bufhash() does not distribute entries
very well across buckets, especially in the case of cylinder group blocks
which are located at a sequence of locations that are a multiple of a large
power of two apart.  In the case of large file systems, one or possibly
a few of the hash chains can get excessively long.  Replace the existing
hash algorithm with a variation on the Fibonacci hash.

Merged from FreeBSD

Revision 1.21: download - view: text, markup, annotated - select for diffs
Fri Mar 26 17:23:42 2004 UTC (10 years, 7 months ago) by drhodus
Branches: MAIN
Diff to: previous 1.20: preferred, unified
Changes since revision 1.20: +2 -2 lines
Change this vnode check inside of the VFS_BIO_DEBUG
code path to check for erroneous hold counts from the
reference count check that was an el-relevant check here.

Revision 1.20: download - view: text, markup, annotated - select for diffs
Thu Mar 11 20:14:46 2004 UTC (10 years, 7 months ago) by hmp
Branches: MAIN
Diff to: previous 1.19: preferred, unified
Changes since revision 1.19: +1 -1 lines
Replace a manual check for a VMIO candidate with vn_canvmio() under
VFS_BIO_DEBUG.

This silences an annoying warning in getblk() when VMIO'ing on a
VDIR (directory) vnode; this happens due to vmiodirenable sysctl
being set to `1'.

Discussed with: 	Matthew Dillon

Revision 1.19: download - view: text, markup, annotated - select for diffs
Mon Mar 1 06:33:17 2004 UTC (10 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.18: preferred, unified
Changes since revision 1.18: +1 -1 lines
Newtoken commit.  Change the token implementation as follows:  (1) Obtaining
a token no longer enters a critical section.  (2) tokens can be held through
schedular switches and blocking conditions and are effectively released and
reacquired on resume.  Thus tokens serialize access only while the thread
is actually running.  Serialization is not broken by preemptive interrupts.
That is, interrupt threads which preempt do no release the preempted thread's
tokens.  (3) Unlike spl's, tokens will interlock w/ interrupt threads on
the same or on a different cpu.

The vnode interlock code has been rewritten and the API has changed.  The
mountlist vnode scanning code has been consolidated and all known races have
been fixed.  The vnode interlock is now a pool token.

The code that frees unreferenced vnodes whos last VM page has been freed has
been moved out of the low level vm_page_free() code and moved to the
periodic filesystem sycer code in vfs_msycn().

The SMP startup code and the IPI code has been cleaned up considerably.
Certain early token interactions on AP cpus have been moved to the BSP.

The LWKT rwlock API has been cleaned up and turned on.

Major testing by: David Rhodus

Revision 1.18: download - view: text, markup, annotated - select for diffs
Mon Feb 16 19:37:48 2004 UTC (10 years, 8 months ago) by dillon
Branches: MAIN
Diff to: previous 1.17: preferred, unified
Changes since revision 1.17: +2 -0 lines
buftimetoken must be declared in a .c file.

Revision 1.17: download - view: text, markup, annotated - select for diffs
Tue Jan 20 05:04:06 2004 UTC (10 years, 9 months ago) by dillon
Branches: MAIN
Diff to: previous 1.16: preferred, unified
Changes since revision 1.16: +2 -2 lines
Retool the M_* flags to malloc() and the VM_ALLOC_* flags to
vm_page_alloc(), and vm_page_grab() and friends.

The M_* flags now have more flexibility, with the intent that we will start
using some of it to deal with NULL pointer return problems in the codebase
(CAM is especially bad at dealing with unexpected return values).  In
particular, add M_USE_INTERRUPT_RESERVE and M_FAILSAFE, and redefine
M_NOWAIT as a combination of M_ flags instead of its own flag.

The VM_ALLOC_* macros are now flags (0x01, 0x01, 0x04) rather then states
(1, 2, 3), which allows us to create combinations that the old interface
could not handle.

Revision 1.16: download - view: text, markup, annotated - select for diffs
Mon Nov 3 17:11:21 2003 UTC (11 years ago) by dillon
Branches: MAIN
Diff to: previous 1.15: preferred, unified
Changes since revision 1.15: +1 -1 lines
64 bit address space cleanups which are a prerequisit for future 64 bit
address space work and PAE.  Note: this is not PAE.  This patch basically
adds vm_paddr_t, which represents a 'physical address'.  Physical addresses
may be larger then virtual addresses and on IA32 we make vm_paddr_t a 64
bit quantity.

Submitted-by: Hiten Pandya <hmp@backplane.com>

Revision 1.15: download - view: text, markup, annotated - select for diffs
Wed Oct 8 00:10:56 2003 UTC (11 years ago) by dillon
Branches: MAIN
Diff to: previous 1.14: preferred, unified
Changes since revision 1.14: +12 -1 lines
Disable background bitmap writes.  They appear to cause at least two race
conditions:  First, on MP systems even an LK_NOWAIT lock may block,
invalidating flags checks done just prior to the lock attempt.  Second, on
both MP and UP systems, the original buffer (origbp) may be modified during
the completion of a background write without its lock being held and these
modifications can race against mainline code that is also modifying the same
buffer with the lock held.

Eventually the problem background bitmap writes solved will be solved more
generally by implementing page COWing durign device I/O to avoid stalls on
pages undergoing write I/O.

Revision 1.14: download - view: text, markup, annotated - select for diffs
Wed Aug 27 01:43:07 2003 UTC (11 years, 2 months ago) by dillon
Branches: MAIN
Diff to: previous 1.13: preferred, unified
Changes since revision 1.13: +12 -2 lines
SLAB ALLOCATOR Stage 1.  This brings in a slab allocator written from scratch
by your's truely.  A detailed explanation of the allocator is included but
first, other changes:

* Instead of having vm_map_entry_insert*() and friends allocate the
  vm_map_entry structures a new mechanism has been emplaced where by
  the vm_map_entry structures are reserved at a higher level, then
  expected to exist in the free pool in deep vm_map code.  This preliminary
  implementation may eventually turn into something more sophisticated that
  includes things like pmap entries and so forth.  The idea is to convert
  what should be low level routines (VM object and map manipulation)
  back into low level routines.

* vm_map_entry structure are now per-cpu cached, which is integrated into
  the the reservation model above.

* The zalloc 'kmapentzone' has been removed.  We now only have 'mapentzone'.

* There were race conditions between vm_map_findspace() and actually
  entering the map_entry with vm_map_insert().  These have been closed
  through the vm_map_entry reservation model described above.

* Two new kernel config options now work.  NO_KMEM_MAP has been fleshed out
  a bit more and a number of deadlocks related to having only the kernel_map
  now have been fixed.  The USE_SLAB_ALLOCATOR option will cause the kernel
  to compile-in the slab allocator instead of the original malloc allocator.
  If you specify USE_SLAB_ALLOCATOR you must also specify NO_KMEM_MAP.

* vm_poff_t and vm_paddr_t integer types have been added.  These are meant
  to represent physical addresses and offsets (physical memory might be
  larger then virtual memory, for example Intel PAE).  They are not heavily
  used yet but the intention is to separate physical representation from
  virtual representation.

			    SLAB ALLOCATOR FEATURES

The slab allocator breaks allocations up into approximately 80 zones based
on their size.  Each zone has a chunk size (alignment).  For example, all
allocations in the 1-8 byte range will allocate in chunks of 8 bytes.  Each
size zone is backed by one or more blocks of memory.  The size of these
blocks is fixed at ZoneSize, which is calculated at boot time to be between
32K and 128K.  The use of a fixed block size allows us to locate the zone
header given a memory pointer with a simple masking operation.

The slab allocator operates on a per-cpu basis.  The cpu that allocates a
zone block owns it.  free() checks the cpu that owns the zone holding the
memory pointer being freed and forwards the request to the appropriate cpu
through an asynchronous IPI.  This request is not currently optimized but it
can theoretically be heavily optimized ('queued') to the point where the
overhead becomes inconsequential.  As of this commit the malloc_type
information is not MP safe, but the core slab allocation and deallocation
algorithms, non-inclusive the having to allocate the backing block,
*ARE* MP safe.  The core code requires no mutexes or locks, only a critical
section.

Each zone contains N allocations of a fixed chunk size.  For example, a
128K zone can hold approximately 16000 or so 8 byte allocations.  The zone
is initially zero'd and new allocations are simply allocated linearly out
of the zone.  When a chunk is freed it is entered into a linked list and
the next allocation request will reuse it.  The slab allocator heavily
optimizes M_ZERO operations at both the page level and the chunk level.

The slab allocator maintains various undocumented malloc quirks such as
ensuring that small power-of-2 allocations are aligned to their size,
and malloc(0) requests are also allowed and return a non-NULL result.
kern_tty.c depends heavily on the power-of-2 alignment feature and ahc
depends on the malloc(0) feature.  Eventually we may remove the malloc(0)
feature.

			    PROBLEMS AS OF THIS COMMIT

NOTE!  This commit may destabilize the kernel a bit.  There are issues
with the ISA DMA area ('bounce' buffer allocation) due to the large backing
block size used by the slab allocator and there are probably some deadlock
issues do to the removal of kmem_map that have not yet been resolved.

Revision 1.13: download - view: text, markup, annotated - select for diffs
Tue Aug 26 21:09:02 2003 UTC (11 years, 2 months ago) by rob
Branches: MAIN
Diff to: previous 1.12: preferred, unified
Changes since revision 1.12: +1 -1 lines
__P() removal

Revision 1.12: download - view: text, markup, annotated - select for diffs
Mon Aug 25 17:01:10 2003 UTC (11 years, 2 months ago) by dillon
Branches: MAIN
Diff to: previous 1.11: preferred, unified
Changes since revision 1.11: +2 -1 lines
Add an alignment feature to vm_map_findspace().  This feature will be used
primarily by the upcoming slab allocator but has many applications.

Use the alignment feature in the buffer cache to hopefully reduce
fragmentation.

Revision 1.11: download - view: text, markup, annotated - select for diffs
Sat Jul 26 19:42:11 2003 UTC (11 years, 3 months ago) by rob
Branches: MAIN
Diff to: previous 1.10: preferred, unified
Changes since revision 1.10: +3 -3 lines
Register keyword removal

Approved by: Matt Dillon

Revision 1.10: download - view: text, markup, annotated - select for diffs
Sat Jul 19 21:14:39 2003 UTC (11 years, 3 months ago) by dillon
Branches: MAIN
Diff to: previous 1.9: preferred, unified
Changes since revision 1.9: +11 -13 lines
Remove the priority part of the priority|flags argument to tsleep().  Only
flags are passed now.  The priority was a user scheduler thingy that is not
used by the LWKT subsystem.  For process statistics assume sleeps without
P_SINTR set to be disk-waits, and sleeps with it set to be normal sleeps.

This commit should not contain any operational changes.

Revision 1.9: download - view: text, markup, annotated - select for diffs
Sun Jul 6 21:23:51 2003 UTC (11 years, 3 months ago) by dillon
Branches: MAIN
Diff to: previous 1.8: preferred, unified
Changes since revision 1.8: +2 -2 lines
MP Implementation 1/2: Get the APIC code working again, sweetly integrate the
MP lock into the LWKT scheduler, replace the old simplelock code with
tokens or spin locks as appropriate.  In particular, the vnode interlock
(and most other interlocks) are now tokens.  Also clean up a few curproc/cred
sequences that are no longer needed.

The APs are left in degenerate state with non IPI interrupts disabled as
additional LWKT work must be done before we can really make use of them,
and FAST interrupts are not managed by the MP lock yet.  The main thing
for this stage was to get the system working with an APIC again.

buildworld tested on UP and 2xCPU/MP (Dell 2550)

Revision 1.8: download - view: text, markup, annotated - select for diffs
Thu Jul 3 17:24:02 2003 UTC (11 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.7: preferred, unified
Changes since revision 1.7: +5 -4 lines
Split the struct vmmeter cnt structure into a global vmstats structure and
a per-cpu cnt structure.  Adjust the sysctls to accumulate statistics
over all cpus.

Revision 1.7: download - view: text, markup, annotated - select for diffs
Fri Jun 27 01:53:25 2003 UTC (11 years, 4 months ago) by dillon
Branches: MAIN
CVS tags: PRE_MP
Diff to: previous 1.6: preferred, unified
Changes since revision 1.6: +2 -2 lines
proc->thread stage 6: kernel threads now create processless LWKT threads.
A number of obvious curproc cases were removed, tsleep/wakeup was made to
work with threads (wmesg, ident, and timeout features moved to threads).
There are probably a few curproc cases left to fix.

Revision 1.6: download - view: text, markup, annotated - select for diffs
Thu Jun 26 20:27:51 2003 UTC (11 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.5: preferred, unified
Changes since revision 1.5: +16 -9 lines
cleanup some odd uses of curproc.  Remove PHOLD/PRELE around physical I/O
(our UPAGES can no longer be swapped out and if they eventually are made
to it will only be when the thread is sleeping on a particular address).

Also move the inblock/oublock accounting in vfs_busy_pages() allowing us
to remove additional curproc references from various filesystem code.  This
also makes inblock/oublock more consistent.

Revision 1.5: download - view: text, markup, annotated - select for diffs
Thu Jun 26 05:55:14 2003 UTC (11 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.4: preferred, unified
Changes since revision 1.4: +4 -31 lines
proc->thread stage 5:  BUF/VFS clearance!  Remove the ucred argument from
vop_close, vop_getattr, vop_fsync, and vop_createvobject.  These VOPs can
be called from multiple contexts so the cred is fairly useless, and UFS
ignorse it anyway.  For filesystems (like NFS) that sometimes need a cred
we use proc0.p_ucred for now.

This removal also removed the need for a 'proc' reference in the related
VFS procedures, which greatly helps our proc->thread conversion.

bp->b_wcred and bp->b_rcred have also been removed, and for the same reason.
It makes no sense to have a particular cred when multiple users can
access a file.  This may create issues with certain types of NFS mounts
but if it does we will solve them in a way that doesn't pollute the
struct buf.

Revision 1.4: download - view: text, markup, annotated - select for diffs
Sun Jun 22 17:39:42 2003 UTC (11 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.3: preferred, unified
Changes since revision 1.3: +7 -7 lines
proc->thread stage 1: change kproc_*() API to take and return threads.  Note:
we won't be able to turn off the underlying proc until we have a clean thread
path all the way through, which aint now.

Revision 1.3: download - view: text, markup, annotated - select for diffs
Thu Jun 19 01:55:06 2003 UTC (11 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.2: preferred, unified
Changes since revision 1.2: +3 -1 lines
thread stage 5: Separate the inline functions out of sys/buf.h, creating
sys/buf2.h (A methodology that will continue as time passes).  This solves
inline vs struct ordering problems.

Do a major cleanup of the globaldata access methodology.  Create a
gcc-cacheable 'mycpu' macro & inline to access per-cpu data.  Atomicy is not
required because we will never change cpus out from under a thread, even if
it gets preempted by an interrupt thread, because we want to be able to
implement per-cpu caches that do not require locked bus cycles or special
instructions.

Revision 1.2: download - view: text, markup, annotated - select for diffs
Tue Jun 17 04:28:41 2003 UTC (11 years, 4 months ago) by dillon
Branches: MAIN
Diff to: previous 1.1: preferred, unified
Changes since revision 1.1: +1 -0 lines
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids.  Most
ids have been removed from !lint sections and moved into comment sections.

Revision 1.1: download - view: text, markup, annotated - select for diffs
Tue Jun 17 02:55:07 2003 UTC (11 years, 4 months ago) by dillon
Branches: MAIN
CVS tags: FREEBSD_4_FORK
import from FreeBSD RELENG_4 1.242.2.20

Diff request

This form allows you to request diffs between any two revisions of a file. You may select a symbolic revision name using the selection box or you may type in a numeric name using the type-in text box.

Log view options