Up to [DragonFly] / src / sys / kern
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
MFC numerous features from HEAD. * Bounce buffer fixes for physio. * Disk flush support in scsi and nata subsystems. * Dead bio handling
Remove a useless assignment and two unused variables. Found-by: LLVM/Clang Static Analyzer
Implement a bounce buffer for physio if the buffer passed from userland is not at least 16-byte aligned. Reported-by: "Steve O'Hara-Smith" <firstname.lastname@example.org>, and others
MFC 1.113 - buffer cache adjustments for handling write errors.
Make some adjustments to the buffer cache: * Retain B_ERROR instead of clearing it. * Change B_ERROR's behavior. It no longer causes the buffer to be invalidated on write. * Change B_NOCACHE's behavior. It no longer causes the buffer to be invalidated while the buffer is marked dirty. * Code that was supposed to re-dirty a failed write buffer in brelse() was not running because biodone() cleared the fields brelse() was testing. Move the code to biodone(). * When attempting to reflush B_DELWRI|B_ERROR'd buffers, sleep a tick to try to avoid a live-lock.
Kernel support for HAMMER: * Add another type to the bio->bio_caller_info1 union * Add two new flags to getblk(), used by the cluster code. GETBLK_SZMATCH - Tell getblk() to fail and return NULL if a pre-existing buffer's size does not match the requested size (this prevents getblk() from doing a potentially undesired bwrite() sequence). GETBLK_NOWAIT - Tell getblk() to use a non-blocking lock. * pop_bio() now returns the previous BIO (or NULL if there is no previous BIO). This allows HAMMER to chain bio_done()'s * Fix a bug in cluster_read(). The cluster code's read-ahead at the end could go past the caller-specified limit and force a block to the wrong block size.
Cleanup - move a warning so it doesn't spam the screen so much, cleanup some syntax.
UFS+softupdates can build up thousands of dirty 1K buffers and run out of buffers before it even hits the lodirtybufspace point. The buf_daemon is never triggered. This case occurs rarely but can be triggered e.g. by a cvs update. Add dirtybufcount back in and flush if it exceeds (nbuf / 2) to handle this degenerate case. Reported-by: "Sepherosa Ziehau" <email@example.com>
Fix numerous pageout daemon -> buffer cache deadlocks in the main system. These issues usually only occur on systems with small amounts of ram but it is possible to trigger them on any system. * Get rid of the IO_NOBWILL hack. Just have the VN device use IO_DIRECT, which will clean out the buffer on completion of the write. * Add a timeout argument to vm_wait(). * Add a thread->td_flags flag called TDF_SYSTHREAD. kmalloc()'s made from designated threads are allowed to dip into the system reserve when allocating pages. Only the pageout daemon and buf_daemon[_hw] use the flag. * Add a new static procedure, recoverbufpages(), which explicitly tries to free buffers and their backing pages on the clean queue. * Add a new static procedure, bio_page_alloc(), to do all the nasty work of allocating a page on behalf of a buffer cache buffer. This function will call vm_page_alloc() with VM_ALLOC_SYSTEM to allow it to dip into the system reserve. If the allocation fails this function will call recoverbufpages() to try to recycle from VM pages from clean buffer cache buffers, and will then attempt to reallocate using VM_ALLOC_SYSTEM | VM_ALLOC_INTERRUPT to allow it to dip into the interrupt reserve as well. Warnings will blare on the console. If the effort still fails we sleep for 1/20 of a second and retry. The idea though is for all the effort above to not result in a failure at the end. Reported-by: Gergo Szakal <firstname.lastname@example.org>
Fix a buf_daemon performance issue when running on machines with small amounts of ram. The daemon was hitting a 1/2 sleep case that it should not have been hitting.
Fix hopefully all possible deadlocks that can occur when mixed block sizes are used with the buffer cache. The fix is simply to base the limiting and flushing code on a byte count rather then a buffer count. This will allow UFS to utilizes a greater number of dirty buffers and will cause HAMMER to use fewer. This also makes tuning the buffer cache a whole lot easier.
Replace the bwillwrite() subsystem to make it more fair to processes. * Add new API functions, bwillread(), bwillwrite(), bwillinode() which the kernel calls when it intends to read, write, or make inode modifications. * Redo the backend. Add bd_heatup() and bd_wait(). bd_heatup() heats up the buf_daemon, starting it flushing before we hit any blocking conditions (similar to the previous algorith). * The new bwill*() blocking functions no longer introduce escalating delays to keep the number of dirty buffers under control. Instead it takes a page from HAMMER and estimates the load caused by the caller, then waits for a specific number of dirty buffers to complete their write I/O's before returning. If the buffers can be retired quickly these functions will return more quickly.
Miscellanious performance adjustments to the kernel * Add an argument to VOP_BMAP so VFSs can discern the type of operation the BMAP is being done for. * Normalize the variable name denoting the blocksize to 'blksize' in vfs_cluster.c. * Fix a bug in the cluster code where a stale bp->b_error could wind up getting returned when B_ERROR is not set. * Do not B_AGE cluster bufs. * Pass the block size to both cluster_read() and cluster_write() instead of those routines getting the block size from vp->v_mount->mnt_stat.f_iosize. This allows different areas of a file to use a different block size. * Properly initialize bp->b_bio2.bio_offset to doffset in cluster_read(). This fixes an issue where VFSs were making an extra, unnecessary call to BMAP. * Do not recycle vnodes on the free list until numvnodes has reached desiredvnodes. Vnodes were being recycled when their resident page count had dropped to zero, but this is actually too early as the VFS may cache important information in the vnode that would otherwise require a number of I/O's to re-acquire. This mainly helps HAMMER (whos inode lookups are fairly expensive). * Do not VAGE vnodes. * Remove the minvnodes test. There is no reason not to load the vnode cache all the way through to its max. * buf_cmd_t visibility for the new BMAP argument.
Reimplement B_AGE. Have it cycle the buffer in the queue twice instead of placing buffers at the head of the queue (which causes them to be run-down backwards). Leave B_AGE set through the write cycle and have the bufdaemon set the flag when flushing dirty buffers. B_AGE no longer effects the ordering of the actual write and is allowed to slide through to the clean queue when the write completes.
Change bwillwrite() to smooth out performance under heavy loads. Blocking based on strict hystersis was being used to try to gang flushes together but filesystems can still blow out the buffer cache and cause processes to block for long periods of time waiting for the dirty count to drop significantly. Instead, as the number of dirty buffers exceeds the desired maximum bwillwrite() imposes a dynamic delay which increases as the number of dirty buffers increase. This improves the stall behavior under heavy loads and keeps the system responsive. TODO: The algorithm needs to have a per-LWP heuristic to penalize heavy writers more then light ones.
Fix many bugs and issues in the VM system, particularly related to heavy paging. * (cleanup) PG_WRITEABLE is now set by the low level pmap code and not by high level code. It means 'This page may contain a managed page table mapping which is writeable', meaning that hardware can dirty the page at any time. The page must be tested via appropriate pmap calls before being disposed of. * (cleanup) PG_MAPPED is now handled by the low level pmap code and only applies to managed mappings. There is still a bit of cruft left over related to the pmap code's page table pages but the high level code is now clean. * (bug) Various XIO, SFBUF, and MSFBUF routines which bypass normal paging operations were not properly dirtying pages when the caller intended to write to them. * (bug) vfs_busy_pages in kern/vfs_bio.c had a busy race. Separate the code out to ensure that we have marked all the pages as undergoing IO before we call vm_page_protect(). vm_page_protect(... VM_PROT_NONE) can block under very heavy paging conditions and if the pages haven't been marked for IO that could blow up the code. * (optimization) Make a minor optimization. When busying pages for write IO, downgrade the page table mappings to read-only instead of removing them entirely. * (bug) In platform/pc32/i386/pmap.c fix various places where pmap_inval_add() was being called at the wrong point. Only one was critical, in pmap_enter(), where pmap_inval_add() was being called so far away from the pmap entry being modified that it could wind up being flushed out prior to the modification, breaking the cpusync required. pmap.c also contains most of the work involved in the PG_MAPPED and PG_WRITEABLE changes. * (bug) Close numerous pte updating races with hardware setting the modified bit. There is still one race left (in pmap_enter()). * (bug) Disable pmap_copy() entirely. Fix most of the bugs anyway, but there is still one left in the handling of the srcmpte variable. * (cleanup) Change vm_page_dirty() from an inline to a real procedure, and move the code which set the object to writeable/maybedirty into vm_page_dirty(). * (bug) Calls to vm_page_protect(... VM_PROT_NONE) can block. Fix all cases where this call was made with a non-busied page. All such calls are now made with a busied page, preventing blocking races from re-dirtying or remapping the page unexpectedly. (Such blockages could only occur during heavy paging activity where the underlying page table pages are being actively recycled). * (bug) Fix the pageout code to properly mark pages as undergoing I/O before changing their protection bits. * (bug) Busy pages undergoing zeroing or partial zeroing in the vnode pager (vm/vnode_pager.c) to avoid unexpected effects.
Keep track of the number of buffers undgoing IO, and include that number in calculations involving numdirtybuffers. This prevents the kernel from believing that there are only a few dirty buffers when, in fact, all the dirty buffers are running IOs.
Add some assertions when a buffer is reused
Fix some IO sequencing performance issues and reformulate the strategy we use to deal with potential buffer cache deadlocks. Generally speaking try to remove roadblocks in the vn_strategy() path. * Remove buf->b_tid (HAMMER no longer needs it) * Replace IO_NOWDRAIN with IO_NOBWILL, requesting that bwillwrite() not be called. Used by VN to try to avoid deadlocking. Remove B_NOWDRAIN. * No longer block in bwrite() or getblk() when we have a lot of dirty buffers. getblk() in particular needs to be callable by filesystems to drain dirty buffers and we don't want to deadlock. * Improve bwillwrite() by having it wake up the buffer flusher at 1/2 the dirty buffer limit but not block, and then block if the limit is reached. This should smooth out flushes during heavy filesystem activity.
HAMMER 30C/many: Fix more TID synchronization issues * Properly zero-out b_tid in getnewbuf so a buffer does not get an old stale (and possibly duplicate) b_tid. * A b_tid assignment was missing in the truncation case, causing an assertion. * Panic instead of warn when we find a duplicate record in the B-Tree.
Fix spurious "softdep_deallocate_dependencies: dangling deps" panic occuring on low memory condition. Add assertion to catch similar bugs automagically. Reported-by: Peter Avalos <email@example.com> Reviewed-by: Matthew Dillon <firstname.lastname@example.org>
Fix buffer cache deadlocks by splitting dirty buffers into two categories: Light weight dirty buffers and heavy weight dirty buffers. Add a second buffer cache flushing daemon to deal with the heavy weight dirty buffers. Currently only HAMMER uses the new feature, but it can also easily be used by UFS in the future. Buffer cache deadlocks can occur in low memory situations where the buffer cache tries to flush out dirty buffers and deadlocks when the act of flushing a dirty buffer requires additional buffers to be acquired. Because there was only one buffer flushing daemon, a deadlock on a heavy weight buffer prevented any further buffer flushes, whether light or heavy weight, and wound up deadlocking the entire system. Giving the heavy weight buffers their own daemon solves the problem by allowing light weight buffers to continue to be flushed even if a stall occurs on a heavy weight buffer. The numbers of dirty heavy weight buffers is limited to ensure that enough light weight buffers are available. This is primarily implemented by changing getblk()'s mostly unused slpflag parameter to a new blkflags parameter and adding a new buffer cache queue called BQUEUE_DIRTY_HW.
Add bio_ops->io_checkread and io_checkwrite - a read and write pre-check which gives HAMMER a chance to set B_LOCKED if the kernel wants to write out a passively held buffer. Change B_LOCKED semantics slightly. B_LOCKED buffers will not be written until B_LOCKED is cleared. This allows HAMMER to hold off B_DELWRI writes on passively held buffers.
Add regetblk() - reacquire a buffer lock. The buffer must be B_LOCKED or must be interlocked with bio_ops. Used by HAMMER. Further changes to B_LOCKED buffers. A B_LOCKED|B_DELWRI buffer will be placed on the dirty queue and then returned to the locked queue once the I/O completes. That is, B_LOCKED does not interfere with B_DELWRI operation.
Convert the global 'bioops' into per-mount bio_ops. For now we also have to have a per buffer b_ops as well since the controlling filesystem cannot be located from information in struct buf (b_vp could be the backing store so that can't be used). This change allows HAMMER to use bio_ops. Change the ordering of the bio_ops.io_deallocate call so it occurs before the buffer's B_LOCKED is checked. This allows the deallocate call to set B_LOCKED to retain the buffer in situations where the target filesystem is unable to immediately disassociate the buffer. Also keep VMIO intact for B_LOCKED buffers (in addition to B_DELWRI buffers). HAMMER will use this feature to keep buffers passively associated with other filesystem structures and thus be able to avoid constantly brelse()ing and getblk()ing them.
Remove the vpp (returned underlying device vnode) argument from VOP_BMAP(). VOP_BMAP() may now only be used to determine linearity and clusterability of the blocks underlying a filesystem object. The meaning of the returned block number (other then being contiguous as a means of indicating linearity or clusterability) is now up to the VFS. This removes visibility into the device(s) underlying a filesystem from the rest of the kernel.
Fix numerous spelling mistakes.
Use SYSREF to reference count struct vnode. v_usecount is now v_sysref(.refcnt). v_holdcnt is now v_auxrefs. SYSREF's termination state (using a negative reference count from -0x40000000+) now places the vnode in a VCACHED or VFREE state and deactivates it. The vnode is now assigned a 64 bit unique id via SYSREF. vhold() (which manipulates v_auxrefs) no longer reactivates a vnode and is explicitly used only to track references from auxillary structures and references to prevent premature destruction of the vnode. vdrop() will now only move a vnode from VCACHED to VFREE on the 1->0 transition of v_auxrefs if the vnode is in a termination state. vref() will now panic if used on a vnode in a termination state. vget() must now be used to explicitly reactivate a vnode. These requirements existed before but are now explicitly asserted. vlrureclaim() and allocvnode() should now interact a bit better. In particular, vlrureclaim() will do a better job of finding vnodes to flush and transition from VCACHED to VFREE, and allocvnode() will do a better job finding vnodes to reuse without getting blocked by a flush. allocvnode now uses a real VX lock to sequence vnodes into VRECLAIMED. All vnode special state processing now uses a VX lock. Vnodes are now able to be slowly returned to the memory pool when kern.maxvnodes is reduced at run time. Various initialization elements have been moved to CTOR/DTOR and are no longer in the critical path, improving performance. However, since SYSREF uses atomic_cmpset_int() (aka cmpxchgl), which reduces performance somewhat, overall performance tends to be about the same.
Add missing link options to export global symbols to the _DYNAMIC section, allowing the kernel namelist functions to operate. For now just make certain static variables global instead of using linker magic to export static variables. Add infrastructure to allow out of band kernel memory to be accessed. The virtual kernel's memory map does not include the virtual kernel executable or data areas. vmstat, systat, pstat, and netstat now work with virtual kernels.
Rewrite vmapbuf() to use vm_fault_page_quick() instead of vm_fault_quick(). Overhead is slightly increased (until we can optimize vm_fault_page_quick()), but the code is greatly simplified.
1:1 Userland threading stage 2.10/4: Separate p_stats into p_ru and lwp_ru. proc.p_ru keeps track of all statistics directly related to a proc. This consists of RSS usage and nswap information and aggregate numbers for all former lwps of this proc. proc.p_cru is the sum of all stats of reaped children. lwp.lwp_ru contains the stats directly related to one specific lwp, meaning packet, scheduler switch or page fault counts, etc. This information gets added to lwp.lwp_proc.p_ru when the lwp exits.
Correct a conditional used to detect a panic situation. The index was off by one.
Make kernel_map, buffer_map, clean_map, exec_map, and pager_map direct structural declarations instead of pointers. Clean up all related code, in particular kmem_suballoc(). Remove the offset calculation for kernel_object. kernel_object's page indices used to be relative to the start of kernel virtual memory in order to improve the performance of VM page scanning algorithms. The optimization is no longer needed now that VM objects use Red-Black trees. Removal of the offset simplifies a number of calculations and makes the code more readable.
Introduce globals: KvaStart, KvaEnd, and KvaSize. Used by the kernel instead of the nutty VADDR and VM_*_KERNEL_ADDRESS macros. Move extern declarations for these variables as well as for virtual_start, virtual_end, and phys_avail from MD headers to MI headers. Make kernel_object a global structure instead of a pointer. Remove kmem_object and all related code (none of it is used any more).
Ansify function declarations and fix some minor style issues. In-collaboration-with: Alexey Slynko <email@example.com>
Rename printf -> kprintf in sys/ and add some defines where necessary (files which are used in userland, too).
Remove the last bits of code that stored mount point linkages in vnodes. Mount point linkages are now ENTIRELY a function of the namecache topology, made possible by DragonFly's advanced namecache. This fixes a number of problems with NULLFS and adds two major features to our NULLFS mounting capabilities. NULLFS mounting paths NO LONGER NEED TO BE DISTINCT. For example, you can now safely do things like 'mount_null -o ro / /fubar/jail1' without creating a recursion and you can now create SUB-MOUNTS within nullfs mounts, such as 'mount_null -o ro /usr /fubar/jail1/usr', without creating problems in the original master partitions. The result is that NULLFS can now be used to glue arbitrary pieces of filesystems together using a mixture of read-only and read-write NULLFS mounts for situations where localhost NFS mounts had to be used before. Jail or chroot construction is now utterly trivial. With-input-from: Joerg Sonnenberger <firstname.lastname@example.org>
Move flag(s) representing the type of vm_map_entry into its own vm_maptype_t type. This is a precursor to adding a new VM mapping type for virtualized page tables.
Rename malloc->kmalloc, free->kfree, and realloc->krealloc. Pass 1
Correct typo in comment
Add some diagnostic messages to try to catch a ufs_dirbad panic before it happens. MFC: Reorder BUF_UNLOCK() - it must occur after b_flags is modified, not before. A newly created non-VMIO buffer is now marked B_INVAL. Callers of getblk() now always clear B_INVAL before issuing a READ I/O or when clearing or overwriting the buffer. Before this change, a getblk() (getnewbuf), brelse(), getblk() sequence on a non-VMIO buffer would result in a buffer with B_CACHE set yet containing uninitialized data. MFC: B_NOCACHE cannot be set on a clean VMIO-backed buffer as this will destroy the VM backing store, which might be dirty. MFC: Reorder vnode_pager_setsize() calls to close a race condition.
Mark various forms of read() and write() MPSAFE. Note that the MP lock is still acquire, but now its a lot deeper in the fileops. Mark dup(), dup2(), close(), closefrom(), and fcntl() MPSAFE. Some code paths don't have to get the MP lock, but most still do deeper into the fileops.
Fix several buffer cache issues related to B_NOCACHE. * Do not set B_NOCACHE when calling vinvalbuf(... V_SAVE). This will destroy dirty VM backing store associated with clean buffers before the VM system has a chance to check for and flush them. Taken-from: FreeBSD * Properly set B_NOCACHE when destroying buffers related to truncated data. * Fix a bug in vnode_pager_setsize() that was recently introduced. v_filesize was being set before a new/old size comparison, causing a file truncation to not destroy related VM pages past the new EOF. * Remove a bogus B_NOCACHE|B_DIRTY test in brelse(). This was originally intended to be a B_NOCACHE|B_DELWRITE test which then cleared B_NOCACHE, but now that B_NOCACHE operation has been fixed it really does indicate that the buffer, its contents, and its backing store are to be destroyed, even if the buffer is marked B_DELWRI. Instead of clearing B_NOCACHE when B_DELWRITE is found to be set, clear B_DELWRITE when B_NOCACHE is found to be set. Note that B_NOCACHE is still cleared when bdirty() is called in order to ensure that data is not lost when softupdates and other code do a 'B_NOCACHE + bwrite' sequence. Softupdates can redirty a buffer in its io completion hook and a write error can also redirty a buffer. * The VMIO buffer rundown seems to have mophed into a state where the distinction between NFS and non-NFS buffers can be removed. Remove the test.
We have to use pmap_extract() here. If we lose a race against page table cleaning pmap_kextract() could choke on a missing page directory.
Remove VOP_BWRITE(). This function provided a way for a VFS to override the bwrite() function and was used *only* by NFS in order to allow NFS to handle the B_NEEDCOMMIT flag as part of NFSv3's 2-phase commit operation. However, over time, the handling of this flag was moved to the strategy code. Additionally, the kernel now fully supports the redirtying of buffers during an I/O (which both softupdates and NFS need to be able to do). The override is no longer needed. All former calls to VOP_BWRITE() now simply call bwrite().
Cleanup procedure prototypes, get rid of extra spaces in pointer decls.
Block devices generally truncate the size of I/O requests which go past EOF. This is exactly what we want when manually reading or writing a block device such as /dev/ad0s1a, but is not desired when a VFS issues I/O ops on filesystem buffers. In such cases, any EOF condition must be considered an error. Implement a new filesystem buffer flag B_BNOCLIP, which getblk() and friends automatically set. If set, block devices are guarenteed to return an error if the I/O request is at EOF or would otherwise have to be clipped to EOF. Block devices further guarentee that b_bcount will not be modified when this flag is set. Adjust all block device EOF checks to use the new flag, and clean up the code while I'm there. Also, set b_resid in a couple of degenerate cases where it was not being set.
- Clarify the definitions of b_bufsize, b_bcount, and b_resid. - Remove unnecessary assignments based on the clarified fields. - Add additional checks for premature EOF. b_bufsize is only used by buffer management entities such as getblk() and other vnode-backed buffer handling procedures. b_bufsize is not required for calls to vn_strategy() or dev_dstrategy(). A number of other subsystems use it to track the original request size. b_bcount is the I/O request size, but b_bcount() is allowed to be truncated by the device chain if the request encompasses EOF (such as on a raw disk device). A caller which needs to record the original buffer size verses the EOF-truncated buffer can compare b_bcount after the I/O against a recorded copy of the original request size. This copy can be recorded in b_bufsize for unmanaged buffers (malloced or getpbuf()'d buffers). b_resid is always relative to b_bcount, not b_bufsize. A successful read that is truncated to the device EOF will thus have a b_resid of 0 and a truncated b_bcount.
Remove buf->b_saveaddr, assert that vmapbuf() is only called on pbuf's. Pass the user pointer and length to vmapbuf() rather then having it try to pull the information out of the buffer. vmapbuf() is now responsible for setting b_data, b_bufsize, and b_bcount. Also fix a bug in cam_periph_mapmem(). The procedure was failing to unmap earlier vmapped bufs if later vmapbuf() calls in the loop failed.
The pbuf subsystem now initializes b_kvabase and b_kvasize at startup and no longer reinitializes these fields in initpbuf(). Users of getpbuf() may no longer modify b_kvabase or b_kvasize. b_data may still be modified.
Remove b_xflags. Fold BX_VNCLEAN and BX_VNDIRTY into b_flags as B_VNCLEAN and B_VNDIRTY. Remove BX_AUTOCHAINDONE and recode the swap pager to use one of the caller data fields in the BIO instead.
Replace the the buffer cache's B_READ, B_WRITE, B_FORMAT, and B_FREEBUF b_flags with a separate b_cmd field. Use b_cmd to test for I/O completion as well (getting rid of B_DONE in the process). This further simplifies the setup required to issue a buffer cache I/O. Remove a redundant header file, bus/isa/i386/isa_dma.h and merge any discrepancies into bus/isa/isavar.h. Give ISADMA_READ/WRITE/RAW their own independant flag definitions instead of trying to overload them on top of B_READ, B_WRITE, and B_RAW. Add a routine isa_dmabp() which takes a struct buf pointer and returns the ISA dma flags associated with the operation. Remove the 'clear_modify' argument to vfs_busy_pages(). Instead, vfs_busy_pages() asserts that the buffer's b_cmd is valid and then uses it to determine the action it must take.
Get rid of pbgetvp() and pbrelvp(). Instead fold the B_PAGING flag directly into getpbuf() (the only type of buffer that pbgetvp() could be called on anyway). Change related b_flags assignments from '=' to '|='. Get rid of remaining depdendancies on b_vp. vn_strategy() now relies solely on the vp passed to it as an argument. Remove buffer cache code that sets b_vp for anonymous pbuf's. Add a stopgap 'vp' argument to vfs_busy_pages(). This is only really needed by NFS and the clustering code do to the severely hackish nature of the NFS and clustering code. Fix a bug in the ext2fs inode code where vfs_busy_pages() was being called on B_CACHE buffers. Add an assertion to vfs_busy_pages() to panic if it encounters a B_CACHE buffer.
Get rid of the remaining buffer background bitmap code. It's been turned off for a while, and it represents a fairly severe hack to the buffer cache code that just complicates further development.
Remove the buffer cache's B_PHYS flag. This flag was originally used as part of a severe hack to treat buffers containing 'user' addresses differently, in particular by using b_offset instead of b_blkno. Now that buffer cache buffers only HAVE b_offset (b_*blkno is gone for good), there is literally no difference between B_PHYS I/O and non-B_PHYS I/O once the buffer has been handed off to the device.
Move most references to the buffer cache array (buf) to kern/vfs_bio.c. Implement a procedure which scans all buffers, called scan_all_buffers(). Cleanup unused debugging code referencing buf.
If softupdates or some other entity re-dirties a buffer, make sure that B_NOCACHE is cleared to prevent the buffer from being discarded. Add printfs to warn if the situation is encountered. Fix a bug in brelse() where a buffer's flags were being modified after the unlock instead of before.
MFC vfs_bio.c 1.57, vfs_subr.c 1.69 - fix race condition in vfs_bio_awrite().
Require that *ALL* vnode-based buffer cache ops be backed by a VM object. No exceptions. Start simplifying the getblk() based on the new requirements.
Remove VOP_GETVOBJECT, VOP_DESTROYVOBJECT, and VOP_CREATEVOBJECT. Rearrange the VFS code such that VOP_OPEN is now responsible for associating a VM object with a vnode. Add the vinitvmio() helper routine.
Major BUF/BIO work commit. Make I/O BIO-centric and specify the disk or file location with a 64 bit offset instead of a 32 bit block number. * All I/O is now BIO-centric instead of BUF-centric. * File/Disk addresses universally use a 64 bit bio_offset now. bio_blkno no longer exists. * Stackable BIO's hold disk offset translations. Translations are no longer overloaded onto a single structure (BUF or BIO). * bio_offset == NOOFFSET is now universally used to indicate that a translation has not been made. The old (blkno == lblkno) junk has all been removed. * There is no longer a distinction between logical I/O and physical I/O. * All driver BUFQs have been converted to BIOQs. * BMAP, FREEBLKS, getblk, bread, breadn, bwrite, inmem, cluster_*, and findblk all now take and/or return 64 bit byte offsets instead of block numbers. Note that BMAP now returns a byte range for the before and after variables.
Replace the global buffer cache hash table with a per-vnode red-black tree. Add a B_HASHED b_flags bit as a sanity check. Remove the invalhash junk and replace with assertions in several cases where the buffer must already not be hashed. Get rid of incore() and gbincore() and replace with a new function called findblk(). Merge the new RB management with bgetvp(), the two are now fully integrated. Previous work has turned reassignbuf() into a mostly degenerate call, simplify its arguments and functionality to match. Remove an unnecessary reassignbuf() call from the NFS code. Get rid of pbreassignbuf(). Adjust the code in several places where it was assumed that calling BUF_LOCK() with LK_SLEEPFAIL after previously failing with LK_NOWAIT would always fail. This code was used to sleep before a retry. Instead, if the second lock unexpectedly succeeds, simply issue an unlock and retry anyway. Testing-by: Stefan Krueger <email@example.com>
vfs_bio_awrite() was unconditionally locking a buffer without checking for races, potentially resulting in the wrong buffer, an invalid buffer, or a recently replaced buffer being written out. Change the call semantics to require a locked buffer to be passed into the function rather then locking the buffer in the function.
buftimespinlock is utterly useless since the spinlock is released within lockmgr(). The only real problem was with lk_prio, which no longer exists, so get rid of the spin lock and document the remaining passive races.
Pass LK_PCATCH instead of trying to store tsleep flags in the lock structure, so multiple entities competing for the same lock do not use unexpected flags when sleeping. Only NFS really uses PCATCH with lockmgr locks.
Make the entire BUF/BIO system BIO-centric instead of BUF-centric. Vnode and device strategy routines now take a BIO and must pass that BIO to biodone(). All code which previously managed a BUF undergoing I/O now manages a BIO. The new BIO-centric algorithms allow BIOs to be stacked, where each layer represents a block translation, completion callback, or caller or device private data. This information is no longer overloaded within the BUF. Translation layer linkages remain intact as a 'cache' after I/O has completed. The VOP and DEV strategy routines no longer make assumptions as to which translated block number applies to them. The use the block number in the BIO specifically passed to them. Change the 'untranslated' constant to NOOFFSET (for bio_offset), and (daddr_t)-1 (for bio_blkno). Rip out all code that previously set the translated block number to the untranslated block number to indicate that the translation had not been made. Rip out all the cluster linkage fields for clustered VFS and clustered paging operations. Clustering now occurs in a private BIO layer using private fields within the BIO. Reformulate the vn_strategy() and dev_dstrategy() abstraction(s). These routines no longer assume that bp->b_vp == the vp of the VOP operation, and the dev_t is no longer stored in the struct buf. Instead, only the vp passed to vn_strategy() (and related *_strategy() routines for VFS ops), and the dev_t passed to dev_dstrateg() (and related *_strategy() routines for device ops) is used by the VFS or DEV code. This will allow an arbitrary number of translation layers in the future. Create an independant per-BIO tracking entity, struct bio_track, which is used to determine when I/O is in-progress on the associated device or vnode. NOTE: Unlike FreeBSD's BIO work, our struct BUF is still used to hold the fields describing the data buffer, resid, and error state. Major-testing-by: Stefan Krueger
Convert the lockmgr interlock from a token to a spinlock. This fixes a problem on SMP boxes where the MP lock would unexpectedly lose atomicy for a short period of time due to token acquisition. Add a tsleep_interlock() call which takes advantage of tsleep()'s cpu locality of reference to provide a helper function which allows us to atomically spin_unlock() and tsleep() in an MP safe manner with only a critical section. Basically all it does is set a cpumask bit for the ident hash index to cause other cpu's issuing a wakeup to notify our cpu. Any actual wakeup occuring during the race period after the spin_unlock but before the tsleep() call will be delayed by the critical section until after the tsleep has queued the thread. Cleanup some unused junk in vm_map.h.
Temporarily check for and correct a race in getnewbuf() that exists due to the fact that lockmgr locks use tokens for their interlock. The use of a token can cause the atomicy of the big giant lock to be temporarily lost and wind up breaking the assumed atomicy of higher level operations that believed themselves to be safe making lockmgr calls with the LK_NOWAIT flag. The general problem will soon be fixed by changing the lockmgr interlock from a token to one of Jeffrey Hsu's spin locks. Fortunately there are only a few places left in DragonFly where LK_INTERLOCK is used.
Add a missing BUF_UNLOCK in the last commit.
Add two checks for potential buffer cache races. "Warning buffer %p (vp %p lblkno %d) was recycled" Occurs if a buffer is recycled unexpectedly. The code will print this warning and retry if it detects the case. "Warning invalid buffer %p (vp %p lblkno %d) did not have cleared b_blkno cache" Occurs if a B_INVAL buffer's b_blkno cache has not been reset. The code will reset the cache if it detects this case.
Remove NO_B_MALLOC preprocessor macro, it was never turned on, and just a hindrence. Reviewed-by: Matthew Dillon
Re-word some sysctl descriptions, make them compact.
Move the bswlist symbol into vm/vm_pager.c because PBUFs are the only consumer of the latter. The PBUF abstraction is just a clever hack, this code will be redone at some point so this measure is temporary.
BUF/BIO cleanup 7/99: First attempt at separating low-level information from BUF structure into the new BIO structure. The latter will be used to represent the actual I/O underlying the buffer cache, other subsystems and device drivers. Other information from the BUF structure will be moved eventually once their place in the grand scheme is determined. For now, preprocess macros have been added to reduce widespread changes; this is a temporary measure by all means until more of the BIO and BUF API is formalised. Remove compatibility preprocessor macros in the AAC driver because our BUF/BIO system is mutating; not to mention they were getting in the way. NB the name BIO has been used because it's quite appropriate and known among kernel developers from other operating system groups, be it BSD or Linux. This change should not have any operational affect (famous last words). Reviewed by: Matthew Dillon <firstname.lastname@example.org>
BUF/BIO cleanup 6/99: Move 'bogus_offset' variable into bufinit(), it is not used anywhere out of the said function.
Add 'debug.sizeof.buf' sysctl for determining size of struct buf on a system. Declare the _debug_sizeof sysctl in sys/sysctl.h instead of redundantly declaring in two files.
MFC 1.38. Fix a case where a buffer is moved to EMPTY or EMPTYKVA without disassociating its vnode.
BUF/BIO cleanup 5/99: Cleanup and document the buffer cache sysctls. The 'getnewbufrestarts', 'getnewbufcalls', 'bufdefragcnt', 'buffreekvacnt' and 'bufreusecnt' are now changed to be read-only sysctls. Group them depending on whether they are writable or not. Correct, extend and write documentation for various functions in this file. Correct typos in various code comments and adjust nearby style issue.
Initialize buf->b_iodone to NULL during bufinit(9) stage. Use NULL instead of 0 for assigning to pointers.
Remove scheduler define which was never used.
BUF/BIO cleanup 3/99: Retire the B_CALL flag in favour of checking the bp->b_iodone pointer directly, thus simplifying the BUF interface even more. Move scattered B_UNUSED* flag space defintions into one place, that is below the rest of the definitions.
BUF/BIO cleanup 2/99: Localise buffer queue information into kern/vfs_bio.c, it should not be messed with outside of the named file. Convert the QUEUE_* #defines into enum bufq_type, prefix the names with 'B'. The change to initpbuf() is acceptable since they are a hack anyway, not to mention that Move vfs_bufstats() from kern/vfs_syscalls.c into kern/vfs_bio.c since that's where it should really belong, atleast till its use is cleaned. Move bufqueues extern from sys/buf.h into kern/vfs_bio.c as it shouldn't be messed with by anything else. It was only sitting in sys/buf.h because of vfs_bufstats(). Note the change to initpbuf() is acceptable since they are a hack anyway, not to mention that the said function and friends should probably reside in kern/vfs_bio.c.
There is a case when B_VMIO is clear where a buffer can be placed on the EMPTY or EMPTYKVA queues without being disassociated from its vnode. This can lead to a duplicate logical block panic in the red-black tree code. Rework brelse() to ensure that buffers are properly cleaned up before being placed on said queues, and add assertions to validate other cases. Reported-by: Tomaz Borstnar
Remove spl*() calls from kern, replacing them with critical sections. Change the meaning of safepri from a cpl mask to a thread priority. Make a minor adjustment to tests within one of the buffer cache's critical sections.
incore() is used to detect logical block number collisions, and other callers will check B_INVAL on return. Do not return a false negative if the buffer we find happens to be B_INVAL as this could result in duplicate buffers in the buffer cache. As of the red-black tree code, which detects duplicate entries, the case will immediately panic the machine. MFC: 2 weeks
Implement Red-Black trees for the vnode clean/dirty buffer lists. Implement ranged fsyncs and adjust the syncer to use the new capability. This capability will also soon be used to replace the write_behind heuristic. Rewrite the fsync code for all VFSs to use the new APIs (generally simplifying them). Get rid of B_WRITEINPROG, it is no longer useful or needed. Get rid of B_SCANNED, it is no longer useful or needed. Rewrite the NFS 2-phase commit protocol to take advantage of the new Red-Black tree topology. Add RB_SCAN() for callback-scanning of Red-Black trees. Give RB_SCAN the ability to track the 'next' scan node and automatically fix it up if the callback directly or indirectly or through blocking indirectly deletes nodes in the tree while the scan is in progress. Remove most related loop restart conditions, they are no longer necessary. Disable filesystem background bitmap writes. This really needs to be solved a different way and the concept does not work well with red-black trees.
Remove an assertion in bundirty() that requires the buffer to not be on a queue. There is a code path in brelse() where the buffer may be put on a queue prior to calling bundirty(). Reported-by: David Rhodus <email@example.com>
getblk() has an old crufty API in which the logical block size is not a well known quantity. Device drivers standardize on using DEV_BSIZE (512), while file ops are supposed to use mount->mnt_stat.f_iosize. The existing code was testing for a non-NULL vnode->v_mountedhere field but this field is part of a union and only valid for VDIR types. It was being improperly tested on non-VDIR vnode types. In particular, if vn_isdisk() fails due to the disk device being ripped out from under a filesystem, the code would fall through and try to use v_mountedhere, leading to a crash. It also makes no sense to use the target mount to calculate the block size for the underlying mount point's vnode, so this test has been removed entirely. The vn_isdisk() test has been replaced with an explicit VBLK/VCHR test. Finally, note that filesystems like UFS use varying buffer cache buffer sizes for different areas of the same block device (e.g. bitmap areas, inode area, file data areas, superblock), which is why DEV_BSIZE is being used here. What really needs to happen is for b_blkno to be entirely removed in favor of a 64 bit offset. Crash-Reported-by: Vyacheslav Bocharov <firstname.lastname@example.org>
Create a non-blocking version of BUF_REFCNT() called BUF_REFCNTNB() to be used for non-critical KASSERT()'s or in situations where the buffer lock is in a known state. This fixes a blocking condition in the ATA interrupt path. The normal BUF_REFCNT() calls lockcount() which obtains a token which caused the interrupt thread to temporarily block in biodone() due to a KASSERT. Found-from: kernel core provided by David Rhodus.
Try to close an occassional VM page related panic that is believed to occur due to the VM page queues or free lists being indirectly manipulated by interrupts that are not protected by splvm(). Do this by replacing splvm()'s with critical sections in a number of places. Note: some of this work bled over into the "VFS messaging/interfacing work stage 8/99" commit.
Correct reference to buf->b_xio.xio_pages in a comment.
BUF/BIO work, for removing the requirement of KVA mappings for I/O requests. Stage 1 of 8: o Replace the b_pages member of the BUF structure with an embedded XIO (b_xio). The XIO will be used for managing the BUF's page lists. o Initialize the XIO at two main (only) points: 1) the pbuf code, which is used by the NFS code to create a temporary buffer; and bufinit(9), which is used by the rest of the BUF/BIO consumers. Discussed-with: Matthew Dillon <email@example.com>,
ANSIfication. No operational changes. Submitted-by: Tim Wickberg <firstname.lastname@example.org>
Get rid of VM_WAIT and VM_WAITPFAULT crud, replace with calls to vm_wait() and vm_waitpfault(). This is a non-operational change. vm_page.c now uses the _vm_page_list_find() inline (which itself is only in vm_page.c) for various critical path operations.
Device layer rollup commit. * cdevsw_add() is now required. cdevsw_add() and cdevsw_remove() may specify a mask/match indicating the range of supported minor numbers. Multiple cdevsw_add()'s using the same major number, but distinctly different ranges, may be issued. All devices that failed to call cdevsw_add() before now do. * cdevsw_remove() now automatically marks all devices within its supported range as being destroyed. * vnode->v_rdev is no longer resolved when the vnode is created. Instead, only v_udev (a newly added field) is resolved. v_rdev is resolved when the vnode is opened and cleared on the last close. * A great deal of code was making rather dubious assumptions with regards to the validity of devices associated with vnodes, primarily due to the persistence of a device structure due to being indexed by (major, minor) instead of by (cdevsw, major, minor). In particular, if you run a program which connects to a USB device and then you pull the USB device and plug it back in, the vnode subsystem will continue to believe that the device is open when, in fact, it isn't (because it was destroyed and recreated). In particular, note that all the VFS mount procedures now check devices via v_udev instead of v_rdev prior to calling VOP_OPEN(), since v_rdev is NULL prior to the first open. * The disk layer's device interaction has been rewritten. The disk layer (i.e. the slice and disklabel management layer) no longer overloads its data onto the device structure representing the underlying physical disk. Instead, the disk layer uses the new cdevsw_add() functionality to register its own cdevsw using the underlying device's major number, and simply does NOT register the underlying device's cdevsw. No confusion is created because the device hash is now based on (cdevsw,major,minor) rather then (major,minor). NOTE: This also means that underlying raw disk devices may use the entire device minor number instead of having to reserve the bits used by the disk layer, and also means that can we (theoretically) stack a fully disklabel-supported 'disk' on top of any block device. * The new reference counting scheme prevents this by associating a device with a cdevsw and disconnecting the device from its cdevsw when the cdevsw is removed. Additionally, all udev2dev() lookups run through the cdevsw mask/match and only successfully find devices still associated with an active cdevsw. * Major work on MFS: MFS no longer shortcuts vnode and device creation. It now creates a real vnode and a real device and implements real open and close VOPs. Additionally, due to the disk layer changes, MFS is no longer limited to 255 mounts. The new limit is 16 million. Since MFS creates a real device node, mount_mfs will now create a real /dev/mfs<PID> device that can be read from userland (e.g. so you can dump an MFS filesystem). * BUF AND DEVICE STRATEGY changes. The struct buf contains a b_dev field. In order to properly handle stacked devices we now require that the b_dev field be initialized before the device strategy routine is called. This required some additional work in various VFS implementations. To enforce this requirement, biodone() now sets b_dev to NODEV. The new disk layer will adjust b_dev before forwarding a request to the actual physical device. * A bug in the ISO CD boot sequence which resulted in a panic has been fixed. Testing by: lots of people, but David Rhodus found the most aggregious bugs.
Close an interrupt race between vm_page_lookup() and (typically) a vm_page_sleep_busy() check by using the correct spl protection. An interrupt can occur inbetween the two operations and unbusy/free the page in question, causing the busy check to fail and for the code to fall through and then operate on a page that may have been freed and possibly even reused. Also note that vm_page_grab() had the same issue between the lookup, busy check, and vm_page_busy() call. Close an interrupt race when scanning a VM object's memq. Interrupts can free pages, removing them from memq, which interferes with memq scans and can cause a page unassociated with the object to be processed as if it were associated with the object. Calls to vm_page_hold() and vm_page_unhold() require spl protection. Rename the passed socket descriptor argument in sendfile() to make the code more readable. Fix several serious bugs in procfs_rwmem(). In particular, force it to block if a page is busy and then retry. Get rid of vm_pager_map_pag() and vm_pager_unmap_page(), make the functions that used to use these routines use SFBUF's instead. Get rid of the (userland?) 4MB page mapping feature in pmap_object_init_pt() for now. The code appears to not track the page directory properly and could result in a non-zero page being freed as PG_ZERO. This commit also includes updated code comments and some additional non-operational code cleanups.
Remove newline from panic(9) message, it is redundant.
Peter Edwards brought up an interesting NFS bug which we both originally thought would be a fairly straightforward bug fix. But it turns out to require a nasty hack to fix. The issue is that near the file EOF NFS uses piecemeal writes and piecemeal buffer cache buffers. The result is that manipulation through the buffer cache only sets some of the m->valid bits in the associated vm_page(s). This case may also occur in the middle of a file if for example a file is piecemeal written and then ftruncated to be much larger (or lseek/write at a much higher seek position). The nfs_getpages() routine was assuming that if m->valid was non-0, the page is basically valid and no read rpc is required to fill it. The problem is that if you mmap() a piecemeal VM page and fault it in, m->valid is set to VM_PAGE_BITS_ALL (0xFF). Then, later, when NFS flushes the buffer cache, only some of the m->valid bits are clear (e.g. 0xFC). A later page fault will cause NFS to believe that the page is sufficiently valid and vm_fault will then zero-out the first X bytes of the page when, in fact, we really should have done an I/O to refill those X bytes. The fix in PR misc/64816 (FreeBSD) tried to solve this by checking to see if the m->valid bits were 'sufficiently valid' in the file EOF case but tesing with fsx resulted in several failure modes. This doesn't work because (1) if you extend the file w/ ftruncate or lseek/write these partially valid pages can end up in the middle of the file rather then just at the end and (2) There may be a dirty buffer associated with these pages, meaning that the pages may contain dirty data, and we cannot safely overwrite the pages with a new read I/O. The solution in this patch is to deal with the screwy m->valid bit clearing but special-casing NFS and then having the BIO system clear ALL the m->valid bits instead of just some of them when NFS calls vinvalbuf(). That way m->valid will be set to 0 when the buffer is invalidated and the nfs_getpages() code can be left doing it's simple 'if any m->valid bits are set assume the whole page is valid' test. In order for the BIO system to safely be able to do this (so as not to invalidate portions of a VM page associated with an adjacent buffer), the NFS io size has been further restricted to be an integral multiple of PAGE_SIZE. This is a terrible hack but there is no other way to fix the problem short of rewriting the entire buffer cache. We will do that eventually, but not now. Reported-by: Peter Edwards <email@example.com> Referencing-PR: misc/64816 by Patrick Mackinlay <firstname.lastname@example.org>
The existing hash algorithm in bufhash() does not distribute entries very well across buckets, especially in the case of cylinder group blocks which are located at a sequence of locations that are a multiple of a large power of two apart. In the case of large file systems, one or possibly a few of the hash chains can get excessively long. Replace the existing hash algorithm with a variation on the Fibonacci hash. Merged from FreeBSD
Change this vnode check inside of the VFS_BIO_DEBUG code path to check for erroneous hold counts from the reference count check that was an el-relevant check here.
Replace a manual check for a VMIO candidate with vn_canvmio() under VFS_BIO_DEBUG. This silences an annoying warning in getblk() when VMIO'ing on a VDIR (directory) vnode; this happens due to vmiodirenable sysctl being set to `1'. Discussed with: Matthew Dillon
Newtoken commit. Change the token implementation as follows: (1) Obtaining a token no longer enters a critical section. (2) tokens can be held through schedular switches and blocking conditions and are effectively released and reacquired on resume. Thus tokens serialize access only while the thread is actually running. Serialization is not broken by preemptive interrupts. That is, interrupt threads which preempt do no release the preempted thread's tokens. (3) Unlike spl's, tokens will interlock w/ interrupt threads on the same or on a different cpu. The vnode interlock code has been rewritten and the API has changed. The mountlist vnode scanning code has been consolidated and all known races have been fixed. The vnode interlock is now a pool token. The code that frees unreferenced vnodes whos last VM page has been freed has been moved out of the low level vm_page_free() code and moved to the periodic filesystem sycer code in vfs_msycn(). The SMP startup code and the IPI code has been cleaned up considerably. Certain early token interactions on AP cpus have been moved to the BSP. The LWKT rwlock API has been cleaned up and turned on. Major testing by: David Rhodus
buftimetoken must be declared in a .c file.
Retool the M_* flags to malloc() and the VM_ALLOC_* flags to vm_page_alloc(), and vm_page_grab() and friends. The M_* flags now have more flexibility, with the intent that we will start using some of it to deal with NULL pointer return problems in the codebase (CAM is especially bad at dealing with unexpected return values). In particular, add M_USE_INTERRUPT_RESERVE and M_FAILSAFE, and redefine M_NOWAIT as a combination of M_ flags instead of its own flag. The VM_ALLOC_* macros are now flags (0x01, 0x01, 0x04) rather then states (1, 2, 3), which allows us to create combinations that the old interface could not handle.
64 bit address space cleanups which are a prerequisit for future 64 bit address space work and PAE. Note: this is not PAE. This patch basically adds vm_paddr_t, which represents a 'physical address'. Physical addresses may be larger then virtual addresses and on IA32 we make vm_paddr_t a 64 bit quantity. Submitted-by: Hiten Pandya <email@example.com>
Disable background bitmap writes. They appear to cause at least two race conditions: First, on MP systems even an LK_NOWAIT lock may block, invalidating flags checks done just prior to the lock attempt. Second, on both MP and UP systems, the original buffer (origbp) may be modified during the completion of a background write without its lock being held and these modifications can race against mainline code that is also modifying the same buffer with the lock held. Eventually the problem background bitmap writes solved will be solved more generally by implementing page COWing durign device I/O to avoid stalls on pages undergoing write I/O.
SLAB ALLOCATOR Stage 1. This brings in a slab allocator written from scratch by your's truely. A detailed explanation of the allocator is included but first, other changes: * Instead of having vm_map_entry_insert*() and friends allocate the vm_map_entry structures a new mechanism has been emplaced where by the vm_map_entry structures are reserved at a higher level, then expected to exist in the free pool in deep vm_map code. This preliminary implementation may eventually turn into something more sophisticated that includes things like pmap entries and so forth. The idea is to convert what should be low level routines (VM object and map manipulation) back into low level routines. * vm_map_entry structure are now per-cpu cached, which is integrated into the the reservation model above. * The zalloc 'kmapentzone' has been removed. We now only have 'mapentzone'. * There were race conditions between vm_map_findspace() and actually entering the map_entry with vm_map_insert(). These have been closed through the vm_map_entry reservation model described above. * Two new kernel config options now work. NO_KMEM_MAP has been fleshed out a bit more and a number of deadlocks related to having only the kernel_map now have been fixed. The USE_SLAB_ALLOCATOR option will cause the kernel to compile-in the slab allocator instead of the original malloc allocator. If you specify USE_SLAB_ALLOCATOR you must also specify NO_KMEM_MAP. * vm_poff_t and vm_paddr_t integer types have been added. These are meant to represent physical addresses and offsets (physical memory might be larger then virtual memory, for example Intel PAE). They are not heavily used yet but the intention is to separate physical representation from virtual representation. SLAB ALLOCATOR FEATURES The slab allocator breaks allocations up into approximately 80 zones based on their size. Each zone has a chunk size (alignment). For example, all allocations in the 1-8 byte range will allocate in chunks of 8 bytes. Each size zone is backed by one or more blocks of memory. The size of these blocks is fixed at ZoneSize, which is calculated at boot time to be between 32K and 128K. The use of a fixed block size allows us to locate the zone header given a memory pointer with a simple masking operation. The slab allocator operates on a per-cpu basis. The cpu that allocates a zone block owns it. free() checks the cpu that owns the zone holding the memory pointer being freed and forwards the request to the appropriate cpu through an asynchronous IPI. This request is not currently optimized but it can theoretically be heavily optimized ('queued') to the point where the overhead becomes inconsequential. As of this commit the malloc_type information is not MP safe, but the core slab allocation and deallocation algorithms, non-inclusive the having to allocate the backing block, *ARE* MP safe. The core code requires no mutexes or locks, only a critical section. Each zone contains N allocations of a fixed chunk size. For example, a 128K zone can hold approximately 16000 or so 8 byte allocations. The zone is initially zero'd and new allocations are simply allocated linearly out of the zone. When a chunk is freed it is entered into a linked list and the next allocation request will reuse it. The slab allocator heavily optimizes M_ZERO operations at both the page level and the chunk level. The slab allocator maintains various undocumented malloc quirks such as ensuring that small power-of-2 allocations are aligned to their size, and malloc(0) requests are also allowed and return a non-NULL result. kern_tty.c depends heavily on the power-of-2 alignment feature and ahc depends on the malloc(0) feature. Eventually we may remove the malloc(0) feature. PROBLEMS AS OF THIS COMMIT NOTE! This commit may destabilize the kernel a bit. There are issues with the ISA DMA area ('bounce' buffer allocation) due to the large backing block size used by the slab allocator and there are probably some deadlock issues do to the removal of kmem_map that have not yet been resolved.
Add an alignment feature to vm_map_findspace(). This feature will be used primarily by the upcoming slab allocator but has many applications. Use the alignment feature in the buffer cache to hopefully reduce fragmentation.
Register keyword removal Approved by: Matt Dillon
Remove the priority part of the priority|flags argument to tsleep(). Only flags are passed now. The priority was a user scheduler thingy that is not used by the LWKT subsystem. For process statistics assume sleeps without P_SINTR set to be disk-waits, and sleeps with it set to be normal sleeps. This commit should not contain any operational changes.
MP Implementation 1/2: Get the APIC code working again, sweetly integrate the MP lock into the LWKT scheduler, replace the old simplelock code with tokens or spin locks as appropriate. In particular, the vnode interlock (and most other interlocks) are now tokens. Also clean up a few curproc/cred sequences that are no longer needed. The APs are left in degenerate state with non IPI interrupts disabled as additional LWKT work must be done before we can really make use of them, and FAST interrupts are not managed by the MP lock yet. The main thing for this stage was to get the system working with an APIC again. buildworld tested on UP and 2xCPU/MP (Dell 2550)
Split the struct vmmeter cnt structure into a global vmstats structure and a per-cpu cnt structure. Adjust the sysctls to accumulate statistics over all cpus.
proc->thread stage 6: kernel threads now create processless LWKT threads. A number of obvious curproc cases were removed, tsleep/wakeup was made to work with threads (wmesg, ident, and timeout features moved to threads). There are probably a few curproc cases left to fix.
cleanup some odd uses of curproc. Remove PHOLD/PRELE around physical I/O (our UPAGES can no longer be swapped out and if they eventually are made to it will only be when the thread is sleeping on a particular address). Also move the inblock/oublock accounting in vfs_busy_pages() allowing us to remove additional curproc references from various filesystem code. This also makes inblock/oublock more consistent.
proc->thread stage 5: BUF/VFS clearance! Remove the ucred argument from vop_close, vop_getattr, vop_fsync, and vop_createvobject. These VOPs can be called from multiple contexts so the cred is fairly useless, and UFS ignorse it anyway. For filesystems (like NFS) that sometimes need a cred we use proc0.p_ucred for now. This removal also removed the need for a 'proc' reference in the related VFS procedures, which greatly helps our proc->thread conversion. bp->b_wcred and bp->b_rcred have also been removed, and for the same reason. It makes no sense to have a particular cred when multiple users can access a file. This may create issues with certain types of NFS mounts but if it does we will solve them in a way that doesn't pollute the struct buf.
proc->thread stage 1: change kproc_*() API to take and return threads. Note: we won't be able to turn off the underlying proc until we have a clean thread path all the way through, which aint now.
thread stage 5: Separate the inline functions out of sys/buf.h, creating sys/buf2.h (A methodology that will continue as time passes). This solves inline vs struct ordering problems. Do a major cleanup of the globaldata access methodology. Create a gcc-cacheable 'mycpu' macro & inline to access per-cpu data. Atomicy is not required because we will never change cpus out from under a thread, even if it gets preempted by an interrupt thread, because we want to be able to implement per-cpu caches that do not require locked bus cycles or special instructions.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 126.96.36.199