Up to [DragonFly] / src / sys / vm
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
* Use SYSREF for vmspace structures. This replaces the vmspace structure's roll-your-own refcnt implementation and replaces its zalloc backing store. Numerous procedures have been added to handle termination and DTOR operations and to properly interlock with vm_exitingcnt, all centered around the vmspace_sysref_class declaration. * Replace pmap_activate() and pmap_deactivate() with add pmap_replacevm(). This replaces numerous instances where roll-your-own deactivate/activate sequences were being used, creating small windows of opportunity where an update to the kernel pmap would not be visible to running code. * Properly deactivate pmaps and add assertions to the fact in the teardown code. Cases had to be fixed in cpu_exit_switch(), the exec code, the AIO code, and a few other places. * Add pmap_puninit() which is called as part of the DTOR sequence for vmspaces, allowing the kmem mapping and VM object to be recovered. We could not do this with the previous zalloc() implementation. * Properly initialize the per-cpu sysid allocator (globaldata->gd_sysid_alloc). Make the following adjustments to the LWP exiting code. * P_WEXIT interlocks the master exiting thread, eliminating races which can occur when it is signaling the 'other' threads. * LWP_WEXIT interlocks individual exiting threads, eliminating races which can occur there and streamlining some of the tests. * Don't bother queueing the last LWP to the reaper. Instead, just leave it in the p_lwps list (but still decrement nthreads), and add code to kern_wait() to reap the last thread. This improves exit/wait performance for unthreaded applications. * Fix a VMSPACE teardown race in the LWP code. It turns out that it was still possible for the VMSPACE for an exiting LWP to be ripped out from under it by the reaper (due to a conditional that was really supposed to be a loop), or by kern_wait() (due to not waiting for all the LWPs to enter an exiting state). The fix is to have the LWPs PHOLD() the process and then PRELE() it when they are reaped. This is a little mixed up because the addition of SYSREF revealed a number of other semi-related bugs in the pmap and LWP code which also had to be fixed.
Implement nearly all the remaining items required to allow the virtual kernel to actually execute code on behalf of a virtualized user process. The virtual kernel is now able to execute the init binary through to the point where it sets up a TLS segment. * Create a pseudo tf_trapno called T_SYSCALL80 to indicate system call traps. * Add MD shims when creating or destroying a struct vmspace, allowing the virtual kernel to create and destroy real-kernel vmspaces along with. Add appropriate calls to vmspace_mmap() and vmspace_mcontrol() to map memory inside the user process vmspace. The memory is mapped VPAGETABLE and the page table directory is set to point to the pmap page directory. * Clean up user_trap, handle T_PAGEFLT properly. * Implement go_user(). It calls vmspace_ctl(... VMSPACE_CTL_RUN) and user_trap() in a loop, allowing the virtual kernel to 'run' a user mode context under its control. * Reduce VM_MAX_USER_ADDRESS to 0xb8000000 for now, until I figure out the best way to have the virtual kernel query the actual max user address from the real kernel. * Correct a pm_pdirpte assignment. We can't look up the PTE until after we have entered it into the kernel pmap.
Make kernel_map, buffer_map, clean_map, exec_map, and pager_map direct structural declarations instead of pointers. Clean up all related code, in particular kmem_suballoc(). Remove the offset calculation for kernel_object. kernel_object's page indices used to be relative to the start of kernel virtual memory in order to improve the performance of VM page scanning algorithms. The optimization is no longer needed now that VM objects use Red-Black trees. Removal of the offset simplifies a number of calculations and makes the code more readable.
Introduce globals: KvaStart, KvaEnd, and KvaSize. Used by the kernel instead of the nutty VADDR and VM_*_KERNEL_ADDRESS macros. Move extern declarations for these variables as well as for virtual_start, virtual_end, and phys_avail from MD headers to MI headers. Make kernel_object a global structure instead of a pointer. Remove kmem_object and all related code (none of it is used any more).
Misc cleanups and CVS surgery. Move a number of header and source files from machine/pc32 to cpu/i386 as part of the ongoing architectural separation work and do a bit of cleanup.
Reformulate the way the kernel updates the PMAPs in the system when adding a new page table page to expand kernel memory. Keep track of the PMAPs in their own list rather then scanning the process list to locate them. This allows PMAPs managed on behalf of virtual kernels to be properly updated. VM spaces can now be allocated from scratch and may not have a parent template to inherit certain fields from. Make sure these fields are properly cleared.
Collapse some bits of repetitive code into their own procedures and allocate a maximally sized default object to back MAP_VPAGETABLE mappings, allowing us to access logical memory beyond the size of the original mmap() call by programming the page table to point at it. This gives us an abstraction and capability similar to a real kernel's ability to map e.g. 2GB of physical memory into its 1GB address space.
MAP_VPAGETABLE support part 3/3. Implement a new system call called mcontrol() which is an extension of madvise(), adding an additional 64 bit argument. Add two new advisories, MADV_INVAL and MADV_SETMAP. MADV_INVAL will invalidate the pmap for the specified virtual address range. You need to do this for the virtual addresses effected by changes made in a virtual page table. MADV_SETMAP sets the top-level page table entry for the virtual page table governing the mapped range. It only works for memory governed by a virtual page table and strange things will happen if you only set the root page table entry for part of the virtual range. Further refine the virtual page table format. Keep with 32 bit VPTE's for the moment, but properly implement VPTE_PS and VPTE_V. VPTE_PS can be used to suport 4MB linear maps in the top level page table and it can also be used when specifying the 'root' VPTE to disable the page table entirely and just linear map the backing store. VPTE_V is the 'valid' bit (before it was inverted, now it is normal).
MAP_VPAGETABLE support part 1/3. Reorganize vm_fault() to get more direct access to the VM page resolved by a VM fault. Move vm_fault()'s core shadow object traversal and fault I/O code to a new procedure called vm_fault_object(). Begin adding support for memory mappings which are backed by a virtualized page table under userland control.
Move flag(s) representing the type of vm_map_entry into its own vm_maptype_t type. This is a precursor to adding a new VM mapping type for virtualized page tables.
VNode sequencing and locking - part 3/4. VNode aliasing is handled by the namecache (aka nullfs), so there is no longer a need to have VOP_LOCK, VOP_UNLOCK, or VOP_ISSLOCKED as 'VOP' functions. Both NFS and DEADFS have been using standard locking functions for some time and are no longer special cases. Replace all uses with native calls to vn_lock, vn_unlock, and vn_islocked. We can't have these as VOP functions anyhow because of the introduction of the new SYSLINK transport layer, since vnode locks are primarily used to protect the local vnode structure itself.
LK_NOPAUSE no longer serves a purpose, scrap it.
Remove the (unused) copy-on-write support for a vnode's VM object. This support originally existed to support the badly implemented and severely hacked ENABLE_VFS_IOOPT I/O optimization which was removed long ago. This also removes a bunch of cross-module pollution in UFS.
Simplify vn_lock(), VOP_LOCK(), and VOP_UNLOCK() by removing the thread_t argument. These calls now always use the current thread as the lockholder. Passing a thread_t to these functions has always been questionable at best.
Change *_pager_allocate() to take off_t instead of vm_ooffset_t. The actual underlying type (a 64 bit signed integer) is the same. Recent and upcoming work is standardizing on off_t. Move object->un_pager.vnp.vnp_size to vnode->v_filesize. As before, the field is still only valid when a VM object is associated with the vnode.
Pass LK_PCATCH instead of trying to store tsleep flags in the lock structure, so multiple entities competing for the same lock do not use unexpected flags when sleeping. Only NFS really uses PCATCH with lockmgr locks.
* Remove (void) casts for discarded return values. * Ansify function definitions. In-collaboration-with: Alexey Slynko <firstname.lastname@example.org>
There is no need to set *entry on each entry traversed in the red-black tree when looking up a record.
gdb-6 uses /dev/kmem exclusively for kernel addresses when gdb'ing a live kernel, but the globaldata mapping is outside the bounds of kernel_map. Make sure that the globaldata mapping is visible to it.
Replace the cache-point linear search algorithm for VM map entries with a red-black tree. This makes VM map lookups O(log N) in all cases. Note that FreeBSD seems to have gone the splay-tree route, but I really dislike the fact that splay trees are constantly writing to memory even for simple lookups. This would also limit our ability to implement a separate hinting/caching mechanism. A red-black tree is basically a binary tree with internal nodes containing real data in addition to the leafs, simlar to a B+Tree. A red-black tree is very similar to a splay tree but it does not attempt to modify the data structure for pure lookups. Caveat: we tried to revive the map->hint mechanism but there is currently a serious crash/lockup bug related to it so it is disabled in this commit. Submitted-by: Eirik Nygaard <email@example.com> Using-Red-Black-Macros-From: NetBSD (sys/tree.h)
Fix bugs in the vm_map_entry reservation and zalloc code. This code is a bit sticky because zalloc must be able to call kmem_alloc*() in order to extend mapentzone to allocate a new chunk of vm_map_entry structures, and kmem_alloc*() *needs* two vm_map_entry structures in order to map the new data block into the kernel. To avoid a chicken-and-egg recursion there must already be some vm_map_entry structures available for kmem_alloc*() to use. To ensure that structures are available the vm_map_entry cache maintains a 'reserve'. This cache is initially populated from the vm_map_entry's allocated via zbootinit() in vm_map.c. However, since this is a per-cpu cache there are situations where the vm_map subsystem will be used on other cpus before the cache can be populated on those cpus, but after the static zbootinit structures have all been used up. To fix this we statically allocate two vm_map_entry structures for each cpu which is sufficient for zalloc to call kmem_alloc*() to allocate the remainder of the reserve. Having a lot preloaded modules seems to be able to trigger the bug. Also get rid of gd_vme_kdeficit which was a confusing methodology to keep track of kernel reservations. Now we just have gd_vme_avail and a negative count indicates a deficit (the reserve is being dug into). From-panic-reported-by: Adam K Kirchhoff <firstname.lastname@example.org>
Remove the vfs page replacement optimization and its ENABLE_VFS_IOOPT option. This never worked properly... that is, the semantics are broken compared to a normal read or write in that the read 'buffer' will be modified out from under the caller if the underlying file is. What is really needed here is a copy-on-write feature that works in both directions, similar to how a shared buffer is copied after a fork() if either the parent or child modify it. The optimization will eventually rewritten with that in mind but not right now.
VFS messaging/interfacing work stage 8/99: Major reworking of the vnode interlock and other miscellanious things. This patch also fixes FS corruption due to prior vfs work in head. In particular, prior to this patch the namecache locking could introduce blocking conditions that confuse the old vnode deactivation and reclamation code paths. With this patch there appear to be no serious problems even after two days of continuous testing. * VX lock all VOP_CLOSE operations. * Fix two NFS issues. There was an incorrect assertion (found by David Rhodus), and the nfs_rename() code was not properly purging the target file from the cache, resulting in Stale file handle errors during, e.g. a buildworld with an NFS-mounted /usr/obj. * Fix a TTY session issue. Programs which open("/dev/tty" ,...) and then run the TIOCNOTTY ioctl were causing the system to lose track of the open count, preventing the tty from properly detaching. This is actually a very old BSD bug, but it came out of the woodwork in DragonFly because I am now attempting to track device opens explicitly. * Gets rid of the vnode interlock. The lockmgr interlock remains. * Introduced VX locks, which are mandatory vp->v_lock based locks. * Rewrites the locking semantics for deactivation and reclamation. (A ref'd VX lock'd vnode is now required for vgone(), VOP_INACTIVE, and VOP_RECLAIM). New guarentees emplaced with regard to vnode ripouts. * Recodes the mountlist scanning routines to close timing races. * Recodes getnewvnode to close timing races (it now returns a VX locked and refd vnode rather then a refd but unlocked vnode). * Recodes VOP_REVOKE- a locked vnode is now mandatory. * Recodes all VFS inode hash routines to close timing holes. * Removes cache_leaf_test() - vnodes representing intermediate directories are now held so the leaf test should no longer be necessary. * Splits the over-large vfs_subr.c into three additional source files, broken down by major function (locking, mount related, filesystem syncer). * Changes splvm() protection to a critical-section in a number of places (bleedover from another patch set which is also about to be committed). Known issues not yet resolved: * Possible vnode/namecache deadlocks. * While most filesystems now use vp->v_lock, I haven't done a final pass to make vp->v_lock mandatory and to clean up the few remaining inode based locks (nwfs I think and other obscure filesystems). * NullFS gets confused when you hit a mount point in the underlying filesystem. * Only UFS and NFS have been well tested * NFS is not properly timing out namecache entries, causing changes made on the server to not be properly detected on the client if the client already has a negative-cache hit for the filename in question. Testing-by: David Rhodus <email@example.com>, Peter Kadau <firstname.lastname@example.org>, walt <email@example.com>, others
VFS messaging/interfacing work stage 2/99. This stage retools the vnode ops vector dispatch, making the vop_ops a per-mount structure rather then a per-filesystem structure. Filesystem mount code, typically in blah_vfsops.c, must now register various vop_ops pointers in the struct mount to compile its VOP operations set. This change will allow us to begin adding per-mount hooks to VFSes to support things like kernel-level journaling, various forms of cache coherency management, and so forth. In addition, the vop_*() calls now require a struct vop_ops pointer as the first argument instead of a vnode pointer (note: in this commit the VOP_*() macros currently just pull the vop_ops pointer from the vnode in order to call the vop_*() procedures). This change is intended to allow us to divorce ourselves from the requirement that a vnode pointer always be part of a VOP call. In particular, this will allow namespace based routines such as remove(), mkdir(), stat(), and so forth to pass namecache pointers rather then locked vnodes and is a very important precursor to the goal of using the namecache for namespace locking.
(From Alan): Correct a very old error in both vm_object_madvise() (originating in vm/vm_object.c revision 1.88) and vm_object_sync() (originating in vm/vm_map.c revision 1.36): When descending a chain of backing objects, both use the wrong object's backing offset. Consequently, both may operate on the wrong pages. (From Matt): In DragonFly the code needing correction is in vm_object_madvise() and vm_map_clean() (that code in vm_map_clean() was moved to vm_object_sync() in FreebSD-5 hence the FreeBSD-5 correction made by Alan was slight different). The madvise case could produce corrupted user memory when MADV_FREE was used, primarily on server-forked processes (where shadow objects exist) PLUS a special set of additional circumstances: (1) The deeper shadow layers had to no longer be shared, (2) Either the memory had been swapped out in deeper shadow layers (not just the first shadow layer), resulting in the wrong swap space being freed, or (2) the forked memory had not yet been COW'd (and the deeper shadow layer is no longer shared) AND also had not yet been collapsed backed into the parent (e.g. the original parent and/or other forks had exited and/or the memory had been isolated from them already). This bug could be responsible for all of the sporatic madvise oddness that has been reported over the years, especially in earlier days when systems had less memory and paged to swap a lot more then they do today. These weird failure cases have led people to generally not use MADV_FREE (in particular the 'H' malloc.conf option) as much as they could. Also note that I tightened up the VM object collapse code considerably in FreeBSD-4.x making the failure cases above even less likely to occur. The vm_map_clean() (vm_object_sync() in FreeBSD-5) case is not likely to produce failures and it might not even be possible for it to occur in the first place since it requires PROT_WRITE mapped vnodes to exist in a backing object, which either might not be possible or might only occur under extrodinary circumstances. Plus the worst that happens is that the vnode's data doesn't get written out immediately (but always will later on). Kudos to Alan for finding this old bug! Noticed and corrected by: Alan Cox <firstname.lastname@example.org> See also: FreeBSD vm_object.c/1.329
Adjust gd_vme_avail after ensuring that sufficient entries exist rather then before. This should solve a panic where the userland vm_map_entry_reserve() was eating out of the kernel's reserve and causing a recursive zalloc() to panic.
Fix a device pager leak for the case where the page already exists in the VM object (typical case: multiple mappings of the device?). If the page already exists we simply update its physical address. It is unclear whether the physical address would ever actually be different, however. This is an untested patch. Original-patch-written-by: Christian Zander @ NVIDIA Workaround-suggested-by: Tor Egge <email@example.com> Submitted-by: Emiel Kollof <firstname.lastname@example.org>
Bring in the fictitious page wiring bug fixes from FreeBSD-5. Make additional major changes to the APIs to clean them up (so this commit is substantially different than what was committed to FreeBSD-5). Obtained-from: Alan Cox <email@example.com> (FreeBSD-5)
Close an interrupt race between vm_page_lookup() and (typically) a vm_page_sleep_busy() check by using the correct spl protection. An interrupt can occur inbetween the two operations and unbusy/free the page in question, causing the busy check to fail and for the code to fall through and then operate on a page that may have been freed and possibly even reused. Also note that vm_page_grab() had the same issue between the lookup, busy check, and vm_page_busy() call. Close an interrupt race when scanning a VM object's memq. Interrupts can free pages, removing them from memq, which interferes with memq scans and can cause a page unassociated with the object to be processed as if it were associated with the object. Calls to vm_page_hold() and vm_page_unhold() require spl protection. Rename the passed socket descriptor argument in sendfile() to make the code more readable. Fix several serious bugs in procfs_rwmem(). In particular, force it to block if a page is busy and then retry. Get rid of vm_pager_map_pag() and vm_pager_unmap_page(), make the functions that used to use these routines use SFBUF's instead. Get rid of the (userland?) 4MB page mapping feature in pmap_object_init_pt() for now. The code appears to not track the page directory properly and could result in a non-zero page being freed as PG_ZERO. This commit also includes updated code comments and some additional non-operational code cleanups.
Bring in the following revs from FreeBS-4: 18.104.22.168 +3 -2 src/sys/i386/i386/pmap.c 22.214.171.124 +2 -2 src/sys/vm/pmap.h 126.96.36.199 +3 -2 src/sys/vm/vm_map.c Suggested-by: Alan Cox <firstname.lastname@example.org>
msync(..., MS_INVALIDATE) will incorrectly remove dirty pages without synchronizing them to their backing store under certain circumstances, and can also cause struct buf's to become inconsistent. This can be particularly gruesome when MS_INVALIDATE is used on a range of memory that is mmap()'d to be read-only. Fix MS_INVALIDATE's operation (1) by making UFS honor the invalidation request when flushing to backing store to destroy the related struct buf and (2) by never removing pages wired into the buffer cache and never removing pages that are found to still be dirty. Note that NFS was already coded to honor invalidation requests in nfs_write(). Filesystems other then NFS and UFS do not currently support buffer-invalidation-on-write but all that means now is that the pages will remain in cache, rather then be incorrectly removed and cause corruption. Reported-by: Stephan Uphoff <email@example.com>, Julian Elischer <firstname.lastname@example.org>
ANSIfication (procedure args) cleanup. Submitted-by: Andre Nathan <email@example.com>
In an rfork'd or vfork'd situation where multiple processes are sharing the same vmspace, and one process goes zombie, the vmspace's vm_exitingcnt will be non-zero. If another process then forks or execs the exitingcnt will be improperly inherited by the new vmspace. The solution is to not copy exitingcnt when copying to a new vmspace. Additionally, for DragonFly, I also had to fix a few cases where the upcall list was also being improperly inherited. Heads-up-by: Xin LI <firstname.lastname@example.org> Obtained-From: Peter Wemm <email@example.com> (FreeBSD-5)
Newtoken commit. Change the token implementation as follows: (1) Obtaining a token no longer enters a critical section. (2) tokens can be held through schedular switches and blocking conditions and are effectively released and reacquired on resume. Thus tokens serialize access only while the thread is actually running. Serialization is not broken by preemptive interrupts. That is, interrupt threads which preempt do no release the preempted thread's tokens. (3) Unlike spl's, tokens will interlock w/ interrupt threads on the same or on a different cpu. The vnode interlock code has been rewritten and the API has changed. The mountlist vnode scanning code has been consolidated and all known races have been fixed. The vnode interlock is now a pool token. The code that frees unreferenced vnodes whos last VM page has been freed has been moved out of the low level vm_page_free() code and moved to the periodic filesystem sycer code in vfs_msycn(). The SMP startup code and the IPI code has been cleaned up considerably. Certain early token interactions on AP cpus have been moved to the BSP. The LWKT rwlock API has been cleaned up and turned on. Major testing by: David Rhodus
Resident executable support stage 1/4: Add kernel bits and syscall support for in-kernel caching of vmspace structures. The main purpose of this feature is to make it possible to run dynamically linked programs as fast as if they were statically linked, by vmspace_fork()ing their vmspace and saving the copy in the kernel, then using that whenever the program is exec'd.
Retool the M_* flags to malloc() and the VM_ALLOC_* flags to vm_page_alloc(), and vm_page_grab() and friends. The M_* flags now have more flexibility, with the intent that we will start using some of it to deal with NULL pointer return problems in the codebase (CAM is especially bad at dealing with unexpected return values). In particular, add M_USE_INTERRUPT_RESERVE and M_FAILSAFE, and redefine M_NOWAIT as a combination of M_ flags instead of its own flag. The VM_ALLOC_* macros are now flags (0x01, 0x01, 0x04) rather then states (1, 2, 3), which allows us to create combinations that the old interface could not handle.
vm_uiomove() is a VFS_IOOPT related procedure, conditionalize it appropriately.
Cleanup the vm_map_entry_[k]reserve/[k]release() API. This API is used to guarentee that sufficient vm_map_entry structures are available for certain atomci VM operations. The kreserve/krelease API is only supposed to be used to dig into the kernel reserve, used when zalloc() must recurse into kmem_alloc() in order to allocate a new page. Without this we can get into a kmem_alloc -> zalloc -> kmem_alloc deadlock. kreserve/krelease was being used improperly in my original work, causing it to be unable to guarentee the reserve and resulting in an occassional panic. This commit converts the improper usage back to using the non-k version of the API and (should) properly handle the zalloc() recursion case. Reported-by: David Rhodus <firstname.lastname@example.org>
Merge from FreeBSD: alc 2003/12/26 13:54:45 PST FreeBSD src repository Modified files: sys/vm vm_map.c Log: Minor correction to revision 1.258: Use the proc pointer that is passed to vm_map_growstack() in the RLIMIT_VMEM check rather than curthread. Revision Changes Path 1.324 +1 -2 src/sys/vm/vm_map.c
* Prevent leakage of wired pages by setting start_entry during vm_map_wire().
Implement an upcall mechanism to support userland LWKT. This mechanism will allow multiple processes sharing the same VM space (aka clone/threading) to send each other what are basically IPIs. Two new system calls have been added, upc_register() and upc_control(). Documentation is forthcoming. The upcalls are nicely abstracted and a program can register as many as it wants up to the kernel limit (which is 32 at the moment). The upcalls will be used for passing asynch data from kernel to userland, such as asynch syscall message replies, for thread preemption timing, software interrupts, IPIs between virtual cpus (e.g. between the processes that are sharing the single VM space).
Entirely remove the old kernel malloc and kmem_map code. The slab allocator is now mandatory. Also remove the related conf options, USE_KMEM_MAP and NO_SLAB_ALLOCATOR.
Rename: - vm_map_pageable() -> vm_map_wire() - vm_map_user_pageable() -> vm_map_unwire()
Remove the NO_KMEM_MAP and USE_SLAB_ALLOCATOR kernel options. Temporarily add the USE_KMEM_MAP and NO_SLAB_ALLOCATOR kernel options, which developers should generally not use. We now use the slab allocator (and no kmem_map) by default.
SLAB ALLOCATOR Stage 1. This brings in a slab allocator written from scratch by your's truely. A detailed explanation of the allocator is included but first, other changes: * Instead of having vm_map_entry_insert*() and friends allocate the vm_map_entry structures a new mechanism has been emplaced where by the vm_map_entry structures are reserved at a higher level, then expected to exist in the free pool in deep vm_map code. This preliminary implementation may eventually turn into something more sophisticated that includes things like pmap entries and so forth. The idea is to convert what should be low level routines (VM object and map manipulation) back into low level routines. * vm_map_entry structure are now per-cpu cached, which is integrated into the the reservation model above. * The zalloc 'kmapentzone' has been removed. We now only have 'mapentzone'. * There were race conditions between vm_map_findspace() and actually entering the map_entry with vm_map_insert(). These have been closed through the vm_map_entry reservation model described above. * Two new kernel config options now work. NO_KMEM_MAP has been fleshed out a bit more and a number of deadlocks related to having only the kernel_map now have been fixed. The USE_SLAB_ALLOCATOR option will cause the kernel to compile-in the slab allocator instead of the original malloc allocator. If you specify USE_SLAB_ALLOCATOR you must also specify NO_KMEM_MAP. * vm_poff_t and vm_paddr_t integer types have been added. These are meant to represent physical addresses and offsets (physical memory might be larger then virtual memory, for example Intel PAE). They are not heavily used yet but the intention is to separate physical representation from virtual representation. SLAB ALLOCATOR FEATURES The slab allocator breaks allocations up into approximately 80 zones based on their size. Each zone has a chunk size (alignment). For example, all allocations in the 1-8 byte range will allocate in chunks of 8 bytes. Each size zone is backed by one or more blocks of memory. The size of these blocks is fixed at ZoneSize, which is calculated at boot time to be between 32K and 128K. The use of a fixed block size allows us to locate the zone header given a memory pointer with a simple masking operation. The slab allocator operates on a per-cpu basis. The cpu that allocates a zone block owns it. free() checks the cpu that owns the zone holding the memory pointer being freed and forwards the request to the appropriate cpu through an asynchronous IPI. This request is not currently optimized but it can theoretically be heavily optimized ('queued') to the point where the overhead becomes inconsequential. As of this commit the malloc_type information is not MP safe, but the core slab allocation and deallocation algorithms, non-inclusive the having to allocate the backing block, *ARE* MP safe. The core code requires no mutexes or locks, only a critical section. Each zone contains N allocations of a fixed chunk size. For example, a 128K zone can hold approximately 16000 or so 8 byte allocations. The zone is initially zero'd and new allocations are simply allocated linearly out of the zone. When a chunk is freed it is entered into a linked list and the next allocation request will reuse it. The slab allocator heavily optimizes M_ZERO operations at both the page level and the chunk level. The slab allocator maintains various undocumented malloc quirks such as ensuring that small power-of-2 allocations are aligned to their size, and malloc(0) requests are also allowed and return a non-NULL result. kern_tty.c depends heavily on the power-of-2 alignment feature and ahc depends on the malloc(0) feature. Eventually we may remove the malloc(0) feature. PROBLEMS AS OF THIS COMMIT NOTE! This commit may destabilize the kernel a bit. There are issues with the ISA DMA area ('bounce' buffer allocation) due to the large backing block size used by the slab allocator and there are probably some deadlock issues do to the removal of kmem_map that have not yet been resolved.
Add the NO_KMEM_MAP kernel configuration option. This is a temporary option that will allow developers to test kmem_map removal and also the upcoming (not this commit) slab allocator. Currently this option removes kmem_map and causes the malloc and zalloc subsystems to use kernel_map exclusively. Change gd_intr_nesting_level. This variable is now only bumped while we are in a FAST interrupt or processing an IPIQ message. This variable is not bumped while we are in a normal interrupt or software interrupt thread. Add warning printf()s if malloc() and related functions detect attempts to use them from within a FAST interrupt or IPIQ. Remove references to the no-longer-used zalloci() and zfreei() functions.
Add an alignment feature to vm_map_findspace(). This feature will be used primarily by the upcoming slab allocator but has many applications. Use the alignment feature in the buffer cache to hopefully reduce fragmentation.
__P()!=wanted, clean up the vm subsystem
2003-07-22 Hiten Pandya <email@example.com> * MFC FreeBSD rev. 1.189 of kern_exit.c (DONE) (shmexit to take vmspace instead of proc) (sort the sys/lock.h include in vm_map.c too) * MFC FreeBSD rev. 1.143 of kern_sysctl.c (DONE) (don't panic if sysctl is unregistrable) * Don't panic when enumerating SYSCTL_NODE() without children nodes. (DONE) * MFC FreeBSD rev. 1.113 of kern_sysctl.c (DONE) (Fix ogetkerninfo() handling for KINFO_BSD_SYSINFO) * MFC FreeBSD rev. 1.103 of kern_sysctl.c (DONE) (Never reuse AUTO_OID values) * MFC FreeBSD rev 1.21 of i386/include/bus_dma.h (BUS_DMAMEM_NOSYNC -> BUS_DMA_COHERENT) * MFC FreeBSD rev. 1.19 of i386/include/bus_dma.h (DONE) (Implement real read/write barriers for i386) Submitted by: Hiten Pandya <hmp@FreeBSD.ORG>
Remove the priority part of the priority|flags argument to tsleep(). Only flags are passed now. The priority was a user scheduler thingy that is not used by the LWKT subsystem. For process statistics assume sleeps without P_SINTR set to be disk-waits, and sleeps with it set to be normal sleeps. This commit should not contain any operational changes.
MP Implementation 1/2: Get the APIC code working again, sweetly integrate the MP lock into the LWKT scheduler, replace the old simplelock code with tokens or spin locks as appropriate. In particular, the vnode interlock (and most other interlocks) are now tokens. Also clean up a few curproc/cred sequences that are no longer needed. The APs are left in degenerate state with non IPI interrupts disabled as additional LWKT work must be done before we can really make use of them, and FAST interrupts are not managed by the MP lock yet. The main thing for this stage was to get the system working with an APIC again. buildworld tested on UP and 2xCPU/MP (Dell 2550)
Split the struct vmmeter cnt structure into a global vmstats structure and a per-cpu cnt structure. Adjust the sysctls to accumulate statistics over all cpus.
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread pointers instead of process pointers as arguments, similar to what FreeBSD-5 did. Note however that ultimately both APIs are going to be message-passing which means the current thread context will not be useable for creds and descriptor access.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 188.8.131.52