Up to [DragonFly] / src / sys / vm
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
Fix numerous pageout daemon -> buffer cache deadlocks in the main system. These issues usually only occur on systems with small amounts of ram but it is possible to trigger them on any system. * Get rid of the IO_NOBWILL hack. Just have the VN device use IO_DIRECT, which will clean out the buffer on completion of the write. * Add a timeout argument to vm_wait(). * Add a thread->td_flags flag called TDF_SYSTHREAD. kmalloc()'s made from designated threads are allowed to dip into the system reserve when allocating pages. Only the pageout daemon and buf_daemon[_hw] use the flag. * Add a new static procedure, recoverbufpages(), which explicitly tries to free buffers and their backing pages on the clean queue. * Add a new static procedure, bio_page_alloc(), to do all the nasty work of allocating a page on behalf of a buffer cache buffer. This function will call vm_page_alloc() with VM_ALLOC_SYSTEM to allow it to dip into the system reserve. If the allocation fails this function will call recoverbufpages() to try to recycle from VM pages from clean buffer cache buffers, and will then attempt to reallocate using VM_ALLOC_SYSTEM | VM_ALLOC_INTERRUPT to allow it to dip into the interrupt reserve as well. Warnings will blare on the console. If the effort still fails we sleep for 1/20 of a second and retry. The idea though is for all the effort above to not result in a failure at the end. Reported-by: Gergo Szakal <firstname.lastname@example.org>
Fix many bugs and issues in the VM system, particularly related to heavy paging. * (cleanup) PG_WRITEABLE is now set by the low level pmap code and not by high level code. It means 'This page may contain a managed page table mapping which is writeable', meaning that hardware can dirty the page at any time. The page must be tested via appropriate pmap calls before being disposed of. * (cleanup) PG_MAPPED is now handled by the low level pmap code and only applies to managed mappings. There is still a bit of cruft left over related to the pmap code's page table pages but the high level code is now clean. * (bug) Various XIO, SFBUF, and MSFBUF routines which bypass normal paging operations were not properly dirtying pages when the caller intended to write to them. * (bug) vfs_busy_pages in kern/vfs_bio.c had a busy race. Separate the code out to ensure that we have marked all the pages as undergoing IO before we call vm_page_protect(). vm_page_protect(... VM_PROT_NONE) can block under very heavy paging conditions and if the pages haven't been marked for IO that could blow up the code. * (optimization) Make a minor optimization. When busying pages for write IO, downgrade the page table mappings to read-only instead of removing them entirely. * (bug) In platform/pc32/i386/pmap.c fix various places where pmap_inval_add() was being called at the wrong point. Only one was critical, in pmap_enter(), where pmap_inval_add() was being called so far away from the pmap entry being modified that it could wind up being flushed out prior to the modification, breaking the cpusync required. pmap.c also contains most of the work involved in the PG_MAPPED and PG_WRITEABLE changes. * (bug) Close numerous pte updating races with hardware setting the modified bit. There is still one race left (in pmap_enter()). * (bug) Disable pmap_copy() entirely. Fix most of the bugs anyway, but there is still one left in the handling of the srcmpte variable. * (cleanup) Change vm_page_dirty() from an inline to a real procedure, and move the code which set the object to writeable/maybedirty into vm_page_dirty(). * (bug) Calls to vm_page_protect(... VM_PROT_NONE) can block. Fix all cases where this call was made with a non-busied page. All such calls are now made with a busied page, preventing blocking races from re-dirtying or remapping the page unexpectedly. (Such blockages could only occur during heavy paging activity where the underlying page table pages are being actively recycled). * (bug) Fix the pageout code to properly mark pages as undergoing I/O before changing their protection bits. * (bug) Busy pages undergoing zeroing or partial zeroing in the vnode pager (vm/vnode_pager.c) to avoid unexpected effects.
MFC - Fix a bug in umtx_sleep().
Fix a bug in umtx_sleep(). This function sleeps on the mutex's physical address and will get lost if the physical page underlying the VM address is copied on write. This case can occur when a threaded program fork()'s. Introduce a VM page event notification mechanism and use it to wake-up the umtx_sleep() if the underlying page takes a COW fault. Reported-by: Jordan Gordeev <email@example.com>, "Simon 'corecode' Schubert" <corecode@xxxxxxxxxxxx>
Fix a bug in vnode_pager_generic_getpages(). This function was improperly setting m->valid to 0 and was also improperly trying to free the page after it had potentially become wired by the buffer cache. Add a sysctl to UFS that allows us to force it to call vop_stdgetpages() for debugging purposes.
Implement struct lwp->lwp_vmspace. Leave p_vmspace intact. This allows vkernels to run threaded and to run emulated VM spaces on a per-thread basis. struct proc->p_vmspace is left intact, making it easy to switch into and out of an emulated VM space. This is needed for the virtual kernel SMP work. This also gives us the flexibility to run emulated VM spaces in their own threads, or in a limited number of separate threads. Linux does this and they say it improved performance. I don't think it necessarily improved performance but its nice to have the flexibility to do it in the future.
Implement vm_fault_object_page(). This function returns a held VM page for the specified offset in the specified object and does all I/O necessary to validate the page (as if it had been faulted in). This function allows us to bypass the vm_map*() code when all we want is the VM page.
Fix the recently committed (and described) page writability nerf. The real kernel was unconditionally mapping writable pages read-only on read faults in order to be able to take another fault on a write attempt. This was needed for early virtual kernel support in order to set the Modify bit in the virtualized page table, but was being applied to ALL mappings rather then just those installed by the virtual kernel. Now the real kernel only does this for virtual kernel mappings. Additionally, the real kernel no longer makes the page read-only when clearing the Modify bit in the real page table (in order to rearm the write fault). When this case occurs VPTE_M has already been set in the virtual page table and no re-fault is required. The virtual kernel now only needs to invalidate the real kernel's page mapping when clearing the virtualized Modify bit in the virtual page table (VPTE_M), in order to rearm the real kernel's write fault so it can detect future modifications via the virtualized Modify bit. Also, the virtual kernel no longer needs to install read-only pages to detect the write fault. This allows the real kernel to do ALL the work required to handle VPTE_M and make the actual page writable. This greatly reduces the number of real page faults that occur and greatly reduces the number of page faults which have to be passed through to the virtual kernel. This fix reduces fork() overhead for processes running under a virtual kernel by 70%, from around 2100uS to around 650uS.
Replace remaining uses of vm_fault_quick() with vm_fault_page_quick(). Do not directly access userland virtual addresses in the kernel UMTX code.
Fix a bug vm_fault_page(). PG_MAPPED was not getting set, causing the system to fail to remove pmap entries related to a VM page when reusing the VM page. General cleaning of vm_fault*() routines. These routines now expect all appropriate VM_PROT_* flags to be specified instead of just one. Also clean up the VM_FAULT_* flags. Remove VM_FAULT_HOLD - it is no longer used. vm_fault_page() handles the functionality in a far cleaner fashion then vm_fault().
Add a missing pmap_enter() in vm_fault_page(). If a write fault does a COW and must replace a read-only page, the pmap must be updated so the process sees the new page.
Implement vm_fault_page_quick(), which will soon be replacing vm_fault_quick(). vm_fault_quick() does not hold the underlying page in any way and is not SMP friendly. It also uses architecture-specific tricks to force the page into a pmap which do not work with the VKERNEL.
Modify the trapframe sigcontext, ucontext, etc. Add %gs to the trapframe and xflags and an expanded floating point save area to sigcontext/ucontext so traps can be fully specified. Remove all the %gs hacks in the system code and signal trampoline and handle %gs faults natively, like we do %fs faults. Implement writebacks to the virtual page table to set VPTE_M and VPTE_A and add checks for VPTE_R and VPTE_W. Consolidate the TLS save area into a MD structure that can be accessed by MI code. Reformulate the vmspace_ctl() system call to allow an extended context to be passed (for TLS info and soon the FP and eventually the LDT). Adjust the GDB patches to recognize the new location of %gs. Properly detect non-exception returns to the virtual kernel when the virtual kernel is running an emulated user process and receives a signal. And misc other work on the virtual kernel.
Add a new procedure, vm_fault_page(), which does all actions related to faulting in a VM page given a vm_map and virtual address, include any necessary I/O, but returns the held page instead of entering it into a pmap. Use the new function in procfs_rwmem, allowing gdb to 'see' memory that is governed by a virtual page table.
1:1 Userland threading stage 2.10/4: Separate p_stats into p_ru and lwp_ru. proc.p_ru keeps track of all statistics directly related to a proc. This consists of RSS usage and nswap information and aggregate numbers for all former lwps of this proc. proc.p_cru is the sum of all stats of reaped children. lwp.lwp_ru contains the stats directly related to one specific lwp, meaning packet, scheduler switch or page fault counts, etc. This information gets added to lwp.lwp_proc.p_ru when the lwp exits.
Make kernel_map, buffer_map, clean_map, exec_map, and pager_map direct structural declarations instead of pointers. Clean up all related code, in particular kmem_suballoc(). Remove the offset calculation for kernel_object. kernel_object's page indices used to be relative to the start of kernel virtual memory in order to improve the performance of VM page scanning algorithms. The optimization is no longer needed now that VM objects use Red-Black trees. Removal of the offset simplifies a number of calculations and makes the code more readable.
Introduce globals: KvaStart, KvaEnd, and KvaSize. Used by the kernel instead of the nutty VADDR and VM_*_KERNEL_ADDRESS macros. Move extern declarations for these variables as well as for virtual_start, virtual_end, and phys_avail from MD headers to MI headers. Make kernel_object a global structure instead of a pointer. Remove kmem_object and all related code (none of it is used any more).
Rename printf -> kprintf in sys/ and add some defines where necessary (files which are used in userland, too).
Collapse some bits of repetitive code into their own procedures and allocate a maximally sized default object to back MAP_VPAGETABLE mappings, allowing us to access logical memory beyond the size of the original mmap() call by programming the page table to point at it. This gives us an abstraction and capability similar to a real kernel's ability to map e.g. 2GB of physical memory into its 1GB address space.
More cleanups + fix a bug when taking a write fault on a mapping that uses a virtual page table. The page was not being pmap'd with the correct permissions.
MAP_VPAGETABLE support part 3/3. Implement a new system call called mcontrol() which is an extension of madvise(), adding an additional 64 bit argument. Add two new advisories, MADV_INVAL and MADV_SETMAP. MADV_INVAL will invalidate the pmap for the specified virtual address range. You need to do this for the virtual addresses effected by changes made in a virtual page table. MADV_SETMAP sets the top-level page table entry for the virtual page table governing the mapped range. It only works for memory governed by a virtual page table and strange things will happen if you only set the root page table entry for part of the virtual range. Further refine the virtual page table format. Keep with 32 bit VPTE's for the moment, but properly implement VPTE_PS and VPTE_V. VPTE_PS can be used to suport 4MB linear maps in the top level page table and it can also be used when specifying the 'root' VPTE to disable the page table entirely and just linear map the backing store. VPTE_V is the 'valid' bit (before it was inverted, now it is normal).
MAP_VPAGETABLE support part 2/3. Implement preliminary virtual page table handling code in vm_fault. This code is strictly temporary so subsystem and userland interactions can be tested, but the real code will be very similar.
MAP_VPAGETABLE support part 1/3. Reorganize vm_fault() to get more direct access to the VM page resolved by a VM fault. Move vm_fault()'s core shadow object traversal and fault I/O code to a new procedure called vm_fault_object(). Begin adding support for memory mappings which are backed by a virtualized page table under userland control.
Remove the (unused) copy-on-write support for a vnode's VM object. This support originally existed to support the badly implemented and severely hacked ENABLE_VFS_IOOPT I/O optimization which was removed long ago. This also removes a bunch of cross-module pollution in UFS.
Fix a null pointer indirection, the VM fault rate limiting code only applies to processes.
Remove the thread pointer argument to lockmgr(). All lockmgr() ops use the current thread. Move the lockmgr code in BUF_KERNPROC to lockmgr_kernproc(). This code allows the lock owner to be set to a special value so any thread can unlock the lock and is required for B_ASYNC I/O so biodone() can release the lock.
Remove the now unused interlock argument to the lockmgr() procedure. This argument has been abused over the years by kernel programmers attempting to optimize certain locking and data modification sequences, resulting in a virtually unreadable code in some cases. The interlock also made porting between BSDs difficult as each BSD implemented their interlock differently. DragonFly has slowly removed use of the interlock argument and we can now finally be rid of it entirely.
Implement a VM load heuristic. sysctl vm.vm_load will return an indication of the load on the VM system in the range 0-1000. Implement a page allocation rate limit in vm_fault which is based on vm_load, and enabled via vm.vm_load_enable (default on). As the system becomes more and more memory bound, those processes whos page faults require a page allocation will start to allocate pages in smaller bursts and with greater and greater enforced delays, up to 1/10 of a second. Implement vm.vm_load_debug (for kernels with INVARIANTS), which outputs the burst calculations to the console when enabled. Increase the minimum guarenteed run time without swapping from 2 to 15 seconds.
Make tsleep/wakeup() MP SAFE for kernel threads and get us closer to making it MP SAFE for user processes. Currently the code is operating under the rule that access to a thread structure requires cpu locality of reference, and access to a proc structure requires the Big Giant Lock. The two are not mutually exclusive so, for example, tsleep/wakeup on a proc needs both cpu locality of reference *AND* the BGL. This was true with the old tsleep/wakeup and has now been documented. The new tsleep/wakeup algorithm is quite simple in concept. Each cpu has its own ident based hash table and each hash slot has a cpu mask which tells wakeup() which cpu's might have the ident. A wakeup iterates through all candidate cpus simply by chaining the IPI message through them until either all candidate cpus have been serviced, or (with wakeup_one()) the requested number of threads have been woken up. Other changes made in this patch set: * The sense of P_INMEM has been reversed. It is now P_SWAPPEDOUT. Also, P_SWAPPING, P_SWAPINREQ are not longer relevant and have been removed. * The swapping code has been cleaned up and seriously revamped. The new swapin code staggers swapins to give the VM system a chance to respond to new conditions. Also some lwp-related fixes were made (more p_rtprio vs lwp_rtprio confusion). * As mentioned above, tsleep/wakeup have been rewritten. The process p_stat no longer does crazy transitions from SSLEEP to SSTOP. There is now only SSLEEP and SSTOP is synthesized from P_SWAPPEDOUT for userland consumpion. Additionally, tsleep() with PCATCH will NO LONGER STOP THE PROCESS IN THE TSLEEP CALL. Instead, the actual stop is deferred until the process tries to return to userland. This removes all remaining cases where a stopped process can hold a locked kernel resource. * A P_BREAKTSLEEP flag has been added. This flag indicates when an event occurs that is allowed to break a tsleep with PCATCH. All the weird undocumented setrunnable() rules have been removed and replaced with a very simple algorithm based on this flag. * Since the UAREA is no longer swapped, we no longer faultin() on PHOLD(). This also incidently fixes the 'ps' command's tendancy to try to swap all processes back into memory. * speedup_syncer() no longer does hackish checks on proc0's tsleep channel (td_wchan). * Userland scheduler acquisition and release has now been tightened up and KKASSERT's have been added (one of the bugs Stefan found was related to an improper lwkt_schedule() that was found by one of the new assertions). We also have added other assertions related to expected conditions. * A serious race in pmap_release_free_page() has been corrected. We no longer couple the object generation check with a failed pmap_release_free_page() call. Instead the two conditions are checked independantly. We no longer loop when pmap_release_free_page() succeeds (it is unclear how that could ever have worked properly). Major testing by: Stefan Krueger <firstname.lastname@example.org>
Avoid a recursive kernel fault and subsequent double fault if the VM fault code gets a KVM map_entry with a NULL object. Such entries exist in system maps managed directly by the kernel, such as the buffer cache and kernel_map. Instead, we check for the condition and panic immediately. Programs which access /dev/[k]mem can hit this race/failure. Reported-by: =?ISO-8859-1?Q?Stefan_Kr=FCger?= <email@example.com>
Try to close an occassional VM page related panic that is believed to occur due to the VM page queues or free lists being indirectly manipulated by interrupts that are not protected by splvm(). Do this by replacing splvm()'s with critical sections in a number of places. Note: some of this work bled over into the "VFS messaging/interfacing work stage 8/99" commit.
Remove an unimplemented advisory function, pmap_pageable(); there is no pmap implementation in existance that requires it implemented. Discussed-with: Alan Cox <alc at freebsd.org>, Matthew Dillon <dillon at backplane.com>
Bring in the fictitious page wiring bug fixes from FreeBSD-5. Make additional major changes to the APIs to clean them up (so this commit is substantially different than what was committed to FreeBSD-5). Obtained-from: Alan Cox <firstname.lastname@example.org> (FreeBSD-5)
Get rid of VM_WAIT and VM_WAITPFAULT crud, replace with calls to vm_wait() and vm_waitpfault(). This is a non-operational change. vm_page.c now uses the _vm_page_list_find() inline (which itself is only in vm_page.c) for various critical path operations.
Close an interrupt race between vm_page_lookup() and (typically) a vm_page_sleep_busy() check by using the correct spl protection. An interrupt can occur inbetween the two operations and unbusy/free the page in question, causing the busy check to fail and for the code to fall through and then operate on a page that may have been freed and possibly even reused. Also note that vm_page_grab() had the same issue between the lookup, busy check, and vm_page_busy() call. Close an interrupt race when scanning a VM object's memq. Interrupts can free pages, removing them from memq, which interferes with memq scans and can cause a page unassociated with the object to be processed as if it were associated with the object. Calls to vm_page_hold() and vm_page_unhold() require spl protection. Rename the passed socket descriptor argument in sendfile() to make the code more readable. Fix several serious bugs in procfs_rwmem(). In particular, force it to block if a page is busy and then retry. Get rid of vm_pager_map_pag() and vm_pager_unmap_page(), make the functions that used to use these routines use SFBUF's instead. Get rid of the (userland?) 4MB page mapping feature in pmap_object_init_pt() for now. The code appears to not track the page directory properly and could result in a non-zero page being freed as PG_ZERO. This commit also includes updated code comments and some additional non-operational code cleanups.
Move vm_fault_quick() out from the machine specific location as the function is now cpu agnostic.
ANSIfication (procedure args) cleanup. Submitted-by: Andre Nathan <email@example.com>
Newtoken commit. Change the token implementation as follows: (1) Obtaining a token no longer enters a critical section. (2) tokens can be held through schedular switches and blocking conditions and are effectively released and reacquired on resume. Thus tokens serialize access only while the thread is actually running. Serialization is not broken by preemptive interrupts. That is, interrupt threads which preempt do no release the preempted thread's tokens. (3) Unlike spl's, tokens will interlock w/ interrupt threads on the same or on a different cpu. The vnode interlock code has been rewritten and the API has changed. The mountlist vnode scanning code has been consolidated and all known races have been fixed. The vnode interlock is now a pool token. The code that frees unreferenced vnodes whos last VM page has been freed has been moved out of the low level vm_page_free() code and moved to the periodic filesystem sycer code in vfs_msycn(). The SMP startup code and the IPI code has been cleaned up considerably. Certain early token interactions on AP cpus have been moved to the BSP. The LWKT rwlock API has been cleaned up and turned on. Major testing by: David Rhodus
Retool the M_* flags to malloc() and the VM_ALLOC_* flags to vm_page_alloc(), and vm_page_grab() and friends. The M_* flags now have more flexibility, with the intent that we will start using some of it to deal with NULL pointer return problems in the codebase (CAM is especially bad at dealing with unexpected return values). In particular, add M_USE_INTERRUPT_RESERVE and M_FAILSAFE, and redefine M_NOWAIT as a combination of M_ flags instead of its own flag. The VM_ALLOC_* macros are now flags (0x01, 0x01, 0x04) rather then states (1, 2, 3), which allows us to create combinations that the old interface could not handle.
64 bit address space cleanups which are a prerequisit for future 64 bit address space work and PAE. Note: this is not PAE. This patch basically adds vm_paddr_t, which represents a 'physical address'. Physical addresses may be larger then virtual addresses and on IA32 we make vm_paddr_t a 64 bit quantity. Submitted-by: Hiten Pandya <firstname.lastname@example.org>
Rename: - vm_map_pageable() -> vm_map_wire() - vm_map_user_pageable() -> vm_map_unwire()
SLAB ALLOCATOR Stage 1. This brings in a slab allocator written from scratch by your's truely. A detailed explanation of the allocator is included but first, other changes: * Instead of having vm_map_entry_insert*() and friends allocate the vm_map_entry structures a new mechanism has been emplaced where by the vm_map_entry structures are reserved at a higher level, then expected to exist in the free pool in deep vm_map code. This preliminary implementation may eventually turn into something more sophisticated that includes things like pmap entries and so forth. The idea is to convert what should be low level routines (VM object and map manipulation) back into low level routines. * vm_map_entry structure are now per-cpu cached, which is integrated into the the reservation model above. * The zalloc 'kmapentzone' has been removed. We now only have 'mapentzone'. * There were race conditions between vm_map_findspace() and actually entering the map_entry with vm_map_insert(). These have been closed through the vm_map_entry reservation model described above. * Two new kernel config options now work. NO_KMEM_MAP has been fleshed out a bit more and a number of deadlocks related to having only the kernel_map now have been fixed. The USE_SLAB_ALLOCATOR option will cause the kernel to compile-in the slab allocator instead of the original malloc allocator. If you specify USE_SLAB_ALLOCATOR you must also specify NO_KMEM_MAP. * vm_poff_t and vm_paddr_t integer types have been added. These are meant to represent physical addresses and offsets (physical memory might be larger then virtual memory, for example Intel PAE). They are not heavily used yet but the intention is to separate physical representation from virtual representation. SLAB ALLOCATOR FEATURES The slab allocator breaks allocations up into approximately 80 zones based on their size. Each zone has a chunk size (alignment). For example, all allocations in the 1-8 byte range will allocate in chunks of 8 bytes. Each size zone is backed by one or more blocks of memory. The size of these blocks is fixed at ZoneSize, which is calculated at boot time to be between 32K and 128K. The use of a fixed block size allows us to locate the zone header given a memory pointer with a simple masking operation. The slab allocator operates on a per-cpu basis. The cpu that allocates a zone block owns it. free() checks the cpu that owns the zone holding the memory pointer being freed and forwards the request to the appropriate cpu through an asynchronous IPI. This request is not currently optimized but it can theoretically be heavily optimized ('queued') to the point where the overhead becomes inconsequential. As of this commit the malloc_type information is not MP safe, but the core slab allocation and deallocation algorithms, non-inclusive the having to allocate the backing block, *ARE* MP safe. The core code requires no mutexes or locks, only a critical section. Each zone contains N allocations of a fixed chunk size. For example, a 128K zone can hold approximately 16000 or so 8 byte allocations. The zone is initially zero'd and new allocations are simply allocated linearly out of the zone. When a chunk is freed it is entered into a linked list and the next allocation request will reuse it. The slab allocator heavily optimizes M_ZERO operations at both the page level and the chunk level. The slab allocator maintains various undocumented malloc quirks such as ensuring that small power-of-2 allocations are aligned to their size, and malloc(0) requests are also allowed and return a non-NULL result. kern_tty.c depends heavily on the power-of-2 alignment feature and ahc depends on the malloc(0) feature. Eventually we may remove the malloc(0) feature. PROBLEMS AS OF THIS COMMIT NOTE! This commit may destabilize the kernel a bit. There are issues with the ISA DMA area ('bounce' buffer allocation) due to the large backing block size used by the slab allocator and there are probably some deadlock issues do to the removal of kmem_map that have not yet been resolved.
__P()!=wanted, clean up the vm subsystem
Register keyword removal Approved by: Matt Dillon
Split the struct vmmeter cnt structure into a global vmstats structure and a per-cpu cnt structure. Adjust the sysctls to accumulate statistics over all cpus.
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread pointers instead of process pointers as arguments, similar to what FreeBSD-5 did. Note however that ultimately both APIs are going to be message-passing which means the current thread context will not be useable for creds and descriptor access.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 220.127.116.11