Up to [DragonFly] / src / sys / i386 / i386
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
Reorganize the way machine architectures are handled. Consolidate the kernel configurations into a single generic directory. Move machine-specific Makefile's and loader scripts into the appropriate architecture directory. Kernel and module builds also generally add sys/arch to the include path so source files that include architecture-specific headers do not have to be adjusted. sys/<ARCH> -> sys/arch/<ARCH> sys/conf/*.<ARCH> -> sys/arch/<ARCH>/conf/*.<ARCH> sys/<ARCH>/conf/<KERNEL> -> sys/config/<KERNEL>
Reformulate the way the kernel updates the PMAPs in the system when adding a new page table page to expand kernel memory. Keep track of the PMAPs in their own list rather then scanning the process list to locate them. This allows PMAPs managed on behalf of virtual kernels to be properly updated. VM spaces can now be allocated from scratch and may not have a parent template to inherit certain fields from. Make sure these fields are properly cleared.
Add a ton of infrastructure for VKERNEL support. Add code for intercepting traps and system calls, for switching to and executing a foreign VM space, and for accessing trap frames.
Avoid casts as lvalues. Taken-from: FreeBSD
Ansify the rest of the K&R-style function declarations in sys/i386. Those were somehow[tm] forgotten last time. Noticed-by: corecode While I'm here, perform some stylistic cleanup in math_emulate.c.
MAP_VPAGETABLE support part 2/3. Implement preliminary virtual page table handling code in vm_fault. This code is strictly temporary so subsystem and userland interactions can be tested, but the real code will be very similar.
Adjust pamp_growkernel(), elf_brand_inuse(), and ktrace() to use allproc_scan() instead of scanning the process list manually.
Consolidate the initialization of td_mpcount into lwkt_init_thread(). Fix a bug in kern.trap_mpsafe, the mplock was not being properly released when operating in vm86 mode (when kern.trap_mpsafe was set to 1).
Make tsleep/wakeup() MP SAFE for kernel threads and get us closer to making it MP SAFE for user processes. Currently the code is operating under the rule that access to a thread structure requires cpu locality of reference, and access to a proc structure requires the Big Giant Lock. The two are not mutually exclusive so, for example, tsleep/wakeup on a proc needs both cpu locality of reference *AND* the BGL. This was true with the old tsleep/wakeup and has now been documented. The new tsleep/wakeup algorithm is quite simple in concept. Each cpu has its own ident based hash table and each hash slot has a cpu mask which tells wakeup() which cpu's might have the ident. A wakeup iterates through all candidate cpus simply by chaining the IPI message through them until either all candidate cpus have been serviced, or (with wakeup_one()) the requested number of threads have been woken up. Other changes made in this patch set: * The sense of P_INMEM has been reversed. It is now P_SWAPPEDOUT. Also, P_SWAPPING, P_SWAPINREQ are not longer relevant and have been removed. * The swapping code has been cleaned up and seriously revamped. The new swapin code staggers swapins to give the VM system a chance to respond to new conditions. Also some lwp-related fixes were made (more p_rtprio vs lwp_rtprio confusion). * As mentioned above, tsleep/wakeup have been rewritten. The process p_stat no longer does crazy transitions from SSLEEP to SSTOP. There is now only SSLEEP and SSTOP is synthesized from P_SWAPPEDOUT for userland consumpion. Additionally, tsleep() with PCATCH will NO LONGER STOP THE PROCESS IN THE TSLEEP CALL. Instead, the actual stop is deferred until the process tries to return to userland. This removes all remaining cases where a stopped process can hold a locked kernel resource. * A P_BREAKTSLEEP flag has been added. This flag indicates when an event occurs that is allowed to break a tsleep with PCATCH. All the weird undocumented setrunnable() rules have been removed and replaced with a very simple algorithm based on this flag. * Since the UAREA is no longer swapped, we no longer faultin() on PHOLD(). This also incidently fixes the 'ps' command's tendancy to try to swap all processes back into memory. * speedup_syncer() no longer does hackish checks on proc0's tsleep channel (td_wchan). * Userland scheduler acquisition and release has now been tightened up and KKASSERT's have been added (one of the bugs Stefan found was related to an improper lwkt_schedule() that was found by one of the new assertions). We also have added other assertions related to expected conditions. * A serious race in pmap_release_free_page() has been corrected. We no longer couple the object generation check with a failed pmap_release_free_page() call. Instead the two conditions are checked independantly. We no longer loop when pmap_release_free_page() succeeds (it is unclear how that could ever have worked properly). Major testing by: Stefan Krueger <email@example.com>
Adjust the globaldata initialization code to accomodate globaldata structures which exceed PAGE_SIZE.
Allow 'options SMP' *WITHOUT* 'options APIC_IO'. That is, an ability to produce an SMP-capable kernel that uses the PIC/ICU instead of the IO APICs for interrupt routing. SMP boxes with broken BIOSes (namely my Shuttle XPC SN95G5) could very well have serious interrupt routing problems when operating in IO APIC mode. One solution is to not use the IO APICs. That is, to run only the Local APICs for the SMP management. * Don't conditionalize NIDT. Just set it to 256 * Make the ICU interrupt code MP SAFE. This primarily means using the imen_spinlock to protect accesses to icu_imen. * When running SMP without APIC_IO, set the LAPIC TPR to prevent unintentional interrupts. Leave LINT0 enabled (normally with APIC_IO LINT0 is disabled when the IO APICs are activated). LINT0 is the virtual wire between the 8259 and LAPIC 0. * Get rid of NRSVIDT. Just use IDT_OFFSET instead. * Clean up all the APIC_IO tests which should have been SMP tests, and all the SMP tests which should have been APIC_IO tests. Explicitly #ifdef out all code related to the IO APICs when APIC_IO is not set.
ICU/APIC cleanup part 1/many. Move ICU and APIC support files into their own subdirectory, bump the required config version for the build since this move also requires the use of the new arch/ symlink.
Userland 1:1 threading changes step 1/4+: o Move thread-local members from struct proc into new struct lwp. o Add a LIST_HEAD(lwp) p_lwps to struct proc. This links a proc with its lwps. o Add a td_lwp member to struct thread which links a thread to its lwp, if it exists. This won't replace td_proc completely to save indirections. o For now embed one struct lwp into struct proc and set up preprocessor linkage so that semantics don't change for the rest of the kernel. Once all consumers are converted to take a struct lwp instead of a struct proc, this will go away. Reviewed-by: dillon, davidxu
Remove spl*() calls from i386, replacing them with critical sections. Leave spl support intact for the moment (it will be removed soon). Adjust the interrupt mux to use a critical section for 'old' interrupt handlers not using the new serialization API (which is nearly all of them at the moment).
Try to close an occassional VM page related panic that is believed to occur due to the VM page queues or free lists being indirectly manipulated by interrupts that are not protected by splvm(). Do this by replacing splvm()'s with critical sections in a number of places. Note: some of this work bled over into the "VFS messaging/interfacing work stage 8/99" commit.
Get rid of some conditionalized code which the pmap invalidation API took over from long ago and is no longer used.
Remove a recently added incorrect assertion. I was assuming that pmap_init_thread() was only being called for processes but it is called for both processes and threads.
Add a stack-size argument to the LWKT threading code so threads can be created with different-sized stacks. Adjust libcaps to match. This is a pre-requisit to adding NDIS support. NDIS threads need larger stacks because microsoft drivers expect larger stacks.
Merge from FreeBSD, RELENG_4 branch, revision 220.127.116.11. --- original commit message --- Log: There is a comma missing in the table initializing the pmap_prefault_pageorder array. This has two effects: 1. The resulting bogus contents of the array thwarts part of the optimization effect pmap_prefault() is supposed to have. 2. The resulting array is only 7 elements long (auto-sized), while pmap_prefault() expects it to be the intended 8 elements. So this function in fact accesses memory beyond the end of the array. Fortunately though, if the data at this location is out of bounds it will be ignored. This bug dates back more than 6 years. It has been introduced in revision 1.178. Submitted by: Uwe Doering <firstname.lastname@example.org> PR: 67460 --- original commit message ---
Remove an unimplemented advisory function, pmap_pageable(); there is no pmap implementation in existance that requires it implemented. Discussed-with: Alan Cox <alc at freebsd.org>, Matthew Dillon <dillon at backplane.com>
Mask bits properly for pte_prot() in case it is called with additional VM_PROT_ bits. Fix a wired memory leak bug in pmap_enter(). If a page wiring change is made and the page has already been faulted in for read access, and a write-fault occurs, pmap_enter() was losing track of the wiring count in the pmap when it tried to optimize the RO->RW case in the page table. This prevented the page table page from being freed and led to a memory leak. The case is easily reproducable if you attempt to wire the data/bss crossover page in a program (typically just declare a global variable in a small program and mlock() its page, then exit without munlock()ing). 4K is lost each time the program is run.
Document the pmap_kenter_quick(9) function. While I am here, fix comment style for pmap_kenter(9).
Close an interrupt race between vm_page_lookup() and (typically) a vm_page_sleep_busy() check by using the correct spl protection. An interrupt can occur inbetween the two operations and unbusy/free the page in question, causing the busy check to fail and for the code to fall through and then operate on a page that may have been freed and possibly even reused. Also note that vm_page_grab() had the same issue between the lookup, busy check, and vm_page_busy() call. Close an interrupt race when scanning a VM object's memq. Interrupts can free pages, removing them from memq, which interferes with memq scans and can cause a page unassociated with the object to be processed as if it were associated with the object. Calls to vm_page_hold() and vm_page_unhold() require spl protection. Rename the passed socket descriptor argument in sendfile() to make the code more readable. Fix several serious bugs in procfs_rwmem(). In particular, force it to block if a page is busy and then retry. Get rid of vm_pager_map_pag() and vm_pager_unmap_page(), make the functions that used to use these routines use SFBUF's instead. Get rid of the (userland?) 4MB page mapping feature in pmap_object_init_pt() for now. The code appears to not track the page directory properly and could result in a non-zero page being freed as PG_ZERO. This commit also includes updated code comments and some additional non-operational code cleanups.
pmap_qremove() takes a page count, not a byte count. This should fix acpica related booting issues that people have reported.
Another major mmx/xmm/FP commit. This is a combination of several patches but since the earlier patches didn't actually fix the crashing and corruption issues we were seeing everything has been rolled into one well tested commit. Make the FP more deterministic by requiring that npxthread and the FP state be properly synchronized, and that the FP be in a 'safe' state (meaning that mmx/xmm registers be useable) when npxthread is NULL. Allow the FP save area to be revectored. Kernel entities which use the FP unit, such as the bcopy code, must save the app state if it hasn't already been saved, then revector the save area. Note that combinations of operations must be protected by a critical section or interrupt disablement. Any clearing or setting npxthread combined with an fxsave/fnsave/frstor/fxrstor/fninit must be protected as an atomic entity. Since interrupts are threads and can preempt, such preemption will cause a thread switch to occur and thus cause npxthread and the FP state to be manipulated. The kernel can only depend on the FP state being stable for its use after it has revectored the FP save area. This commit fixes a number of issues, including potential filesystem corruption and kernel crashes.
Commit an update to the pipe code that implements various pipe algorithms. Note that the newer algorithms are either experimental or only exist for testing purposes. The default remains the same (sfbuf mode), which is considered to be stable. The code is just too useful not to commit it. Add pmap_qenter2() for installing cpu-localized KVM mappings. Add pmap_page_assertzero() which will be used in a later diagnostic commit.
Correct a bug in the last FPU optimized bcopy commit. The user FPU state was being corrupted by interrupts. Fix the bug by implementing a feature described as a missif in the original FreeBSD comments... add a pointer to the FP saved state in the thread structure so routines which 'borrow' the FP unit can simply revector the pointer temporarily to avoid corruption of the original user FP state. The MMX_*_BLOCK macros in bcopy.s have also been simplified somewhat. We can simplify them even more (in the future) by reserving FPU save space in the per-cpu structure instead of on the stack.
Bring in the following revs from FreeBS-4: 18.104.22.168 +3 -2 src/sys/i386/i386/pmap.c 22.214.171.124 +2 -2 src/sys/vm/pmap.h 126.96.36.199 +3 -2 src/sys/vm/vm_map.c Suggested-by: Alan Cox <email@example.com>
Enhance the pmap_kenter*() API and friends, separating out entries which only need invalidation on the local cpu against entries which need invalidation across the entire system, and provide a synchronization abstraction. Enhance sf_buf_alloc() and friends to allow the caller to specify whether the sf_buf's kernel mapping is going to be used on just the current cpu or whether it needs to be valid across all cpus. This is done by maintaining a cpumask of known-synchronized cpus in the struct sf_buf Optimize sf_buf_alloc() and friends by removing both TAILQ operations in the critical path. TAILQ operations to remove the sf_buf from the free queue are now done in a lazy fashion. Most sf_buf operations allocate a buf, work on it, and free it, so why waste time moving the sf_buf off the freelist if we are only going to move back onto the free list a microsecond later? Fix a bug in sf_buf_alloc() code as it was being used by the PIPE code. sf_buf_alloc() was unconditionally using PCATCH in its tsleep() call, which is only correct when called from the sendfile() interface. Optimize the PIPE code to require only local cpu_invlpg()'s when mapping sf_buf's, greatly reducing the number of IPIs required. On a DELL-2550, a pipe test which explicitly blows out the sf_buf caching by using huge buffers improves from 350 to 550 MBytes/sec. However, note that buildworld times were not found to have changed. Replace the PIPE code's custom 'struct pipemapping' structure with a struct xio and use the XIO API functions rather then its own.
Newtoken commit. Change the token implementation as follows: (1) Obtaining a token no longer enters a critical section. (2) tokens can be held through schedular switches and blocking conditions and are effectively released and reacquired on resume. Thus tokens serialize access only while the thread is actually running. Serialization is not broken by preemptive interrupts. That is, interrupt threads which preempt do no release the preempted thread's tokens. (3) Unlike spl's, tokens will interlock w/ interrupt threads on the same or on a different cpu. The vnode interlock code has been rewritten and the API has changed. The mountlist vnode scanning code has been consolidated and all known races have been fixed. The vnode interlock is now a pool token. The code that frees unreferenced vnodes whos last VM page has been freed has been moved out of the low level vm_page_free() code and moved to the periodic filesystem sycer code in vfs_msycn(). The SMP startup code and the IPI code has been cleaned up considerably. Certain early token interactions on AP cpus have been moved to the BSP. The LWKT rwlock API has been cleaned up and turned on. Major testing by: David Rhodus
Synchronize a bunch of things from FreeBSD-5 in preparation for the new ACPICA driver support. * Bring in a lot of new bus and pci DEV_METHODs from FreeBSD-5 * split apic.h into apicreg.h and apicio.h * rename INTR_TYPE_FAST -> INTR_FAST and move the #define * rename INTR_TYPE_EXCL -> INTR_EXCL and move the #define * rename some PCIR_ registers and add additional macros from FreeBSD-5 * note: new pcib bus call, host_pcib_get_busno() imported. * kern/subr_power.c no longer optional. Other changes: * machine/smp.h machine smp/smptests.h can now be #included unconditionally, and some APIC_IO vs SMP separation has been done as well. * gd_acpi_id and gd_apic_id added to machine/globaldata.h prep for new ACPI code. Despite all the changes, the generated code should be virtually the same. These were mostly additions which the pre-existing code does not (yet) use.
Introduce an MI cpu synchronization API, redo the SMP AP startup code, and start cleaning up deprecated IPI and clock code. Add a MMU/TLB page table invalidation API (pmap_inval.c) which properly synchronizes page table changes with other cpus in SMP environments. * removed (unused) gd_cpu_lockid * remove confusing invltlb() and friends, normalize use of cpu_invltlb() and smp_invltlb(). * redo the SMP AP startup code to make the system work better in situations where all APs do not startup. * add memory barrier API, cpu_mb1() and cpu_mb2(). * remove (obsolete, no longer used) old IPI hard and stat clock forwarding code. * add a cpu synchronization API which is capable of handling multiple simultanious requests without deadlocking or livelocking. * major changes to the PMAP code to use the new invalidation API. * remove (unused) all_procs_ipi() and self_ipi(). * only use all_but_self_ipi() if it is known that all AP's started up, otherwise use a mask. * remove (obsolete, no longer usde) BETTER_CLOCK code * remove (obsolete, no longer used) Xcpucheckstate IPI code Testing-by: David Rhodus and others
Create a new machine type, cpumask_t, to represent a mask of cpus, and replaces earlier uses of __uint32_t for cpu masks with cpumask_t.
Retool the M_* flags to malloc() and the VM_ALLOC_* flags to vm_page_alloc(), and vm_page_grab() and friends. The M_* flags now have more flexibility, with the intent that we will start using some of it to deal with NULL pointer return problems in the codebase (CAM is especially bad at dealing with unexpected return values). In particular, add M_USE_INTERRUPT_RESERVE and M_FAILSAFE, and redefine M_NOWAIT as a combination of M_ flags instead of its own flag. The VM_ALLOC_* macros are now flags (0x01, 0x01, 0x04) rather then states (1, 2, 3), which allows us to create combinations that the old interface could not handle.
CAPS IPC library stage 1/3: The core CAPS IPC code, providing system calls to create and connect to named rendezvous points. The CAPS interface implements a many-to-1 (client:server) capability and is totally self contained. The messaging is designed to support single and multi-threading, synchronous or asynchronous (as of this commit: polling and synchronous only). Message data is 100% opaque and so while the intention is to integrate it into a userland LWKT messaging subsystem, the actual system calls do not depend on any LWKT structures. Since these system calls are experiemental and may contain root holes, they must be enabled via the sysctl kern.caps_enabled.
USER_LDT is now required by a number of packages as well as our upcoming user threads support. Make it non-optional. USER_LDT breaks SysV emulated sysarch(... SVR4_SYSARCH_DSCR) support. For now just #if 0 out the support (which is what FreeBSD-5.x does). Submitted-by: Craig Dooley <firstname.lastname@example.org>
Fix LINT issues with vm_paddr_t
Fix the pt_entry_t and pd_entry_t types. They were previously pointers to integers which is completely bogus. What they really represent are page table entries so define them as __uint32_t. Also add a vtophys_pte() macro to distinguish between physical addresses (vm_paddr_t) and physical addresses represented in PTE form (pt_entry_t). vm_paddr_t can be 64 bits even on IA32 boxes without PAE which use 32 bit PTE's. Taken loosely from: FreeBSD-4.x
64 bit address space cleanups which are a prerequisit for future 64 bit address space work and PAE. Note: this is not PAE. This patch basically adds vm_paddr_t, which represents a 'physical address'. Physical addresses may be larger then virtual addresses and on IA32 we make vm_paddr_t a 64 bit quantity. Submitted-by: Hiten Pandya <email@example.com>
Do a bit of Ansification, add some pmap assertions to catch the improper use of certain pmap functions from an interrupt, similar to FreeBSD-5, rewrite a number of comments, and surround some of the pmap functions which manipulate per-cpu CMAPs with critical sections.
cleanup: remove register keyword, ANSIze procedure arguments.
what the heck one last one before i go take a nap... remove __P(); from the i386 directory
Use FOREACH_PROC_IN_SYSTEM() throughout.
The comment was wrong, ptmmap *is* used, put it back in (fix crash accessing /dev/mem)
MP Implmentation 3/4: MAJOR progress on SMP, full userland MP is now working! A number of issues relating to MP lock operation have been fixed, primarily that we have to read %cr2 before get_mplock() since get_mplock() may switch away. Idlethreads can now safely HLT without any performance detriment. The userland scheduler has been almost completely rewritten and is now using an extremely flexible abstraction with a lot of room to grow. pgeflag has been removed from mapdev (without per-page invalidation it isn't safe to use PG_G even on UP). Necessary locked bus cycles have been added for the pmap->pm_active field in swtch.s. CR3 has been unoptimized for the moment (see comment in swtch.s). Since the switch code runs without the MP lock we have to adjust pm_active PRIOR to loading %cr3. Additional sanity checks have been added to the code (see PARANOID_INVLTLB and ONLY_ONE_USER_CPU in the code), plus many more in kern_switch.c. A passive release mechanism has been implemented to optimize P_CURPROC/lwkt priority shifting when going from user->kernel and kernel->user. Note: preemptive interrupts don't care due to the way preemption works so no additional complexity there. non-locking atomic functions to protect only against local interrupts have been added. astpending now uses non-locking atomic functions to set and clear bits. private_tss has been moved to a per-cpu variable. The LWKT thread module has been considerably enhanced and cleaned up, including some fixes to handle MPLOCKED vs td_mpcount races (so eventually we can do MP locking without a pushfl/cli/popfl combo). stopevent() needs critical section protection, maybe.
MP Implementation 1/2: Get the APIC code working again, sweetly integrate the MP lock into the LWKT scheduler, replace the old simplelock code with tokens or spin locks as appropriate. In particular, the vnode interlock (and most other interlocks) are now tokens. Also clean up a few curproc/cred sequences that are no longer needed. The APs are left in degenerate state with non IPI interrupts disabled as additional LWKT work must be done before we can really make use of them, and FAST interrupts are not managed by the MP lock yet. The main thing for this stage was to get the system working with an APIC again. buildworld tested on UP and 2xCPU/MP (Dell 2550)
Generic MP rollup work.
Split the struct vmmeter cnt structure into a global vmstats structure and a per-cpu cnt structure. Adjust the sysctls to accumulate statistics over all cpus.
smp/up collapse stage 2 of 2: cleanup the globaldata structure, cleanup and separate machine dependant portions of thread, proc, and globaldata, and reduce the need to include lots of MD header files.
smp/up collapse stage 1 of 2: Make UP use the globaldata structure the same way SMP does, and start removing all the bad macros and hacks that existed before.
Cleanup lwkt threads a bit, change the exit/reap interlock.
thread stage 10: (note stage 9 was the kern/lwkt_rwlock commit). Cleanup thread and process creation functions. Check the spl against ipending in cpu_lwkt_restore (so the idle loop does not lockup the machine). Remove the old VM object kstack allocation and freeing code. Leave newly created processes in a stopped state to fix wakeup/fork_handler races. Normalize the lwkt_init_*() functions. Add a sysctl debug.untimely_switch which will cause the last crit_exit() to yield, which causes a task switch to occur in wakeup() and catches a lot of 4.x-isms that can be found and fixed on UP.
Add kern/lwkt_rwlock.c -- reader/writer locks. Clean up the process exit & reaping interlock code to allow context switches to occur. Clean up and make operational the lwkt_block/signaling code.
thread stage 8: add crit_enter(), per-thread cpl handling, fix deferred interrupt handling for critical sections, add some basic passive token code, and blocking/signaling code. Add structural definitions for additional LWKT mechanisms. Remove asleep/await. Add generation number based xsleep/xwakeup. Note that when exiting the last crit_exit() we run splz() to catch up on blocked interrupts. There is also some #if 0'd code that will cause a thread switch to occur 'at odd times'... primarily wakeup()-> lwkt_schedule()->critical_section->switch. This will be usefulf or testing purposes down the line. The passive token code is mostly disabled at the moment. It's primary use will be under SMP and its primary advantage is very low overhead on UP and, if used properly, should also have good characteristics under SMP.
thread stage 7: Implement basic LWKTs, use a straight round-robin model for the moment. Also continue consolidating the globaldata structure so both UP and SMP use it with more commonality. Temporarily match user processes up with scheduled LWKTs on a 1:1 basis. Eventually user processes will have LWKTs, but they will not all be scheduled 1:1 with the user process's runnability. With this commit work can potentially start to fan out, but I'm not ready to announce yet.
thread stage 6: Move thread stack management from the proc structure to the thread structure, cleanup the pmap_new_*() and pmap_dispose_*() functions, and disable UPAGES swapping (if we eventually separate the kstack from the UPAGES we can reenable it). Also LIFO/4 cache thread structures which improves fork() performance by 40% (when used in typical fork/exec/exit or fork/subshell/exit situations).
thread stage 4: remove curpcb, use td_pcb reference instead. Move the pcb to the end of the thread stack, and note that a pcb will always exist because a thread context will always exist. Also note that vm86 replaces td_pcb temporarily and we really need to rip that out and instead make a copy on the stack, because assumptions are made in regards to the pcb's location.
thread stage 3: create independant thread structure, unembed from proc.
thread stage 1: convert curproc to curthread, embed struct thread in proc.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 188.8.131.52