Up to [DragonFly] / src / sys / vfs / isofs / cd9660
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
Miscellanious performance adjustments to the kernel * Add an argument to VOP_BMAP so VFSs can discern the type of operation the BMAP is being done for. * Normalize the variable name denoting the blocksize to 'blksize' in vfs_cluster.c. * Fix a bug in the cluster code where a stale bp->b_error could wind up getting returned when B_ERROR is not set. * Do not B_AGE cluster bufs. * Pass the block size to both cluster_read() and cluster_write() instead of those routines getting the block size from vp->v_mount->mnt_stat.f_iosize. This allows different areas of a file to use a different block size. * Properly initialize bp->b_bio2.bio_offset to doffset in cluster_read(). This fixes an issue where VFSs were making an extra, unnecessary call to BMAP. * Do not recycle vnodes on the free list until numvnodes has reached desiredvnodes. Vnodes were being recycled when their resident page count had dropped to zero, but this is actually too early as the VFS may cache important information in the vnode that would otherwise require a number of I/O's to re-acquire. This mainly helps HAMMER (whos inode lookups are fairly expensive). * Do not VAGE vnodes. * Remove the minvnodes test. There is no reason not to load the vnode cache all the way through to its max. * buf_cmd_t visibility for the new BMAP argument.
Remove the vpp (returned underlying device vnode) argument from VOP_BMAP(). VOP_BMAP() may now only be used to determine linearity and clusterability of the blocks underlying a filesystem object. The meaning of the returned block number (other then being contiguous as a means of indicating linearity or clusterability) is now up to the VFS. This removes visibility into the device(s) underlying a filesystem from the rest of the kernel.
VNode sequencing and locking - part 3/4. VNode aliasing is handled by the namecache (aka nullfs), so there is no longer a need to have VOP_LOCK, VOP_UNLOCK, or VOP_ISSLOCKED as 'VOP' functions. Both NFS and DEADFS have been using standard locking functions for some time and are no longer special cases. Replace all uses with native calls to vn_lock, vn_unlock, and vn_islocked. We can't have these as VOP functions anyhow because of the introduction of the new SYSLINK transport layer, since vnode locks are primarily used to protect the local vnode structure itself.
Simplify vn_lock(), VOP_LOCK(), and VOP_UNLOCK() by removing the thread_t argument. These calls now always use the current thread as the lockholder. Passing a thread_t to these functions has always been questionable at best.
Due to continuing issues with VOP_READ/VOP_WRITE ops being called without a VOP_OPEN, particularly by NFS, redo the way VM objects are associated with vnodes. * The size of the object is now passed to vinitvmio(). vinitvmio() no longer calls VOP_GETATTR(). * Instead of trying to call vinitvmio() conditionally in various places, we now call it unconditionally when a vnode is instantiated if the filesystem at any time in the future intends to use the buffer cache to access that vnode's dataspace. * Specfs 'disk' devices are an exception. Since we cannot safely do I/O on such vnodes if they have not been VOP_OPEN()'ed anyhow, the VM objects for those vnodes are still only associated on open. The performance impact is limited to the case where large numbers of vnodes are being created and destroyed. This case only occurs when a large directory topology (number of files > kernel's vnode cache) is traversed and all related inodes are cached by the system. Being a pure-cpu case the slight loss of performance due to the VM object allocations is not really a big dael.
Clone cd9660_blkatoff() into a new procedure, cd9660_devblkatoff(), which returns a devvp-relative buffer rather then the vp-relative buffer. This allows us to access meta-data relative to a vnode without having to instantiate a VM object for that vnode. The new function is used for all directory scans and (negative offset) meta-data access. This fixes a panic due to recent buffer cache commits that formalized the requirements for using the buffer cache. Also, prior to this change, the CD9660 filesystem was using B_MALLOC buffers for a great deal of meta-data access that could very easily have been backed by the device vnode's VM object instead. B_MALLOC buffers have severe caching limitations. This commit fixes all of that as well.
Remove VOP_GETVOBJECT, VOP_DESTROYVOBJECT, and VOP_CREATEVOBJECT. Rearrange the VFS code such that VOP_OPEN is now responsible for associating a VM object with a vnode. Add the vinitvmio() helper routine.
Major BUF/BIO work commit. Make I/O BIO-centric and specify the disk or file location with a 64 bit offset instead of a 32 bit block number. * All I/O is now BIO-centric instead of BUF-centric. * File/Disk addresses universally use a 64 bit bio_offset now. bio_blkno no longer exists. * Stackable BIO's hold disk offset translations. Translations are no longer overloaded onto a single structure (BUF or BIO). * bio_offset == NOOFFSET is now universally used to indicate that a translation has not been made. The old (blkno == lblkno) junk has all been removed. * There is no longer a distinction between logical I/O and physical I/O. * All driver BUFQs have been converted to BIOQs. * BMAP, FREEBLKS, getblk, bread, breadn, bwrite, inmem, cluster_*, and findblk all now take and/or return 64 bit byte offsets instead of block numbers. Note that BMAP now returns a byte range for the before and after variables.
Make the entire BUF/BIO system BIO-centric instead of BUF-centric. Vnode and device strategy routines now take a BIO and must pass that BIO to biodone(). All code which previously managed a BUF undergoing I/O now manages a BIO. The new BIO-centric algorithms allow BIOs to be stacked, where each layer represents a block translation, completion callback, or caller or device private data. This information is no longer overloaded within the BUF. Translation layer linkages remain intact as a 'cache' after I/O has completed. The VOP and DEV strategy routines no longer make assumptions as to which translated block number applies to them. The use the block number in the BIO specifically passed to them. Change the 'untranslated' constant to NOOFFSET (for bio_offset), and (daddr_t)-1 (for bio_blkno). Rip out all code that previously set the translated block number to the untranslated block number to indicate that the translation had not been made. Rip out all the cluster linkage fields for clustered VFS and clustered paging operations. Clustering now occurs in a private BIO layer using private fields within the BIO. Reformulate the vn_strategy() and dev_dstrategy() abstraction(s). These routines no longer assume that bp->b_vp == the vp of the VOP operation, and the dev_t is no longer stored in the struct buf. Instead, only the vp passed to vn_strategy() (and related *_strategy() routines for VFS ops), and the dev_t passed to dev_dstrateg() (and related *_strategy() routines for device ops) is used by the VFS or DEV code. This will allow an arbitrary number of translation layers in the future. Create an independant per-BIO tracking entity, struct bio_track, which is used to determine when I/O is in-progress on the associated device or vnode. NOTE: Unlike FreeBSD's BIO work, our struct BUF is still used to hold the fields describing the data buffer, resid, and error state. Major-testing-by: Stefan Krueger
Rename all the functions and structures for the old VOP namespace API functions from vop_* to vop_old_*. e.g. vop_lookup -> vop_old_lookup. This will make it easier to identify areas containing old VOP API code. Remove vop_old_*_ap() functions, they are not used (and not allowed to be used). The old API is only allowed at the leaf of a VFS stack.
VFS messaging/interfacing work stage 9/99: VFS 'NEW' API WORK. NOTE: unionfs and nullfs are temporarily broken by this commit. * Remove the old namecache API. Remove vfs_cache_lookup(), cache_lookup(), cache_enter(), namei() and lookup() are all gone. VOP_LOOKUP() and VOP_CACHEDLOOKUP() have been collapsed into a single non-caching VOP_LOOKUP(). * Complete the new VFS CACHE (namecache) API. The new API is able to supply topological guarentees and is able to reserve namespaces, including negative cache spaces (whether the target name exists or not), which the new API uses to reserve namespace for things like NRENAME and NCREATE (and others). * Complete the new namecache API. VOP_NRESOLVE, NLOOKUPDOTDOT, NCREATE, NMKDIR, NMKNOD, NLINK, NSYMLINK, NWHITEOUT, NRENAME, NRMDIR, NREMOVE. These new calls take (typicaly locked) namecache pointers rather then combinations of directory vnodes, file vnodes, and name components. The new calls are *MUCH* simpler in concept and implementation. For example, VOP_RENAME() has 8 arguments while VOP_NRENAME() has only 3 arguments. The new namecache API uses the namecache to lock namespaces without having to lock the underlying vnodes. For example, this allows the kernel to reserve the target name of a create function trivially. Namecache records are maintained BY THE KERNEL for both positive and negative hits. Generally speaking, the kernel layer is now responsible for resolving path elements. NRESOLVE is called when an unresolved namecache record needs to be resolved. Unlike the old VOP_LOOKUP, NRESOLVE is simply responsible for associating a vnode to a namecache record (positive hit) or telling the system that it's a negative hit, and not responsible for handling symlinks or other special cases or doing any of the other path lookup work, much unlike the old VOP_LOOKUP. It should be particularly noted that the new namecache topology does not allow disconnected namecache records. In rare cases where a vnode must be converted to a namecache pointer for new API operation via a file handle (i.e. NFS), the cache_fromdvp() function is provided and a new API VOP, VOP_NLOOKUPDOTDOT() is provided to allow the namecache to resolve the topology leading up to the requested vnode. These and other topological guarentees greatly reduce the complexity of the new namecache API. The new namei() is called nlookup(). This function uses a combination of cache_n*() calls, VOP_NRESOLVE(), and standard VOP calls resolve the supplied path, deal with symlinks, and so forth, in a nice small compact compartmentalized procedure. * The old VFS code is no longer responsible for maintaining namecache records, a function which was mostly adhoc cache_purge()s occuring before the VFS actually knows whether an operation will succeed or not. The new VFS code is typically responsible for adjusting the state of locked namecache records passed into it. For example, if NCREATE succeeds it must call cache_setvp() to associate the passed namecache record with the vnode representing the successfully created file. The new requirements are much less complex then the old requirements. * Most VFSs still implement the old API calls, albeit somewhat modified and in particular the VOP_LOOKUP function is now *MUCH* simpler. However, the kernel now uses the new API calls almost exclusively and relies on compatibility code installed in the default ops (vop_compat_*()) to convert the new calls to the old calls. * All kernel system calls and related support functions which used to do complex and confusing namei() operations now do far less complex and far less confusing nlookup() operations. * SPECOPS shortcutting has been implemented. User reads and writes now go directly to supporting functions which talk to the device via fileops rather then having to be routed through VOP_READ or VOP_WRITE, saving significant overhead. Note, however, that these only really effect /dev/null and /dev/zero. Implementing this was fairly easy, we now simply pass an optional struct file pointer to VOP_OPEN() and let spec_open() handle the override. SPECIAL NOTES: It should be noted that we must still lock a directory vnode LK_EXCLUSIVE before issuing a VOP_LOOKUP(), even for simple lookups, because a number of VFS's (including UFS) store active directory scanning information in the directory vnode. The legacy NAMEI_LOOKUP cases can be changed to use LK_SHARED once these VFS cases are fixed. In particular, we are now organized well enough to actually be able to do record locking within a directory for handling NCREATE, NDELETE, and NRENAME situations, but it hasn't been done yet. Many thanks to all of the testers and in particular David Rhodus for finding a large number of panics and other issues.
VFS messaging/interfacing work stage 8/99: Major reworking of the vnode interlock and other miscellanious things. This patch also fixes FS corruption due to prior vfs work in head. In particular, prior to this patch the namecache locking could introduce blocking conditions that confuse the old vnode deactivation and reclamation code paths. With this patch there appear to be no serious problems even after two days of continuous testing. * VX lock all VOP_CLOSE operations. * Fix two NFS issues. There was an incorrect assertion (found by David Rhodus), and the nfs_rename() code was not properly purging the target file from the cache, resulting in Stale file handle errors during, e.g. a buildworld with an NFS-mounted /usr/obj. * Fix a TTY session issue. Programs which open("/dev/tty" ,...) and then run the TIOCNOTTY ioctl were causing the system to lose track of the open count, preventing the tty from properly detaching. This is actually a very old BSD bug, but it came out of the woodwork in DragonFly because I am now attempting to track device opens explicitly. * Gets rid of the vnode interlock. The lockmgr interlock remains. * Introduced VX locks, which are mandatory vp->v_lock based locks. * Rewrites the locking semantics for deactivation and reclamation. (A ref'd VX lock'd vnode is now required for vgone(), VOP_INACTIVE, and VOP_RECLAIM). New guarentees emplaced with regard to vnode ripouts. * Recodes the mountlist scanning routines to close timing races. * Recodes getnewvnode to close timing races (it now returns a VX locked and refd vnode rather then a refd but unlocked vnode). * Recodes VOP_REVOKE- a locked vnode is now mandatory. * Recodes all VFS inode hash routines to close timing holes. * Removes cache_leaf_test() - vnodes representing intermediate directories are now held so the leaf test should no longer be necessary. * Splits the over-large vfs_subr.c into three additional source files, broken down by major function (locking, mount related, filesystem syncer). * Changes splvm() protection to a critical-section in a number of places (bleedover from another patch set which is also about to be committed). Known issues not yet resolved: * Possible vnode/namecache deadlocks. * While most filesystems now use vp->v_lock, I haven't done a final pass to make vp->v_lock mandatory and to clean up the few remaining inode based locks (nwfs I think and other obscure filesystems). * NullFS gets confused when you hit a mount point in the underlying filesystem. * Only UFS and NFS have been well tested * NFS is not properly timing out namecache entries, causing changes made on the server to not be properly detected on the client if the client already has a negative-cache hit for the filename in question. Testing-by: David Rhodus <email@example.com>, Peter Kadau <firstname.lastname@example.org>, walt <email@example.com>, others
VFS messaging/interfacing work stage 7/99. BEGIN DESTABILIZATION! Implement the infrastructure required to allow us to begin switching to the new nlookup() VFS API. filedesc->fd_ncdir, fd_nrdir, fd_njdir File descriptors (associated with processes) now record the namecache pointer related to the current directory, root directory, and jail directory, in addition to the vnode pointers. These pointers are used as the basis for the new path lookup code (nlookup() and friends). file->f_ncp File pointers may now have a referenced+unlocked namecache pointer associated with them. All fp's representing directories have this attached. This allows fchdir() to properly record the ncp in fdp->fd_ncdir and friends. mount->mnt_ncp The namecache topology for crossing a mount point works as follows: when looking up a path element which is a mount point, cache_nlookup() will locate the ncp for the vnode-under the mount point. mount->mnt_ncp represents the root of the mount, that is the vnode-over. nlookup() detects the mount point and accesses mount->mnt_ncp to skip past the vnode-under. When going backwards (..), nlookup() detects the case and skips backwards. The ncp linkages are: ncp->ncp->ncp[vnode_under]->ncp[vnode_over]. That is, when going forwards or backwards nlookup must explicitly skip over the double-ncp when crossing a mount point. This allows us to keep the namecache topology intact across mount points. NEW CACHE level API functions: cache_get() Reference and lock a namecache entry cache_put() Dereference and unlock a namecache entry cache_lock() lock an already-referenced namecache entry cache_unlock() unlock a lockednamecache entry NOTE: namecache locks are exclusive and recursive. These are the 'namespace' locks that we will be using to guarentee namespace operations such as in a CREATE, RENAME, or REMOVE. vfs_cache_setroot() Set the new system-wide root directory cache_allocroot() System bootstrap helper function to allocate the root namecache node. cache_resolve() Resolve a NCF_UNRESOLVED namecache node. The namecache node should be locked on call. cache_setvp() (resolver) associate a VP or create a negative cache entry representation for a namecache pointer and clear NCF_UNRESOLVED. The namecache node should be locked on call. cache_setunresolved() Revert a resolved namecache entry back to an unresolved state, disassociating any vnode but leaving the topology intact. The namecache node should be locked on call. cache_vget() Obtain the locked+refd vnode related to a namecache entry, resolving the entry if necessary. Return ENOENT if the entry represents a negative cache hit. cache_vref() Obtained a refd (not locked) vnode related to a namecache entry, as above. cache_nlookup() The new namecache lookup routine. This routine does a lookup and allocates a new namecache node (into an unresolved state) if necessary. Returns a namecache record whether or not the item can be found and whether or not it represents a positive or negative hit. cache_lookup() OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. cache_enter() OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. NEW default VOPs vop_noresolve() Implements a namecache resolver for VFSs which are still using the old VOP_LOOKUP/ VOP_CACHEDLOOKUP API (which is all of them still). VOP_LOOKUP OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. VOP_CACHEDLOOKUP OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. NEW PATHNAME LOOKUP CODE nlookup_init() Similar to NDINIT, initialize a nlookupdata structure for nlookup() and nlookup_done(). nlookup() Lookup a path. Unlike the old namei/lookup code the new lookup code does not do any fancy pre-disposition of the cache for create/delete, it simply looks up the requested path and returns the appropriate locked namecache pointer. The caller can obtain the vnode and directory vnode, as applicable, from the one namecache structure that is returned. Access checks are done on directories leading up to the result but not done on the returned namecache node. nlookup_done() Mandatory routine to cleanup a nlookupdata structure after it has been initialized and all operations have been completed on it. nlookup_simple() (in progress) all-in-one wrapped new lookup. nlookup_mp() helper call for resolving a mount point's glue NCP. hackish, will be cleaned up later. nreadsymlink() helper call to resolve a symlink. Note that the namecache does not yet cache symlink data but the intention is to eventually do so to avoid having to do VFS ops to get the data. naccess() Perform access checks on a namecache node given a mode and cred. naccess_va() Perform access cheks on a vattr given a mode and cred. Begin switching VFS operations from using namei to using nlookup. In this batch: * mount (install mnt_ncp for cross-mount-point handling in nlookup, simplify the vfs_mount() API to no longer pass a nameidata structure) * [l]stat (use nlookup) * [f]chdir (use nlookup, use recorded f_ncp) * [f]chroot (use nlookup, use recorded f_ncp)
Remove the VREF() macro and uses of it. Remove uses of 0x20 before ^I inside vnode.h
Style(9) cleanup to src/sys/vfs, stage 7/21: isofs. - Convert K&R-style function definitions to ANSI style. Submitted-by: Andre Nathan <firstname.lastname@example.org> Additional-reformatting-by: cpressey
Per-CPU VFS Namecache Effectiveness Statistics: * Convert nchstats into a CPU indexed array * Export the per-CPU nchstats as a sysctl vfs.cache.nchstats and let user-land aggregate them. * Add a function called kvm_nch_cpuagg() to libkvm; it is shared by systat(1) and vmstat(1) and the ncache-stats test program. As the function name suggests, it aggregates the per-CPU nchstats. * Move struct nchstats into a separate header to avoid header file namespace pollution; sys/nchstats.h. * Keep a cached copy of the globaldata pointer in the VFS specific LOOKUP op, and use that to increment the namecache effectiveness counters (nchstats). * Modify systat(1) and vmstat(1) to accomodate the new behavior of accessing nchstats. Remove a (now) redundant sysctl to get the cpu count (hw.ncpu), instead we just divide the total length of the nchstats array returned by sysctl by sizeof(struct nchstats) to get the CPU count. * Garbage-collect unused variables and fix nearby warnings in systat(1) an vmstat(1). * Add a very-cool test program, that prints the nchstats per-CPU statistics to show CPU distribution. Here is the output it generates on an 2-processor SMP machine: gray# ncache-stats VFS Name Cache Effectiveness Statistics 4207370 total name lookups COUNTER CPU-1 CPU-2 TOTAL goodhits 2477657 1060677 (3538334 ) neghits 107531 47294 (154825 ) badhits 28968 7720 (36688 ) falsehits 0 0 (0 ) misses 339671 137852 (477523 ) longnames 0 0 (0 ) passes 2 13104 6813 (19917 ) 2-passes 25134 15257 (40391 ) The SMP machine used for testing this commit was proudly presented by David Rhodus <email@example.com>. Reviewed-by: Matthew Dillon <firstname.lastname@example.org>
Newtoken commit. Change the token implementation as follows: (1) Obtaining a token no longer enters a critical section. (2) tokens can be held through schedular switches and blocking conditions and are effectively released and reacquired on resume. Thus tokens serialize access only while the thread is actually running. Serialization is not broken by preemptive interrupts. That is, interrupt threads which preempt do no release the preempted thread's tokens. (3) Unlike spl's, tokens will interlock w/ interrupt threads on the same or on a different cpu. The vnode interlock code has been rewritten and the API has changed. The mountlist vnode scanning code has been consolidated and all known races have been fixed. The vnode interlock is now a pool token. The code that frees unreferenced vnodes whos last VM page has been freed has been moved out of the low level vm_page_free() code and moved to the periodic filesystem sycer code in vfs_msycn(). The SMP startup code and the IPI code has been cleaned up considerably. Certain early token interactions on AP cpus have been moved to the BSP. The LWKT rwlock API has been cleaned up and turned on. Major testing by: David Rhodus
namecache work stage 3a: Adjust the VFS APIs to include a namecache pointer where necessary. For the moment we pass NULL for these parameters (the old 'dvp' vnode pointer's cannot be ripped out quite yet).
namecache work stage 1: namespace cleanups. Add a NAMEI_ prefix to CREATE, LOOKUP, DELETE, and RENAME. Add a CNP_ prefix too all the name lookup flags (nd_flags) e.g. ISDOTDOT->CNP_ISDOTDOT.
kernel tree reorganization stage 1: Major cvs repository work (not logged as commits) plus a major reworking of the #include's to accomodate the relocations. * CVS repository files manually moved. Old directories left intact and empty (temporary). * Reorganize all filesystems into vfs/, most devices into dev/, sub-divide devices by function. * Begin to move device-specific architecture files to the device subdirs rather then throwing them all into, e.g. i386/include * Reorganize files related to system busses, placing the related code in a new bus/ directory. Also move cam to bus/cam though this may not have been the best idea in retrospect. * Reorganize emulation code and place it in a new emulation/ directory. * Remove the -I- compiler option in order to allow #include file localization, rename all config generated X.h files to use_X.h to clean up the conflicts. * Remove /usr/src/include (or /usr/include) dependancies during the kernel build, beyond what is normally needed to compile helper programs. * Make config create 'machine' softlinks for architecture specific directories outside of the standard <arch>/include. * Bump the config rev. WARNING! after this commit /usr/include and /usr/src/sys/compile/* should be regenerated from scratch.
Register keyword removal Approved by: Matt Dillon
proc->thread stage 5: BUF/VFS clearance! Remove the ucred argument from vop_close, vop_getattr, vop_fsync, and vop_createvobject. These VOPs can be called from multiple contexts so the cred is fairly useless, and UFS ignorse it anyway. For filesystems (like NFS) that sometimes need a cred we use proc0.p_ucred for now. This removal also removed the need for a 'proc' reference in the related VFS procedures, which greatly helps our proc->thread conversion. bp->b_wcred and bp->b_rcred have also been removed, and for the same reason. It makes no sense to have a particular cred when multiple users can access a file. This may create issues with certain types of NFS mounts but if it does we will solve them in a way that doesn't pollute the struct buf.
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread pointers instead of process pointers as arguments, similar to what FreeBSD-5 did. Note however that ultimately both APIs are going to be message-passing which means the current thread context will not be useable for creds and descriptor access.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 220.127.116.11