Up to [DragonFly] / src / sys / vfs / nfs
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
Fix 'unused variable' warning in case neither BOOTP nor NFS_ROOT are defined.
Pass the correct string to md_mount() when doing a diskless nfs mount. Fix a bogus call to kfree() that panics the machine when a diskless nfs mount fails.
MFC NFS performance fixes.
NFS performance fixes. * sync on an NFS mount was a big NOP due to a silly bug. * utimes (setattr w/ mtime-changed) was unconditionally flushing the file, causing programs such as cpdup, rsync, rdist, and tar xp to sync on each file. change it so it does not unconditionally flush the file.
MFC 1.51 - force an over-the-write transaction when resolving the root of an NFS mount to avoid an endless ESTALE retry loop from occuring in the namecache code.
Force an over-the-wire transaction when resolving the root of an NFS mount point. The namecache will cache the mount point for us so this should not reduce performance. We need to know that the mount point is good or things like the namecache resolver could end up looping forever trying to resolve a stale NFS mount. Reported-by: firstname.lastname@example.org (Petr)
Ansify parameter declarations. In-collaboration-with: Alexey Slynko <email@example.com>
Rename printf -> kprintf in sys/ and add some defines where necessary (files which are used in userland, too).
Rename sprintf -> ksprintf Rename snprintf -> knsprintf Make allowances for source files that are compiled for both userland and the kernel.
Remove the last bits of code that stored mount point linkages in vnodes. Mount point linkages are now ENTIRELY a function of the namecache topology, made possible by DragonFly's advanced namecache. This fixes a number of problems with NULLFS and adds two major features to our NULLFS mounting capabilities. NULLFS mounting paths NO LONGER NEED TO BE DISTINCT. For example, you can now safely do things like 'mount_null -o ro / /fubar/jail1' without creating a recursion and you can now create SUB-MOUNTS within nullfs mounts, such as 'mount_null -o ro /usr /fubar/jail1/usr', without creating problems in the original master partitions. The result is that NULLFS can now be used to glue arbitrary pieces of filesystems together using a mixture of read-only and read-write NULLFS mounts for situations where localhost NFS mounts had to be used before. Jail or chroot construction is now utterly trivial. With-input-from: Joerg Sonnenberger <firstname.lastname@example.org>
Rename malloc->kmalloc, free->kfree, and realloc->krealloc. Pass 1
VNode sequencing and locking - part 3/4. VNode aliasing is handled by the namecache (aka nullfs), so there is no longer a need to have VOP_LOCK, VOP_UNLOCK, or VOP_ISSLOCKED as 'VOP' functions. Both NFS and DEADFS have been using standard locking functions for some time and are no longer special cases. Replace all uses with native calls to vn_lock, vn_unlock, and vn_islocked. We can't have these as VOP functions anyhow because of the introduction of the new SYSLINK transport layer, since vnode locks are primarily used to protect the local vnode structure itself.
Remove several layers in the vnode operations vector init code. Declare the operations vector directly instead of via a descriptor array. Remove most of the recalculation code, it stopped being needed over a year ago. This work is similar to what FreeBSD now does, but was developed along a different line. Ultimately our vop_ops will become SYSLINK ops for userland VFS and clustering support.
Add kernel syscall support for explicit blocking and non-blocking I/O regardless of the setting applied to the file pointer. send/sendmsg/sendto/recv/recvmsg/recfrom: New MSG_ flags defined in sys/socket.h may be passed to these functions to override the settings applied to the file pointer on a per-I/O basis. MSG_FBLOCKING - Force the operation to be blocking MSG_FNONBLOCKING- Force the operation to be non-blocking pread/preadv/pwrite/pwritev: These system calls have been renamed and wrappers will be added to libc. The new system calls are prefixed with a double underscore (like getcwd vs __getcwd) and include an additional flags argument. The new flags are defined in sys/fcntl.h and may be used to override settings applied to the file pointer on a per-I/O basis. Additionally, the internal __ versions of these functions now accept an offset of -1 to mean 'degenerate into a read/readv/write/writev' (i.e. use the offset in the file pointer and update it on completion). O_FBLOCKING - Force the operation to be blocking O_FNONBLOCKING - Force the operation to be non-blocking O_FAPPEND - Force the write operation to append (to a regular file) O_FOFFSET - (implied of the offset != -1) - offset is valid O_FSYNCWRITE - Force a synchronous write O_FASYNCWRITE - Force an asynchronous write O_FUNBUFFERED - Force an unbuffered operation (O_DIRECT) O_FBUFFERED - Force a buffered operation (negate O_DIRECT) If the flags do not specify an operation (e.g. neither FBLOCKING or FNONBLOCKING are set), then the settings in the file pointer are used. The original system calls will become wrappers in libc, without the flags arguments. The new system calls will be made available to libc_r to allow it to perform non-blocking I/O without having to mess with a descriptor's file flags. NOTE: the new __pread and __pwrite system calls are backwards compatible with the originals due to a pad byte that libc always set to 0. The new __preadv and __pwritev system calls are NOT backwards compatible, but since they were added to HEAD just two months ago I have decided to not renumber them either. NOTE: The subrev has been bumped to 1.5.4 and installworld will refuse to install if you are not running at least a 1.5.4 kernel.
Remove the thread argument from all mount->vfs_* function vectors, replacing it with a ucred pointer when applicable. This cleans up a considerable amount of VFS function code that previously delved into the process structure to get the cred, though some code remains. Get rid of the compatibility thread argument for hpfs and nwfs. Our lockmgr calls are now mostly compatible with NetBSD (which doesn't use a thread argument either). Get rid of some complex junk in fdesc_statfs() that nobody uses. Remove the thread argument from dounmount() as well as various other filesystem specific procedures (quota calls primarily) which no longer need it due to the lockmgr, VOP, and VFS cleanups. These cleanups also have the effect of making the VFS code slightly less dependant on the calling thread's context.
The thread/proc pointer argument in the VFS subsystem originally existed for... well, I'm not sure *WHY* it originally existed when most of the time the pointer couldn't be anything other then curthread or curproc or the code wouldn't work. This is particularly true of lockmgr locks. Remove the pointer argument from all VOP_*() functions, all fileops functions, and most ioctl functions.
Remove the thread_t argument from vfs_busy() and vfs_unbusy(). Passing a thread_t to these functions has always been questionable at best.
Simplify vn_lock(), VOP_LOCK(), and VOP_UNLOCK() by removing the thread_t argument. These calls now always use the current thread as the lockholder. Passing a thread_t to these functions has always been questionable at best.
Due to continuing issues with VOP_READ/VOP_WRITE ops being called without a VOP_OPEN, particularly by NFS, redo the way VM objects are associated with vnodes. * The size of the object is now passed to vinitvmio(). vinitvmio() no longer calls VOP_GETATTR(). * Instead of trying to call vinitvmio() conditionally in various places, we now call it unconditionally when a vnode is instantiated if the filesystem at any time in the future intends to use the buffer cache to access that vnode's dataspace. * Specfs 'disk' devices are an exception. Since we cannot safely do I/O on such vnodes if they have not been VOP_OPEN()'ed anyhow, the VM objects for those vnodes are still only associated on open. The performance impact is limited to the case where large numbers of vnodes are being created and destroyed. This case only occurs when a large directory topology (number of files > kernel's vnode cache) is traversed and all related inodes are cached by the system. Being a pure-cpu case the slight loss of performance due to the VM object allocations is not really a big dael.
Correct some minor bugs in the last patch to fix kernel compilation.
Remove NQNFS support. The mechanisms are too crude to co-exist with upcoming cache coherency management work and the original implementation hacked up the NFS code pretty severely. Move nqnfs_clientd() out of nfs_nqlease.c to a new file, nfs_kerb.c, and rename it nfs_clientd().
Make the entire BUF/BIO system BIO-centric instead of BUF-centric. Vnode and device strategy routines now take a BIO and must pass that BIO to biodone(). All code which previously managed a BUF undergoing I/O now manages a BIO. The new BIO-centric algorithms allow BIOs to be stacked, where each layer represents a block translation, completion callback, or caller or device private data. This information is no longer overloaded within the BUF. Translation layer linkages remain intact as a 'cache' after I/O has completed. The VOP and DEV strategy routines no longer make assumptions as to which translated block number applies to them. The use the block number in the BIO specifically passed to them. Change the 'untranslated' constant to NOOFFSET (for bio_offset), and (daddr_t)-1 (for bio_blkno). Rip out all code that previously set the translated block number to the untranslated block number to indicate that the translation had not been made. Rip out all the cluster linkage fields for clustered VFS and clustered paging operations. Clustering now occurs in a private BIO layer using private fields within the BIO. Reformulate the vn_strategy() and dev_dstrategy() abstraction(s). These routines no longer assume that bp->b_vp == the vp of the VOP operation, and the dev_t is no longer stored in the struct buf. Instead, only the vp passed to vn_strategy() (and related *_strategy() routines for VFS ops), and the dev_t passed to dev_dstrateg() (and related *_strategy() routines for device ops) is used by the VFS or DEV code. This will allow an arbitrary number of translation layers in the future. Create an independant per-BIO tracking entity, struct bio_track, which is used to determine when I/O is in-progress on the associated device or vnode. NOTE: Unlike FreeBSD's BIO work, our struct BUF is still used to hold the fields describing the data buffer, resid, and error state. Major-testing-by: Stefan Krueger
Bring in the parallel route table code and clean up ARP. The route table is now replicated across all cpus (ncpus, not ncpus2). Note that cloned routes are not replicated. This removes one of the few remaining obstacles to being able to run the network protocol stacks without the BGL. Primary-Design-by: Jeffrey Hsu Work-by: Jeffrey Hsu and Matthew Dillon
Add an argument to vfs_add_vnodeops() to specify VVF_* flags for the vop_ops structure. Add a new flag called VVF_SUPPORTS_FSMID to indicate filesystems which support persistent storage of FSMIDs. Rework the FSMID code a bit to reduce overhead. Use the spare field in the UFS inode structure to implement a persistent FSMID. The FSMID is recursively marked in the namecache but not adjusted until the next getattr() call on the related inode(s), or when the vnode is reclaimed.
MFC 1.29 - fix missing splx() calls (in HEAD they are crit_enter/exit calls).
Cleanup the module build and conditionalize a goto label.
Give the kernel a native NFS mount rpc capability for mounting NFS roots by splitting off the mount rpc code from the BOOTP code. The loader is no longer required to pass the nfs root mount file handle to the kernel. Pure tftp-based loaders with no knowledge of NFS can now pass a NFS root mount path to the kernel without having to pass a resolved NFS file handle. This change allows kernels booted from tftp loaders to have an NFS root without having to specify BOOTP (which sometimes doesn't work properly when done from both the loader and from the kernel).
Remove old #if 0'd sections of code, add a few comments, and report a bit more information when mounting an NFS root.
Add a missing crit_exit(), fixing a panic. Attempt to continue with the mount even if we cannot add the default routeo (since the NFS server might be on the LAN). Print out useful information on the console such as the interface chosen and the mount point. Reported-by: Scott Ullrich <email@example.com> (critical section panic)
Clean the VFS operations vector and related code: * take advantage of C99 sparse structure initialisation, this allows us to initialise left out vfsops entries cleanly when vfs_register() is called; any vfsop entries that are not specified will be assigned vfs_std* functions. the only exception to this rule is VFS_SYNC which is assigned vfs_stdnosync() since a file system may not have support for it. file systems can simply assign vfs_stdsync if they do not have their own sync operation. * add KKASSERTS to make sure that the VFS_ROOT, VFS_MOUNT and VFS_UNMOUNT vfs operations are provided by a file system being registered. all of the above are necessary to ensure a minimally working file system. * remove scattered no-op definitions of VFS_START() vfsop vector entry and take advantage of sparse vfsop initialisation. VFS_START is only used by MFS to make ensure calling process is not swapped out when I/O is initialised. The entry point is called from the mount path, before the file system is marked ready. * remove scattered no-op definitions of VFS_QUOTACTL() vfsop vector entry and take advantage of sparse vfsop initialisation. * give UFS a VFS_UNINIT vfsop entry and make use of it in ext2fs when ripping down the hash tables. * many file systems in the kernel seem to not implement the complementing VFS_UNINIT() vfsop entry, this is not so much of a problem when the file system is compiled into the kernel, but it can leave leakage when compiled as KLD modules. add uninitialisation code and entry points for ext2fs, ufs, fdescfs. grab the ufs_ihash_token when free'ing the inode hash table at ripping time. * add typedefs for all the vfsop entry points, make use of it in definition of struct vfsops; this results in clean and consolidate code. use the typedefs for vfs_std* function prototypes.
Replace spl with critical sections.
Implement Red-Black trees for the vnode clean/dirty buffer lists. Implement ranged fsyncs and adjust the syncer to use the new capability. This capability will also soon be used to replace the write_behind heuristic. Rewrite the fsync code for all VFSs to use the new APIs (generally simplifying them). Get rid of B_WRITEINPROG, it is no longer useful or needed. Get rid of B_SCANNED, it is no longer useful or needed. Rewrite the NFS 2-phase commit protocol to take advantage of the new Red-Black tree topology. Add RB_SCAN() for callback-scanning of Red-Black trees. Give RB_SCAN the ability to track the 'next' scan node and automatically fix it up if the callback directly or indirectly or through blocking indirectly deletes nodes in the tree while the scan is in progress. Remove most related loop restart conditions, they are no longer necessary. Disable filesystem background bitmap writes. This really needs to be solved a different way and the concept does not work well with red-black trees.
Clean up a number of caching edge cases in NFS, rework the code to be a bit more readable, document some bits, and fix some cache coherency detection issues. The caching cleanups should allow the NFS client to retain more of the NFS cache when doing complex operations on a file. * Properly check and update the mtime using WCC records in the NFS response. This record gives us the 'before' and 'after' mtime. The 'before' mtime must match our existing idea of the mtime, if it doesn't we flag the nfsnode as having been modified by the server. Our notion of the mtime is then set to the 'after time. This was not being done properly for several edge cases. This required extending the nfsm macros a bit in order to be able to tell loadattrcache how to handle the mtime data. This also required rearranging (really fixing) the sequence in nfs_open(), nfs_write(), etc. * Rearrange the flags a bit. NSIZECHANGED -> NRMODIFIED (nfsnode modified by server), NMODIFIED -> NLMODIFIED (nfsnode modified by client). Do not clear NRMODIFIED until we have actually invalidated the cache (this fixes a problem where programs using mmap() were not properly clearing the cache after a file was modified on the server). * Don't code NRMODIFIED as an exception to NLMODIFIED. Recode the flags so they (mostly) operate in tandem. * When appending to a file, use nfs_flush() instead of nfs_vinvalbuf(). There is no need to destroy our data cache for the file. This makes appends considerably more efficient. * Hopefully fix the last problem associated with attribute timeouts. * Clear the attribute cache when a file is opened for write in nfs_open() BEFORE doing other checks rather then after. * Document some of the nastier cache coherency hacks.
Don't use the statfs field f_mntonname in filesystems. For the userland export code, it can synthesized from mnt_ncp. For debugging code, use f_mntfromname, it should be enough to find culprit. The vfs_unmountall doesn't use code_fullpath to avoid problems with resource allocation and to make it more likely that a call from ddb succeds. Change getfsstat and fhstatfs to not show directories outside a chroot path, with the exception of the filesystem counting the chroot root itself.
VFS messaging/interfacing work stage 10/99: Start adding the journaling, range locking, and (very slightly) cache coherency infrastructure. Continue cleaning up the VOP operations vector. Expand on past commits that gave each mount structure its own set of VOP operations vectors by adding additional vector sets for journaling or cache coherency operations. Remove the vv_jops and vv_cops fields from the vnode operations vector in favor of placing those vop_ops directly in the mount structure. Reorganize the VOP calls as a double-indirect and add a field to the mount structure which represents the current vnode operations set (which will change when e.g. journaling is turned on or off). This creates the infrastructure necessary to allow us to stack a generic journaling implementation on top of a filesystem. Introduce a hard range-locking API for vnodes. This API will be used by high level system/vfs calls in order to handle atomicy guarentees. It is a prerequisit for: (1) being able to break I/O's up into smaller pieces for the vm_page list/direct-to-DMA-without-mapping goal, (2) to support the parallel write operations on a vnode goal, (3) to support the clustered (remote) cache coherency goal, and (4) to support massive parallelism in dispatching operations for the upcoming threaded VFS work. This commit represents only infrastructure and skeleton/API work.
VFS messaging/interfacing work stage 8/99: Major reworking of the vnode interlock and other miscellanious things. This patch also fixes FS corruption due to prior vfs work in head. In particular, prior to this patch the namecache locking could introduce blocking conditions that confuse the old vnode deactivation and reclamation code paths. With this patch there appear to be no serious problems even after two days of continuous testing. * VX lock all VOP_CLOSE operations. * Fix two NFS issues. There was an incorrect assertion (found by David Rhodus), and the nfs_rename() code was not properly purging the target file from the cache, resulting in Stale file handle errors during, e.g. a buildworld with an NFS-mounted /usr/obj. * Fix a TTY session issue. Programs which open("/dev/tty" ,...) and then run the TIOCNOTTY ioctl were causing the system to lose track of the open count, preventing the tty from properly detaching. This is actually a very old BSD bug, but it came out of the woodwork in DragonFly because I am now attempting to track device opens explicitly. * Gets rid of the vnode interlock. The lockmgr interlock remains. * Introduced VX locks, which are mandatory vp->v_lock based locks. * Rewrites the locking semantics for deactivation and reclamation. (A ref'd VX lock'd vnode is now required for vgone(), VOP_INACTIVE, and VOP_RECLAIM). New guarentees emplaced with regard to vnode ripouts. * Recodes the mountlist scanning routines to close timing races. * Recodes getnewvnode to close timing races (it now returns a VX locked and refd vnode rather then a refd but unlocked vnode). * Recodes VOP_REVOKE- a locked vnode is now mandatory. * Recodes all VFS inode hash routines to close timing holes. * Removes cache_leaf_test() - vnodes representing intermediate directories are now held so the leaf test should no longer be necessary. * Splits the over-large vfs_subr.c into three additional source files, broken down by major function (locking, mount related, filesystem syncer). * Changes splvm() protection to a critical-section in a number of places (bleedover from another patch set which is also about to be committed). Known issues not yet resolved: * Possible vnode/namecache deadlocks. * While most filesystems now use vp->v_lock, I haven't done a final pass to make vp->v_lock mandatory and to clean up the few remaining inode based locks (nwfs I think and other obscure filesystems). * NullFS gets confused when you hit a mount point in the underlying filesystem. * Only UFS and NFS have been well tested * NFS is not properly timing out namecache entries, causing changes made on the server to not be properly detected on the client if the client already has a negative-cache hit for the filename in question. Testing-by: David Rhodus <firstname.lastname@example.org>, Peter Kadau <email@example.com>, walt <firstname.lastname@example.org>, others
VFS messaging/interfacing work stage 7/99. BEGIN DESTABILIZATION! Implement the infrastructure required to allow us to begin switching to the new nlookup() VFS API. filedesc->fd_ncdir, fd_nrdir, fd_njdir File descriptors (associated with processes) now record the namecache pointer related to the current directory, root directory, and jail directory, in addition to the vnode pointers. These pointers are used as the basis for the new path lookup code (nlookup() and friends). file->f_ncp File pointers may now have a referenced+unlocked namecache pointer associated with them. All fp's representing directories have this attached. This allows fchdir() to properly record the ncp in fdp->fd_ncdir and friends. mount->mnt_ncp The namecache topology for crossing a mount point works as follows: when looking up a path element which is a mount point, cache_nlookup() will locate the ncp for the vnode-under the mount point. mount->mnt_ncp represents the root of the mount, that is the vnode-over. nlookup() detects the mount point and accesses mount->mnt_ncp to skip past the vnode-under. When going backwards (..), nlookup() detects the case and skips backwards. The ncp linkages are: ncp->ncp->ncp[vnode_under]->ncp[vnode_over]. That is, when going forwards or backwards nlookup must explicitly skip over the double-ncp when crossing a mount point. This allows us to keep the namecache topology intact across mount points. NEW CACHE level API functions: cache_get() Reference and lock a namecache entry cache_put() Dereference and unlock a namecache entry cache_lock() lock an already-referenced namecache entry cache_unlock() unlock a lockednamecache entry NOTE: namecache locks are exclusive and recursive. These are the 'namespace' locks that we will be using to guarentee namespace operations such as in a CREATE, RENAME, or REMOVE. vfs_cache_setroot() Set the new system-wide root directory cache_allocroot() System bootstrap helper function to allocate the root namecache node. cache_resolve() Resolve a NCF_UNRESOLVED namecache node. The namecache node should be locked on call. cache_setvp() (resolver) associate a VP or create a negative cache entry representation for a namecache pointer and clear NCF_UNRESOLVED. The namecache node should be locked on call. cache_setunresolved() Revert a resolved namecache entry back to an unresolved state, disassociating any vnode but leaving the topology intact. The namecache node should be locked on call. cache_vget() Obtain the locked+refd vnode related to a namecache entry, resolving the entry if necessary. Return ENOENT if the entry represents a negative cache hit. cache_vref() Obtained a refd (not locked) vnode related to a namecache entry, as above. cache_nlookup() The new namecache lookup routine. This routine does a lookup and allocates a new namecache node (into an unresolved state) if necessary. Returns a namecache record whether or not the item can be found and whether or not it represents a positive or negative hit. cache_lookup() OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. cache_enter() OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. NEW default VOPs vop_noresolve() Implements a namecache resolver for VFSs which are still using the old VOP_LOOKUP/ VOP_CACHEDLOOKUP API (which is all of them still). VOP_LOOKUP OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. VOP_CACHEDLOOKUP OLD API CODE DEPRECATED, but must be maintained until everything has been converted over. NEW PATHNAME LOOKUP CODE nlookup_init() Similar to NDINIT, initialize a nlookupdata structure for nlookup() and nlookup_done(). nlookup() Lookup a path. Unlike the old namei/lookup code the new lookup code does not do any fancy pre-disposition of the cache for create/delete, it simply looks up the requested path and returns the appropriate locked namecache pointer. The caller can obtain the vnode and directory vnode, as applicable, from the one namecache structure that is returned. Access checks are done on directories leading up to the result but not done on the returned namecache node. nlookup_done() Mandatory routine to cleanup a nlookupdata structure after it has been initialized and all operations have been completed on it. nlookup_simple() (in progress) all-in-one wrapped new lookup. nlookup_mp() helper call for resolving a mount point's glue NCP. hackish, will be cleaned up later. nreadsymlink() helper call to resolve a symlink. Note that the namecache does not yet cache symlink data but the intention is to eventually do so to avoid having to do VFS ops to get the data. naccess() Perform access checks on a namecache node given a mode and cred. naccess_va() Perform access cheks on a vattr given a mode and cred. Begin switching VFS operations from using namei to using nlookup. In this batch: * mount (install mnt_ncp for cross-mount-point handling in nlookup, simplify the vfs_mount() API to no longer pass a nameidata structure) * [l]stat (use nlookup) * [f]chdir (use nlookup, use recorded f_ncp) * [f]chroot (use nlookup, use recorded f_ncp)
VFS messaging/interfacing work stage 2/99. This stage retools the vnode ops vector dispatch, making the vop_ops a per-mount structure rather then a per-filesystem structure. Filesystem mount code, typically in blah_vfsops.c, must now register various vop_ops pointers in the struct mount to compile its VOP operations set. This change will allow us to begin adding per-mount hooks to VFSes to support things like kernel-level journaling, various forms of cache coherency management, and so forth. In addition, the vop_*() calls now require a struct vop_ops pointer as the first argument instead of a vnode pointer (note: in this commit the VOP_*() macros currently just pull the vop_ops pointer from the vnode in order to call the vop_*() procedures). This change is intended to allow us to divorce ourselves from the requirement that a vnode pointer always be part of a VOP call. In particular, this will allow namespace based routines such as remove(), mkdir(), stat(), and so forth to pass namecache pointers rather then locked vnodes and is a very important precursor to the goal of using the namecache for namespace locking.
Remove the canwait argument to dup_sockaddr(). Callers of dup_sockaddr() all assume that it just works, so it really has to work. Since interrupts are now threads we can use M_INTWAIT. While it is possible that a memory deadlock issue exists here (e.g. if swapping over NFS), it isn't likely in this case.
Device layer rollup commit. * cdevsw_add() is now required. cdevsw_add() and cdevsw_remove() may specify a mask/match indicating the range of supported minor numbers. Multiple cdevsw_add()'s using the same major number, but distinctly different ranges, may be issued. All devices that failed to call cdevsw_add() before now do. * cdevsw_remove() now automatically marks all devices within its supported range as being destroyed. * vnode->v_rdev is no longer resolved when the vnode is created. Instead, only v_udev (a newly added field) is resolved. v_rdev is resolved when the vnode is opened and cleared on the last close. * A great deal of code was making rather dubious assumptions with regards to the validity of devices associated with vnodes, primarily due to the persistence of a device structure due to being indexed by (major, minor) instead of by (cdevsw, major, minor). In particular, if you run a program which connects to a USB device and then you pull the USB device and plug it back in, the vnode subsystem will continue to believe that the device is open when, in fact, it isn't (because it was destroyed and recreated). In particular, note that all the VFS mount procedures now check devices via v_udev instead of v_rdev prior to calling VOP_OPEN(), since v_rdev is NULL prior to the first open. * The disk layer's device interaction has been rewritten. The disk layer (i.e. the slice and disklabel management layer) no longer overloads its data onto the device structure representing the underlying physical disk. Instead, the disk layer uses the new cdevsw_add() functionality to register its own cdevsw using the underlying device's major number, and simply does NOT register the underlying device's cdevsw. No confusion is created because the device hash is now based on (cdevsw,major,minor) rather then (major,minor). NOTE: This also means that underlying raw disk devices may use the entire device minor number instead of having to reserve the bits used by the disk layer, and also means that can we (theoretically) stack a fully disklabel-supported 'disk' on top of any block device. * The new reference counting scheme prevents this by associating a device with a cdevsw and disconnecting the device from its cdevsw when the cdevsw is removed. Additionally, all udev2dev() lookups run through the cdevsw mask/match and only successfully find devices still associated with an active cdevsw. * Major work on MFS: MFS no longer shortcuts vnode and device creation. It now creates a real vnode and a real device and implements real open and close VOPs. Additionally, due to the disk layer changes, MFS is no longer limited to 255 mounts. The new limit is 16 million. Since MFS creates a real device node, mount_mfs will now create a real /dev/mfs<PID> device that can be read from userland (e.g. so you can dump an MFS filesystem). * BUF AND DEVICE STRATEGY changes. The struct buf contains a b_dev field. In order to properly handle stacked devices we now require that the b_dev field be initialized before the device strategy routine is called. This required some additional work in various VFS implementations. To enforce this requirement, biodone() now sets b_dev to NODEV. The new disk layer will adjust b_dev before forwarding a request to the actual physical device. * A bug in the ISO CD boot sequence which resulted in a panic has been fixed. Testing by: lots of people, but David Rhodus found the most aggregious bugs.
Peter Edwards brought up an interesting NFS bug which we both originally thought would be a fairly straightforward bug fix. But it turns out to require a nasty hack to fix. The issue is that near the file EOF NFS uses piecemeal writes and piecemeal buffer cache buffers. The result is that manipulation through the buffer cache only sets some of the m->valid bits in the associated vm_page(s). This case may also occur in the middle of a file if for example a file is piecemeal written and then ftruncated to be much larger (or lseek/write at a much higher seek position). The nfs_getpages() routine was assuming that if m->valid was non-0, the page is basically valid and no read rpc is required to fill it. The problem is that if you mmap() a piecemeal VM page and fault it in, m->valid is set to VM_PAGE_BITS_ALL (0xFF). Then, later, when NFS flushes the buffer cache, only some of the m->valid bits are clear (e.g. 0xFC). A later page fault will cause NFS to believe that the page is sufficiently valid and vm_fault will then zero-out the first X bytes of the page when, in fact, we really should have done an I/O to refill those X bytes. The fix in PR misc/64816 (FreeBSD) tried to solve this by checking to see if the m->valid bits were 'sufficiently valid' in the file EOF case but tesing with fsx resulted in several failure modes. This doesn't work because (1) if you extend the file w/ ftruncate or lseek/write these partially valid pages can end up in the middle of the file rather then just at the end and (2) There may be a dirty buffer associated with these pages, meaning that the pages may contain dirty data, and we cannot safely overwrite the pages with a new read I/O. The solution in this patch is to deal with the screwy m->valid bit clearing but special-casing NFS and then having the BIO system clear ALL the m->valid bits instead of just some of them when NFS calls vinvalbuf(). That way m->valid will be set to 0 when the buffer is invalidated and the nfs_getpages() code can be left doing it's simple 'if any m->valid bits are set assume the whole page is valid' test. In order for the BIO system to safely be able to do this (so as not to invalidate portions of a VM page associated with an adjacent buffer), the NFS io size has been further restricted to be an integral multiple of PAGE_SIZE. This is a terrible hack but there is no other way to fix the problem short of rewriting the entire buffer cache. We will do that eventually, but not now. Reported-by: Peter Edwards <email@example.com> Referencing-PR: misc/64816 by Patrick Mackinlay <firstname.lastname@example.org>
Remove the VREF() macro and uses of it. Remove uses of 0x20 before ^I inside vnode.h
Protect nfs socket locks with a critical section. Recheck rep->r_mrep just prior to calling tsleep() in case another thread got in and handled the request being waited for. Rewrite the vnode scanning code in nfs_sync() to use vmntvnodescan(), fixing a number of potential races. Protect the commit phase 2 scan in nfs_subs.c with the appropriate token (note: still needs some work).
Allow the nominal NFS io block size to be set with a sysctl vfs.nfs.nfs_io_size and default it to the largest possible block size (32K), regardless of the network transfer size settings. nfs_iosize() is no longer based on the network transfer size but is matched against the maximum data block size for the protocol and transport (8K for NFSv2, 16K for NFSv3/UDP, 32K for NFSv3/TCP). Adjust statfs() reporting to suit. This should improve performance over high bandwidth connections, primarily by causing the client to use larger buffer cache buffers (16K or 32K instead of 8K prior to this commit), and also improving read-ahead (which goes by blocks). In particular, while the largest network transfer size over UDP is 16K, the largest transfer size of TCP is 32K, so TCP ought to reap the largest reward w/ this commit. Delay retrieval of mountpoint attributes until the mountpoint is actually accessed or a df() occurs. Submitted-by: Hiten Pandya <email@example.com>, and Matthew Dillon
Newtoken commit. Change the token implementation as follows: (1) Obtaining a token no longer enters a critical section. (2) tokens can be held through schedular switches and blocking conditions and are effectively released and reacquired on resume. Thus tokens serialize access only while the thread is actually running. Serialization is not broken by preemptive interrupts. That is, interrupt threads which preempt do no release the preempted thread's tokens. (3) Unlike spl's, tokens will interlock w/ interrupt threads on the same or on a different cpu. The vnode interlock code has been rewritten and the API has changed. The mountlist vnode scanning code has been consolidated and all known races have been fixed. The vnode interlock is now a pool token. The code that frees unreferenced vnodes whos last VM page has been freed has been moved out of the low level vm_page_free() code and moved to the periodic filesystem sycer code in vfs_msycn(). The SMP startup code and the IPI code has been cleaned up considerably. Certain early token interactions on AP cpus have been moved to the BSP. The LWKT rwlock API has been cleaned up and turned on. Major testing by: David Rhodus
Add prototype for bootpc_init
This commit represents a major revamping of the clock interrupt and timebase infrastructure in DragonFly. (missing nfs files)
Misc cleanups to take care of GCC3.x warnings. Missing 'U' and 'LL' postfixes on large unsigned or 64 bit constants, non-storage structural declarations embedded in structures, deprecated use of __FUNCTION__, missing 'break' statements in the last switch case, goto label ops where the label occurs just before an end-brace (many of which appear to be fixable with 'break' or 'continue' instead and existed simply due to programmer-paranoia), garbage data in #endif lines that was not commented out. GCC3 also caught some argument count issues in kernel printfs. Many of these (obvious) fixes are similar to or copied from 5.x. Also fix a few other minor issues such as certain drivers declaring a proc pointer instead of a thread pointer. Move -ffreestanding from CWARNFLAGS to CFLAGS. It doesn't belong in CWARNFLAGS.
Data reads and writes should not need credentials, and most filesystems ignore the ucred argument. NFS does, though, and FreeBSD-4.x had some terrible hacks to associate credentials with data that only 'mostly' worked. There were VM paging and buffer reconstitution cases which broke the credentials even in 4.x. The hacks were removed from DragonFly during the VFS messaging reorganization. In DragonFly credentials are checked on open() but no credentials are required for read and write ops. I had NFS just use the 'root' credential for the RPC. However, this breaks NFS mounts which do not use the -maproot (server side) directive. Really the bug is on the server side, but to maintain general compatibility with NFS servers we have to provide a non-root credential if root did not issue the I/O. This commit hacks up the NFS code (rather then hacking up the rest of the kernel) to restore the hacks that were previously removed from the kernel. Unfortunately it can lead to a proliferation of ucred structures (FreeBSD-4.x did as well), but that's the price we have to pay for now. Report-by: Galen Sampson <firstname.lastname@example.org>
__P()!=wanted, remove old style prototypes from the vfs subtree
kernel tree reorganization stage 1: Major cvs repository work (not logged as commits) plus a major reworking of the #include's to accomodate the relocations. * CVS repository files manually moved. Old directories left intact and empty (temporary). * Reorganize all filesystems into vfs/, most devices into dev/, sub-divide devices by function. * Begin to move device-specific architecture files to the device subdirs rather then throwing them all into, e.g. i386/include * Reorganize files related to system busses, placing the related code in a new bus/ directory. Also move cam to bus/cam though this may not have been the best idea in retrospect. * Reorganize emulation code and place it in a new emulation/ directory. * Remove the -I- compiler option in order to allow #include file localization, rename all config generated X.h files to use_X.h to clean up the conflicts. * Remove /usr/src/include (or /usr/include) dependancies during the kernel build, beyond what is normally needed to compile helper programs. * Make config create 'machine' softlinks for architecture specific directories outside of the standard <arch>/include. * Bump the config rev. WARNING! after this commit /usr/include and /usr/src/sys/compile/* should be regenerated from scratch.
Register keyword removal Approved by: Matt Dillon
Remove the priority part of the priority|flags argument to tsleep(). Only flags are passed now. The priority was a user scheduler thingy that is not used by the LWKT subsystem. For process statistics assume sleeps without P_SINTR set to be disk-waits, and sleeps with it set to be normal sleeps. This commit should not contain any operational changes.
proc->thread stage 5: BUF/VFS clearance! Remove the ucred argument from vop_close, vop_getattr, vop_fsync, and vop_createvobject. These VOPs can be called from multiple contexts so the cred is fairly useless, and UFS ignorse it anyway. For filesystems (like NFS) that sometimes need a cred we use proc0.p_ucred for now. This removal also removed the need for a 'proc' reference in the related VFS procedures, which greatly helps our proc->thread conversion. bp->b_wcred and bp->b_rcred have also been removed, and for the same reason. It makes no sense to have a particular cred when multiple users can access a file. This may create issues with certain types of NFS mounts but if it does we will solve them in a way that doesn't pollute the struct buf.
proc->thread stage 4: rework the VFS and DEVICE subsystems to take thread pointers instead of process pointers as arguments, similar to what FreeBSD-5 did. Note however that ultimately both APIs are going to be message-passing which means the current thread context will not be useable for creds and descriptor access.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 22.214.171.124