Up to [DragonFly] / src / sys / kern
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
MFC numerous features from HEAD. * Bounce buffer fixes for physio. * Disk flush support in scsi and nata subsystems. * Dead bio handling
Add BUF_CMD_FLUSH support - issue flush command to mass storage device.
Another round of typo fixes (mostly in messages).
dssetmask() was being called too early, causing the disk layer to believe the disk was open when in fact it was not, in certain cases where the disk open fails.
Make some adjustments to clean up structural field names. Add type and storage uuid's to the partinfo structure for the DIOCGPART ioctl and load the fields up for GPT slices and disklabel64 partitions.
Implement non-booting support for the DragonFly 64 bit disklabel: * Add full kernel support. Both 32 and 64 bit labels will be probed. * Add a new program, disklabel64, which allows you to create and edit the new disklabel. * Add some logic to prevent foot shooting. DragonFly's 64 bit disklabels start at byte offset 0 on the disk slice or GPT partition and operate in a slice-relative fashion. No translation is required when going from on-disk to in-core or vise-versa, unlike the existing 32 bit disklabels. 512 bytes at the beginning of the label are reserved for legacy boot code. Specifically, the label starts at sector 0, NOT sector 1, which means its location on the disk is the same regardless of the sector size. The label has a UUID to uniquely identify the storage and a type and object uuid for each partition. All location specifications are 64 bit byte offsets, NOT logical blocks. The label enforces an alignment requirement for label-related I/O and partitions which defaults to 4K regardless of the sector size. This makes the label 100% portable across media with different sector sizes within the constraints of the alignment requirement. All partitions are specified using byte offsets and sizes, constrained by the alignment requirement, relative to the base of the label (i.e. offset 0 in the slice). disklabel64 will adjust the offsets for display purposes to be relative to the partition table area. The label headers, partition table, and boot2 areas come BEFORE the partition table area and partitions which overlap any of those objects are not allowed. By default, a virgin 64 bit disklabel will reserve 32K for boot2. As of this writing, boot1 and boot2 blocks have not yet been implemented.
Move all the code related to handling the current 32 bit disklabel to subr_disklabel32.c. Move the header file from sys/disklabel.h to sys/disklabel32.h. Rename all the related structures and constants and retire 'struct disklabel'. Redo the sys/disklabel.h header file to implement a generic disklabel abstraction. Modify kern/subr_diskslice.c to use this abstraction, with some shims for the ops dispatch at the moment which will be cleaned up later. Adjust all auxillary code that directly accesses 32 bit disklabels to use the new structure and constant names. Remove the snoop-adjust code. The kernel would snoop reads and writes to the disklabel area via the raw slice device (e.g. ad0s1) and convert the disklabel from the in-core format to the on-disk format and vise-versa. The reads and writes made by disklabel -r and the kernel's own internal readdisklabel and writedisklabel code used the snooping. Rearrange the kernel's internal code to manually convert the disklabel when reading and writing. Rearrange the /sbin/disklabel program to do the same when the -r option is used. Have the disklabel program also check which DragonFly OS it is running under so it can be run on older systems. Note that the disklabel binary prior to these changes will NOT operate on the disklabel properly if running on a NEW kernel. Introduce skeleton files for 64 bit disklabel support.
Disklabel separation work - Generally shift all disklabel-specific procedures for the kernel proper to a new source file, subr_disklabel32.c. Move the DTYPE_ and FS_ defines out of sys/disklabel.h and into a new header files sys/dtype.h Make adjustments to the uuids file, renaming "DragonFly Label" to "DragonFly Label32" and creating a "DragonFly Label64" uuid.
Fix an overflow in the GPT code, I wasn't allocating enough slice structures. Fix the openmask array declaration, it was declaring too-large an array. Disallow GPT partitions with invalid spans. When calculating a virgin disklabel take into account the possibility of absurdly small GPT or MBR slices that would cause the calculation of disklabel->d_ncylinders to result in 0. Reduce the number of heads and the number of sectors per track until a reasonable cylinder count is achieved.
Implement (non-bootable) GPT support. If a PMBR partition type is detected the rest of the MBR is ignored and the GPT partition table will be parsed into slices. GPT partition 0 will be s0, GPT partition 1 will be s1, etc. Bootable support is forthcoming. Remove support for COMPATIBILITY_SLICE when a MBR/GPT table is present. That is, the COMPATIBILITY_SLICE (s0) will still point to the dangerously dedicated disklabel or be synthesized for a CD, but it will no longer point to the 'first BSD slice' in a real MBR or GPT table. For GPT tables slice 0 (s0) will point at GPT partition #0, slice 1 (s1) at GPT partition #1, etc. Redo the reserved sector handling code. There is now a single reserved sector count instead of separate fields for the slice layer and disklabel layer. Redo the disklabel snooping code. Note that you cannot run an old /sbin/disklabel in raw (-r) mode with a new OS because the old disklabel will not turn on snooping. For now the on-disk format remains the same, but more changes may be forthcoming (after discussion). I would like to get rid of the snooping entirely. Add kuuid_is_nil() and use it to ignore unset GPT paritions.
Expand the diskslice->ds_openmask from 8 bits to 256 bits to cover all possible partitions. Partitions from 'i' on, and the whole-disk partition, were not being properly tracked, resulting in multiple device opens and device closes to the underlying device. In particular, this caused USB memory sticks to connect to the CAM driver with ever-increasing DA#n unit numbers because CAMs reference counting got seriously corrupted. Reported-by: "Simon 'corecode' Schubert" <corecode@fs.ei.tum.de>
Fix a bug in recent commits. When creating a virgin disk label for devices in DSO_COMPATPARTA mode, the generated 'a' partition was not sized properly.
When a traditional bsd disklabel is present, try to reserve SBSIZE bytes of total space, regardless of the sector size. newfs presumes no more then SBSIZE bytes of space at the beginning of the block device are reserved.
Cleanup diskerr() output a bit - don't say it was trying to write when the BUF/BIO command is no longer set. The device name was being misreported by dsname(). Fix it up.
Include geometry data in DIOCGPART so fdisk can use it instead of trying to read a faked disklabel. Change MAKEDEV to create 'compatibility slice' devices, e.g. da0s0a, da0s0b, etc. Previously the compatibility slice devices were e.g. da0a, da0b, and there was no 'whole slice' device for the compatibility slice at all, meaning one couldn't disklabel it. Now there is, e.g. da0s0.
Keep the ds_skip_* fields in struct diskslice properly synchronized. ds_skip_bsdlabel is inclusive of bsd_skip_platform but was being improperly set to 0 even when an mbr reserved sector existed. The fields were not being properly reset for a slice whos disklabel is destroyed. Defer reading the disklabel on a slice until a partition on the slice is opened or a disklabel related DIOC ioctl is performed on the slice. In particular, we do not attempt to read the disklabel when opening the whole-disk-slice for the whole disk or the whole-slice-partition for a slice. Previously the code attempted to scan all available BSD slices for disklabels. When writing to a raw slice, do not snoop or do reserved-sector checks unless a disklabel has been loaded for the slice. Typically a disklabel will only be loaded in two situations: (1) if filesystems are mounted from that slice or (2) the disklabel program has performed ioctls on the whole-slice-partition to set a disklabel. Now writing to raw slices works almost the same as writing to the whole-disk-slice, with no interpretation. Remove all remaining references to the LABELSECTOR constant. Instead, use the ds_skip_* fields to determine the sector where the disklabel starts within a slice. These changes significantly cleaned up the snoop and reserved sector checking code in dscheck().
Implement raw extensions for WHOLE_DISK_SLICE device accesses for acd0. Disallow special accesses on devices that do not support the extensions. Implement direct track reading via /dev/acd0 or /dev/acd0t* (use MAKEDEV acd0t to create per-track devices). Fix a few bugs with the minor device numbers generated by MAKEDEV for /dev/acd*. /dev/acd0a and /dev/acd0c were improperly specifying the WHOLE_DISK_SLICE instead of the compatibility slice. Change all mountroot operations that were trying to access disks via RAW_PART to instead access them via WHOLE_SLICE_PART (removing more dependancies on the old disklabel structure). Replace the unconditional sector sanity check in dsopen() with better sanity checks in dscheck(). The checks are not made for special WHOLE_DISK_SLICE accesses, allowing weird sector sizes to feed through to the device.
Continue untangling the disklabel. * Move dk*() inline functions and other related stuff not directly related to the BSD disklabel out of sys/disklabel.h and into sys/diskslice.h. Add additional functions to sys/diskslice.h * Extend the slice and partition fields in the device minor number. We now support up to 128 slices and up to 256 partitions. * Implement new minor device numbers for 'raw slices', such as ad0s1. Previously raw slices used the same minor number as partition c within the slice. e.g. ad0s1 and ad0s1c had the same device number. This made it impossible to distinguish between the two. The 'whole disk' device's minor number has also changed. Our new whole-slice and whole-disk devices specify a partition number of (DKMAXPARTITIONS - 1) (aka 255). * Completely disable disklabel related operations on the raw disk, e.g. da0, and on partitions, e.g. da0s1a. Only allow disklabel operations on whole slices, e.g. da0s1. NOTE!! For compatibility while booting drivers which set DSO_COMPATLABEL, the compat disklabel may be read, but not written, via the whole-disk device. e.g. acd0. NOTE!! For compatibility we have no choice but to continue to snoop read/write operations on raw slices (e.g. da0s1) because the disklabel program and the kernel still depend on the snooping to modify the in-core version of the disklabel to the on-disk version. No snooping will occur on the whole-disk device (e.g. da0). No snooping will occur on raw slices (e.g. da0s1) if the disk is unlabeled and no in-core label was set. Note that disklabel -r -w DOES set an in-core label before writing to a raw-slice, so it is still ok. * dsopen() no longer attempts to scan the MBR or slice table when the whole-disk device (e.g. da0) is opened, and no longer attempts to read the disklabel when the whole-slice device is opened (e.g. da0s1). The disklabel is only read when a partition is explicitly opened or the label is explicitly read via an ioctl. * The virgin disklabel is stored in the struct diskslice for WHOLE_DISK_SLICE (slice 1).
Remove #include <sys/disklabel.h> from various source files which no longer need it.
Remove the roll-your-own disklabel from CCD. Use the kernel disk manager for disklabel support instead. Make CCD a real disk device rather then a fake one. NOTE: All /dev/ccd* devices have changed and must be remade Introduce DSO_COMPATMBR. This forces an MBR sector to be reserved in front of a disklabel even when the target disk does not have slices. It is used by the CCD and VN devices to keep the disklabel aligned the same way it has been historically. Implement 64 bit block addressing for CCD. Implement a new filesystem type "ccd", and require that the devices backing the CCD use that filesystem type for safety. Fix a bug in DIOCGPART where the partinfo->media_blocks was not being set properly for partitions.
Continue untangling the disklabel. Add sector index reservation fields to the diskslice and partinfo structures. These fields will replace the hardcoded LABELSECTOR constant and also help manage reserved areas in the disklabel.
* The diskslice abstraction now stores offsets/sizes as 64 bit quantities. (NOTE: DOS partition tables and standard disklabels can't handle 64 bit sector numbers yet). For future pluggable disklabel/partitioning schemes. * The kernel panic / kernel core API is now 64 bits. * The VN device now uses 64 bit sector numbers and can handle block devices up to what is supported by the filesystem (typically 8TB). This change was made primarily so we can test future disklabel / partition table support. * Pass 64 bit LBAs to various block devices and to the SCSI layer. * Check for and assert 32 bit overflow conditions in various places, instead of wrapping.
Continue untangling the disklabel. Reorganize struct partinfo and the DIOCGPART ioctl to extract the required information directly, and fix the DIOCGPART ioctl direction so userland can use it. This removes numerous disklabel references, particularly from the filesystem code which was doing silly indirections just to figure out the sector size. NOTE: The absolute byte offset of the slice or partition (relative to the base of the raw disk) is also made available, but is not currently used by the kernel.
Continue untangling the disklabel. Use the generic disk_info structure to hold template information instead of the disklabel structure. This removes all references to the disklabel structure from the MBR code and leaves mostly opaque references in the slice code.
Continue untangling the disklabel. Have most disk device drivers fill out and install a generic disk_info structure instead of filling out random fields in the disklabel. The generic disk_info structure uses a 64 bit integer to represent the media size in bytes or total sector count.
Start untangling the disklabel from various bits of code with the goal of introducing support for a new 64 bit disklabel. Remove the D_* flags for disklabel.d_flags. These sorts of flags just do not belong in the disk image. Relabel the partition sub-structure in disktab.h, and remove other ancient compatibility defines in disklabel.h.
Rename printf -> kprintf in sys/ and add some defines where necessary (files which are used in userland, too).
Rename sprintf -> ksprintf Rename snprintf -> knsprintf Make allowances for source files that are compiled for both userland and the kernel.
Change the kernel dev_t, representing a pointer to a specinfo structure, to cdev_t. Change struct specinfo to struct cdev. The name 'cdev' was taken from FreeBSD. Remove the dev_t shim for the kernel. This commit generally removes the overloading of 'dev_t' between userland and the kernel. Also fix a bug in libkvm where a kernel dev_t (now cdev_t) was not being properly converted to a userland dev_t.
Rename malloc->kmalloc, free->kfree, and realloc->krealloc. Pass 2
Rename malloc->kmalloc, free->kfree, and realloc->krealloc. Pass 1
MASSIVE reorganization of the device operations vector. Change cdevsw to dev_ops. dev_ops is a syslink-compatible operations vector structure similar to the vop_ops structure used by vnodes. Remove a huge number of instances where a thread pointer is still being passed as an argument to various device ops and other related routines. The device OPEN and IOCTL calls now take a ucred instead of a thread pointer, and the CLOSE call no longer takes a thread pointer.
Do not attempt to read the slice table or disk label when accessing a raw disk device such as /dev/ad0. Otherwise a read failure of sector 0 during the open will cause the open to fail and prevent the recovery of other potentially readable data.
Block devices generally truncate the size of I/O requests which go past EOF. This is exactly what we want when manually reading or writing a block device such as /dev/ad0s1a, but is not desired when a VFS issues I/O ops on filesystem buffers. In such cases, any EOF condition must be considered an error. Implement a new filesystem buffer flag B_BNOCLIP, which getblk() and friends automatically set. If set, block devices are guarenteed to return an error if the I/O request is at EOF or would otherwise have to be clipped to EOF. Block devices further guarentee that b_bcount will not be modified when this flag is set. Adjust all block device EOF checks to use the new flag, and clean up the code while I'm there. Also, set b_resid in a couple of degenerate cases where it was not being set.
- Clarify the definitions of b_bufsize, b_bcount, and b_resid. - Remove unnecessary assignments based on the clarified fields. - Add additional checks for premature EOF. b_bufsize is only used by buffer management entities such as getblk() and other vnode-backed buffer handling procedures. b_bufsize is not required for calls to vn_strategy() or dev_dstrategy(). A number of other subsystems use it to track the original request size. b_bcount is the I/O request size, but b_bcount() is allowed to be truncated by the device chain if the request encompasses EOF (such as on a raw disk device). A caller which needs to record the original buffer size verses the EOF-truncated buffer can compare b_bcount after the I/O against a recorded copy of the original request size. This copy can be recorded in b_bufsize for unmanaged buffers (malloced or getpbuf()'d buffers). b_resid is always relative to b_bcount, not b_bufsize. A successful read that is truncated to the device EOF will thus have a b_resid of 0 and a truncated b_bcount.
Replace the the buffer cache's B_READ, B_WRITE, B_FORMAT, and B_FREEBUF b_flags with a separate b_cmd field. Use b_cmd to test for I/O completion as well (getting rid of B_DONE in the process). This further simplifies the setup required to issue a buffer cache I/O. Remove a redundant header file, bus/isa/i386/isa_dma.h and merge any discrepancies into bus/isa/isavar.h. Give ISADMA_READ/WRITE/RAW their own independant flag definitions instead of trying to overload them on top of B_READ, B_WRITE, and B_RAW. Add a routine isa_dmabp() which takes a struct buf pointer and returns the ISA dma flags associated with the operation. Remove the 'clear_modify' argument to vfs_busy_pages(). Instead, vfs_busy_pages() asserts that the buffer's b_cmd is valid and then uses it to determine the action it must take.
Get rid of pbgetvp() and pbrelvp(). Instead fold the B_PAGING flag directly into getpbuf() (the only type of buffer that pbgetvp() could be called on anyway). Change related b_flags assignments from '=' to '|='. Get rid of remaining depdendancies on b_vp. vn_strategy() now relies solely on the vp passed to it as an argument. Remove buffer cache code that sets b_vp for anonymous pbuf's. Add a stopgap 'vp' argument to vfs_busy_pages(). This is only really needed by NFS and the clustering code do to the severely hackish nature of the NFS and clustering code. Fix a bug in the ext2fs inode code where vfs_busy_pages() was being called on B_CACHE buffers. Add an assertion to vfs_busy_pages() to panic if it encounters a B_CACHE buffer.
A number of structures related to UFS and QUOTAS have changed name. dinode -> ufs1_dinode dqblk -> ufs_dqblk (and other quota related structures) In addition, a large number of UFS related structures and procedures have been prefixed with 'ufs_' to allow us to split off EXT2FS. ufs_daddr_t has been moved out of sys/types.h and into vfs/ufs/dinode.h. The #ifndef header file checks for UFS have been normalized.
Major BUF/BIO work commit. Make I/O BIO-centric and specify the disk or file location with a 64 bit offset instead of a 32 bit block number. * All I/O is now BIO-centric instead of BUF-centric. * File/Disk addresses universally use a 64 bit bio_offset now. bio_blkno no longer exists. * Stackable BIO's hold disk offset translations. Translations are no longer overloaded onto a single structure (BUF or BIO). * bio_offset == NOOFFSET is now universally used to indicate that a translation has not been made. The old (blkno == lblkno) junk has all been removed. * There is no longer a distinction between logical I/O and physical I/O. * All driver BUFQs have been converted to BIOQs. * BMAP, FREEBLKS, getblk, bread, breadn, bwrite, inmem, cluster_*, and findblk all now take and/or return 64 bit byte offsets instead of block numbers. Note that BMAP now returns a byte range for the before and after variables.
Make the entire BUF/BIO system BIO-centric instead of BUF-centric. Vnode and device strategy routines now take a BIO and must pass that BIO to biodone(). All code which previously managed a BUF undergoing I/O now manages a BIO. The new BIO-centric algorithms allow BIOs to be stacked, where each layer represents a block translation, completion callback, or caller or device private data. This information is no longer overloaded within the BUF. Translation layer linkages remain intact as a 'cache' after I/O has completed. The VOP and DEV strategy routines no longer make assumptions as to which translated block number applies to them. The use the block number in the BIO specifically passed to them. Change the 'untranslated' constant to NOOFFSET (for bio_offset), and (daddr_t)-1 (for bio_blkno). Rip out all code that previously set the translated block number to the untranslated block number to indicate that the translation had not been made. Rip out all the cluster linkage fields for clustered VFS and clustered paging operations. Clustering now occurs in a private BIO layer using private fields within the BIO. Reformulate the vn_strategy() and dev_dstrategy() abstraction(s). These routines no longer assume that bp->b_vp == the vp of the VOP operation, and the dev_t is no longer stored in the struct buf. Instead, only the vp passed to vn_strategy() (and related *_strategy() routines for VFS ops), and the dev_t passed to dev_dstrateg() (and related *_strategy() routines for device ops) is used by the VFS or DEV code. This will allow an arbitrary number of translation layers in the future. Create an independant per-BIO tracking entity, struct bio_track, which is used to determine when I/O is in-progress on the associated device or vnode. NOTE: Unlike FreeBSD's BIO work, our struct BUF is still used to hold the fields describing the data buffer, resid, and error state. Major-testing-by: Stefan Krueger
Add more documentation comments to disk_create() and dscheck().
BUF/BIO cleanup 3/99: Retire the B_CALL flag in favour of checking the bp->b_iodone pointer directly, thus simplifying the BUF interface even more. Move scattered B_UNUSED* flag space defintions into one place, that is below the rest of the definitions.
Remove spl*() calls from kern, replacing them with critical sections. Change the meaning of safepri from a cpl mask to a thread priority. Make a minor adjustment to tests within one of the buffer cache's critical sections.
Remove DEC Alpha support.
ANSIfication and cleanup. No functional changes. Submitted-by: Tim Wickberg <me@k9mach3.org>
Fully synchronize sys/boot from FreeBSD-5.x, but add / to the module path so /kernel will be found and loaded instead of /boot/kernel. This will give us all the capabilities of the FreeBSD-5 boot code including AMD64 and ELF64 support. As part of this work, rather then try to adjust ufs/fs.h and friends to get UFS2 info I instead copied the fs.h and friends from FreeBSD-5 into the sys/boot subtree Additionally, import Peter Wemm's linker set improvements from FreeBSD-5.x. They happen to be compatible with GCC 2.95.x and it allows very few changes to be made to the boot code. Additionally import a number of other elements from FreeBSD-5 including sys/diskmbr.h separation.
__P() removal
kernel tree reorganization stage 1: Major cvs repository work (not logged as
commits) plus a major reworking of the #include's to accomodate the
relocations.
* CVS repository files manually moved. Old directories left intact
and empty (temporary).
* Reorganize all filesystems into vfs/, most devices into dev/,
sub-divide devices by function.
* Begin to move device-specific architecture files to the device
subdirs rather then throwing them all into, e.g. i386/include
* Reorganize files related to system busses, placing the related code
in a new bus/ directory. Also move cam to bus/cam though this may
not have been the best idea in retrospect.
* Reorganize emulation code and place it in a new emulation/ directory.
* Remove the -I- compiler option in order to allow #include file
localization, rename all config generated X.h files to use_X.h to
clean up the conflicts.
* Remove /usr/src/include (or /usr/include) dependancies during the
kernel build, beyond what is normally needed to compile helper
programs.
* Make config create 'machine' softlinks for architecture specific
directories outside of the standard <arch>/include.
* Bump the config rev.
WARNING! after this commit /usr/include and /usr/src/sys/compile/*
should be regenerated from scratch.
DEV messaging stage 2/4: In this stage all DEV commands are now being funneled through the message port for action by the port's beginmsg function. CONSOLE and DISK device shims replace the port with their own and then forward to the original. FB (Frame Buffer) shims supposedly do the same thing but I haven't been able to test it. I don't expect instability in mainline code but there might be easy-to-fix, and some drivers still need to be converted. See primarily: kern/kern_device.c (new dev_*() functions and inherits cdevsw code from kern/kern_conf.c), sys/device.h, and kern/subr_disk.c for the high points. In this stage all DEV messages are still acted upon synchronously in the context of the caller. We cannot create a separate handler thread until the copyin's (primarily in ioctl functions) are made thread-aware. Note that the messaging shims are going to look rather messy in these early days but as more subsystems are converted over we will begin to use pre-initialized messages and message forwarding to avoid having to constantly rebuild messages prior to use. Note that DEV itself is a mess oweing to its 4.x roots and will be cleaned up in subsequent passes. e.g. the way sub-devices inherit the main device's cdevsw was always a bad hack and it still is, and several functions (mmap, kqfilter, psize, poll) return results rather then error codes, which will be fixed since now we have a message to store the result in :-)
proc->thread stage 2: MAJOR revamping of system calls, ucred, jail API, and some work on the low level device interface (proc arg -> thread arg). As -current did, I have removed p_cred and incorporated its functions into p_ucred. p_prison has also been moved into p_ucred and adjusted accordingly. The jail interface tests now uses ucreds rather then processes. The syscall(p,uap) interface has been changed to just (uap). This is inclusive of the emulation code. It makes little sense to pass a proc pointer around which confuses the MP readability of the code, because most system call code will only work with the current process anyway. Note that eventually *ALL* syscall emulation code will be moved to a kernel-protected userland layer because it really makes no sense whatsoever to implement these emulations in the kernel. suser() now takes no arguments and only operates with the current process. The process argument has been removed from suser_xxx() so it now just takes a ucred and flags. The sysctl interface was adjusted somewhat.
Add the DragonFly cvs id and perform general cleanups on cvs/rcs/sccs ids. Most ids have been removed from !lint sections and moved into comment sections.
import from FreeBSD RELENG_4 1.82.2.6