DragonFly BSD
DragonFly users List (threaded) for 2012-07
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: machine won't start


From: Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx>
Date: Wed, 4 Jul 2012 11:34:09 -0700 (PDT)

:There was interesting debate before couple od days/weeks on OpenBSD
:about support for disks larger than 2TB. It turned out that they can
:be used just fine without GPT, but multiboot capability is mostly lost
:as job is done in disklable (their fdisk can't do that)
:http://marc.info/?l=openbsd-misc&m=133857397722515&w=2
:

    What we do in our 'fdisk -IB' formatting sequence is cap the LBA
    slice values at all-1's and CHS values at 1023/255/63 (I think that
    is all-1's too).  We do not wrap the CHS or LBA values as that
    creates massive edge cases when the size of the disk sometimes
    just barely wraps and makes the BIOS think the disk is really tiny
    instead (this has happened to me!).

    The DragonFly OS then detects that a slice is using capped values
    and silently uses the HD-reported values instead.  Or, more to the
    point, the DragonFly disklabel code detects the situation and
    properly allows the disklabel to be sized to the actual media size
    instead of restricting it to the capped LBA values for the slice.

    But, as you can see, mixed results.  Even though capping the LBA
    values instead of wrapping it is the officially-supported methodology,
    some BIOS's can't handle it.  Fortunately nearly all BIOS's that would
    otherwise barf on the situation do allow you to go in via the BIOS
    setup and manually set the access mode to LBA or LARGE yourself.

    --

    BIOS issues are also the reason why most fdisk's use such weird
    CHS values for the bootable slice.

	sysid 165,(DragonFly/FreeBSD/NetBSD/386BSD)
	    start 63, size 78156225 (38162 Meg), flag 80 (active)
		beg: cyl 0/ head 1/ sector 1;
		end: cyl 1023/ head 255/ sector 63

	(NOTE the 'start 63', sector numbers start at '1', not '0' for
	 fdisk reporting... blame Intel).

    fdisk sector numbers start at 1, so the slice start of '63' winds
    up being only 512-byte aligned.  The reason the start is weird
    like this is because the maximum sectors/track is 63, and many
    BIOS's (again their old decrepid CHS probing) blow up if the
    slice is not on a cylinder boundary.  A lot of BIOS's also blow
    up if sectors/track is not set to 63, so we lose no matter what we
    do.

    --

    Newer advance-format drives with 4K sector sizes are instantly
    inefficient when the resulting filesystems, even if they are
    aligned relative to the slice, wind up not being aligned relative
    to the physical media.  This forces the HD itself to issue
    read-before-write when handling pure media writes from the
    filesystem, resulting in very poor performance.

    Some disk manufacturers (aka Seagate) apparently tried detecting
    filesystem alignment on the fly but it just created an immense
    mess (including w/GPT compatibility slices).  I think most disk
    drive manufacturers are finally settling into requiring media
    accesses to be physically aligned if the consumer wishes the
    accesses to be efficient.

    Again, applies only to advanced-format drives but once the kinks
    are worked out by BIOS manufacturers I expect a lot of HD vendors
    will move most of their lines to advanced-format (4K physical
    sectors), because 4K physical sectors allow them to put a
    inter-sector gap back in and because it boosts linear transfer
    rates by 30-50%.

    In anycase, DragonFly solves the alignment issue in its disklabel64
    partition format (which has been our default for a few years now),
    by detecting that the slice table is mis-aligned and correcting
    for it in the disklabel.  Plus disklabel64 uses a very large
    initial alignment... not just 4K.  It's more like ~1MB.

# data space:   39077083 blocks # 38161.21 MB (40014933504 bytes)
#
# NOTE: If the partition data base looks odd it may be
#       physically aligned instead of slice-aligned
#
diskid: 7f45e4eb-9af2-11e1-a2f9-01012e2fd933
label: 
boot2 data base:      0x000000001000
partitions data base: 0x000000100200
partitions data stop: 0x000951237000
backup label:         0x000951237000
total size:           0x000951238200    # 38162.22 MB
alignment: 4096
display block size: 1024        # for partition display only

16 partitions:
#          size     offset    fstype   fsuuid
  a:    1048576          0    4.2BSD    #    1024.000MB
  b:   16777216    1048576      swap    #   16384.000MB
  d:   21251288   17825792    HAMMER    #   20753.211MB
  a-stor_uuid: a5cff4d1-9af2-11e1-a2f9-01012e2fd933
  b-stor_uuid: a5cff4e0-9af2-11e1-a2f9-01012e2fd933
  d-stor_uuid: ac45d623-9af2-11e1-a2f9-01012e2fd933

    The 'partition data base' in the DragonFly disklabel64 format
    is using an offset of 0x100200, which is 1MB+512 bytes.  The
    extra 512 bytes is correcting for the unaligned fdisk slice
    the partition is sitting in (which the partition code probes
    dynamically).  The 1MB is geared towards LVM partitioning for
    future soft-RAID setups.  LVM tends to want very large alignments
    which it then cuts down as you set up soft-RAID configurations
    within in.

    So hard drives care about reasonable alignment ~32K is usually
    plenty good enough, and lvm/dm cares about larger partitioning
    alignments and ~1MB is usually plenty good enough for that
    purpose.

    The better alignments probably also help SSDs though it will
    depend on the firmware.  SSDs tend to be more dynamic in the
    way they handle write-combining but I think the 32K base media
    alignment (i.e. the first slice is offset by 63 sectors and we
    add one more sector, the 1MB being irrelevant)... so a physical
    device sees a ~32K base alignment for most I/O operations.  In
    anycase, the SSDs have better write combining algorithms anyway
    but might still react better to a ~32K base alignment than to
    a ~32K-512 bytes base alignment.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]