DragonFly users List (threaded) for 2010-09
Re: SMP (Was: Why did you choose DragonFly?)
I think our ability to advertise our features has been a bit lacking.
We're programmers more than we are salesman.
Take device serial numbers in devfs for example. A simple feature that
gives one a guaranteed device path to access a physical hard drive,
no matter where it was mounted. Soft-labeling schemes such as LVM or
Geom use have their place (and of course we support LVM's labeling now),
but personally speaking I don't think anything could be simpler than
simply accessing the device by its serial number, and the non-duplication
guarantee is important in a world where one can 'dd' partitions back
and forth almost at will. But mostly it's the sheer simplicity of
stuffing the serial number in /etc/fstab, /boot/loader.conf, and
/etc/rc.conf, and then not caring how the drive is actually attached
to the system, that makes the feature worth its weight in gold.
Implementation details often get lost in the noise as well. The coding
model we use for SMP is far, far less complex than the coding model
other OSs use, and just as fast. It doesn't just making coding easier,
it makes maintaining the codebase easier and the result tends to be
more stable in the long run IMHO. We don't have any significant
cross-subsystem pollution. We don't have a serious problem with
deadlocks (because tokens CAN'T deadlock). Problems tend to be
localized. Our LWKT token abstraction is a very big deal in more
ways than one.
It is hard to describe the level of integration some of our
reimplemented subsystems have and the additional capabilities they
provide. When was the last time you as a DragonFly user worried
about NULLFS ever not doing the right thing, even from inside a NFS
mount? It just works, so well that the base install depends on it
heavily for HAMMER PFS mounts. Or the namecache... as invisible a
subsystem to the end-user as it is possible to get, yet our
reengineered namecache make things like 'fstat' work incredibly well,
able to report the actual full file paths for open files descriptors.
Or VN for that matter. It just works.
This stuff doesn't 'just work' in other OSs. There are aliasing
problems, system resource issues, directory recursion issues and
limitations related to recursive mount points, the potential for
system deadlocks, and numerous other issues.
We are lacking in a few areas, but frankly I don't consider other
open systems to be all that much ahead of us. People talk-up soft
mirroring and soft raid all the time (and I want to get them into DFly
too), but I have yet to see an implementation on an open system which
actually has any significant robustness. A friend of mine has a
hot fail-over setup running on Linux which works fine up until the
point something goes wrong, or he makes a mistake. Then it is history.
At best current setups save you from an instant crash but you have
to swap out drives and reboot anyway if you want to be safe. At
worst something goes wrong and the system decides to start rebuilding
a 2TB drive which itself takes 2 days and say goodbye to the high
performance production system in the meantime (or allow it to take 5
days copying more slowly instead of 2, which is just as bad). Or
the setup requires the use of a complex set of subsystems which
are as likely to blow up when a disk detaches as they are to return
the BIOs with an EIO.
I want these features in DragonFly, but I want them done right.
Another similar example would be crypted disks. Alex recently brought
in cryptsetup along with LVM and spent a good deal of time getting all
the crypto algorithms working. But what's the point if the in-kernel
software crypto is only single-threaded on your SMP system? Apparently
the expectation was that one would have to buy a crypto card or a MB
with a built-in crypto acellerator. So I went and fixed our in-kernel
software crypto and now our crypted disk implementation runs almost
as fast as unencrypted on a quad cpu box. THAT I would consider
production-viable. What we originally inherited I would not.
In terms of differentiation from other systems HAMMER is a very big
deal. I think in many ways HAMMER has saved the project from the
userbase-bleeding effect that is often endemic with projects like ours.
I wish I had done it earlier. Once someone starts using HAMMER they
find it really difficult to move away to anything else. If there's
an issue at all it is simply that people looking in from the outside
have no idea just how flexible HAMMER's fine-grained history and
quasi-real-time mirroring/backup/retention capabilities are.
Swapcache is also underrated. And the use of swap as well. Swap
has gone out of favor in recent years as the gulf between cpu/memory
and disk performance has widened. But as storage densities continue
to rise into the multi-terrabyte range even for entry-level consumer
systems, even throwing in a ton of ram isn't enough to cache the
active 'overnight' dataset. find/locatedb, web services, large
repositories, rsyncs, you name it. They all take their toll.
Swapcache pretty much fixes that whole mess at the cost of a small
40-80G SSD. $100-$200. Not only does it 'refresh' older systems
with less ram by making swap viable again, it also caches filesystem
meta-data generally across the whole system and is capable of caching
file data as well, making it extremely useful even on well-endowed
systems. The 'overnight' meta-data set will easily fit in a 40G SSD.
And swapcache is designed with SSDs in mind. It clusters large I/Os
and has very low write multiplication effects when used with a SSD.
Normal filesystems tend to have larger write multiplication effects
and wear the SSD out faster.
TMPFS becomes more viable with swapcache too. We have two bulk
pkgsrc building boxes. More than two even, but two that I regularly
use. One has a swapcache SSD (Pkgbox64) and one does not (Avalon).
There is a gulf of difference in overall system performance between
the two. The system requirements would be much higher, even needing
more disk spindles for reasonable performance, without TMPFS/swapcache.
Nearly all of DragonFly's own production systems use swapcache now.
Only Avalon sitting in its remote colo facility doesn't have a small
SSD swapcache setup.