DragonFly users List (threaded) for 2010-09
Re: Why did you choose DragonFly?
:Today I think that the SSI goal has become less important.
:The "cluster hype" has diminished and been partially
:replaced by the "cloud hype". Today, it is extremely
:important to have excellent SMP scalability. Multi-core
:systems are common, my desktop at home is a 6-core AMD
:Phenom II X6 which costs less than 200 Euros. You can
:buy x86 machines with 4 CPU sockets, 8 cores each, plus
I think these are all good points. The evolution of SMP has
been highly predictable, but perhaps what has not been quite
that predictable is the enormous improvement in off-cpu
interconnect bandwidth over the last few years as PCI-e has
really pushed into all aspects of chip design. These days
you can't even find a FPGA that doesn't have support for
numerous serial gigabit links going off-chip. Serial has clearly
won over parallel.
This enormous improvement is one of the many reasons why something
like swapcache/SSD is not only viable on today's system, but almost a
necessity if one wishes to squeeze out every last drop of performance
from a machine.
It used to be that one could get a good chunk of cpu power
in a consumer machine but not really be able to match servers
on bus bandwidth. That simply is not the case any more. Now
the cheap sweet-spot on the consumer curve has well over 50GBits
of off-chip bandwidth. 4-8 SATA ports running at 3-6 GBits each, plus
another 24+ PCI-e lanes on top of that. It is getting busy out
there. The only real advantage a 'server' has now is memory
interconnect and even that is being whittled away now that one can
stuff 16GB+ of ECC ram into a 4-slot consumer box.
The next big improvement will probably wind up being ultra-high-speed
memory interconnects. We would have it already if not for RamBus and
their double-blasted patent lawsuits. And after that the links are
going to start running at light frequencies, which Intel I think has
already demonstrated to some degree.
I do think I have to modulate my SSI goal a bit. The cluster
filesystem is still hot... I absolutely want something to replace NFSv3
that is fully cache coherent across all clients and servers, and it
*IS NOT* NFSv4. Adding a quorum protocol capability on top of that
to create distributed filesystem redundancy is also still in the cards.
The important point here is that I do not believe anything but a network
abstraction can create the levels of reliability needed to have truely
distributed redundancy for filesystems.
The Single-image abstraction that RAID provides isn't good enough
because it doesn't deal with bugs in the filesystem code itself or
the corruption from software bugs in today's complex kernels (in
Actual SSI might not be in the cards any more. It is virtually
impossible to do it at the vnode, device, and process level without
a complete rewrite of nearly the entire system. Sure one can migrate
whole VM's, but that is more a workaround and less a core solution.
However, with a fully cache coherent remote filesystem solution we
can actually get very close to SSI-like operation. If a shared mmap
of a file across a cache coherent remote mount is made possible then
process migration or at least shared memory spaces across physical
machines can certainly be made possible too.
I'm a bit loath to extend the per-cpu globaldata concept beyond 64
cpus (for 64-bit builds) for numerous reasons, not the least of which
being that the kernel per-cpu caches don't scale well when the
memory:ncpus ratio drops too low. We know this is a problem already
for per-thread caches in pthreads implementations for user programs
which is why our nmalloc implementation in libc tries to be very
careful to not leave too much stuff sitting around in a per-thread
cache that other threads can't get to.
There is an opportunity here too. There is no real need to support
more than 64 *KERNEL* threads on a massively hyperthreaded cpu when
just 2 per actual cpu (judged by the L1 cache topology) will yield
the same kernel performance.
The solution is obvious to me... multiplex the N hyperthreads per
real cpu into a single globaldata structure and interlock with a
spinlock. Or, alternatively, have no more than two globaldata entities
per real cpu (say in a situation where one has 4 hyperthreads or 8
hyperthreads). Spinlocks contending only between the hyperthreads
associated with the same cpu have an overhead cost of almost nothing.
Another possible solution is to not actually transition the extra
hyperthreads into the kernel core but instead hold them at the
userspace<->kernelspace border and have only dedicated kernel
hyperthreads run the kernel core. The user threads would just stall
at the edge until the real kernel thread tells them what to do.
Advantages of this include not having to save/restore the register
state and allowing the real kernel threads to run full floating point
(and all related compiler optimizations).
So, lots of possibilities abound here.