I've been letting the conversation run to see what people have to say,
I am going to select this posting by Brett to answer, as well as
provide some more information.
:I am not sure I understand the potential aim of the new file system -
:is it to allow all nodes on the SSI (I purposefully avoid terms like
:"grid") to have all "local" data actually on their hard drive or is it
:more like each node is aware of all data on the SSI, but the data may
:be scattered about all of the nodes on the SSI?
I have many requirements that need to be fullfilled by the new
filesystem. I have just completed the basic design work and I feel
quite confident that I can have the basics working by our Summer
- On-demand filesystem check and recovery. No need to scan the entire
filesystem before going live after a reboot.
- Infinite snapshots (e.g. on 30 second sync), ability to collapse
snapshots for any time interval as a means of recovering space.
- Multi-master replication at the logical layer (not the physical layer),
including the ability to self-heal a corrupted filesystem by accessing
replicated data. Multi-master means that each replicated store can
act as a master and new filesystem ops can run independantly on any
of the replicated stores and will flow to the others.
- Infinite log Replication. No requirement to keep a log of changes
for the purposes of replication, meaning that replication targets can
be offline for 'days' without effecting performance or operation.
('mobile' computing, but also replication over slow links for backup
purposes and other things).
- 64 bit file space, 64 bit filesystem space. No space restrictions
- Reliably handle data storage for huge multi-hundred-terrabyte
filesystems without fear of unrecoverable corruption.
- Cluster operation - ability to commit data to locally replicated
store independantly of other nodes, access governed by cache
:So, in effect, is it similar in concept to the notion of storing bits
:of files across many places using some unified knowledge of where the
:bits are? This of course implies redunancy and creates synchronization
:problems to handle (assuming bo global clock), but I certainly think
:it is a good goal. In reality, how redundant will the data be? In a
:practical sense, I think the principle of "locality" applies here -
:the pieces that make up large files will all be located very close to
:one another (aka, clustered around some single location).
Generally speaking the topology is up to the end-user. The main
issue for me is that the type of replication being done here is
logical layer replication, not physical replication. You can
think of it as running a totally independant filesystem for each
replication target, but the filesystems cooperate with each other
and cooperate in a clustered environment to provide a unified,
:>From my experiences, one of the largest issues related to large scale
:computing is the movement of large files, but with the trend moving
:towards huges local disks and many-core architectures (which I agree
:with), I see the "grid" concept of geographically diverse machines
:connected as a single system being put to rest in favor of local
:clusters of many-core machines.
:With that, the approach that DfBSD is taking is vital wrt distributed
:computing, but any hard requirement of moving huge files long
:distances (even if done so in parallel) might not be so great. What
:is required is native parallel I/O that is able to handle locally
:distributed situations - because it is within a close proximity that
:many processes would be writing to a "single" file. Reducing the
:scale of the problem may provide some clues into how it may be used
:and how it should handle the various situations effectively.
:Additionally, the concept of large files somewhat disappears when you
:are talking about shipping off virtual processes to execute on some
:other processor or core because they are not shipped off with a whole
:lot of data to work on. I know this is not necessarily a SSI concept,
:but one that DfBSD will have people wanting to do.
There are two major issues here: (1) Where the large files reside
and (2) How much of that data running programs need to access.
For example, lets say you had a 16 gigabyte database. There is a
big difference between scanning the entire 16 gigabytes of data
and doing a query that only has to access a few hundred kilobytes
of the data.
No matter what, you can't avoid reading data that a program insists
on reading. If the data is not cacheable or the amount of data
being read is huge, the cluster has the choice of moving the program's
running context closer to the storage, or transfering the data over
A cluster can also potentially use large local disks as a large
cache in order to be able to distribute the execution contexts, even
when the data being accessed is not local. It might be reasonable to
have a 100TB data store with 1TB local 'caches' in each LAN-based locale.
Either way I think it is pretty clear that individual nodes are
significantly more powerful when coupled with the availability of
local (LAN) hard disk storage, especially in the types of environments
that I want DragonFly to cater to.
The DragonFly clustering is meant to be general purpose... I really want
to be able to migrate execution contexts, data, and everything else
related to a running program *without* having to actually freeze the
program while the migration occurs. That gives system operators the
ultimate ability to fine-tune a cluster.