:Matthew Dillon wrote:
:> In a clustered environment the execution context (what 'cp' is actually
:> running on) can be anywhere. But there is absolutely no reason for the
:> file data to physically pass through that machine if 'cp' itself does
:> not need to know what the file contains. If done properly, the actual
:> file data would be transported directly from machine A to machine B,
:> or stay strictly within machine A in the second example.
:Are such operations going to be exposed through system calls? In other
:words, does this mean that userland utilities will need to be modified
:to fully support (efficiently) this type of copy by reference?
No. Only userland programs acting as data sources or data sinks
via the protocol (i.e. a userland VFS or cluster related processes).
Something like 'cp' would just use read() and write() or mmap() and
write(). The key to making something like this work with 'cp' is
that the kernel would not instantiate the VM pages backing the
buffer being read into or the VM pages backing the memory map. So
it would be possible for 'cp' to read/mmap and write without ever
touching the actual data.
That's just an example. It would be fairly complex to actually make
it work with something like read(), but the mmap/write combination is
far more achievable since VM objects are already hierarchical and
would be fairly easy to 'back' with a cache line ID.
:What level of transactional support will be provided? For example, will
:the cp utility return before or after the data itself is made durable?
:Will it be possible for the cp utility to complete successfully, have
:the node containing the referenced cache data fail and thus the
:transaction fail after the fact?
That would be up to the utility. 'cp' doesn't guarentee that an
operation is made durable even now since most of the data winds up
in the buffer cache. Userland would have to perform a sync or fsync
of some sort to make the data durable.
:What are the error recovery/failure scenarios in the case that a node
:with the only copy of referenced cached data fails?
:Best of luck with your work, and thank you!
This falls into the category of 'complex issues that one has to deal
with to make a clustering system robust'. It is less of a problem
within a machine, even with a userland process (aka a userland VFS)
acting as the data source. If the data is unrecoverably lost then
whatever is trying to access it would probably end up having to seg-fault.
The key is to manage the state of the cache id based on the needs of
So, for example, lets say you have two userland VFS mounts and you
are copying data from one to the other via the kernel. UVFS(A) passes
a cache ID to the kernel which forwards it to UVFS(B). Several things
can happen asynchronously:
* UVFS(A) can decide that it has to flush the data. It sends a cache
flush to the kernel which forwards it to UVFS(B), which forces UBFS(B)
to read the actual data from UVFS(A) and then de-ref UVFS(A)'s cache
* UVFS(B) can decide to instantiate its own copy of the cached data.
It would send cache ID read command to UVFS(A) to get the data and
then de-ref UVFS(A)'s cache ID.
* Either UVFS(A) or UVFS(B) could decide that these things need to be
done as part of the operating that originally requested the data,
making the data access effectively synchronous. A failure would
caues the original operation to fail (verses a cache ID getting
forwarded through many subsystems and the failure occuring at some
later time in a seemingly unrelated subsystem).
* If UVFS(A) crashes and burns the data is not necessarily lost. e.g.
if UVFS(A) seg-faults and generates a core, the data would still be
retrievable after restart. If the cache line IDs represent data in
backing store that UVFS(A) hasn't itself read yet, then the data would
also still be retrievable.
If one expands this to a clustered system, and one assumes that the
cache line data is not recoverable if a machine crashes, then the issue
becomes one of redundancy.
The data represented by a cache line ID as described in my original
posting *CAN* be cached by multiple machines in the cluster. The
capability is there, but the algorithms to use this feature effectively
are probably going to end up being fairly complex. They are as-yet