Annotation of src/lib/libc/sys/syslink.2, revision 1.1
1.1 ! dillon 1: .\" Copyright (c) 2007 The DragonFly Project. All rights reserved.
! 2: .\"
! 3: .\" This code is derived from software contributed to The DragonFly Project
! 4: .\" by Matthew Dillon <email@example.com>
! 5: .\"
! 6: .\" Redistribution and use in source and binary forms, with or without
! 7: .\" modification, are permitted provided that the following conditions
! 8: .\" are met:
! 9: .\"
! 10: .\" 1. Redistributions of source code must retain the above copyright
! 11: .\" notice, this list of conditions and the following disclaimer.
! 12: .\" 2. Redistributions in binary form must reproduce the above copyright
! 13: .\" notice, this list of conditions and the following disclaimer in
! 14: .\" the documentation and/or other materials provided with the
! 15: .\" distribution.
! 16: .\" 3. Neither the name of The DragonFly Project nor the names of its
! 17: .\" contributors may be used to endorse or promote products derived
! 18: .\" from this software without specific, prior written permission.
! 19: .\"
! 20: .\" THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
! 21: .\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
! 22: .\" LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
! 23: .\" FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
! 24: .\" COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
! 25: .\" INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
! 26: .\" BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
! 27: .\" LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
! 28: .\" AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
! 29: .\" OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
! 30: .\" OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
! 31: .\" SUCH DAMAGE.
! 32: .\"
! 33: .\" $DragonFly$
! 34: .\"
! 35: .Dd March 13, 2007
! 36: .Dt SYSLINK 2
! 37: .Os
! 38: .Sh NAME
! 39: .Nm syslink
! 40: .Nd low level connect to the cluster mesh
! 41: .Sh LIBRARY
! 42: .Lb libc
! 43: .Sh SYNOPSIS
! 44: .In sys/syslink.h
! 45: .Ft int
! 46: .Fn syslink "int fd" "int flags" "sysid_t routenode"
! 47: .Sh DESCRIPTION
! 48: The
! 49: .Fn syslink
! 50: function establishes a link to a kernel-implemented syslink route node
! 51: as specified by
! 52: .Fa routenode .
! 53: If a file descriptor of -1 is specified, a file descriptor representing
! 54: a direct connection to the specified route node will be allocated and
! 55: returned.
! 56: If a file descriptor is specified, it will be connected to the specified
! 57: route node via full-duplex communication and kernel threads will be
! 58: created to shuttle data between the descriptor and the route node. The
! 59: kernel may optimize and shortcut this operation.
! 60: .Pp
! 61: It is also perfectly legal to allocate two route nodes and then connect them
! 62: together by passing the file descriptor returned by the first
! 63: .Fn syslink
! 64: call to the second
! 65: .Fn syslink
! 66: call. It is legal (and usually necessary) to obtain multiple descriptors to
! 67: the same kernel-managed syslink route node.
! 68: .Pp
! 69: The syslink protocol revolves around 64 bit system ids using the
! 70: .Ft sysid_t
! 71: type. A system id can be logical or physical.
! 72: Physical system ids are negotiated dynamically as system links are created
! 73: and destroyed, while Logical system ids are persistently associated with
! 74: particular resources in the cluster.
! 75: For example, a particular filesystem mount will have a persistent logical
! 76: sysid and would have one or more physical sysids depending on how it
! 77: connects into the cluster mesh.
! 78: .Sh SYSLINK PROTOCOL - PHYSICAL SYSIDS
! 79: The Syslink protocol is used to glue the cluster mesh together. It is
! 80: based on the concept of reliable packets and buffered streams. Adding a
! 81: new node to the mesh is as simple as obtaining a stream connection to any
! 82: node already in the mesh, or tying into a packet switch with UDP.
! 83: .Pp
! 84: The first stage of the protocol is to negotiate a physical sysid space.
! 85: Each connection to the mesh negotiates its own space, meaning that
! 86: multi-homed entities (which are expected to be common) may be accessible
! 87: through multiple physical sysids. The physical sysid space can take time
! 88: to settle down and may change while the cluster is operational due to changes
! 89: in the cluster topology. For example, you can reconfigure the system id
! 90: space propogated out from a seed node (or a seed node could go down, or come
! 91: up), and effectively change some of the physical sysid assignments for
! 92: every node in the mesh while the mesh is live.
! 93: .Pp
! 94: Assignment of physical sysid space is simple. The seed nodes take their
! 95: statically assigned sysid space (specified by a 64 bit CIDR block), cut out
! 96: enough bits to handle the number of connections that need to be supported,
! 97: and then dole out a subnet to each connectee. If a connectee is a route node
! 98: it is then able to cut up the subnet CIDR block and dole out subnets to
! 99: nodes that connect to it. Leaf nodes have fixed SYSID space requirements,
! 100: typically 10 bits. If a leaf node is handed a 24 bit sysid space it will
! 101: still use only 10 bits of it. A leaf node handed a sysid space below its
! 102: minimum requirement simply ignores that space.
! 103: .Pp
! 104: Eventually every seed node propogates its physical sysid space to every other
! 105: node in the mesh. If a mesh has four seed nodes, then every node in the mesh
! 106: will wind up with at least four SYSID spaces. Nodes may obtain additional
! 107: physical SYSID assignments due to loops in the graph. For example, if you
! 108: create a triangle between nodes A, B, and C, with B as the seed node, then
! 109: SYSID will propogate B->C->A->B and B->A->C->B and node A will wind up with
! 110: two physical SYSID assignments (and node B will have four) even though
! 111: there was only one seed node. Physical SYSID assignments represent routing
! 112: paths. Because the mesh is potentially too large to store the full graph
! 113: in memory, the SYSLINK protocol only requires that the four largest SYSID
! 114: spaces for any given seed be retained by every node. This creates a
! 115: self-healing mesh with reasonable, but not ultimate redundancy.
! 116: .Pp
! 117: Only a limited number of hops are supported in the mesh due to the
! 118: limitations of the 64 bit ID space and the need to be able to route
! 119: messages simply with a single 64 bit id - without having to retain a
! 120: route table for the whole mesh. Very large meshes require some attention
! 121: to the design of the topology to retain reasonable redundancy. For
! 122: example, if you are trying to create an internet-wide mesh to handle
! 123: a massively distributed problem which requires low data bandwidths,
! 124: you might implement a couple of very large CIDR distribution blocks
! 125: for people to connect to via TCP streams.
! 126: .Pp
! 127: Once physical SYSID space is assigned (and remember, the physical SYSID
! 128: space can change on the fly as nodes go up and down), messages may be sent
! 129: from one physical SYSID to another, or broadcast across the entire mesh.
! 130: Only messages to immediate neighbors are guarenteed to be reliable, but
! 131: for the cluster to operate efficiently packet loss is not tolerated.
! 132: Message delivery failures must be almost solely due to losses which occur
! 133: when the mesh changes (due to a node going up or down).
! 134: .Sh SYSLINK PROTOCOL - LOGICAL SYSIDS, REGISTRATION, AND LOOKUP
! 135: .Pp
! 136: Logical sysids are unique, persistent entities which bear little resemblence
! 137: to the physical sysid representing a node's connection to the mesh.
! 138: An entity might be a particular filesystem, piece of storage, or device.
! 139: The key to understanding the logical sysid is that it migrates with the
! 140: entity it represents. If you move a hard drive from one machine to another,
! 141: the logical sysids representing the ANVIL partitions on that hard drive
! 142: will also migrate.
! 143: .Pp
! 144: Whenever a leaf node connects to the mesh, it must register all entities
! 145: under its direct control with the route node it connects to.
! 146: A route node always collects all logical sysid registrations from all
! 147: directly connected leafs, and may optionally propogate the registrations
! 148: to other route nodes to further consolidate the lookup database. In
! 149: very large clusters route nodes typically do not propogate logical sysid
! 150: registrations very far since this would create a massive burden on internal
! 151: route nodes. They need propogate only far enough to reduce the overhead
! 152: of a LOOKUP. LOOKUP requests translate logical sysids to physical sysids.
! 153: A LOOKUP request is a broadcast entity which must be propogated through
! 154: the mesh until it hits route nodes with complete registration tables.
! 155: The fewer such nodes exist, the less overhead a LOOKUP takes.
! 156: LOOKUP operations almost always return multiple physical sysids. Multiple
! 157: sysids may be returned due to having multiple seeding nodes or due to loops
! 158: in the graph, potentially providing a more optimal communications path for
! 159: a packet.
! 160: .Sh SYSLINK PROTOCOL - MESSAGE ROUTING
! 161: A syslink message contains the logical sysid of the originator and the target,
! 162: and may cache the physical sysid for routing purposes. Once cached, the
! 163: physical sysid contains all information required to fully and trivially route
! 164: the message through the mesh.
! 165: A leaf in the mesh typically specifies a physical sysid of 0 and lets the
! 166: nearest route node do the logical sysid lookup of the target. The route
! 167: node will attempt to cache translations along with propogation times to
! 168: choose the best physical sysid to use to get to the target. A simple hop
! 169: count is not used, as links might have different bandwidths and propogation
! 170: delays.
! 171: .Pp
! 172: Syslink messages are transactional in nature and it is possible for a single
! 173: transaction to be made up of multiple messages... for example, to break down
! 174: a large buffer into smaller pieces for the purposes of transmission over the
! 175: mesh. The syslink protocol imposes fairly severe limitations on transactional
! 176: messages and sizes... syslink messages are not meant to abstract very large
! 177: multi-megabyte I/O operations but instead are meant to provide a reliable
! 178: communications abstraction for small messages.
! 179: A transaction may contain no more then 32 individual messages, allowing
! 180: the route node to use a simple bitmap to track messages which may arrive
! 181: out of order.
! 182: Multiple transactions may be run in parallel between two logical sysids.
! 183: .Pp
! 184: A 32 bit transaction space field is used to encode the whole mess.
! 185: One bit is used to tag the first message in a transaction, one bit
! 186: to tag the last message (both bits would be set if the transaction
! 187: consists of a single message), one bit indicates which side initiated
! 188: the transaction, allowing both sides to initiate transactions without
! 189: creating conflicts or having to negotiate the transaction space,
! 190: 20 bits implement a unique transaction number that will not be reused for a
! 191: very long time, allowing route nodes to weed out duplicate packets, and 8 bits
! 192: are reserved for the sequence number within the transaction (just in case
! 193: we want to expand the maximum number of messages to 256 in the future).
! 194: which is discussed in another section. Note that a portion of the 20 bit
! 195: unique transaction number is a timestamp.
! 196: .Pp
! 197: The messages making up a transaction can arrive out of order and will be
! 198: collected by the target until all messages are present. The originator
! 199: must hold onto all messages it sends (so it can re-send if requested by
! 200: the route node), until it has the complete response.
! 201: .Pp
! 202: The route node for a leaf is responsible for weeding out duplicate messages,
! 203: monitoring transactions, and handling timeouts (returning a retry indication
! 204: to the leaf).
! 205: If the physical sysid becomes invalid the route node is typically responsible
! 206: for locating a new physical sysid and returning a transaction abort to the
! 207: leaf.
! 208: Even though dynamic rerouting is possible, the route node and
! 209: originator has no idea whether the new physical sysid represents the same
! 210: actual leaf or some different leaf with access to the same logical entity
! 211: (such as you might find in a SAN environment).
! 212: Because of this, changes in the physical id require a transaction abort
! 213: and full transaction retry.
! 214: This greatly simplifies operation of the leaf node.
! 215: .Pp
! 216: The SYSLINK protocol is not intended to take the place of a reliable link
! 217: level protocol such as TCP and mesh links should only use UDP when packet
! 218: delivery can be virtually guarenteed (such as when operating over switched
! 219: ethernet). UDP-based syslinks may still buffer multiple messages within
! 220: the limitations of the UDP packet.
! 221: .Pp
! 222: The SYSLINK protocol is not intended to provide quorum guarentees. Quorum
! 223: protocols operate over SYSLINK, but are not implemented by SYSLINK.
! 224: .Sh SYSLINK PROTOCOL - MESSAGE BUFFERING
! 225: Syslinks which operate over buffered connections where messages may be
! 226: sent or received in bulk must adhere to certain alignment and cross-over
! 227: requirements to allow buffers to be implemented as FIFOs. The message length
! 228: field in a syslink message is not particular aligned, but syslink messages
! 229: themselves must always be 16-byte aligned, creating small amounts of dead
! 230: space in the buffer (and the data stream). Additionally, the physical
! 231: sysid propogation protocol also propogates a FIFO cross-over size, which is
! 232: always a power of 2. Typical values range from 64KB to 1024KB. Messages
! 233: received on a stream can be written into a buffer in FIFO fashion. No single
! 234: message may straddle the end of the FIFO's physical buffer (that is, cross
! 235: back over to the beginning). All transmitters must adhere to the FIFO
! 236: size supplied in the initial message traffic by generating a PAD message
! 237: when necessary. Larger FIFO sizes are usually better since they result
! 238: in smaller PADs. I/O transactions containing data are typically broken up
! 239: into smaller messages not only to accomodate limitations in transport
! 240: protocols (such as UDP), but also to reduce the dead space created by PADs.
! 241: On the bright side, these requirements allow very optimal hardware and
! 242: software buffering of syslink message traffic.
! 243: .Sh BLOCKING TRANSACTIONS
! 244: Certain operations can block. That is, the target may not be able to
! 245: immediately complete the requested transaction. When a transaction blocks
! 246: the target is responsible for returning a keep-alive blocking indication
! 247: to the originator to prevent the originator from retrying or aborting
! 248: the transaction. Keep-alives can be directly handled by the route node
! 249: connected to the target (since it knows if the leaf disconnects),
! 250: simplifying leaf operation. A route node will very occassionally do a sanity
! 251: check request to the leaf (perhaps once a minute) to verify that
! 252: transactions blocked for a long time are still known to the leaf.
! 253: .Pp
! 254: Blocking indications are special response messages that set the
! 255: blocked-operation bit in the sequence field and do not set the
! 256: end-transaction bit.
! 257: .Sh TRANSACTION ABORTS
! 258: A transaction can be aborted. Normally aborted transactions still
! 259: required an acknowledgement (since the abort may race completion).
! 260: If the target completes the transaction before receiving the abort
! 261: request, it is as if the abort never occured.
! 262: .Sh ASYNCHRONOUS PUSH TRANSACTIONS
! 263: Most syslink transactions require an acknowledgement to terminate the
! 264: transaction. The acknowledgement is typically a single message in the
! 265: return direction with both the start and stop bits set. Multi-message
! 266: responses are of course possible, such as when the transaction is
! 267: implementing an I/O read operation.
! 268: .Pp
! 269: Certain syslink transactions do not require an acknowledgement and do not
! 270: implement the retry or timeout protocols. Such transactions are typically
! 271: cache-push operations which are used to optimize operation of the cluster
! 272: by allowing a node to asynchronously push data to places where it thinks
! 273: it will be needed immediately. The most commmon use of this sort of
! 274: operation is the read-ahead optimization. When one node performs a read
! 275: transaction with another node, and the target node is capable of read-ahead
! 276: and detemines that read-ahead is useful, the target node can initiate the
! 277: read-ahead and push the data to the originating node in a separate
! 278: asyncnronous transaction. Read-aheads are typically not directly adjacent
! 279: to the read that just occured in order to allow the originator to initiate
! 280: the next synchronous transaction without it crossing paths with the
! 281: asynchronous read-ahead push (resulting in the same data being returned to
! 282: the originator twice).
! 283: .Sh OPERATING AS A ROUTE NODE
! 284: Most userland applications using syslink will operate as leaf nodes, but
! 285: there is nothing preventing you from oprating as a route node. Operating
! 286: as a route node requires implementing all route node requirements including
! 287: the handling of logical sysid registrations and the tracking of transactions
! 288: initiated by nodes that directly connect to you. In fact, sysid seeding
! 289: nodes are user processes which operate as degenerate route nodes.
! 290: .Sh RETURN VALUES
! 291: The value -1 is returned if an error occurs in either call.
! 292: The external variable
! 293: .Va errno
! 294: indicates the cause of the error.
! 295: If a descriptor is supplied and the system call is successful, 0 is
! 296: returned. If a descriptor is not supplied and the system call is successful,
! 297: a descriptor is returned representing a direct connection to the mesh's
! 298: route node.
! 299: .Sh SEE ALSO
! 300: .Sh HISTORY
! 301: The
! 302: .Fn syslink
! 303: function first appeared in
! 304: .Dx 1.9 .