Annotation of src/lib/libc/sys/syslink.2, revision 1.1

1.1     ! dillon      1: .\" Copyright (c) 2007 The DragonFly Project.  All rights reserved.
        !             2: .\"
        !             3: .\" This code is derived from software contributed to The DragonFly Project
        !             4: .\" by Matthew Dillon <dillon@backplane.com>
        !             5: .\"
        !             6: .\" Redistribution and use in source and binary forms, with or without
        !             7: .\" modification, are permitted provided that the following conditions
        !             8: .\" are met:
        !             9: .\"
        !            10: .\" 1. Redistributions of source code must retain the above copyright
        !            11: .\"    notice, this list of conditions and the following disclaimer.
        !            12: .\" 2. Redistributions in binary form must reproduce the above copyright
        !            13: .\"    notice, this list of conditions and the following disclaimer in
        !            14: .\"    the documentation and/or other materials provided with the
        !            15: .\"    distribution.
        !            16: .\" 3. Neither the name of The DragonFly Project nor the names of its
        !            17: .\"    contributors may be used to endorse or promote products derived
        !            18: .\"    from this software without specific, prior written permission.
        !            19: .\"
        !            20: .\" THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
        !            21: .\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
        !            22: .\" LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
        !            23: .\" FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE
        !            24: .\" COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
        !            25: .\" INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
        !            26: .\" BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
        !            27: .\" LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
        !            28: .\" AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
        !            29: .\" OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
        !            30: .\" OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
        !            31: .\" SUCH DAMAGE.
        !            32: .\"
        !            33: .\" $DragonFly$
        !            34: .\"
        !            35: .Dd March 13, 2007
        !            36: .Dt SYSLINK 2
        !            37: .Os
        !            38: .Sh NAME
        !            39: .Nm syslink
        !            40: .Nd low level connect to the cluster mesh
        !            41: .Sh LIBRARY
        !            42: .Lb libc
        !            43: .Sh SYNOPSIS
        !            44: .In sys/syslink.h
        !            45: .Ft int
        !            46: .Fn syslink "int fd" "int flags" "sysid_t routenode"
        !            47: .Sh DESCRIPTION
        !            48: The
        !            49: .Fn syslink
        !            50: function establishes a link to a kernel-implemented syslink route node
        !            51: as specified by
        !            52: .Fa routenode .
        !            53: If a file descriptor of -1 is specified, a file descriptor representing
        !            54: a direct connection to the specified route node will be allocated and
        !            55: returned.
        !            56: If a file descriptor is specified, it will be connected to the specified
        !            57: route node via full-duplex communication and kernel threads will be
        !            58: created to shuttle data between the descriptor and the route node.  The
        !            59: kernel may optimize and shortcut this operation.
        !            60: .Pp
        !            61: It is also perfectly legal to allocate two route nodes and then connect them
        !            62: together by passing the file descriptor returned by the first
        !            63: .Fn syslink
        !            64: call to the second
        !            65: .Fn syslink
        !            66: call.  It is legal (and usually necessary) to obtain multiple descriptors to
        !            67: the same kernel-managed syslink route node.
        !            68: .Pp
        !            69: The syslink protocol revolves around 64 bit system ids using the
        !            70: .Ft sysid_t
        !            71: type.  A system id can be logical or physical.
        !            72: Physical system ids are negotiated dynamically as system links are created
        !            73: and destroyed, while Logical system ids are persistently associated with
        !            74: particular resources in the cluster. 
        !            75: For example, a particular filesystem mount will have a persistent logical
        !            76: sysid and would have one or more physical sysids depending on how it
        !            77: connects into the cluster mesh.
        !            78: .Sh SYSLINK PROTOCOL - PHYSICAL SYSIDS
        !            79: The Syslink protocol is used to glue the cluster mesh together.  It is
        !            80: based on the concept of reliable packets and buffered streams.  Adding a
        !            81: new node to the mesh is as simple as obtaining a stream connection to any
        !            82: node already in the mesh, or tying into a packet switch with UDP.
        !            83: .Pp
        !            84: The first stage of the protocol is to negotiate a physical sysid space. 
        !            85: Each connection to the mesh negotiates its own space, meaning that
        !            86: multi-homed entities (which are expected to be common) may be accessible
        !            87: through multiple physical sysids.  The physical sysid space can take time
        !            88: to settle down and may change while the cluster is operational due to changes
        !            89: in the cluster topology.  For example, you can reconfigure the system id
        !            90: space propogated out from a seed node (or a seed node could go down, or come
        !            91: up), and effectively change some of the physical sysid assignments for
        !            92: every node in the mesh while the mesh is live.
        !            93: .Pp
        !            94: Assignment of physical sysid space is simple.  The seed nodes take their
        !            95: statically assigned sysid space (specified by a 64 bit CIDR block), cut out
        !            96: enough bits to handle the number of connections that need to be supported,
        !            97: and then dole out a subnet to each connectee.  If a connectee is a route node
        !            98: it is then able to cut up the subnet CIDR block and dole out subnets to
        !            99: nodes that connect to it.  Leaf nodes have fixed SYSID space requirements,
        !           100: typically 10 bits.  If a leaf node is handed a 24 bit sysid space it will
        !           101: still use only 10 bits of it.  A leaf node handed a sysid space below its
        !           102: minimum requirement simply ignores that space.
        !           103: .Pp
        !           104: Eventually every seed node propogates its physical sysid space to every other
        !           105: node in the mesh.  If a mesh has four seed nodes, then every node in the mesh
        !           106: will wind up with at least four SYSID spaces.  Nodes may obtain additional
        !           107: physical SYSID assignments due to loops in the graph.  For example, if you
        !           108: create a triangle between nodes A, B, and C, with B as the seed node, then
        !           109: SYSID will propogate B->C->A->B and B->A->C->B and node A will wind up with
        !           110: two physical SYSID assignments (and node B will have four) even though
        !           111: there was only one seed node.  Physical SYSID assignments represent routing
        !           112: paths.  Because the mesh is potentially too large to store the full graph
        !           113: in memory, the SYSLINK protocol only requires that the four largest SYSID
        !           114: spaces for any given seed be retained by every node.  This creates a 
        !           115: self-healing mesh with reasonable, but not ultimate redundancy.
        !           116: .Pp
        !           117: Only a limited number of hops are supported in the mesh due to the 
        !           118: limitations of the 64 bit ID space and the need to be able to route
        !           119: messages simply with a single 64 bit id - without having to retain a
        !           120: route table for the whole mesh.  Very large meshes require some attention
        !           121: to the design of the topology to retain reasonable redundancy.  For
        !           122: example, if you are trying to create an internet-wide mesh to handle
        !           123: a massively distributed problem which requires low data bandwidths,
        !           124: you might implement a couple of very large CIDR distribution blocks
        !           125: for people to connect to via TCP streams.
        !           126: .Pp
        !           127: Once physical SYSID space is assigned (and remember, the physical SYSID
        !           128: space can change on the fly as nodes go up and down), messages may be sent
        !           129: from one physical SYSID to another, or broadcast across the entire mesh.
        !           130: Only messages to immediate neighbors are guarenteed to be reliable, but
        !           131: for the cluster to operate efficiently packet loss is not tolerated.
        !           132: Message delivery failures must be almost solely due to losses which occur
        !           133: when the mesh changes (due to a node going up or down).
        !           134: .Sh SYSLINK PROTOCOL - LOGICAL SYSIDS, REGISTRATION, AND LOOKUP
        !           135: .Pp
        !           136: Logical sysids are unique, persistent entities which bear little resemblence
        !           137: to the physical sysid representing a node's connection to the mesh.
        !           138: An entity might be a particular filesystem, piece of storage, or device.
        !           139: The key to understanding the logical sysid is that it migrates with the
        !           140: entity it represents.  If you move a hard drive from one machine to another,
        !           141: the logical sysids representing the ANVIL partitions on that hard drive
        !           142: will also migrate.
        !           143: .Pp
        !           144: Whenever a leaf node connects to the mesh, it must register all entities
        !           145: under its direct control with the route node it connects to.
        !           146: A route node always collects all logical sysid registrations from all
        !           147: directly connected leafs, and may optionally propogate the registrations
        !           148: to other route nodes to further consolidate the lookup database.  In
        !           149: very large clusters route nodes typically do not propogate logical sysid
        !           150: registrations very far since this would create a massive burden on internal
        !           151: route nodes.  They need propogate only far enough to reduce the overhead
        !           152: of a LOOKUP.  LOOKUP requests translate logical sysids to physical sysids.
        !           153: A LOOKUP request is a broadcast entity which must be propogated through
        !           154: the mesh until it hits route nodes with complete registration tables. 
        !           155: The fewer such nodes exist, the less overhead a LOOKUP takes.
        !           156: LOOKUP operations almost always return multiple physical sysids.  Multiple
        !           157: sysids may be returned due to having multiple seeding nodes or due to loops
        !           158: in the graph, potentially providing a more optimal communications path for
        !           159: a packet.
        !           160: .Sh SYSLINK PROTOCOL - MESSAGE ROUTING
        !           161: A syslink message contains the logical sysid of the originator and the target,
        !           162: and may cache the physical sysid for routing purposes.  Once cached, the
        !           163: physical sysid contains all information required to fully and trivially route
        !           164: the message through the mesh.
        !           165: A leaf in the mesh typically specifies a physical sysid of 0 and lets the
        !           166: nearest route node do the logical sysid lookup of the target.  The route
        !           167: node will attempt to cache translations along with propogation times to
        !           168: choose the best physical sysid to use to get to the target.  A simple hop
        !           169: count is not used, as links might have different bandwidths and propogation
        !           170: delays.
        !           171: .Pp
        !           172: Syslink messages are transactional in nature and it is possible for a single
        !           173: transaction to be made up of multiple messages... for example, to break down
        !           174: a large buffer into smaller pieces for the purposes of transmission over the
        !           175: mesh.  The syslink protocol imposes fairly severe limitations on transactional
        !           176: messages and sizes... syslink messages are not meant to abstract very large
        !           177: multi-megabyte I/O operations but instead are meant to provide a reliable 
        !           178: communications abstraction for small messages.
        !           179: A transaction may contain no more then 32 individual messages, allowing
        !           180: the route node to use a simple bitmap to track messages which may arrive
        !           181: out of order.
        !           182: Multiple transactions may be run in parallel between two logical sysids.
        !           183: .Pp
        !           184: A 32 bit transaction space field is used to encode the whole mess.  
        !           185: One bit is used to tag the first message in a transaction, one bit
        !           186: to tag the last message (both bits would be set if the transaction 
        !           187: consists of a single message), one bit indicates which side initiated
        !           188: the transaction, allowing both sides to initiate transactions without
        !           189: creating conflicts or having to negotiate the transaction space,
        !           190: 20 bits implement a unique transaction number that will not be reused for a
        !           191: very long time, allowing route nodes to weed out duplicate packets, and 8 bits
        !           192: are reserved for the sequence number within the transaction (just in case
        !           193: we want to expand the maximum number of messages to 256 in the future). 
        !           194: which is discussed in another section.  Note that a portion of the 20 bit
        !           195: unique transaction number is a timestamp.
        !           196: .Pp
        !           197: The messages making up a transaction can arrive out of order and will be
        !           198: collected by the target until all messages are present.  The originator
        !           199: must hold onto all messages it sends (so it can re-send if requested by
        !           200: the route node), until it has the complete response.
        !           201: .Pp
        !           202: The route node for a leaf is responsible for weeding out duplicate messages,
        !           203: monitoring transactions, and handling timeouts (returning a retry indication
        !           204: to the leaf). 
        !           205: If the physical sysid becomes invalid the route node is typically responsible
        !           206: for locating a new physical sysid and returning a transaction abort to the
        !           207: leaf.  
        !           208: Even though dynamic rerouting is possible, the route node and 
        !           209: originator has no idea whether the new physical sysid represents the same
        !           210: actual leaf or some different leaf with access to the same logical entity
        !           211: (such as you might find in a SAN environment).  
        !           212: Because of this, changes in the physical id require a transaction abort
        !           213: and full transaction retry.
        !           214: This greatly simplifies operation of the leaf node.
        !           215: .Pp
        !           216: The SYSLINK protocol is not intended to take the place of a reliable link
        !           217: level protocol such as TCP and mesh links should only use UDP when packet
        !           218: delivery can be virtually guarenteed (such as when operating over switched
        !           219: ethernet).  UDP-based syslinks may still buffer multiple messages within 
        !           220: the limitations of the UDP packet.
        !           221: .Pp
        !           222: The SYSLINK protocol is not intended to provide quorum guarentees.  Quorum
        !           223: protocols operate over SYSLINK, but are not implemented by SYSLINK.
        !           224: .Sh SYSLINK PROTOCOL - MESSAGE BUFFERING
        !           225: Syslinks which operate over buffered connections where messages may be
        !           226: sent or received in bulk must adhere to certain alignment and cross-over
        !           227: requirements to allow buffers to be implemented as FIFOs.  The message length
        !           228: field in a syslink message is not particular aligned, but syslink messages
        !           229: themselves must always be 16-byte aligned, creating small amounts of dead
        !           230: space in the buffer (and the data stream).  Additionally, the physical
        !           231: sysid propogation protocol also propogates a FIFO cross-over size, which is
        !           232: always a power of 2.  Typical values range from 64KB to 1024KB.  Messages
        !           233: received on a stream can be written into a buffer in FIFO fashion.  No single
        !           234: message may straddle the end of the FIFO's physical buffer (that is, cross
        !           235: back over to the beginning).  All transmitters must adhere to the FIFO
        !           236: size supplied in the initial message traffic by generating a PAD message
        !           237: when necessary.  Larger FIFO sizes are usually better since they result
        !           238: in smaller PADs.  I/O transactions containing data are typically broken up
        !           239: into smaller messages not only to accomodate limitations in transport
        !           240: protocols (such as UDP), but also to reduce the dead space created by PADs.
        !           241: On the bright side, these requirements allow very optimal hardware and
        !           242: software buffering of syslink message traffic.
        !           243: .Sh BLOCKING TRANSACTIONS
        !           244: Certain operations can block.  That is, the target may not be able to 
        !           245: immediately complete the requested transaction.  When a transaction blocks
        !           246: the target is responsible for returning a keep-alive blocking indication
        !           247: to the originator to prevent the originator from retrying or aborting
        !           248: the transaction.  Keep-alives can be directly handled by the route node
        !           249: connected to the target (since it knows if the leaf disconnects),
        !           250: simplifying leaf operation.  A route node will very occassionally do a sanity
        !           251: check request to the leaf (perhaps once a minute) to verify that
        !           252: transactions blocked for a long time are still known to the leaf.
        !           253: .Pp
        !           254: Blocking indications are special response messages that set the
        !           255: blocked-operation bit in the sequence field and do not set the
        !           256: end-transaction bit.
        !           257: .Sh TRANSACTION ABORTS
        !           258: A transaction can be aborted.  Normally aborted transactions still 
        !           259: required an acknowledgement (since the abort may race completion).
        !           260: If the target completes the transaction before receiving the abort
        !           261: request, it is as if the abort never occured.
        !           262: .Sh ASYNCHRONOUS PUSH TRANSACTIONS
        !           263: Most syslink transactions require an acknowledgement to terminate the
        !           264: transaction.  The acknowledgement is typically a single message in the
        !           265: return direction with both the start and stop bits set.  Multi-message
        !           266: responses are of course possible, such as when the transaction is
        !           267: implementing an I/O read operation.
        !           268: .Pp
        !           269: Certain syslink transactions do not require an acknowledgement and do not
        !           270: implement the retry or timeout protocols.  Such transactions are typically
        !           271: cache-push operations which are used to optimize operation of the cluster
        !           272: by allowing a node to asynchronously push data to places where it thinks
        !           273: it will be needed immediately.  The most commmon use of this sort of
        !           274: operation is the read-ahead optimization.  When one node performs a read
        !           275: transaction with another node, and the target node is capable of read-ahead
        !           276: and detemines that read-ahead is useful, the target node can initiate the
        !           277: read-ahead and push the data to the originating node in a separate 
        !           278: asyncnronous transaction.  Read-aheads are typically not directly adjacent
        !           279: to the read that just occured in order to allow the originator to initiate
        !           280: the next synchronous transaction without it crossing paths with the 
        !           281: asynchronous read-ahead push (resulting in the same data being returned to
        !           282: the originator twice).
        !           283: .Sh OPERATING AS A ROUTE NODE
        !           284: Most userland applications using syslink will operate as leaf nodes, but
        !           285: there is nothing preventing you from oprating as a route node.  Operating
        !           286: as a route node requires implementing all route node requirements including
        !           287: the handling of logical sysid registrations and the tracking of transactions
        !           288: initiated by nodes that directly connect to you.  In fact, sysid seeding
        !           289: nodes are user processes which operate as degenerate route nodes.
        !           290: .Sh RETURN VALUES
        !           291: The value -1 is returned if an error occurs in either call.
        !           292: The external variable
        !           293: .Va errno
        !           294: indicates the cause of the error.
        !           295: If a descriptor is supplied and the system call is successful, 0 is
        !           296: returned.  If a descriptor is not supplied and the system call is successful,
        !           297: a descriptor is returned representing a direct connection to the mesh's 
        !           298: route node.
        !           299: .Sh SEE ALSO
        !           300: .Sh HISTORY
        !           301: The
        !           302: .Fn syslink
        !           303: function first appeared in
        !           304: .Dx 1.9 .