Nexedi Enterprise Objects (NEO) Specification

  Introduction

    Nexedi Enterprise Objects (NEO) is a storage system for Zope
    Object Database (ZODB). NEO is a novel technology in that it provides
    an extremely robust and scalable storage service which may not be found
    in existing solutions.

    NEO provides these features for the robustness and the scalability:

      - distributed storage nodes

      - redundant copies

      - backup master nodes

      - automatic recovery

      - dynamic storage allocations

      - fail-safe protocol

    For now, this specification corresponds to ZODB version 3.2, which is
    bundled with Zope version 2.7.

    This documents how NEO works from the protocol level to the software level.
    This specification is version 3.0, last updated at 2006-07-08.

  Components

    Overview

      NEO consists of three components: master nodes, storage nodes and client nodes.
      Here, node means a software component which runs on a computer. These nodes can
      run in the same computer or different computers in a network. A node communicates
      with another node via the TCP protocol.

      Master nodes are responsible for coordinating the whole cluster. They manage
      information on nodes in the same cluster, and direct states of other nodes to
      implement the semantics of ZODB transactions. At a time, only a single master node
      is current; the other nodes are secondary.

      Storage nodes maintain raw data. They store data committed by each transaction,
      and hold redundant copies of ZODB objects.

      Client nodes are usually Zope servers, and clients from the viewpoint of NEO.
      They interact with current master node and storage nodes to perform transactions
      and obtain data.

    Master Node

      Configuration

        Each master node must have a configuration which describes all the list of master nodes.
        Also, the configuration must specify how many copies of objects should be made in the cluster.

      Database

        Each master node must have a database. The database must record the list of known master
        nodes, the list of known storage nodes, the list of known client nodes, the mapping
        between transaction IDs and storage nodes which hold corresponding transactions, the mapping
        between object IDs and transaction IDs. This database must be reinitialized when it turns
        out to be out-of-dated, excluding the list of storage nodes. The list of storage nodes
        must be persistent so that the master node can examine the necessity of recovery.

      Roles

        Each master node must be one of "Primary", "Secondary" and "Unknown". At most one
        master node in a cluster is "Primary", and responsible for managing the whole cluster.
        "Secondary" master nodes must obey the "Primary" master node, as long as it is working
        correctly. If the "Primary" master node is disconnected or broken, one of the "Secondary"
        master nodes must take over the "Primary" role.

      Sorting Order

        The master nodes have a defined sorting order. The sorting algorithm is defined by comparing
        the listening IP addresss and the listening port numbers. Each IP address is converted into
        the network byte order. If two nodes have different IP addresses, the node with the lower IP address
        is smaller. If two nodes have the same IP address but different port numbers, the node with the
        lower port number is smaller.

      States

        Bootstrap State

          Initially, each master node has the role "Unknown". In this state, the master node
          must try to connect to all other known master nodes and accept connections from other
          master nodes. If a smaller node gets a connection from a larger node in the sorting order,
          the smaller node must connect to the larger node and drop the connection from the larger
          node, in order to reduce the number of connections.

          Each node must wait until it establishes connections with all known master nodes. Once
          this has been done, each node must advertise what master nodes are available. If a node
          gets a different answer from another node, that node must report an error and exit.
          Otherwise, it must move to the discussion state.

        Discussion State

          First, a master node must ask if any other node is "Primary". If so,
          it must query which node is "Primary" and move to the recovery state.

          Then, a master node must send the latest transaction ID it holds in the database,
          and compare which has the newest transaction ID. The node which has the newest
          transaction ID must be selected to "Primary". If multiple nodes hold the same
          newest transaction ID, the smallest node must be "Primary".

          If a node becomes "Primary", it must move to the verification state.

          If a node becomes "Secondary", it must move to the recovery state.

        Verification State

          A master node must verify its database up-to-date. The node verifies the database
          by querying the latest transaction ID and the number of transaction IDs for each
          storage node to the storage node. For this, it is necessary to wait for all known
          storage nodes which are not broken or down to connect to the master node.

          If a storage node is not connected, the node must be marked as "Temporarily Down".
          If a storage node is temporarily down for over 10 minutes, the node must be remarked
          as "Down", and stop waiting for the node.

          If any storage node disagrees, the node must move to the reconstruction state.
          Otherwise, the node must move to the ready state.

        Reconstruction State

          The master node must reinitialize the database, except for the list of nodes.
          Then, it must ask each storage node to send object IDs and transaction IDs it
          has.

          After updating the database, the master node must examine if there is any
          transaction or object which does not have a good number of replicates in
          storage nodes. If any, the master must initiate the process of a replication.
          The detailed information about the replication is described below.

          Once this is finished, the node must move to the ready state.

        Recovery State

          The master node must wait for the "Primary" master node to be in the ready
          or working state.

          Once the "Primary" master node becomes working, the "Secondary" master node
          must verify if its database is up-to-date with the same algorithm described
          in the verification state.

          If it is not up-to-date, the "Secondary" master node must ask all information
          the "Primary" master node holds, and update the database.

          The node must move to the backup state after this.

        Ready State

          The master node must wait for all "Secondary" master nodes to be in the backup
          state.

          Once this is finished, the node must move to the working state.

        Working State

          When entering a working state, the master node must check the list of known storage
          nodes which are broken or down. If so, the master node must initiate a replication
          process appropriately.

          In this state, all protocol commands are supported, and client nodes must be able
          to function normally.

          If a "Secondary" master node is available, the "Primary" master node must send
          all updating messages to it, when any data in the "Primary" database changes.

          If a storage node appears in this state, the master node must interrupt any replication
          process, and query all data from the storage node. Once the information is transferred
          to the master node, the master node must re-examine the database for interrupted
          replication processes, and re-issue replication processes, if necessary.

          If a storage node disappears, it must be marked as "Temporarily Down". If
          this persists over 10 minutes, it must be remarked as "Down", and the master node
          must initiate a replication process. Also, the master node must report
          an error and continue the operation.

          If a storage node gets broken, it must marked as "Broken". The master node
          must initiate a replication process, and report an error, but it must continue
          the operation.

        Backup State

          The master node must listen to the "Primary" node, and synchronize the database
          with the "Primary" database. The "Secondary" master node is not allowed to
          change the database contents without any command from the "Primary" master node.

          If a storage node or a client node reports that the "Primary" node is not
          available or broken, the master node must issue a re-discussion to all other
          "Secondary" master nodes. Then, it must move back to the discussion state.

          If the node gets a message that the "Primary" node is malfunctioning, the node
          must drop out the connection with the "Primary" node, and move back to
          the discussion state again.

        Shutdown State

          This is a special state to shutdown the whole cluster. The transition to this
          state is only possible with a manual issue. Once this instruction is issued,
          the node must request shutdown to all nodes.

          When the node gets this message, it must enter the shutdown state. In this state,
          client nodes are not allowed to make any new transaction, and stop as soon as
          possible. Once all ongoing transactions are finished, the "Primary" master node
          must direct all nodes to shutdown immediately, and the "Primary" master node
          itself must exit, after all the other nodes are down.

    Storage Node

      Configuration

        The storage node may find master nodes by broadcasting. Alternatively, it
        may specify master nodes by a configuration.

      Database

        The storage node must have a persistent database to hold transactions and
        objects.

      States

        Bootstrap State

          The storage node must generate a Universally Unique ID (UUID), if not
          present yet, so that master nodes can identify the storage node.

          The storage node must connect to one master node, and wait for a
          "Primary" master node is selected. Once it is selected, the storage
          node must make sure that it has a connection to the "Primary" master
          node.

          Then, the storage node must move to the working state.

        Working State

          The storage node must interact with master nodes, client nodes and
          other storage nodes. This is mainly retrieving and storing data.

        Shutdown State

          The storage node must shutdown immediately, when the "Primary"
          master node asks it.

    Client Node

      Configuration

        The client node may find master nodes by broadcasting. Alternatively, it
        may specify master nodes by a configuration.

      Database

        The client node may have a volatile database to cache information.
        The database must be cleared out, when connecting to the "Primary"
        master node.

      States

        Bootstrap State

          The client node must connect to one master node, and wait for a
          "Primary" master node is selected. Once it is selected, the client
          node must make sure that it has a connection to the "Primary" master
          node.

          Then, the client node must move to the working state.

        Working State

          The client node must interact with master nodes and storage nodes.
          This is mainly commiting and aborting transactions. Also, the client
          node must detect a broken storage node.

          If a "Primary" master node is not available, the client node must
          connect to another master node, and notify that the "Primary" master
          node is not available. Then, it must move back to the bootstrap state.

          If no master node is available, the client node must report an error
          and exit.

        Shutdown State

          The client node must not issue any new transaction. It must shutdown
          immediately, once all ongoing transactions are finished.

  Operations

    Transactions

      Overview

        Transactions must be serialized with a lock in the commit phase to ensure the
        atomicity and the integrity of every transaction. Thus only a single transaction
        commit is allowed at a time for ZODB. For Zope, however, multiple transactions may run
        at a time, until they reach the commit phase.

        ZODB's transaction commit follows this order:

          1. Begin a commit (tpc_begin).

          2. Vote for a commit (tpc_vote).

          3. Finish or abort a commit (tpc_finish or tpc_abort).

        At tpc_begin, a new transaction begins. At tpc_vote, object data is sent to
        storages. And, finally, at tpc_finish, the transaction is finished. If anything
        wrong happens in tpc_vote, tpc_abort may be called to abort the transaction.

        In NEO, a transaction commit is realized in this way:

          1. A client node calls tpc_begin, and demand a new transaction ID from a "Primary"
            master node. The "Primary" master node must send back storage node IDs to the
            client node so that the client node will store objects in the set of specified
            storage nodes.

          2. The client node calls tpc_vote, and send object data to the specified storage
            nodes, and meta information to the "Primary" master node.

          3. The client node calls tpc_finish, and confirm that the transaction is complete
            with the "Primary" master node. The "Primary" master node then must contact
            storage nodes to ensure that the data has been stored, and must ask all client
            nodes but the sending client node to invalidate their cached items, because
            data is updated, and keeping older data will make conflicts.

          4. Instead, the client node may call tpc_abort. In this case, the client node
            must tell the "Primary" master node to cancel the transaction. The "Primary"
            master node then must contact storage nodes to drop out the object data.

        That above is the normal situation where everything goes well. It is possible, however,
        that one or more errors may happen in a commit phase, at the client side, the master side
        and the storage side. Thus care must be taken of the following situations.

      Error Cases

        In the following cases, abbriviations are used for simplicity: "CN" for a client node,
        "PMN" for a "Primary" master node, "SMN" for a "Secondary" master node, and "SN" for
        a storage node.

        CN calls tpc_begin, then PMN does not reply

          In this case, CN must stop the transaction, and notify SMN that PMN is broken.

        PMN replies to tpc_begin, but CN does not go ahead

          In theory, tpc_vote may take a lot of time when committing a lot of data. However,
          PMN must take care of not stopping the whole cluster, as a transaction commit is
          an exclusive operation. Therefore, in reality, a lock must not be obtained in tpc_begin.
          At the beginning, PMN must only generate a new transaction ID.

          Nevertheless, if CN does not issue tpc_vote to PMN, garbage can remain in SN,
          because CN may have sent data to SN already. Thus SN must implement a garbage collection
          which may expire pending transactional data.

        CN calls tpc_vote, but SN does not reply or is buggy

          CN must not wait for each SN too long. If CN cannot connect to a SN, CN must assume
          that the SN is down, thus must report it to PMN. If CN connected to a SN but the SN
          does not reply or the reply indicates an error, CN must report that the SN is broken
          to PMN, so that PMN may abandone the SN. This requires a timeout, as SN may send
          no rely forever. The timeout should be at least 10 seconds, because an operating
          system flushes cache in every 5 second typically, and this disk activity may delay
          the operation in SN significantly.

          If all SNs specified by PMN fail, CN may ask PMN to send a new set of storage nodes.
          As it is quite rare that all fail, this feature is not obligatory.

          If all SNs specified by PMN, including newly acquired SNs, fail, CN must stop the
          transaction.

        CN calls tpc_vote, SN accepts data, but PMN does not reply

          At this stage, PMN really needs to obtain a lock, thus PMN may spend some time
          to wait for other transactions to finish. However, PMN should not need much time
          for this, as the lock is held only when PMN writes small meta information to its
          database, and send notifications to other nodes.

        PMN replies to tpc_vote, but CN does not go ahead

          Because CN may vote for other storage adapters as well, this may need some time.
          However, as the lock is held already, PMN must not stop other clients too long.
          PMN must abort the transaction with a proper timeout.

        CN calls tpc_finish, but PMN does not reply

          CN must drop the connection to PMN, and report it to SMN. When CN reconnects to
          a new PMN, the cache is invalidated, thus CN does not have to invalidate the
          cached data for the unfinished transaction explicitly.

        PMN asks SN to finish a transaction, but SN does not reply

          PNM must send an error message to CN, so that CN can invalidate the cached data.
          CN may not do anything else, because ZODB does not have a facility to make an
          error at tpc_finish appropriately.

        PMN asks CN to invalidate cache, but CN is down

          PMN must ignore this error, as the CN will invalidate cache when reconnecting.

        CN or PMN may ask data when a transaction is in an intermediate state

          SN must take care that it does not send meta information to PMN when it is
          not finished. SN should remove pending data when PMN asks for meta information,
          because PMN does that only when all transactions are aborted.

          PMN must take care that it does not send meta information to CN when it is
          not finished. PMN should not write meta information to the database when it is not
          finish.

    Replications

      Overview

        Replications are used to make redundant copies of objects among a cluster.
        This guarantees that a single node failure (or even more, depending on a
        configuration) does not cause a data loss, and helps to distribute the load
        of reading data.

        When a client node commits a transaction, the client node itself makes
        redundant copies by contact multiple storage nodes. In the normal mode of
        the cluster management, a "Primary" master node is involved only with assigning
        storage nodes to every transaction.

        However, if a storage node is down or broken, the "Primary" master node must
        undergo a recovery by making a new replication in another storage node. In
        this case, the "Primary" master node must check all data which a non-functional
        storage node had, and determine a new storage node for a given transaction.

        When the cluster does not have a good number of storage nodes any longer,
        the "Primary" master node may not proceed a replication. In this case, the only
        solution is to add a new storage node to the cluster, or to reduce the required
        number of replications. If a new storage node is connected, the "Primary" master
        node must check if the storage node already contains data, and update its
        database, if necessary, because the storage node may have a lost replication.

        Another case is when meta information is reconstructed by collecting data from
        all storage nodes. A "Primary" master node must undertake a complete examination
        of the database to obtain a list of transactions which have fewer replications.

      Considerations

        As replications are only that storage nodes exchange data under the management
        by a "Primary" master node, this specification does not define how a "Primary"
        master node must implement a replication policy.

        However, an implementation should take care that replications should not affect
        the system performance too much, as replications are not always urgent, in comparison
        with transactions. For instance, replications should be divided into smaller
        parts. Replications should be scheduled appropriately, so that they run only when
        the system load of a node is low.

        When a replication process is finished, a "Primary" master node should remove
        a broken or down storage node from the database, if the node is not referred to
        any longer. Otherwise, the database may grow up over time, and slows down replications
        gradually.

  Protocol

    Format

      NEO sends messages among cluster nodes via TCP connections. Each message has
      the following header::

        +----+------+--------+----------+----
        | ID | Type | Length | Reserved | ...
        +----+------+--------+----------+----
        0    2      4        8          10

      ID -- 2-byte unsigned integer which distinguishes a message.

      Type -- 2-byte unsigned integer which specifies a message type.

      Length -- 4-byte unsigned integer which specifies the length of this
                message, including the header.

      Reserved -- 2-byte unsigned integer for future expansion. This must be
                  set to zero in this version.

      When a message requires a reply, the ID of the reply message must be identical
      with the ID of the original request message. In addition, the type of a reply
      must be identical with the type of a request with the 15th bit set.

      Most messages add additional data into packets, followed by the header.
      The Length field of a header specifies the size of a message in bytes,
      including the header itself. Thus a receiver of a packet previse the size
      before reading the whole packet.

      All integer values in packets are always encoded in the network byte order
      for portability.

      When the number of parameters or the size of a parameter is variable, the
      number is specified before the parameters.

    Message Classes

      Messages are classified into asynchronous messages and synchronous messages.
      Asynchronous messages are request-only messages which do not require any reply,
      while synchronous messages require replies, one reply to one request.

      Asynchronous messages are used to notify status changes in the cluster. They
      include additions and deletions of nodes.

      Synchronous messages are used to exchange data between nodes. A sender never
      sends multiple synchronous request messages before getting a reply. Note that,
      however, a receiver may get multiple synchronous requests from a single connection,
      because such a connection may be used by multiple threads.

    Common Parameters

      Messages use the same types of parameters frequently. Here are the definitions of
      those types:

        OID -- Object ID. It is a 8-byte array used to identify an object, regardless
               of transactions.

        TID -- Transaction ID. It is a 8-byte array used to identify a transaction.

        Serial -- Serial number of an object, which identifies a certain generation
                  of an object. In NEO, this is identical to TID, because NEO does
                  not support versioning.

        UUID -- Universally Unique ID. It is a 16-byte array used to identify a node.

        NID -- Node ID. It is a 4-byte unsigned integer used to identify a node. A "Primary"
               master node maps an UUID to an NID for efficiency.

        IP address -- For now, NEO only supports IPv4. So the address is a 4-byte array.

        Port Number -- TCP's port number. It is a 2-byte unsigned integer.

        INVALID_OID -- Invalid OID. It indicates that an OID is invalid. It is
                       '\xff\xff\xff\xff\xff\xff\xff\xff'.

        INVALID_TID -- Invalid TID. It indicates that a TID is invalid. It is
                       '\xff\xff\xff\xff\xff\xff\xff\xff'.

        INVALID_SERIAL -- Invalid Serial Number. It indicates that a Serial Number is invalid.
                          It is '\xff\xff\xff\xff\xff\xff\xff\xff'.

    Error Messages

      A sychronous message allows a receiver to return an error message, when an error occurs.
      An error message must specify the same ID as a request, and the same Message Type but with
      the 15th bit set, as well as the other return messages. Successful code indicates that
      the message is not an error message but an usual message. In this case, the message format
      is documented in each return message description.

      Type -- Vary

      Sender -- MN, SN

      Receiver -- CN, MN, SN

      Class -- Synchronous

      Format::

          +------------+----------------------+---------------+
          | Error Code | Error Message Length | Error Message |
          +------------+----------------------+---------------+
          10           12                     16              16+n

      Error Code is a 2-byte unsigned integer, and it is one of these:

        0 -- Success. The request is successfully completed. In this case, neither
             Error Message Length nor Error Message follows the Error Code.

        1 -- Not Ready. The node is not ready to accept a given request yet.

        2 -- OID Not Found. A given OID is not found in the database.

        3 -- Serial Number Not Found. A given Serial Number is not found in the database.

        4 -- TID Not Found. A given TID is not found in the database.

        5 -- Disk Full. The node cannot store data due to too little disk space.

        6 -- Conflict Found. A given transaction may not be committed because of a conflict.

        7 -- Inconsistent Configuration. A master node does not have an identical list of
             master nodes with another, or a foreign cluster node is connected.

        8 -- Protocol Version Mismatch. Nodes do not talk in the same protocol.

        9 -- Protocol Error. A node does not follow the protocol.

        10 -- Timeout. A node may not wait for too long.

        11 -- Broken Node Disallowed. A node is known to be broken, so it is not allowed to connect.

        12 -- Internal Error. A node is corrupted in software or hardware internally.

      Error Message Length specifies the length of Error Message, including the trailing NUL character.

      Error Message is a human-readable string which describes an error. The string is terminated with
      a NUL character.

    Message Types

      Each message type defines what nodes can send it to what nodes. Node types
      are abbrivated to CN, SN, MN, PMN and SMN for client node, storage node,
      master node, primary master node and secondary master node, respectively.

      Each message type is shown as a hexadecimal value.

      Get New OIDs

        Type -- 0003

        Sender -- CN

        Receiver -- PMN

        Class -- Synchronous

        Format::

          +---+
          | n |
          +---+
          10  12

        Request new OIDs to assign to objects. 'n' is a 2-byte unsigned integer
        which specifies the number of requested OIDs.

      Return New OIDs

        Type -- 8003

        Sender -- PMN

        Receiver -- CN

        Class -- Synchronous

        Format::

          +------------+---+-------+-----+-------+
          | Error Code | n | OID 1 | ... | OID n |
          +------------+---+-------+-----+-------+
          10           12  14      22    14+n*4  22+n*4

        Return new OIDs. 'n' is a 2-byte unsigned integer, specifying the number
        of returned OIDs.

      Request Node Identification

        Type -- 0004

        Sender -- CN, SN, MN

        Receiver -- MN, SN

        Class -- Synchronous

        Format::

          +---------------+---------------+-----------+------+------------+-------------+
          | Major Version | Minor Version | Node Type | UUID | IP Address | Port Number |
          +---------------+---------------+-----------+------+------------+-------------+
          10              14              18          20     36           40            42

        Every client node must issue this request before any other request,
        to identify itself.

        Major Version and Minor Version must specify the protocol
        versions, and each parameter is a 4-byte unsigned integer. In this version
        Major Version must be '3', and Minor Version must be '0'.

        Node Type is a 2-byte unsigned integer, and it must be one of these:

          1 -- Master Node

          2 -- Storage Node

          3 -- Client Node

        UUID is unique to each node, and this may be zero for CN,
        because the UUID will not be used for any purpose. For MN, the UUID is
        a cluster UUID but not a MN UUID. The difference is that a cluster UUID is
        shared by all nodes joining the same cluster.

        IP Address and Port Number are a listening address and a port to accept connections.
        They must be zero for CN.

      Accept Node Identification

        Type -- 8004

        Sender -- MN, SN

        Receiver -- CN, SN, MN

        Class -- Synchronous

        Format::

          +------------+------+
          | Error Code | UUID |
          +------------+------+
          10           12     28

        Accept a connection. This returns a cluster UUID which idenfies the cluster.

      Get Database Information

        Type -- 0005

        Sender -- SN, CN

        Receiver -- PMN

        Class -- Synchronous

        Format::

          +
          |
          +
          10

        Request database information.

      Return Database Information

        Type -- 8005

        Sender -- PMN

        Receiver -- SN, CN

        Class -- Synchronous

        Format::

          +------------+-----------------+--------------+--------------------------+
          | Error Code | Version Support | Undo Support | Transaction Undo Support |
          +------------+-----------------+--------------+--------------------------+
          10           12                13             14                         15

          +-----------+-------------+------+------------------+-----------+
          | Read Only | Name Length | Name | Extension Length | Extension |
          +-----------+-------------+------+------------------+-----------+
          15          16            18     18+n               20+n        20+n+m

        Version Support, Undo Support, Transaction Undo Support and Read Only are 1-byte
        boolean parameters. They must be either 1 or 0.

        Name Length and Extension Length are 2-byte unsigned integers, and they specify
        the lengths of Name and Extension, respectively, including trailing NUL characters.

        Name and Extension are human-readable, NUL-terminated strings.

        In this version, Version Support, Undo Support and Transaction Undo Support must be
        always 0, 1 and 1, respectively. Name must be "NEO" and Extension must be empty.

      Get Transaction Information

        Type -- 0006

        Sender -- CN

        Receiever -- PMN

        Class -- Synchronous

        Format::

          +-----+-----+---+
          | OID | TID | n |
          +-----+-----+---+
          10    18    26  30

        Request transaction information. The first transaction is specified by TID.
        If TID is INVALID_TID, the last transaction is selected. 'n' is a 4-byte
        unsigned integer which requests the maximum number of transactions returned,
        starting from TID or the last transaction. The positions of transactions are
        defined in the descending order of transaction IDs, namely, a larger
        transactions ID is earlier, excluding unfinished transactions.

        If OID is not INVALID_OID, only transactions with which the object is
        committed are selected. Otherwise, all transactions are used.

      Return Transaction Information

        Type -- 8006

        Sender -- PMN

        Receiver -- CN

        Class -- Synchronous

        Format::

          +------------+---+-------+---+-------+-----+-------+------+
          | Error Code | n | TID 1 | m | NID 1 | ... | NID m | .... |
          +------------+---+-------+---+-------+-----+-------+------+
          10           12  16      24  28      32    28+m*4  32+m*4

        Return the number of transactions and storage nodes for the transactions.
        'n' is a 4-byte unsigned integer, the number of transactions following
        this parameter. The transaction IDs must be sorted in the descending order,
        and each transaction information specifies 'm' which is a 4-byte unsigned
        integer describing the number of storage nodes, and storage node IDs.

      Request Undo

        Type -- 0007

        Sender -- CN

        Receiver -- PMN, SN

        Class -- Synchronous

        Format::

          +-----+
          | TID |
          +-----+
          10    18

        Request undoing a transaction. This must be performed within a transaction.
        A client node is responsible for sending the same transaction ID to all
        storage nodes.

      Accept Undo

        Type -- 8007

        Sender -- PMN, SN

        Receiever -- CN

        Class -- Synchronous

        Format::

          +------------+
          | Error Code |
          +------------+
          10           12

        Return only if undo was successful or not.

      Request New TID

        Type -- 0008

        Sender -- CN

        Receiver -- PMN

        Class -- Synchronous

        Format::

          +
          |
          +
          10

        Request a new transaction. This implies the beginning of a transaction.

      Return New TID

        Type -- 8008

        Sender -- PMN

        Receiver -- CN

        Class -- Synchronous

        Format::

          +------------+-----+
          | Error Code | TID |
          +------------+-----+
          10           12    20

        Return a new transaction ID.

      Request Confirmation For Transaction

        Type -- 0009

        Sender -- CN

        Receiver -- PMN

        Class -- Synchronous

        Format::

          +-----+---+-------+-----+-------+---+-------+----------+-----+-------+----------+
          | TID | n | NID 1 | ... | NID n | m | OID 1 | Serial 1 | ... | OID m | Serial m |
          +-----+---+-------+-----+-------+---+-------+----------+-----+-------+----------+
          10    18  22      26    22+n*4      30+n*4  38+n*4     46+n*4                   30+n*4+m*16

        Send information about a stored transaction. 'n' is a 4-byte unsigned integer which
        specifies the number of storage nodes holding the transaction data. 'm' is another
        4-byte unsigned integer which specifies the number of OIDs and Serial Numbers. Each
        pair of an OID and a Serial Number are the objects which were modified. Note that
        a Serial Number must be the previous Serial Number for a modified object, but not
        the new Serial Number.

      Confirm Transaction

        Type -- 8009

        Sender -- PMN

        Receiver -- CN

        Class -- Synchronous

        Format::

          +------------+
          | Error Code |
          +------------+
          10           12

        Return only if the request was successful.

      Send Transaction Data

        Type -- 000A

        Sender -- CN

        Receiver -- SN

        Class -- Synchronous

        Format::

          +-----+-------------+------+--------------------+-------------+------------------+-----------+
          | TID | User Length | User | Description Length | Description | Extension Length | Extension |
          +-----+-------------+------+--------------------+-------------+------------------+-----------+
          10    18           20      20+ul                22+ul         22+ul+dl           24+ul+dl    24+ul+dl+el

          +---+-------+---------------+------------+----------+--------+----
          | n | OID 1 | Compression 1 | Checksum 1 | Length 1 | Data 1 | ...
          +---+-------+---------------+------------+----------+--------+----
          x   x+4     x+12            x+13         x+17       x+25     x+25+l1

        Send all the data in a transaction. User Length, Description Length and Extension Length
        are 2-byte unsigned integers, and the lengths of User, Description and Extension, respectively,
        including trailing NUL characters. User, Descrption and Extension are NUL-terminated strings.

        Following Extension, the 4-byte unsigned integer 'n' specifies the number of objects in this
        transaction. Each object is described with an OID, a compression algorithm, a Adler-32 checksum,
        the length of data, and data. The compression algorithm is a 1-byte unsigned integer. Currently,
        these values are defined:

          0 -- No Compression

          1 -- zlib's Compression

        All other values are undefined.

        The checksum is based on Adler-32, which is implemented in zlib. The checksum must be computed
        after a compression is applied, if any. The 8-byte unsigned integer, Length, defines the length
        of data. Data is raw object data, and it must be treated as opaque data for storage nodes.

      Accept Transaction Data

        Type -- 800A

        Sender -- SN

        Receiver -- CN

        Class -- Synchronous

        Format::

          +------------+
          | Error Code |
          +------------+
          10           12

        Return only if the request was successful.

      FIXME: more message types must be defined.
      FIXME: the message types should be renumbered for clarity.


  Notes

    It might be better to improve the verification algorithm of checking the data integrity.
    The difficulty is that it would be very heavy to check all data.

    It would be better to add a monitoring protocol to watch the whole cluster, to get
    information about each node, such as working, broken, and so on. Also, it might be
    interesting to support performance monitoring, such as the CPU load, disk space, etc.
    of each node. A web interface is desirable, but probably it is not a good idea to
    implement this in a client node with Zope, because Zope will not start up correctly,
    if master nodes are not working well.

    In the current specification, disk full is not dealt with correctly. If the disk space
    of a node is full, the node is reported to be broken mistakenly. This might be inevitable
    for master nodes, since unwritable master nodes are really useless, but storage nodes
    may function in read-only, as they still work for a part of replications. Possibly
    it would be better to add a read-only state to a storage node.

    It is interesting how to export and import data. FileStorage's Data.fs would be the
    best option. However, this is not very suitable for making a backup remotely, because
    it must send all the data every time. One way is to make a clever protocol which asks
    for only updated data. This is not difficult, because Data.fs is a pending-only structure.
    Another way is to make a replicated storage node distantly, which is not used for reading
    data by client nodes.

    Undo is hard. If a storage node is down, undo is not performed to that storage node.
    If the storage node is reconnected, the storage node believes that the undone transaction
    is still effective. To solve this problem, it is necessary to record what have been undone,
    and replay the undo when reconnecting.