Nexedi Enterprise Objects (NEO) Specification Introduction Nexedi Enterprise Objects (NEO) is a storage system for Zope Object Database (ZODB). NEO is a novel technology in that it provides an extremely robust and scalable storage service which may not be found in existing solutions. NEO provides these features for the robustness and the scalability: - distributed storage nodes - redundant copies - backup master nodes - automatic recovery - dynamic storage allocations - fail-safe protocol For now, this specification corresponds to ZODB version 3.2, which is bundled with Zope version 2.7. This documents how NEO works from the protocol level to the software level. This specification is version 3.0, last updated at 2006-07-08. Components Overview NEO consists of three components: master nodes, storage nodes and client nodes. Here, node means a software component which runs on a computer. These nodes can run in the same computer or different computers in a network. A node communicates with another node via the TCP protocol. Master nodes are responsible for coordinating the whole cluster. They manage information on nodes in the same cluster, and direct states of other nodes to implement the semantics of ZODB transactions. At a time, only a single master node is current; the other nodes are secondary. Storage nodes maintain raw data. They store data committed by each transaction, and hold redundant copies of ZODB objects. Client nodes are usually Zope servers, and clients from the viewpoint of NEO. They interact with current master node and storage nodes to perform transactions and obtain data. Master Node Configuration Each master node must have a configuration which describes all the list of master nodes. Also, the configuration must specify how many copies of objects should be made in the cluster. Database Each master node must have a database. The database must record the list of known master nodes, the list of known storage nodes, the list of known client nodes, the mapping between transaction IDs and storage nodes which hold corresponding transactions, the mapping between object IDs and transaction IDs. This database must be reinitialized when it turns out to be out-of-dated, excluding the list of storage nodes. The list of storage nodes must be persistent so that the master node can examine the necessity of recovery. Roles Each master node must be one of "Primary", "Secondary" and "Unknown". At most one master node in a cluster is "Primary", and responsible for managing the whole cluster. "Secondary" master nodes must obey the "Primary" master node, as long as it is working correctly. If the "Primary" master node is disconnected or broken, one of the "Secondary" master nodes must take over the "Primary" role. Sorting Order The master nodes have a defined sorting order. The sorting algorithm is defined by comparing the listening IP addresss and the listening port numbers. Each IP address is converted into the network byte order. If two nodes have different IP addresses, the node with the lower IP address is smaller. If two nodes have the same IP address but different port numbers, the node with the lower port number is smaller. States Bootstrap State Initially, each master node has the role "Unknown". In this state, the master node must try to connect to all other known master nodes and accept connections from other master nodes. If a smaller node gets a connection from a larger node in the sorting order, the smaller node must connect to the larger node and drop the connection from the larger node, in order to reduce the number of connections. Each node must wait until it establishes connections with all known master nodes. Once this has been done, each node must advertise what master nodes are available. If a node gets a different answer from another node, that node must report an error and exit. Otherwise, it must move to the discussion state. Discussion State First, a master node must ask if any other node is "Primary". If so, it must query which node is "Primary" and move to the recovery state. Then, a master node must send the latest transaction ID it holds in the database, and compare which has the newest transaction ID. The node which has the newest transaction ID must be selected to "Primary". If multiple nodes hold the same newest transaction ID, the smallest node must be "Primary". If a node becomes "Primary", it must move to the verification state. If a node becomes "Secondary", it must move to the recovery state. Verification State A master node must verify its database up-to-date. The node verifies the database by querying the latest transaction ID and the number of transaction IDs for each storage node to the storage node. For this, it is necessary to wait for all known storage nodes which are not broken or down to connect to the master node. If a storage node is not connected, the node must be marked as "Temporarily Down". If a storage node is temporarily down for over 10 minutes, the node must be remarked as "Down", and stop waiting for the node. If any storage node disagrees, the node must move to the reconstruction state. Otherwise, the node must move to the ready state. Reconstruction State The master node must reinitialize the database, except for the list of nodes. Then, it must ask each storage node to send object IDs and transaction IDs it has. After updating the database, the master node must examine if there is any transaction or object which does not have a good number of replicates in storage nodes. If any, the master must initiate the process of a replication. The detailed information about the replication is described below. Once this is finished, the node must move to the ready state. Recovery State The master node must wait for the "Primary" master node to be in the ready or working state. Once the "Primary" master node becomes working, the "Secondary" master node must verify if its database is up-to-date with the same algorithm described in the verification state. If it is not up-to-date, the "Secondary" master node must ask all information the "Primary" master node holds, and update the database. The node must move to the backup state after this. Ready State The master node must wait for all "Secondary" master nodes to be in the backup state. Once this is finished, the node must move to the working state. Working State When entering a working state, the master node must check the list of known storage nodes which are broken or down. If so, the master node must initiate a replication process appropriately. In this state, all protocol commands are supported, and client nodes must be able to function normally. If a "Secondary" master node is available, the "Primary" master node must send all updating messages to it, when any data in the "Primary" database changes. If a storage node appears in this state, the master node must interrupt any replication process, and query all data from the storage node. Once the information is transferred to the master node, the master node must re-examine the database for interrupted replication processes, and re-issue replication processes, if necessary. If a storage node disappears, it must be marked as "Temporarily Down". If this persists over 10 minutes, it must be remarked as "Down", and the master node must initiate a replication process. Also, the master node must report an error and continue the operation. If a storage node gets broken, it must marked as "Broken". The master node must initiate a replication process, and report an error, but it must continue the operation. Backup State The master node must listen to the "Primary" node, and synchronize the database with the "Primary" database. The "Secondary" master node is not allowed to change the database contents without any command from the "Primary" master node. If a storage node or a client node reports that the "Primary" node is not available or broken, the master node must issue a re-discussion to all other "Secondary" master nodes. Then, it must move back to the discussion state. If the node gets a message that the "Primary" node is malfunctioning, the node must drop out the connection with the "Primary" node, and move back to the discussion state again. Shutdown State This is a special state to shutdown the whole cluster. The transition to this state is only possible with a manual issue. Once this instruction is issued, the node must request shutdown to all nodes. When the node gets this message, it must enter the shutdown state. In this state, client nodes are not allowed to make any new transaction, and stop as soon as possible. Once all ongoing transactions are finished, the "Primary" master node must direct all nodes to shutdown immediately, and the "Primary" master node itself must exit, after all the other nodes are down. Storage Node Configuration The storage node may find master nodes by broadcasting. Alternatively, it may specify master nodes by a configuration. Database The storage node must have a persistent database to hold transactions and objects. States Bootstrap State The storage node must generate a Universally Unique ID (UUID), if not present yet, so that master nodes can identify the storage node. The storage node must connect to one master node, and wait for a "Primary" master node is selected. Once it is selected, the storage node must make sure that it has a connection to the "Primary" master node. Then, the storage node must move to the working state. Working State The storage node must interact with master nodes, client nodes and other storage nodes. This is mainly retrieving and storing data. Shutdown State The storage node must shutdown immediately, when the "Primary" master node asks it. Client Node Configuration The client node may find master nodes by broadcasting. Alternatively, it may specify master nodes by a configuration. Database The client node may have a volatile database to cache information. The database must be cleared out, when connecting to the "Primary" master node. States Bootstrap State The client node must connect to one master node, and wait for a "Primary" master node is selected. Once it is selected, the client node must make sure that it has a connection to the "Primary" master node. Then, the client node must move to the working state. Working State The client node must interact with master nodes and storage nodes. This is mainly commiting and aborting transactions. Also, the client node must detect a broken storage node. If a "Primary" master node is not available, the client node must connect to another master node, and notify that the "Primary" master node is not available. Then, it must move back to the bootstrap state. If no master node is available, the client node must report an error and exit. Shutdown State The client node must not issue any new transaction. It must shutdown immediately, once all ongoing transactions are finished. Operations Transactions Overview Transactions must be serialized with a lock in the commit phase to ensure the atomicity and the integrity of every transaction. Thus only a single transaction commit is allowed at a time for ZODB. For Zope, however, multiple transactions may run at a time, until they reach the commit phase. ZODB's transaction commit follows this order: 1. Begin a commit (tpc_begin). 2. Vote for a commit (tpc_vote). 3. Finish or abort a commit (tpc_finish or tpc_abort). At tpc_begin, a new transaction begins. At tpc_vote, object data is sent to storages. And, finally, at tpc_finish, the transaction is finished. If anything wrong happens in tpc_vote, tpc_abort may be called to abort the transaction. In NEO, a transaction commit is realized in this way: 1. A client node calls tpc_begin, and demand a new transaction ID from a "Primary" master node. The "Primary" master node must send back storage node IDs to the client node so that the client node will store objects in the set of specified storage nodes. 2. The client node calls tpc_vote, and send object data to the specified storage nodes, and meta information to the "Primary" master node. 3. The client node calls tpc_finish, and confirm that the transaction is complete with the "Primary" master node. The "Primary" master node then must contact storage nodes to ensure that the data has been stored, and must ask all client nodes but the sending client node to invalidate their cached items, because data is updated, and keeping older data will make conflicts. 4. Instead, the client node may call tpc_abort. In this case, the client node must tell the "Primary" master node to cancel the transaction. The "Primary" master node then must contact storage nodes to drop out the object data. That above is the normal situation where everything goes well. It is possible, however, that one or more errors may happen in a commit phase, at the client side, the master side and the storage side. Thus care must be taken of the following situations. Error Cases In the following cases, abbriviations are used for simplicity: "CN" for a client node, "PMN" for a "Primary" master node, "SMN" for a "Secondary" master node, and "SN" for a storage node. CN calls tpc_begin, then PMN does not reply In this case, CN must stop the transaction, and notify SMN that PMN is broken. PMN replies to tpc_begin, but CN does not go ahead In theory, tpc_vote may take a lot of time when committing a lot of data. However, PMN must take care of not stopping the whole cluster, as a transaction commit is an exclusive operation. Therefore, in reality, a lock must not be obtained in tpc_begin. At the beginning, PMN must only generate a new transaction ID. Nevertheless, if CN does not issue tpc_vote to PMN, garbage can remain in SN, because CN may have sent data to SN already. Thus SN must implement a garbage collection which may expire pending transactional data. CN calls tpc_vote, but SN does not reply or is buggy CN must not wait for each SN too long. If CN cannot connect to a SN, CN must assume that the SN is down, thus must report it to PMN. If CN connected to a SN but the SN does not reply or the reply indicates an error, CN must report that the SN is broken to PMN, so that PMN may abandone the SN. This requires a timeout, as SN may send no rely forever. The timeout should be at least 10 seconds, because an operating system flushes cache in every 5 second typically, and this disk activity may delay the operation in SN significantly. If all SNs specified by PMN fail, CN may ask PMN to send a new set of storage nodes. As it is quite rare that all fail, this feature is not obligatory. If all SNs specified by PMN, including newly acquired SNs, fail, CN must stop the transaction. CN calls tpc_vote, SN accepts data, but PMN does not reply At this stage, PMN really needs to obtain a lock, thus PMN may spend some time to wait for other transactions to finish. However, PMN should not need much time for this, as the lock is held only when PMN writes small meta information to its database, and send notifications to other nodes. PMN replies to tpc_vote, but CN does not go ahead Because CN may vote for other storage adapters as well, this may need some time. However, as the lock is held already, PMN must not stop other clients too long. PMN must abort the transaction with a proper timeout. CN calls tpc_finish, but PMN does not reply CN must drop the connection to PMN, and report it to SMN. When CN reconnects to a new PMN, the cache is invalidated, thus CN does not have to invalidate the cached data for the unfinished transaction explicitly. PMN asks SN to finish a transaction, but SN does not reply PNM must send an error message to CN, so that CN can invalidate the cached data. CN may not do anything else, because ZODB does not have a facility to make an error at tpc_finish appropriately. PMN asks CN to invalidate cache, but CN is down PMN must ignore this error, as the CN will invalidate cache when reconnecting. CN or PMN may ask data when a transaction is in an intermediate state SN must take care that it does not send meta information to PMN when it is not finished. SN should remove pending data when PMN asks for meta information, because PMN does that only when all transactions are aborted. PMN must take care that it does not send meta information to CN when it is not finished. PMN should not write meta information to the database when it is not finish. Replications Overview Replications are used to make redundant copies of objects among a cluster. This guarantees that a single node failure (or even more, depending on a configuration) does not cause a data loss, and helps to distribute the load of reading data. When a client node commits a transaction, the client node itself makes redundant copies by contact multiple storage nodes. In the normal mode of the cluster management, a "Primary" master node is involved only with assigning storage nodes to every transaction. However, if a storage node is down or broken, the "Primary" master node must undergo a recovery by making a new replication in another storage node. In this case, the "Primary" master node must check all data which a non-functional storage node had, and determine a new storage node for a given transaction. When the cluster does not have a good number of storage nodes any longer, the "Primary" master node may not proceed a replication. In this case, the only solution is to add a new storage node to the cluster, or to reduce the required number of replications. If a new storage node is connected, the "Primary" master node must check if the storage node already contains data, and update its database, if necessary, because the storage node may have a lost replication. Another case is when meta information is reconstructed by collecting data from all storage nodes. A "Primary" master node must undertake a complete examination of the database to obtain a list of transactions which have fewer replications. Considerations As replications are only that storage nodes exchange data under the management by a "Primary" master node, this specification does not define how a "Primary" master node must implement a replication policy. However, an implementation should take care that replications should not affect the system performance too much, as replications are not always urgent, in comparison with transactions. For instance, replications should be divided into smaller parts. Replications should be scheduled appropriately, so that they run only when the system load of a node is low. When a replication process is finished, a "Primary" master node should remove a broken or down storage node from the database, if the node is not referred to any longer. Otherwise, the database may grow up over time, and slows down replications gradually. Protocol Format NEO sends messages among cluster nodes via TCP connections. Each message has the following header:: +----+------+--------+----------+---- | ID | Type | Length | Reserved | ... +----+------+--------+----------+---- 0 2 4 8 10 ID -- 2-byte unsigned integer which distinguishes a message. Type -- 2-byte unsigned integer which specifies a message type. Length -- 4-byte unsigned integer which specifies the length of this message, including the header. Reserved -- 2-byte unsigned integer for future expansion. This must be set to zero in this version. When a message requires a reply, the ID of the reply message must be identical with the ID of the original request message. In addition, the type of a reply must be identical with the type of a request with the 15th bit set. Most messages add additional data into packets, followed by the header. The Length field of a header specifies the size of a message in bytes, including the header itself. Thus a receiver of a packet previse the size before reading the whole packet. All integer values in packets are always encoded in the network byte order for portability. When the number of parameters or the size of a parameter is variable, the number is specified before the parameters. Message Classes Messages are classified into asynchronous messages and synchronous messages. Asynchronous messages are request-only messages which do not require any reply, while synchronous messages require replies, one reply to one request. Asynchronous messages are used to notify status changes in the cluster. They include additions and deletions of nodes. Synchronous messages are used to exchange data between nodes. A sender never sends multiple synchronous request messages before getting a reply. Note that, however, a receiver may get multiple synchronous requests from a single connection, because such a connection may be used by multiple threads. Common Parameters Messages use the same types of parameters frequently. Here are the definitions of those types: OID -- Object ID. It is a 8-byte array used to identify an object, regardless of transactions. TID -- Transaction ID. It is a 8-byte array used to identify a transaction. Serial -- Serial number of an object, which identifies a certain generation of an object. In NEO, this is identical to TID, because NEO does not support versioning. UUID -- Universally Unique ID. It is a 16-byte array used to identify a node. NID -- Node ID. It is a 4-byte unsigned integer used to identify a node. A "Primary" master node maps an UUID to an NID for efficiency. IP address -- For now, NEO only supports IPv4. So the address is a 4-byte array. Port Number -- TCP's port number. It is a 2-byte unsigned integer. INVALID_OID -- Invalid OID. It indicates that an OID is invalid. It is '\xff\xff\xff\xff\xff\xff\xff\xff'. INVALID_TID -- Invalid TID. It indicates that a TID is invalid. It is '\xff\xff\xff\xff\xff\xff\xff\xff'. INVALID_SERIAL -- Invalid Serial Number. It indicates that a Serial Number is invalid. It is '\xff\xff\xff\xff\xff\xff\xff\xff'. Error Messages A sychronous message allows a receiver to return an error message, when an error occurs. An error message must specify the same ID as a request, and the same Message Type but with the 15th bit set, as well as the other return messages. Successful code indicates that the message is not an error message but an usual message. In this case, the message format is documented in each return message description. Type -- Vary Sender -- MN, SN Receiver -- CN, MN, SN Class -- Synchronous Format:: +------------+----------------------+---------------+ | Error Code | Error Message Length | Error Message | +------------+----------------------+---------------+ 10 12 16 16+n Error Code is a 2-byte unsigned integer, and it is one of these: 0 -- Success. The request is successfully completed. In this case, neither Error Message Length nor Error Message follows the Error Code. 1 -- Not Ready. The node is not ready to accept a given request yet. 2 -- OID Not Found. A given OID is not found in the database. 3 -- Serial Number Not Found. A given Serial Number is not found in the database. 4 -- TID Not Found. A given TID is not found in the database. 5 -- Disk Full. The node cannot store data due to too little disk space. 6 -- Conflict Found. A given transaction may not be committed because of a conflict. 7 -- Inconsistent Configuration. A master node does not have an identical list of master nodes with another, or a foreign cluster node is connected. 8 -- Protocol Version Mismatch. Nodes do not talk in the same protocol. 9 -- Protocol Error. A node does not follow the protocol. 10 -- Timeout. A node may not wait for too long. 11 -- Broken Node Disallowed. A node is known to be broken, so it is not allowed to connect. 12 -- Internal Error. A node is corrupted in software or hardware internally. Error Message Length specifies the length of Error Message, including the trailing NUL character. Error Message is a human-readable string which describes an error. The string is terminated with a NUL character. Message Types Each message type defines what nodes can send it to what nodes. Node types are abbrivated to CN, SN, MN, PMN and SMN for client node, storage node, master node, primary master node and secondary master node, respectively. Each message type is shown as a hexadecimal value. Get New OIDs Type -- 0003 Sender -- CN Receiver -- PMN Class -- Synchronous Format:: +---+ | n | +---+ 10 12 Request new OIDs to assign to objects. 'n' is a 2-byte unsigned integer which specifies the number of requested OIDs. Return New OIDs Type -- 8003 Sender -- PMN Receiver -- CN Class -- Synchronous Format:: +------------+---+-------+-----+-------+ | Error Code | n | OID 1 | ... | OID n | +------------+---+-------+-----+-------+ 10 12 14 22 14+n*4 22+n*4 Return new OIDs. 'n' is a 2-byte unsigned integer, specifying the number of returned OIDs. Request Node Identification Type -- 0004 Sender -- CN, SN, MN Receiver -- MN, SN Class -- Synchronous Format:: +---------------+---------------+-----------+------+------------+-------------+ | Major Version | Minor Version | Node Type | UUID | IP Address | Port Number | +---------------+---------------+-----------+------+------------+-------------+ 10 14 18 20 36 40 42 Every client node must issue this request before any other request, to identify itself. Major Version and Minor Version must specify the protocol versions, and each parameter is a 4-byte unsigned integer. In this version Major Version must be '3', and Minor Version must be '0'. Node Type is a 2-byte unsigned integer, and it must be one of these: 1 -- Master Node 2 -- Storage Node 3 -- Client Node UUID is unique to each node, and this may be zero for CN, because the UUID will not be used for any purpose. For MN, the UUID is a cluster UUID but not a MN UUID. The difference is that a cluster UUID is shared by all nodes joining the same cluster. IP Address and Port Number are a listening address and a port to accept connections. They must be zero for CN. Accept Node Identification Type -- 8004 Sender -- MN, SN Receiver -- CN, SN, MN Class -- Synchronous Format:: +------------+------+ | Error Code | UUID | +------------+------+ 10 12 28 Accept a connection. This returns a cluster UUID which idenfies the cluster. Get Database Information Type -- 0005 Sender -- SN, CN Receiver -- PMN Class -- Synchronous Format:: + | + 10 Request database information. Return Database Information Type -- 8005 Sender -- PMN Receiver -- SN, CN Class -- Synchronous Format:: +------------+-----------------+--------------+--------------------------+ | Error Code | Version Support | Undo Support | Transaction Undo Support | +------------+-----------------+--------------+--------------------------+ 10 12 13 14 15 +-----------+-------------+------+------------------+-----------+ | Read Only | Name Length | Name | Extension Length | Extension | +-----------+-------------+------+------------------+-----------+ 15 16 18 18+n 20+n 20+n+m Version Support, Undo Support, Transaction Undo Support and Read Only are 1-byte boolean parameters. They must be either 1 or 0. Name Length and Extension Length are 2-byte unsigned integers, and they specify the lengths of Name and Extension, respectively, including trailing NUL characters. Name and Extension are human-readable, NUL-terminated strings. In this version, Version Support, Undo Support and Transaction Undo Support must be always 0, 1 and 1, respectively. Name must be "NEO" and Extension must be empty. Get Transaction Information Type -- 0006 Sender -- CN Receiever -- PMN Class -- Synchronous Format:: +-----+-----+---+ | OID | TID | n | +-----+-----+---+ 10 18 26 30 Request transaction information. The first transaction is specified by TID. If TID is INVALID_TID, the last transaction is selected. 'n' is a 4-byte unsigned integer which requests the maximum number of transactions returned, starting from TID or the last transaction. The positions of transactions are defined in the descending order of transaction IDs, namely, a larger transactions ID is earlier, excluding unfinished transactions. If OID is not INVALID_OID, only transactions with which the object is committed are selected. Otherwise, all transactions are used. Return Transaction Information Type -- 8006 Sender -- PMN Receiver -- CN Class -- Synchronous Format:: +------------+---+-------+---+-------+-----+-------+------+ | Error Code | n | TID 1 | m | NID 1 | ... | NID m | .... | +------------+---+-------+---+-------+-----+-------+------+ 10 12 16 24 28 32 28+m*4 32+m*4 Return the number of transactions and storage nodes for the transactions. 'n' is a 4-byte unsigned integer, the number of transactions following this parameter. The transaction IDs must be sorted in the descending order, and each transaction information specifies 'm' which is a 4-byte unsigned integer describing the number of storage nodes, and storage node IDs. Request Undo Type -- 0007 Sender -- CN Receiver -- PMN, SN Class -- Synchronous Format:: +-----+ | TID | +-----+ 10 18 Request undoing a transaction. This must be performed within a transaction. A client node is responsible for sending the same transaction ID to all storage nodes. Accept Undo Type -- 8007 Sender -- PMN, SN Receiever -- CN Class -- Synchronous Format:: +------------+ | Error Code | +------------+ 10 12 Return only if undo was successful or not. Request New TID Type -- 0008 Sender -- CN Receiver -- PMN Class -- Synchronous Format:: + | + 10 Request a new transaction. This implies the beginning of a transaction. Return New TID Type -- 8008 Sender -- PMN Receiver -- CN Class -- Synchronous Format:: +------------+-----+ | Error Code | TID | +------------+-----+ 10 12 20 Return a new transaction ID. Request Confirmation For Transaction Type -- 0009 Sender -- CN Receiver -- PMN Class -- Synchronous Format:: +-----+---+-------+-----+-------+---+-------+----------+-----+-------+----------+ | TID | n | NID 1 | ... | NID n | m | OID 1 | Serial 1 | ... | OID m | Serial m | +-----+---+-------+-----+-------+---+-------+----------+-----+-------+----------+ 10 18 22 26 22+n*4 30+n*4 38+n*4 46+n*4 30+n*4+m*16 Send information about a stored transaction. 'n' is a 4-byte unsigned integer which specifies the number of storage nodes holding the transaction data. 'm' is another 4-byte unsigned integer which specifies the number of OIDs and Serial Numbers. Each pair of an OID and a Serial Number are the objects which were modified. Note that a Serial Number must be the previous Serial Number for a modified object, but not the new Serial Number. Confirm Transaction Type -- 8009 Sender -- PMN Receiver -- CN Class -- Synchronous Format:: +------------+ | Error Code | +------------+ 10 12 Return only if the request was successful. Send Transaction Data Type -- 000A Sender -- CN Receiver -- SN Class -- Synchronous Format:: +-----+-------------+------+--------------------+-------------+------------------+-----------+ | TID | User Length | User | Description Length | Description | Extension Length | Extension | +-----+-------------+------+--------------------+-------------+------------------+-----------+ 10 18 20 20+ul 22+ul 22+ul+dl 24+ul+dl 24+ul+dl+el +---+-------+---------------+------------+----------+--------+---- | n | OID 1 | Compression 1 | Checksum 1 | Length 1 | Data 1 | ... +---+-------+---------------+------------+----------+--------+---- x x+4 x+12 x+13 x+17 x+25 x+25+l1 Send all the data in a transaction. User Length, Description Length and Extension Length are 2-byte unsigned integers, and the lengths of User, Description and Extension, respectively, including trailing NUL characters. User, Descrption and Extension are NUL-terminated strings. Following Extension, the 4-byte unsigned integer 'n' specifies the number of objects in this transaction. Each object is described with an OID, a compression algorithm, a Adler-32 checksum, the length of data, and data. The compression algorithm is a 1-byte unsigned integer. Currently, these values are defined: 0 -- No Compression 1 -- zlib's Compression All other values are undefined. The checksum is based on Adler-32, which is implemented in zlib. The checksum must be computed after a compression is applied, if any. The 8-byte unsigned integer, Length, defines the length of data. Data is raw object data, and it must be treated as opaque data for storage nodes. Accept Transaction Data Type -- 800A Sender -- SN Receiver -- CN Class -- Synchronous Format:: +------------+ | Error Code | +------------+ 10 12 Return only if the request was successful. FIXME: more message types must be defined. FIXME: the message types should be renumbered for clarity. Notes It might be better to improve the verification algorithm of checking the data integrity. The difficulty is that it would be very heavy to check all data. It would be better to add a monitoring protocol to watch the whole cluster, to get information about each node, such as working, broken, and so on. Also, it might be interesting to support performance monitoring, such as the CPU load, disk space, etc. of each node. A web interface is desirable, but probably it is not a good idea to implement this in a client node with Zope, because Zope will not start up correctly, if master nodes are not working well. In the current specification, disk full is not dealt with correctly. If the disk space of a node is full, the node is reported to be broken mistakenly. This might be inevitable for master nodes, since unwritable master nodes are really useless, but storage nodes may function in read-only, as they still work for a part of replications. Possibly it would be better to add a read-only state to a storage node. It is interesting how to export and import data. FileStorage's Data.fs would be the best option. However, this is not very suitable for making a backup remotely, because it must send all the data every time. One way is to make a clever protocol which asks for only updated data. This is not difficult, because Data.fs is a pending-only structure. Another way is to make a replicated storage node distantly, which is not used for reading data by client nodes. Undo is hard. If a storage node is down, undo is not performed to that storage node. If the storage node is reconnected, the storage node believes that the undone transaction is still effective. To solve this problem, it is necessary to record what have been undone, and replay the undo when reconnecting.