Nexedi Enterprise Objects (NEO) Specification Introduction Nexedi Enterprise Objects (NEO) is a storage system for Zope Object Database (ZODB). NEO is a novel technology in that it provides an extremely robust and scalable storage service which may not be found in existing solutions. NEO provides these features for the robustness and the scalability: - distributed storage nodes - redundant copies - backup master nodes - automatic recovery - dynamic storage allocations - fail-safe protocol For now, this specification corresponds to ZODB version 3.2, which is bundled with Zope version 2.7. This documents how NEO works from the protocol level to the software level. This specification is version 3.0, last updated at 2006-11-01. Components Overview NEO consists of three components: master nodes, storage nodes and client nodes. Here, node means a software component which runs on a computer. These nodes can run in the same computer or different computers in a network. A node communicates with another node via the TCP protocol. Master nodes store the information about the system. They manage central information on NEO nodes and their states; every node addition or failure is registered in the master and notified to other nodes by the master. At a time, only a single master node is current, other nodes are secondary. Storage nodes maintain raw data. They store data committed by each transaction, as well as redundant copies of ZODB objects. They are grouped into clusters of replicas. In each storage nodes cluster one node is distinguished as a primary which implements the semantics of ZODB transactions and assures data consistency within the cluster. Client nodes are usually Zope servers, and clients from the viewpoint of NEO. They interact with current master node and storage nodes to perform transactions and retrieve data. Master Node Master node is the server of system information of NEO. It works in a cluster of master nodes among which only one has the role of primary master node, other are replicas that take over the role of the primary master node in case of failure. The main functions of master node can be summarized as: - storing information about the storage nodes - storing information about the client nodes - storing information about transactions - generating new transactions id - generating new objects id Node start Master nodes have a defined sorting order. The sorting algorithm is defined by comparing the listening IP addresses and the listening port numbers. Each IP address is converted into the network byte order. If two nodes have different IP addresses, the node with the lower IP address is smaller. If two nodes have the same IP address but different port numbers, the node with the lower port number is smaller. Starting application starts master nodes supplied in the configuration data sequentially until the first one is started successfully. The first started master node is the primary. Each following master nodes have the address of the running primary supplied on the start. Database Masters keep simple databases for storing the following information: - storage nodes information - client nodes information - master nodes information The data that is stored in memory before being commited to the database and sent to masters is: - last transaction id - last generated object id Replication Secondary masters connect to the primary and replicate its data in order to be ready to take over its role in case of failure. During the replication no data modification should be performed on the primary master. This is neccessary to ensure the consistency of data among master nodes. Primary master node, sends the contents of its tables (master nodes, storages and clients) in blocks of defined size. Rows of all three tables are sent together, to each row one byte is added identifying the row as client, storage or master. Each block contains the information on its size an the size of the remaining data on the primary. Replicas Replicas maintain copies of the master database but only the primary master initiates read and write operations. Each update is sent to all replicas. Read is performed by the primary alone. Keeping track of the state of master nodes is done by constant sending hanshake messages among the primary and the rest of nodes in the cluster. It allows for both detecting a node failure by the primary and detecting the primary failure by others. In case of a primary master failure a new primary is elected. New primary sends the information about the election to all clients and storage nodes in the system. In case of a replica failure or addition of a new one a notification is sent to each of the master replicas. Primary node election On primary node fail a new one is selected among the rest of the nodes in the cluster. The cluster nodes can be sorted according to their IP address and port number as described above. The node that comes as a first in this sorting order takes over the role of primary in the cluster. Other replicas after a certain lease period, during which all the nodes should detect the primary node failure, redirect their handshakes to the new primary. This selection procedure goes as follows: 1. Primary failure 2. Each of the secondary nodes detects the failure, in <= lease period of time 3. After the failure detection each node verifies if it is a new primary 3a. If it is a secondary, after waiting a lease period of time it sends a handshake to the node that should become a primary. If the primary does not reply to hanshake it is assumed to fail and the selection procedure restarts. 3b. If it is a primary it waits for handshakes from all other master nodes during <= 2 * lease period of time. If during this time a handshake from the majority of other master nodes is received the node becomes primary and it can advertise itself in the system. If the handshakes are not received, the procedure fails, the node exits with error. Newly elected primary sends notifications to all clients and storages in the system about the change. In order to ensure transactions id uniqueness the new primary master queries all client and storage nodes about the last transaction performed. Following transaction numbers are issued in the ascending order. Storage nodes communication With each change in the states of storage nodes, the primary master is notified - either by a client or by a primary storage - then a proper change is introduced in the database and sent to all clients. Storage nodes maintain information on their cluster within itself (described below). Clients nodes tracking Clients nodes failures are detected by storage nodes or by other clients. Similarly, master is informed, it updates the database and sends the information to all clients and storage nodes. For the safety, all storage nodes keep the list of all clients and reply to requests only from the known clients. If a request is received from an unknown client, a special error code is returned and a client needs to ask the master to insert it again on the clients list. This is to avoid a situation where a client is down for a while, it is reported to be down and then it wakes up again. It is important that the clients list stays consistent for maintaining the cache invalidations among clients. Storage Node The main role of storage nodes is storing the raw data and ZODB transaction implementation. Storage nodes are grouped in clusters of replicas similar to master nodes cluster. Within each cluster one of the nodes is the primary that initiates write operations on all nodes in the cluster. The write operations are resent through the primary to the cluster nodes in order to ensure the data consistency within the cluster. In case of primary failure one of the replicas takes over the role of the primary. The same mechanism of handshaking as in the master nodes cluster is implemented for the storage nodes. A constant handshake messages among the primary and the rest of nodes in the cluster allows for both detecting a node failure by the primary and detecting the primary failure by others. In case of a primary storage failure a new one is elected (see primary node election in the master nodes description). New primary sends the information about the election to the primary master who resends it to all masters and clients in the system. Node start Storage node connects to the primary master node on the start. On the first connection the primary assignes it to a cluster and sends the list of the nodes in the cluster. Then, in order to replicate the data of its cluster, the storage node sends a request to the primary storage in the cluster to lock the objects in this cluster for writing. A lock on all objects means that no transactions should be performed on this cluster. Because there is no system of queue on a lock in NEO, the storage node iteratively sends the request until no all transactions are finished in the cluster. With obtaining the lock the data is replicated from the primary. If a storage node is a first in its cluster no copy of the data is done and it becomes a primary. During the replication the objects stored in the cluster are locked for writing on the primary storage of the cluster to avoid the data inconsistency within the cluster. A lock on all objects of a cluster corresponds to a transaction on all cluster objects. After a successful replication, the storage node is ready to be a part of the cluster, it sends a confirmation to the master. The primary master updates the storages list in its database, sends an update information to other masters and clients. The primary storage unlocks the cluster objects and sends the information on a new storage node to all other nodes in its cluster. Another mode of starting a storage is recovery, that is after a fail, a node can be restarted manually as a member of the previous cluster. One more possibility is that the connection between a storage and primary storage is broken for some time so that the storage does not reply to handshakes and is considered to be down for some time. In both cases the storage data consistency check is performed: - if there are already nodes in the clusters, the storage should ask for a lock and check the consistency of its data with the current cluster data - if there are no working nodes in the cluster the node becomes primary of its cluster with the current state of its data In the second case in this case we cannot assure that the latest transaction data is recovered as the whole storage cluster failed. During both data consistency check and storage node replication the cluster is locked for writing. Since in NEO no kind of queue for locking exists the new or recovered storage node iteratively sends requests for locking the cluster to the primary storage until the objects of the cluster are not locked. Then the primary locks the cluster for writing (but keeps the objects cachable) and unlocks after the replication or check is finished. Replication procedure The newly connected node sends to the primary the last id of transactions performed on all objects stored. The primary node checks for the missing transactions of all objects and for missing objects on the received list. Then it sends blocks of the missing data to the new node. Each block is not bigger than a defined size. It contains information on it size and the size of the remaining data on the primary. Failures Storage nodes in a cluster keep track of the states of each other in a similar manner as master nodes cluster, that is by constant hanshakes. Primary storage sends constant hanshake messages to all storages in its cluster. On the first handshake failure the connection to the node is considered to be temporarily out of reach. No updates are sent to this node. If the communication is not reestablished after some time, the storage node is considered to be broken. If the communication is reestablished the node is requested to make a consistency check as described above. Nodes that don't reply to handshakes after a given lease period are considered to be dead. The information is sent to the master, the handshaking with the node is no longer performed. On the other hand, a lack of handshakes from the primary incites a new primary election (as described in the primary master election). Once there is a new primary it sends an update to primary master that resends it to clients. The information on the change of primary storage allows the clients to decide whether their transaction is in jeopardy and whether it should be aborted due to the failure. New primary performs the abort operation of running transactions independently of clients (as they may fail as well and not call the abort operation while holding the lock) - it compares the latest transactions among the storage nodes in the cluster and removes the ones that are not committed on all nodes. Any uncachable objects are marked as cachable again. An information that is not replicated on all nodes must be a transaction that was being comitted during the failure. Client nevertheless aborts its transaction on all storages and the cluster where the failure occurred as well. This prevents a situation where an unfinished transaction was replicated on all cluster nodes before the fail. Client transactions In order to prevent a situation where a client fails while performing a transaction, primary storage nodes performs constant handshakes with the clients that have started transactions on the cluster. If a client node failure is detected, its transaction is aborted, data unlocked and the master is notified. Garbage collection An extra feature that might be elaborated later is a background process in clusters that would remove part of the history of an object is the disc space is limited. Client Node Client nodes connect both to the primary master, to storage nodes and other clients. Master node is contacted during the change of nodes states in the system, storage nodes are used for querying and updating data. Client nodes maintain a local cache for both storage nodes information and recently accessed data. They inform other clients nodes on the modified cache data invalidation. Configuration On the start a client node connects to the primary master to retrieve the list of storage nodes and clients. On each change on the storage or clients nodes list all clients are informed by the master. Data accessing Data is regrouped on the storage nodes. There is a mapping function that allows to identify the cluster where an object is stored according to an object's id. The history of an object is therefore stored within one cluster of storage nodes and a client does not need to contact other nodes to localise the data. According to the load of certain clusters, a new object id is generated in such a way that it would be placed on a less loades cluster. This is an extra feature that can be elaborated later. The number of storge clusters in the system is defined. Each object can be localised among te storage clusters according to its ID by a hashing function OID % number of clusters. Data write operation is always performed through the primary storage of a cluster whereas the reading can be done on any storage node in the cluster. Cache Clients maintain a local cache of the recenlty accessed objects. After finishing a transaction a client sends an invalidation notifications of the modified objects to all NEO clients. When a transaction is being performed on an object, it is marked as uncachable on the storage nodes. The information on cachability of an object is sent to the client on each read operation. Transactions Transactions' design takes into account not only the concurrent data manipulation issues but the effects of independent machine failures on uncomitted transactions as well. Overview Multiple transactions can run at the same time. On the transaction vote the transaction objects are locked. The lock is released on commit or abort. Transaction outline: 1. Client calls tpc_begin on the primary master. The master generates new transaction id and sends it back to the client. 2. Client modifies data locally and calls tpc_vote. This is sent to all storage clusters whose data has been changed in this transaction. Clients send tpc_vote to primary storages along with the transaction id and ids of modified objects. Primary master checks whether modifications are performed on the latest versions of objects and whether they are not locked by some other transaction. In any of the cases voting fails and the client must first abort the transaction on other storage clusters where tpc_vote has already been sent and then restart the transaction as the data it has been operating on is not up to date. If the transaction objects are up to date and unlocked, primary storage locks the objects for writing and marks them as "uncachable" on all storage nodes in the cluster. The uncachable state of an object allows clients to keep on reading the data even though the transaction is not finished. When the client receives the confirmation from all primary storages it can start to commit the data. Once the client has calles tpc_vote on primary storages, constant handshakes are performed between primary storages and the client. In case of a client failure during a transaction, the transaction is aborted in the cluster and the master node is notified. Note: Such implementation of tpc_vote brings a risk of starving a client who wants to write but fails to vote at the moment when the transaction objects are unlocked. However this solution is safe and should work for objects on which a concurrent writing is not often performed. If the starving problem becomes important we should consider an algorithm of decision making as e.g. timestamp concurrency control. 3. Client sends the data to primary storages. The primaries write the data in their databases. 4a. Client calls tpc_finish on each of the primary storages. The primaries send updates to all the nodes in the cluster, unlock the transaction objects and marks them as cachable again. The client sends invalidation messagess to all clients to remove modified data from their cache. 4b. Client calls tpc_abort instead of tpc_finish. The primary storage nodes remove the update from their databases, unlock locked objects and mark them as cachable again on the cluster. Failures An error that is dangerous for the system functioning is a fail of a client performing a transaction. To avoid the effects of such a fail, each primary storage should keep track of the clients holding locks on it by constant handshakes. In case a client does not respond to a handshake the transaction is cancelled, primary storage unlocks the objects and removes any changes performed during this transaction. Many nodes failures effects should be minimized by the replications and new primary election. Error Cases In the following cases, abbreviations are used for simplicity: "CN" for a client node, "PMN" for a primary master node, "MN" for a secondary master node, and "PSN" for a primary storage node, "SN" for a storage node that is not primary. 1. CN calls tpc_begin, then PMN does not reply As soon as a new master is elected the CN should call tpc_begin once more on it. 2. PMN replies to tpc_begin, but CN does not go ahead This causes no problems as on tpc_begin only a new transaction id is generated. On tpc_begin no locking is done, but the new transaction id necessary for data consistence verification on tpc_vote. 3. CN calls tpc_vote, but PSN does not reply or is buggy CN sends the abort call to any PSNs where the tpc_vote was already called. It informs primary master on a node failure. Primary master tries to connect to the node itself. This is to prevent a situation when an information on a node "wake-up" reaches the master before the failure notice and the node is considered to be down even though it has recovered since. CN waits a certain lease period during which a new PSN should be established in the cluster. Then CN restarts the transaction again. 4. CN sends transaction data but the PSN does not reply Since the client node is not sure what changes have been done by the primary before its fail it calls tpc_abort on all the PSNs where the transaction was started, informs the PMN (as in the previous point), waits a certain "lease period" during which a new PSN should be established in the cluster and restarts the transaction again. 5. PSN replies to tpc_vote, but CN does not go ahead This situation is detected by constant hanshakes with clients holding locks. When a client failure is detected the PSNs abort his transaction, that is unlock the data, mark it as cachable on storage nodes. A failure information is sent to the PMN, PMN resends the information to all nodes in the system. 6. CN calls tpc_finish, but PSN does not reply As in the case of tpc_vote or data updates, the client does not know what updates or unlocking has been done by the primary before its fail. Therefore it calls tpc_abort on all transaction PSNs, informs the PMN (as in point 3), waits a lease period and repeats the whole transaction. Similarly a PSN receiving a notice of a CN failure aborts this client's transaction in its cluster. Even though client's failure would be detected by handshakes later. 7. CN sends a cache invalidation call to another CN that does not reply PMN is notified, it resends the information to other nodes in the system. All transactions of this client are aborted. 8. CN receives an error code saying that it is an unknowned CN It is an information that the node was down for some time and it was reported to be dead. It must clear all its cache and call the PMN to register him as a new CN. 9. CN may ask data when a transaction is in an intermediate state The transaction objects are marked as uncachable on storage nodes. CN receives the data but does not store them in the cache. 10. CN reads from a SN but it does not reply CN reads from another SN in the cluster. If the SN is down then this situation will be detected by the PSN in the cluster and reported to the PMN. 11. PSN receives a handshake from a SN that was considered to be dead PMS returns an error message to the SN, the SN performs a data consistency check against the PSN and then informs the PMN on an addition of a new SN. 12. SN receives a call from an unknown client This indicates that a client was considered to be down by the system. SN returns an error message to the CN the CN clears the cache and contacts the PMN to register it as a client in the system again. Critical Error Cases 1. All masters fail A new master should be started manually. Clients on PMN failure detection wait a lease period of time for a new master to appear. If nevertheless the no PMN shows up, CNs continue but no transaction can be started. 2. All SN in a cluster fail The data of the cluster is no longer available, nor can it be consistent after the cluster restart. All clients remove it from their cache. No write or read of the data can be performed. After a restart of one of the nodes of the cluster, the PMN registers it as the PSN of the cluster. All nodes of this cluster that are restarted later are Secondary SNs of the cluster and make the data consistency check against the PSN. 3. Error on start of a node When there is a connection failure to the master right after the SN or CN start then the started node exits with an error. Protocol Format NEO sends messages among cluster nodes via TCP connections. Each message has the following header: +----+------+--------+----------+---- | ID | Type | Length | Reserved | ... +----+------+--------+----------+---- 0 2 4 8 10 ID -- 2-byte unsigned integer which distinguishes a message. Type -- 2-byte unsigned integer which specifies a message type. Length -- 4-byte unsigned integer which specifies the length of this message, including the header. Reserved -- 2-byte unsigned integer for future expansion. This must be set to zero in this version. When a message requires a reply, the ID of the reply message must be identical with the ID of the original request message. In addition, the type of a reply must be identical with the type of a request with the 15th bit set. Most messages add additional data into packets, followed by the header. The Length field of a header specifies the size of a message in bytes, including the header itself. Thus a receiver of a packet previse the size before reading the whole packet. All integer values in packets are always encoded in the network byte order for portability. When the number of parameters or the size of a parameter is variable, the number is specified before the parameters. Message Classes Messages are classified into asynchronous messages and synchronous messages. Asynchronous messages are request-only messages which do not require any reply, while synchronous messages require replies, one reply to one request. Asynchronous messages are used e.g. to notify status changes in the cluster. They include additions and deletions of nodes. Synchronous messages are used to exchange data between nodes. A sender never sends multiple synchronous request messages before getting a reply. Note that, however, a receiver may get multiple synchronous requests from a single connection, since such a connection may be used by multiple threads. Common Parameters Common parrameters used in the communication are: OID -- Object ID, an 8-byte array used to identify an object, regardless of transactions. TID -- Transaction ID, an 8-byte array used to identify a transaction. Serial -- Serial number of an object, which corresponds to a version of an object. In NEO objects versioning is implemented through transaction numbers, therefore the serial number is identical to TID. UUID -- Universally Unique ID. It is a 16-byte array used to identify a node. *TODO* application? exchange it into a cluster ID? NID -- Node ID. It is a 4-byte unsigned integer used to identify a node. A "Primary" master node maps an UUID to an NID for efficiency. IP address -- For now, NEO only supports IPv4. So the address is a 4-byte array. Port Number -- TCP's port number. It is a 2-byte unsigned integer. INVALID_OID -- Invalid OID. It indicates that an OID is invalid. It is '\xff\xff\xff\xff\xff\xff\xff\xff'. INVALID_TID -- Invalid TID. It indicates that a TID is invalid. It is '\xff\xff\xff\xff\xff\xff\xff\xff'. INVALID_SERIAL -- Invalid Serial Number. It indicates that a Serial Number is invalid. It is '\xff\xff\xff\xff\xff\xff\xff\xff'. Error Messages A sychronous message allows a receiver to return an error message, when an error occurs. An error message must specify the same ID as a request, and the same Message Type but with the 15th bit set, as well as the other return messages. Successful code indicates that the message is not an error message but an usual message. In this case, the message format is documented in each return message description. Type -- Vary Sender -- MN, SN Receiver -- CN, MN, SN Class -- Synchronous Format: +------------+----------------------+---------------+ | Error Code | Error Message Length | Error Message | +------------+----------------------+---------------+ 10 12 16 16+n Error Code is a 2-byte unsigned integer with the following meaning: 0 -- Success. The request is successfully completed. In this case, neither Error Message Length nor Error Message follows the Error Code. 1 -- Not Ready. The node is not ready to accept a given request yet. 2 -- OID Not Found. A given OID is not found in the database. 3 -- Serial Number Not Found. A given Serial Number is not found in the database. 4 -- TID Not Found. A given TID is not found in the database. 5 -- Disk Full. The node cannot store data due to too little disk space. 6 -- Conflict Found. A given transaction may not be committed because of a conflict. 7 -- Inconsistent Configuration. A foreign storage/master cluster node is connected, a request from an unknown client has been received. 8 -- Protocol Version Mismatch. Nodes do not talk in the same protocol. 9 -- Protocol Error. A node does not follow the protocol. 10 -- Timeout. A node may not wait for too long. *TODO* check for the timeout error application 11 -- Broken Node Disallowed. A node is known to be broken, so it is not allowed to connect. 12 -- Internal Error. A node is corrupted in software or hardware internally. Error Message Length specifies the length of Error Message, including the trailing NUL character. Error Message is a human-readable string which describes an error. The string is terminated with a NUL character. Message Types Each message type defines what nodes can send it to what nodes. Node types are abbreviated to CN, SN, MN, PMN and PSN for client node, storage node, master node, primary master node and primary storage node, respectively. Each message type is shown as a hexadecimal value. Request Node Identification Type -- 0001 Sender -- CN, SN, MN Receiver -- PMN Class -- Synchronous Format: +---------------+---------------+-----------+------+------------+-------------+ | Major Version | Minor Version | Node Type | UUID | IP Address | Port Number | +---------------+---------------+-----------+------+------------+-------------+ 10 14 18 20 36 40 42 Every node in NEO must issue this request before any other request, to identify itself. Major Version and Minor Version must specify the protocol versions, and each parameter is a 4-byte unsigned integer. In this version Major Version must be '3', and Minor Version must be '0'. Node Type is a 2-byte unsigned integer, and it must be one of these: 1 -- Master Node 2 -- Storage Node 4 -- Client Node UUID is unique to each node, and this may be zero for CN, because it is not be used by them for any purpose. For SN the UUID is a cluster UUID, if it is equal to zero then the PMN assigns a SN to a cluster. Otherwise the node is started as a member of a cluster. The cluster UUID is shared by all nodes joining the same cluster. IP Address and Port Number are a listening address and a port to accept connections. Reply Node Identification Type -- 8001 Sender -- PMN Receiver -- CN, SN, MN Class -- Synchronous Format: +------------+------+------+ | Error Code | UUID | NID | +------------+------+------+ 10 12 28 30 Accept a connection. This returns a cluster UUID which idenfies the cluster and a node ID assigned by the PMN that identifies the node in the system. Request System Information Type -- 0002 Sender -- SN, CN, MN Receiver -- PMN Class -- Synchronous Format: +------+ | NID | +------+ 10 14 Request information on the system. NID is the requesting node's ID, which allows the PMN to determine which information should be returned. Apart from the basic information, CNs receive information on all storage and client nodes, SNs on all clients and storages in the cluster, MNs all information. Reply System Information Type -- 8002 Sender -- PMN Receiver -- SN, CN, MN Class -- Synchronous Format: +------------+-----------------+--------------+--------------------------+ | Error Code | Version Support | Undo Support | Transaction Undo Support | +------------+-----------------+--------------+--------------------------+ 10 12 13 14 15 +-----------+-------------+------+------------------+-----------+----------+ | Read Only | Name Length | Name | Extension Length | Extension | Clusters | +-----------+-------------+------+------------------+-----------+----------+ 15 16 18 18+n 20+n 20+n+m 24+n+m=k +---+-------+-------------+--------+--------------+---------------+-----+ | n | NID 1 | Node Type 1 | UUID 1 | IP Address 1 | Port Number 1 | ... | +---+-------+-------------+--------+--------------+---------------+-----+ k k+4 k+6 k+8 k+24 k+28 k+30 k+4+26*N Version Support, Undo Support, Transaction Undo Support and Read Only are 1-byte boolean parameters. They must be either 1 or 0. Name Length and Extension Length are 2-byte unsigned integers, and they specify the lengths of Name and Extension, respectively, including trailing NUL characters. Name and Extension are human-readable, NUL-terminated strings. In this version, Version Support, Undo Support and Transaction Undo Support must be always 0, 1 and 1, respectively. Name must be "NEO" and Extension must be empty. 'Clusters' is the number of storage nodes clusters in the system. The information following the basic system information is the system nodes list. N is the number of nodes on the list. Information on a node contains the node ID, type of the node, UUID of the node cluster, address and port. Node Type is a 2-byte unsigned integer, and it must be one of these: 1 -- Master Node 2 -- Storage Node 3 -- Primary Storage Node 4 -- Client Node Request Transaction Information Type -- 0003 Sender -- SN Receiver -- PSN Class -- Synchronous Format: +---+-----+------+-------+-------+-----+-------+-------+ | n | NID | UUID | OID 1 | TID 1 | ... | OID n | TID n | +---+-----+------+-------+-------+-----+-------+-------+ 10 14 18 34 42 50 34+16*n Request for the transactions information. SN sends this request to the PSN for the data consistency check. Included in the request is a list of objects and their latest transactions that were stored on the given node. 'n' is a 4-byte unsigned integer which represents the nubmer of objects on the list. The transactions are defined in the descending order of transaction IDs, namely, a larger transactions ID is earlier, excluding unfinished transactions. Reply Transaction Information Type -- 8003 Sender -- PSN Receiver -- SN Class -- Synchronous Format: +------------+---+---+-------+-------+---------------+------------+----------+--------+---- | Error Code | N | n | OID 1 | TID 1 | Compression 1 | Checksum 1 | Length 1 | Data 1 | ... +------------+---+-------+-------+---------------+------------+----------+--------+---- 10 12 16 20 28 36 37 41 49 49+l1 Return the latest transactions data. 'N' and 'n' are a 4-byte unsigned integers, meaning the number of all objects and the number of objects in the message respectively. Each object of the list is an object and its data stored in the transaction identified by TID. The data must be sent in ascending TID order. N allows the SN to decide whether some more data will be sent. New Node Added Type -- 0004 Sender -- PMN, PSN Receiver -- CN, MN, SN, PMN Class -- Asynchronous Format: +-----+-----------+------+------------+-------------+ | NID | Node Type | UUID | IP Address | Port Number | +-----+-----------+------+------------+-------------+ 10 14 16 32 36 38 This asynchronous message is issued on a succesful node addition to the system. It is sent either by - PMN to CNs, MNs, STs on a new CN addition - PMN to CNs, MNs, STs on a new CN addition - PMN to CNs, MNs, SNs on a new primary master node election - PMN to MNs on a new MN addition - PSN to SN in its cluster on a new SN addition in a cluster Node Type is a 2-byte unsigned integer, and it must be one of these: 0 -- Primary Master Node 1 -- Master Node 2 -- Storage Node 3 -- Primary Storage Node 4 -- Client Node UUID may be zero for CN, for SN the UUID is a cluster UUID. IP Address and Port Number are a listening address and a port to accept connections. Node Down Type -- 0005 Sender -- PMN, PSN, CN Receiver -- CN, MN, SN, PSN, PMN Class -- Asynchronous Format: +-----+-----------+------+ | NID | Node Type | UUID | +-----+-----------+------+ 10 14 16 32 This asynchronous message is issued on a node failure detection. It is sent either by: - CN to PMN on a CN, PSN or SN failure detection - PSN to PMN on a SN failure detection - SN to PMN on a PSN failure detection - PMN to CNs, MNs, STs on SN or CN failure information Node Type is a 2-byte unsigned integer, and it must be one of these: 2 -- Storage Node 3 -- Primary Storage Node 4 -- Client Node UUID may be zero for CN, for SN the UUID is a cluster UUID. Node Type and UUID are additional, identification information. Request Last Transaction ID Type -- 0006 Sender -- PMN Receiver -- PSN, CN Class -- Synchronous Format: + | + 10 A new PMN after being elected queries all CNs and PSNs on the last transaction ID in order to ensure following transactions ID uniqueness. Reply Last Transaction ID Type -- 8006 Sender -- CN, PSN Receiver -- PMN Class -- Synchronous Format: +-----+ | TID | +-----+ 10 18 Return last stored or performed transaction ID. Request Handshake Type -- 0007 Sender -- SN, MN Receiver -- PSN, PMN, CN Class -- Synchronous Format: +-----+ | NID | +-----+ 10 14 Send a hanshake with the requesting node ID. Note: secondary SNs and MNs query the primaries in their cluster. Primaries count the last time a handshake of a secondary node has been received and consider a node to be down if the limit time is over. Reply Handshake Type -- 8007 Sender -- CN, PSN, PMN Receiver -- SN, MN Class -- Synchronous Format: +------------+ | Error Code | +------------+ 10 12 Return 0 on success, error code otherwise. Request New OIDs Type -- 0008 Sender -- CN Receiver -- PMN Class -- Synchronous Format: +-----+---+ | NID | n | +-----+---+ 10 14 18 Request new OIDs to assign to objects. 'n' is a 4-byte unsigned integer which specifies the number of requested OIDs. NID is the ID of the requesting node. It should be a valid ID of a CN. Reply New OIDs Type -- 8008 Sender -- PMN Receiver -- CN Class -- Synchronous Format: +------------+---+-------+-----+-------+ | Error Code | n | OID 1 | ... | OID n | +------------+---+-------+-----+-------+ 10 12 16 24 16+n*4 Return new OIDs. 'n' is a 2-byte unsigned integer, specifying the number of returned OIDs. Request New TID Type -- 0009 Sender -- CN Receiver -- PMN Class -- Synchronous Format: +-----+ | NID | +-----+ 10 14 Request a new transaction. This implies the begining of a transaction. NID is the ID of the requesting node. It should be a valid ID of a CN. Reply New TID Type -- 8009 Sender -- PMN Receiver -- CN Class -- Synchronous Format: +------------+-----+ | Error Code | TID | +------------+-----+ 10 12 20 Return a new transaction ID. Request Object Lock For Transaction Type -- 000A Sender -- CN Receiver -- PSN Class -- Synchronous Format: +-----+-----+---+-------+----------+-----+-------+----------+ | NID | TID | n | OID 1 | Serial 1 | ... | OID n | Serial n | +-----+-----+---+-------+----------+-----+-------+----------+ 10 14 22 26 34 42 26+n*16 Ask for locking objects for writing. 'n' is a 4-byte unsigned integer which specifies the number of objects to be locked on this cluster. Each pair of an OID and a Serial Number are the objects which were modified. Serial Number must be the previous Serial Number for a modified object, not the new one. Reply Lock for Transaction Objects Type -- 800A Sender -- PSN Receiver -- CN Class -- Synchronous Format: +------------+ | Error Code | +------------+ 10 12 Return 0 on success, error code otherwise. Request Write Transaction Data Type -- 000B Sender -- CN, PSN Receiver -- PSN, SN Class -- Synchronous Format: +-----+--------------------+-------------+------------------+-----------+ | TID | Description Length | Description | Extension Length | Extension | +-----+--------------------+-------------+------------------+-----------+ 10 18 20 20+dl 22+dl 22+dl+el=x +---+-------+---------------+------------+----------+--------+---- | n | OID 1 | Compression 1 | Checksum 1 | Length 1 | Data 1 | ... +---+-------+---------------+------------+----------+--------+---- x x+4 x+12 x+13 x+17 x+25 x+25+l1 Send all transaction data of objects of a given cluster. Description Length and Extension Length are 2-byte unsigned integers, and the lengths of Description and Extension, respectively, including trailing NUL characters. Descrption and Extension are NUL-terminated strings. Following the Extension, is the the 4-byte unsigned integer 'n' that specifies the number of objects in this transaction. Each object is described with an OID, a compression algorithm, a Adler-32 checksum, the length of data, and data. The compression algorithm is a 1-byte unsigned integer. Its values are defined as: 0 -- No Compression 1 -- zlib's Compression All other values are undefined. The checksum is based on Adler-32, which is implemented in zlib. The checksum must be computed after a compression is applied, if any. The 8-byte unsigned integer, Length, defines the length of data. Data is raw object data, and it must be treated as opaque data for storage nodes. Reply Write Transaction Data Type -- 800B Sender -- SN Receiver -- CN, PMN Class -- Synchronous Format: +------------+ | Error Code | +------------+ 10 12 Return 0 on success, error code otherwise. Request Finish Transaction Type -- 000C Sender -- CN Receiver -- PSN Class -- Synchronous Format: +-----+-----+ | NID | TID | +-----+-----+ 10 14 18 Commit transaction TID. On this message, the PSN sends the updated of the transaction to all nodes in the cluster. Reply Finish Transaction Type -- 800C Sender -- PSN Receiver -- CN Class -- Synchronous Format: +------------+ | Error Code | +------------+ 10 12 Return 0 on success, error code otherwise. Request Abort Transaction Type -- 000D Sender -- CN Receiver -- PSN Class -- Synchronous Format: +-----+-----+ | NID | TID | +-----+-----+ 10 14 18 Abort transaction TID. On this message, the PSN removes the transaction data stored locally and sends cachable notices to all cluster nodes. Reply Abort Transaction Type -- 800D Sender -- PSN Receiver -- CN Class -- Synchronous Format: +------------+ | Error Code | +------------+ 10 12 Return 0 on success, error code otherwise. Request Read Data Type -- 000E Sender -- CN Receiver -- SN Class -- Synchronous Format: +-----+---+-------+-----+-------+ | NID | n | OID 1 | ... | OID n | +-----+---+-------+-----+-------+ 10 14 18 26 18+n*8 Request the latest versions of the objects OIDs. 'n' is a 4-byte unsigned integer informing on the number of requested objects. Following are the requested objects IDs. Reply Read Data Type -- 800E Sender -- SN Receiver -- CN Class -- Synchronous Format: +------------+---+ | Error Code | n | +------------+---+ 10 12 16 +-------+-------+----------+---------------+------------+----------+--------+---- | OID 1 | TID 1 | Cachable | Compression 1 | Checksum 1 | Length 1 | Data 1 | ... +-------+-------+----------+---------------+------------+----------+--------+---- 16 24 32 34 35 39 47 47+l1 Send latest versions of requested objects. 4-byte unsigned integer 'n' specifies the number of objects and should be equal to the number of requested objects, each object is compressed as desrcibed in the transaction data sending message. To each object a byte is added whether it can be stored in the client's cache, 0 for cachable, 1 for uncachable. Request Undo *TODO* Simplify Undo Type -- 000F Sender -- CN Receiver -- PSN Class -- Synchronous Format: +-----+ | TID | +-----+ 10 18 Request undoing a transaction. This must be done within a new transaction locking the same objects as the transaction that is being cancelled. A client sends the undo message to all PSNs on which the transaction was performed. Reply Undo Type -- 800F Sender -- PSN Receiever -- CN Class -- Synchronous Format: +------------+ | Error Code | +------------+ 10 12 Return 0 if undo was successful, error code otherwise. Set Cachable Objects Type -- 0010 Sender -- PSN Receiver -- SN Class -- Asynchronous Format: +----------+---+-------+-----+-------+ | Cachable | n | OID 1 | ... | OID n | +----------+---+-------+-----+-------+ 10 11 15 23 15+n*8 Set given transaction objects as cachable. 'Cachable' is a one byte value, 0 on cachable, 1 on not. Invalidate Cache Objects Type -- 0011 Sender -- CN Receiver -- CN Class -- Asynchronous Format: +---+-------+-----+-------+ | n | OID 1 | ... | OID n | +---+-------+-----+-------+ 10 14 22 14+n*8 An asynchronous message informing clients to remove modified objects from the cache.