Second prototype protocol and components Here we choose to store all a transaction in the same Storage Node (SN) to reduce the number of message sent. So the Master Node (MN) must deal with this data structure : list of object -- oid -> serial, list of SN id list of transaction -- tid -> list of SN id list of SN -- SN_id -> ipv4 address, port, status (status : u = "unrelieable", r = "reliable") list of Client Node (CN) -- CN_id -> ipv4 address, port The list of SN and the list of CN with information are stored on disk because they are only requested by few operations. The CN cache must manage data about object. For each object it should store oid(as key), status (v = 'valid', i = 'invalidate'), serial and data. Before each request to master it must check in cache before to see of object is existing in it and if it is valid before doing a transaction or a load. The CN must map a list of SN. This mapped list is : storage node id -> ipv4 address, port. Then when MN answers CN, it send back only the SN id. If the client have'nt got the SN id mapping, it requests information to MN with getSNInfo() method and map them for next use. CN and SN address which are stored are ipv4 address. Protocol overview Start At MN start, it generates its UUID or use the one stored on disk if it have one and then wait for connection from SN and CN. At a CN start, the CN generates an id and send the id to MN, the MN answers information wich contains Name, supportsVersion, supportsUndo, supportsTransactionalUndo, ReadOnly and ExtensionsMethod. the CN store this information in its cache. At SN start, if it's the first start, the SN generates a UUID and send it to MN to identify itself. The MN then asnwers with its own UUID and map the storage information (UUID, ipv4 address and port). If it isn't the first time,the SN have ever an UUID stored on disk, so it identify with this UUID to MN and then MN send it information in order to make it synchronize its data with other SN. When data are synchronized, SN send message to MN to tell it is ready to work. The best way to start the system is to start MN first, then SN and then CN. In order to provide other start order, if the MN is running, it should reply with 'temporary failure' message to CN if there's none or not enought SN running. If MN is not running, the CN or SN must try to connect again and again until a MN is running. Running Methods like getSerial(), lastTransaction() and new_oid() only request the MN. Serial and lastTransaction are read straight from cache. The oid is generated by the MN to avoid duplicated ids. For transactions, as ZODB manage transactions, there's only an abort if all SN failed. So we can send all data and do all the transaction at the same time. At tpc_begin(), CN get a new transaction id and list of SN id from MN. If the client doesn't know the SN id, it get information (address and port) about SN from MN. At store() CN store objects in a temporary buffer and at tpc_vote() CN send all data to SNs in a whole packet and it return transaction id, then CN send data to MN for updating index and check the sanity of transaction (see furthe r for explaination) and then confirm to zodb. At tpc_finish() the CN send checkCache() message with the list of oid to make invalidate objects in other clients'cache. The client list once it has been requested is stored in cache by CN for future (we suppose list of CN constant most of the time), if an new CN is added or deleted, the master send message to all CN to update their list. If all SNs fail, then CN abort transaction but if not all SNs fail, CN go on the transaction and finish it normally and just notify the SN which failed to MN. UndoInfo() methods, same as undoLog() which is deprecated, request informations about a given range of transactions, this informations are stored on CN. So CN request the MN, the list of transation and their SNs, then it ask SNs to give informations. Undo() methods is deprecated, instead zodb use undoTransaction() wich undo transaction in context of a transaction. This request MN to give SNs for the given transaction, then CN undone transaction on SNs and send cacheVerif message to all other clients. For history() methods, informations requested are timestamp from serial, user_name, description, serial, information and size. This informations are stored on SNs and are requested for a given number of entries. So CN ask MN the SNs for the number of serial for the object, then CN ask SNs for informations For load() and loadSerial(), CN ask MN the SNs for the given oid (and serial), and then ask data to SNs. And getSize() method need all SNs, so CN request list of all SNs to MN and ask size to each SNs. For all operations, if there's a failure in a SN, the CN must notify failure to MN and go on with another SN if possible. Then MN must check the SN and mark it as Unreliable. If a CN is deconnected or a new CN is connected, the MN notify to all other CN the new list of CN in asynchrone methods. We use he same way for the SN. Transaction conflict As we don't use version, the transaction id and the serial number of each object in transaction are the same at the end of a transaction. In order to resolve conflict, when CN commit a transaction (endTransaction) on MN, we send the old serial number of each object whith their oid, MN checck it and if the serial is not the same as the last serial store for this object, it means that transaction has been done with old version of object, so transaction are not sane. Then we send back an error messageto CN wich must undo transaction and reply an exception to ZODB. This must be done when tpc_vote is called by ZODB and not when tpc_finish is called because ZODB doens'nt care of tpc_finish return. So the tpc_finish is only use to send the checkCache message to other CN. Shutdown When a CN shutdown, there's nothing to do on the CN, we just send deconnection message with client id to MN which remove client from its list and send message to all other CN to make them remove CN from their list. When a SN shutdown, it's waiting for all transactions on itself to be finished, then send deconnection message to MN which mark the SN as unreliable but still store it in its list for future restart. If MN have to shutdown and we have no replicated MN, it send message to all SN and CN to make them terminate their process, wait until they answer that they have finished and then MN shutdown. If there's a replicated MN, we send id, address and port of replicated MN and wait for all CN and SN to reply they have finish their work. List of methods Methods called by CN on MN getObjectByOid(oid) -- return SNs and serial for the given object getObjectBySerial(oid, serial) -- return SNs for the given object with the serial getSerial(oid) -- return serial number for the given oid getLastTransaction() -- return last transaction id getOid(n) -- return n new oids clientStart(id) -- return list of informations : Name, supportsVersion, supportsUndo,supportsTransactionalUndo and ExtensionsMethod getTransSN(first, last) -- return list of transaction between first and last with their SNs getObjectHist(oid, length) -- return list of serial for the oid with their SNs. undoTrans(tid) -- mark transaction as undone and return SNs for the given transaction beginTrans() -- return tid and SNs endTrans(tid, oid1, oid2...) -- update cache and return confirmation getAllSN() -- return list of all SN getAllCN() -- return list of all CN failure(SN) -- mark SN as unreliable getSNInfo(SNid) -- return ip address and port for the given storage's id. clientClose() -- close connection to MN Method called by CN on CN checkCache(oid1, oid2...) -- verif cache on client Methods called by CN on SN transaction(tid,txn.user, txn.desc, txn.ext, object1, object2...) -- store data in DB and return confirm or failure undo(tid) -- undone the given transaction and return list of oid histInfo(oid, serial) -- return history informations for the given object undoInfo(tid) -- return informations for the given transaction load(oid, serial) -- return data for the given oid and serial (use both by load() and loadSerial()) getSize() -- return size of storage file system Methods called by SN on MN storageClose(id) -- send deconnection message to MN storageStart(id) -- send id to MN storageReady(id) -- tell MN that SN is ready Methods called by MN on SN and CN masterClose() -- ask all CN and all SN to finish their process, wait for response and then shutdown masterChange(MNdata) --send information to all SN and all CN for making them changing their MN Methods called by MN on CN addSN(id, addr, port) -- send information for new SN (id, ipv4 address, port) addCN(id, addr, port) -- send information for new CN (id, ipv4 address, port) delSN(id) -- tell CN to del SN from its list delCN(id) -- tell CN to del CN from its list Methods id definition getObjectByOid 1 getObjectBySerial 2 getSerial 3 getLastTransaction 4 getOid 5 getTransSN 6 getObjectHist 7 undoTrans 8 beginTrans 9 endTrans 10 getAllSN 11 getAllCN 12 addSN 13 addCN 14 delSN 15 delCN 16 getSNInfo 17 failure 18 checkCache 19 transaction 20 undo 21 histInfo 22 undoInfo 23 load 24 getSize 25 clientClose 26 storageClose 27 masterClose 28 storageStart 29 storageReady 30 clientStart 31 masterChange 32 unreliableStorage 33 If the 15th bit of ID is set, it means a return message. If not, it means a request message. Return code definition 0 success 1 temporary failure (wait for SN to be ready, or MN is going to shutdown) 2 oid not found 3 serial not found 4 tid/transaction not found 5 abort transaction (SN cannot commit the transaction) 6 transaction not valid (send by MN at endTransaction if transaction not sane) Protocol for connection layer : All integer are in network byte order. Header for each request message : 2 byte unsigned integer -- method id 2 byte array of character -- flags (for future use) 4 byte unsigned integer -- message data length Header for each return message : 2 byte unsigned integer -- method id 2 byte character -- flags (for future use) 4 byte unsigned integer -- message data length 2 byte unsigned integer -- return code if return code = 0 : succes if return code != 0 : failure, each failure have it's own return code (to be defined) When there is a failure, message format for data is : 4 byte unsigned integer -- error message length ? byte array of character -- error message Methods details All method beginning with 'return' are the answer from request getObjectByOid(oid) 8 byte unsigned integer -- object oid returnGetObjectByOid() 8 byte unsigned integer -- object serial 4 byte unsigned integer -- number of storage nodes UUID_LEN byte array of character -- first storage id UUID_LEN byte array of character -- second storage id ... getObjectBySerial(oid, serial) 8 byte unsigned integer -- object oid 8 byte unsigned integer -- object serial returngetObjectBySerial() 4 byte unsigned integer -- number of storage nodes UUID_LEN byte array of character -- first storage id UUID_LEN byte array of character -- second storage id ... getSerial(oid) 8 byte unsigned integer -- object oid returnGetSerial() : 8 byte unsigned integer -- object serial number getLastTransaction() nothing returnGetLastTransaction() 8 byte unsigned integer -- transaction tid getOid(n) 2 byte unsigned integer -- number of new oid wanted : n returnGetOid() 2 byte unsigned integer -- number of oid 8 byte array of 8 integer -- first oid 8 byte array of 8 integer -- second oid ... clientStart() UUID_LEN byte array of character -- client id IP_LEN byte array of character -- client ip 2 byte unsigned integer -- client server port returnClientStart() 2 byte unsigned integer -- name length ? byte array of character -- name 2 byte unsigned integer -- supportVersion 2 byte unsigned integer -- supportUndo 2 byte unsigned integer -- supportTransUndo 2 byte unsigned integer -- readOnly 2 byte unsigned integer -- extension length ? byte array of character -- extension getTransSN(first, last) 2 byte unsigned integer -- first transaction position 2 byte unsigned integer -- last transaction position returnGetTransSN() 2 byte unsigned integer -- number of transactions 8 byte unsigned integer -- transaction tid 4 byte unsigned integer -- number of storage nodes UUID_LEN byte array of character -- first storage id UUID_LEN byte array of character -- second storage id ... 8 byte unsigned integer -- second transaction tid 4 byte unsigned integer -- number of storage nodes UUID_LEN byte array of character -- first storage id UUID_LEN byte array of character -- second storage id getObjectHist(oid, length) 8 byte unsigned integer -- object oid 2 byte unsigned integer -- length of history returnGetObjectHist() 2 byte unsigned integer -- number of serial record 8 byte unsigned integer -- serial 4 byte unsigned integer -- number of storage nodes UUID_LEN byte array of character -- first storage id UUID_LEN byte array of character -- second storage id ... 8 byte unsigned integer -- second serial 4 byte unsigned integer -- number of storage nodes UUID_LEN byte array of character -- first storage id UUID_LEN byte array of character -- second storage id ... undoTrans(tid) 8 byte unsigned integer -- transaction tid returnUndoTrans() 4 byte unsigned integer -- number of storage nodes UUID_LEN byte array of character -- first storage id UUID_LEN byte array of character -- second storage id ... beginTrans(tid) 8 byte unsigned integer -- transaction tid returnBeginTrans() 8 byte array of 8 integer -- transaction tid 4 byte unsigned integer -- number of storage nodes UUID_LEN byte array of character -- first storage id UUID_LEN byte array of character -- second storage id ... endTrans(tid, oid1, serial1, oid2, serial2,...) 8 byte unsigned integer -- transaction tid 4 byte unsigned integer -- number of storages UUID_LEN byte array of character -- first storage id UUID_LEN byte array of character -- second storage id ... 4 byte unsigned integer -- number of oid 8 byte unsigned integer -- first object oid 8 byte unsigned integer -- first object old serial 8 byte unsigned integer -- second object oid 8 byte unsigned integer -- second object old serial ... returnEndTrans() only message with either succes code or error code getAllSN() nothing returnGetAllSN() 4 byte unsigned integer -- number of storage nodes UUID_LEN byte array of character -- first storage id IP_LEN byte array of character -- first storage ip 2 byte unsigned integer -- first storage port UUID_LEN byte array of character -- second storage id ... getAllCN() nothing returnGetAllCN() 4 byte unsigned integer -- number of client UUID_LEN byte array of character -- first client id IP_LEN byte array of character -- first client address 2 byte unsigned integer -- first client port UUID_LEN byte array of character -- second client id IP_LEN byte array of character -- second client address 2 byte unsgined integer -- second client port ... failure(SN) UUID_LEN byte array of character -- storage id checkCache(oid1, serial1, oid2, serial2, ...) 8 byte unsigned integer -- serial number 4 byte unsigned integer -- number of oid 8 byte unsigned integer -- first oid 8 byte unsigned integer -- second oid ... transaction(tid,txn.user, txn.desc, txn.ext, object1, object2...) 8 byte unsinged integer -- transaction tid 2 byte unsigned integer -- txn.user length 2 byte unsigned integer -- txn.desc length 2 byte unsigned integer -- txn.ext length ? byte array of character -- txn.user ? byte array of character -- txn.desc ? byte array of character -- txn.ext 4 byte unsigned integer -- number of objects in transaction 8 byte unsigned integer -- first object id 8 byte unsigned integer -- first object serial 8 byte unsigned integer -- first object data length ? byte array of character -- first object data 8 byte unsigned integer -- second object id 8 byte unsigned integer -- second object serial 8 byte unsigned integer -- second object data length ? byte array of character -- second object data ... returnTransaction() only check with the return code undo(tid) 8 byte unsigned integer -- transaction tid returnUndo() 4 byte unsigned integer -- number of oid 8 byte unsigned integer -- first oid 8 byte unsigned integer -- second oid ... histInfo(oid, serial) 8 byte unsigned integer -- object oid 8 byte unsigned integer -- object serial returnHistInfo() 2 byte unsigned integer -- time length ? byte array of character -- time 2 byte unsigned integer -- user length ? byte array of character -- user 2 byte unsigned integer -- desc length ? byte array of character -- desc 8 byte unsigned integer -- serial 8 byte unsigned integer -- object size undoInfo(tid) 8 byte array of character -- transaction tid returnUndoInfo() 2 byte unsigned integer -- time length ? byte array of character -- time 2 byte unsigned integer -- user length ? byte array of character -- user 2 byte unsigned integer -- desc length ? byte array of character -- desc 2 byte unsigned integer -- id length ? byte array of character -- id load(oid, serial) 8 byte unsigned integer -- object oid 8 byte unsigned integer -- object serial returnLoad() : 8 byte unsigned integer -- data length ? byte array of character -- data getSize() nothing returnGetSize() 8 byte unsigned integer -- size of storage data storageStart() UUID_LEN byte array of character -- storage id IP_LEN byte array of character -- storage ip 2 byte unsigned integer -- port returnStorageStart() UUID_LEN byte array of character -- master id 2 byte unsigned integer -- status storageReady() UUID_LEN byte array of character -- storage id 8 byte unsigned integer -- number of transaction 8 byte unsigned integer -- first transaction id 8 byte unsigned integer -- second transaction id ... 8 byte unsigned integer -- number of pair object-serial 8 byte array of character -- first object oid 8 byte array of character -- first object serial 8 byte array of character -- second object oid 8 byte array of character -- second object serial ... masterClose() UUID_LEN byte array of character -- master id returnMasterClose() UUID_LEN byte array of character -- node id masterChange() UUID_LEN byte array of character -- replicated master id IP_LEN byte array of character -- replicated master address 2 byte unsigned integer-- replicated master port returnMasterChange() UUID_LEN byte array of character -- storage id clientClose(id) UUID_LEN byte array of character -- client id storageClose(id) UUID_LEN byte array of character -- storage id getSNInfo(id) 2 byte unsigned integer -- length of storage id ? byte array of character -- storage id returnGetSNInfo() 2 byte unsigned integer -- length of storage id ? byte array of character -- storage id 2 byte unsigned integer -- length of storage address ? byte array of character -- storage address 2 byte unsigned integer -- storage port addSN(id, address, port) UUID_LEN byte array of character -- storage id IP_LEN byte array of character -- storage address 2 byte unsigned integer -- storage port addCN(id, address, port) UUID_LEN byte array of character -- client id IP_LEN byte array of character -- client address 2 byte unsigned integer -- client port delSN(id) UUID_LEN byte array of character -- storage id delCN(id) UUID_LEN byte array of character -- client id