gzochi 0.11 is out. Enjoy it in good health.
The major innovation over the previous release is that the client side of the distributed storage engine now releases the locks it requests from the meta server. This wasn't easy to orchestrate, so I want to say a little bit about how it works.
Some context: The distributed storage engine is based on a paper by Tim Blackman and Jim Waldo, and it works a bit like a cache: The client requests an intentional lock (read or write) on a particular key from the server, and if the server grants the client's request, it serves up the value for the key along with a temporary lease - essentially a timeout. For the duration of the lease, the client is guaranteed that its lock intentions will be honored. If it's holding a read lock, no other clients can have a write lock; if it's got a write lock, no other clients can obtain a read or write lock. Within the client, the key is added to a transactional B+tree (a special instance of the in-memory storage engine) and game application threads can execute transactions that access or modify the data in the B+tree just as they would in a single-node configuration. When a transaction completes, new and modified values are transmitted back up to the meta server, but they also remain in the local B+tree for access by subsequent transactions. When the lease for a key expires - and the last active transaction using the affected key either commits or rolls back - its lock is released, and the client must petition the meta server to re-obtain the lock before it can do anything else with that key.
The tricky parts happen at the edges of this lifecycle; that is, when a lock is obtained and when it is released. In both cases, the client's view of available data from the perspective of active transactions must change. When the client obtains a lock, it gains access to a new key-value pair, and when it releases a lock, it loses that access. These changes occur asynchronously with respect to transaction execution: The arrival of a message from the meta server notifies the client that a lock has been granted (or denied) and lock release is governed by a local timeout. It's tempting to try implement these processes as additional transactions against the B+tree, such that when a new key is added or an expired key is removed, the modification to the B+tree occurs in a transaction executed alongside whatever other transactions are being processed at the time. Unfortunately, this can lead to contention and even deadlock, since adding or removing keys can force modifications to the structure of the B+tree itself, in the form of node splitting or merging. What to do, then, given that it's not acceptable that these "system" transactions fail, since they're responsible for maintaining the integrity of the local cache? You could adjust the deadlock resolution algorithm to always rule in favor of system transactions when it comes to choosing a transaction to mark for rollback, but since lock acquisition and release are relatively frequent, this just transfers the pain to "userland" transactions, which would in turn see an undue amount of contention and rollback.
The answer, I think, involves going "around" the transactional store. For newly arriving keys, this is straightforward: When a new key-value pair is transmitted from the meta server as part of a lock acquisition response, don't add it to the B+tree store transactionally; instead, add it to a non-transactional data structure like a hash table that'll serve as a cache. Transactions can consult the cache if they find that the key doesn't exist in the B+tree. The first transaction to modify the value can write it to the B+tree, and subsequent transactions will see this modified value since they check the B+tree before the cache.
Orchestrating the removal of expired keys is more challenging. Because the B+tree takes precedence over the cache of incoming values, it's crtical that the B+tree reflect key expiration accurately, or else new modifications from elsewhere in the cluster may be ignored, as in the following pathological case:
- Transaction on node 1 obtains write lock on key A, value is stored in cache
- Transaction on node 1 modifies key A, writing new value to the B+tree
- Transaction on node 1 releases lock on key A
- Transaction on node 2 obtains write lock on key A, modifies it, and commits
- Transaction on node 2 releases lock on key A
- Transaction on node 1 obtains read lock on key A, new value stored in cache
- Transaction on node 1 attempts to read key A, but sees old value from B+tree
After some consideration, I borrowed a page from BigTable, and attacked the problem using the concept behind tombstone records. Every record - even deleted records - written to either the B+tree or the incoming value cache are prefixed with a timestamp. To find the effective value for a key, both the B+tree and the cache must be consulted; the version with the most recent timestamp "wins." In most BigTable implementations (e.g., HBase and Cassandra) there's a scheduled, asynchronous "compaction" process that sweeps stale keys. I didn't want to run more threads, so I prevent stale keys from piling up by keeping a running list of released keys. Once that list's length exceeds a configurable threshold, the next transaction commit or rollback triggers a stop-the-world event in which no new transactions can be initiated, and a single thread sweeps any released keys that haven't since been refreshed. With a judiciously configured threshold, the resulting store performs well, since the sweep is quite fast when only a single transaction is active.
This was a real head-scratcher, and I feel proud of having figured it out despite having skipped all databases courses during my undergraduate coursework. Download my project and try it out for yourself!