Currently Genezzo is a single-process database -- only one process on a single server may access the database files at a time. The current Genezzo codebase also does not implement transaction rollback of data already written to disk. The Clustered Genezzo project will add support for multi-server multi-process access. The multi-process support will be specifically designed to support shared data clusters. Transaction rollback support will also be added. These additions will be done via Genezzo's Havok and SysHook extension mechanisms, so little modification of the base code will be necessary.
This implementation of Clustered Genezzo will rely heavily on two outside components: a Cluster File System and a Distributed Lock Manager. Initial choices for these components have been made to enable development. It is expected these components will later be replaced, either with other outside components or with new components developed as part of the Clustered Genezzo project.
The remainder of this document discusses the details of the design and proposed implementation of Clustered Genezzo. Many notes detail alternative design decisions and places for further improvement.
A shared data cluster with a topology like that shown below provides numerous benefits when deploying an application like the Genezzo database. These benefits are in the areas of scalability, high availability, and most recently affordability. Several of these benefits arise because every server can equally access every shared disk. The cluster scales in compute power by adding servers, and in capacity by adding disks. The data stored by the cluster is highly available because single server failures don't prevent access to any of the data, and because the SAN disks can easily be set up in a RAID configuration to guard against disk failures. Finally, clusters built with commodity AMD/Intel processors, ATA/SATA hard drives, and ethernet adapters and switches are very affordable. The major drawback to shared data clusters is the greater complexity of the operating system(s) and applications required to utilize it. The holy grail at both the operating system and database level is the "single system image" the illusion presented to the higher level applications and the users that the cluster is a single monolithic system and the underlying hardware complexity can safely be ignored.
The configuration of the Genezzo test cluster is discussed at http://eric_rollins.home.mindspring.com/genezzo/cluster.html.
One major concern with any file system will be the implementation of the fsync system call. When implementing database transactions it is important to guarantee data has really been written to disk. fsync is used to force all cached data to disk. When buffers are passed between processes on different machines via the disk another process may read an inconsistent block due to a partially completed write. This is very possible when database blocks are larger than file system blocks. We can detect this failure case using a checksum, and the block will be reread if a mismatch is found.
In the past clustered databases (such as Oracle Parallel Server or RAC) have been implemented utilizing one database instance per machine, where each instance has its own internal lock manager. With row level locking the lock information has been stored directly in the data blocks with the rows. The DLM has been used to implement cache coherence between database instances on different machines, not database transaction-level locks. In our case we will not initially be implementing a separate instance-level (intra-machine) lock manager. Instead the DLM will provide all locking services even if only a single machine is running.
Early tests of OpenDLM's scalability have found it to run out of locks at around 360,000 locks. Only 350MB out of 1GB memory had been consumed, so the issue was not simple memory exhaustion. Initially we will restrict our usage to 100,000 locks. This would restrict the size of an update or select to 400MB, assuming a 4K database page size. We will instead hash the lock names into a space of 100,000 locks, trading off the possibility of false lock blocking for larger potential update size.
There may also be issues with the performance of OpenDLM. In testing it took 1.9 seconds to obtain 10,000 locks, or 5263 locks/second. In the 100,000 lock limit case above it would take 19 seconds to lock the entire database. This is probably acceptable.
While OpenDLM provides many complex locking services, we will initially utilize a very simple set. We will not utilize the async APIs, avoiding issues with callback synchronization in Perl. By utilizing a simple set it should be easier to move to a different DLM (or write our own) in the future.
Each Genezzo process is started with a globally unique process id. Currently it is allocated by requesting a lock [SVR-processid] in exclusive nowait mode, and incrementing the processid and retrying if the processid is already in use (this N^2 technique may need to be revisited with a large number of processes). All processes share a single undo file, but have distinct ranges of blocks preallocated to each process. Only the owning process reads or writes its portion of the undo file.
The undo file lists the current status of each process (CLEAR, COMMITTED, ROLLEDBACK, PENDING) with one block per process. A range of blocks per process lists filenumber-blocknumber pairs. These are the blocks written to by the current transaction. Before the data block is written the before image is copied to the tail of its associated data file to a slot at filelength + blocknum.
Blocks written by a process but not yet committed contain the process id. When a process failure occurs block locks will be released. If a block is read containing another process id the state of the failed process is read and the block is replaced with the before image if necessary.
Note only before-images of blocks are being generated, and they are only being retained for the duration of the transaction. No long-lived redo or log file is generated. Cluster-wide recovery with multiple redo log files is quite complex, and it is believed may customers cannot accept the downtime implied in a restore-backups + mount-offline-logs + roll-logs-forward style recovery. Instead we will rely on the availability of cheap SATA RAID hardware to cover the media failure case. Simple process death or power failure is covered in the undo file scheme.
An online hot-backup facility can be added via a tablespace or cluster-wide lock which must be held in SHARED mode to enable writes to the tablespace. A backup process would obtain it in EXCLUSIVE mode, locking out all writers for the duration of the backup.
Alternately all transactions could be streamed to a remote log server. This is similar to how many databases accomplish replication.
blockType fileNumber blockNumber currentProcess (or none) ...metadata... ...row data.. checksum
If currentProcess is not none a transaction is in progress against this block, and this block has been modified. currentProcess indicates the slot in the undo file which contains a commit record if the transaction was committed. In this case currentProcess can safely be set to none. Otherwise the transaction has been rolled back, and the current block needs to be replaced with the before image from the tail of the data file.
Note no global System Commit Number or Log Sequence Number is maintained. This eliminates the need for this type of global sequence generation algorithm.
As SQL statements are processed blocks will be requested from the buffer cache. Each request will form a lock name [BLK-fileNumber-blockNumber]. The process will determine whether this block is already locked by the process. If not, a blocking SHARED mode lock request will be made.
def blockReadRequest(fileNumber, blockNumber) form lock name [BLK-blockNumber mod 100000] if(not lock already held by process) blocking SHARED lock request add lock to hash and list of locks held by process end checksumMatch = false retries = 0 // all block reads follow this checksum-retry logic; // it is only listed here while(not checksumMatch and retries < maxRetryCount) read block from disk compute checksumMatch for block end if(retries >= maxRetryCount) raise BlockCorruptionException end if(block PID not none (and not current process!)) promote block lock to EXCLUSIVE // Safe to read state, since we were able to obtain EX block lock get process state for PID if state == COMMITTED update block to clear PID write updated block to before image tail of file write updated block to file else if state == ROLLEDBACK || state == PENDING replace block with before image from tail of file end end proceed with regular read endThe Genezzo push-hash implementation automatically detects any attempts to write to a block. The first time a block is written in a transaction the SHARED lock must be promoted to EXCLUSIVE and the old contents of the block must be written to the undo file.
def blockWriteRequest(fileNumber, blockNumber) if(block on list of those already written to undo) return end form lock name find lock in hash of locks held by process if(not lock already held by process in EXCLUSIVE mode) blocking EXCLUSIVE lock conversion request update local record of lock state end append current (old) contents of block to tail of data file fsync data file add fileno-blockno to process undo block in undo file (written to two blocks for recoverability). fsync undo file current process id will be added to PID in metadata portion of block prior to its being written out to disk endAt this point the block may be updated and safely written out to disk at any time by the buffer cache. It may LRU in and out of the buffer cache repeatedly, so we do support updates larger than the buffer cache.
At commit time the buffer cache must be written out to disk (same as single-process Genezzo). Then the process status in the undo file is set to Committed. Each of the updated data blocks now has the currentUndo field set to none. Finally the undo file is truncated (set to empty), and the locks are released.
def commit() write all dirty buffer cache buffers to disk fsync data file(s) write Committed status to process slot in undo file fsync undo file // At this point we have now persistently marked the transaction as // COMMITTED. Prior to this point recovery will roll back the // transaction. After this point recovery will complete the transaction. read undo slots in undo file sequentially to determine list of updated blocks: for each block listed in undo slots read block update block to clear PID write updated block to before image tail of file write updated block to file end fsync data file(s) write CLEAR status to process slot in undo file fsync undo file free all data buffers release all locks in lock list endIn the case of a rollback we can use the undo file to back out the changes:
def rollback() write ROLLEDBACK status to process slot in undo file fsync undo file invalidate buffer cache for each block listed in undo slots in undo file write block from data file tail to corresponding data file location end fsync data file(s) write CLEAR status to process slot in undo file fsync undo file release all locks in lock list end
def recoverProcess() committed = FALSE lookup process status of failed process if COMMITTED status found committed = TRUE end for each block listed in undo slots of failed process lock block SHARED if PID in block != current process id continue // block already recovered by some other process promote lock to EXCLUSIVE if committed read block update block to clear PID write updated block to before image tail of file write updated block to file else write block from data file tail to corresponding data file location end release lock (or wait till end) end fsync data file(s) write CLEAR status to process slot of failed process in undo file fsync undo file return TRUE end
We currently don't have any facility to support separate space management transactions. If a user transaction rolls back for any reason the nested space management transaction also rolls back. Space management transactions cannot commit prior to the user transaction commit.
The second major buffer cache difficulty is that currently the cache must be invalidated on completion of each transaction, since all the locks are released. One alternative is to store the blockRevisionNumber in the Lock Value Block (a 32-byte user-defined field) provided by OpenDLM. If the value matched the blockRevisionNumber in the buffer cache a read could be avoided. However, writing a new value requires upgrading the lock to EXCLUSIVE mode, which would be difficult with blocks under high contention. The lock would also need to be down-converted to NULL mode rather than released, so the DLM will preserve the value. Finally, with hashing of locks the lock name would also need to be written in the value field and only one value could be retained in a hash-collision case. A second option is to have the server processes retain locks after committing, and instead only release locks when an async callback occurs requesting the lock. This would require async callback programming in Perl.
File system performance will be strongly influenced by the implementation of the underlying Cluster File System. If it is built directly on raw devices there will not be any other file system buffering, and all reads will hit the disk (or disk cache). Cluster file systems build on top of the operating system file system have additional levels of buffer cache. These levels of cache reduce the need for a large Genezzo buffer cache, but introduce a greater likelihood of fsync problems.
Performance will also be impacted by the lack of serial writes. Traditional enterprise databases have one dedicated disk per database server log. This way each server can perform serial writes. They also perform write-ahead in the log and minimize fsyncs of the data files. We have a single random-accessed undo file shared between all processes on all servers. A dedicated undo disk per process would be cost prohibitive.
A minimal Apache 2.0 process (with mod_perl) consumes 6 MB [need to verify with code sharing, etc.]. The Genezzo code is small, and the total process size will be primarily determined by the size of the buffer cache. With a 4 MB buffer cache the total process would be 10 MB, and a 1 GB server could have say 80 Apache Genezzo child processes running. This is small compared to the hundreds of user connections supported by commercial databases, but the difference is that we are using a stateless web protocol instead of persistent connections. If a user makes small simple requests and waits tens of seconds between them then a single Apache process can serve dozens of users.
Note the stateless protocol means user transactions cannot span multiple HTTP requests. Instead some type of server-side stored procedures will be required to bundle multiple SQL statements into a single transaction. Stored procedures would of course be written in Perl. A simple alternative is to create separate mod_perl web pages for each procedure. They could be shared between the Apache servers in the cluster via the Cluster File System. Another alternative is to store the procedure definitions in the database. Large bulk loads would need to be done using a command-line tool (similar to gendba.pl).
With the locking model described above we still have the potential for "Phantoms". To prevent phantoms in indexed tables we can acquire additional locks on the index pages in positions before or after where phantom rows would be inserted. Preventing phantoms in non-indexed columns is more difficult.