Relational DBs on Key-Value Stores
Over the last several years, we’ve seen the rise of fast, ordered, transactional key-value stores such as leveldb
, rocksdb
, and lmdb
. These are all very cool projects in their own rights, but perhaps even more notable are the projects built on top of them, such as MyRocks, SQLite4, CockroachDB, and Google’s F1.
To better understand how these work, I’ve been hacking together a basic relational database on a key-value store named voodoo. Interestingly, many of the concepts are applicable to everyday usage of a normal SQL database, especially in understanding query plans and indexes.
Starting from the Beginning #
A relational database is centered around the concept of a table
, which is comprised of rows
and columns
. Notably, all tables have a column labeled as a primary key
, which is used to uniquely identify each row. In the example users
table below, id
is the primary key
.
id | username | |
---|---|---|
1 | ian | root@example.com |
2 | alice | alice@security.com |
3 | eve | eve@pwned.com |
If we had a key value store that implemented the interface:
type KV interface {
Get(key []byte) []byte
Set(key []byte, val []byte)
}
we could map each row of the above table onto it by setting keys for each primary key, making sure to include the table name:
'users/1' => 'username: ian, email: root@example.com'
'users/2' => 'username: alice, email: alice@security.com'
'users/3' => 'username: eve, email: eve@pwned.com'
And in doing so, could set and retrieve rows of our users
table by primary key.
Indexes #
What would we do if we wanted to look up users
where email = alice@security.com
? We have no way to see what keys exist if we just have the above interface. Luckily, all of the aforementioned key-value stores implement an interface that looks more like:
type KV interface {
Get(key []byte) []byte
Set(key []byte, val []byte
Walk(prefix []byte) Iterator
}
type Iterator interface {
// get the value at the iterator location
Val() []byte
// get the current key
Key() []byte
// move to the next key
Next() bool
}
which allows us to traverse the keyspace by prefix. So with some code like:
iterator := kv.Walk([]byte("users/"))
for iterator.Next() {
if hasEmail(iterator.Val(), "alice@security.com") {
return iterator.Val()
}
}
We could find the key associated with email: alice@security.com
by iterating through all of the keys until we find a matching row (or all matching rows if the column isn’t marked unique). This is known in RDBMS terminology as sequential scan, something you may have seen in the output of a SQL EXPLAIN
before (probably marked Seq Scan
). Unfortunately, while this strategy will work, it’s not particularly fast. To make this query fast we’ll need what’s known as a secondary index
, a mapping of all emails to the keys in which they belong, akin to the SQL CREATE UNIQUE INDEX users_email_idx ON users(email);
. This is fairly straightforward, changing our keyspace to look like:
'users/1' => 'username: ian, email: root@example.com'
'users/2' => 'username: alice, email: alice@security.com'
'users/3' => 'username: eve, email: eve@pwned.com'
'users_email_idx/root@example.com' => 'users/1'
'users_email_idx/alice@security.com' => 'users/2'
'users_email_idx/eve@pwned.com' => 'users/3'
With this change implemented, our pseudo-go-code to look up alice@security.com
becomes:
primaryKey := kv.Get([]byte("users_email_idx/alice@security.com"))
row := kv.Get(primaryKey)
As one might imagine, this indexing does not come for free, as it both takes up more space on disk, and means we have to insert two keys for every one row we want to write in a table.
The same concept we used to create a single-column secondary index applies to the creation of composite indexes - mapping the SQL CREATE UNIQUE INDEX users_username_email_idx ON users(username, email)
to keys like 'users_username_email_idx/eve/eve@pwned.com' => 'users/3'
.
Wrapup #
The above techniques allow a relatively powerful mapping of a simple relational database onto a key-value store, while not holding us back as we pave the way for more complex features (JOIN
operations, “real” SQL
parsing, executing, and planning, etc.).