Relational DBs on Key-Value Stores

2017-05-01T10:00:50-07:00

Over the last several years, we’ve seen the rise of fast, ordered, transactional key-value stores such as leveldb, rocksdb, and lmdb. These are all very cool projects in their own rights, but perhaps even more notable are the projects built on top of them, such as MyRocks, SQLite4, CockroachDB, and Google’s F1.

To better understand how these work, I’ve been hacking together a basic relational database on a key-value store named voodoo. Interestingly, many of the concepts are applicable to everyday usage of a normal SQL database, especially in understanding query plans and indexes.

Starting from the Beginning #

A relational database is centered around the concept of a table, which is comprised of rows and columns. Notably, all tables have a column labeled as a primary key, which is used to uniquely identify each row. In the example users table below, id is the primary key.

id	username	email
1	ian	root@example.com
2	alice	alice@security.com
3	eve	eve@pwned.com

If we had a key value store that implemented the interface:

type KV interface {
    Get(key []byte) []byte
    Set(key []byte, val []byte)
}

we could map each row of the above table onto it by setting keys for each primary key, making sure to include the table name:

'users/1' => 'username: ian, email: root@example.com'
'users/2' => 'username: alice, email: alice@security.com'
'users/3' => 'username: eve, email: eve@pwned.com'

And in doing so, could set and retrieve rows of our users table by primary key.

Indexes #

What would we do if we wanted to look up users where email = alice@security.com? We have no way to see what keys exist if we just have the above interface. Luckily, all of the aforementioned key-value stores implement an interface that looks more like:

type KV interface {
    Get(key []byte) []byte
    Set(key []byte, val []byte
    Walk(prefix []byte) Iterator
}

type Iterator interface {
    // get the value at the iterator location
    Val() []byte
    // get the current key
    Key() []byte
        // move to the next key
    Next() bool
}

which allows us to traverse the keyspace by prefix. So with some code like:

iterator := kv.Walk([]byte("users/"))
for iterator.Next() {
    if hasEmail(iterator.Val(), "alice@security.com") {
       return iterator.Val()
   }
}

We could find the key associated with email: alice@security.com by iterating through all of the keys until we find a matching row (or all matching rows if the column isn’t marked unique). This is known in RDBMS terminology as sequential scan, something you may have seen in the output of a SQL EXPLAIN before (probably marked Seq Scan). Unfortunately, while this strategy will work, it’s not particularly fast. To make this query fast we’ll need what’s known as a secondary index, a mapping of all emails to the keys in which they belong, akin to the SQL CREATE UNIQUE INDEX users_email_idx ON users(email);. This is fairly straightforward, changing our keyspace to look like:

'users/1' => 'username: ian, email: root@example.com'
'users/2' => 'username: alice, email: alice@security.com'
'users/3' => 'username: eve, email: eve@pwned.com'
'users_email_idx/root@example.com' => 'users/1'
'users_email_idx/alice@security.com' => 'users/2'
'users_email_idx/eve@pwned.com' => 'users/3'

With this change implemented, our pseudo-go-code to look up alice@security.com becomes:

primaryKey := kv.Get([]byte("users_email_idx/alice@security.com"))
row := kv.Get(primaryKey)

As one might imagine, this indexing does not come for free, as it both takes up more space on disk, and means we have to insert two keys for every one row we want to write in a table.

The same concept we used to create a single-column secondary index applies to the creation of composite indexes - mapping the SQL CREATE UNIQUE INDEX users_username_email_idx ON users(username, email) to keys like 'users_username_email_idx/eve/eve@pwned.com' => 'users/3'.

Wrapup #

The above techniques allow a relatively powerful mapping of a simple relational database onto a key-value store, while not holding us back as we pave the way for more complex features (JOIN operations, “real” SQL parsing, executing, and planning, etc.).

Go Dependencies Considered Harmful

2016-08-15T12:00:03-07:00

In the wake of the ongoing vendoring discussions within the Go community, I think that almost all Go projects should be able rely on having few, mature, dependencies, regular builds (almost all CI systems support some sort of cron for regular builds), and good test coverage to ensure they stay on top of their third-party dependencies.

Mature third party libraries don’t change their API hugely, and suddenly. Libraries such as github.com/lib/pq and github.com/gorilla/mux that are cornerstones of many large Go projects simply don’t become different libraries with completely different APIs overnight.

Using immature and unstable libraries is a mistake no matter how you look at it - if the library is still under development or not well maintained (i.e. breaks API compatibility every other week), why are you using it for a production application? It’s entirely the developer’s responsibility to review any third party dependency they introduce to their application - you wouldn’t introduce a random SaaS into your codebase without checking it out / asking around about it first, would you?

It’s also more than doable to build large, production applications without using any third-party code at all, for a great example of this, see Netflix’s Rend, a memcached proxy that manages data chunking and L1 / L2 caches. The core of nats, a high performance messaging system, gnatsd, is another great example with only 3 dependencies outside the main repo, on two golang.org/x/net/crypto/ packages, and github.com/nats-io/nuid, all of which are very stable/controlled dependencies.

In the event that you absolutely have to use an unstable or immature library within your application, e.g. for an exotic file format - Go supports /internal/... packages, into which you can copy these sort of dependencies, independently of any tooling or trust in the original maintainers, while ensuring they’re not importable outside your application. As Rob Pike says, a little bit of copying is better than a little bit of dependency - the ultimate way to be happy with your dependencies is to eliminate them.

Vendoring dependencies solves a particular problem (no matter the language) where having non-absolutely-reproducible builds (not just build artifacts), not being able to deploy while GitHub is down for an hour (do your CI/CD systems work independently of GitHub, too?), and fixing small, rare, API changes will cost you thousands of $CURRENCY. Otherwise, it is not an answer to using poorly maintained libraries that cause your builds to break every few hours because of your choice in libraries and maintainers. In all likelihood, vendoring is a solution to a problem you simply don’t face.

Some sort of “package manager” is equally not the solution to the problems that a lack of discipline around managing third party dependencies creates. Developers of libraries and applications alike should hold themselves to a higher standard of technical excellence; designing stable APIs and avoiding dependency trees that grow out of control. Otherwise, a tool to make things easier to manage will only mask these problems, not solve them.

Follow me on twitter, @fortytw2, or shoot me an email if you think I’ve said anything interesting here and want to chat more

ramblings

Relational DBs on Key-Value Stores

Starting from the Beginning #

Indexes #

Wrapup #

Go Dependencies Considered Harmful