Introducing Nebari, a key-value data store written using an append-only file format

ecton · September 28, 2021, 6:28pm

In the previous thread, I wrote about my journey of implementing what I was calling “roots” at the time. My original intent was for this to be a crate specifically for BonsaiDb to use, but as I grew closer to the point of adoption, I started to see some interesting possibilities for it as a general purpose library.

What is Nebari?

Nebari is an append-only database implementation loosely inspired by Couchstore. It is not ready to be used in anything except experiments – the file format will likely change multiple times before its first stable release.

Why would you want an append-only database implementation?

Database files are plain files that can be copied and backed up through regular file copies. The file format is guaranteed to always be consistent, since once the bytes are written to disk, they will never change.
Because whatever process is writing to the file will be adding bytes to the end of the file, readers are never blocked by an in-process write operation.
With a little extra information written, full version history can be exposed for the database.

Why wouldn’t you want an append-only database implementation?

If your database is write-heavy, your database over time will contain a lot of data that isn’t in use anymore. A “compaction” process is needed to reclaim disk space. This process isn’t as bad as it sounds, since it can be done without blocking the database, but it adds IO overhead while it is happening.

Why not to use Nebari instead of Sled?

Time for me to admit something embarrassing. I started on this endeavor due to Fear, Uncertainty, and Doubt and a mistaken assumption. After reading other people’s comments on Sled having high memory usage, I tried to narrow down the strange test failure I had on my Mac related to massive memory usage.

In the end, I noticed that if I commented out the only test that inserted a large chunk of transactions into the database, the memory usage issue went away entirely. I couldn’t run Instruments on the Mac without figuring out how to code sign the test harness, so I couldn’t actually track the allocations to where they were coming from. I couldn’t reproduce the issue outside of my Mac. Thus, I concluded it was a strange set of circumstances that caused this memory issue to be exacerbated on my Mac.

Well, a few days ago, I realized I had gotten Nebari to the point that I could complete the test harness for BonsaiDb. I pulled the code onto my Mac and ran it. And… you guessed where this was going. Correlation != Causation. Sled no longer was in the project, yet the bug remained.

What caused the uncontrolled memory usage?

Now that I was 100% sure it wasn’t cause by Sled, I tried in earnest to reproduce it on my Linux PC. By increasing the test-treads to a larger number, I was able to reproduce it occasionally on my PC. With the knowledge it wasn’t a Mac-specific bug, I still decided to try to debug it on the Macbook – after all, I could reproduce it almost every time on that machine.

After pouring through countless thread stack traces across multiple runs, I started noticing that our QUIC-based unit tests were always the ones that seemed to be left running when this failure state happened. A very subtle bug was happening in this chunk of code:

while let Some(connecting) = allochronic_util::select! {
    connecting: &mut bi_streams => connecting,
    _: &mut shutdown => None,
} {
	let incoming = connecting.map_err(error::Connection);

	if sender.send(incoming).is_err() {
		// if there is no receiver, it means that we dropped the last
		// `Connection`
		break;
	}
}

This loop was designed to be able to accept incoming streams from a connection without blocking the connection while negotiating the individual stream. To exit this loop, it relied on sender.send to return an error, which is should when no receiver is still alive to receive messages.

What was happening occasionally was a condition where there was still a receiver alive somewhere, but the connection was returning a disconnected error. As a result, this loop turned into an infinite loop that queued up an infinite number of error messages on that unbounded channel. In the end, the fix was trivial since all of the errors that quinn can return in this situation indicate that the connection is no longer active.

With that problem solved, I had a real dilemma: keep pushing on Nebari or revert back to Sled.

Why still develop Nebari?

One of the biggest selling points to me personally is that I understand it. It’s simple enough that I feel confident that other people could safely contribute without needing to feel like a database engineer. By using Nebari, I am in better control of the full destiny of BonsaiDb.

One of the harshest selling points was the existing BonsaiDb benchmark: basics, which tests inserting documents into a simple collection (no views). My initial results showed a decrease in speed of over 5x. I knew a few things I needed to do to speed it up, but I wasn’t expecting to need to overcome a difference of 5x given that the single-tree performance was nearly identical in my simpler benchmarks.

Despite the disappointing realization, I decided to try doing one of the fun projects I had envisioned for Nebari: supporting both versioned and unversioned trees. I also whipped together a multi-threaded approach to transaction committing.

The last piece of data that throws off these benchmarks is the transaction log. Nebari is maintaining a transaction log, and so is BonsaiDb. BonsaiDb does it by using a tree in Sled/Nebari to store the transactions. Thus, when running this benchmark with Nebari, the transaction data is being stored twice. This makes fair benchmarks hard to do – disabling this chunk of code when executing against Nebari is one of the benefits of using Nebari. But it also somewhat moves the goal posts.

So, here’s a table of results of saving documents that contain 1k bytes:

Backend	Version History	Transaction Logs	Insert Time
Sled	false	yes	48us
Nebari	true	double	127us
Nebari	false	double	91us
Nebari	true	yes	58us
Nebari	false	yes	41us

Despite my original disappointment in the benchmarks, I was never expecting that disabling versioning and using the Nebari transaction log solely would exceed the original speed of BonsaiDb on sled. Remember there’s a caveat with append-only formats: disk usage. At the end of each of these benchmarks, the Nebari database would be larger and contain more dead data than the Sled database would.

Despite that caveat, the flexibility is the strongest argument I have for developing Nebari for BonsaiDb. I will be able to custom-tailor Nebari to solve the usage patterns of BonsaiDb, including:

Leveraging block-level encryption at the Nebari level, allowing us to remove all limitations on encrypted views.
Opt-in full version history for collections. Want safety by having full document history? Just enable it for your collection and you’ll have it.

Additionally, Nebari has a subset of functionality that can be adapted for no_std support. I’ve never worked on a no_std library, but all std::fs access is abstracted behind a trait, allowing for alternative storage systems to be implemented.

With all those things in mind, I’m currently of the mindset that Nebari has its place in the Rust ecosystem as well as in BonsaiDb. Unfortunately, I don’t think it’s tenable to support multiple storage layers for BonsaiDb.

So that’s where I’m at today: deciding to introduce Nebari publicly and most likely continue with embracing it as the core of BonsaiDb.