Implementing a new storage layer for BonsaiDb

While I have done some cool things with Gooey that I will want to talk about soon, I’ve been fixated on a problem since Thursday, after reading a comment on Reddit the other day. A detail that I had been keeping to myself is that I couldn’t actually run my test suite on my Macbook Air M1. The reason was memory usage. On Linux, my test suite doesn’t really seem to consume much RAM at all. But on my Mac, before Thursday it would balloon to over 40 GB of memory used, prompting the OS to ask me to force quit applications.

After reading that comment, I decided to try changing the cache size from the default 1GB down to a 1MB cache, as others had suggested it helped. It seemed to… but it really didn’t. But, armed with some more hints that this could be sled-related, I started to dig in more.

The Problematic Test

Edit: As the next post details, the next section was an incorrect diagnosis of the cause of the memory issues.

I eventually determined that the only test suite that caused this issue is the unified core-suite, which uses a single persistent server. Each unit test connects over either a QUIC or WebSocket connection. One of the tests is a test of list_transactions(). That API has a limit of returning a maximum of 1,000 transactions. To test that limit, I needed to insert 1,001 transactions:

        .map(|_| async { collection.push(&Basic::default()).await.unwrap() }),

I did some more debugging and began to understand one problem a bit more: each connection could spam the server with as many requests as it wanted, and the server would try to execute all of them in parallel. But, the server has a setting to limit how many requests it can execute at a single time, which in effect meant that as soon as list_transactions_test began, tests ground to a halt – sometimes stuck in the middle of executing because their next request was inserted at the end of the 1,000 transactions.

On my Linux machine, this caused some fun timing-sensitive tests to be problematic, but on my Mac, this meant that more tests piled up as it slowly chomped away at the transactions. I decided it was time to add a limit to how many requests a client can have in-flight at any moment. This change made the test suite execute smoothly on my Linux machine for the first time in a long while!

On the Mac, this did help too, but once that test ran, I continued to witness the same memory growth as before. On a whim, I commented the test out and everything ran smooth as butter. As I re-read some of the sled issues on Github, it just looked more and more similar to issues other people had been reporting.

To this day, I’m still baffled as to why this happens on my Mac but not my Linux PC. At the end of Thursday, I summarized my findings to @dAxpeDDa (Dax) in a private chat and went to make dinner. As I cooked, I couldn’t stop thinking about it. My thought process went:

  • 1,000 documents, each must be less than 1,000 bytes for sure. That’s only 1 megabyte of raw data.
  • There’s overhead that I introduce with the transaction log too, but it’s a fraction of the document data. Let’s assume it ends up doubling the data, just as a worst case. That’s still only 2 megabytes of data being written.
  • Issuing these transactions into sled causes the memory to grow steadily as the inserts happening, but orders of magnitude larger than the actual data being written.

With the knowledge that these issues have been reported and worked on since 2019, I started looking for other storage layer replacements. As I searched, I didn’t find many that were pure-rust, one of the reasons I’m writing BonsaiDb.

I’ve written before about how I’m very familiar with CouchDB as I used it to grow my previous startup from a single server to a geo-redundant cluster over ten years. I knew that CouchDB used a custom append-only format to implement their B-Tree storage. I decided to search, and lo-and-behold: there’s a C implementation with good overview documentation. Thus, I left Dax a note:

Link to Couchbase docs with the message, I'm tempted.

Setting the Goalposts

From the outset, I’ve never viewed BonsaiDb as something that will compete on raw performance with PostgreSQL or Redis, the two databases I’m replacing in my own stack with BonsaiDb. This is the same as what I expected of using CouchDB before as well. CouchDB has its own data guarantees and performance characteristics, just as PostgreSQL and Redis have theirs. My goal of BonsaiDb was to be competitive with CouchDB’s performance, because I knew that it was manageable and predictable as a developer and business owner.

When I embraced another library as my underlying storage layer, I took no responsibility on the raw performance. I understand the basic philosophies on how to optimize IO, but I’m not a database engineer. By taking this responsibility on my self, I feel like I must make it clear: BonsaiDb should be fast enough for most users, but I highly doubt it will ever compete with PostgreSQL for ACID-compliant transactions nor with Redis for non-ACID-compliant key-value storage. I’m not someone to obsess over benchmarks: I will make and publish a benchmark suite that attempts to be fair, and I will do my best to do obvious optimizations.

Beyond that, I welcome anyone else’s contributions that improve efficiency without sacrificing maintainability too much. I care about code maintainability more than I do about raw performance.

Another key goal is ACID-compliant, multi-collection transactions – something that CouchDB/Couchstore doesn’t offer. I refuse to sacrifice this capability, as it’s one of the reasons I preferred to store some data in PostgreSQL at my previous company. For some data, you just don’t want to ever roll the dice.

Those are the goals that required a little explanation, but I also want to have:

  • Cross-platform, but support io_uring on Linux. I’m primarily interested in hosting servers on Linux, and using this IO abstraction might help make up a little bit for my lack of experience as a database developer.
  • File-level encryption support.
  • Settable in-memory cache limits.
  • Ability to write to trees without transactions.
  • Fine-control over flushing behavior.

Starting the journey

I began by creating an abstraction between tokio-uring and tokio::fs. It is a bit ugly in its current form, as it’s been iterated on over and over these past two days, and I haven’t spent time cleaning it up.

As I said before, I’m not someone to chase benchmarks, but on Saturday I got my initial tests working, and I decided to change the test from trying 100 items to larger numbers and turn on release mode.

Before I even start, let me say that this test is a very bad test for performance. It consists of:


  • Creating/opening a transaction log file
  • Inserting a new transaction
  • Flushing it fully, then closing the file

Each loop, the file is opened, inspected to learn the last transaction state, written to, then closed. It’s really not an indicator for performance. But, if it was abysmally slow, I knew I was doing something wrong. The total file size ends up at the “page size” times 1 million, which conveniently is 1 gigabyte.

Good news, at roughly 20k iterations per second on my AMD Ryzen 3800X, I was happy enough to move on! Later that night, I pulled the code onto my M1 Macbook Air and ran the same test: it beat my Ryzen by averaging 25k iterations per second.

Planning for Transactions

At this point, I had working versions of the async file abstraction, the block-level encryption abstraction (with a Rot-13 test implementation), and the append-only transaction log implementation. I had finished doing some concurrency tests, but while writing that test I realized something: my file locking strategy wasn’t flexible enough to support non-exclusive modes in the future.

I began to think about the real flow of the program would actually function. The way I’m guaranteeing ACID compliance is based around the append-only format. Here’s an overview of how it works:

  • The file is broken into “blocks” of data. For Couchstore, this is 4,096 bytes. For the transaction log, I’ve chosen 1,024 bytes (I expect too much wasted space for the transaction log if I picked a higher value).
  • At the start of each block boundary in the file is either a 0 or a 1. If it’s a 1, it contains the start of a record in the file. If it’s a 0, it’s part of another chunk of data that couldn’t fit within a single block.
  • When opening the file, the reader seeks to the last multiple of the block size. An attempt to find an existing transaction is made by scanning backwards until a 1 is found on a block boundary. The last record is then parsed to be verified before the log is considered open and verified.
  • If a write is interrupted before a chunk is able to be written fully, the opening procedure can detect this and truncate the file to the last completed record’s end.
  • After the transaction log is flushed, the ACID compliance has been reached and the writer can report success back.

If I imagine having a pool of database workers performing transactions, I can see a clear way to optimize those workers: allow multiple transactions to be prepared by each worker while waiting for the transaction log to commit them.

In this flow, I could pass ownership of the log file between each of the workers, or I could introduce a single thread responsible for committing all transaction log entries. To me, giving the transaction log its own thread made immediate sense, so that’s what I did this morning.

One of the other realizations I had in this model is that sometimes the TransactionManager may have more than one transaction in queue ready to write. Each transaction log entry may be as small as 1KB, and in my head, 16KB was the smallest amount that was worthwhile to flush. I don’t know where I had that in my head from – chalk it up to a good guesstimate. I wrote the transaction manager to attempt to write up to 16 transactions in a single batch before flushing, and then notifying all 16 transaction workers after the flush succeeded. If a full 16 transactions weren’t ready, only what was available would be written before flushing to ensure as small delay for a single insert as possible.

The new test is a bit more realistic: using 10 workers, spam the TransactionManager with 100,000 entries each. The initial test on my Linux machine using was an increase to ~115k transactions per second. On my M1 Macbook Air: ~85k per second.

At the end of the day, these numbers aren’t meant to show that I’m doing anything impressively fast. They’re just to prove to myself that “it’s probably good enough” allowing me to move on. To have a realistic end-user benchmark that’s relevant to BonsaiDb, I need to implement document storage.

Wait, it’s not done?

Of course not, I only started it on Friday after spending the morning playing Factorio with friends.

I found myself giving updates to Dax in a private chat, which doesn’t do anything to further the community. Thus, I was inspired to write up my journey so far, and use this thread as a place to report on my continued journey of replacing the storage layer of BonsaiDb.

I’m still wanting to work on Gooey to get to the point of building BonsaiDb administration tools in it. I already have lots of great things to show off since my last update, but my brain has been practically obsessed with this new project. This hopefully will solve one of my last uncertainties I’ve had with BonsaiDb.

Progress has been made! It’s been a rough couple days trying to figure out how I wanted to approach the problem of an “append-only b-tree” implementation.

I’m very rusty in reading C code, and unfortunately the B-Tree code in Couchstore isn’t heavily documented. I’ve understood the concepts and algorithms behind B-Trees for a very long time, but I’ve never actually written a B-Tree. The closest I have done were binary tree-type structures.

Armed with the file format overview, I had started by understanding how the structures worked at a file level. I felt like I understood that, but when I went to try to understand how the actual tree implementation worked, I found myself quickly getting lost.

The questions I had that I ultimately never answered are:

  • What is the maximum “order” of the Couchstore B-Tree format? My understanding is that this would be limited to the root node sizes in the file header, which are limited to 16-bits. The root nodes under high capacity will definitely be pointer nodes, which makes their size be key_size (12 bits) + value_size (28 bits) + key_data (variable) + 48bits + 48bits. A 12-bit key means that a pointer node can potentially take up to 529 bytes. This to me implies that the order is 65,536 / 529, or roughly 123 nodes.

    What I can confirm is that Couchstore tracks the quota, so I believe it probably has some sort of dynamic order based on how big the keys actually are in-use.

  • Does Couchstore have a maximum depth? The only mentions of depth in the Couchstore repository are during the read operations, not during the modify. I ended up deciding to stop trying to understand how Couchstore worked before I was able to understand whether Couchstore limits itself to a depth of 2 (root, interior, leafs) or whether it’s dynamic.

I turned to many articles about how to implement B-Trees, but almost all of the examples simplified things a little too much. I found myself growing irritated as I would think that I started understanding how some principle was supposed to work, only to then realize the problem with that understanding later. I called it a night last night with no working code and a lot of half-finished data structures.

Append-Only B-Trees

This morning, I set my goal on getting something working. At the end of the day, I’d prefer to have something working that I could learn from. This meant one thing: Ignoring Couchstore and focusing on just trying to understand how to implement the B-Tree itself.

Before I dive into the B-Tree implementation, let me first describe the append-only format’s general structure including the general idea of how the B-Tree is “updated” in an append-only format.

The file is written in pages of 4,096 bytes long. At every multiple of 4,096 inside of the file is a one-byte header. The last page that contains a PageHeader::Header value contains the “root” of the tree.

I put root in quotes, because technically there are multiple B-Trees being managed. When a new document is inserted, a unique sequence ID is assigned to the newly stored document. The By-Sequence B-Tree index contains information about every sequence, allowing you to iterate over/search through the database in order of modification. The By-Id B-Tree index contains information about every key/ID, allowing quick access to documents by their IDs.

When a B-Tree node is “modified”, the existing structure doesn’t change on-disk. Instead, a new version of that node is written out. Because this changes the location of the node on-disk, it also means any nodes that pointed to that node must also be rewritten with the updated information. Because most of these nodes are simply pointing to other locations on disk, most nodes are quite small.

That is the summary of the approach I needed to take.

My actual approach

With the decision to just get something working, I decided be liberal with my changes from the Couchstore format. I couldn’t feel comfortable with limiting myself to an order of 123, nor did I want to deal with the complexities of allowing tighter packing for large keys… so I decided to change the u16s that stored the root node’s serialized sizes to u32s. It is absolutely overkill, but it allows me to just define a “reasonable” order for my B-Tree and get it rolling.

What is reasonable? Well, I have it set to 1,000 right now, but I highly suspect this is a number that will need to be tuned. Essentially, the smaller the order, the deeper the tree gets for the same number of nodes. A B-Tree of depth 3 and order 1,000 can store nearly 1 Trillion records. If you throw more than 1 Trillion records at it, it will need to go to a depth of 4.

The net effect of adding another depth to the tree means one more jump on-disk when looking up content. I’d really like to test what the net effects are of having smaller orders that result in larger depths with modern solid-state storage.

One of the more embarrassing moments of clarity I had was when I realized that using V for the value of my B-Tree indexes had confused me into thinking that these Key-Value pairs actually were the “key-value” pair being inserted into the database itself. For clarity, I renamed that generic to be I for index.

Once I unraveled that mess in my brain, I realized that one of my original understandings from yesterday was correct: The keys that are stored in the pointer nodes are the “maximum” key from the subtree.

It’s sort of funny, yesterday I was doodling trees on paper trying to understand how I would actually store the tree. In the process, I sketched a tree that matched that implementation perfectly, but I talked myself out of it because I didn’t understand how that could remotely work in the Couchstore format.

Now that I’ve actually finished a partial implementation, my approach is actually very similar to Couchstore, and my pointer nodes have this characteristic.

What does that characteristic mean? It means that when searching for where a document is stored, our B-Tree nodes will be one of two types: a Leaf node or a Interior node. Leafs are one or more Key-Index pairs, and Interior nodes contain one or more Key-Pointer pairs. Hunting for a key in a Leaf is easy: just do a binary search, and the value you’re looking for is stored with the node.

If you have an Interior node however, you don’t have all of the keys. Thus, to be able to efficiently query, you must be able to make some statement of knowledge about the key stored at this level relative to all the keys that the node being pointed at contains. There are two obvious answers: the lowest or the highest key. Regardless of what you choose, you can now determine which child might contain the key and what children definitely will not contain the child.

What’s left?

A lot! This basic unit test flow succeeds:

let mut tree = TreeFile::<TokioFile>::open(file, state, None)
tree.push(b"test", b"hello world").await.unwrap();

// Reload the file
let state = State::default();
TreeFile::<TokioFile>::initialize_state(&state, &file_path, None)

let file = manager.append(&file_path).await.unwrap();
let mut tree = TreeFile::<TokioFile>::open(file, state, None)
let value = tree.get(b"test").await.unwrap();
assert_eq!(&value.unwrap(), b"hello world");

However, it will break once the B-Tree needs to expand – there are todo!() calls with some comments about what’s left to be done in those scenarios.

But, those scenarios just cover insert/update. Ultimately, I am wanting to get to that stage to be able to start testing some of my assumptions about performance thus far.

Unlike last night when I was lamenting returning to the code editor, I’m looking forward to tackling the next pieces tomorrow. It’s funny what having a green checkmark next to a test can do for morale :slight_smile:

It’s been a “fun” few days since my last update, but I finally reached some real progress to report. Most of my pain came from me thinking I was just “one more bug away” from having it all working. Rather than spend the time to write the exhaustive unit test, I kept trying to get it working by running the entire flow – after all, the basic example worked!

Thankfully, around the middle of the afternoon yesterday, I finally wrote a unit test for the PagedWriter – a type that transparently ensures that the file always has a page header at every 4,096 byte offset. This was the main problem, as I was trying to minimize reads and data copies.

Once I fixed the file format issues, it became a lot more clear what actual bugs I had in my implementation of the append-only B-Tree. And then finally, an hour after I normally like to step away from the computer, I had a working test that inserted as much data as possible until I killed it.

I did a quick test. My initial numbers were a bit disappointment. I reviewed the benchmark to make sure it wasn’t doing anything stupid and then headed off to sleep.

Benchmarks are a trap

At this point in time, my B-Tree implementation supports inserting one record per fully-flushed transaction, and it can do a Key-Value style “Get” with no ways to do multi-row queries/scanning. Over time, I will have a lot more functionality. But right now, I only can test two things: inserting data, and querying one row at a time.

So, from the outset, comparing it against any database, my benchmark will fail anyone’s muster. I wanted to understand how many transactions per second this limited implementation could perform. But, the moment I try to write a similar benchmark for any other database, the common complaints would be, “You should batch your inserts in a single transaction” or “Use these settings to tune the database engine for those operations.”

No matter what numbers I give, the reality is that every database application is different. Not only are the schemas different, but the approaches to interacting with those schemas will be different as well. Thus, there is no general purpose benchmark that can prove any database is better or worse for your use case. They may help you make educated guesses, but nothing beats benchmarking your actual schema/code.

For example, to limit PostgreSQL to my exact functionality, I need to execute 100,000 insert statements individually. As an experienced developer, I know that inserting rows inside of a transaction then committing will be much faster. But, then I wouldn’t be comparing apples to apples anymore. So, regardless of what numbers I provide for how PostgreSQL performed vs this experimental implementation, they aren’t real world numbers.

Why am I writing such a big disclaimer if I was disappointed in my numbers? Well, it’s because my disappointment was unfounded.

Evaluating my initial results

Under tokio_uring, I was reaching ~2.3k transactions per second with an “order = 1000” B-Tree (that is each node’s maximum child count). Under tokio, I was reaching just over 2k transactions per second. I thought these numbers sounded low – and compared to my transaction log’s performance, they are. But, the complexity of operations in the B-Tree are significantly higher. Inserting into the B-Tree requires scanning the file, whereas the transaction log always simply writes bytes to the end of the file and never has to look at the earlier entries except upon first launch.

This morning I knew I was going to tackle a caching layer. By caching the recently written/read chunks of data, I can skip accessing the disk for many operations. This leveled the playing field between tokio_uring and tokio, with both reporting ~2.8k transactions per second. I was hoping for more – again, I was comparing it to the 115k transactions per second that the transaction log could achieve.

Then, I thought about how to write a fair test against PostgreSQL. The equivalent is just writing 100,000 insert statements – that will insert each one in its own fully flushed transaction.

Using the psql command line to execute that file, it ended up averaging just over 900 transactions per second. I attempted to use the same setup for SQLite, but even with forced batch mode, I gave up after waiting 5 minutes. If I switched SQLite and PostgreSQL to having a transaction wrapping all the insert statements, both obliterate my test – I don’t have equivalent functionality yet.

With these new unscientifically measured numbers, I have to say I’m a lot more confident in my original goals of this project. As I’ve said from the beginning, I don’t expect myself as a solo developer can ever hope to create a database that beats the current leading databases. I think it’s a lofty goal to even try to be in the same ballpark across a reasonable benchmark suite.

Why am I not more ecstatic about a “3x as fast” initial benchmark result?

One of the downsides of an append-only format is that old data, including B-Tree headers, won’t be reused. This means that as your database grows, more dead information gets saved in those files. This can be true in traditional formats as well: fragmentation. With this implementation, if you are worried about disk space, you will occasionally want to “compact” your databases. This process requires IO. It can be a non-blocking operation, but it still will have an impact on the running system’s available IO while executing.

While that sounds like a disadvantage, it can be spun as an advantage: With CouchDB you could schedule compaction to run during certain times of days, allowing you to only run it while your systems were under reduced load. This allows the benefits of the append-only format to be realized during the times which your system needs the most performance.

Ultimately, this is why I consider benchmarks a trap. How does someone accurately demonstrate that over-time performance penalty in benchmarks? These are questions for a day in the future. For now, I’m staying cautiously optimistic about the performance I’m observing, and I’ll only allow myself to get excited once I can do more extensive comparisons.

As of tonight, I’m done chasing speed. I’m pretty happy with where things are at. I’m at a stage where I need to work on putting it all together for BonsaiDb to use. Before I moved on, I wanted to share a bit of the journey.

The less you write, the faster you are

As we all learned in the previous post, it’s important to keep the amount of data written to a minimum.

After testing single-row inserts, the next step was to attempt to support batch writes. The idea behind a batched write is that you can modify the B-Tree in memory with all of the changes necessary, then write one updated set of information.

My initial benchmarks were incredible. Almost too incredible. I specifically called one part “intriguing” as I called it a night.

The next day, I spent time benchmarking and trying to understand the performance. Then, I decided I should expand some tests before I considered this part done. That’s when I discovered it. Not just one bug. Multiple bugs that meant I wasn’t writing all the data on the large transactions tests, which perfectly explained the “intriguing” performance.

Sadly, when I finished fixing all the bugs, I was slower than SQLite by a fair margin in most categories, including single-row transaction inserts – the ones I previously claimed victory over also were affected by one of the bugs.

I was eventually able to get single-row transactions back up to speed, but along the way I realized multiple things.

Append-Only vs Mutable B-Trees: Considering bulk inserts

One interesting thought I had while pondering why I was writing so much data was to consider the IDs I was using. When I was originally writing unit tests for the B-Tree, I wanted random IDs to test growing the tree organically. When inserting sequentially, you end up with a tree that is slightly heavier on one side, so to speak. When inserting randomly, your tree will be more or less fully balanced.

When inserting a large number of random IDs in a single bulk operation, it’s more likely to touch a larger number of pages than if the IDs that are being inserted are sequential. When comparing the performance characteristics of an append-only B-Tree implementation vs a mutable B-Tree, this is one of the key parts: in the append-only situation, we must write all the affected nodes to a new location on-disk. In the mutable B-Tree implementation, it could be written to only rewrite the small parts that have actually changed.

Thus, for randomized ID inserts, even SQLite slows down when compared against sequential ID inserts – it has more BTree nodes it has to update as well. But, the gap in performance between SQLite and Roots widens corollary to the number of nodes that need to be written.

So, for transactions that touch many nodes, SQLite will almost certainly always be faster than Roots and thus BonsaiDb. For transactions that touch few nodes, Roots is competitive and thus hopefully BonsaiDb can be as well.

Append-Only vs Mutable B-Trees: Revision History

It’s important to remember the biggest benefit that these tradeoffs come with: full history of all changes. While you will likely want to use database compaction (still be to written) to reduce the bloat of your databases, until you do, previous versions of data are still able to be retrieved by looking it up via its “Sequence Id”.

This feature came in handy many times. You can’t always predict human error, and CouchDb’s version history saved the day at least once a year.

Re-evaluating performance goals

Once I realized that an append-only B-Tree was almost guaranteed to be slower than a mutable B-Tree, I remembered that my primary motivation was to meet or exceed the performance of CouchDb. Thus far, I had only compared it to SQLite, which is known to be quite performant.

I installed CouchDb, and was floored. The numbers for Roots are representative of the performance after the removal of async (next heading):

10000 transactions, 1 entries per transaction
| target  | total (s) |      tx/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
|  roots  |     0.242 | 41305.970 |    0.007 |    0.045 |      0.004 |          0.036 |
| sqlite  |     0.335 | 29849.551 |    0.022 |    0.678 |      0.019 |          0.052 |
| couchdb |    78.044 |   128.133 |    6.184 |   41.581 |      1.296 |         13.722 |

1000 transactions, 100 entries per transaction
| target  | total (s) |     tx/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
|  roots  |     0.364 | 2745.834 |    0.325 |    0.530 |      0.030 |          0.497 |
| sqlite  |     0.222 | 4497.521 |    0.183 |    0.432 |      0.022 |          0.305 |
| couchdb |    19.056 |   52.477 |   11.237 |   68.107 |      3.482 |         29.094 |

100 transactions, 1000 entries per transaction
| target  | total (s) |    tx/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
|  roots  |     0.333 | 299.883 |    3.070 |    4.388 |      0.304 |          4.351 |
| sqlite  |     0.181 | 553.675 |    0.954 |    2.186 |      0.153 |          2.164 |
| couchdb |     7.460 |  13.404 |   60.140 |  129.219 |      9.897 |        112.643 |

10 transactions, 10000 entries per transaction
| target  | total (s) |   tx/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
|  roots  |     0.343 | 29.169 |   32.216 |   43.943 |      3.317 |         43.943 |
| sqlite  |     0.180 | 55.563 |   10.217 |   21.812 |      2.841 |         21.812 |
| couchdb |    15.081 |  0.663 | 1369.856 | 1580.031 |     62.915 |       1580.031 |

For CouchDb, I’m using ureq and issuing calls to /{db}/_bulk_docs. As you can see, Roots and SQLite are in a league of their own across all sizes of transactions. The CouchDb benchmark isn’t in the repo yet, as I want to make it easier for other people to run before pushing it.

Async isn’t worth it for my usage pattern

One way that I thought I’d attempt to be competitive was to use newfangled technology like tokio-uring. io_uring has some safety requirements that require a non-standard API, and to support it either I needed to use something like tokio-uring or I had to use unsafe. There’s nothing wrong with unsafe, but I like to avoid being responsible for safety concerns. Thus, I built a mechanism to abstract tokio::fs and tokio_uring, both using an async API.

Today, I was working on adding the ability for PagedWriter to queue up multiple pages in memory and write them to disk in one-go. As part of that, I added a method, async fn commit_if_needed(). I then added the calls to the correct locations. And to my dismay, this optimization was slower! On a hunch, I rewrote that function to a macro (it only needed to actually await if certain conditions were met), and my optimization worked.

That made me realize something very important. The overhead of a simple async/await function call on these benchmarks was measurable. Additionally, I wasn’t able to see any real benefits from tokio-uring. With an append-only format, one of the benefits of tokio-uring's access model goes away: two writers can’t actually write at the same time, because they are always going to want to write to the same location. If I was writing a mutable B-Tree, I could in theory leverage async tasks to parallelize my writes and see some gains.

In short, I realized async just wasn’t a good fit for Roots. Here’s the results including before and after. That being said, BonsaiDb will continue to be async, but just as it was wrapping sled, it will instead just wrap Roots.

Time to finish the project up

After seeing my performance relative to CouchDb, I am very confident that I can take what I’ve built and move BonsaiDb onto it and continue to outperform CouchDb. Additionally, with views being written in Rust (or WASM eventually) instead of Javascript, even the map/reduce performance should be greatly improved relative to CouchDb. In short, I’m completely OK with SQLite/PostgreSQL having a margin on me, because the margin is reasonable compared to the margin that CouchDb exhibits.

I need to wrap up the core functionality of Roots and begin figuring out how I’m going to integrate it into BonsaiDb. Once it’s integrated into BonsaiDb, I can try to perform more interesting benchmarks such as multi-collection/table transactions with contention.

I’ve been benchmarking read performance today. I honestly haven’t done much real optimizations, I’ve been mostly just learning about the performance. On a small database, I beat SQLite surprisingly (this benchmark verifies reads match the expected values from the database):

searching for 100 ids in a data set of 1000 entries
| target  | total (s) |  batch/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
|  roots  |     0.262 | 3816.858 |    0.229 |    0.698 |      0.044 |          0.426 |
| sqlite  |     0.789 | 1267.826 |    0.688 |    2.580 |      0.184 |          1.667 |
| couchdb |    96.516 |   10.361 |   74.392 |  143.383 |      9.327 |        129.483 |

These are single-request, single-id gets. Ie, a query that mimics the common application flow of: SELECT * FROM user WHERE id = 123. CouchDb… well… it still has a fond spot in my heart, but whew. The next test still searches for 100 ids, but now the data set is 100k entries:

searching for 100 ids in a data set of 100000 entries
| target  | total (s) |  batch/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
|  roots  |     1.027 |  973.562 |    0.900 |    1.495 |      0.069 |          1.261 |
| sqlite  |     0.988 | 1012.347 |    0.808 |    1.948 |      0.165 |          1.662 |
| couchdb |   108.170 |    9.245 |   83.148 |  177.311 |     12.885 |        147.575 |

Now that the database is larger, SQLite has gained enough that its neck and neck. I will note that my stdev/max/99th are all better too! I gotta get my bragging in because…

searching for 100 ids in a data set of 1000000 entries
| target  | total (s) |  batch/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
|  roots  |     1.510 |  662.191 |    1.403 |    2.229 |      0.080 |          1.924 |
| sqlite  |     0.901 | 1109.947 |    0.831 |    1.820 |      0.133 |          1.619 |
| couchdb |   102.361 |    9.769 |   78.908 |  200.295 |     11.647 |        140.831 |

Well, that just sucks. With a million records I kept slowing down. I tried various optimizations, and the best guess I can hazard is that it boils down to the extra seeking that an append-only file is necessarily going to need to do. Yet, CouchDb didn’t seem to suffer quite as much (although at that resolution, the noise is too large to know for sure). I noticed that read_chunk was spending a lot of time in the method itself. I have a lot more CRCs than SQLite – every “chunk” is individually encoded like this. SQLite only puts a CRC at the end of each page.

On a hunch I commented out the CRCs and ran it again:

searching for 100 ids in a data set of 1000 entries
| target | total (s) |  batch/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
| roots  |     0.255 | 3917.836 |    0.234 |    0.610 |      0.031 |          0.378 |
| sqlite |     0.744 | 1344.281 |    0.678 |    1.841 |      0.154 |          1.584 |

searching for 100 ids in a data set of 100000 entries
| target | total (s) |  batch/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
| roots  |     0.755 | 1325.324 |    0.672 |    1.461 |      0.100 |          1.159 |
| sqlite |     0.867 | 1153.214 |    0.769 |    2.443 |      0.182 |          1.613 |

searching for 100 ids in a data set of 1000000 entries
| target | total (s) |  batch/s | min (ms) | max (ms) | stdev (ms) | 99th pctl (ms) |
| roots  |     1.138 |  878.519 |    1.071 |    1.730 |      0.052 |          1.374 |
| sqlite |     0.902 | 1109.092 |    0.825 |    1.781 |      0.131 |          1.551 |

Disabling the CRCs helped significantly. Checking the CRCs this often really is overkill, so I should come up with another approach – either follow SQLite’s model and add a CRC to the page itself or come up with another method to reduce their impact.

I wanted to share these results because to me they’re exciting. I still want to do more investigation to understand how Roots behaves as the dataset grows. If I grow it to 10,000,000 records, will the performance remain roughly the same or will it get slower? If it gets slower, can I determine why?

I’ll be aiming to upload the new benchmark as well as my very minor optimizations tomorrow.

I have some exciting new updates tonight. I not only published the aforementioned benchmark, I also went ahead and ported my benchmarks over to Criterion. They’re executing on CI and publishing here.

I added Sled benchmarks yesterday, but I kept having failures in CI – it was taking too long for Criterion to run in a timely fashion and sometimes even failed outright. In the end, it’s nice to have as a reference, and I’ll still keep them able to be run locally, but they’re disabled on CI. Because of these failures, I uploaded a report from last night which includes CouchDB, Sled, SQLite, and Roots. I will keep the CouchDB benchmarks for posterity, but because of how far they skew the results, I won’t be publishing many graphs that actually include it.

I want to call attention to a few key metrics. This benchmark performs a single row “get”-like operation out of datasets of varying sizes – from 1,000 records to 1,000,000 records. If you’re wanting to reference the Criterion graphs, these are the randomized dataset numbers.

Database Get/1k Get/100k Get/1m
Roots 2.56us 5.76us 9.40us
Sled 609.15ns 798.81ns 993.19ns
SQLite 7.89us 7.77us 7.80us

The key takeaway here is that Sled is fast. There are some metrics where Roots currently surpasses Sled, but after seeing these results, I was truly shocked. Sub-microsecond reads?

I couldn’t get there. (Yet?) But, I got extremely close full report:

Database Get/1k Get/100k Get/1m
Roots 1.08us 1.15us 1.15us
Sled 564.72ns 750.76ns 918.18
SQLite 7.54us 7.76us 7.79us

Most importantly, the conclusion to the last post in this thread was uncertainty over how Roots behaved over larger datasets. With these new changes, the overhead of the CRCs becomes nearly immeasurable, and the performance of Roots is now solid across dataset sizes.

For the first time since I began this project, I am finally feeling confident in my ability to take it across the finish line with performance that I’ll be happy with – and I think others will be too.

This weekend I implemented more functionality, in an attempt to get closer to being able to port BonsaiDb onto Roots.

CRCs are still a performance impact

First, I wanted to point out some flawed understandings of what was being tested in the benchmarks above. I’ve changed the benchmarks after realizing that my “gets” tests were essentially just testing the difference in speed between Roots and Sled’s caches. By increasing the quantity of keys that are requested, it exhausts the caches and begins testing the real performance of the database.

With the cache exhausted, CRCs are still a performance impact. I think I’ve decided to write a one-time verification function that can be optionally run once per tree load or disabled.

Multi-Key Requests

The first piece of functionality that I wanted to add support for was getting more than one record at a time. There are two ways this is done in BonsaiDb: a list of keys or a range of keys. As of today, both are now supported – get_multiple() and scan().

Updated Benchmarks

Before I call it a night, I wanted to share the benchmarks that have been updated to no longer be subjected to Sled or Roots’ caches. These also have initial mutli-get benchmarks, but there has been no time spent optimizing this new functionality.

Designing the public API: Transactions

The way that ACID compliant transactions are being implemented is to require that a transaction declare which trees it wants exclusive access to within the transaction. It is given a TransactionHandle which owns the locks is returned – once it’s dropped, other blocked transactions will be unblocked.

This handle contains a unique ID that all writes to the trees will be annotated with. Once all of the transactional operations are finished, the TransactionHandle is passed over to the TransactionManager to be written to the transaction log. Once the transaction has been saved, the transaction is now fully committed to disk.

Tonight, I pushed two passing unit tests: transaction isolation and rollback. This is great, but I still have one more detail to implement: ensuring that data written but rolled back (or never completed) is skipped.

Upon relaunching, the logic to find the most recent tree header needs to query the transaction manager to find out if the transaction has been completed. If not, it needs to keep scanning backwards. This is what will ensure that both planned and unplanned rollbacks work correctly.

Almost ready to integrate

I spent more time thinking today on what’s needed to be able to hook it up to BonsaiDb, and it’s honestly pretty close. While Sled has a lot of features, BonsaiDb doesn’t use much more beyond the basic functionality that has now been implemented. I’m hopeful that within the next few days, I’ll be ready to attempt swapping it out and seeing how the unit test suite behaves.

Of course, even if it passes all of the unit tests, there’s still more important work to be done before calling Roots feature complete. Still, I can’t wait to get it all tied together.

Exciting news: The roots branch now passes all unit tests without sled.

This is going to be the last post in this particular thread, as I’m going to be merging this into main soon. I dislike large pull requests, and this seems like a good milestone to merge even though I still have plans for Roots.

Optimizing complex transactions

The new roots layer is fast enough to allow BonsaiDb to perform quickly, but in its current implementation, simple inserts are 200% slower than they were on Sled. There’s multiple reasons for that, all stemming from the initial port to Roots – doing the bare minimum to get things functioning.

The way transactions are implemented in BonsaiDb currently requires writing the data to one tree, a transaction log entry to another tree, and some extra data depending on what views that collection might have.

Currently, as each of these operations is made, the files are written and fully flushed. This means that if the same tree has multiple operations done on it within a transaction, each one will hit the disk, and the rest of the code processing the transaction can’t continue until the write is fully flushed.

As I’m sure you can imagine, there are lots of improvements to be made, but to me the most exciting will potentially shape Roots into an interesting crate to use on its own.

Multi-threading transactions

I’ve always envisioned using a thread pool to streamline Roots, but until I had it hooked up to BonsaiDb, I wasn’t sure if it was something I wanted to implement in BonsaiDb or something that Roots should handle independently.

My current plan is to add a threadpool to Roots, and when a transaction is sent to be committed, each file will be written and flushed in the pool. This is still a two-step process: the trees must be flushed before the transaction log is flushed for my ACID guarantees.

Switching to Root’s features

The current implementation is a direct port of the existing code. That includes a transaction log implementation – separate from Roots’ transaction log. The transaction log in Roots was purpose-built for this purpose, so we no longer need an additional transaction log.

Additionally, Roots has built-in features to support encryption, while BonsaiDb was written with a database that didn’t support encryption. Switching this over will reduce the code complexity in BonsaiDb, and it will remove the limitations of what is possible with an encrypted database.

Re-implement view indexes

View indexes are currently implemented by giving them their own TreeFile in Roots. This is wasteful, because a TreeFile implies using both a “by-id” index for access by primary key, and a “by-sequence” index for access by “sequence” – a unique id that represents a specific version of a specific key-value entry. Views only need the by-id index, which means any time or IO spent updating the by-sequence index is wasted.

My idea is to change how TreeFile works. Instead of it assuming the contents of the file to be those two indexes, it will delegate it to a generic type. This will allow view indexes to only save the information they care about. Additionally, it might turn Roots into a more general-purpose crate – I’m not sure what combinations of information someone might want to store, but I’ve already thought of other ideas on how to better utilize Roots to potentially add neat features using this type of customization. For example, quadtrees could be added to BonsaiDb, enabling optimized spatial queries.

Next steps

I am almost ready to merge this pull request, but I may split Roots into its own crate. I had never envisioned Roots itself being a general purpose library, but my view of this has changed over the last few days. I expected to be happy enough with the performance to use it in BonsaiDb, but I didn’t expect that someone would want to use it as a lower-level database implementation than what BonsaiDb aims to provide. Now, I’m not so sure!