While I have done some cool things with Gooey that I will want to talk about soon, I’ve been fixated on a problem since Thursday, after reading a comment on Reddit the other day. A detail that I had been keeping to myself is that I couldn’t actually run my test suite on my Macbook Air M1. The reason was memory usage. On Linux, my test suite doesn’t really seem to consume much RAM at all. But on my Mac, before Thursday it would balloon to over 40 GB of memory used, prompting the OS to ask me to force quit applications.
After reading that comment, I decided to try changing the cache size from the default 1GB down to a 1MB cache, as others had suggested it helped. It seemed to… but it really didn’t. But, armed with some more hints that this could be sled-related, I started to dig in more.
The Problematic Test
Edit: As the next post details, the next section was an incorrect diagnosis of the cause of the memory issues.
I eventually determined that the only test suite that caused this issue is the unified core-suite
, which uses a single persistent server. Each unit test connects over either a QUIC or WebSocket connection. One of the tests is a test of list_transactions()
. That API has a limit of returning a maximum of 1,000 transactions. To test that limit, I needed to insert 1,001 transactions:
futures::future::join_all(
(0..=(LIST_TRANSACTIONS_MAX_RESULTS))
.map(|_| async { collection.push(&Basic::default()).await.unwrap() }),
)
.await;
I did some more debugging and began to understand one problem a bit more: each connection could spam the server with as many requests as it wanted, and the server would try to execute all of them in parallel. But, the server has a setting to limit how many requests it can execute at a single time, which in effect meant that as soon as list_transactions_test
began, tests ground to a halt – sometimes stuck in the middle of executing because their next request was inserted at the end of the 1,000 transactions.
On my Linux machine, this caused some fun timing-sensitive tests to be problematic, but on my Mac, this meant that more tests piled up as it slowly chomped away at the transactions. I decided it was time to add a limit to how many requests a client can have in-flight at any moment. This change made the test suite execute smoothly on my Linux machine for the first time in a long while!
On the Mac, this did help too, but once that test ran, I continued to witness the same memory growth as before. On a whim, I commented the test out and everything ran smooth as butter. As I re-read some of the sled issues on Github, it just looked more and more similar to issues other people had been reporting.
To this day, I’m still baffled as to why this happens on my Mac but not my Linux PC. At the end of Thursday, I summarized my findings to @dAxpeDDa (Dax) in a private chat and went to make dinner. As I cooked, I couldn’t stop thinking about it. My thought process went:
- 1,000 documents, each must be less than 1,000 bytes for sure. That’s only 1 megabyte of raw data.
- There’s overhead that I introduce with the transaction log too, but it’s a fraction of the document data. Let’s assume it ends up doubling the data, just as a worst case. That’s still only 2 megabytes of data being written.
- Issuing these transactions into sled causes the memory to grow steadily as the inserts happening, but orders of magnitude larger than the actual data being written.
With the knowledge that these issues have been reported and worked on since 2019, I started looking for other storage layer replacements. As I searched, I didn’t find many that were pure-rust, one of the reasons I’m writing BonsaiDb.
I’ve written before about how I’m very familiar with CouchDB as I used it to grow my previous startup from a single server to a geo-redundant cluster over ten years. I knew that CouchDB
used a custom append-only format to implement their B-Tree storage. I decided to search, and lo-and-behold: there’s a C implementation with good overview documentation. Thus, I left Dax a note:
Setting the Goalposts
From the outset, I’ve never viewed BonsaiDb as something that will compete on raw performance with PostgreSQL or Redis, the two databases I’m replacing in my own stack with BonsaiDb. This is the same as what I expected of using CouchDB before as well. CouchDB
has its own data guarantees and performance characteristics, just as PostgreSQL and Redis have theirs. My goal of BonsaiDb was to be competitive with CouchDB’s performance, because I knew that it was manageable and predictable as a developer and business owner.
When I embraced another library as my underlying storage layer, I took no responsibility on the raw performance. I understand the basic philosophies on how to optimize IO, but I’m not a database engineer. By taking this responsibility on my self, I feel like I must make it clear: BonsaiDb should be fast enough for most users, but I highly doubt it will ever compete with PostgreSQL for ACID-compliant transactions nor with Redis for non-ACID-compliant key-value storage. I’m not someone to obsess over benchmarks: I will make and publish a benchmark suite that attempts to be fair, and I will do my best to do obvious optimizations.
Beyond that, I welcome anyone else’s contributions that improve efficiency without sacrificing maintainability too much. I care about code maintainability more than I do about raw performance.
Another key goal is ACID-compliant, multi-collection transactions – something that CouchDB/Couchstore doesn’t offer. I refuse to sacrifice this capability, as it’s one of the reasons I preferred to store some data in PostgreSQL at my previous company. For some data, you just don’t want to ever roll the dice.
Those are the goals that required a little explanation, but I also want to have:
- Cross-platform, but support io_uring on Linux. I’m primarily interested in hosting servers on Linux, and using this IO abstraction might help make up a little bit for my lack of experience as a database developer.
- File-level encryption support.
- Settable in-memory cache limits.
- Ability to write to trees without transactions.
- Fine-control over flushing behavior.
Starting the journey
I began by creating an abstraction between tokio-uring and tokio::fs. It is a bit ugly in its current form, as it’s been iterated on over and over these past two days, and I haven’t spent time cleaning it up.
As I said before, I’m not someone to chase benchmarks, but on Saturday I got my initial tests working, and I decided to change the test from trying 100 items to larger numbers and turn on release mode.
Before I even start, let me say that this test is a very bad test for performance. It consists of:
-Repeat
- Creating/opening a transaction log file
- Inserting a new transaction
- Flushing it fully, then closing the file
Each loop, the file is opened, inspected to learn the last transaction state, written to, then closed. It’s really not an indicator for performance. But, if it was abysmally slow, I knew I was doing something wrong. The total file size ends up at the “page size” times 1 million, which conveniently is 1 gigabyte.
Good news, at roughly 20k iterations per second on my AMD Ryzen 3800X, I was happy enough to move on! Later that night, I pulled the code onto my M1 Macbook Air and ran the same test: it beat my Ryzen by averaging 25k iterations per second.
Planning for Transactions
At this point, I had working versions of the async file abstraction, the block-level encryption abstraction (with a Rot-13 test implementation), and the append-only transaction log implementation. I had finished doing some concurrency tests, but while writing that test I realized something: my file locking strategy wasn’t flexible enough to support non-exclusive modes in the future.
I began to think about the real flow of the program would actually function. The way I’m guaranteeing ACID compliance is based around the append-only format. Here’s an overview of how it works:
- The file is broken into “blocks” of data. For Couchstore, this is 4,096 bytes. For the transaction log, I’ve chosen 1,024 bytes (I expect too much wasted space for the transaction log if I picked a higher value).
- At the start of each block boundary in the file is either a 0 or a 1. If it’s a 1, it contains the start of a record in the file. If it’s a 0, it’s part of another chunk of data that couldn’t fit within a single block.
- When opening the file, the reader seeks to the last multiple of the block size. An attempt to find an existing transaction is made by scanning backwards until a 1 is found on a block boundary. The last record is then parsed to be verified before the log is considered open and verified.
- If a write is interrupted before a chunk is able to be written fully, the opening procedure can detect this and truncate the file to the last completed record’s end.
- After the transaction log is flushed, the ACID compliance has been reached and the writer can report success back.
If I imagine having a pool of database workers performing transactions, I can see a clear way to optimize those workers: allow multiple transactions to be prepared by each worker while waiting for the transaction log to commit them.
In this flow, I could pass ownership of the log file between each of the workers, or I could introduce a single thread responsible for committing all transaction log entries. To me, giving the transaction log its own thread made immediate sense, so that’s what I did this morning.
One of the other realizations I had in this model is that sometimes the TransactionManager
may have more than one transaction in queue ready to write. Each transaction log entry may be as small as 1KB, and in my head, 16KB was the smallest amount that was worthwhile to flush. I don’t know where I had that in my head from – chalk it up to a good guesstimate. I wrote the transaction manager to attempt to write up to 16 transactions in a single batch before flushing, and then notifying all 16 transaction workers after the flush succeeded. If a full 16 transactions weren’t ready, only what was available would be written before flushing to ensure as small delay for a single insert as possible.
The new test is a bit more realistic: using 10 workers, spam the TransactionManager
with 100,000 entries each. The initial test on my Linux machine using was an increase to ~115k transactions per second. On my M1 Macbook Air: ~85k per second.
At the end of the day, these numbers aren’t meant to show that I’m doing anything impressively fast. They’re just to prove to myself that “it’s probably good enough” allowing me to move on. To have a realistic end-user benchmark that’s relevant to BonsaiDb, I need to implement document storage.
Wait, it’s not done?
Of course not, I only started it on Friday after spending the morning playing Factorio with friends.
I found myself giving updates to Dax in a private chat, which doesn’t do anything to further the community. Thus, I was inspired to write up my journey so far, and use this thread as a place to report on my continued journey of replacing the storage layer of BonsaiDb.
I’m still wanting to work on Gooey to get to the point of building BonsaiDb administration tools in it. I already have lots of great things to show off since my last update, but my brain has been practically obsessed with this new project. This hopefully will solve one of my last uncertainties I’ve had with BonsaiDb.