New BonsaiDb Example: Histograms

ecton · October 15, 2021, 2:06am

I had an interesting thought a few weeks ago: could you use the map/reduce views in BonsaiDb to store histograms. Today, I explored that and decided to publish the example.

Using BonsaiDb’s map/reduce Views to produce histograms

When I originally had this idea, I searched the excellent hdrhistogram crate to see if it has built-in serialization support. It does, although I had to include a small wrapper to support Serde.

With that question answered, it was down to the actual task at hand: storing raw sample data and producing histograms as a result of querying the database. For the data storage, this is the structure I store as a collection (view-histogram.rs:97-104):

/// A set of samples that were taken at a specific time.
#[derive(Debug, Serialize, Deserialize)]
pub struct Samples {
    /// The timestamp of the samples.
    pub timestamp: u64,
    /// The raw samples.
    pub entries: Vec<u64>,
}

BonsaiDb will allow inserting more than one entry with the same timestamp, as we haven’t defined it as a unique key. That’s OK – BonsaiDb will handle that for us by returning all of the relevant data based on the view query.

The View is going to allow us to query for histograms that can be filtered based on the timestamp they were recorded at. Using a real world example, this could be used to compare how benchmarks performed last week against this week.

So, the first part of the view will define the Key as a u64 and the Value as a StoredHistogram (view-histogram.rs:118-151):

/// A view for [`Samples`] which produces a histogram.
#[derive(Debug)]
pub struct AsHistogram;

impl View for AsHistogram {
    type Collection = Samples;
    type Key = u64;
    type Value = StoredHistogram;

    fn version(&self) -> u64 {
        0
    }

    fn name(&self) -> Result<Name, InvalidNameError> {
        Name::new("as-histogram")
    }

    fn map(
        &self,
        document: &Document<'_>,
    ) -> bonsaidb::core::schema::MapResult<Self::Key, Self::Value> {
        let samples = document.contents::<Samples>()?;
        let mut histogram = Histogram::new(4).unwrap();
        for sample in &samples.entries {
            histogram.record(*sample).unwrap();
        }

        Ok(Some(document.emit_key_and_value(
            samples.timestamp,
            StoredHistogram(histogram.into_sync()),
        )))
    }

The map function receives the stored document, and deserializes it as Samples – the type we originally inserted. It then creates a new histogram with all of its samples, and emits the key/value pair of the timestamp and the histogram.

With this implemented, the query() and query_with_docs() functions are able to be queried. This would allow you to receive all of the key/value mappings as well as their associated documents with the original sample data. This might be useful, but it’s not our real goal here – we want to get a single histogram from the database from a single query.

The reduce function is going to take an array of u64/StoredHistogram pairs, and convert it into a single StoredHistogram (view-histogram.rs:153-169):

    fn reduce(
        &self,
        mappings: &[MappedValue<Self::Key, Self::Value>],
        _rereduce: bool,
    ) -> Result<Self::Value, view::Error> {
        let mut mappings = mappings.iter();
        let mut combined = SyncHistogram::from(
            mappings
                .next()
                .map(|h| h.value.0.deref().clone())
                .unwrap_or_else(|| Histogram::new(4).unwrap()),
        );
        for map in mappings {
            combined.add(map.value.0.deref()).unwrap();
        }
        Ok(StoredHistogram(combined))
    }

That should be it – now let’s insert some fake data (view-histogram.rs:36-57):

#[tokio::main]
async fn main() -> Result<(), bonsaidb::local::Error> {
    let db = Database::<Samples>::open_local(
        "view-histogram.bonsaidb",
        Configuration::default(),
    )
    .await?;

    println!("inserting 100 new sets of samples");
    let mut rng = StdRng::from_entropy();
    for timestamp in 1..100 {
        // This inserts a new record, generating a random range that will trend
        // upwards as `timestamp` increases.
        db.collection::<Samples>()
            .push(&Samples {
                timestamp,
                entries: (0..100)
                    .map(|_| rng.gen_range(50 + timestamp / 2..115 + timestamp))
                    .collect(),
            })
            .await?;
    }

Now that we have data in our collection, we can use the view to pull a histogram out (view-histogram.rs:60-68):

    // We can ask for a histogram of all the data:
    let total_histogram = db
        .reduce::<AsHistogram>(None, AccessPolicy::UpdateBefore)
        .await?;
    println!(
        "99th Percentile overall: {} ({} samples)",
        total_histogram.value_at_quantile(0.99),
        total_histogram.len()
    );

That one-liner did all the heavy lifting of compiling all of the samples into a histogram for us.

Now, what about the goal of pulling a histogram for a specific range of timestamps? Change the None parameter to a range (view-histogram.rs:71-76):

    let range_histogram = db
        .reduce::<AsHistogram>(
            Some(QueryKey::Range(10..20)),
            AccessPolicy::UpdateBefore,
        )
        .await?;

It’s that easy. And, with the power of Rust’s type system, you’ll notice that you received a StoredHistogram as a result of this query. This is true even when using BonsaiDb as a server over the internet.

That sounds like a lot of work for the database

Executing view queries/reductions multiple times will used cached versions of histograms – some calculations will still be done each time currently, but a significantly smaller amount than what is done the first time the view is accessed. Updating a single document or inserting a single new document will not force all histograms to be recalculated – only a subset.

Still, it can take time to re-calculate change data, and sometimes you want your services to have consistent, low-latency response times. That’s where the AccessPolicy::UpdateBefore parameter comes in. This parameter controls whether you are happy to receive potentially out-of-date data for the sake of speed, or whether you wish to have the view fully updated before analyzing your query. There’s even a final option, UpdateAfter, which kicks off a background job after returning the cached data.

Just to be clear, I’m not advocating that this example shows that BonsaiDb can compare against popular TSDBs. Rather, its use of the Rust type system positions it nicely as a friendly way for Rust developers to solve some of their data problems in unique ways without reaching for another tool.

That’s my aim with BonsaiDb, after all: Create a database that will replace PostgreSQL and Redis in my own stack. And, I hope other developers are equally as excited at the prospect of being able to write database-driven apps using 100% Rust code.

Other notable updates

On the Nebari front, I’ve been working towards releasing a 0.1 version. Despite my last ramblings on benchmarks I spent more time diving through benchmarks a few times – but they were experiments to help me understand what might need to change in the file format in future versions and what might not.

I’m now at the stage where I feel pretty confident the file format won’t change in the near future. That being said, I am not recommending people rush to use a 0.1 version’ed crate in production – I’d rather my own projects be the first production projects that trust this code. And for me, that means deploying BonsaiDb.

And that’s the real update here: I’ve noticed that Nebari and BonsaiDb have generated a bit more interest than I was expecting given the early stages in development. Given that there seems to be actual interest in these projects, I’m probably going to focus a lot more on getting both of these crates to a 0.1 release this year. For BonsaiDb, I have a milestone tracking progress, although that list is definitely subject to change.

As always, if you’re interested in chatting with us, we have a Discord server, or you can leave a comment here. If you have questions, suggestions, are thinking of using any of our projects, or might want to contribute someday, we’d love to hear from you. Thank you for reading!