Encoding ids and data using English words: englishid

ecton · October 27, 2021, 3:28am

I released a new crate that I wanted to share:

We have two problems that we’re trying to solve, and this crate will help us with both of them:

Invitation-type codes: How to encode random unique IDs in a friendly manner?
Backup keys: How to encode backup private keys in a textual format that is easy to enter in a recovery scenario? (textual format is for those who don’t have access to a camera while recovering, as a QR code will also be available)

Enter: englishid. I’m sure just as the “gif” creator has to live with people mispronouncing their library’s name, mine would also fall victim to such a mistake that I will just leave the naming inspiration here: It’s common for us crazy Americans to turn a noun into a verb. E.g., “Let me google that for you.”

My attempt at punny humor is present in this crate’s name: “I englishedid the ID for you.” … I tried. Feel free to just call it “English ID” instead of “Englished”.

What does it do?

englishid offers two sets of APIs that convert data into a series of English words. The list of words chosen is based on the EFF’s wordlist which appears to be a “safe” wordlist to use in a commercial environment. I expanded the list from 7,776 entries to 8,192 entries, allowing each word to encode 13 bits of information.

For example:

println!("42 => {}", EnglishId::from(42_u16).to_string()?);

Produces 42 => accept-abacus. To decode, call englishid::parse_u16("accept-abacus")?). The astute reader will notice that two word are used to encode a u16 – that wastes 10 bits of data. What if you want to just encode 13 bits? You can restrict the word limit during encoding:

println!("42 => {}", EnglishId::from(42_u16).words(1).to_string()?);

Produces 42 => accept. The same decode function is used to decode the id. The crate offers APIs for each of the unsigned primitives in Rust.

The last set of APIs allow encoding arbitrary data. Since 13 is not a nice power-of-two multiple, to successfully decode a payload we must know the original length of the payload. My first stab at writing usable APIs uses three pairs of methods:

encode_with_fixed_length()/decode_with_fixed_length(): Encodes the data without any additional information. Requires the caller to pass in the original payload’s length upon decoding. There is no limit to the amount of data encoded through these APIs.
encode()/decode(): Adds an extra word at the start that encodes the length of the payload. Allows up to 8,190 bytes to be encoded. I can remove this limit… but seriously, this crate isn’t intended for large payloads.
encode_with_custom_header()/decode_with_custom_header(): Adds an extra word at the start that encodes a custom u16 value (limited to the same range 0..=8190). When decoding, a callback is invoked with the stored header value, and the callback is responsible for returning the number of bytes expected in the payload.

This mode is intended to be used if you have an enum representing the type of information stored, and each enum value maps to a specific byte size. For us, this is the method our backup key encoding will work.

The crate is fairly well tested, and the algorithms aren’t very tricky, so I feel fairly confident that the encoding format itself can stay stable. However, additional eyes would be appreciated. As my confidence grows, I will release a 1.0 version.

Another area of feedback I would appreciate is on the contents of the wordlist itself. I found, for example, “aids” and “aide”. As I was adding words, I removed both words. Why? If I was a friend helping you recover your account by reading the words to you as you typed, and I said, “aid,” would you know to spell it with the “e”? Similarly, “aides” is a real word as well.

Hopefully other people enjoy this crate.

A quick update from the rest of the labs

We’ve been mostly focused on working on BonsaiDb as well as the next step in our development plans. Our next devlog should cover a new development release of BonsaiDb – the first one under the new name, and the first one that includes user authentication and the ability to define custom APIs.