The other day while working at Chipmunks & Lumberjacks, I was asked to create a computer system to count the distinct set of leaf shapes, sizes and colors of every tree in the forest. I know what you are thinking! Why would anyone want to know this information? For C&L it's really important, they use it when scoring whether a tree can be harvested.
The naive implementation of this system would be to add all of the leaves into a database and do distinct queries. This system might work for a while but what happens when the total number of trees and their associated leaves climb into the billions or trillions? Will it still perform? The short answer is no; the longer answer is, absolutely not, are you kidding me!
Probabilistic Data Structures
So, what is a geek working for C&L to do? Enter the probabilistic data structure HyperLogLog. This sleek and beautiful data structure will tell you the distinct count of items entered into its structure with a 2% error rate. Ok now, come on, 2% is not that bad, especially when you can fit 10^9 items in just 1.5K of memory. HyperLog might be the best thing to come out of the 80's besides Nintendo and the Sony Walkman.
HyperLogLog data structures have an Add, Count and Merge function. Add will insert a new value into the data structure. Count will tell you the distinct number of items in the data structure. Merge will take two HyperLogLog data structures and return a new combined data structure with only the distinct set of the two data structures.
Add and Count are the meat and potatoes of the HyperLogLog data structure but don't discount Merge. Some very interesting features can be created with Merge. For C&L we needed the 1, 7 and 30 day counts for our tree leaves. To save some space we just used one data structure per day and merged the days to find the week and month values. Neat, right?
There are some significant side benefits of using this type of data structure over the typical row in a database. Three of them are GDPR compliance, easy replay and size.
For GDPR, if you are just exchanging the data structures, there is no recognizable information stored inside the data structure due to the hashing of the original value and flipping a bit from 0 to 1 during the Add command.
Replay is all about recovering from a distributed system failure and that trying to insert the same value into a HyperLogLog data structure has no effect on the count. It's basically a no-op.
You have to admit that storing a bazillion items in 1.5 kilobytes is pretty neat. The data size makes sharing and transmitting the structure between data-centers or servers in the same center relativity easy.
A Red Bird
The new system named REDBird is built and deployed to multiple forests around the world and C&L is very happy with the new system. If you have a cardinality challenge at work, you might want to learn more about probabilistic data structures and HyperLogLog.