PWL: Age Partitioned Bloom Filters
Now that is an interesting title of a paper, "Age Partitioned Bloom Filters," somewhat of a mouthful and packed with meaning. I love it. When this gem of a paper came up in my Twitter feed, I instantly fav'ed it, downloaded and read it. And the paper did not disappoint. If you've not read it yet, I suggest you do so now. You can find it here. Go on, this blog post can wait.
The Setup
Was it good? I knew you would like it. One of the best things about the paper was how the authors set up their needs and went through all the different variations on techniques, types of Bloom filters and other data structures to efficiently solve their problems.
In this paper we present Age-Partitioned Bloom filters, a BF-based approach for duplicate detection in sliding windows that not only is competitive in time-complexity, but has better space usage than current dictionary-based approaches (e.g., SWAMP), at the cost of some moderate slack.
I bet you can just imagine doing duplicate detection on a large continuous stream of data using an Age Partitioned Bloom Filter. Neat!
Time Segments
One of the approaches that the authors dismissed was using Bloom filters associated to a segment of time and asking each segment if the new piece of data was part of the set. To make this a little bit more real how about I set up a scenario to make it more concrete. Bring on the cat memes.
Edge Cache Scenario
Let's say that we are going to start a company that caches cat meme images on the edge of a large network. To do this in a more efficient way, we only want to cache things that are actually requested more than once. No need to cache all cat memes on the edge if only the latest and the greatest are ever requested.
Typically, in this scenario, we would set up a Bloom filter in our request system and add the requested cat meme URL to our Bloom filter. When and if a request were already in the Bloom filter, would we then pull the image from disk and place it on our edge network. The next time someone wanted that picture of the baby Yoda cat meme they would get it off the edge of the network.
With the Bloom filter, we could optimize our cache storage and not store yesterday's old memes on the edge. Everything is fine until the Bloom filter gets full. What do you do now; start a new one? Maybe, but it will be empty and not have any memory of possible cached or requested items. What you need in your cat meme cache business would be an Age Partitioned Bloom filter or similar technique. If only the oldest of the meme image locations would empty out the back of your Bloom filter, you would not have to reset to an empty one when the Bloom filter became full. NEATO! right.
Time Segment Issues
The issues the authors of the paper had with time segments was the choppiness of the technique and the processing costs. So if we did this for the cat meme system, we might create a Bloom filter per hour and query the current Bloom filter and the past 24 to get a days worth of data to check whether or not the meme image should be cached. This is a simple and a some what silly example to be honest but I hope you get the point. Asking 25 Bloom filters if it has a cat meme URL in it would take some time unless you could merge the Bloom filters together and keep the same error rate.
HyperLogLog and PFMerge
I've used the time segment technique in high cardinality counting systems using HyperLogLog in the past. One of the benefits of HyperLogLog in Redis is the PFMERGE command. The PFMerge command will take any number of time segmented HyperLogLog data structures and merge them together. I can tell you that in production, with billions of data structures and millions of cardinality checks a minute, the system runs and runs very nicely.
I love reading papers like this one and discovering new techniques to challenges I've never even thought of. Have you had a high cardinality counting problem or cat meme optimization problem? I hope to have more in the future, to be honest.