Do you know how Redis cleans up memory? If not, keep reading. You see, I know much more about how Redis cleans up memory, more than I ever wanted to know. Recently, I built a system that had a variable load and was hammered during the fall foliage season and we ran into a little memory issue with my favorite database. We had to keep scaling it up and it was getting expensive.
TTL and a Red Bird
The REDBird system uses Redis and writes and reads a lot of time series data. Truth be told, the system creates a lot of keys and leaves them on the system for Redis to clean up. Kind of like that brother that doesn't clean up after himself in the kitchen and thinks mom will clean it up for him. There is only one problem, we write and expire keys faster than Redis can clean up in its default configuration.
Keys in Redis are expired in two ways. There is a passive, and an active way. When a client tries to read an expired key, it is deleted in the passive way. The active way entails the system looking up 20 keys at random and testing them for an expired Time To Live (TTL). If it finds that more than 25% are expired, it will run again.
This is where things get a little murky for me. The Redis documents say that the cleanup function will run 10 times a second but that it will stop if it does not find greater than 25% of keys are expired. I have some questions. Stop for how long? What does start again if it finds greater than 25% are expired mean?
hz and Frequency
The answer to most of my questions was inside the redis.conf file. In the file, under the hz configuration settings, we get this little gem of a comment.
Redis calls an internal function to perform many background tasks, like closing connections of clients in timeout, purging expired keys that are never requested, and so forth.
Wow, and when we keep reading, we get this bonus.
By default "hz" is set to 10. Raising the value will use more CPU when Redis is idle, but at the same time will make Redis more responsive when there are many keys expiring at the same time, and timeouts may be handled with more precision.
There it is. The magical 10 number. Now I know what you are thinking. Adjust the "hz" configuration value and REDBird will fly again. The problem is, Amazon Elasticache does not give you access to this configuration setting. I wonder why? Not really, I know why. Moving "hz" to 100 or 500 would reduce the total number of Redis instances that Amazon can deployed on the same hardware with a "hz" of 10. Ok, that's fair, but sucks. Moving on.
Fly REDBird Fly
So how did we fix this? We took some inspiration from this and another blog post and started asking Redis to lookup random keys in a fire and forget pipeline batch. We started out with batches of 50K every second and eventually found that a 10K batch every couple of seconds gave us a good balance of memory cleanup and system load. Today REDBird is flying even during the fall and we were able to reduce the size of our cluster saving more money for the chipmunks and lumberjacks retirement fund.