In the right use case, Bloom filters seem like magic. That’s a bold statement, but in this tutorial we’ll explore the curious data structure, how best to use it, and a few practical examples using Redis and Node.js.
Bloom filters are a probabilistic, one-way data structure. The word ‘filter’ can be confusing in this context; filter implies that it is an active thing, a verb, but it might be easier to think of it as storage, a noun. With a simple Bloom filter you can do two things:
- Add an item.
- Check if an item has not been added previously.
These are important limitations to understand—you can’t remove an item nor can you list the items in a Bloom filter. Also, you can’t tell, with certainty, if an item has been added to the filter in the past. This is where the probabilistic nature of a Bloom filter comes in—false positives are possible, but false negatives are not. If the filter is set up correctly, false positives can be extremely rare.
Variants of Bloom filters exist, and they add other abilities, like removal or scaling, but they also add complexity and limitations. It’s important to first understand simple Bloom filters before moving on to the variants. This article will only cover the simple Bloom filters.
With these limitations you have a number of benefits: fixed size, hash-based encryption, and quick lookups.
When you set up a Bloom filter, you give it a size. This size is fixed, so if you have one item or one billion items in the filter, it will never grow beyond the specified size. As you add more items to your filter, the chance of a false positive increases. If you specified a smaller filter, this false positive rate will increase more quickly than if you have a larger size.
Bloom filters are built on the concept of one-way hashing. Much like correctly storing passwords, Bloom filters use a hash algorithm to determine a unique identifier for the items passed into it. Hashes, by nature, cannot be reversed and are represented by a seemingly random string of characters. So, if someone gains access to a Bloom filter, it will not directly reveal any of the contents.
Finally, Bloom filters are quick. The operation involves far fewer comparisons than other methods, and it can easily be stored in-memory, preventing performance-robbing database hits.
Now that you know the limits and advantages of Bloom filters, let’s take a look at some situations where you can use them.
Setup
We’ll be using Redis and Node.js to illustrate Bloom filters. Redis is a storage medium for your Bloom filter; it’s quick, in-memory, and has a few specific commands (GETBIT
, SETBIT
) that make implementation efficient. I’ll assume that you have Node.js, npm, and Redis installed on your system. Your Redis server should be running on localhost
at the default port for our examples to work.
In this tutorial, we won’t be implementing a filter from the ground up; instead, we’ll focus on practical uses with a pre-built module in npm: bloom-redis. bloom-redis has a very concise set of methods: add
, contains
and clear
.
As mentioned earlier, Bloom filters need a hashing algorithm to generated unique identifiers for an item. bloom-redis uses the well-known MD5 algorithm, which, although maybe not the perfect fit for a Bloom filter (a little slow, overkill on bits), will work fine.
Unique Usernames
Usernames, especially those that identify a user in a URL, need to be unique. If you build an application that allows users to change the username, then you’ll likely want a username that has never been used to avoid confusion and sniping of usernames.
Without a Bloom filter, you’d need to reference a table that has every username ever used, and at scale this can be very expensive. Bloom filters allow you to add an item every time a user adopts a new name. When a user checks to see if a username is taken, all you need to do is check the Bloom filter. It will be able to tell you, with absolute certainty, if the requested username has been previously added. It is possible that the filter will falsely return that a username has been used when it hasn’t, but this errs on the side of caution and can cause no real harm (aside from a user maybe not being able to claim ‘k3w1d00d47’).
To illustrate this, let’s build a quick REST server with Express. First, create your package.json
file and then run the following terminal commands.
npm install bloom-redis --save
npm install express --save
npm install redis --save
The default options for bloom-redis have the size set at two megabytes. This errs on the side of caution, but it’s quite large. Setting up the size of the Bloom filter is critical: too large and you waste memory, too small and your false positive rate will be too high. The math involved in determining the size is quite involved and beyond the scope of this tutorial, but thankfully there is a Bloom filter size calculator to get the job done without cracking a textbook.
Now, create your app.js
as follows:
var Bloom = require('bloom-redis'), express = require('express'), redis = require('redis'), app, client, filter; //setup our Express server app = express(); //create the connection to Redis client = redis.createClient(); filter = new Bloom.BloomFilter({ client : client, //make sure the Bloom module uses our newly created connection to Redis key : 'username-bloom-filter', //the Redis key //calculated size of the Bloom filter. //This is where your size / probability trade-offs are made //http://hur.st/bloomfilter?n=100000&p=1.0E-6 size : 2875518, // ~350kb numHashes : 20 }); app.get('/check', function(req,res,next) { //check to make sure the query string has 'username' if (typeof req.query.username === 'undefined') { //skip this route, go to the next one - will result in a 404 / not found next('route'); } else { filter.contains( req.query.username, // the username from the query string function(err, result) { if (err) { next(err); //if an error is encountered, send it to the client } else { res.send({ username : req.query.username, //if the result is false, then we know the item has *not* been used //if the result is true, then we can assume that the item has been used status : result ? 'used' : 'free' }); } } ); } }); app.get('/save',function(req,res,next) { if (typeof req.query.username === 'undefined') { next('route'); } else { //first, we need to make sure that it's not yet in the filter filter.contains(req.query.username, function(err, result) { if (err) { next(err); } else { if (result) { //true result means it already exists, so tell the user res.send({ username : req.query.username, status : 'not-created' }); } else { //we'll add the username passed in the query string to the filter filter.add( req.query.username, function(err) { //The callback arguments to `add` provides no useful information, so we'll just check to make sure that no error was passed if (err) { next(err); } else { res.send({ username : req.query.username, status : 'created' }); } } ); } } }); } }); app.listen(8010);
To run this server: node app.js
. Go to your browser and point it to: https://localhost:8010/check?username=kyle
. The response should be: {"username":"kyle","status":"free"}
.
Now, let’s save that username by pointing your browser at http://localhost:8010/save?username=kyle
. The response will be: {"username":"kyle","status":"created"}
. If you go back to the address http://localhost:8010/check?username=kyle
, the response will be {"username":"kyle","status":"used"}
. Similarly, going back to http://localhost:8010/save?username=kyle
will result in {"username":"kyle","status":"not-created"}
.
From the terminal, you can see the size of the filter:
redis-cli strlen username-bloom-filter
.
Right now, with one item, it should show 338622
.
Now, go ahead and try adding more usernames with the /save
route. Try as many as you’d like.
If you then check the size again, you may notice that your size has gone up slightly, but not for every addition. Curious, right? Internally, a Bloom filter sets individual bits (1’s / 0’s) at different positions in the string saved at username-bloom. However, these are not contiguous, so if you set a bit at index 0 and then one at index 10,000, everything between will be 0. For practical uses, it’s not initially important to understand the precise mechanics of every operation—just know that this is normal and that your storage in Redis will never exceed the value you specified.
Fresh Content
Fresh content on a website keeps a user coming back, so how do you show a user something new every time? Using a traditional database approach, you could add a new row to a table with the user identifier and the identifier of the story, and then you would query that table when deciding to show a piece of content. As you might imagine, your database will grow extremely quickly, especially with growth of both users and content.
In this case, a false negative (e.g. not showing an unseen piece of content) has very little consequence, making Bloom filters a viable option. At first blush, you may think that you’d need a Bloom filter for every user, but we’ll use a simple concatenation of the user identifier and the content identifier, and then insert that string into our filter. This way we can use a single filter for all users.
In this example, let’s build another basic Express server that displays content. Every time you visit the route /show-content/any-username
(with any-username being any URL-safe value), a new piece of content will be shown until the site is out of content. In the example, the content is the first line of the top ten books on Project Gutenberg.
We’ll need to install one more npm module. From the terminal, run:
npm install async --save
Your new app.js file:
var async = require('async'), Bloom = require('bloom-redis'), express = require('express'), redis = require('redis'), app, client, filter, // From Project Gutenberg - opening lines of the top 10 public domain ebooks // https://www.gutenberg.org/browse/scores/top openingLines = { 'pride-and-prejudice' : 'It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.', 'alices-adventures-in-wonderland' : 'Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, \'and what is the use of a book,\' thought Alice \'without pictures or conversations?\'', 'a-christmas-carol' : 'Marley was dead: to begin with.', 'metamorphosis' : 'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.', 'frankenstein' : 'You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings.', 'adventures-of-huckleberry-finn' : 'YOU don\'t know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain\'t no matter.', 'adventures-of-sherlock-holmes' : 'To Sherlock Holmes she is always the woman.', 'narrative-of-the-life-of-frederick-douglass' : 'I was born in Tuckahoe, near Hillsborough, and about twelve miles from Easton, in Talbot county, Maryland.', 'the-prince' : 'All states, all powers, that have held and hold rule over men have been and are either republics or principalities.', 'adventures-of-tom-sawyer' : 'TOM!' }; app = express(); client = redis.createClient(); filter = new Bloom.BloomFilter({ client : client, key : '3content-bloom-filter', //the Redis key size : 2875518, // ~350kb //size : 1024, numHashes : 20 }); app.get('/show-content/:user', function(req,res,next) { //we're going to be looping through the contentIds, checking to see if they are in the filter. //Since this spends time on each contentId wouldn't be advisable to do over a high number of contentIds //But, in this case the number of contentIds is small / fixed and our filter.contains function is fast, it is okay. var //creates an array of the keys defined in openingLines contentIds = Object.keys(openingLines), //getting part of the path from the URI user = req.params.user, checkingContentId, found = false, done = false; //since filter.contains is asynchronous, we're using the async library to do our looping async.whilst( //check function, where our asynchronous loop will end function () { return (!found && !done); }, function(cb) { //get the first item from the array of contentIds checkingContentId = contentIds.shift(); //false means we're sure that it isn't in the filter if (!checkingContentId) { done = true; // this will be caught by the check function above cb(); } else { //concatenate the user (from the URL) with the id of the content filter.contains(user+checkingContentId, function(err, results) { if (err) { cb(err); } else { found = !results; cb(); } }); } }, function(err) { if (err) { next(err); } else { if (openingLines[checkingContentId]) { //before we send the fresh contentId, let's add it to the filter to prevent it from showing again filter.add( user+checkingContentId, function(err) { if (err) { next(err); } else { //send the fresh quote res.send(openingLines[checkingContentId]); } } ); } else { res.send('no new content!'); } } } ); }); app.listen(8011);
If you carefully pay attention to round-trip time in Dev Tools, you’ll notice that the more you request a single path with a username, the longer it takes. While checking the filter takes a fixed time, in this example, we’re checking for the presence of more items. Bloom filters are limited in what they can tell you, so you’re testing for the presence of each item. Of course, in our example it is fairly simple, but testing for hundreds of items would be inefficient.
Stale Data
In this example, we’ll build a small Express server that will do two things: accept new data via POST, and display the current data (with a GET request). When the new data is POST’ed to the server, the application will check for its presence in the filter. If it is not present, we’ll add it to a set in Redis, otherwise we’ll return null. The GET request will fetch it from Redis and send it to the client.
This is different than the previous two situations, in that false positives would not be okay. We’ll be using the Bloom filter as a first line of defense. Given the properties of Bloom filters, we’ll only know for certain that something isn’t in the filter, so in this case we can go ahead and let the data in. If the Bloom filter returns that is probably in the filter, we’ll do a check versus the actual data source.
So, what do we gain? We gain the speed of not having to check versus the actual source every time. In situations where the data source is slow (external APIs, pokey databases, the middle of a flat file), the speed increase is really needed. To demonstrate the speed, let’s add in a realistic delay of 150ms in our example. We’ll also use the console.time
/ console.timeEnd
to log the differences between a Bloom filter check and a non-Bloom filter check.
In this example, we’ll also use an extremely constrained number of bits: just 1024. It will fill up quickly. As it fills, it will show more and more false positives—you’ll see the response time increase as the false positive rate fills up.
This server uses the same modules as before, so set the app.js
file to:
var async = require('async'), Bloom = require('bloom-redis'), bodyParser = require('body-parser'), express = require('express'), redis = require('redis'), app, client, filter, currentDataKey = 'current-data', usedDataKey = 'used-data'; app = express(); client = redis.createClient(); filter = new Bloom.BloomFilter({ client : client, key : 'stale-bloom-filter', //for illustration purposes, this is a super small filter. It should fill up at around 500 items, so for a production load, you'd need something much larger! size : 1024, numHashes : 20 }); app.post( '/', bodyParser.text(), function(req,res,next) { var used; console.log('POST -', req.body); //log the current data being posted console.time('post'); //start measuring the time it takes to complete our filter and conditional verification process //async.series is used to manage multiple asynchronous function calls. async.series([ function(cb) { filter.contains(req.body, function(err,filterStatus) { if (err) { cb(err); } else { used = filterStatus; cb(err); } }); }, function(cb) { if (used === false) { //Bloom filters do not have false negatives, so we need no further verification cb(null); } else { //it *may* be in the filter, so we need to do a follow up check //for the purposes of the tutorial, we'll add a 150ms delay in here since Redis can be fast enough to make it difficult to measure and the delay will simulate a slow database or API call setTimeout(function() { console.log('possible false positive'); client.sismember(usedDataKey, req.body, function(err, membership) { if (err) { cb(err); } else { //sismember returns 0 if an member is not part of the set and 1 if it is. //This transforms those results into booleans for consistent logic comparison used = membership === 0 ? false : true; cb(err); } }); }, 150); } }, function(cb) { if (used === false) { console.log('Adding to filter'); filter.add(req.body,cb); } else { console.log('Skipped filter addition, [false] positive'); cb(null); } }, function(cb) { if (used === false) { client.multi() .set(currentDataKey,req.body) //unused data is set for easy access to the 'current-data' key .sadd(usedDataKey,req.body) //and added to a set for easy verification later .exec(cb); } else { cb(null); } } ], function(err, cb) { if (err) { next(err); } else { console.timeEnd('post'); //logs the amount of time since the console.time call above res.send({ saved : !used }); //returns if the item was saved, true for fresh data, false for stale data. } } ); }); app.get('/',function(req,res,next) { //just return the fresh data client.get(currentDataKey, function(err,data) { if (err) { next(err); } else { res.send(data); } }); }); app.listen(8012);
Since POSTing to a server can be tricky with a browser, let’s use curl to test.
curl --data “your data goes here" --header "Content-Type: text/plain" http://localhost:8012/
A quick bash script can be used to show how filling up the entire filter looks:
#!/bin/bash for i in `seq 1 500`; do curl --data “data $i" --header "Content-Type: text/plain" http://localhost:8012/ done
Looking at a filling or full filter is interesting. Since this one is small, you can easily view it with redis-cli
. By running redis-cli get stale-filter
from the terminal between adding items, you’ll see the individual bytes increase. A full filter will be \xff
for every byte. At this point, the filter will always return positive.
Conclusion
Bloom filters aren’t a panacea solution, but in the right situation, a Bloom filter can provide a speedy, efficient complement to other data structures.
Comments