In this post, we would like to talk about a simple data structure we designed for an approximate membership filter with probabilistic time decay. It doesn’t sound simple, but the filter is easy to implement and is highly concurrent while achieving a very low memory overhead.

In our system there are a lot of incoming events. The system needs to detect recurring events and apply a special processing to them. Some people may think of using LRU cache for that task, but we don’t use it. Why? If the accuracy matters, LRU is a good way, but for us the accuracy is not the highest priority. Our data structure needs to remember newer data with a *higher probability* and older data with a *lower probability*. In addition, we are going to have a large number of filters in the system, so a low memory overhead is required. Lastly filters are serialized/deserialized frequently, and they need to be done fast. Therefore, we looked for a different solution than LRU, and we came up with the data structure in this blog.

Let’s start with the most naive way. Suppose an event is an integer number *E*, and we have an integer array of size *s*. For each event, the hash value is computed to locate the position in the array, and *E* is written in the position. An older event is more likely to be overwritten by newer events. The probability of survival of an event is

** t** represents the number of events after this event is inserted. This is an exponential decay function.

Although this works, in a real environment a hash collision may occur. If this happens among frequent events, they overwrite each other. Therefore we need to reduce the collision rate somehow.

How about having more hash functions? Instead of writing to a single place in the array, we write the value to multiple places using multiple hash functions. The probability that all hash values collide should be lower. Let ** n** be the number of hash functions. An event survives when at least one out of

**places is not overwritten. How does the survival probability change? For brevity, we will gloss over the fact that collisions may occur among the hash values of a single event.**

*n*Every event writes to ** n** places in the array. An event survives if at least

*one*out of

**places is intact. The survival probability**

*n***of an event after**

*φ***event insertions is**

*t*where

The following graph shows a few examples of different ** n** (

**is fixed to 1000).**

*s*With a higher ** n**, the probability declines slowly at the beginning, which means that the filter remembers the recent events very well. The curve becomes closer to a step function as

**increases. It is obvious, however, that the overall capacity of remembering events decreases as**

*n***gets bigger.**

*n*In order to compensate the capacity decrease let’s consider increasing ** s**. In the following graph,

**is increased proportionally to**

*s**log*.

_{2}(**n**)The capacity is still decreasing, but the size of *log _{2}(n)* order seems to be a reasonable guide line.

We would like to use a bigger ** n**, but the space efficiency is an issue. It will be great if we can decrease the required storage size without decreasing the capacity. Let’s try an approximate matching instead of the exact matching and examine how the approximation affects the accuracy.

We will reduce the bit size of an array entry. Another hash function is used to map the 32-bit integer to a smaller number of bits. Let’s call it a fingerprint and let *b* be the size of a fingerprint. Intuitively the probability of false positive ** f** at a single place is

Considering this, the probability ** φ**, which is now the probability that the filter answers “YES” including false positives, is now written as

where

In the next graph, the probabilities are shown with a few variations of ** n**.

**is fixed to 8.**

*b*Now the curves don’t approach to zero. The effect of false positive is surprisingly high for a big ** n**. We cannot choose a small

**when we use a large**

*b***. In my opinion, practical choices of parameters are {**

*n**= 8,*

**b***< 4 } and {*

**n***= 16, 4 ≤*

**b****≤ 32 }.**

*n*We designed a very simple data structure for an approximate membership filter with probabilistic time decay. We have three parameters (** n**,

**and**

*b***) to configure according to application’s needs. The filter is easy to implement and is highly concurrent since there is no pointers to update. Most importantly, it has a very low memory overhead since it consists of just one array object.**

*s**We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.*