The main motivation of Flower Filter is that we need a data structure that remembers newer data with higher probabilities and older data with lower probabilities. We call such probability “survival probability”. In this post, we discuss some math subtlety of the survival probability.
First, let’s have a quick review of the background. We have an integer array A of size s, and A stores the fingerprints of events. When an event E comes, we use n hash functions to compute n location indexes, and E stores its fingerprints in these locations. (Due to hash collisions, E may take fewer than n locations in array A. But let’s ignore this for now.) Suppose t events come after event E. These events may overwrite event E’s fingerprints. We say event E survives if at least one of its n fingerprints is intact.
In the original article, the survival probability Φ of event E is computed as:
Here, p is the survival probability of one fingerprint of event E, and it is computed as:
The summands in the summation are binomial probabilities. Thus, an implicit assumption is: the survival of different fingerprints of one event are independent. If you are really picky about math, you will find this assumption is not valid.
A simple counter-example: consider n=2, s=4, t=1. In other words, array A can hold up to 4 fingerprints, and each event can store up to 2 fingerprints. Suppose initially an event E stores two fingerprints in slots 1 and 2. We need to compute the survival probability of E if there is one more event (since t=1).
If the survival of slot 1 and slot 2 were independent, then the conditional probability p(slot 2 intact given that slot 1 intact) should equal p(2 intact). But,
Thus, intactness of 1 and 2 are not independent.
To get around this dependency problem, we can use the inclusion-exclusion principle. Let Xi denote the case that the i-th fingerprint of event E is intact after t events. We have
Clearly, this looks very different from the original formula. The question is, in reality, does this matter much? The answer is no, and we hope the plot below can convince the reader. Here, the green curve represents the new formula, the bule curve for the old formula.
As we can see, the difference is negligible. In fact, in our system, n<<s, so the difference is even smaller.
The new formula assumes that event E stores exactly n fingerprints in array A. If you are picky, you can find the total probability by marginalizing over all cases (E stores exactly i fingerprints, i = 1, …, n). But we believe the difference is marginal.
Conclusion: In the original Flower Filter blog post, we implicitly assumed that the survival probabilities of an event’s different fingerprints are independent. In this article, we showed that that assumption does not hold, derived a new formula to compute the survival probability of an event, and demonstrated that the difference is negligible in practice.
We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.