Given a bloom filter of size N-bits and K hash functions, of which M-bits (where M <= N) of the filter are set.
Is it possible to approximate the number of elements inserted into the bloom filter?
Simple Example
I've been mulling over the following example, assuming a BF of 100-bits and 5 hash functions where 10-bits are set...
Best case scenario: Assuming the hash functions are really perfect and uniquely map a bit for some X number of values, then given 10-bits have been set we can say that there has only been 2 elements inserted into the BF
Worst case scenario: Assuming the hash functions are bad and consistently map to the same bit (yet unique amongst each other), then we can say 10 elements have been inserted into the BF
The range seems to be [2,10] where abouts in this range is probably determined by the false-positive probability of filter - I'm stuck at this point.
This question worries me a bit because there are better algorithms for approximately counting the number of distinct elements with small amounts of storage.
Nevertheless, if we must use a Bloom filter, let's assume that the hash functions are random oracles (all values chosen independently, or "really perfect", not to be confused with perfect hashing). Now we have a balls and bins problem: given that
MofNbins have balls in them, how many balls did we throw? LetBbe the number of balls thrown; the number of items isB/K, since for each item we throwKballs.The standard approximation for balls and bins processes is to model each bin as an independent Poisson process; the time before a bin becomes occupied is exponentially distributed. Letting
1be the time it takes to throw all of the balls, the maximum likelihood estimateλof the rate of this exponential distribution satisfiesPr(Exponential[λ] < 1) = M/N, so1 - exp(-λ) = M/Nandλ = -log(1 - M/N). The parameterλis akin to the number of balls, so the estimate for the number of items isB ≈ -N log(1 - M/N)/K.EDIT: There are
Nbins, so we need to multiply byN.