Bloom Filters and Probabilistic Data Structures

# Bloom Filters and Probabilistic Data Structures

Bloom filters are a type of probabilistic data structure that help to determine whether an element is a member of a set. They can give two types of responses: "Possibly in set" or "Definitely not in set".

A Bloom filter consists of a bit array of m bits, all initially set to 0. There are also k different hash functions, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

Let's consider a simple implementation of a Bloom filter in Python:

```python
import mmh3
from bitarray import bitarray

class BloomFilter(object):

def __init__(self, size, hash_num):
        self.size = size
        self.hash_num = hash_num
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)

def add(self, string):
        for seed in range(self.hash_num):
            result = mmh3.hash(string, seed) % self.size
            self.bit_array[result] = 1

def lookup(self, string):
        for seed in range(self.hash_num):
            result = mmh3.hash(string, seed) % self.size
            if self.bit_array[result] == 0:
                return "Nope"
        return "Probably"
```
In the above code, we are using `mmh3` as our hash function and `bitarray` to create the binary array.

When we add an element, the bits at k array positions, h1(x), h2(x), ..., hk(x), are set to 1. When checking if an item is in the set, we check if the bits at the array positions h1(x), h2(x), ..., hk(x) are set to 1. If any bit is 0, then the item is definitely not in the set. If all are 1, then the item may be in the set.

Bloom filters are space-efficient, but the cost is that there's a certain error probability, known as the false positive rate. The error probability decreases with the size of the Bloom filter, but increases with the number of elements stored.

Remember, Bloom filters never give false negatives. If it says an item is not in the set, it really isn't. However, it might say an item is in the set when it isn't (false positive).

Probabilistic data structures like Bloom filters are a powerful tool in software engineering, particularly when dealing with large data sets where the exact precision is not required, and some amount of error can be tolerated.