## Microsoft Interview Question

**Country:**United States

Bloom filter will work only for the false case, not the true case.

One other way to optimize for space could be to keep two levels of hashsets. One regular one, one for contiguous ranges. If in your regular hashset you find that you have hashed numbers in, say, a contiguous block of 10 (like 0-9 inclusive), you can remove them from regular hashshet and then add 1 to your range hashset. Now, as you are checking a new number n, check to see if n/10 exists in the range hashset. If not, check in the regular hashshet.

This can incredibly reduce your space usage.

One more thing, this seems more like an open ended question to see how many questions you can ask about details and come up with different solutions for it rather than one with one correct answer.

The fact that smaller numbers are more likely to appear before large numbers is a hint that the data will be clustered in continues ranges.

Here is a Python code that would keep track of seen numbers by storing contiguous ranges in a binary tree:

```
class MyRange:
def __init__(self, l: int, r: int) -> None:
self.l = l
self.r = r
class IntegerSetNode:
def __init__(self, i: int):
self.range = MyRange(i, i+1)
self.left = None
self.right = None
class IntegerSet:
def __init__(self):
self.root = None
def seen(self, i: int) -> bool:
if self.root is None:
self.root = IntegerSetNode(i)
return False
else:
node = self.root
while True:
if node.range.l > i:
if node.range.l == i + 1:
node.range.l = i
return False
elif node.left is None:
node.left = IntegerSetNode(i)
return False
else:
node = node.left
elif node.range.r <= i:
if node.range.r == i:
node.range.r = i + 1
return False
elif node.right is None:
node.right = IntegerSetNode(i)
return False
else:
node = node.right
else:
return True
def test():
in_set = IntegerSet()
for i in range(10):
if in_set.seen(i):
print(f"should not have seen {i}")
for i in range(10):
if not in_set.seen(i):
print(f"should have have seen {i}")
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
test()
```

The question starts with a probability topic and restriction on space both point to a possible solution using Bloom Filter.

- Saurabh March 18, 2021Now Bloom Filter is a probabilistic data structure that uses a relatively small fixed-size array to record hashes for the elements that it has seen before. It may give you a false positive with some probability P but never a false negative. (So it may say that the element exists in a set when the element may not really exist)

Usually, a bloom filter has an "add" and "contains" operation which both are very fast (time complexity O(1)). But there is an implementation/extension of the bloom filter called, the Counting Bloom filter with a "remove" operation at the cost of incurring an additional space overhead for counting.

I would assume, the first part of the question where smaller numbers have more probability of appearing first in the stream to suggest that once you start getting larger numbers, you can remove hashes generated using smaller numbers to further reduce the false positive scenario.