Google Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
How about developing a hash table, if the key produced by hash function for the input data is same, then compare the entry thats pointed to by the key and if similar either remove it or increase the count depending on the requirement...
I believe the key word here is rapid. Data comes in so fast that it cannot afford having worst case of O(n) access time in hashtable. I am thinking we could add data onto a queue first then check it against the hashtable?
When you say rapid succession, could we take incoming "tweet" data as an example?
Another fact that needs to be cleared is : How important is the data ? Can we compromise losing a little bit of data in order to implement a faster way of checking data redundancy?
As of today there are various ways to deduplicate data:
File-level:
file-level data deduplication compares a file to an archived file by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to an existing file is stored.
Block-level
Block-level data deduplication operates on the sub-file level. The file is broken down into chunks or blocks ,these are examined for redundancy with respect to previously stored information.
From these two, if you are looking for computational efficiency, You would go for file level as the indexes are smaller in comparison to a block level deduplication.
The latest deduplication techniques automate the process of hunting down multiple files at a very granular level and apply specialized compression algorithms to what remains to shrink your data even further.
You can also read up on content aware de-dupe.
Since this problem is within an interview context, I believe the answer should be something as simple as hashing the data and seeing if the hash has been seen before. The trick is that hashing once wouldn't be enough because of collisions, so hashing a few times with different hash functions would be necessary; see Bloom Filters.
Since this problem is within an interview context, I believe the answer should be something as simple as hashing the data and seeing if the hash has been seen before. The trick is that hashing once wouldn't be enough because of collisions, so hashing a few times with different hash functions would be necessary; see Bloom Filters.
The question is probably intentionally vague. Here we don't have access to the interviewer here to ask questions so we should make our own assumptions. If the range of the incoming data is small then we can fit all the hash table in memory and the solution is pretty trivial. But if the range is very big (like finding duplicate incoming images in imgur), we need a huge hash table to minimize collisions and we need to also think about how to map the data to a hash key. In this case the hash table might no longer fit in the memory or even in the same machine. Another important question to ask if duplicates happen often or rarely. If the answer is rarely then we can use bloom filters to check very quickly whether the incoming data is not present in our database (bloom filters are 100% accurate when they determine a data is not present in the set, otherwise we have to go to database to check for the answer).
- Mona February 24, 2014