If you have data coming in rap

Google Interview Question for Software Engineer / Developers

-6

of 6 votes

19
Answers
If you have data coming in rapid succession what is the best way of dealing with redundant data?
- Guy February 06, 2014 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer Data Structures

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 2 vote

The question is probably intentionally vague. Here we don't have access to the interviewer here to ask questions so we should make our own assumptions. If the range of the incoming data is small then we can fit all the hash table in memory and the solution is pretty trivial. But if the range is very big (like finding duplicate incoming images in imgur), we need a huge hash table to minimize collisions and we need to also think about how to map the data to a hash key. In this case the hash table might no longer fit in the memory or even in the same machine. Another important question to ask if duplicates happen often or rarely. If the answer is rarely then we can use bloom filters to check very quickly whether the incoming data is not present in our database (bloom filters are 100% accurate when they determine a data is not present in the set, otherwise we have to go to database to check for the answer).

- Mona February 24, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

HashSet<T> in java should be adequate, I feel. Got any suggestions around HashSet ?

- Rushabh October 11, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 5 vote

The question is fucking vague.

- GZ February 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

true that.

- Anonymous February 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

How about developing a hash table, if the key produced by hash function for the input data is same, then compare the entry thats pointed to by the key and if similar either remove it or increase the count depending on the requirement...

- Megha February 06, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

By removing I meant don't add the new word in the table... or just increment the count

- Anonymous February 06, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

I believe the key word here is rapid. Data comes in so fast that it cannot afford having worst case of O(n) access time in hashtable. I am thinking we could add data onto a queue first then check it against the hashtable?

- Guy February 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

and use multi threads..

- arhtu February 08, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Why multi threads? Could you explain?

- Guy February 10, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

if we don't have to keep track of the count of each number then use a bitmap

- Anonymous February 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

It doesn't have to be integers. Can be any type of data

- Guy February 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

What do they want to do with the redundant data? Do we want to gather stats? Or making sure no dups?

- erik February 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

When you say rapid succession, could we take incoming "tweet" data as an example?
Another fact that needs to be cleared is : How important is the data ? Can we compromise losing a little bit of data in order to implement a faster way of checking data redundancy?

As of today there are various ways to deduplicate data:

File-level:
file-level data deduplication compares a file to an archived file by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to an existing file is stored.

Block-level
Block-level data deduplication operates on the sub-file level. The file is broken down into chunks or blocks ,these are examined for redundancy with respect to previously stored information.

From these two, if you are looking for computational efficiency, You would go for file level as the indexes are smaller in comparison to a block level deduplication.

The latest deduplication techniques automate the process of hunting down multiple files at a very granular level and apply specialized compression algorithms to what remains to shrink your data even further.

You can also read up on content aware de-dupe.

- YSI February 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

If all the data is to be checked and stored for redundancy, use multiple threads to process the incoming data, then push the data on multiple queues. Have a distributed hashtable to check and store newer values and maybe a count (if needed). I agree though it is a really vague question

- mo February 08, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- mo February 08, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Since this problem is within an interview context, I believe the answer should be something as simple as hashing the data and seeing if the hash has been seen before. The trick is that hashing once wouldn't be enough because of collisions, so hashing a few times with different hash functions would be necessary; see Bloom Filters.

- Anonymous February 08, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- yianaga February 08, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

The trick is that hashing once wouldn't be enough because of collisions? Could you explain this a bit more? Thanks

- Guy February 08, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Insert the values into a hashmap and check each time if the current value exists in the hashmap. if exists the value is redundant.

- vinay June 01, 2014 | Flag Reply

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews