How would you get all of the u

Google Interview Question for Software Engineers

0

of 0 votes

11
Answers
How would you get all of the unique IDs out of a single file? What if it was a very large file?
- LeetJile February 23, 2015 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer Knowledge Based

Email me when people comment.

An error occurred in subscribing you.

Country: United States

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 3 vote

Looks like MapReduce problem to me. Fork multiple Map jobs each takes a part of file and calculate unique ids. Output of all Map jobs are aggregate by a final reducer. If file is still bigger we can have sequence of Map-Reduce jobs.

- Richeek February 24, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I think this is the answer was along the lines my interviewer was looking for

- LeetJile February 24, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

for this the ids will have to be sorted using external sorting. Wikipedia "external sorting" if need further details.
After sorting we will have to load memory sized chunks of the sorted file. We can easily track unique ids by checking adjacent ids in each chunk, and write them off to a file. Also, we will have to store the last unique id in a chunk and not write it to the unique id file, to handle the case of duplicate id being split on the edge of chunk.

- Adi February 23, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Depending on the range of unique IDs, one could potentially use the simple approach of a hash to track unique IDs. Probably one of the first questions to ask the interviewer even though they'll probably say it's an unspecified large range.

- JW February 23, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Adding to JW's response, another question would be how big a unique id can be, are they all fixed length or variable length?

- Sam February 23, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

add all the ids in a hashset as keys and return them

- aaa February 23, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We can use Hashing for UID and if these unique id is in large file then we can use hasing as well as file partioning according to avilable main memory..
please correct me if i a wrong

- sumit.polite February 23, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

If possible, I'd just do the appropriate grep/cut/awk and pipe it through sort and uniq. I'd probably mention that in the interview because it takes less than a minute to explain. Obviously it's not the optimal approach, but it's the most practical one in many real world situations. It completely breaks down for huge files, not just because it must all be held in memory but because of the sorting.

Otherwise I'd take a MapReduce approach. For a chunk of the file, use a hash map to find unique IDs; there's no need to sort here. In fact, if 100% accuracy is not required, and one only needs to query whether or not a given ID is in the file, then one could just construct a bloom filter.

Either way, aggregate the hash tables together by merging their keysets.

- nilkn February 27, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think a better (and easier) solution is read the file line by line and store the IDs in a Trie.

- Anonymous March 17, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I was thinking along the same line. For a large set of data, the distributed hash will work. first hash (or first k bits of the UID) to find the hash table within the hash table set. then run the hash on the UID to locate the entry(update by incrementing the counter or insert). Finally, each set can be individually queried for unique id. Hash and then the query can be run across multiple threads, where each thread reads from a section of the file at some offset, over individual hash tables.

- ali.kheam May 03, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Mapreduce will do too, though I am little familar with this paradigm.
the map to take the block file data, and output UID:1 for each UID seen.
Prior to sharding/shuffling the id's (keys) to a reducer, the filtering can be run on the output of each map to filter out the non unique ids.
The reducer will aggregate the unique id, and for any ID, if count ==1, emit(unique id, 1) else consume

- ali.kheam May 03, 2017 | Flag Reply

CareerCup

Google Interview Question for Software Engineers

Books

Videos

Resume Review

Mock Interviews