Google Interview Question
Software EngineersCountry: United States
for this the ids will have to be sorted using external sorting. Wikipedia "external sorting" if need further details.
After sorting we will have to load memory sized chunks of the sorted file. We can easily track unique ids by checking adjacent ids in each chunk, and write them off to a file. Also, we will have to store the last unique id in a chunk and not write it to the unique id file, to handle the case of duplicate id being split on the edge of chunk.
If possible, I'd just do the appropriate grep/cut/awk and pipe it through sort and uniq. I'd probably mention that in the interview because it takes less than a minute to explain. Obviously it's not the optimal approach, but it's the most practical one in many real world situations. It completely breaks down for huge files, not just because it must all be held in memory but because of the sorting.
Otherwise I'd take a MapReduce approach. For a chunk of the file, use a hash map to find unique IDs; there's no need to sort here. In fact, if 100% accuracy is not required, and one only needs to query whether or not a given ID is in the file, then one could just construct a bloom filter.
Either way, aggregate the hash tables together by merging their keysets.
I was thinking along the same line. For a large set of data, the distributed hash will work. first hash (or first k bits of the UID) to find the hash table within the hash table set. then run the hash on the UID to locate the entry(update by incrementing the counter or insert). Finally, each set can be individually queried for unique id. Hash and then the query can be run across multiple threads, where each thread reads from a section of the file at some offset, over individual hash tables.
Mapreduce will do too, though I am little familar with this paradigm.
the map to take the block file data, and output UID:1 for each UID seen.
Prior to sharding/shuffling the id's (keys) to a reducer, the filtering can be run on the output of each map to filter out the non unique ids.
The reducer will aggregate the unique id, and for any ID, if count ==1, emit(unique id, 1) else consume
Looks like MapReduce problem to me. Fork multiple Map jobs each takes a part of file and calculate unique ids. Output of all Map jobs are aggregate by a final reducer. If file is still bigger we can have sequence of Map-Reduce jobs.
- Richeek February 24, 2015