Facebook Interview Question
Software Engineer / Developershave a Counter and a hashmap in the external memory. Whenever a new word comes in add it to hastable and increment the counter . Do the process and increment the counter only if the word is not present in Hashtable .
Divide the file in chunks.
Use Map reduce methodology
And Apply Merge sort on each chunk.
"Use map reduce methodology" <<-- Thats too high level. The interviewer was interested in the inner workings of the 'reduce' phase.
There Can be multiple ways to implement 'reduce' phase.
Moreover it depends on your algorithm too.
Suppose, You are dividing the file chunks on different nodes. and the nodes are removing the duplicate nodes i.e
A,B,C,D,C,A ===> B,D
Now in 'reduce' phase: reassemble the data and divide into chunks then re-distribute on nodes.
This works in similar ways as your merge sort works.
At the end you will get only unique nodes.
Total complexity is N(log N).
Better solutions welcome.
As the interviewer asked about "reduce phase" in details, here's my approach:
1. split the file in equal size. each computer will get it's chunk.
2. for each chunk, create hash table.
3. union two hash tables of computers (i) and (i+1)
4. continue step 3, in a binary tree format (bottom-up fashion).
complexity analysis:
n = total # of words in file
k = # of computers
step 1 : O(n)
step 2 : average complexity O(n)
step 3 & 4: there are total O(log k) levels in the tree
each level costs average O(n) time
so, total complexity O(n log k)
What if the number of unique words is too large as well? In that case, the hash map would not fit in memory.
1>If the input is too large to fit into memory, do the mappings, based on the word as key, and value does not matter.
2> In the MAP phase, map function only emit pairs
The output is written over a buffer, when buffer gets full
The output is spilled over the disk, and then combine function operates on the spilled data and produces a sorted key outuput. (this is manageable in the memory)
At the end it is merged with the already processed data in the map phase.(sorted data merging is O(N)
This avoids problem of memory being less than the amount of data to be processed.
Reduce Phase:- we can do similar thing, where merge is done for all the data received by the map phase.(if you understand hadoop implementation) . Every reducer is guaranteed
to receive data from the same key.
I have done a implementation ,assuming i dont end up in data being handled is larger than the memory. Quite a simple framework which illustrates the DC.
Check out this tinyurl {dot} com/bvwtm3f
why not simply inserting all your words in a Set, i.e. HashSet and then mySet.size() when the computation is ended?
In a multi-threaded environment you can safely use ConcurrentSkipListSet unless there are some sort of compound operations.
If the number of words is N, the complexity is O(N) (if an HashSet is used) or O(NlogN) if a SkipList is used.
Hash is an in-memory thing and file is too large to be stored in primary memory. So we have to thing of something external data structure.
My suggestion is B-Tree, where for each word first search it in BTree. If word exists increase the count of that word and if word doesn't exist insert it into the tree.
Once scanning of whole file is done, all entries which has count as 1 will be the words which were unique in the document.
#Create a TRIE tree, with a special node at the end of each WORD.
#Read the words one by one until EOF.
#Whenever a new word is added in the TRIE, write 1 in it's special terminal node and increment unique_count.
#Whenever a word is repeated, increment the counter in it's terminal node. If the counter is incremented from 1 to 2, also decrement the unique_counter.
1>If the input is too large to fit into memory, do the mappings, based on the word as key, and value does not matter.
2> In the MAP phase, map function only emit pairs
The output is written over a buffer, when buffer gets full O(N)
The output is spilled over the disk, and then combine function operates on the spilled data and produces a sorted key outuput. (this is manageable in the memory)
At the end it is merged with the already processed data in the map phase.(sorted data merging is O(N)
This avoids problem of memory being less than the amount of data to be processed.
Reduce Phase:- we can do similar thing, where merge is done for all the data received by the map phase.(if you understand hadoop implementation) . Every reducer is guaranteed
to receive data from the same key.
I have done a implementation ,assuming i dont end up in data being handled is larger than the memory. Quite a simple framework which illustrates the DC.
Check out this tinyurl {dot} com/bvwtm3f
1) Merge sort
- Anonymous June 02, 20112) Hadoop + Hive