Design a system to calculate t

Facebook Interview Question for Software Engineer / Developers

0

of 0 votes

17
Answers
Design a system to calculate the number of unique words in a file..

1) What if the file is huge ? (i.e cannot fit in the main memory)

2) Assuming that you have more than one computers available, how can you distribute the problem ?
- ss June 01, 2011 | Report Duplicate | Flag | PURGE
Facebook Software Engineer / Developer Data Structures

Email me when people comment.

An error occurred in subscribing you.

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

1) Merge sort
2) Hadoop + Hive

- Anonymous June 02, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. I think you are thinking of external sorting. In any case, you don't need to sort the words... just use a hash-map to store the occurrences of each word.

- S June 02, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

have a Counter and a hashmap in the external memory. Whenever a new word comes in add it to hastable and increment the counter . Do the process and increment the counter only if the word is not present in Hashtable .

- Yashwanth Redd y June 02, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

yeah...seems to be effective algo...but dont u think if the file size is too large then we vl have problem of storage to solve this problem ?

- Rajeev July 23, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

Divide the file in chunks.
Use Map reduce methodology
And Apply Merge sort on each chunk.

- Ashutosh June 02, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

"Use map reduce methodology" <<-- Thats too high level. The interviewer was interested in the inner workings of the 'reduce' phase.

- ss (the person who posted this problem) June 02, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

There Can be multiple ways to implement 'reduce' phase.
Moreover it depends on your algorithm too.

Suppose, You are dividing the file chunks on different nodes. and the nodes are removing the duplicate nodes i.e
A,B,C,D,C,A ===> B,D

Now in 'reduce' phase: reassemble the data and divide into chunks then re-distribute on nodes.
This works in similar ways as your merge sort works.
At the end you will get only unique nodes.

Total complexity is N(log N).

Better solutions welcome.

- Ashutosh June 03, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

As the interviewer asked about "reduce phase" in details, here's my approach:

1. split the file in equal size. each computer will get it's chunk.

2. for each chunk, create hash table.

3. union two hash tables of computers (i) and (i+1)

4. continue step 3, in a binary tree format (bottom-up fashion).

complexity analysis:
n = total # of words in file
k = # of computers

step 1 : O(n)
step 2 : average complexity O(n)
step 3 & 4: there are total O(log k) levels in the tree
each level costs average O(n) time
so, total complexity O(n log k)

- lol June 04, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

What if the number of unique words is too large as well? In that case, the hash map would not fit in memory.

- Sameer February 06, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

1>If the input is too large to fit into memory, do the mappings, based on the word as key, and value does not matter.

2> In the MAP phase, map function only emit pairs
The output is written over a buffer, when buffer gets full
The output is spilled over the disk, and then combine function operates on the spilled data and produces a sorted key outuput. (this is manageable in the memory)
At the end it is merged with the already processed data in the map phase.(sorted data merging is O(N)
This avoids problem of memory being less than the amount of data to be processed.

Reduce Phase:- we can do similar thing, where merge is done for all the data received by the map phase.(if you understand hadoop implementation) . Every reducer is guaranteed
to receive data from the same key.

I have done a implementation ,assuming i dont end up in data being handled is larger than the memory. Quite a simple framework which illustrates the DC.
Check out this tinyurl {dot} com/bvwtm3f

- Harkirat April 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

why not simply inserting all your words in a Set, i.e. HashSet and then mySet.size() when the computation is ended?
In a multi-threaded environment you can safely use ConcurrentSkipListSet unless there are some sort of compound operations.

If the number of words is N, the complexity is O(N) (if an HashSet is used) or O(NlogN) if a SkipList is used.

- Anonymous June 04, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Probably you overlooked the thing of "huge" file. You can have a 5GB file which won't be fitted into your memory.

- anon June 04, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

using hash counting and then map-reduce for the size issue.

- bothell June 05, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Hash is an in-memory thing and file is too large to be stored in primary memory. So we have to thing of something external data structure.

My suggestion is B-Tree, where for each word first search it in BTree. If word exists increase the count of that word and if word doesn't exist insert it into the tree.

Once scanning of whole file is done, all entries which has count as 1 will be the words which were unique in the document.

- Gaurav September 01, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

#Create a TRIE tree, with a special node at the end of each WORD.
#Read the words one by one until EOF.
#Whenever a new word is added in the TRIE, write 1 in it's special terminal node and increment unique_count.
#Whenever a word is repeated, increment the counter in it's terminal node. If the counter is incremented from 1 to 2, also decrement the unique_counter.

- -- February 09, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

map task :- finds the unique words makes all words the keys
reduce task:- combines the output of two mapped task/reduce task

tinyurl {dot} com/bvwtm3f

- Sam April 28, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1>If the input is too large to fit into memory, do the mappings, based on the word as key, and value does not matter.

2> In the MAP phase, map function only emit pairs
The output is written over a buffer, when buffer gets full O(N)
The output is spilled over the disk, and then combine function operates on the spilled data and produces a sorted key outuput. (this is manageable in the memory)
At the end it is merged with the already processed data in the map phase.(sorted data merging is O(N)
This avoids problem of memory being less than the amount of data to be processed.

Reduce Phase:- we can do similar thing, where merge is done for all the data received by the map phase.(if you understand hadoop implementation) . Every reducer is guaranteed
to receive data from the same key.

I have done a implementation ,assuming i dont end up in data being handled is larger than the memory. Quite a simple framework which illustrates the DC.
Check out this tinyurl {dot} com/bvwtm3f

- Anonymous April 28, 2013 | Flag Reply

CareerCup

Facebook Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews