Facebook Interview Question for Software Engineer / Developers


Country: United States




Comment hidden because of low score. Click to expand.
7
of 7 vote

1. Do External Sort (merge sort in file).
2. Collapse duplicates to "num, word", e.g. "1025, the".
3. While doing step 2, keep a max heap of word frequencies.

This way there is no extra storage needed, but the complexity is O(N*log(N)).

If extra storage is permitted, you can build a hash file, where line number corresponds to hash index. You'll probably end up with using twice the original storage though. Also, random access on disk will probably be less efficient than the sorting in External Sort, since lines will be read sequentially and written back sequentially.

- Martin August 27, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

This is the solution in my mind as well but the solution to this problem is a lot of code (Creating multiple files, file IO code, merge sort for individual files and then an N way merge). Does facebook expect a complete solution end to end?

- axecapone August 28, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

if you have a Space of O(n) for creating a max heap, why you do not use the hash map to prevent sorting which has O(nlogn) runtime complexity?

look at the solution which is provided above and let me know your opinion

- zasa1010 August 28, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

You are right. Max heap is not a solution to me either but external sort is. In external sort, typically, sorted data is saved back into the HDD and then you perform an Nway merge on the files.

- axecapone August 28, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Linux command :

cat filename.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | awk '{h[$1]++}END{for (i in h){print h[i]" "i}}'|sort -nr | cat -n | head -n 10

- Psycho September 20, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

your heap can go out of memory. you need to write out the results. Create a max heap of size 10 and start reading the results computed in the previous steps and insert them into the heap. finally, the heap will contain the 10 most frequent strings.

- ram October 02, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Min-heap instead of max-heap; you wan't to pop the smallest key. ;) And it doesn't require linear space; it wants to maintain the top 10 frequencies.

- Safi December 10, 2014 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

Use Hash-Map and Min Heap

- Psycho August 29, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

Well, there could be 2 different definition of word.
1. A word in English dictionary, the biggest I have seen so far has 400k words.
2. A sequence of character separated by separator like space, full stop, etc. if we can say that they all from English dictionary or another languages dictionary. Or even if we can be sure that all sequence of words in our terabyte of file belongs to word in any language. Then total set of words is limited can easily loaded into hash table in memory.

- Anonymous October 18, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Instead of using hashtables, we can use a different data strcuture to indicate the frequency of the occurances. I ve written a basic program for any number of words based upon input.

The structure has count of occurences of a word ending with that node. Based upon the next letter we create an array of 26 letters (assuming all the letters are small , doesnt matter if mixed as well) from the first node. And the process resumes till last letter.

#include <iostream>

struct node;

struct node
{
	int count;
	node* child;
};

int main()
{
	node * fnode = new node[26];
	std::string str;
	int numStr;
	std::cin >> numStr;
	for (int i = 0; i < numStr ; i++)
	{
		
	
	std::cin >> str;
	node * n = fnode + (str[0] - 97);
	for (int i = 1; i < str.length() ; i++)
	{
		if (n->child == NULL)
		{
			n->child  = new node[26];
		}
		n = n->child + (str[i] - 97);
		 
	}
	n->count++;
	std::cout << n->count ;
	}
}

- Rahul Balakavi August 28, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

what if we are allowed to use a distributed setup something like mapreduce and we use the hashmap solution on each computer and then finally merge the each solution .
i think this can be solved by map reduce approach

- pritesh August 28, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

create a min-heap of 10 nodes. Node structure would look like:

srtuct node{
int frequency;
char *word;
node * left;
node *right;
}

The min heap is created based on the frequency of the word. Also maintain a concurrent hashmap backing this. With every insert on the hashmap check if the frequency of the inserted element is greater than the min frequency currently on the min-heap then delete the min-heap element-> place it into the hashmap because later it could re-enter the min-heap. Insert the element from the hashmap back into the min-heap and heapify. The insertion into the min-heap has to be an atomic operation. I think this would have good throughput.

Alternatively map-reduce could also be the way to go..

- saurabh August 28, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Since there are terabyte of strings. A Modified trie based solution and Min Heaps too will be a optimal one.
In Trie, The node consists of not "IsEnd" boolean variable but also the Count.
when a new string is inserted and its count is increased. If the count is greater than the top element of the min heap and the heap size is equal to 10, remove the top element and add the element.

When all the strings have been read, pop the elements and store the elements in a array.
print the array in the reverse order which gives the top 10 frequent elements

At the End

- Dinesh April 01, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
Comment hidden because of low score. Click to expand.
0
of 0 votes

The string size is one terabyte!! You cant store it in memory with a hash table.

- axecapone August 27, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

You can if you buffer line by line and have a map of key and counter. Can't think of a solution if you have one terabyte of string has with all unique words.

- Dat August 27, 2012 | Flag
Comment hidden because of low score. Click to expand.
Comment hidden because of low score. Click to expand.
0
of 0 votes

what happen if all the words are unique? then you cannot put everything into memory.

- hchan August 27, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

What if the string is the list of all words in a dictionary? All words are unique and it takes much more than one terabyte in a hashmap given its overhead.

- axecapone August 27, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

One mitigation to having one terabyte worth of unique words is to use a trie. The words (or keys to the hash table) will be compressed to the degree of common prefixes between them.

- Anonymous August 28, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

if the string contains all unique English words (worst case), it means about 150.000 words and if the length of each word is in average 10 bytes and 4 bytes for keeping the number of acurrance, in total 14 bytes for each word, the hash table will be 2GB which is not unreseanable and can be handle by desktop PC or a alptop.

- zasa1010 August 28, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

@Zasa1010

First of all, a data structure that hogs 2GB of memory IS unreasonable in any scaling application. 2GB for one data structure is ridiculous. Secondly oxford has about 300k words. So requirement is 4GB. Thirdly, in real life applications like Facebook, words are not limited to just English. There are other things like usernames which can themselves go into the millions.

So in short, this storing everything in memory wont work. Even if its just english dictionary, I am sure the interviewer will not be happy with such a program.

- axecapone August 28, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Is it just me or is all the math in this thread incorrect? 300,000 words, at an average of 10 bytes each and 4 bytes to hold their count is 300000 * 14 = 4,200,000 bytes = 4.2MB, not 2-4GB as previously stated. Or am I missing something?

- exp August 28, 2012 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

exp is right, It's 4.2 MB not 4.2GB. This solution is fine if the interviewer says assuming english. I would be interested in how to solve the problem if it's not english... like UTF32 strings even.

- spock05 August 30, 2012 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

I think whenever an interviewer specifically mentions "terabyte of strings", then it does not imply an in-memory solution because "terabyte" has no significance..it should be out-of-memory solutions such as disk-based or distributed approaches such as map-reduce..but the point is that these solutions are not possible to code in a real-time interview environment.

- Anonymous September 07, 2012 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

yeah but the point is to realize that the total collection of *unique* words is not too big for memory. So run through the file stream keeping counts of words. simple.

- spock05 September 07, 2012 | Flag
Comment hidden because of low score. Click to expand.
Comment hidden because of low score. Click to expand.
Comment hidden because of low score. Click to expand.
-1
of 1 vote

But if you do the final merge by adding ranks of same words, you will get the answer

- Man August 30, 2012 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

Nasser is right. Consider 2 parallel buckets. One comes up with top ten where last one is Y, 91 times and 11th is X 90 times. The other bucket comes up last as Z, 91 times, and 11th as X 90 times, and no Y at all. X actually appeared 180 times and Y only 91 times, but if you merge only top 10, the answer will not have X in it.

Could tracking more than top 10 in each bucket and merging more than 10 from each would guarantee the correct top 10? I don't know.

- dc360 August 31, 2012 | Flag


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More