Expedia Interview Question
Software Engineer / DevelopersCountry: United States
something better than tries like ternary trees can be used. also, since we want to know top 10, a list of the top 10 words can be made real time to eliminate any extra time for searching top 10.
As we are finding the word count from a predefined language (e.g. English), we can use the predefined dictionary of language words and their digests (hash values). Now, we will place the words from the book into each bucket associated with a word digest in constant time (calculation of digest and placing into appropriate bucket can be done in constant time). Once we are done with parsing the book, will end up with word counts and these can be sorted in O(lg n) time (to the base 2).
step 1 : If the file(the book) is very large and can't be sorted in memory you can split it into chunks that can be sorted in memory.
step 2 : For each sorted chunk compute sorted pairs of (words, nr_occurrence), at his point you can renounce to the chunks because you need only the sorted pairs.
step 3 : Iterate over the chunks and sort the chunks and always keep the top ten appearances.
Example:
Step 1:
a b a ab abb a a b b c c ab ab
split into :
chunk 1: a b a ab
chunk 2: abb a a b b
chunk 3: c c ab ab
Step 2:
chunk 1: a2, b1, ab1 chunk 2: a2, b2, abb1
chunk 3: c2, ab2
Step 3(merge the chunks and keep the top ten appearances):
a4 b3 ab3 c2 abb1
Construct a "trie" tree with an extra field for keeping a count during first parse of the book. Once done, output the top 10 frequently occurring words. May be there are some other optimal algos, but this is what struck me immediately. Assuming there are N words in the book, and out of which k are distinct. Time complexity to create a "trie" tree would be O(N) and searching top 10 frequently occurring words would be O(k).
- Prakash May 26, 2012