Amazon Interview Question
Country: United States
1. Assuming multiple web servers, get all the query words in the last 24 hrs. by parsing the log file to a centralized processing machine.
2. Now, store all the keywords in a has table with <key=word, value=freq>
3. Iterate through the hast table key entries, by first initializing a min-heap of size 20 with the first 20 freq elements of the hash table.
4. Now, process freq field of each hash table entry; if lower than min of heap the discard, otherwise extract min, insert the new freq and heapify.
5. After processing the entire hash table entries we are left with the top 20 frequent query words.
6. A distributed approach can be designed by processing the above logic to generate the top-20 frequent words in the last 24 hrs at each respective web server machine.
7. Collect all these top-20 frequency query words from each machine to a single machine and merge the results to get a combined top-20 frequent query word list.
8. May also solve using Map-Reduce but I am not much familiar!
This looks about like how I answered it. Note that you can end up getting slightly inaccurate results from this with weird distributions (i.e. server 1's top 21 results are a1..a20 and foo, server 2's top 21 results are b1..b20 and foo; if a1..a20 and b1..b20 have no overlap, foo probably ought to be in the final top 20 but won't get returned to the centralized server)
I have been asked this question many times at big data companies. It's worth being able to refine your answer and be able to think about what you'll do if, say, the centralized server goes down.
-1 for your Algo.. Coz.. You are assuming the search String is ONE WORD Sentence which is not the case..
The question is query String.. not query word..
Also, the algo sound fishy.. I mean for words like "Britney Spears".. What if the "Britney" and "Spears" individually way less than top 100 but combined String is on the top. coz the way you are framing definitely will have the common words like "the" frequency way above.
Mapreduce maybe one option. Just set key as query and frequency as value. Since Mapreduce is for distributed data and it can sort, it will be efficient to get the result. But if you try to get the result repeatedly, hashtable + heap is better since you can make use of previous results.
But for sites like Google, won't an in-memory hash table easily cause out-of-memory exceptions? There must be millions of queries within 24 hours and storing this in memory and then using a heap is not at all scalable. Instead as a distributed solution, the frequency and date can be stored in a database table where the primary key is <query,date>. As the reduce step, the top 20 frequency from each distributed database can be compared.
No this Lack Memory Compromise.. not a good algo.. coz search site hits million of record in 24 long hour.. Also, during searching the Time Complexity drastically increases........ Not at all the Sol I would like to have
You can cache the computed values from the MapReduce job just fine. Add Redis or Memcached to your architecture to distribute query load for previous runs, recompute the cache values on interval (the problem requires 24 hrs, so daily would be a starting point-- or perhaps hourly for a rolling 24 hrs)
My Algo may be the kinda bit Complicated but I guess the implementation will have its advantage.
1> I will use Trie to Store the words (Not Letters). Coz search string will have words like "the ", "what" etc more frequent.
2> Now For every query .i.e. in trie end node , I will put the frequency count.
Now I even can have the Time complexity to be as minimal as possible coz we are searching word graph (DWAG) and the space of Trie being the minimal, the optimisation will be the best one...
Anymore or better algo is highly appreciated.
A mapreduce job would be a perfect fit, if that was acceptable. Here are the steps to design the solution:
- Sada December 28, 20121. Create a hadoop cluster with at least 2000 nodes [this number and the configuration of each node are just arbitrary in this solution and only based on the problem statement (Yahoo, Google kind of data flow) - there are a lot of other factors will help deciding the right number]
2. Assuming there are lot of web servers writing log files into a lot of file servers, write a MapReduce job (with only map) to load the log files from all these servers into HDFS.
3. Once all the files are loaded into HDFS, fire another mapreduce job. In this job, have a preprocessor map job that will spit out the log files into the following format:
[TimeStamp] [Query]
4. Write a combiner function along with the mapper mentioned above which will output only the top N values that each mapper generates (in our case each combiner will output only the first 20 values)
5. Write a custom partitionar such a way that only one partition is produced. Have only one reduce job that will output the first N tuples.
Note: - Again, this is probably an outline of the solution and more tuning and tweaking can be done. If you have any suggestions or corrections to the aforementioned solution please post the same.
Also note that, using Hive can considerably make the solution easier and faster.