Design a system where you can

Amazon Interview Question

0

of 0 votes

13
Answers
Design a system where you can reutrn top 20 queries made in last 24 hours to users.

Think on the scale of Google and Yahoo. How would you store data. What will be your data structures, algorithm to get that data.Describe your assumptions etc.

For simplicity, you can assume that every web server create a log file with query and timestamp.
- zeroByzero December 27, 2012 in United States | Report Duplicate | Flag | PURGE
Amazon

Email me when people comment.

An error occurred in subscribing you.

Country: United States

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

A mapreduce job would be a perfect fit, if that was acceptable. Here are the steps to design the solution:
1. Create a hadoop cluster with at least 2000 nodes [this number and the configuration of each node are just arbitrary in this solution and only based on the problem statement (Yahoo, Google kind of data flow) - there are a lot of other factors will help deciding the right number]

2. Assuming there are lot of web servers writing log files into a lot of file servers, write a MapReduce job (with only map) to load the log files from all these servers into HDFS.

3. Once all the files are loaded into HDFS, fire another mapreduce job. In this job, have a preprocessor map job that will spit out the log files into the following format:
[TimeStamp] [Query]

4. Write a combiner function along with the mapper mentioned above which will output only the top N values that each mapper generates (in our case each combiner will output only the first 20 values)

5. Write a custom partitionar such a way that only one partition is produced. Have only one reduce job that will output the first N tuples.

Note: - Again, this is probably an outline of the solution and more tuning and tweaking can be done. If you have any suggestions or corrections to the aforementioned solution please post the same.
Also note that, using Hive can considerably make the solution easier and faster.

- Sada December 28, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I would agree with using Hive in a real world scenario, unless there are significant reasons not to. The amount of time saved not preparing all the jobs for MapReduce in Hadoop is huge and the time to learn Hive given some SQL experience is minimal.

- kunikos January 01, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 4 vote

1. Assuming multiple web servers, get all the query words in the last 24 hrs. by parsing the log file to a centralized processing machine.
2. Now, store all the keywords in a has table with <key=word, value=freq>
3. Iterate through the hast table key entries, by first initializing a min-heap of size 20 with the first 20 freq elements of the hash table.
4. Now, process freq field of each hash table entry; if lower than min of heap the discard, otherwise extract min, insert the new freq and heapify.
5. After processing the entire hash table entries we are left with the top 20 frequent query words.
6. A distributed approach can be designed by processing the above logic to generate the top-20 frequent words in the last 24 hrs at each respective web server machine.
7. Collect all these top-20 frequency query words from each machine to a single machine and merge the results to get a combined top-20 frequent query word list.
8. May also solve using Map-Reduce but I am not much familiar!

- BoredCoder December 27, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

This looks about like how I answered it. Note that you can end up getting slightly inaccurate results from this with weird distributions (i.e. server 1's top 21 results are a1..a20 and foo, server 2's top 21 results are b1..b20 and foo; if a1..a20 and b1..b20 have no overlap, foo probably ought to be in the final top 20 but won't get returned to the centralized server)

I have been asked this question many times at big data companies. It's worth being able to refine your answer and be able to think about what you'll do if, say, the centralized server goes down.

- Mike Lewis December 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

-1 for your Algo.. Coz.. You are assuming the search String is ONE WORD Sentence which is not the case..

The question is query String.. not query word..

Also, the algo sound fishy.. I mean for words like "Britney Spears".. What if the "Britney" and "Spears" individually way less than top 100 but combined String is on the top. coz the way you are framing definitely will have the common words like "the" frequency way above.

- hprem991 December 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Become familiar with Hadoop. There's no way this will hit "web scale".

- kunikos January 01, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Doubt in Step 3:
The min heap of size 20 will have the minimum element at the root....y dont we just iterate through HashMap and get all values that are greater than min of heap.(y do we need to heapify again).....

- Anonymous February 03, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Mapreduce maybe one option. Just set key as query and frequency as value. Since Mapreduce is for distributed data and it can sort, it will be efficient to get the result. But if you try to get the result repeatedly, hashtable + heap is better since you can make use of previous results.

- zyfo2 December 28, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

But for sites like Google, won't an in-memory hash table easily cause out-of-memory exceptions? There must be millions of queries within 24 hours and storing this in memory and then using a heap is not at all scalable. Instead as a distributed solution, the frequency and date can be stored in a database table where the primary key is <query,date>. As the reduce step, the top 20 frequency from each distributed database can be compared.

- Vandana Gopal December 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

No this Lack Memory Compromise.. not a good algo.. coz search site hits million of record in 24 long hour.. Also, during searching the Time Complexity drastically increases........ Not at all the Sol I would like to have

- hprem991 December 28, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

You can cache the computed values from the MapReduce job just fine. Add Redis or Memcached to your architecture to distribute query load for previous runs, recompute the cache values on interval (the problem requires 24 hrs, so daily would be a starting point-- or perhaps hourly for a rolling 24 hrs)

- kunikos January 01, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

My Algo may be the kinda bit Complicated but I guess the implementation will have its advantage.

1> I will use Trie to Store the words (Not Letters). Coz search string will have words like "the ", "what" etc more frequent.
2> Now For every query .i.e. in trie end node , I will put the frequency count.

Now I even can have the Time complexity to be as minimal as possible coz we are searching word graph (DWAG) and the space of Trie being the minimal, the optimisation will be the best one...

Anymore or better algo is highly appreciated.

- hprem991 December 28, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Fails on the scale challenge though. You will run out of memory.

A streaming sketch algorithm would be better.

- kunikos January 01, 2013 | Flag

CareerCup

Amazon Interview Question

Books

Videos

Resume Review

Mock Interviews