Programming Interview Questions

Page:

1

Comment hidden because of low score. Click to expand.

1

of 1 vote

A mapreduce job would be a perfect fit, if that was acceptable. Here are the steps to design the solution:
1. Create a hadoop cluster with at least 2000 nodes [this number and the configuration of each node are just arbitrary in this solution and only based on the problem statement (Yahoo, Google kind of data flow) - there are a lot of other factors will help deciding the right number]

2. Assuming there are lot of web servers writing log files into a lot of file servers, write a MapReduce job (with only map) to load the log files from all these servers into HDFS.

3. Once all the files are loaded into HDFS, fire another mapreduce job. In this job, have a preprocessor map job that will spit out the log files into the following format:
[TimeStamp] [Query]

4. Write a combiner function along with the mapper mentioned above which will output only the top N values that each mapper generates (in our case each combiner will output only the first 20 values)

5. Write a custom partitionar such a way that only one partition is produced. Have only one reduce job that will output the first N tuples.

Note: - Again, this is probably an outline of the solution and more tuning and tweaking can be done. If you have any suggestions or corrections to the aforementioned solution please post the same.
Also note that, using Hive can considerably make the solution easier and faster.

- Sada December 28, 2012 | Flag Reply

Page:

1

CareerCup

Sada

Books

Videos

Resume Review

Mock Interviews