is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.
CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.
Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.
Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.
A mapreduce job would be a perfect fit, if that was acceptable. Here are the steps to design the solution:
- Sada December 28, 20121. Create a hadoop cluster with at least 2000 nodes [this number and the configuration of each node are just arbitrary in this solution and only based on the problem statement (Yahoo, Google kind of data flow) - there are a lot of other factors will help deciding the right number]
2. Assuming there are lot of web servers writing log files into a lot of file servers, write a MapReduce job (with only map) to load the log files from all these servers into HDFS.
3. Once all the files are loaded into HDFS, fire another mapreduce job. In this job, have a preprocessor map job that will spit out the log files into the following format:
[TimeStamp] [Query]
4. Write a combiner function along with the mapper mentioned above which will output only the top N values that each mapper generates (in our case each combiner will output only the first 20 values)
5. Write a custom partitionar such a way that only one partition is produced. Have only one reduce job that will output the first N tuples.
Note: - Again, this is probably an outline of the solution and more tuning and tweaking can be done. If you have any suggestions or corrections to the aforementioned solution please post the same.
Also note that, using Hive can considerably make the solution easier and faster.