Amazon Interview Question
Software Engineer / DevelopersCountry: India
What about reading the logs, checking the date, putting the search queries to a hash table, with counters. Every time query comes(or a similar query ) increase counter. Also keep a min heap and if counter hits min heap's min, reorganize heap.
As data will be huge, this operation can be divided per letter and to be able to get the most searched queries, we would need to merge heaps from different letters
Would appreciate any feedback
In each datacenter have a set of "count" servers to get the current query and store it. These set of servers communicate with each other and also they communicate with count servers in other data centers. They can use distributed algorithm with Lamport logic to order the search queries and store it. They can use a sliding window algorithm to then get the top 20 in the last 24 hours.
since there could be many servers, and each server has its separate log file.
what we can do , rather than doing processing log file one by one. Maintain a min heap of top requests and a hashmap to index each query and to update min heap of individual server.
combine result of all the servers, i.e compare min heaps of all servers, and consolidate result in to one large heap to produce response.
Use Map reduce
- Anonymous January 14, 2013