Groupon Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
are you limiting your timestamp to hours, mins and seconds? or are you inclusing milliseconds as well? if we have a counter for each timestamp then we might end up having enormous number of counters as we will end up having a new timestamp for each millisecond or second. Please correct my assumption if wrong.
One more thing that can be done is..create an extra counter for a day. In this case if you have to do large queries just add up counters for whole day + some extra rows (max 86400)
e.g.
between 01-01-2013-8:00:00 and 31-01-2013 8:00 you will have to scan
28 rows + 2* 8 * 60 * 60 rows to get actual number of hits
We create following data structure
create a b+tree like structure, where we use date to create child,
at first level we use year,
second level we use months,
and at third level we use day,
and then to hours.
Now these are standard, we join these hours node by linked list. so that we can move easily to another hours.
After that we create categories for URL's, It may be domain's.
here actually we create a trie kind of structure to store URL's in minimum space, and to improve search.
Now the first part is used to find from where to where,we have to search. and once we find interval in tree, we go searching and print output.
I would like to have one log directory for each URL, and one log file for each day. For each log line in the log file with time in seconds and the accumulated number of access since the start of the log. All the access with same time in second is shared in the same line of the log in the file.
For statistics of the number of access time for the specified URL, you just find the specified directory for the special URL, and find all the log file within the time frame, for each log file, you can find the number of the access by subtracting the earlies time of access number in that log file from the latest one in the log file within the time frame, and sum all these number to formal the final result. Here is some concern on the design:
1> distributed effort for each URL and its time frame
2> Accumulated number of access within each day, so omitting the going through of all file.
The problem bogs down to design a data structure wherein
1) A time stamp can map multiple URLs i.e Multiple URLs can be opened in a particular time stamp
2) A single URL map to multiple timestamps i.e A single URL can be opened at multiple timestamps.
This is what I have deduced. Can someone please let me know what can be used in such a situaltion (many to many relationship)
Trie+BinarySearch. Store URLs in Trie, at the end store the logged times in a list, they have to be chronological (otherwise, the concept of time between is inconsistent i guess)
So this list is sorted.
Given a URL and a query. Just get to the URL's list and use binary search to retrieve the required value.
This is a better than partitioning time when the number of logs per user is sparse and the users are high.
You can store an atomic counter for each time stamp. If url is hit and entry already exists for this timestamp then just increase the counter. When you need to run query just add all the counters between two timestamps. Query will be straightforward as you have only one entry for one timestamp
- msingh March 17, 2013