Question from my friend's inte

Groupon Interview Question for Software Engineer / Developers

0

of 0 votes

5
Answers
Question from my friend's interview.

Problem statement : How would you design a logging system for something like Google , you should be able to query for the number of times a URL was opened within two time frames.

i/p : start_time , end_time , URL1
o/p : number of times URL1 was opened between start and end time.

Some specs : Database is not an optimal solution A URL might have been opened multiple times for given time stamp. A URL might have been opened a large number of times within two time stamps. start_time and end_time can be a month apart. time could be granular to a second.
- programmer March 17, 2013 in United States | Report Duplicate | Flag | PURGE
Groupon Software Engineer / Developer

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

You can store an atomic counter for each time stamp. If url is hit and entry already exists for this timestamp then just increase the counter. When you need to run query just add all the counters between two timestamps. Query will be straightforward as you have only one entry for one timestamp

- msingh March 17, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

are you limiting your timestamp to hours, mins and seconds? or are you inclusing milliseconds as well? if we have a counter for each timestamp then we might end up having enormous number of counters as we will end up having a new timestamp for each millisecond or second. Please correct my assumption if wrong.

- vmzi March 30, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

One more thing that can be done is..create an extra counter for a day. In this case if you have to do large queries just add up counters for whole day + some extra rows (max 86400)
e.g.
between 01-01-2013-8:00:00 and 31-01-2013 8:00 you will have to scan
28 rows + 2* 8 * 60 * 60 rows to get actual number of hits

- m March 17, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We create following data structure

create a b+tree like structure, where we use date to create child, 
        at first level we use year, 
        second level we use months,
        and at third level we use day, 
        and then to hours.

Now these are standard, we join these hours node by linked list. so that we can move easily to another hours.

After that we create categories for URL's, It may be domain's.
here actually we create a trie kind of structure to store URL's in minimum space, and to improve search.

Now the first part is used to find from where to where,we have to search. and once we find interval in tree, we go searching and print output.

- sonesh March 17, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I would like to have one log directory for each URL, and one log file for each day. For each log line in the log file with time in seconds and the accumulated number of access since the start of the log. All the access with same time in second is shared in the same line of the log in the file.
For statistics of the number of access time for the specified URL, you just find the specified directory for the special URL, and find all the log file within the time frame, for each log file, you can find the number of the access by subtracting the earlies time of access number in that log file from the latest one in the log file within the time frame, and sum all these number to formal the final result. Here is some concern on the design:
1> distributed effort for each URL and its time frame
2> Accumulated number of access within each day, so omitting the going through of all file.

- chenlc626 March 17, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

The problem bogs down to design a data structure wherein
1) A time stamp can map multiple URLs i.e Multiple URLs can be opened in a particular time stamp
2) A single URL map to multiple timestamps i.e A single URL can be opened at multiple timestamps.

This is what I have deduced. Can someone please let me know what can be used in such a situaltion (many to many relationship)

- DashDash April 25, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Trie+BinarySearch. Store URLs in Trie, at the end store the logged times in a list, they have to be chronological (otherwise, the concept of time between is inconsistent i guess)
So this list is sorted.
Given a URL and a query. Just get to the URL's list and use binary search to retrieve the required value.
This is a better than partitioning time when the number of logs per user is sparse and the users are high.

CareerCup

Groupon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews