Amazon Interview Question
Software Engineer in TestsCountry: United States
Create a dictionary with Key as CustomerId (always unique) and value as a LinkList, whose each node represents the videoDetails (structure shown below):
We will have to update the LinkList Add function to do this:
1. Traverse the list and if videoId doesnt exist, create one at tail with count as 1
2. If videoId exists increment count
Traverse through each entry in the dictionary and if the number of play hit doesnt meet or exceeds the threshold, remove the node.
Return the dictionary
Struct VideoDetails
{
VideoID vID;
VideoLength vLength;
uint numberOfPlayHits;
}
JSDUDE, your structure doesnt care about the timestamp. Let say a customer hits a particular video 6 times in a span of 4 hrs, and the threshhold is 4 then it means customer is having issue but actually he is not facing any issues
@DashDash,
They way i thought about it is, if the videosize is 2 hours long and your timestamp indicates that you have hit play more than Threshold times on this video within the timestamp span of 2 hours then it means you have issues.
Makes sense?
That is the way I am incrementing the count on the video in my list
But i guess i didnot consider re-viewing of the video.
How is the collection of objects of the class given, example, linked list etc?
How much is the RAM capacity?
Does it need to be solved in a distributed environment?
What about solving it using a script?
How about some external sorting mechanisms?
Is the data going to be used for other querying too? if yes, we could put it in database and query.
Answers to some of these questions in the discussion will throw some light on the solution. With the given problem, a possible solution could be:
1. while parsing and creating objects or when scanning class objects, eliminate objects that do not have play event.
2. We probably can also eliminate if video length is too small, that user is pressing play to actually replay it. We will not be able to tell the difference. Good to discuss with interviewer.
3. Assume resume is an event (pause/play) and just "Play" is not equivalent to "Pause/Play'. Again discuss.
4. Sort doubly linked list based on customer ID, then Video ID and then timestamp.
5. maintain a hash of VideoIDs and node addresses.
6. If in this list for adjacent entries, the timestamp difference is less than the video length, add the VideoID to a hash.
7. Remove all the address associated with this video from the list. If this is done, we will have fewer checking to do.
Approach 1:
//generate json at client side, save in filesystem, run mongodb/hdfs(graphdb) to find out
// json { ip:"", user:"userid or guest", videotype:"", networkbandwidth:"slow/medium/fast"
requesttime:"", startTime:"", endTime:""}
Approach 2:
// build a graph with all the vieoIds{ long duration, List<VidRequest> list }
// VidRequest is like above json object
//check if the that node is visisted wihin the time then report it
A couple of things.
- For each customer, Maintain an K-ary B+ tree whose key is videoId. (loopup,insert, delete O(log(base)N)) where base is K.
- Maintain a sorted list of customer by customerId (loopup,insert, delete O(logN))
For each record (customerId, videoId ) in the file
- Add it to sorted list of customerId or find it if exists
- Get that customer's B+ tree and look up by videoId, and increase count of play for that video and for that customer
-
Optionally, store customerIds in a min priority queue sorted by Count-of-play-push for some top K customers.(similar to finding top K max values)
Lets take customerId and VideoId as a key. Therefore for this key we can create a hash and where chaining is done for the same key and the data in the same key are sorted according to the time stamp. In this way we can get the difference between 2 time stamp and can predict whether a customer is facing buffering issues
- DashDash April 30, 2013Please do let me know your suggestions