Amazon Interview Question for Software Engineer / Developers

0

of 0 votes

9
Answers
Round5 | Q2 : Suppose, Amazon have a Logging system Like This:

They Track all logs daily basis, stored in separate log file.Log contains a collection of tuples of Customer ID and Page ID. The length of Customer ID and Page ID is L1 and L2. Suppose We have a log of D-days , and size of each log fine can be Order of n , where n be the number of customer.
In a most generalized situation, you can assume that a customer can visit the same page multiple times in a day or any number of days.

We are interested to find out the number of Distinct customer visited at-least p distinct pages within exactly T days.

Propose a Data Structure to be use to solve this problem efficiently . Design an Algorithm to solve this problem and Find out the complexity of the algorithm.

{Hints:- Double Hashing/ {Hashing +Tries/BST }}
- dutta.dipankar08 February 02, 2012 in India for SDE | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer

Email me when people comment.

An error occurred in subscribing you.

Team: SDE
Country: India
Interview Type: In-Person

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 2 vote

@vran.freelancer
Let me first get it clarified with what your solution talks about. HashCID takes CustomerId as input and gives a hashtable in return. This hashtable is about the PageId being referenced on a particular day.
eg. C1 accesses P1 on D1
C1 accesses P1 on D1
C1 accesses P2 on D1
C2 accesses P1 on D1
C3 accesses P2 on D1

HashCID will look like:
<C1, [<P1,D1>,<P2,D1>]>
<C2, <P1,D1>>
<C3, <P2,D1>>

HashPID will look like
<D1,5>

From the problem, we are expected to find the number of Distinct customers, each of which has accessed atleast 'p' distinct pages in T days.

With your solution, the above requirement is not getting fulfilled. I think the second hashtable should be HashPID (PID, <Day, Count>), otherwise counting access to a page w.r.t to Customer will not be possible.

- Learner February 03, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

what abt the following soln:

first we scan the whole log and extract tuples for the required T days. Next we build up a hash PageID -> list<CustomerID> from this info, ignoring the duplicates. For example, we might get:

P1 -> C1 C2 C3 C5
P2 -> C1 C3 C6 C7
P3 -> C1 C2 C5 C4

then, suppose we need to find all customers who visited at least 2 distinct pages. For that we scan the above hash and build up a new hash CustomerID -> PageID, ie.:

C1 -> P1 P2 P3
C2 -> P1 P3
C3 -> P1 P2
C4 -> P3
C5 -> P1 P3
C6 -> P2
C7 -> P2

finally select those customerID's which have at least 2 pages in the lists

- Anonymous February 09, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I did not understand the need of hash PageID->list<customerID>. I think that hash CustID->list<PageID> alone would do. And how about using a map here ?

- Anonymous February 15, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

With log files, you won't be able to do straight up hashing on ram. You can do it the Hadoop way - with an external sort, and merge.

- memo February 03, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

You might still be able to bring everything into RAM. It depends on how much data you have.

- eugene.yarovoi February 03, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

With 2 Hashtables. 1) HashCID<int CID, HashTable<PID,Day>> and 2) HashPID<Day,Count>

- vran.freelancer February 03, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I understand that the question is about data structures......But I tried to approach it from a DB perspective

Table:

create table Pages(Date int, CustID int, PageID int);

I did not consider the multiple visits to the same page as it does not add any value to the question asked.

Query to get the output

SELECT CustID , Count (DISTINCT PageID) AS NoOfPages FROM Pages WHERE Date BETWEEN D1 AND D2 Group BY CUSTID HAVING NoOfPages > p;

a) All the distinct customers are Grouped
b) Customers and No of Distinct Pages are calculated between selected dates (T Days)
c) Having clause trims the recordset to remove the customers who have visited less than p Pages

Create a temp table with result recordset. And count the number of customers. Output it.

But I am struggling to convert the sql structure into a MT data structure. Is there a data structure which can extend to n-Column instead of Key Pair?

- Anonymous February 04, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We can create a Map<CID, List<PID>> by taking the logs of T days. Then we can traverse the map and whenever for a CID the size of List<PID> greater than p, output that CID.

- Lalit February 16, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

(Approach assuming we can get the logs into memory completely)
Suppose we have to find the no. of distinct customers visiting pages today. This can be decomposed into HashMap<CustomerID, PageID>

Now if we hash all these maps obtained above based on Dates we get required information as follows: HashMap<Date, HashMap<CustomerID, PageID>> and extract the required info.

- ashu February 18, 2012 | Flag Reply

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews