A log file which has user deta

Amazon Interview Question for Software Engineer / Developers

0

of 2 votes

12
Answers
A log file which has user details(user ID,timestamp) and pages visited in a particular day by that user.The next day -the same kind of log file gets generated.How do you find the probability of users who logged in consecutive days out of the second day - logged in users? The question is simple,but they look for the efficient data structure and time complexity.
- sriramMS December 20, 2012 in United States | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Algorithm Application / UI Design Arrays Assembly Automata Behavioral Bit Manipulation Brain Teasers C C# C++ Cache Coding Data Mining Data Structures

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: Phone Interview

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 6 vote

Well this pretty much means finding repeating user ids in the second log file. One must ask how big are these log files or even better how many unique UIDs will it have. If they are big enough to to be held in memory then just parse UIDs from first log file and add them to a HashSet, while parsing the second file check if each uid exists in the set. contains operation in hashset will take O(1) time and so does add operation. Summing up, this program will take O(n+m) time where n is number of users in first log file and m is number of users in second.

- Sap December 20, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

what if the file contains large number of userid's which cannot fit in the hashtable memory?

- bhargav December 24, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 3 votes

Couldn't you just split up the UIDs from the first log file into smaller chunks?
Eg if the UIDs are four times too big to fit into memory, split them up into fourths and load each fourth into a hash set one at a time. Then parse the second log file 4 times, one for each fourth.

- Anonymous December 27, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 3 vote

If the data is tooo big that you cannot put it in memory at once, then follow this approach:

1. Apply a hash function h1(k) on the UserLog1 and UserLog2 files and partition the data into smaller chunks so that each of which can fit into memory.
2. Lets suppose that you get m files for UserLog1 and UserLog2 after step 1.
3. Now prepare a in memory hashSet for the first partition of UserLog1 file.
4. Now check for each element of UserLog2 file, if its already there in the hashSet prepared in step 3. If yes, then add it to a different list of duplicateUsers. Else, continue looking for next element until you are done with all the elements.
5. Repeat steps 3, 4 and 5 for second, third and all partitions of UserLog1 and UserLog2 files.

- A January 19, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

Step1: awk -F ',' '{print $1}' file1.txt > file2.txt // get ids only
Step2: awk '!x[$1]++' FS="," file2.txt > userId.log //get distinct id

- santosh February 19, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

cat day2 | awk -F"," '{print$1;}' > UserId2
cat day1 | awk -F"," '{print$1;}' > UserId1
comm -13 UserId1 UserId2 >> ans

- Anusha March 03, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

in order to use comm, you have to sort both file.
And also, you have to count distinct id for each days.

- ethan April 14, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

The probability cannot be determined by just looking at the second day log. Even if a user had 0 or a million log entries on second day you must make assumptions about the probability distribution from day to day to give an answer. For example assume flipping a coin has a probability distribution or 50-50 heads versus tails. One day in N trials it could come up 50-50 heads and tails. The next day the trials could wind up being 100% heads. So based on second day data you would then surmise that probability is 100% heads. And so you answer the first day that 100% heads was probability. But obvoiusly, this is a mistaken conclusion since you are making a faulty assumption about the probability distribution based on one days data.

- Anonymous March 14, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

As pointed out previously, this question is about the probability. Even if you assume a prior belief the question by itself does not have enough data to answer the question. I believe the interviewer wants us to ask questions about what is the relation between the users logging in the second day after the first day and so on. if you have enough information then you can do some bayesian inference on it.

- red May 29, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

This question is about probability and statistical inference. For a large population size where taking into account each and every user is difficult due to whatever constraints there might be, you can sample the data at random and calculate the probability with upto 99% certainty (confidence interval). This is the method employed by social sciences people when identifying a demographic pattern etc.

One assumption you will need to make is that the days are i.i.d (independent and identically distributed). This means that the user behavior does not change from one day to another. If this holds true, than you can pick a random sample of, let's say 100 users and figure how many of them browsed during both days.

You will then calculate the mean and standard deviation of the sample and conduct a t-test to measure the statistical significance of the sample.

- Waleed December 29, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@bhargav : In that case you can use Arraylist

- sriramMS December 24, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think question has asked to find the probability. Can we take small chunk of file for different time stamps like [0-1], [8-9],[16-17] hours and do the hashing stuff.

- Anonymous January 20, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I am studying collections at the moment . I just had my analysis.

a) We can put the User Ids in the HashMap by looking at the T+1 file . hashMap.keyset will give me the denominator.

Denominator = HashMap.keySet.size

b) We can put the value for each key and store as an ArrayList.

numerator = sum(eachUser(HashMap.values.size() == 2))

probability = numerator/denominator

Am I Correct in my analysis of the problem ?

Regards,
Harish

- Harish November 12, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Simply implement hashtable concept in third file.
1.) Lets say there are 10k active user in general.
2.) int p = hash(userId)%10k
3.) add this userId once at line number 'p' calculated in (2).
4.) Now process second file in similar manner to find users in our new file.

- Himanshu October 22, 2019 | Flag Reply

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews