Amazon Interview Question for Software Engineer / Developers


Country: United States
Interview Type: Phone Interview




Comment hidden because of low score. Click to expand.
6
of 6 vote

Well this pretty much means finding repeating user ids in the second log file. One must ask how big are these log files or even better how many unique UIDs will it have. If they are big enough to to be held in memory then just parse UIDs from first log file and add them to a HashSet, while parsing the second file check if each uid exists in the set. contains operation in hashset will take O(1) time and so does add operation. Summing up, this program will take O(n+m) time where n is number of users in first log file and m is number of users in second.

- Sap December 20, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

what if the file contains large number of userid's which cannot fit in the hashtable memory?

- bhargav December 24, 2012 | Flag
Comment hidden because of low score. Click to expand.
3
of 3 votes

Couldn't you just split up the UIDs from the first log file into smaller chunks?
Eg if the UIDs are four times too big to fit into memory, split them up into fourths and load each fourth into a hash set one at a time. Then parse the second log file 4 times, one for each fourth.

- Anonymous December 27, 2012 | Flag
Comment hidden because of low score. Click to expand.
3
of 3 vote

If the data is tooo big that you cannot put it in memory at once, then follow this approach:

1. Apply a hash function h1(k) on the UserLog1 and UserLog2 files and partition the data into smaller chunks so that each of which can fit into memory.
2. Lets suppose that you get m files for UserLog1 and UserLog2 after step 1.
3. Now prepare a in memory hashSet for the first partition of UserLog1 file.
4. Now check for each element of UserLog2 file, if its already there in the hashSet prepared in step 3. If yes, then add it to a different list of duplicateUsers. Else, continue looking for next element until you are done with all the elements.
5. Repeat steps 3, 4 and 5 for second, third and all partitions of UserLog1 and UserLog2 files.

- A January 19, 2013 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

Step1: awk -F ',' '{print $1}' file1.txt > file2.txt // get ids only
Step2: awk '!x[$1]++' FS="," file2.txt > userId.log //get distinct id

- santosh February 19, 2013 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

cat day2 | awk -F"," '{print$1;}' > UserId2
cat day1 | awk -F"," '{print$1;}' > UserId1
comm -13 UserId1 UserId2 >> ans

- Anusha March 03, 2013 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

in order to use comm, you have to sort both file.
And also, you have to count distinct id for each days.

- ethan April 14, 2013 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

The probability cannot be determined by just looking at the second day log. Even if a user had 0 or a million log entries on second day you must make assumptions about the probability distribution from day to day to give an answer. For example assume flipping a coin has a probability distribution or 50-50 heads versus tails. One day in N trials it could come up 50-50 heads and tails. The next day the trials could wind up being 100% heads. So based on second day data you would then surmise that probability is 100% heads. And so you answer the first day that 100% heads was probability. But obvoiusly, this is a mistaken conclusion since you are making a faulty assumption about the probability distribution based on one days data.

- Anonymous March 14, 2013 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

As pointed out previously, this question is about the probability. Even if you assume a prior belief the question by itself does not have enough data to answer the question. I believe the interviewer wants us to ask questions about what is the relation between the users logging in the second day after the first day and so on. if you have enough information then you can do some bayesian inference on it.

- red May 29, 2013 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

This question is about probability and statistical inference. For a large population size where taking into account each and every user is difficult due to whatever constraints there might be, you can sample the data at random and calculate the probability with upto 99% certainty (confidence interval). This is the method employed by social sciences people when identifying a demographic pattern etc.

One assumption you will need to make is that the days are i.i.d (independent and identically distributed). This means that the user behavior does not change from one day to another. If this holds true, than you can pick a random sample of, let's say 100 users and figure how many of them browsed during both days.

You will then calculate the mean and standard deviation of the sample and conduct a t-test to measure the statistical significance of the sample.

- Waleed December 29, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

@bhargav : In that case you can use Arraylist

- sriramMS December 24, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

I think question has asked to find the probability. Can we take small chunk of file for different time stamps like [0-1], [8-9],[16-17] hours and do the hashing stuff.

- Anonymous January 20, 2013 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

I am studying collections at the moment . I just had my analysis.

a) We can put the User Ids in the HashMap by looking at the T+1 file . hashMap.keyset will give me the denominator.

Denominator = HashMap.keySet.size

b) We can put the value for each key and store as an ArrayList.

numerator = sum(eachUser(HashMap.values.size() == 2))

probability = numerator/denominator

Am I Correct in my analysis of the problem ?

Regards,
Harish

- Harish November 12, 2016 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Simply implement hashtable concept in third file.
1.) Lets say there are 10k active user in general.
2.) int p = hash(userId)%10k
3.) add this userId once at line number 'p' calculated in (2).
4.) Now process second file in similar manner to find users in our new file.

- Himanshu October 22, 2019 | Flag Reply


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More