Amazon Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: Phone Interview
what if the file contains large number of userid's which cannot fit in the hashtable memory?
If the data is tooo big that you cannot put it in memory at once, then follow this approach:
1. Apply a hash function h1(k) on the UserLog1 and UserLog2 files and partition the data into smaller chunks so that each of which can fit into memory.
2. Lets suppose that you get m files for UserLog1 and UserLog2 after step 1.
3. Now prepare a in memory hashSet for the first partition of UserLog1 file.
4. Now check for each element of UserLog2 file, if its already there in the hashSet prepared in step 3. If yes, then add it to a different list of duplicateUsers. Else, continue looking for next element until you are done with all the elements.
5. Repeat steps 3, 4 and 5 for second, third and all partitions of UserLog1 and UserLog2 files.
cat day2 | awk -F"," '{print$1;}' > UserId2
cat day1 | awk -F"," '{print$1;}' > UserId1
comm -13 UserId1 UserId2 >> ans
The probability cannot be determined by just looking at the second day log. Even if a user had 0 or a million log entries on second day you must make assumptions about the probability distribution from day to day to give an answer. For example assume flipping a coin has a probability distribution or 50-50 heads versus tails. One day in N trials it could come up 50-50 heads and tails. The next day the trials could wind up being 100% heads. So based on second day data you would then surmise that probability is 100% heads. And so you answer the first day that 100% heads was probability. But obvoiusly, this is a mistaken conclusion since you are making a faulty assumption about the probability distribution based on one days data.
As pointed out previously, this question is about the probability. Even if you assume a prior belief the question by itself does not have enough data to answer the question. I believe the interviewer wants us to ask questions about what is the relation between the users logging in the second day after the first day and so on. if you have enough information then you can do some bayesian inference on it.
This question is about probability and statistical inference. For a large population size where taking into account each and every user is difficult due to whatever constraints there might be, you can sample the data at random and calculate the probability with upto 99% certainty (confidence interval). This is the method employed by social sciences people when identifying a demographic pattern etc.
One assumption you will need to make is that the days are i.i.d (independent and identically distributed). This means that the user behavior does not change from one day to another. If this holds true, than you can pick a random sample of, let's say 100 users and figure how many of them browsed during both days.
You will then calculate the mean and standard deviation of the sample and conduct a t-test to measure the statistical significance of the sample.
I am studying collections at the moment . I just had my analysis.
a) We can put the User Ids in the HashMap by looking at the T+1 file . hashMap.keyset will give me the denominator.
Denominator = HashMap.keySet.size
b) We can put the value for each key and store as an ArrayList.
numerator = sum(eachUser(HashMap.values.size() == 2))
probability = numerator/denominator
Am I Correct in my analysis of the problem ?
Regards,
Harish
Well this pretty much means finding repeating user ids in the second log file. One must ask how big are these log files or even better how many unique UIDs will it have. If they are big enough to to be held in memory then just parse UIDs from first log file and add them to a HashSet, while parsing the second file check if each uid exists in the set. contains operation in hashset will take O(1) time and so does add operation. Summing up, this program will take O(n+m) time where n is number of users in first log file and m is number of users in second.
- Sap December 20, 2012