Google Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
Assume 3 lines:
(1) For the first line, probability is 1/1 = 100%. So this line always get picked.
(2) For the second line, probability is 1/2 = 50%. So this line has 50% of being picked.
(3) For the third line, probability is 1/3 = 33%. So this line has 33% of being picked.
Note that we always return the LAST line that was picked. So third line has 33% of being returned. For the second line to be returned, it must be picked, while the third line must not be picked. So that's (1/2)(2/3) = 1/3 = 33%. For the first line to be returned, the second & third lines must not be picked. So that's (1/2)(2/3) = 1/3 again.
Thanks. That does make a lot more sense. But doesn't the first line always get picked? Then it would be impossible for the remaining lines to be picked. Moreover, this algorithm would favor the lines in the front, wouldn't it? Because 2nd line has 1/2 of being picked, 3rd has 1/3 of being picked and 4th has 1/4 of being picked. How can this fit in 'return a random line'?
@kelvin198897 It doesn't favor the lines in front. Consider a file with 5 lines, here are the probabilities as each line is read...
... read a line...
Since Line1 is the only line, it will be selected
Line1: 1/1
... read a line...
Since Line2 has a 1/2 chance being selected, Line1 has a 1/2 chance of remaining selected
Line1: 1/1 * 1/2 = 1/2
Line2: 1/2
... read a line...
Since Line3 has a 1/3 chance of being selected, the previously selected line has a 2/3 chance of remaining selected...
Line1: 1/1 * 1/2 * 2/3 = 1/3
Line2: 1/2 * 2/3 = 1/3
Line 3: 1/3
... read a line...
Since Line4 has a 1/4 chance of being selected, the previously selected line has a 3/4 chance of remaining selected
Line1: 1/1 * 1/2 * 2/3 * 3/4 = 1/4
Line2: 1/2 * 2/3 * 3/4 = 1/4
Line3: 1/3 * 3/4 = 1/4
Line4: 1/4
... read a line ...
Since line 5 has a 1/5 chance of being selected, the previously selected line has a 4/5 chance of remaining selected
Line1: 1/1 * 1/2 * 2/3 * 3/4 * 4/5 = 1/5
Line2: 1/2 * 2/3 * 3/4 * 4/5 = 1/5
Line3: 1/3 * 3/4 * 4/5 = 1/5
Line4: 1/4 * 4/5 = 1/5
Line5: 1/5
According to the explanations given above, if nth line is to picked, then the probability is 1/n and that any previous line was picked is 1 - 1/n .
I dont get this.
I concur with probability of 1/n. But the probability of not being picked = probability of prev lines or any future line to be picked.
Hence the math worked out is wrong.
General idea (no code):
Set current_line initially to 0. Iterate over the lines. For the nth line, set current_line to n with probability 1/n. This will produce a uniform distribution on the set of lines.
Based on what you're saying:
for line=0, P(line) = 1/0 = infinity
for line=1, P(line) = 1/1 = 1
for line=2, P(line) = 1/2 = 0.5
...
How is it a uniform distribution? Am I missing something?
First, you don't divide by zero, obviously (my fault--I should have indexed from 1).
What you're missing is that you potentially update current_line at every iteration. So while there's a 1/2 chance that current_line is set to 2 when n = 2, there is a 1/3 chance that it is changed again to 3 when n = 3, and so on.
Note that, at line n, the odds of current_line being set to n are 1/n. The odds of it *staying* at n are (1 - 1/(n+1)) * (1 - 1/(n+2)) * ... * (1 - 1/N) where N is the total number of lines. Therefore, the odds of current_line being n at the very end are
1/n * (1 - 1/(n+1)) * ... * (1 - 1/N),
which you can verify is equal to 1/N.
So you are basically iterating each line and keep a counter of how many lines you have read so far, and probably call Math.random()*counter to generate a random number? If so, it also requires to store all read lines in an arraylist, right?
This question (or at least his way of solving it) happens to be the same as this previously posted one:
id=5568707711467520
Basically as you iterate through the lines, you keep calling Math.random as you mentioned, but you only need a single variable to keep track of the last string that "matched". Note that the first line has 100% probability of being chosen.
public bool IsMatch(String FilePath,String matchString)
{
StringReader sr= new StreamReader(FilePath);
int hashStr = matchString.GetHashCode();
while(sr.ReadLine()!=null)
{
String tempStr = sr.ReadLine();
int hashTempStr = tempStr.GetHashCode();
if(hashTempStr == hashStr )
{
return true;
break;
}
}
return false;
}
For the code above if true is returned then the particular string has been found else the search will continue and if No Match is found then in the end it would return false.
Only Drawback here is that I have to iterate through all the lines of the file.But as per the given Question no specifications are given . So this code will run.
This code has been written in VisualStudio.Net , like StreamReader other Readers might also be available in different IDE's.
In the above Code there is Correction where
StringReader is written on the left side, actually it is StreamReader. Please ignore the typographical error.
At time i we read the i'th line. We pick this line with the possibility 1/i. If this line is not selected, nothing needs to be done later for this line. At time (i+1) we re-choose the i'th line with possibility i/(i+1), so that the possibility to keep the i'th line at time (i+1) is 1/i * i / (i+1) = 1/(i+1).
This process can be continued until we reach the end of the file. For the lines kept we can randomly pick one.
It suffices to store only a single line in memory at once, specifically the line that current_line currently points to. If current_line gets updated, you can update the line stored in memory as well.
The algorithm is like this, in pseudocode:
At the end of this, current_line is the index of the randomly chosen line.
- nilkn January 30, 2014