Write code to return a random

Google Interview Question for Software Engineer / Developers

-1

of 5 votes

19
Answers
Write code to return a random line from a file of unknown size and variable length rows
- Guy January 30, 2014 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 6 vote

It suffices to store only a single line in memory at once, specifically the line that current_line currently points to. If current_line gets updated, you can update the line stored in memory as well.

The algorithm is like this, in pseudocode:

current_line = 1
counter  = 1.0
while there is another line:
    counter++
    if rand() <= 1.0 / counter:
        current_line = counter

At the end of this, current_line is the index of the randomly chosen line.

- nilkn January 30, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Argh, this was meant to be a reply to kelvin198897.

- nilkn January 30, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I still dont quite follow this approach. Could you please explain a bit more

- Guy January 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Assume 3 lines:
(1) For the first line, probability is 1/1 = 100%. So this line always get picked.
(2) For the second line, probability is 1/2 = 50%. So this line has 50% of being picked.
(3) For the third line, probability is 1/3 = 33%. So this line has 33% of being picked.

Note that we always return the LAST line that was picked. So third line has 33% of being returned. For the second line to be returned, it must be picked, while the third line must not be picked. So that's (1/2)(2/3) = 1/3 = 33%. For the first line to be returned, the second & third lines must not be picked. So that's (1/2)(2/3) = 1/3 again.

- Sunny January 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

Thanks. That does make a lot more sense. But doesn't the first line always get picked? Then it would be impossible for the remaining lines to be picked. Moreover, this algorithm would favor the lines in the front, wouldn't it? Because 2nd line has 1/2 of being picked, 3rd has 1/3 of being picked and 4th has 1/4 of being picked. How can this fit in 'return a random line'?

- Guy January 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

@kelvin198897 It doesn't favor the lines in front. Consider a file with 5 lines, here are the probabilities as each line is read...

... read a line...

Since Line1 is the only line, it will be selected

Line1: 1/1

... read a line...

Since Line2 has a 1/2 chance being selected, Line1 has a 1/2 chance of remaining selected

Line1: 1/1 * 1/2 = 1/2
Line2: 1/2

... read a line...

Since Line3 has a 1/3 chance of being selected, the previously selected line has a 2/3 chance of remaining selected...

Line1: 1/1 * 1/2 * 2/3 = 1/3
Line2: 1/2 * 2/3 = 1/3
Line 3: 1/3

... read a line...

Since Line4 has a 1/4 chance of being selected, the previously selected line has a 3/4 chance of remaining selected

Line1: 1/1 * 1/2 * 2/3 * 3/4 = 1/4
Line2: 1/2 * 2/3 * 3/4 = 1/4
Line3: 1/3 * 3/4 = 1/4
Line4: 1/4

... read a line ...

Since line 5 has a 1/5 chance of being selected, the previously selected line has a 4/5 chance of remaining selected

Line1: 1/1 * 1/2 * 2/3 * 3/4 * 4/5 = 1/5
Line2: 1/2 * 2/3 * 3/4 * 4/5 = 1/5
Line3: 1/3 * 3/4 * 4/5 = 1/5
Line4: 1/4 * 4/5 = 1/5
Line5: 1/5

- Anonymous January 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Thanks. I got it now. This is brilliant

- Guy January 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

According to the explanations given above, if nth line is to picked, then the probability is 1/n and that any previous line was picked is 1 - 1/n .
I dont get this.
I concur with probability of 1/n. But the probability of not being picked = probability of prev lines or any future line to be picked.
Hence the math worked out is wrong.

- Sunny February 09, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Its a reservoir sampling algorithm

- Thejus April 16, 2019 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

General idea (no code):

Set current_line initially to 0. Iterate over the lines. For the nth line, set current_line to n with probability 1/n. This will produce a uniform distribution on the set of lines.

- nilkn January 30, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Based on what you're saying:
for line=0, P(line) = 1/0 = infinity
for line=1, P(line) = 1/1 = 1
for line=2, P(line) = 1/2 = 0.5
...
How is it a uniform distribution? Am I missing something?

- v1 January 30, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

First, you don't divide by zero, obviously (my fault--I should have indexed from 1).

What you're missing is that you potentially update current_line at every iteration. So while there's a 1/2 chance that current_line is set to 2 when n = 2, there is a 1/3 chance that it is changed again to 3 when n = 3, and so on.

Note that, at line n, the odds of current_line being set to n are 1/n. The odds of it *staying* at n are (1 - 1/(n+1)) * (1 - 1/(n+2)) * ... * (1 - 1/N) where N is the total number of lines. Therefore, the odds of current_line being n at the very end are

1/n * (1 - 1/(n+1)) * ... * (1 - 1/N),

which you can verify is equal to 1/N.

- nilkn January 30, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

So you are basically iterating each line and keep a counter of how many lines you have read so far, and probably call Math.random()*counter to generate a random number? If so, it also requires to store all read lines in an arraylist, right?

- Guy January 30, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

This question (or at least his way of solving it) happens to be the same as this previously posted one:
id=5568707711467520

Basically as you iterate through the lines, you keep calling Math.random as you mentioned, but you only need a single variable to keep track of the last string that "matched". Note that the first line has 100% probability of being chosen.

- Sunny January 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

public bool IsMatch(String FilePath,String matchString)
{

StringReader sr= new StreamReader(FilePath);

int hashStr = matchString.GetHashCode();

while(sr.ReadLine()!=null)
{
  String tempStr = sr.ReadLine();

  int hashTempStr = tempStr.GetHashCode();

 if(hashTempStr  == hashStr )
  {
   return true;
   break;
  }        

}


return false;

}

- Anonymous January 31, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

For the code above if true is returned then the particular string has been found else the search will continue and if No Match is found then in the end it would return false.

Only Drawback here is that I have to iterate through all the lines of the file.But as per the given Question no specifications are given . So this code will run.

This code has been written in VisualStudio.Net , like StreamReader other Readers might also be available in different IDE's.

- Aditya January 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

In the above Code there is Correction where
StringReader is written on the left side, actually it is StreamReader. Please ignore the typographical error.

- Aditya January 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

You're skipping every second line here.

String tempStr;
while((tempStr = sr.ReadLine()) != null)
{
    ...

- Marton January 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

At time i we read the i'th line. We pick this line with the possibility 1/i. If this line is not selected, nothing needs to be done later for this line. At time (i+1) we re-choose the i'th line with possibility i/(i+1), so that the possibility to keep the i'th line at time (i+1) is 1/i * i / (i+1) = 1/(i+1).

This process can be continued until we reach the end of the file. For the lines kept we can randomly pick one.

- Westlake January 31, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Ya Martin sorry I made an error while typing , yes we have to do

while((tempStr = sr.ReadLine()) != null)
{
}

My Mistake I missed out this thing.

- Aditya January 31, 2014 | Flag Reply

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews