Amazon Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
6
down vote
You can do this without having to read all the lines in memory, thus working well for huge files. Pseudocode:
linenum := 0
ret := ''
while more lines to read:
line := readline()
linenum := linenum + 1
r := uniform_random(0, linenum)
if r < 1:
ret := line
return ret
Proof: We begin by noting that we always save the first line in ret. If the file has one line, you are going to choose it, and you're done.
For two-line file, ret will save the first line 100% of the times, and the second line will be saved in ret 50% of the time during the second iteration of the loop. Thus, each line has a probability of 0.5 of being selected.
Now, let's assume that this method works for files of ≤N lines. To prove that this means it works for N+1, in the (N+1)th iteration of the loop, there is a probability of 1/(N+1) that the last line will be selected (random(0, N+1) < 1 has that probability). Thus, the last line has 1/(N+1) probability of being selected. The probability of all other lines being selected is still going to be equal to each other, let's call this x. Then, N*x + 1/(N+1) == 1, which means that x = 1/(N+1).
Proof by induction is complete.
Edit: Oops, didn't see the first answer's third method before replying. Still, I will keep this post here if only for the proof, and the opportunity for other people to correct it if there are any errors in it.
Divide the file in N chunks, of 10,000 (e.g.) lines each
store the offset of these N chunks in another file / in memory.
For the last chunk, also count the line numbers (Let say have K line) in it, if it is less than 10,000 then store this information as well
Generate a number randomly between 1 to N, and open this chunk, by seeking to that position in that file.
Now generate another number between 1 and 10,000 and return taht line,
in case of the last chunk generate the number between 1 and K, and return that line....
In the book 《programming pearls》 second edition, the 12th capter and the 10th question in the exercise. At last, the probabilty to choose any line is 1/n.
when there are n lines. the probabilty to choose n - 1line is 1 / (n - 1) * (n - 1) / n = 1/ n
the probabilty to choose n - 1 * the probability not choose n.
@milo...can u plz eleborate on the solution...i couldnt get how we can always achieve the equal prob....
Thankx
@candis the probability to choose n - 2 line when there are n lines in the file is: 1 / (n - 2) * (n - 2) / (n - 1) * (n - 1) / n = 1 / n. and so on, for any line, the probability is 1 / n.
@candis we should make distinction between the probability of read a line and the probability of get a line.
The probability of get a line = the probability of read a line * the probability not read the lines after the current line
If there are 2 lines, probability of choosing one line is 1/2. If there are 5 lines then the probability of choosing one line is 1/5. If there are n lines, then the probability of choosing one line is 1/n. So there is an equal probability.
But the question here is how to get this programatically?
A practical solution:
1)Count lines (tot_line) in largefile.txt and record file offset in an array (if memory allows), or binary disk file offset.dat, while contains file offset(int64).
2) Generate random line number randline = randf()*tot_line.
3) If in memory, directly access array using randline-1 as index
Directly seek into offset.dat by (randline-1)*8 bytes, retrieve line offset offset_val.
4) Seek into largef.txt by offset_val bytes.
5) Read line.
1) Scan the file using readline() and note the offset of each line by writing it in binary to a 32 bit unsigned int in a new "table" file (or better yet in RAM if it fits). Keep a count of how many lines you encounter.
2) Generate a random number up to number of lines.
3) Seek into your binary table file using the line number, and then read the offset, then seek into the big file and get the line.
O(n) where n is number of chars in big file, can't avoid that. Each lookup is a fairly small constant. If number of items in table fits in memory, you get a speedup.
See reservoir sampling on wikipedia
- Ashupriya July 03, 2012