## Amazon Interview Question for Software Engineer / Developers

Country: United States
Interview Type: In-Person

Comment hidden because of low score. Click to expand.
4
of 4 vote

See reservoir sampling on wikipedia

Comment hidden because of low score. Click to expand.
2
of 2 vote

6
down vote
You can do this without having to read all the lines in memory, thus working well for huge files. Pseudocode:

linenum := 0
ret := ''
while more lines to read:
line := readline()
linenum := linenum + 1
r := uniform_random(0, linenum)
if r < 1:
ret := line

return ret
Proof: We begin by noting that we always save the first line in ret. If the file has one line, you are going to choose it, and you're done.

For two-line file, ret will save the first line 100% of the times, and the second line will be saved in ret 50% of the time during the second iteration of the loop. Thus, each line has a probability of 0.5 of being selected.

Now, let's assume that this method works for files of ≤N lines. To prove that this means it works for N+1, in the (N+1)th iteration of the loop, there is a probability of 1/(N+1) that the last line will be selected (random(0, N+1) < 1 has that probability). Thus, the last line has 1/(N+1) probability of being selected. The probability of all other lines being selected is still going to be equal to each other, let's call this x. Then, N*x + 1/(N+1) == 1, which means that x = 1/(N+1).

Proof by induction is complete.

Edit: Oops, didn't see the first answer's third method before replying. Still, I will keep this post here if only for the proof, and the opportunity for other people to correct it if there are any errors in it.

Comment hidden because of low score. Click to expand.
1
of 1 vote

Divide the file in N chunks, of 10,000 (e.g.) lines each
store the offset of these N chunks in another file / in memory.
For the last chunk, also count the line numbers (Let say have K line) in it, if it is less than 10,000 then store this information as well

Generate a number randomly between 1 to N, and open this chunk, by seeking to that position in that file.

Now generate another number between 1 and 10,000 and return taht line,
in case of the last chunk generate the number between 1 and K, and return that line....

Comment hidden because of low score. Click to expand.
0
of 0 vote

In the book 《programming pearls》 second edition, the 12th capter and the 10th question in the exercise. At last, the probabilty to choose any line is 1/n.
when there are n lines. the probabilty to choose n - 1line is 1 / (n - 1) * (n - 1) / n = 1/ n
the probabilty to choose n - 1 * the probability not choose n.

Comment hidden because of low score. Click to expand.
0
of 0 votes

@milo...can u plz eleborate on the solution...i couldnt get how we can always achieve the equal prob....
Thankx

Comment hidden because of low score. Click to expand.
0
of 0 votes

@candis the probability to choose n - 2 line when there are n lines in the file is: 1 / (n - 2) * (n - 2) / (n - 1) * (n - 1) / n = 1 / n. and so on, for any line, the probability is 1 / n.

Comment hidden because of low score. Click to expand.
0
of 0 votes

@candis we should make distinction between the probability of read a line and the probability of get a line.
The probability of get a line = the probability of read a line * the probability not read the lines after the current line

Comment hidden because of low score. Click to expand.
0
of 0 votes

If there are 2 lines, probability of choosing one line is 1/2. If there are 5 lines then the probability of choosing one line is 1/5. If there are n lines, then the probability of choosing one line is 1/n. So there is an equal probability.

But the question here is how to get this programatically?

Comment hidden because of low score. Click to expand.
0
of 2 vote

A practical solution:
1)Count lines (tot_line) in largefile.txt and record file offset in an array (if memory allows), or binary disk file offset.dat, while contains file offset(int64).
2) Generate random line number randline = randf()*tot_line.
3) If in memory, directly access array using randline-1 as index
Directly seek into offset.dat by (randline-1)*8 bytes, retrieve line offset offset_val.
4) Seek into largef.txt by offset_val bytes.
5) Read line.

Comment hidden because of low score. Click to expand.
0
of 0 vote

1) Scan the file using readline() and note the offset of each line by writing it in binary to a 32 bit unsigned int in a new "table" file (or better yet in RAM if it fits). Keep a count of how many lines you encounter.
2) Generate a random number up to number of lines.
3) Seek into your binary table file using the line number, and then read the offset, then seek into the big file and get the line.
O(n) where n is number of chars in big file, can't avoid that. Each lookup is a fairly small constant. If number of items in table fits in memory, you get a speedup.

Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

### Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

### Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

### Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

### Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More