## Adobe Interview Question

Software Analysts**Country:**United States

According to question, every word should have equal probability of being selected. So if a word appear more than one times then above method will increase probability of that repeated word.

i did not think of the duplicate words...in that case with no extra space i could not think of any solution :(

the question was not a random unique word. think about it. if all words needed to have equal probability per unique word, you would need to keep track of the count of the words. There simply is no other way to do it.

I've read it from algogeeks : "rand()%n is not uniformly distributed between 0 and n-1. To see this, suppose rand() produces positive 32-bit integers, so they have the range 0 to 2^31 - 1 = 2,147,483,647. Now, suppose there are 100,000 words in the file. When you are computing rand()%100000, you will get each of the numbers 0 to 83,646 on average 21,475 times out of 2^31 random numbers, but you will get the numbers 83,647 to 99,999 only 21,474 times out of every 2^31 random numbers. Thus, the first numbers in the file will appear with slightly too high a probability and the numbers at the end of the file will be selected with slightly too low a probability. "

I agree with this . So , any other solution??

Barney, I think you're focusing on the wrong issue here. rand() % n is very close to being distributed uniformly at random for small n. If you need the probability to be exactly equal, there are algorithms for selecting a number uniformly at random between 0 and X given a generator for generating numbers uniformly at random between 0 and Y. I think the point of the problem was to focus on the aspects of the problem mentioned above.

@eugene : ok .

But to find the value of n , first we've to traverse the whole file and count the number of words . Is it possible to print the random word in just 1 traverse ???

@Barney: Sure, you can use a technique known as "Reservoir Sampling". You can read about it in detail on Wikipedia.

The basic approach is that at any point you have a current random choice (which in the beginning is initialized to the first word), and as you see each additional word, you replace the current random choice with the new word you just saw with 1/k probability, if k is the number of words you've seen so far.

After reading the first word, the random choice is the first word with 100% probability

After reading the second word, make it the new random choice with 50% probability. So now the probability distribution is 50% first word, 50% second word.

After reading the third word, make it the new random choice with 1/3 probability. This means the two previous words have 1 - 1/3 = 2/3 probability of remaining selected if they were selected already, which is true with a probability of 1/2 for each word, so the probability is now (2/3)*(1/2) = 1/3 for each of the first two words as well. So every word has probability 1/3 of being chosen.

The statement that after k words, every word will have a 1/k chance of being the selected random word can be formally proven by induction. k=1 is the trivial base case and is true by construction.

For k >= 2, we opt to select the k-th word with probability 1/k, and by induction, every previous word was selected with probability 1/(k-1). Any previously selected word now has probability 1 - 1/k = (k-1)/k probability of remaining selected, and had 1/(k-1) probability of being selected in the first place, so it now has 1/k probability of being the random word. So all words now have probability 1/k of being the random word.

@eugene.yarovoi but how to ensure that the probability of a random number to be selected is 1/k??

@Karthik: Are you asking for an algorithm to give you a bit whose value is 1 with probability 1/k and 0 otherwise?

@eugene In Reservoir Sampling, how to replace a 'random' word and what if that word is encountered afterwards during traversing?

@akashi: This algorithm isn't meant to have an equal chance of picking each distinct word, only each word. You won't be able to do it without O(n) space if you need an equal chance for each distinct word. I see no reason to assume that interpretation of the question, however.

@Karthik Vvs: what do you mean by "output it in programming?"

@Karthik: I assume we already have a random number generator that generates random numbers between 0 and some number X (such as the C rand() function, for which X is RAND_MAX).

If you want to know how, given a random number generator that generates numbers uniformly at random between 0 and some number X, we can generate numbers uniformly at random between 0 and a given number Y, see this CareerCup question I answered a while back: www .careercup. com/question?id=12426697. It contains a special case of what can be made into a more general algorithm.

I think talking about this issue is probably out-of-scope for this question though. Many programming languages already provide convenience functions to do this sort of thing, implemented similarly to how I solve the problem in the link I gave.

I think just create number like m= Interger.Max and do r= random(m); and r%length should be ok

If it is possible to load entire file into memory, we can do something like this:

public void shuffleArray(String[] words) {

String temp;

int index, totalCards = words.length;

for (int i = 0; i < totalCards; i++){

index = (int) (Math.random() * (totalCards - i)) + i;

temp = words[i];

words[i] = words[index];

words[index] = temp;

System.out.print(words[i]+" ");

}

}

Here are few steps

1. Read the file line by line

2. using strtok, split each line using space as delimiter and collect all the words in an array (either static or dynamic array)

3. After we collect all the Words, choose a random number between Array length and get the random word.

my_words[random(Length<my_words> )]

Assumptions:

- The Artist November 15, 20121. Word here means the english word and not 8bytes.(seperated by a single space character)

2. Number of words in the file < RAND_MAX.

Here's what you can do:

1. Count the number of spaces (say x)in the file => no of. words = x + 1

2. y = rand() % (x + 1)

3. return the word after y-1 spaces