Facebook Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
Assuming the file is sorted with longer length words appearing after shorter one.
Do binary/interpolation search for aaaa and zzzz.
The positions between those two locations are in effect an array of the 4 letter words. (5 bytes each, assuming this is ascii file, with newline after each word)
The positions will tell you how many 4 letter words are there . Say that is W.
Now you can pick a random number in 1 to W. And pick the corresponding word.
This will issue very few reads (and seeks) on the file.
Of course, if the words is not too big, etc, there are probably simpler methods (like scanning the file and storing the 4 letter words), but which perform slower.
All these unix command line based ideas are great, but if we're asked not to use these but to think of an algorithm and print it out, would reservoir sampling be a good fit?
public class GenerateRandom {
public static void main(String args[]){
String s="/usr/share/dict/words:";
String word="";
int l=s.length();
List shuffledList= new ArrayList();
for(int i = 0; i < l; i++)
shuffledList.add(s.charAt(i));
Collections.shuffle(shuffledList);
List newList=shuffledList.subList(0, 4);
for(Object i:newList)
word+=i;
System.out.println("The new word is "+word);
}
}
It says generate "a 4 letter word" not all. It seems like most of the answers read the whole file and parse the whole file which seems kind of brute force. Here is a python version that reads 3 orders of magnitude less of the file:
#!/usr/bin/python
import random
import os
CHUNK_SIZE = 1024
def random_word( wlen ):
wfile = open('/etc/dictionaries-common/words','r')
flen = os.stat('/etc/dictionaries-common/words').st_size
chunks = flen / CHUNK_SIZE
word = None
# We might have to try a couple times if we get a chunk that does
# not have any words of our requested length.
while not word:
rpos = random.randint(0,chunks) * CHUNK_SIZE
wfile.seek(rpos)
buf = wfile.read(CHUNK_SIZE)
# We skip the first and last word because the seeks and read
# may have left partial words on the ends.
words = [x for x in buf.split("\n")[1:-1] if len(x) == wlen ]
if words:
p = random.randint(0,len(words)-1)
word = words[p]
return word
print random_word(4)
Since you specifically mention a unix file system, you will probably get bonus points for a one line terminal solution before designing a program.
- Ryan Latham October 05, 2012grep -o -w '\w\{4\}' /usr/share/dict/words | sort --random-sort | head -n 1
or
grep -o -w '\w\{4\}' /usr/share/dict/words | shuf -n 1
would work