Interview Question
SDE1sCountry: United States
Interview Type: In-Person
Trie can be used for this problem.
N - word n
L[n] - number of characters in N word
1. Iterate thru files, with N words and get next word. Complexity: L[n]
2. Build a trie along the way for next found word. Leafs contains number of word occurrence. Complexity L[n]
3. Keep a counter (or an array of 5 top mose) for the most frequent word
Overall complexity: L[n]*L[n]
I'm using a HashMap to create a frequency and a max PriorityQueue to maintain the N most frequent.
To insert and remove a register from PriorityQueue would be a long (N). IE: Log N of 5.
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;
/**
* Created by sky on 21/04/15.
*/
public class MostFrequentWords {
private static final class Counter {
private String word;
private int counter;
public Counter(String word, int initial) {
this.word = word;
this.counter = initial;
}
public void increment() {
counter++;
}
@Override
public String toString() {
return "[" + word + ", " + counter + "]";
}
}
public static void main(String[] args) throws Exception {
Reader in = new InputStreamReader(System.in);
BufferedReader br = new BufferedReader(in);
Map<String, Counter> freq = new HashMap<>();
int N = 1;
PriorityQueue<Counter> priorityQueue = new PriorityQueue<>(N + 1, new Comparator<Counter>() {
@Override
public int compare(Counter o1, Counter o2) {
return o1.counter - o2.counter;
}
});
String line;
while((line = br.readLine()) != null) {
Counter counter = freq.get(line);
if (counter == null) {
counter = new Counter(line, 1);
freq.put(line, counter);
priorityQueue.add(counter);
} else {
counter.increment();
priorityQueue.add(counter);
}
if (priorityQueue.size() > N)
priorityQueue.remove();
}
br.close();
in.close();
while (!priorityQueue.isEmpty()) {
Counter max = priorityQueue.remove();
System.out.println(max);
}
}
}
Instead of making a suffix tree, we can take a map<string, integar>. We can load all name of files in queue. We can start 100 threads at a time (depending on memory). Every thread will read one file and then create a map and also hold a list of 5 top words (keys), along with it. Once thread will finish it will write all data in file.
Then next file would be taken from queue on finish and start same thread. at the end we will have 100 files. We can then merge them and find top 5 searched elements. To speed up, we can keep these 100 files as memory mapped files.
Suffix tree is not a good choice because whenever we will search whether this word exists, has map will give in one attempt, while suffix tree will traverse the node.
I would go with following solution:
Store all words in Map<String, Integer> where key is a word and Integer is counter for this word.
and then using selection algorithm on Values would give you as many most frequent words as you need.
Assumption:
- Ivan April 22, 20151. We cannot load all files into memory (1 TB).
2. Word means separate word (in text "killer" theres is just one word; there is no word "kill").
1. Load several source files into memory (for example 100 files; it takes 1 GB).
2. Build tree (RB-Tree), where key is a word and value is number of occurences.
3. Save into new temp file: each line with format "word number".
4. Repeat 1-3 until process all files.
5. Open 2 temp files, use external merge sort by reading string from each file, if equal words appear then merge then into one record, write result into new temp file. Maintain the most frequent word (s).
6. Delete processed temp files.
7. Repeat until we have one file.