You are given large number of

Interview Question for SDE1s

0

of 2 votes

7
Answers
You are given large number of files each approx: 10MB.
Assume a million such files.

You are required to find the most frequent word or top 5 most frequent word.

How would you design the solution
- um01 April 21, 2015 in United States | Report Duplicate | Flag | PURGE
SDE1 System Design

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 2 vote

Assumption:
1. We cannot load all files into memory (1 TB).
2. Word means separate word (in text "killer" theres is just one word; there is no word "kill").

1. Load several source files into memory (for example 100 files; it takes 1 GB).
2. Build tree (RB-Tree), where key is a word and value is number of occurences.
3. Save into new temp file: each line with format "word number".
4. Repeat 1-3 until process all files.
5. Open 2 temp files, use external merge sort by reading string from each file, if equal words appear then merge then into one record, write result into new temp file. Maintain the most frequent word (s).
6. Delete processed temp files.
7. Repeat until we have one file.

- Ivan April 22, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Trie can be used for this problem.

N - word n
L[n] - number of characters in N word

1. Iterate thru files, with N words and get next word. Complexity: L[n]
2. Build a trie along the way for next found word. Leafs contains number of word occurrence. Complexity L[n]
3. Keep a counter (or an array of 5 top mose) for the most frequent word

Overall complexity: L[n]*L[n]

- Marcello Ghali April 21, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

It seems you don't have enough memory to arrange a trie in memory (1 TB).

- Ivan April 22, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

I'm using a HashMap to create a frequency and a max PriorityQueue to maintain the N most frequent.
To insert and remove a register from PriorityQueue would be a long (N). IE: Log N of 5.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

/**
 * Created by sky on 21/04/15.
 */
public class MostFrequentWords {
    private static final class Counter {
        private String word;
        private int counter;

        public Counter(String word, int initial) {
            this.word = word;
            this.counter = initial;
        }

        public void increment() {
            counter++;
        }

        @Override
        public String toString() {
            return "[" + word + ", " + counter + "]";
        }
    }

    public static void main(String[] args) throws Exception {
        Reader in = new InputStreamReader(System.in);
        BufferedReader br = new BufferedReader(in);

        Map<String, Counter> freq = new HashMap<>();

        int N = 1;


        PriorityQueue<Counter> priorityQueue = new PriorityQueue<>(N + 1, new Comparator<Counter>() {
            @Override
            public int compare(Counter o1, Counter o2) {
                return o1.counter - o2.counter;
            }
        });

        String line;
        while((line = br.readLine()) != null) {
            Counter counter = freq.get(line);

            if (counter == null) {
                counter = new Counter(line, 1);
                freq.put(line, counter);
                priorityQueue.add(counter);
            } else {
                counter.increment();
                priorityQueue.add(counter);
            }

            if (priorityQueue.size() > N)
                priorityQueue.remove();
        }

        br.close();
        in.close();

        while (!priorityQueue.isEmpty()) {
            Counter max = priorityQueue.remove();
            System.out.println(max);
        }

    }
}

- Felipe Cerqueira April 21, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use MapReduce.

- inucoder April 21, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Assuming we have 1GB RAM.
1. Process 100 files simultaneously.
2. In order to count occurence keep a hashMap<Word,Occurrence> for all the words.
3. In order to get top 5 elements also maintain a MinHeap of max words.

- Sushant April 25, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Instead of making a suffix tree, we can take a map<string, integar>. We can load all name of files in queue. We can start 100 threads at a time (depending on memory). Every thread will read one file and then create a map and also hold a list of 5 top words (keys), along with it. Once thread will finish it will write all data in file.
Then next file would be taken from queue on finish and start same thread. at the end we will have 100 files. We can then merge them and find top 5 searched elements. To speed up, we can keep these 100 files as memory mapped files.
Suffix tree is not a good choice because whenever we will search whether this word exists, has map will give in one attempt, while suffix tree will traverse the node.

- Phoniex July 27, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

map/reduce + hashing seems the right approach, whether we do this on single computer or multiple

- interviewbuster August 09, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

More simpler approach. Divide the files among hadoop cluster. Each box will run a shell script (ex: tr command of unix) to dump top frequently used words, dump the results to file. Aggregate all the result files to find the top used words.

- Anonymous June 22, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

I would go with following solution:
Store all words in Map<String, Integer> where key is a word and Integer is counter for this word.
and then using selection algorithm on Values would give you as many most frequent words as you need.

- xpoh April 21, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

consider words: key, keychain. 2 entries in a Dictionary are needed, however since they share common prefix - "key" they can be compacted using, say, prefix tree.

- Marcello Ghali April 21, 2015 | Flag

CareerCup

Interview Question for SDE1s

Books

Videos

Resume Review

Mock Interviews