Find the K most frequent strin

Flipkart Interview Question for SDE-2s

0

of 0 votes

21
Answers
Find the K most frequent strings from a large data that can't fit into the memory in one go. Lets say there is a data of 200GB but available RAM memory is 1GB only.
- ak May 19, 2014 in India | Report Duplicate | Flag | PURGE
Flipkart SDE-2 Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: India
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 9 vote

My suggestions:
-Use something like a Buffered File Reader to read from the 200GB text file.
-Use a HashTable (where key=word, value = count) to track the count of the usage for each word (assuming these strings are words).
-Use a MinHeap (sorted on frequency count) to track the top K words. Fill the heap initially with K words. Then, if the next word has a higher frequency count. pop the min value off the heap and replace it with the next word. Repeat till done reading the 200GB file.

- Michael.J.Keating May 20, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

having TRIE data Structure in place of HashTable will be ok ?

- ur.devesh May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I think a trie would me a fine alternative to a hash table here assuming there are no memory issues.

- Michael.J.Keating May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

The problem here is how will you get the frequency of words without keep all the 200Gb data into the memory in one time.

- ak May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

ak - Yes, that is why we would be reading from a buffered file reader. The hashtable and MinHeap should have no problem fitting into 1GB of memory. There are only about 100k commonly used words in the English language - which is far below our threshold.

- Michael.J.Keating May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

what if all the strings are different from each other?

- ravio May 27, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

If all the strings / words are different , you will end up keeping all the 200 gb words in memory (in form of hash table ). Which is not possible

- neo May 31, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 4 votes

Exactly if all the strings are different, then this logic doesn't work. You cannot keep all those strings in memory in hashtable.

Here is i think what might help...

1. Read 1 Gb of data from disk, sort them and put them back onto the disk as group0.
Similarly prepare group1, group2,.....group199.
2. Read (1gb/200) starting parts of each group. Do a string count for strings and insert into minHeap (which holds only k elements) where strings are sorted on frequency count. If the newly read strings from disk have greater count than minimum on the heap, remove that one and insert the newly read.

Here, ofcourse we need to handle a few cases.
1. A string which is read from disk is already present into the heap. This can be handled. (We can't work with a heap in this case, since this string might be present in the heap anywhere. We can use customised hashmap in that case where we ourselves keep track of latest seen minimum no.).

- ysoserious June 14, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

@ysoseriou

You are right up to the point:
>> Read 1 Gb of data from disk, sort them and put them back onto the disk as group0. Similarly prepare group1, group2,.....group199.

Further to this, you can do an external merge sort to sort all of the 199 groups. Once you get the whole data sorted, you can use a min-heap of size k to maintain the top k most frequent strings as you read the data in a stream.

- Murali Mohan July 24, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Using Min Heap won't work here because the word which is popping off from min heap due to less count at current situation... may be this word can occur more times than the incoming word into heap...
so min heap wont work for this problem.

- v.krishna1708 January 22, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@v.krishna1708
As they are sorting the individual groups of data,this might work.

- crackerplace March 10, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

We can use "trie" datastructure here. As we go on reading the file insert the read word in a Trie and at the end of last character put the counter for that string. But retrieving the count becomes a problem. I think we can use some other datastructure along with tries to get the frequent words.

- Ganesh July 15, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Here is my pseudo-code

While(!EOF)
   Read 1GB of data
   Get words one by one (Use a Tokenizer to split words with space or any 
  punctuation   characters)
  Check if file "word".txt  // the "word" which you got in the previous line
  if exists
    // Change only access time
    touch -a "word".txt
  else 
    // create the file
    touch  "word".txt
End

// Files will have the access time stamp updated based on their occurrence
// The frequent they appeared the latest will be their access time stamps relative      
// to current time
// Read the file listing of this directory with descending order of access time stamp

- P.D May 19, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

One problem with this solution is that it could create an absolutely enormous number of files. In the event that most of the strings are distinct and short, this solution might create billions of files, overwhelming just about any operating system. Directory file lists are not optimized to serve as a giant key-value store.

This also does not produce the correct output. At the end, you end up with the K most recently seen words, not the K most frequent words. While frequent words are more likely to be recently seen than infrequent ones, you will not always get the correct answer.

- eugene.yarovoi May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Appreciate the comments eugene

I did consider the number of files. I did think through of what the conventional wisdom would say (to combine DS).

I wanted to persist the frequency of words o(1) so it can work for any K input and printing top K will be o(n) leveraging file system sort etc..

Coming back to comments

1. OS will create files as long as disk can accomodate, OS doesnt have any upper limit on file counts (why would OS be overwhelmed, unless Andrew Tanenbaum has a technical explanation of whats the limitation ;-)).

2. Empty *.txt files wont take lot of disk space

3. This solutions just creates files in disk, not even suggesting any IO. I didn't suggest to use file system as Key, value pair.

- P.D May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

1. Just because you didn't hear about a limitation in your OS textbook doesn't mean it doesn't exist in practice. There is no hard technical reason for why an OS couldn't accommodate billions of files in theory; it's that most OSes are not optimized for it. Go ahead and try to create, say, even a mere 100 million files in the same directory. See how your OS responds.
2. On many OSes, an empty file takes as much disk space as one minimal disk allocation. This can often be something like 4 KB. If you think that's not a lot, think that you will require 4 KB for each distinct word, when the size of the word itself is a few orders of magnitude smaller. For 1 billion distinct words, you will require 4 TB.
3. Creating files on disk requires I/O. You didn't suggest using anything as key-value pairs, but that's effectively what you're doing with this solution.
4. You seem to think that something like touch -a word.txt takes O(1) time. But...if the number of files in a directory gets large, the OS may not be optimized to do this operation in O(1) time. In any case, even if the operation were done using an efficient lookup, the constant factors here are absolutely enormous.
5. Why would the file system's sort be O(n)?

All in all, while I like that you're trying to think outside the box, this is not a solution you'd want to give in an interview. If you still don't believe me, try coding this solution and comparing its performance to some other approaches. Make sure to try on large inputs where billions of files would need to be created.

- eugene.yarovoi May 30, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Missed one more line, append the below comment to above

// and print the names of the files a.k.a words for K entires.

- P.D May 19, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

What if sort files using external merge sort ..nd put into 26 or more files and then use min heap

- Tushar June 08, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Divide the list into 200 divisions and sort each division
2. Now take first 200th part of each division, combine and sort
3. extract the first 200th part of this resulting list and store it.
4. for other divisions use insertion sort and add the rest of the resulted list.
5. Now repeat 2,3,4 and the resulting 200th part to the end of the list where you stored intially.
6. After 200 such repetitions you will get 1GB of the strings with highest frequencies.

if you consider "m" as the ram and "n" as the total data
step 1: (n/m) mlog(m) = nlogm
step 2: (n/m) * m (insertion sort ) = n
n+nlogm to get the n/m*m part sorted list, overall would be
~m*m(n+nlogm) ~m*m*nlogm

- bharathj August 06, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think below solution may work for this.

1) Fetch 1 GB of data for the file via BufferedReader and add it into a hashtable with key as word and value as count.
2) Now write this hashtable into a file with (Key, value) pair in sorted order of value.
3) Repeat step 1 & 2 for complete file of 200GB.
4) Now, using loop upto k times on number of files (in this case - 200 files), retrieve max value count object and add it into arraylist,

Any suggestion ?

- Skywalker August 09, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use the external sort (alphabetical) to combine all the occurrence of a string.
In the second (merge) phase of external sort maintain a min heap of size 'k' .

- Anonymous October 09, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think as we are leaning ourselves towards merge and external sort. My question/advise is that do we really need to do this. I will definitely take up time to sort 200GB. And then a new word string came, we will have to do an insertion sort.

What I'm suggesting is let's see what kind of file it is. If it has mostly repeated words. Then we can make a in-memory TRIE (of'course modified) which keeps a count to that word. And a new word can update the count. In this case it wont be a problem to make a in-memory list with reference to the top K mostly used words. We can update this reference list on each new insertion of word.

But as if we hypothetically imagine that all(or most) the words are different and unique. Still the TRIE will handle , but memory will depend on the length of the most words and the diverse the words are to other existing words in the TRIE. I think we should use TRIE with count variable. And a list of refrences to most used k words would serve our purpose.

- Akshay December 14, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Way to do this is to bring total hashmap to disk.

1. Make a file hashes.txt

***LOOP1
2. Keep 500 MB for hashmap and 500MB for file content (max hashmap can be of size 500 mb for 500 mb of data)

4. Remove file content from memory

5. Make new file temp.txt.

******LOOP2
6. Load content of hashes.txt in chunks in remaining 500MB and for every entry check if hash is found. Add to temp.txt
******END LOOP2

7. Rename temp.txt to hashes.txt
***END LOOP1

8. At last load hashes.txt in memory in chunks and find k words. Use max heap.

- Abhi February 26, 2015 | Flag Reply

CareerCup

Flipkart Interview Question for SDE-2s

Books

Videos

Resume Review

Mock Interviews