Flipkart Interview Question
SDE-2sCountry: India
Interview Type: In-Person
I think a trie would me a fine alternative to a hash table here assuming there are no memory issues.
The problem here is how will you get the frequency of words without keep all the 200Gb data into the memory in one time.
ak - Yes, that is why we would be reading from a buffered file reader. The hashtable and MinHeap should have no problem fitting into 1GB of memory. There are only about 100k commonly used words in the English language - which is far below our threshold.
If all the strings / words are different , you will end up keeping all the 200 gb words in memory (in form of hash table ). Which is not possible
Exactly if all the strings are different, then this logic doesn't work. You cannot keep all those strings in memory in hashtable.
Here is i think what might help...
1. Read 1 Gb of data from disk, sort them and put them back onto the disk as group0.
Similarly prepare group1, group2,.....group199.
2. Read (1gb/200) starting parts of each group. Do a string count for strings and insert into minHeap (which holds only k elements) where strings are sorted on frequency count. If the newly read strings from disk have greater count than minimum on the heap, remove that one and insert the newly read.
Here, ofcourse we need to handle a few cases.
1. A string which is read from disk is already present into the heap. This can be handled. (We can't work with a heap in this case, since this string might be present in the heap anywhere. We can use customised hashmap in that case where we ourselves keep track of latest seen minimum no.).
@ysoseriou
You are right up to the point:
>> Read 1 Gb of data from disk, sort them and put them back onto the disk as group0. Similarly prepare group1, group2,.....group199.
Further to this, you can do an external merge sort to sort all of the 199 groups. Once you get the whole data sorted, you can use a min-heap of size k to maintain the top k most frequent strings as you read the data in a stream.
Using Min Heap won't work here because the word which is popping off from min heap due to less count at current situation... may be this word can occur more times than the incoming word into heap...
so min heap wont work for this problem.
We can use "trie" datastructure here. As we go on reading the file insert the read word in a Trie and at the end of last character put the counter for that string. But retrieving the count becomes a problem. I think we can use some other datastructure along with tries to get the frequent words.
Here is my pseudo-code
While(!EOF)
Read 1GB of data
Get words one by one (Use a Tokenizer to split words with space or any
punctuation characters)
Check if file "word".txt // the "word" which you got in the previous line
if exists
// Change only access time
touch -a "word".txt
else
// create the file
touch "word".txt
End
// Files will have the access time stamp updated based on their occurrence
// The frequent they appeared the latest will be their access time stamps relative
// to current time
// Read the file listing of this directory with descending order of access time stamp
One problem with this solution is that it could create an absolutely enormous number of files. In the event that most of the strings are distinct and short, this solution might create billions of files, overwhelming just about any operating system. Directory file lists are not optimized to serve as a giant key-value store.
This also does not produce the correct output. At the end, you end up with the K most recently seen words, not the K most frequent words. While frequent words are more likely to be recently seen than infrequent ones, you will not always get the correct answer.
Appreciate the comments eugene
I did consider the number of files. I did think through of what the conventional wisdom would say (to combine DS).
I wanted to persist the frequency of words o(1) so it can work for any K input and printing top K will be o(n) leveraging file system sort etc..
Coming back to comments
1. OS will create files as long as disk can accomodate, OS doesnt have any upper limit on file counts (why would OS be overwhelmed, unless Andrew Tanenbaum has a technical explanation of whats the limitation ;-)).
2. Empty *.txt files wont take lot of disk space
3. This solutions just creates files in disk, not even suggesting any IO. I didn't suggest to use file system as Key, value pair.
1. Just because you didn't hear about a limitation in your OS textbook doesn't mean it doesn't exist in practice. There is no hard technical reason for why an OS couldn't accommodate billions of files in theory; it's that most OSes are not optimized for it. Go ahead and try to create, say, even a mere 100 million files in the same directory. See how your OS responds.
2. On many OSes, an empty file takes as much disk space as one minimal disk allocation. This can often be something like 4 KB. If you think that's not a lot, think that you will require 4 KB for each distinct word, when the size of the word itself is a few orders of magnitude smaller. For 1 billion distinct words, you will require 4 TB.
3. Creating files on disk requires I/O. You didn't suggest using anything as key-value pairs, but that's effectively what you're doing with this solution.
4. You seem to think that something like touch -a word.txt takes O(1) time. But...if the number of files in a directory gets large, the OS may not be optimized to do this operation in O(1) time. In any case, even if the operation were done using an efficient lookup, the constant factors here are absolutely enormous.
5. Why would the file system's sort be O(n)?
All in all, while I like that you're trying to think outside the box, this is not a solution you'd want to give in an interview. If you still don't believe me, try coding this solution and comparing its performance to some other approaches. Make sure to try on large inputs where billions of files would need to be created.
1. Divide the list into 200 divisions and sort each division
2. Now take first 200th part of each division, combine and sort
3. extract the first 200th part of this resulting list and store it.
4. for other divisions use insertion sort and add the rest of the resulted list.
5. Now repeat 2,3,4 and the resulting 200th part to the end of the list where you stored intially.
6. After 200 such repetitions you will get 1GB of the strings with highest frequencies.
if you consider "m" as the ram and "n" as the total data
step 1: (n/m) mlog(m) = nlogm
step 2: (n/m) * m (insertion sort ) = n
n+nlogm to get the n/m*m part sorted list, overall would be
~m*m(n+nlogm) ~m*m*nlogm
I think below solution may work for this.
1) Fetch 1 GB of data for the file via BufferedReader and add it into a hashtable with key as word and value as count.
2) Now write this hashtable into a file with (Key, value) pair in sorted order of value.
3) Repeat step 1 & 2 for complete file of 200GB.
4) Now, using loop upto k times on number of files (in this case - 200 files), retrieve max value count object and add it into arraylist,
Any suggestion ?
I think as we are leaning ourselves towards merge and external sort. My question/advise is that do we really need to do this. I will definitely take up time to sort 200GB. And then a new word string came, we will have to do an insertion sort.
What I'm suggesting is let's see what kind of file it is. If it has mostly repeated words. Then we can make a in-memory TRIE (of'course modified) which keeps a count to that word. And a new word can update the count. In this case it wont be a problem to make a in-memory list with reference to the top K mostly used words. We can update this reference list on each new insertion of word.
But as if we hypothetically imagine that all(or most) the words are different and unique. Still the TRIE will handle , but memory will depend on the length of the most words and the diverse the words are to other existing words in the TRIE. I think we should use TRIE with count variable. And a list of refrences to most used k words would serve our purpose.
Way to do this is to bring total hashmap to disk.
1. Make a file hashes.txt
***LOOP1
2. Keep 500 MB for hashmap and 500MB for file content (max hashmap can be of size 500 mb for 500 mb of data)
4. Remove file content from memory
5. Make new file temp.txt.
******LOOP2
6. Load content of hashes.txt in chunks in remaining 500MB and for every entry check if hash is found. Add to temp.txt
******END LOOP2
7. Rename temp.txt to hashes.txt
***END LOOP1
8. At last load hashes.txt in memory in chunks and find k words. Use max heap.
My suggestions:
- Michael.J.Keating May 20, 2014-Use something like a Buffered File Reader to read from the 200GB text file.
-Use a HashTable (where key=word, value = count) to track the count of the usage for each word (assuming these strings are words).
-Use a MinHeap (sorted on frequency count) to track the top K words. Fill the heap initially with K words. Then, if the next word has a higher frequency count. pop the min value off the heap and replace it with the next word. Repeat till done reading the 200GB file.