Adobe Interview Question
Software Engineer / DevelopersCountry: India
// i am assuming file contais alphabets only small case letter we can do also for all symbols
//by nesting logic
int a[26];
for(i=0;i<26;i++)
a[i]=0;
FILE *fp=fopen("abc.txt",r)
while((ch=fgetc(fp))!=null)
a[ch-'a']++;
//sort the array in decreasing order
sort(a);
//print first 10 entries from the array
plz crct me if i m wrng:)
Suppose your i/p is :- my name is dhaker i am from warangal my branch is computer science and enggineering i want placement in adobe.
for all count[]=0;
char *ch=strtok(str," ");
hash(ch);
while(ch)
{ ch=strtok(ch," ");
hash(ch);
}
for(int i=0;i<10;i++)
{
for(int j=0;j<n-i;i++)
{ if(count[i]>count[j])
swap(count[i],count[j]);
}
cout<<count[9-i]<<" ";
}
void hash(char*c)
{ for(int i=0; ;i++)
{ if(strcmp(c,s[i])==0) count[i]++;
else
{ strcpy(s[i],c); count[i]++;}
}
}
If the file's size is enough to fit in memory than the solution will be straight forward ...
Use one Minheap<Integer> of size 10 and a HashMap <String, count>
heap = new minheap();
for each item
// if you don't already have k items on the heap, add this one
if (heap.count < k)
heap.Add(item)
else if (item.frequency > heap.Peek().frequency)
{
// The new item's frequency is greater than the lowest frequency
// already on the heap. Remove the item from the heap
// and add the new item.
heap.RemoveRoot();
heap.Add(item);
}
But if it a big file, a book etc. Than it is tricky to find top 10 words. I will use SPACE SAVING algorithm .
SPACESAVING(k)
n ← 0;
T ← ∅;
foreach i do
n ← n + 1;
if i ∈ T then //If word alreadyb exists increase its count
ci ← ci + 1;
else if |T| < k then //if T is less than F add new word
T ← T ∪ {i};
ci ← 1; //New words count will be 1
else
j ← arg minj∈T cj; //If T is full remove least frequent one, and add new one
ci ← cj + 1; // and add last removed words count in new word
T ← T ∪ {i}\{j};
Here is link of original research paper:
dimacs.rutgers.edu/~graham/pubs/papers/freqcacm.pdf
1. Put in a hash table with <key,value> = <words, freq>
2. Iterate through the elements of the hash table to find the 10 most frequent
Requires O(#no of distinct words) space and O(#no of total words) time complexity
The answer seems to be trivial. If the file size reaches megabytes then the program will become slow since there need to be generated thousands of hash. There might be a better solution which i am thinking of.
For generating the top-10 from hast table elements follow the below method:
1. Initialize a min-heap with first ten elements of the frequency field from the hash table.
2. As you insert each element in the hash table compare the min element of heap O(1) operation to the current updated freq element inserted to the hash table.
3. If larger, then extract the min element from the heap and insert the current element with a O(log k) heapify operation.
4. If lesser, then no heap operation required.
The top-K elements can be processed synchronously with each hash table operation and can be kept updated real-time with a max of O(log K) time complexity where k=10 in this case.
this problem becomes difficult when the file is really big...
one solution could be, you can partition the file into multiple segments and calculate frequent word lists (using hash tables) for each segments and merge them
this way, the space complexity is minimized. If you split the file into n segments, and store the top K frequently used words for each segment, you are looking at n*k space which will be less than storing all the words of the file in a hashtable
one obvious danger is there could be words that won't make it to any of the sub frequently used words and might therefore be not part of the final frequent words list.
But it it is possible that those few words ignored at each segment could potentially make the ideal K frequently used words list if not for this solution
Use a trie to count the occurances of the words then traverse the tree to get the frequency values at the leaves. Insert the word/freq count objects into a min heap and pop 10.
- kunikos January 01, 2013