Microsoft Interview Question
Software Engineer / DevelopersUse a hash table
1st level => hashed with 1st alphabet of the word and
2nd level => length of word
list *h[26][MAX_LENGTH_OF_A_WORD]
struct list {
char *word;
unsigned short count;
list *next
}
keep on inserting words in hash table while reading one by one from file.
For each word found again, increment the count.
At the end, search the hash table for top 10 counts.
Performance :-
each word needs to be compared with only those words which have starting character same and length same.
Memory => Maximum for hash table = 26 * <max length of an english word>
+ memory to store these words.
You can also use a BST based on lexical ordering of words.
Read words from the file one by one. Maintain a word_count node in each field. As you encounter new words add them to the BST with a count = 1.
After reading the entire file. traverse the BST inorder. The last 10 words have the highest frequency
@Vel, unfortunately just having an array[10] doesn't work out. Because, "frequently occurring" is relative to other words in the document.
@abybaby, a BST could go skew-ing. So, instead we could use heap - max heap, based on the frequency count. That way for each word in the document, we need to either insert it or update its frequency. After all words in the document are processed, we extract the top 10 words from the heap.
The document processing part takes O(nlgn) time and extraction of words takes O(lgn) time. Memory requirement = O(n). [n = number of words in the document]
@ khexa
If my interpretation is not wrong, you recommended using only a Heap. Heap is definitely required here but its not the only data structure you will use. The key for maintaining Max Heap is count i.e. the frequency. But how do you index or increment the count every time you read a word from the file?
For example
Hello Bonjour Hi Hey Namaste Bonjour Bonjour Hey
If you insert these words in the Max Heap/PQ then the heap property is built based on the content of the word & not the count. So your Heap will look like this :-
Namaste
Hi Hey
Hey Hello Bonjour Bonjour
NOTE - This is not the exact output of Heap but a rough sketch of the way heap data will be organized i.e based on content of the word & not the frequency of the word
You ALSO need a complete binary tree. Apart from left & right pointers your Binary node should have 2 metadata
1) char *pWord i.e the content to the word. This will be used as the key for the binary tree insertion & search
2) int frequency
Every time you read a word from the file do these
1) Search for a Binary Node for this word.
2) If Binary Node is found increment its frequency.
3) Else create a binary node & insert at the proper place in the tree. (with frequency=1)
4) Now, try to push this binary node in a Heap of constant size 10. The key for Push or Pop operation of Heap is frequency.
Finally, you can choose to keep an array of size 10 & push or overwrite. But a heap will be faster because everytime you read word you have to push it inside the Array. Heap will be efficient because it will minimize the data movement during Heapify i.e Bottom-Up as compared to an array.
if want more optimization during Push operation for data move (“ShiftUp” for pointers implement the Heap as a double link list so that its Bottom-Up is only a changing of the links.
The above can be done by taking each node in a heap as a structure which contains the string and the variable count. Modify the heap functions so that when you want to search for an element it is searched on the string and when you want to heapify it it is done using the count. So, when you want to add a string first check in the heap if it is there increment the count else put in the heap with count as 1 and heapify.
this can be done by using simple array...
since we know the range of characters that the given string can have.
allocate an array of that size.
@use counting sort technique.
initially fill the whole array with 0.
keep on increamenting the vlaue at index i if i happens to be the ascii value of a character.
at the end check for the top 10 values .
This is quite efficient(provided the range of characters is small say a-z)
This is generally done using a method called map reduce....
In the map phase we create tuples of <word,1> and in the reduce phase we group the tuples with the same word and update the count....this is a method widely used by google for their search....
This programs is implementation using heap and binary tree.
Please comment...
public void AddNode(String value,Int16 frequency)
{
HeapNode heapNode = new HeapNode(value,frequency);
Int16 selection = 0;
Node prev, cur, temp;
temp = new Node(value, frequency);
if (root == null)
{ //inserting the first word to the heap
if (Insert(heapNode))
{
temp.Selected = true;
}
root = temp;
return;
}
prev = null;
cur = root;
while (cur != null)
{
prev = cur;
if (value.CompareTo(cur.Word) > 0)
{
cur = cur.Right;
}
else if (value.CompareTo(cur.Word) < 0)
{
cur = cur.Left;
}
else
{
cur.Frequency = (Int16)(cur.Frequency + 1);
selection = 3;
break;
}
}
if (value.CompareTo(prev.Word)<0)
{
prev.Left = temp;
selection = 1;
}
else if (value.CompareTo(prev.Word) > 0)
{
prev.Right = temp;
selection = 2;
}
//section to make the heap.
if (selection == 1 && heapNodeSize<10)
{
if (Insert(heapNode))
{
prev.Left.Selected = true;
}
}
else if (selection == 2 && heapNodeSize < 10)
{
if (Insert(heapNode))
{
prev.Right.Selected = true;
}
}
//frequency is greater than 1
else if (selection == 3 && heapNodeSize < 10)
{
//already in heap. here we are increasing the frequency and reheapify.
if (cur.Selected == true)
{
int i = 0;
for (i = 0; i < heapNodeSize; ++i)
{
if (array[i].Word.CompareTo(cur.Word) == 0)
{
array[i].Frequency=cur.Frequency;
break;
}
}
BuildMinHeap();
}
else
{
//it is not already in the heap but frequency is greater than one.
//checking whether frequency is higher than the ExtractMinimum node from the Heap
HeapNode min = ExtractMinimum();
if (min.Frequency < cur.Frequency)
{
HeapNode deletedHeapNode= DeleteElement();
SearchAndUpdateNode(deletedHeapNode.Word);
if (Insert(new HeapNode(cur.Word,cur.Frequency)))
{
cur.Selected = true;
}
}
}
}
}
public HeapNode ExtractMinimum()
{
if (heapNodeSize>0)
return array[0];
return null;
}
public void BuildMinHeap()
{
for (int i = 0; i < (heapNodeSize / 2); ++i)
{
MinHeapify(i);
}
}
public Boolean Insert(HeapNode node)
{
if(heapNodeSize <9)
{
heapNodeSize++;
array[heapNodeSize] = node;
SiftUp(heapNodeSize);
}
else
{
return false;
}
return true;
}
public void SiftUp(int nodeIndex)
{
int parentIndex = 0;
HeapNode tmp=null;
if (nodeIndex != 0)
{
parentIndex = (nodeIndex-1)/2;
if (array[parentIndex].Frequency > array[nodeIndex].Frequency)
{
//exchange
tmp = array[parentIndex];
array[parentIndex] = array[nodeIndex];
array[nodeIndex] = tmp;
SiftUp(parentIndex);
}
}
}
//if the frequency of a heap node comes haigher than the root node, then remove the root node and insert the coming node
//then set the Selected variable to false in the tree
public HeapNode DeleteElement()
{
HeapNode temp;
temp = array[0];
array[0] = array[heapNodeSize];
heapNodeSize--;
MinHeapify(0);
return temp;
}
//siftdown
public void MinHeapify( int i)
{
int left = 2 * i + 1;
int right = 2 * i + 2;
int min = 0;
//heap size is 10
if (left <= 9 && array[left].Frequency < array[i].Frequency)
{
min = left;
}
else
{
min = i;
}
if (right <= 9 && array[right].Frequency < array[min].Frequency)
{
min = right;
}
if (min != i)
{
HeapNode temp = array[i];
array[i] = array[min];
array[min] = temp;
MinHeapify(min);
}
}
public bool SearchAndUpdateNode(String value)
{
Node curr = root;
while (curr != null)
{
if (value.CompareTo(curr.Word) == 0)
{
curr.Selected = false;
return true;
}
if (value.CompareTo(curr.Word) < 0)
{
curr = curr.Left;
}
else
{
curr = curr.Right;
}
}
return false;
}
Sorry, I forget to explain the algorithm
1. Reading each words.
2. Add it to the tree
3.a. If it is a new node, try to insert to the heap.
3.b.If the heap is full, then it will return null
3.c Else it will add the node to the heap and sift up the node.
4.a. if it is not already in the heap but frequency is greater than one, then will
check whether frequency is higher than the ExtractMinimum node from the Heap.
4.b. If the frequency is greater than the minimum node from the heap,then delete the minimum node from the heap and insert the
current node.And sift up the node.
4.c Update the variable 'selected' of node to false.
This will help to select the node to be inserted to the heap during next addnode call ,
if it got its frequency higher than the min node of heap.
5.a. If the word is already in the heap and its frequency will be increased.
5.b. Heapfy it again
Finally the heap will be having 10 mostly occured words.
1. As you are reading words from the file, maintain its count in the hash table.
- algooz April 28, 20082. Keep a k (=10) size min-heap, and whenever you update a word in the hash table and increase its count ci, compare ci with the top element of the heap kt.
3. If ci > kt then replace kt with ci and heapify.
4. At the end when you have traversed the whole of doc you will have 10 most freq element in the heap.