Given a document of text, retu

Microsoft Interview Question for Software Engineer / Developers

0

of 0 votes

17
Answers
Given a document of text, return the 10 most frequently occurred words in the text.
- vodangkhoa March 29, 2007 | Report Duplicate | Flag | PURGE
Microsoft Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

1. As you are reading words from the file, maintain its count in the hash table.
2. Keep a k (=10) size min-heap, and whenever you update a word in the hash table and increase its count ci, compare ci with the top element of the heap kt.
3. If ci > kt then replace kt with ci and heapify.
4. At the end when you have traversed the whole of doc you will have 10 most freq element in the heap.

- algooz April 28, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

hash

- Anonymous July 05, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use a hash table
1st level => hashed with 1st alphabet of the word and
2nd level => length of word

list *h[26][MAX_LENGTH_OF_A_WORD]
struct list {
char *word;
unsigned short count;
list *next
}
keep on inserting words in hash table while reading one by one from file.
For each word found again, increment the count.
At the end, search the hash table for top 10 counts.

Performance :-
each word needs to be compared with only those words which have starting character same and length same.
Memory => Maximum for hash table = 26 * <max length of an english word>
+ memory to store these words.

- AA September 03, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

The problem with your algorithm is that you have to traverse the entire hash table to find the 10 most frequently occurred words.

- Anonymous June 15, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Could you explain how you would do the search for the top 10?

thanks

- Anonymous October 15, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Maintain an array of 10 elements, each time you hash and update, check with this array. since 10 is a constant, it does not add to time or space complexity

- Vel November 02, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

You can also use a BST based on lexical ordering of words.

Read words from the file one by one. Maintain a word_count node in each field. As you encounter new words add them to the BST with a count = 1.

After reading the entire file. traverse the BST inorder. The last 10 words have the highest frequency

- abybaby January 04, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@Vel, unfortunately just having an array[10] doesn't work out. Because, "frequently occurring" is relative to other words in the document.

@abybaby, a BST could go skew-ing. So, instead we could use heap - max heap, based on the frequency count. That way for each word in the document, we need to either insert it or update its frequency. After all words in the document are processed, we extract the top 10 words from the heap.

The document processing part takes O(nlgn) time and extraction of words takes O(lgn) time. Memory requirement = O(n). [n = number of words in the document]

- khexa January 30, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@ khexa

If my interpretation is not wrong, you recommended using only a Heap. Heap is definitely required here but its not the only data structure you will use. The key for maintaining Max Heap is count i.e. the frequency. But how do you index or increment the count every time you read a word from the file?

For example
Hello Bonjour Hi Hey Namaste Bonjour Bonjour Hey

If you insert these words in the Max Heap/PQ then the heap property is built based on the content of the word & not the count. So your Heap will look like this :-

Namaste
Hi Hey
Hey Hello Bonjour Bonjour

NOTE - This is not the exact output of Heap but a rough sketch of the way heap data will be organized i.e based on content of the word & not the frequency of the word

You ALSO need a complete binary tree. Apart from left & right pointers your Binary node should have 2 metadata
1) char *pWord i.e the content to the word. This will be used as the key for the binary tree insertion & search
2) int frequency

Every time you read a word from the file do these
1) Search for a Binary Node for this word.
2) If Binary Node is found increment its frequency.
3) Else create a binary node & insert at the proper place in the tree. (with frequency=1)
4) Now, try to push this binary node in a Heap of constant size 10. The key for Push or Pop operation of Heap is frequency.

Finally, you can choose to keep an array of size 10 & push or overwrite. But a heap will be faster because everytime you read word you have to push it inside the Array. Heap will be efficient because it will minimize the data movement during Heapify i.e Bottom-Up as compared to an array.
if want more optimization during Push operation for data move (“ShiftUp” for pointers implement the Heap as a double link list so that its Bottom-Up is only a changing of the links.

- The Hercules February 20, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

The above can be done by taking each node in a heap as a structure which contains the string and the variable count. Modify the heap functions so that when you want to search for an element it is searched on the string and when you want to heapify it it is done using the count. So, when you want to add a string first check in the heap if it is there increment the count else put in the heap with count as 1 and heapify.

- gauravk.18 April 08, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

this can be done by using simple array...
since we know the range of characters that the given string can have.
allocate an array of that size.
@use counting sort technique.
initially fill the whole array with 0.
keep on increamenting the vlaue at index i if i happens to be the ascii value of a character.
at the end check for the top 10 values .
This is quite efficient(provided the range of characters is small say a-z)

- Anonymous April 11, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

This is generally done using a method called map reduce....
In the map phase we create tuples of <word,1> and in the reduce phase we group the tuples with the same word and update the count....this is a method widely used by google for their search....

- NL September 17, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I guess this solution does not apply here. Microsoft does not use anything like Hadoop which uses MapReduce.

- russoue February 16, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

We can make the Binary search tree of whole file in which node contains the count of each word after creating the whole tree we can heapify this binary search tree taking the count into consideration.

- Anonymous November 30, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

use trie

- piyush1989kapoor February 10, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

This programs is implementation using heap and binary tree.
Please comment...

public void AddNode(String value,Int16 frequency)
    
    {
            HeapNode heapNode = new HeapNode(value,frequency);

            Int16 selection = 0;
            Node prev, cur, temp;

            temp = new Node(value, frequency);
            if (root == null)
            {    //inserting the first word to the heap            
                if (Insert(heapNode))
                {
                    temp.Selected = true;
                }
                root = temp;
                return;
            }
            prev = null;
            cur = root;
            while (cur != null)
            {
                prev = cur;
                if (value.CompareTo(cur.Word) > 0)
                {
                    cur = cur.Right;
                }
                else if (value.CompareTo(cur.Word) < 0)
                {
                    cur = cur.Left;
                }
                else
                {
                    cur.Frequency = (Int16)(cur.Frequency + 1);
                    selection = 3;
                    break;
                }
            }
            if (value.CompareTo(prev.Word)<0)
            {
                prev.Left = temp;
                selection = 1;
            }
            else if (value.CompareTo(prev.Word) > 0)
            {
                prev.Right = temp;
                selection = 2;
            }
            //section to make the heap.
            if (selection == 1 && heapNodeSize<10)
            {
                if (Insert(heapNode))
                {
                    prev.Left.Selected = true;
                }
            }
            else if (selection == 2 && heapNodeSize < 10)
            {
                if (Insert(heapNode))
                {
                    prev.Right.Selected = true;
                }
            }
                //frequency is greater than 1
            else if (selection == 3 && heapNodeSize < 10)
            {
                //already in heap. here we are increasing the frequency and reheapify.
                if (cur.Selected == true)
                {
                    int i = 0;
                    for (i = 0; i < heapNodeSize; ++i)
                    {
                        if (array[i].Word.CompareTo(cur.Word) == 0)
                        {
                            array[i].Frequency=cur.Frequency;
                            break;
                        }
                    }
                    BuildMinHeap();
                }
                else
                {
                    //it is not already in the heap but frequency is greater than one.
                    //checking whether frequency is higher than the ExtractMinimum node from the Heap
                    HeapNode min = ExtractMinimum();
                    if (min.Frequency < cur.Frequency)
                    {
                       HeapNode deletedHeapNode= DeleteElement();
                       SearchAndUpdateNode(deletedHeapNode.Word);
                        if (Insert(new HeapNode(cur.Word,cur.Frequency)))
                        {
                            cur.Selected = true;
                        }
                    }
                }
            }
            
        }
       public HeapNode ExtractMinimum()
        {
            if (heapNodeSize>0)
            return array[0];
            return null;
         }
        public void BuildMinHeap()
        {
            for (int i = 0; i < (heapNodeSize / 2); ++i)
            {
                MinHeapify(i);
            }
        }
        public Boolean Insert(HeapNode node)
        {
           
            if(heapNodeSize <9)
            {
                heapNodeSize++;
                array[heapNodeSize] = node; 
               SiftUp(heapNodeSize);
            }
            else
            {
                return false;
            }
            return true;
        }
        public void SiftUp(int nodeIndex)
        {
            int parentIndex = 0;
            HeapNode tmp=null;
            if (nodeIndex != 0)
            {
                  parentIndex = (nodeIndex-1)/2;
                  if (array[parentIndex].Frequency > array[nodeIndex].Frequency) 
                  {
                      //exchange
                        tmp = array[parentIndex];
                        array[parentIndex] = array[nodeIndex];
                        array[nodeIndex] = tmp;
                        SiftUp(parentIndex);
                  }
            }
         }
        //if the frequency of a heap node comes haigher than the root node, then remove the root node and insert the coming node
        //then set the Selected variable to false in the tree
        public HeapNode DeleteElement()
        {
            HeapNode temp;
            temp = array[0];
            array[0] = array[heapNodeSize];
            heapNodeSize--;
            MinHeapify(0);
            return temp;
        }
//siftdown
        public void MinHeapify( int i)
        {
            int left = 2 * i + 1;
            int right = 2 * i + 2;
            int min = 0;
            //heap size is 10
            if (left <= 9 && array[left].Frequency < array[i].Frequency)
            {
                min = left;
            }
            else
            {
                min = i;
            }
            if (right <= 9 && array[right].Frequency < array[min].Frequency)
            {
                min = right;
            }
            if (min != i)
            {
                HeapNode temp = array[i];
                array[i] = array[min];
                array[min] = temp;
                MinHeapify(min);
            }            
        }
        
 public bool SearchAndUpdateNode(String value)
        {
            Node curr = root;
            while (curr != null)
            {
                if (value.CompareTo(curr.Word) == 0)
                {
                    curr.Selected = false;
                    return true;
                }
                if (value.CompareTo(curr.Word) < 0)
                {
                    curr = curr.Left;
                }
                else
                {
                    curr = curr.Right;
                }
            }
            return false;
        }

- BVarghese January 12, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Sorry, I forget to explain the algorithm

1. Reading each words.
2. Add it to the tree
3.a. If it is a new node, try to insert to the heap.
3.b.If the heap is full, then it will return null
3.c Else it will add the node to the heap and sift up the node.
4.a. if it is not already in the heap but frequency is greater than one, then will
check whether frequency is higher than the ExtractMinimum node from the Heap.
4.b. If the frequency is greater than the minimum node from the heap,then delete the minimum node from the heap and insert the
current node.And sift up the node.
4.c Update the variable 'selected' of node to false.
This will help to select the node to be inserted to the heap during next addnode call ,
if it got its frequency higher than the min node of heap.
5.a. If the word is already in the heap and its frequency will be increased.
5.b. Heapfy it again

Finally the heap will be having 10 mostly occured words.

- BVarghese January 12, 2011 | Flag

CareerCup

Microsoft Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews