Suppose you have a large file

Morgan Stanley Interview Question for Software Engineer / Developers

0

of 0 votes

22
Answers
Suppose you have a large file with lots of words. How would you find the unique words and their count?
What kind of data structure u will use? What will be the time complexity and space complexity?
- hamepal November 19, 2010 | Report Duplicate | Flag | PURGE
Morgan Stanley Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

It depends on what operations on the data structure you are interested in. Two alternatives: hashmap , trie. If you just want the count then a hash map with value = count
time: amortized O(1)
space: depends on the hash function, best case: O(unique_words)

- blueskin.neo November 19, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Please tell, why time will be O(1)? If one applies hash function to each word, it`ll be O(n*1) = O(n), won`t it?

- Anonymous November 19, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

you are ignoring the word "amortized",
amortized O(1) means average running time per operation over a worst-case sequence of operations is constant.

- blueskin.neo November 19, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

If the time complexity is based on the number of words in the file, how will it be amortized O(1)? What is the worst case sequence here? I am thinking its just O(n).

- Kannan November 20, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

What do you mean by "unique words". If uniques words are one that appear once than their count will be one or is the unique word given as input by the user and you have to calculate its occurrence?

- JoshMachine November 19, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

for example if text present in the file like "Hi Hello, bob. Hello Matt, Hi jully" so the output should be like

Hi-2
Hello-2
bob -1
Matt- 1
Jully -1 ..
hope you can understand the question now.

- hamepal November 19, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@blueskin.neo,
suppose I want to print both the word and their count. In that case how would u do that. Can you please write the algorithm for that?

- hamepal November 19, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

hey hamepal, here you go

/* Algorithm to count words in a file. Please dont reply opinions on the code, comments on complexity analysis is appreciated
   For each word check if the word exists in the hash, if it doesnt then insert the word with a count 1
                                                       if it does then increment the value
   For each word in the hashmap, print word and its count
   Complexity: Linear insertion and space
	NOTE:The code only compiles with newer C++ use the switch -std=c++0x in your Makefile*/

#include <iostream> 
#include <vector>
#include <unordered_map>
typedef std::unordered_map<std::string, int> MyHashMap;
void printMyHashMap(MyHashMap::const_iterator itr,MyHashMap::const_iterator enditr){
	while(itr!=enditr){
		std::cout<<(*itr).first<<" : "<<(*itr).second<<std::endl;
		++itr;
	}
}
int main(){ 
    MyHashMap wordsMap; //declare unordered map
	//vector of strings
    std::vector<std::string> strings {"count","words","in","a","file",
					"and","print","words","with","their","count",
					".","more","words","file"};
	int len = strings.size();
	std::vector<std::string>::const_iterator wordsItr;
	wordsItr = strings.begin();
	MyHashMap::iterator mapItr = wordsMap.begin();
	for ( wordsItr=strings.begin(); wordsItr!=strings.end(); wordsItr++ ){
 		mapItr = wordsMap.find(*wordsItr);
		if(mapItr==wordsMap.end()){	//if word not found then 
			wordsMap.insert(MyHashMap::value_type(*wordsItr, 1));
		}else{
			(*mapItr).second = ++(*mapItr).second;		}
 	}
	std::cout<<"words and their count :"<<std::endl;
	printMyHashMap(wordsMap.begin(),wordsMap.end());
    return (0);
}

- blueskin.neo November 20, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@blueskin.neo
thanks

- hamepal November 20, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

#include <iostream>
#include <algorithm>
#include <vector>
#include <fstream>
#include <map>
#include <iterator>

using namespace std;

int main() {
string str;
ifstream in("input.txt");
map<string, int> m;
if (in.is_open()) {
while(in >> str) {
m[str]++;
}
}
for(map<string, int>::iterator it = m.begin(); it != m.end(); it++) {
cout << (*it).first <<"\t" << (*it).second << endl;
}

}

input.txt

abc abc abc abc abc abc abc abcdef abc abcd
abc abc abc abc abc abc abc abcdef abc abcd
abc abc abc abc abc abc abc abcdef abc abcd
abc abc abc abc abc abc abc abcdef abc abcd
count words in a file and print words with their count more words file

output

ajay@ubuntu:~/workspace$ ./a.out
a 1
abc 32
abcd 4
abcdef 4
and 1
count 2
file 2
in 1
more 1
print 1
their 1
with 1
words 3

- ajay November 21, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Internally, the elements in the map are sorted from lower to higher key value following a specific strict weak ordering criterion set on construction. Assuming red-black trees. the complexity would be O(n*logn). try Hashmap, thanks

- Synonymouse November 21, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

RB-trees have insertion item/searching item complexity O(lgn), not O(nlogn). Read Cormen & Co.

- Anonymous November 21, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Insertion of an element is O(logn) --> for n elements O(nlogn)
Now, we want to traverse all elements, retrieval of each element could be O(n) or O(nlogn) depending on the container used.
But for a hashmap, we have functions that give you O(n) insertion for n elements and O(n) for traversal of n elements. which one would you choose?

- Synonymous November 21, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

My bad, a typo: please ignore the above comment.
Insertion of an element is O(logn) --> for n elements O(nlogn)
Now, we want to traverse all elements, retrieval of n elements could be O(n) or O(nlogn) depending on the container used.
But for a hashmap, we have functions that give you O(n) insertion for n elements and O(n) for traversal of n elements. which one would you choose?

- Synonymouse November 22, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

cat $file | sort | uniq -c

- Anonymous November 22, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

How would you find the unique words and their count?
Unique!
So you just use ArrayList to put words as you read a word and remove it if already exists. So the count will always be one (remember unique words). we are not asked to list other words.

- Dawit January 05, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

It doesn't work because what if a word appears three times, you'll add it back to the list. Plus the time complexity will be very high, because you need to traverse the whole list to see if the word already exists in it.

- Sophia March 03, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

import java.util.HashMap;

public class TestTest {
	public static void main(String[] args){

		String inputString = "abc abc abc abc abc abc abc abcdef abc abcd abc abc abc abc abc abc abc abcdef abc abcd" +
				" abc abc abc abc abc abc abc abcdef abc abcd abc abc abc abc abc abc abc abcdef abc abcd" +
				" count words in a file and print words with their count more words file";

		HashMap<Object, Integer> myMap = new HashMap<Object, Integer>(){
			@Override
			public Integer put(Object arg0, Integer arg1) {
				Integer i = super.get(arg0);
				arg1 = i==null?1:i+1;
				return super.put(arg0, arg1);
			}
		};
		
		String[] splits = inputString.split(" ");
		for(String s : splits){
			myMap.put(s,1);
		}
		
		System.out.println(myMap);
	}
}

- hope this is right.. January 05, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Also seems to be a perfect candidate for MapReduce.

- Mikler June 26, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

import java.util.*;

public class Freq {
    public static void main(String[] args) {
        Map<String, Integer> m = new HashMap<String, Integer>();

        // Initialize frequency table from command line
        for (String a : args) {
            Integer freq = m.get(a);
            m.put(a, (freq == null) ? 1 : freq + 1);
        }

        System.out.println(m.size() + " distinct words:");
        System.out.println(m);
    }
}

- Guillaume Plourde January 02, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

String s = "Hi Hello bob Hello Matt Hi jully" ;
Map<String,Integer> numberOfWords = new HashMap<String,Integer>();
String[] allWords = s.split("\\s");
		for (String s1 : allWords) {
			if (numberOfWords.containsKey(s1)) {
				numberOfWords.put(s1, numberOfWords.get(s1) + 1);
			} else {
				numberOfWords.put(s1, 1);
			}
		}
		
		Iterator it = numberOfWords.entrySet().iterator();
	    while (it.hasNext()) {
	        Map.Entry pairs = (Map.Entry)it.next();
	        System.out.println(pairs.getKey() + " = " + pairs.getValue());
	    }

- Bindu Mahapatra February 28, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We can create a Trie based structure with one modification.
Every trie node will store the counts of the words that end with it ( the count will be 0 if it is not a leaf ).
Ex. Hi John ! How are you ? Hi Jane, I'm doing good. Hiya everybody, how's the party ?
will have a node structure for H(0) --> i (2) --> y (0) --> a (1)
This will conserve space and worst time complexity will be O(K) where K is the maximum number of alphabets in that structure.

- Krishna May 31, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Of course the author meant: find all the unique words and their count (i.e. how many unique words there are).
Use a binary tree or a hash table (i.e. std::map<string, int> or std::unordered_map<string, int>) to insert all the words with their counts. After this traverse this structure and print all words with count = 1 and also count them (i.e. those with count = 1).

- Leo July 31, 2015 | Flag Reply

CareerCup

Morgan Stanley Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews