Find top K(1000) strings that

Yahoo Interview Question for Software Engineer / Developers

1

of 3 votes

13
Answers
Find top K(1000) strings that occur frequently from a file having some billions of strings.
- Boyer Moor February 20, 2014 in United States | Report Duplicate | Flag | PURGE
Yahoo Software Engineer / Developer

Email me when people comment.

An error occurred in subscribing you.

Country: United States

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 4 vote

You will need a Hashtable and a Min-heap for this.
1) Begin with populating the Hashtable with the String (<key>) and their frequency (<value>)
2) Once you completed 1000 strings, iterate through the Hashtable and maintain a min-heap of the size K .
3) Iterate till the end and only add the Keys from the Hashtable that have a value greater than the top of the heap.

- wolfengineer February 20, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

let us say the file has 2 billion strings and there are 1 billion distinct strings then the hashtable must contain 1 billion strings? if so, it requires lots of memory? Am i missing something?

- snigda May 26, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

I guess any algorithm that stores the strings is going to face space issues. One way to store strings efficiently is a trie. At each leaf node we can store the number of times it appears. Then you can simply use the top K min/max elements in the array algorithm.
Time complexity O(N)
Space complexity not sure. It will be great if anybody can shed more light on it.

- kr.neerav February 20, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I thought of the same,

Build 26 trie's one of each letter store in hashmap
keep a count of each word in the leaf node.

sounds good for now,

how will we find top 1000?

- byteattack June 01, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Can you explain the need to build 26 tries, i didn't get it.
To find top 1000 store the values of each leaf node in an array and then find top 1000.

- kr.neerav June 16, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Reading the other solutions it looks like using max heap after building tree with be more space efficient.

- kr.neerav June 16, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

can we solve it like this:
1) maintain a hashmap for (string, (repeats, pointer to min-heap node)), and a min-heap for the top k strings
2) for a new string, we first increments its "repeats" in hashmap, if the resulting "repeats" is greater than the root value, we then need to remove the root node(if the numbers nodes in heap is equal to K). And, we need to check if the new node already exists in the heap,

- samuel February 20, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

{{{ //find top 1000 strings that appear frequently from billions //find file //input file // File.iostream.input String searchvar "lamp"; While (inputfile.in) { For I = 0 I <= searchvar; i++) Return searchvar.count(n); Output number of occurences } - rizla February 20, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

a form of min heap of max elements would help :)

- Anonymous February 20, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think we need calculate a hash value for each string without compare the whole string.
The following are simple.
Maintain a HashMap of <hash value and repeat times>.
Maintain a HashMap of the top 1000 strings <hash value and string>

- huichen229 February 20, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use a trie along with the min heap.
Trie would take much lesser space as compared to a hash table.

- maxomax March 11, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

This seems like a mapReduce problem. i.e. Calculate the count of each word and then sort them in the end.
Basically have the Map output Words and Count in 100,000 Strings (We have 10,000 Nodes) that would each Output a Word and a Count.
We will then have a Reduce for each word (All rows with the same Word are sent to Reduce that aggregates it).
We will then again sort the list by order, and pick the top 1000 (Not sure if Quicksort is possible in the context of MapReduce/Hive).

- Subbu September 14, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

There are many approach some of them are like using data partition or mapreduce.
I have written code to handle small chunk of data to find top 5 most occurred words, this can be modified accordingly.

and

import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

/**
* @author
*
*/
public class FrequentlyOccurredString {

static List<Entry<String, Integer>> listOfWords = new ArrayList<Entry<String, Integer>>();

public static void sortMap(HashMap<String, Integer> inputMap) {

Set<Entry<String, Integer>> set = inputMap.entrySet();
listOfWords = new ArrayList<Entry<String, Integer>>(set);
Collections.sort(listOfWords, new Comparator<Map.Entry<String, Integer>>() {
public int compare(Map.Entry<String, Integer> obj1,
Map.Entry<String, Integer> obj2) {
int value = (obj2.getValue()).compareTo(obj1.getValue());
return value;
}
});

}

public static void printMap(List<Entry<String, Integer>> list) {
int count = 0;
for (Map.Entry<String, Integer> wordOccurrence : list) {
if (count <= 4) {
System.out.println("'" + wordOccurrence.getKey() + "' - " + ""
+ wordOccurrence.getValue());
}
count++;
}
}

public static void findFrequentlyOccurredString(String inpString, String dictionaryString) {

String tempString = "";
int DEFAULT_VALUE = 1;

String[] storeStringToken = inpString.replace(" ", "").trim().split(" ");
HashMap<String, Integer> countStringOccurrence = new HashMap<String, Integer>();

for (int i = 0; i < storeStringToken.length; i++) {
// insert all new words to hashmap
if (!countStringOccurrence.containsKey(storeStringToken[i])) {

if (!dictionaryString.toLowerCase().contains(
storeStringToken[i].toLowerCase())) {

tempString = storeStringToken[i];
countStringOccurrence.put(tempString.trim(), DEFAULT_VALUE);

}
}

// else process existing words from hashmap
else {

tempString = storeStringToken[i];
for (Map.Entry<String, Integer> mapEntry : countStringOccurrence
.entrySet()) {
// check if the key already there
if (mapEntry.getKey().equals(tempString)) {
int keyOccurrence = mapEntry.getValue();
keyOccurrence++;
countStringOccurrence.put(tempString, keyOccurrence);
}
}

}

}

// sort hashmap in decending order
sortMap(countStringOccurrence);

// print hashMap
printMap(listOfWords);

}

public static void main(String[] args) {
String inp = "It is globally known for its Web portal, search engine Yahoo! Search, and related services, including Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Finance, Yahoo! Groups, Yahoo! Answers, advertising, online mapping, video sharing, fantasy sports, and its social media website.";
String dictionary = "andtheinorbetweenisaMrtheretheirthemthosetheybyonbutduringafterbeforeassuchreturnsfollowsfromtoaboveofoffwantsbecauseitsknown";
inp = inp.replaceAll("[^a-zA-Z\\s]", "");
findFrequentlyOccurredString(inp, dictionary);

}

}

and

- Sandy0109 September 01, 2016 | Flag Reply

CareerCup

Yahoo Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews