Boyer Moor
BAN USER- 1of 3 votes
AnswersFind top K(1000) strings that occur frequently from a file having some billions of strings.
- Boyer Moor in United States| Report Duplicate | Flag | PURGE
Yahoo Software Engineer / Developer
Here is my algorithm :
Consider you have an array which can be split into A( first N elements), B( remaining array)
1. Reverse the first n elements in the array.
2. Reverse the elements n+1 to end of the of the array.
3. Reverse the whole array once.
Eg : say you have an Array [1, 2, 3, 4, 5 ,6, 7,8,9,10]
say n = 3 meaning 3 elements have to copied to end of the array. expected output is [4,5,6,7,8,9,10,1,2,3]
1) we will have [3,2,1,4,5,6,7,8,9,10]
2) we will have [3,2,1,10,9,8,7,6,5,4]
3) we will have [4,5,6,7,8,9,10,1,2,3] => expected output
1. We could use bloom filters for the first bag of elements (1000 elements). Since the number of elements is small it is easy to use bloom filters in this case. For all the elements in Bag2 pass these elements through bloom filter, if they are already set then print the solution that the element is identicaly. In bloom filter if a number is not present => the index of that element will never be set. So bloom filters could be used here.
2, If the range of numbers(in bag1) is known and is very small, we could as well use a BitSet and set the corresponding bit in bitset for all the 1000 elements. In second pass we could use see for each element if a bit is already set in the bitset, if Yes then we print that the element is identical. advantage of bitset is it is much more memory efficient.
Preprocessing is a good idea, But like you sais " we will keep refining the options available and check all the options" is very expensive operations depending on the size of the character set available. say we have n* 26 characters, your option space turns out to be exponential and checking every option against a tree is the longest running function I could ever imagine.
- Boyer Moor February 25, 2014Programmatically these are the approaches that I would take.
1. Test it on multiple browsers and check if the behavior(delay) is same or occurring only particular browser.
2. use the time collected from past few weeks and observe for any patterns.
3. Write some complete end to end integration/regression test cases.
4. Rely on some unit tests that are written for the code developed from yesterday to today.
5. see if there are any data structures in the back end are changed.
6. see if the datasource/database queries are changed.
once you have all the frequencies in the map, since the map is unordered, we need to sort the map based on values. Use collections.sort passing the entry set and comparator as arguments to it. comparator's compare to function can sort the entries based on the values(frequencies).. Thus your map is sorted and you can print the values of top K elements from the map.
Map<String,Integer> sortByValue(Map<string,Integer> m) {
List<Map.Entry<String,Integer>> list = new LinkedList<Map.Entry<String,Integer>>(map.entryset());
Collections.sort(list, new Comparator<Map.Entry<String,Integer>>(){ public int compare(Map.Entry<String,Integer> m1, Map.Entry<String,Integer> m2){
return (m2.getValue()).CompareTo(m1.getValue());
}
});
Map<String,Integer> result = new LinkedHashMap<String,Integer>();
for(Map.Entry<String,Integer> entry : list) {
result.put(entry.getKey(),entry.getValue());
}
}
In my algorithm : Each node also contains an additional pointer to its parent, not just the left and right child.
recontstructTree() is called for each pair of numbers.
Traverse() does a pre-order traversal of the tree and prints the resulting tree.
I have tested the code and it works fine. Let me know if any questions.
class BuildTree {
static HashMap<Integer,Node> nodes = new HashMap<Integer,Node>();
static Node root = null;
static class Node
{
Node right;
Node left;
Node parent;
int num;
public Node(int number)
{
left = null;
right = null;
parent = null;
num = number;
}
}
public static Node reconstructTree(int[][] input) {
// TODO: Build the tree from the values in input and return the root node.
Integer x = new Integer(input[0][0]);
Integer y = new Integer(input[0][1]);
Node parent = null;
Node child = null;
if(!nodes.containsKey(x))
{
parent = new Node(x);
nodes.put(x, parent);
}
if(!nodes.containsKey(y))
{
child = new Node(y);
nodes.put(y, child);
}
parent = nodes.get(x);
child = nodes.get(y);
if(parent.left==null){
parent.left = child;
System.out.println(child.num +" is attached as Left child to "+ parent.num);
}
else if(parent.right==null){
parent.right = child;
System.out.println(child.num +" is attached as right child to "+ parent.num);
}
if(child.parent== null) {
child.parent = parent;
root = parent;
}
return root;
}
public static void traverse (Node root){ // Each child of a tree is a root of its subtree.
System.out.println(root.num);
if (root.left != null){
traverse (root.left);
}
if (root.right != null){
traverse (root.right);
}
}
RepAtiDavis, Analyst at 8x8
I am a skilled construction worker with 3+ years of experience erecting structures according to blueprints and safely utilizing power ...
RepKoriScott, abc at Omli
Licensed mechanical maintenance engineer with extensive practical experience working with diverse systems and equipment. Solid professional knowledge base built upon ...
Repdeborahdond, Rehabilitation counsellor at Parts salesperson
Hello, I am Deborah Rehabilitation counsellors who help people with physical, mental, developmental, or emotional disabilities live independently. I want ...
cnt = 0 before starting the next iteration of the for loop should be updated to cnt =1, else the count of the consecutive repeating character would be wrong and also since we are iterating through the next distinct character of the string the count should be initialized to 1.
- Boyer Moor February 13, 2015