Given a string, you need to fi

Amazon Interview Question for Software Engineer / Developers

3

of 3 votes

20
Answers
Given a string, you need to find super string by word match. i.e. all words in the input string has to occure in any order in output string.
e.g. given data set:
"string search"
"java string search"
"manual c++ string search equals"
"java search code"
"c++ java code search"
...

input: "java search"
output:
1) "java string search"
2) "java search code"
3) "c++ java code search"

input: "c++ search"
output:
1) "manual c++ string search equals"
2) "c++ java code search"

There are millions of records in given data set and you need to process few million as input.
- zc51 March 29, 2013 in India | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Algorithm Data Mining

Email me when people comment.

An error occurred in subscribing you.

Country: India
Interview Type: Phone Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 4 vote

Idea:

Loop each string, adding the words into a hashmap on the word => string ID. For example:

java => 1, 3, 4
search => 0, 1, 2, 3, 4

To find "java search", look up each word in our map and intersect the results, i.e. we get 1, 3, 4 in this case.

Therefore the result is:

"java string search"
"java search code"
"c++ java code search"

I'm not sure how much more efficient we can make it - since we can now do the lookup in worst-case linear time, and we have the advantage that it's probably quite likely we could store this input in a database, and not have to recompute it constantly (and for each new string insertion, we just have to insert a couple more rows into the database).

- jay March 29, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I think this should be a question to the interviewer:
- if the string can contain any combination of characters/alphabets/numbers we have millions of string that could be formed
- if the string data is going to have only english dictionary words then we have a manageable haspmap size. This doesn't seem to be the case since the sample itself has 'C++' in it.

So if we can hold all the unique words in the data file in memory, hash with the string IDs as values makes sense. In this case you will need to maintain two maps though,
Map 1: word -> String ID
Map 2: String ID -> actual String. Since it is not good on performance if you to loop through the data file to get string at row X.

Alternative solution:
Use regular expression
Input: java search
return String if String matches the pattern .*+java+.*+search + .*
This is good on memory usage but will be slow on performance.

- AmazonUser March 29, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

Having 'C++' does not prove we don't have a manageable finite dictionary, since the number of actual programming languages is quite low.

Regexs will be come unmanageable since you cannot assume ordering, as per the question, and building a regex of every possible permutation of alternating patterns could get inefficient as the search string gets large.

- jay March 29, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

The dataset given is more or less english dictionary words. more precisely it contains all human readable meaningful words.

regex cannot be used as jay mentioned. Through we can use regex if we can apply the regex on canonical form of both the input string and on data set.
e.g.
"c++ java code search" -> canonical form "c++ code java search"
"java search" -> canonical form "java search"
"java string search" -> canonical form "java search string"
now apply regex .*+java+.*+search + .*

But the cost of this operation is too much.

- zc51 March 29, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Also by "C++" I meant it could have human readable words which are not in the dictionary. *Webster Dictionary*. If I can have C++, I can also have C+++, C+++++, ++C++ right? Can't ignore the possibility of these strings appearing in the data set

- AmazonUser March 29, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

What @jay describes is called Inverted Index.

- Sahej March 31, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

By regex, I meant if you are looking for 'java search', check if String.matches("*java*") and if it also matches ("*search*"). not both at the same time. If it does not match the first one we don't have to match it further. Saves a lot of time.

- AmazonUser March 29, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

// implementation of strstr
bool match(char* super, char* sub)
{
while (*sub != '\0')
{
if (*sub++ == *super++)
continue;
else
return false;
}
return true;
}
char* needleInHayStack(char* super, char* sub)
{
if (!super || !sub)
{
return 0;
}
while (*super != '\0')
{
if (match(super, sub))
{
return super;
}
super++;
}
return 0;
}

Let me explain what this code does:
1. The very first if condition checks that if any of the string is a NULL pointer, returns 0.
2. In the while loop, we are comparing the search string occurrence starting from every position in the super string.
3. Since we are incrementing pointer pointing to super string “str”, if a match is found pointer has to be decremented by len.
4. Also if a match is not found pointer needs to be decremented by len so that next search starts from next position in “str”.
5. This is a C style string and is always terminated by ‘\0’ character.

Now from a programming perspective, look how the passed parameter and return value has const. This means the content which the pointer is pointing to cannot be changed. This ensures that original string passed is not changed by this question.

Probably a good thing to clarify in the beginning if the expected behavior is fine or not. The standard library in C++ comes with both the variants (const as well as non-const).

One can also terminate the loop once the length of sub string becomes bigger than remaining string pointed by “str”. I will leave this as an exercise to the readers. Also can you guess where this program might have performance problems.

There are many ways of implementing this. Here is another way, try to see if you can find how this code works. We will be using this implementation to solve many other problems. Remember I promised you in the beginning that we will learn one thing and solve similar problems.

Look how I have defined function 'match' before using it. This is because C/C++ expects function to be declared or defined before one can use it. Also notice how we don't need to maintain two variables as in first implementation. This is because we are passing super and sub to the function. Though some people will say that it is pass by reference, the pointer object is actually passed by value (yes it is pointing to the same location), but since we are passing the variable storing the addresses of two strings by value, any change done to it may 'match' method is local to 'match' method. Don't worry too much about it. This will get explained pretty quickly in some of my next post. Also notice that this is non-const implementation of strstr().

- SUBBU September 06, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

String matcher[] = {"string search","java string search","manual c++ string search equals",
"java search code","c++ java code search" };
String[] pattern = "c++".split(" ");

for(String match : matcher){
for(int i=0;i<pattern.length;i++){
if(match.contains(pattern[i])){
if(i==pattern.length-1){
System.out.println(match);
}
continue;
}else
break;
}
}

- vrajendra.singh.mandloi March 30, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

This is a map reduce problem of inverted index.

- red May 29, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

just a thought:
if some offline/pre-processing can be done on the dataset, each word occurring in the sentence/line can be put in a hashmap which points to a list of occurrences (index, filename-line num etc.)

when a input string needs to be checked, we can fetch the occurrences of each of the words and then pick the ones that are common to all words.

- VJ March 29, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

public static void main(String[] args) {
		
		String a[] = {"string search","java string search","manual c++ string search equals",
					  "java search code","c++ java code search" };
		
		String pattern = "java search";
		HashMap patternMap = new HashMap();
		HashSet<String> set = new HashSet<String>();		
		ArrayList lst = getTokens(pattern);
		ArrayList tokens = new ArrayList();
		
		// to convert the pattern to Map.
		for(int i = 0 ; i < lst.size(); i++)
			patternMap.put(lst.get(i),0);
		
		for(int i = 0; i < a.length; i++)
		{
			set.clear();
			tokens = getTokens(a[i]);
			for (int j = 0 ; j < tokens.size(); j++)
				if(patternMap.containsKey(tokens.get(j)))
					set.add(String.valueOf(tokens.get(j)));
			
			if(patternMap.size() == set.size())
				System.out.println(a[i] + ": candiate of result");
		}
	}
	
	public static ArrayList getTokens(String value)
	{
		StringTokenizer ls = new StringTokenizer(value);
		ArrayList<String> tokens = new ArrayList<String>();
		while (ls.hasMoreTokens())
			tokens.add(ls.nextToken());
		return tokens;
	}

- Anonymous March 29, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

why are we using Hashmap instead we can do the same using simple array work.. I know the approach using hashmap would be good.. or more memory efficient but can anybody explain how is it useful.. More memory efficient>>?

- vrajendra.singh.mandloi April 01, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Seems to me, this is a database question. We would need to put millions of records into database first.

- outdoor March 30, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

this code works well for the above question.....pls comment if I have any error....
#include<stdio.h>
#include<string.h>
#include<iostream>
#include<stdlib.h>
using namespace std;
int main()
{
int i=0,j,flag=0;
char *str[10],*str1;
str1=(char *)malloc(30);
printf("enter the number of strings to be entered:\n");
scanf("%d",&j);
while(j--){
str[i]=(char *)malloc(50);
gets(str[i++]);}
printf("enter the searchstring:\n");
gets(str1);
for(j=0;j<i;j++){
if(strstr(str[j],str1)){
puts(str[j]);flag=1;
printf("\n");}}
if(flag)printf("\nno match!!!!!!!");
return 0;
}

- Anonymous April 01, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Pre-process these millions of records in the data set to a hashtable of a vector
{{
for (record: records)
for (word: record)
hashtable[word] = hashtable[word].append(pointer to record)
}}
2. For any given string, got the common record of hastable[ for (word:string)]

- Kill-11 April 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Hi zc51, have you got f2f interview call??

- Abhijit April 04, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

In java :

HashMap<String> inverted_list = new HashMap<String>();
void preprocess(String[] list){
	for (int i =0 ; i < list.length ; i++){
		StringTokenizer token = new StringTokenizer(list[i]);
		while (token.hasMoreElements()){				
			Srting q = token.nextToken();
			ArrayList<Integer> r ;
			if (!inverted_list.contains(q)){
				ArrayList<Integer> arr = new ArrayList<Integer>();
				r = arr;
			}else{
				r = inverted_list.get(q);
			}
			r.add(i);
		}
	}
}

void searchString(String list[] , String input){
	StringTokenizer q = new StringTokenizer(input);
	ArrayList<ArrayList<Integer>> pointer = new ArrayList<ArrayList<Integer>>();
	String s = "";
	while ((s = q.nextToken()) != null){
		if (!inverted_list.contains(s))
			return;
		pointer.add(inverted_list.get(s));
	}
	//now we have to find the intersections 
	ArrayList<Integer> intersection = intersect(pointer);
	for (int i = 0 ; i < intersection.size() ; i++)
		System.out.println(list[intersection.get(i)]);
	return;
}

ArrayList<Integer> intersect(ArrayList<ArrayList<Integer>> pointer){
	ArrayList<Integer> result = new ArrayList<Integer>();
	int[] index = new int[pointer.size()];
	outter : while(true){
		for (int i = 0 ; i < pointer.size() ; i++){
			if (index[i] >= pointer.get(i).size())
				return result;
		}
		int min = find_min(pointer , index);
		int second_min = find_second_min(pointer , index);
		while (pointer.get(min).get(index[min]) < pointer.get(second_min).get(index[second_min]))
			min++;
		for (int i = 0 ; i < pointer.size() -1  ; i++)
			if (pointer.get(i).get(index[i]) !=  pointer.get(i+1).get(index[i+1]))
				continue outter;
		result.add(pointer.get(0).get(index[0]));
	}
	return result;
}

int min(ArrayList<ArrayList<Integer>> pointer , int[] index ){
	int min = 0;
	for (int i = 0 ; i < pointer.size() ; i++)
		if (pointer.get(i).get(index[i]) < pointer.get(i).get(min))
			min = i;
	return min;
}

int second_min(ArrayList<ArrayList<Integer>> pointer , int[] index ){
	int min = min(pointer , index);
	int second_min = 0;
	for (int i = 0 ; i < pointer.size() ; i++)
		if (i != min && pointer.get(i).get(index[i]) < pointer.get(i).get(second_min))
			second_min = i;
	return second_min;
}

- Anonymous June 07, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

One approach could be:
Populating the DS:
1. For each string, tokenize into words.
2. Add each word into a trie (discussed the structure later), with each word containing it's parent string index.

each node in Trie could be:

#define CHARSETSIZE 26

typedef struct node{
	int val;
	struct node * children[CHARSETSIZE];
	set<int> parentStrIndex;
}trieNode;

Finding the output strings:
1. For any given input string (not dataset that was populated earlier), tokenize into words.
2. Query each word into the trie.
3. If word is found, get it's parent string id(s) from it's terminating node's parentStrIndex set contents.
4. Repeat 2,3 for each word in input.
5. Intersect the set(s) found in 3, to find the result parent string id(s).

- theGhost September 05, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

python

def index1(searchable):
    retdict = {}
    for word in searchable.split(" "):
        retdict[word]=True
    return retdict

def search(searchterm, searchdb):
    retlist=[]
    for searchable_item in searchdb:
        indexdict = index1(searchable_item)
        search_hit = True
        for word in searchterm.split(" "):
            if word not in indexdict:
                search_hit=False
        if search_hit == True:
            retlist.append(searchable_item)
    return retlist

list1 = ["string search" , "java string search" , "manual c++ string search equals" , "java search code" , "c++ java code search" ]

search1="java search"

res1 = ["java string search" ,"java search code" ,"c++ java code search" ]

assert search(search1,list1) == res1

- mikeolteanu September 12, 2013 | Flag Reply

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews