Amazon Interview Question
Software Engineer / DevelopersCountry: India
Interview Type: Phone Interview
I think this should be a question to the interviewer:
- if the string can contain any combination of characters/alphabets/numbers we have millions of string that could be formed
- if the string data is going to have only english dictionary words then we have a manageable haspmap size. This doesn't seem to be the case since the sample itself has 'C++' in it.
So if we can hold all the unique words in the data file in memory, hash with the string IDs as values makes sense. In this case you will need to maintain two maps though,
Map 1: word -> String ID
Map 2: String ID -> actual String. Since it is not good on performance if you to loop through the data file to get string at row X.
Alternative solution:
Use regular expression
Input: java search
return String if String matches the pattern .*+java+.*+search + .*
This is good on memory usage but will be slow on performance.
Having 'C++' does not prove we don't have a manageable finite dictionary, since the number of actual programming languages is quite low.
Regexs will be come unmanageable since you cannot assume ordering, as per the question, and building a regex of every possible permutation of alternating patterns could get inefficient as the search string gets large.
The dataset given is more or less english dictionary words. more precisely it contains all human readable meaningful words.
regex cannot be used as jay mentioned. Through we can use regex if we can apply the regex on canonical form of both the input string and on data set.
e.g.
"c++ java code search" -> canonical form "c++ code java search"
"java search" -> canonical form "java search"
"java string search" -> canonical form "java search string"
now apply regex .*+java+.*+search + .*
But the cost of this operation is too much.
Also by "C++" I meant it could have human readable words which are not in the dictionary. *Webster Dictionary*. If I can have C++, I can also have C+++, C+++++, ++C++ right? Can't ignore the possibility of these strings appearing in the data set
By regex, I meant if you are looking for 'java search', check if String.matches("*java*") and if it also matches ("*search*"). not both at the same time. If it does not match the first one we don't have to match it further. Saves a lot of time.
// implementation of strstr
bool match(char* super, char* sub)
{
while (*sub != '\0')
{
if (*sub++ == *super++)
continue;
else
return false;
}
return true;
}
char* needleInHayStack(char* super, char* sub)
{
if (!super || !sub)
{
return 0;
}
while (*super != '\0')
{
if (match(super, sub))
{
return super;
}
super++;
}
return 0;
}
Let me explain what this code does:
1. The very first if condition checks that if any of the string is a NULL pointer, returns 0.
2. In the while loop, we are comparing the search string occurrence starting from every position in the super string.
3. Since we are incrementing pointer pointing to super string “str”, if a match is found pointer has to be decremented by len.
4. Also if a match is not found pointer needs to be decremented by len so that next search starts from next position in “str”.
5. This is a C style string and is always terminated by ‘\0’ character.
Now from a programming perspective, look how the passed parameter and return value has const. This means the content which the pointer is pointing to cannot be changed. This ensures that original string passed is not changed by this question.
Probably a good thing to clarify in the beginning if the expected behavior is fine or not. The standard library in C++ comes with both the variants (const as well as non-const).
One can also terminate the loop once the length of sub string becomes bigger than remaining string pointed by “str”. I will leave this as an exercise to the readers. Also can you guess where this program might have performance problems.
There are many ways of implementing this. Here is another way, try to see if you can find how this code works. We will be using this implementation to solve many other problems. Remember I promised you in the beginning that we will learn one thing and solve similar problems.
Look how I have defined function 'match' before using it. This is because C/C++ expects function to be declared or defined before one can use it. Also notice how we don't need to maintain two variables as in first implementation. This is because we are passing super and sub to the function. Though some people will say that it is pass by reference, the pointer object is actually passed by value (yes it is pointing to the same location), but since we are passing the variable storing the addresses of two strings by value, any change done to it may 'match' method is local to 'match' method. Don't worry too much about it. This will get explained pretty quickly in some of my next post. Also notice that this is non-const implementation of strstr().
String matcher[] = {"string search","java string search","manual c++ string search equals",
"java search code","c++ java code search" };
String[] pattern = "c++".split(" ");
for(String match : matcher){
for(int i=0;i<pattern.length;i++){
if(match.contains(pattern[i])){
if(i==pattern.length-1){
System.out.println(match);
}
continue;
}else
break;
}
}
just a thought:
if some offline/pre-processing can be done on the dataset, each word occurring in the sentence/line can be put in a hashmap which points to a list of occurrences (index, filename-line num etc.)
when a input string needs to be checked, we can fetch the occurrences of each of the words and then pick the ones that are common to all words.
public static void main(String[] args) {
String a[] = {"string search","java string search","manual c++ string search equals",
"java search code","c++ java code search" };
String pattern = "java search";
HashMap patternMap = new HashMap();
HashSet<String> set = new HashSet<String>();
ArrayList lst = getTokens(pattern);
ArrayList tokens = new ArrayList();
// to convert the pattern to Map.
for(int i = 0 ; i < lst.size(); i++)
patternMap.put(lst.get(i),0);
for(int i = 0; i < a.length; i++)
{
set.clear();
tokens = getTokens(a[i]);
for (int j = 0 ; j < tokens.size(); j++)
if(patternMap.containsKey(tokens.get(j)))
set.add(String.valueOf(tokens.get(j)));
if(patternMap.size() == set.size())
System.out.println(a[i] + ": candiate of result");
}
}
public static ArrayList getTokens(String value)
{
StringTokenizer ls = new StringTokenizer(value);
ArrayList<String> tokens = new ArrayList<String>();
while (ls.hasMoreTokens())
tokens.add(ls.nextToken());
return tokens;
}
this code works well for the above question.....pls comment if I have any error....
#include<stdio.h>
#include<string.h>
#include<iostream>
#include<stdlib.h>
using namespace std;
int main()
{
int i=0,j,flag=0;
char *str[10],*str1;
str1=(char *)malloc(30);
printf("enter the number of strings to be entered:\n");
scanf("%d",&j);
while(j--){
str[i]=(char *)malloc(50);
gets(str[i++]);}
printf("enter the searchstring:\n");
gets(str1);
for(j=0;j<i;j++){
if(strstr(str[j],str1)){
puts(str[j]);flag=1;
printf("\n");}}
if(flag)printf("\nno match!!!!!!!");
return 0;
}
In java :
HashMap<String> inverted_list = new HashMap<String>();
void preprocess(String[] list){
for (int i =0 ; i < list.length ; i++){
StringTokenizer token = new StringTokenizer(list[i]);
while (token.hasMoreElements()){
Srting q = token.nextToken();
ArrayList<Integer> r ;
if (!inverted_list.contains(q)){
ArrayList<Integer> arr = new ArrayList<Integer>();
r = arr;
}else{
r = inverted_list.get(q);
}
r.add(i);
}
}
}
void searchString(String list[] , String input){
StringTokenizer q = new StringTokenizer(input);
ArrayList<ArrayList<Integer>> pointer = new ArrayList<ArrayList<Integer>>();
String s = "";
while ((s = q.nextToken()) != null){
if (!inverted_list.contains(s))
return;
pointer.add(inverted_list.get(s));
}
//now we have to find the intersections
ArrayList<Integer> intersection = intersect(pointer);
for (int i = 0 ; i < intersection.size() ; i++)
System.out.println(list[intersection.get(i)]);
return;
}
ArrayList<Integer> intersect(ArrayList<ArrayList<Integer>> pointer){
ArrayList<Integer> result = new ArrayList<Integer>();
int[] index = new int[pointer.size()];
outter : while(true){
for (int i = 0 ; i < pointer.size() ; i++){
if (index[i] >= pointer.get(i).size())
return result;
}
int min = find_min(pointer , index);
int second_min = find_second_min(pointer , index);
while (pointer.get(min).get(index[min]) < pointer.get(second_min).get(index[second_min]))
min++;
for (int i = 0 ; i < pointer.size() -1 ; i++)
if (pointer.get(i).get(index[i]) != pointer.get(i+1).get(index[i+1]))
continue outter;
result.add(pointer.get(0).get(index[0]));
}
return result;
}
int min(ArrayList<ArrayList<Integer>> pointer , int[] index ){
int min = 0;
for (int i = 0 ; i < pointer.size() ; i++)
if (pointer.get(i).get(index[i]) < pointer.get(i).get(min))
min = i;
return min;
}
int second_min(ArrayList<ArrayList<Integer>> pointer , int[] index ){
int min = min(pointer , index);
int second_min = 0;
for (int i = 0 ; i < pointer.size() ; i++)
if (i != min && pointer.get(i).get(index[i]) < pointer.get(i).get(second_min))
second_min = i;
return second_min;
}
One approach could be:
Populating the DS:
1. For each string, tokenize into words.
2. Add each word into a trie (discussed the structure later), with each word containing it's parent string index.
each node in Trie could be:
#define CHARSETSIZE 26
typedef struct node{
int val;
struct node * children[CHARSETSIZE];
set<int> parentStrIndex;
}trieNode;
Finding the output strings:
1. For any given input string (not dataset that was populated earlier), tokenize into words.
2. Query each word into the trie.
3. If word is found, get it's parent string id(s) from it's terminating node's parentStrIndex set contents.
4. Repeat 2,3 for each word in input.
5. Intersect the set(s) found in 3, to find the result parent string id(s).
python
def index1(searchable):
retdict = {}
for word in searchable.split(" "):
retdict[word]=True
return retdict
def search(searchterm, searchdb):
retlist=[]
for searchable_item in searchdb:
indexdict = index1(searchable_item)
search_hit = True
for word in searchterm.split(" "):
if word not in indexdict:
search_hit=False
if search_hit == True:
retlist.append(searchable_item)
return retlist
list1 = ["string search" , "java string search" , "manual c++ string search equals" , "java search code" , "c++ java code search" ]
search1="java search"
res1 = ["java string search" ,"java search code" ,"c++ java code search" ]
assert search(search1,list1) == res1
Idea:
- jay March 29, 2013Loop each string, adding the words into a hashmap on the word => string ID. For example:
java => 1, 3, 4
search => 0, 1, 2, 3, 4
To find "java search", look up each word in our map and intersect the results, i.e. we get 1, 3, 4 in this case.
Therefore the result is:
"java string search"
"java search code"
"c++ java code search"
I'm not sure how much more efficient we can make it - since we can now do the lookup in worst-case linear time, and we have the advantage that it's probably quite likely we could store this input in a database, and not have to recompute it constantly (and for each new string insertion, we just have to insert a couple more rows into the database).