Goldman Sachs Interview Question
Senior Software Development EngineersCountry: United States
Hmm.
So your suggestion is to create hash map , that has arrays of size of character set (256 for ascii or 65536 for Unicode) as key which map to list of words
Bit of an overkill in terms of space.
Why not before inserting a word , sort it , and put it in HashMap with key being the sorted representation and values being list of anagrams for the key.
The problem with using sorting, is that sorting is expensive.
However, it depends on what is provided to you, if a lot of space is given, we can even do this. Have two hashmaps, one with a word and its sorted characters, the other with the sorted characters mapped to all the permutations
My Idea would be:
why don't we use product of ascii values of each character to calculate their hash key.
For Eg:
A=1, B=2, C=3, D=4, E=5, F=6, G=7, H=8, I=9, J=10, K=11, L=12, M=13,
N=14, O=15, P=16, Q=17, R=18, S=19, T=20, U=21, V=22, W=23, X=24, Y=25, Z=26
Then, Hash key for POT = P*O*T = 16*15*20 = 4800
And Hash key for any of POT's anagram will be same as well, because multiplication is associative property
Hash key for TOP = 20*15*16 = Hash key of OPT = 15*16*20 = Hash key for POT = P*O*T = 16*15*20.
Hence, we can maintain a chain of all anagrams in a hash table. and whenever we search for a word,
we calculate the key and return the entire list.
One more optimization can be done is that: We maintain different hash map for every word length.
That is for word length of 3 we have one hashmap, for word length of 4 we have different.
This will avoid any hash key collision for words of different length
Like: CAT and DO represents same hash key of 60. But, if we have independent hashmap for each and every length of words, both of these words should reside in separate hash maps and should be easy to search for.
def getwords():
file = open("sherlockholmes.txt")
words={}
for line in file:
for word in line.split():
key = ''.join(sorted(word)).lower()
if words.has_key(key):
words[key].add(word)
else:
words[key]=set([word])
return words
if __name__ == '__main__':
words = getwords()
for word in words:
if len(words[word])>3 :print words[word]
It would be easier to build a Trie-Tree with the data which will map the different paths that can lead to a legitimate word in the dictionary.
Once we have the tree then searching will only require us to start from the top and take certain paths that match the individual letters in the input string. If the word is X letters long, then we are essentially going down to X level depth in each path. The comparison of strings is the dominant action in each decision point, which cannot exceed 26 characters in the english dictionary. So at the most we are talking about 26 * X comparisons in 1 path. This multiplied by X different possible paths will give us X^2 comparisons. I am assuming the input has all possible permutations in the dictionary for worst case scenario.
Now since X is the input string length and not dependent on the size of the dictionary this is a constant computation time. The Big O of building the Trie tree is of course another matter but has a one time cost to it.
IN JAVA,
int hascode(){ // it will make sure world with length 3 will goes into same bucket
return world.length()
}
bollean equals(Object o){
if( StringUtils.countMatches( ((String)0),p) == 1 && StringUtils.countMatches( ((String)0),u) == 1 && StringUtils.countMatches( ((String)0),t) == 1){
return true;
else return false;
}
Use Either HasSet of HashMap
/* Function to print permutations of string
This function takes three parameters:
1. String
2. Starting index of the string
3. Ending index of the string. */
void permute(char *a, int i, int n)
{
int j;
if (i == n)
printf("%s\n", a);
else
{
for (j = i; j <= n; j++)
{
swap((a+i), (a+j));
permute(a, i+1, n);
swap((a+i), (a+j)); //backtrack
}
}
}
We can assume that n(the no of words in the file) is much bigger than m(the average no of letters in each word), hence we can use the following key for a hash map
1.for each word to be inserted, sort the word first
2.now use the following to find the hash key
Eg if TOP, POT, OPT is the word, after sorting they all give POT
now assume A = 0 and Z = 25
therefore TOP, POT, OPT all map to the hash key value P*10^0 +0*10^1 + T*10^2
now this will generate a unique number for each word where atleast one letter is different
Hence all words can be mapped into one hashmap
therefore, insertion will take O(n*m*logm) but all searching deletion and updates will take O(1) time
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class AnagramsInFile {
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
List<String> anagramsInFile = new ArrayList<String>();
anagramsInFile.add("POT");
anagramsInFile.add("OTp");
anagramsInFile.add("ottp");
anagramsInFile.add("Ppot");
anagramsInFile.add("pot");
anagramsInFile.add("top");
anagramsInFile.add("tpo");
anagramsInFile.add("pto");
anagramsInFile.add("opt");
System.out.println(findAnagrams(anagramsInFile, "pot"));
}
/**
* The overall time complexity is :
* consider there are m elements in array , with average length of r
* than time complexity will be O(m*r log r)
* @param anagramsInFile
* @param findAnagramOfText
* @return
*/
private static Object findAnagrams(List<String> anagramsInFile,
String findAnagramOfText) {
// TODO Auto-generated method stub
Map<Integer, List<String>> mappingAnagram = new HashMap<Integer, List<String>>();
//O(m)
for (String anagrString : anagramsInFile) {
//O(r*log r)
int key = getKey(anagrString);
if (!mappingAnagram.containsKey(key)) {
mappingAnagram.put(key, new ArrayList<String>());
}
mappingAnagram.get(key).add(anagrString);
}
//System.out.println(mappingAnagram);
return mappingAnagram.get(getKey(findAnagramOfText));
}
/**
* Here this logic will effect the overall time -complexity of the algorithm.
* o(n*log n) - performance overall
* @param anagrString
* @return
*/
private static int getKey(String anagrString) {
if (anagrString == null || anagrString.trim().length() == 0)
return -1;
//o(n)
char[] sortedStringChar = anagrString.toLowerCase().toCharArray();
//o(n*logn)
Arrays.sort(sortedStringChar);
//System.out.println(sortedStringChar);
//System.out.println(sortedStringChar.toString());
//o(n)
return String.valueOf(sortedStringChar).hashCode();
}
}
by sumit dey
public class Test {
public static void main(String[] args){
String[] arr = {"POT","VDSF","OPT","TOP","ME","LOP"};
String user = "POT";
char[] arr1 = user.toCharArray();
int sum = 1;
for(int i=0;i<arr1.length;i++){
sum *= (int)arr1[i];
}
System.out.println(sum);
for(int i=0;i<arr.length;i++){
char[] arr2 = arr[i].toCharArray();
int sum1 = 1;
for(int j=0;j<arr[i].length();j++){
sum1 *= (int) arr2[j];
}
if (sum1 == sum){
System.out.println(arr[i]);
}
}
}
}
Assumptions
1. If the file is really large
2. The sequence to be searched is very small as compared to the size of the file
We can create a hashMap of strings that has all the permutations of the anagram to be searched. The complexity of this would be O(m!) where m is the size of the anagram to be searched. Once we have such a hashMap, we can read every word from the file and check it in the map. The search complexity would be O(1).
Total complexity = O(n) + O(m!) where n is the size of the file(or the count of words in the file) and m is the size of the anagram to be searched
Space complexity = O(m!) --> size of hash map
This method should work well as long as n >> m
Example: If file has 1000 words and we need to search TOP whose length is 3. Then the number of entries in the hash map should be 6.
Have a hashmap for which the key is an array containing the count of all the letters occurring in the word, and as a value have a set in which you put any words which have those letters. For example, if the text contains POT and TOP then for the array (a[p]=1, a[t]=1, a[o]=1) the set will contain (POT, TOP).
- Anonymous April 11, 2013When the user enters a query word, simply generate the array which counts all the letters and use it to retrieve all the anagrams.