Yahoo Interview Question
Software Development ManagersTeam: Y! Suite
Country: United States
Interview Type: In-Person
#define ALPH_SIZE 256 //change as desired based on encoding
int count[ALPH_SIZE], i;
if( s1.length != s2.length ) return false;
for(i=0; i < s1.length ; i++) count[ s1[i] ]++, count[ s2[i] ]--;
for(i=0; i < ALPH_SIZE ) if ( count[i] ) return false;
return true;
Use unsigned int or another "rollover/under" save integer type above.
unsigned int count[ALPH_SIZE]; //or keep it as "int" if roll-under is "safe"
Or modify as necessary for your "more strict" platform.
#define ALPH_SIZE 256
unsigned int count[ALPH_SIZE], i;
if( s1.length != s2.length ) return false;
for(i=0; i < s1.length ; i++) count[ s1[i] ]++, count[ s2[i] ]--;
for(i=0; i < ALPH_SIZE ; i++) if ( count[i] ) return false;
return true;
The time complexity of the above code is O(m + s), where m is the length of each string and s is the size of the alphabet. If s could potentially be large in comparison to m, we can use a hash-based set instead of an array to make the complexity O(m).
Hello everyone,
Most of the time for anagram related questions we create a int array of 256 length.
Could you please comment if our solutions will work if strings are not consist of English alphabets. I mean if strings are in some other language (e.g., Chinese, Japanese). Will an array of 256 length will suffice in that case.
@Rakesh - The data structure used for storing character occurrence counts depends on the range the characters in the input string.
If the input contains only alpha-numeric then an array of constant size would suffice.
If the input contains wide range of characters then the hash table would save lot of memory.
It all depends on the assumptions which you take during the interview.
@Eugene, Yes correct. I have commented on the O(s) fixed being too large in a reply somewhere else on this thread.
@Siva, In this problem, getting a workable algorithm comes first, the rest is something easily _designable_ with the interviewer (or engineering team), like you say.
That is, having a DS that is fixed O(s=encoding size), is better than having one that is O(m=stringsize) for this problem, IMO.
Ok I'm biased to a single language alphabet in my answer above ^^^^
Hasing, I feel, is not a "algorithm." Some DSs have natural algorithms coming with them, but "use hash" seems like it's not explaning much. In this problem, I think it's better to think of Hash as a final optimization step to get memory size and/or runtime down for specific applications. And there are many hashing techniques... here is one such idea for a specific application niche (probably best application for this question: 1 language, strings on the alphabet compared) ....
One interesting way to hash to get memory size and runtime O(s) down is to strip out all parts of the encoding that are irrelevant to the problem (that is, anagrams usually refer to strings on the actual alphabet characters).
We can make due without even 256 sized count array (even if we are living in ASCII) with a one-to-one hash that maps 'a' to 'z' into integers 0 to 25, and also characters 'A' to 'Z' into 26 to 51 , irrespective of the underlying encoding size.
#define ALPH_SIZE 26 //size of real alphabet
unsigned count int[2*ALPH_SIZE + 1]; //lower, upper case & other
//shifted hash (this is one-to-one on the actual alphabets)
unsigned int alphahash( char x )
{
return ( x >= 'a' && x <='z' ? x-'a' : //lower case
x >= 'A' && x <='Z' ? x-'A'+ALPH_SIZE : //upper case
2*ALPH_SIZE); //all other characters (junk)
//or add other ranges (numbers?)
}
Since the above hash is invertible on the alphabets, we can do the final anagram for-loop just as easily in ~C*52 fixed checks regardless of encoding size (note, this is for this special 1 language anagram niche application).
So O(s=encoding size) become C*52.
Yeah, we can optimize and modify solutions to make them super amazing, but we shouldn't simply post the most fancy one without any explanation. Rather _maybe_ we should show how to "develop" a solution.
Idea => Idea2 => Idea3 => Idea4 => Idea5 => {Optimize/improve/use tricks}
This problem has a "non-trick" chain of ideas that can lead to a good solution for even someone who has never done such a problem.
Anyways time to take a giant break from this place because I'm tired of the way people post questions and how people post responses. People are doing things to save themselves time (like rushing a question post with lack of details or posting a response hoping for validation or a code check) and causing the whole system to become extremely inefficient.
Most people post their code as a response to questions simply as a way to get other people to find bugs in their code.
1) "If we can manage to linear sort the strings, this preprocessing step is free since we have to check every letter regardless"
2) "Keep a count of characters we've seen as we sweep each string from left to right"
Above are two such ways to explain how to get an idea for the solution:
"Use a hash" off the bat is not the way to go.
typedef unsigned int uint;
#define ALPH_SIZE 26 //for English alphabet
//so long as lower, upper case parts live in contiguous
//ranges of any underlying encoding (ascii, etc.)
uint count[2*ALPH_SIZE + 1]; //lower, upper case & other junk
uint h( char x )
{
return (x >= 'a' && x <='z' ? x-'a' : //lower case
x >= 'A' && x <='Z' ? x-'A'+ALPH_SIZE : //upper case
2*ALPH_SIZE) //all other characters
}
/* precondition: you supply it with strings on English alphabet
that is, the rest of the code in your program
should not create junk strings/characters if you want proper results */
if( s1.length != s2.length ) return false;
for(i=0; i < s1.length ; i++) count[h(s1[i])]++, count[h(s2[i])]--;
for(i=0; i < 2*ALPH_SIZE ; i++) if ( count[i] ) return false;
return true;
There is much better way to check if two strings are anagrams O(n) time and O(1) space where n is the length of the longest string.
1) Assign each of the 26 letters a unique prime number i.e. {'a'=>2, 'b'=>3, 'c'=>5, 'd'=>7, 'e'=>11, ..., 'l' = 37, ..., 's' => 67, .... ,'t' => 71,.... 'z' => 101}
2) All the anagrams of a word will have the same UNIQUE product. For example: prod(TEA) = prod(ATE) = prod(ETA) = prod(TAE) = prod(AET) = prod(EAT) = 71*11*2 = 154
public static final int primes[] = {2, 3, 5, 7, 11, 13, ......, 101} ;
public boolean isAnagram (String word1, String word2)
{
if(word1.length() != word2.length())
return false;
word1 = word1.toLower();
word2 = word2.toLower();
int word1Key = 1;
for(int i=0; i<word1.length; i++)
word1Key*=primes[word1[i]-'a'];
int word2Key = 1;
for(int i=0; i<word2.length; i++)
word2Key*=primes[word2[i]-'a'];
return word1Key == word2Key;
}
Sorting two strings would take O(nlogn) at best. If you're concerned about time, this isn't the best approach. The better approach would be to use a hash map. This would give you an time of O(n). However, if you have really long strings, you could have a problem with space.
public class AnagramChecker {
public static void main(String[] args) {
String s1 = "this is a test";
String s2 = "this test is a";
String s3 = "this test is a";
String s4 = "this is a tes1";
if (isAnagram(s1,s2)) {
System.out.println("these are anagrams, string 1: " + s1 + " and string 2:" + s2);
} else {
System.out.println("these are not anagrams, string 1: " + s1 + " and string 2:" + s2);
}
if (isAnagram(s1,s3)) {
System.out.println("these are anagrams, string 1: " + s1 + " and string 2:" + s3);
} else {
System.out.println("these are not anagrams, string 1: " + s1 + " and string 2:" + s3);
}
if (isAnagram(s1,s4)) {
System.out.println("these are anagrams, string 1: " + s1 + " and string 2:" + s4);
} else {
System.out.println("these are not anagrams, string 1: " + s1 + " and string 2:" + s4);
}
}
private static boolean isAnagram(String s1, String s2) {
Map<String,Integer> letters = new HashMap<String,Integer>();
if (s1.length() != s2.length()) return false;
for (int i = 0; i < s1.length(); i++) {
String cachedLetter = String.valueOf(s1.charAt(i));
Integer count = letters.get(cachedLetter);
if ( count == null ) {
letters.put(cachedLetter, 1);
} else {
letters.put(cachedLetter, count + 1);
}
}
for (int i = 0; i < s2.length(); i++) {
String cachedLetter = String.valueOf(s2.charAt(i));
Integer count = letters.get(cachedLetter);
if ( count == null ) {
return false;
} else {
count = count - 1;
if (count != 0)
letters.put(cachedLetter, count);
else
letters.remove(cachedLetter);
}
}
return true;
}
}
RC, nlgn is only a minimum on < comparison sorting.
And no need for hash table or any storage that grows with string size.
Below uses fixed O(encoding size) storage regardless of strings.
Try,
// put in a function returning true if s1, s2 are anagrams
#define ALPH_SIZE 256 //change as desired based on encoding
int count[ALPH_SIZE], i;
if( s1.length != s2.length ) return false;
for(i=0; i < s1.length ; i++) count[ s1[i] ]++, count[ s2[i] ]--;
for(i=0; i < ALPH_SIZE ) if ( count[i] ) return false;
return true;
//please fix bugs too
Note: Above is essentially optimized bucket sort (stripping out the parts not needed from it).
That is, optimized form of abstract solution below(m= string length):
1) t1 = linearsort(s1), t2 = linearsort(s2) <=== O(m) work
2) now compare t1 and t2 letter by letter <=== O(m) work
Any linear sort followed by step through comparing strings works optimally.
The optimized form above came to me after thinking in abstract about linear sorts (then optimizing by "strip out"). The linear sort came to me because I wanted to match the theoretical minimum for the problem as best as possible.
No magic rabbit out of a hat needed. Can we start explaining how we are "thinking" of solutions? Of course, to some people that answer is "I remember this question like the 1000 other questions I remember" but .... and least reverse engineer what you remember as a solution into a thought out process?
Use unsigned int or another "rollover/under" save integer type above.
unsigned int count[ALPH_SIZE]; //or keep it as "int" if roll-under is "safe"
Your 'bucket sort' is really just a hash table with the key being the value of the character as the index and the value in the index. The problem with this solution is the space complexity. So, while your solution doesn't grow with string size, it does grow with character base.
Hashtables that don't have perfect functions do have problems. As the taken keys increase, the table must increase. This could be a problem with run-time.
I still do not get it, what are you correcting with your reply? Any objections?
Count array method above is basically a hash table with trivial hash h(x) = x, that keeps counts of keys seen instead of storing satellite data (i.e., keeps information about duplicate keys). We all know this.
The count array comes naturally from thinking of the abstract solution using linear sorts. To call it a hash table right off the bat is not the right way to go here. We can call "any" array with some meaning given to the index a hash table, since h(index) = index is used to index into the array (implicitly, without a hash function). That's like calling a cherry bomb a WMD.
If someone says "use a hash table of size O(stringsize)" then that is pulling a rabbit out of a hat.
I have explained the process of developing the solution, which is roughly: try to match. theoretical min. => pre-process by linear sort => bucket sort is one such type of sort => strip out unnecessary bits => keep a tally of letters
If character encoding range is too big for your app., then the logical adjustment is another => hash the character counts (to get smaller memory usage). e.g., if you don't want to use a sizeof(int)x64kB count array for unicode in Java, then the adjustment is minor. For one you can use "byte" or "char" then it's only 64kBytes memory, else you can map onto a smaller chunk of memory (e.g., a type of hashing).
And you are incorrect, there is no problem at runtime. The size of the tally array (or tally hash array if you want to go that route) is a matter of design, not a runtime problem. Encoding is fixed at compile time.
import java.util.HashMap;
public class Anagram {
static HashMap<Character,Integer> Anagram_table = new HashMap<Character,Integer>();
public Anagram() {
}
public static boolean checkIfAnagram(String s1,String s2)
{
if(s1.length()!=s2.length())
return false;
for(int i =0;i < s1.length();i++)
{ char c = s1.charAt(i);
int count =0;
if(!Anagram_table.containsKey(c))
Anagram_table.put(c, ++count);
else
Anagram_table.put(c,(Anagram_table.get(c))+1);
}
for(int j =0;j<s2.length();j++)
{
char c = s2.charAt(j);
if(Anagram_table.containsKey(c))
{
Anagram_table.put(c,Anagram_table.get(c)-1);
if(Anagram_table.get(c)==0)
Anagram_table.remove(c);
}
else
return false;
}
return(Anagram_table.isEmpty());
}
}
For this problem, why can't we just put all the characters of the string into a SET and then try to put the characters of the second string into the set. If the SET already contains the character, we should remove it. In the end, if the set is empty, we know its an anagram :). This o(n) solution
If the two nonanagram strings are different lengths and we know the lengths ahead of time, we're done ( O(1) ) but ignore that case.
- S O U N D W A V E October 20, 2013Let m be length of a string (note, they are both equal in length in interesting case).
{Note: Since the problem had no "letters" I, myself, define a letter instead of bringing "historical reasons" as a hidden assumption for what they mean}
Theoretical MIN. for solving the problem is clearly O(m) - you have to check every letter of the strings. Note, O(m) + O(m) = O(m) so I do mean checking both strings.
Can we find a solution that meets the theoretical min.?
Sure, in the abstract, let s1, s2 be our strings.
1) t1 = linearsort(s1), t2 = linearsort(s2) <=== O(m) work
2) now compare t1 and t2 letter by letter <=== O(m) work
Total O(m). We win.
=================
Details: What linear sort you use is a detail that changes the constants hidden in the big O. We can optimize step by step:
1) Counting sort works fine, but we don't need the stability, so maybe it's enough to...
2) BucketSort or basically count number of letters of each string. {Good enough I think}
{Now you can optimize 2) even further with a trick, of counting UP with one string, then DOWN with the other on the same tally array ... this saves some space and time... big deal who cares }