Yahoo Interview Question
Software Engineer / DevelopersI guess the interviewer meant O(n). I don't think it can be done in O(1).
Secondly, i do not understand why people want to find the most complex solution to a question. We do not need the hi-fi technicalities of hash, sorted list & merge.
This can be solved in O(n) time with only 3 variables --> one boolean & 2 int variables for maintaining count.
If you do not agree with me, read the question carefully for "large" text file & "space" complexity i.e memory. With the approach of Hash & merge of a clearly defined "large" file you are going to have a LARGE hashtable & again for merge !!i.e huge memory consumption. So, this solution does not even address the specs.
Following is the algo for O(1) i.e constant memory & O(n) computations
This is the basic algo which can be refined for all boundary conditions
int FindShortestDistance( const char *ipWord1, const char* ipWord2)
{
ifstream reader; // properly initialized with file name & stream opened
char wordRead[100]; // Handle the size more carefully.
char whiteSpace = ' '; // whitespace value
bool isFirstPatternFound = FALSE;
int shortestDistance = 0;
int currentCount = 0;
// Start reading the file
while( ! reader.eof() )
{ // Get function takes a delimiter with default delimiter value as --> \n so
// change the delimiter to ' ' i.e whitespace
reader.get( wordRead, whiteSpace );
if( 0 == strcmp( ipWord1, wordRead) )
{ // 1st pattern found
if( isFirstPatternFound == FALSE )
{
isFirstPatternFound = TRUE;
}
// reset count everytime you find 1st pattern because only resetting
// gives shortest distance
currentCount = 0;
}
// check if 2nd pattern is found
else if( 0 == strcmp(pWord2, wordRead) )
{
if( isFirstPatternFound == FALSE )
{ // starting word not yet found
continue;
}
// 1st & 2nd both pattern found
if( currentCount < shortestDistance || 0 == shortestDistance )
{
shortestDistance = currentCount;
}
// 1st pattern & 2nd pattern both found at least once.
// But, 1st pattern can occurs 2nd,3rd.. nth time
// So reset for onward scanning
isFirstPatternFound = FALSE;
}
else if( TRUE == isFirstPatternFound )
{ // increment only when 1st pattern is found & 2nd pattern not found
++currentCount;
}
}
return shortestDistance;
}
Its not completely O(n) as you are using strcmp function which in itself take O(k) time assuming k to be the average length of each word and so the efficiency will become O(nk). I am not sure if we need to consider k or not. I guess that depends on what the interviewer wants.
Gaurav, i am not sure if this can be obviated.
I am reading the file one word at a time. i.e O(k) // assuming k is average length.
Now strcmp will use compare the k characters (in worst case). For a file with 'n' words i will have O(nk) to read + O(nk) for strcmp.
So total complexity = O(nk) + O(nk) ~ O(nk)
The 2nd & perhaps better option is to compute hash [dont use a hastable, just compute the hash value of the word].
Just replace the call to strcms by something like gethashedvalue(currentWord). The compare the hashed values ipWord1, ipWord2 with the hashed value of currentWord
"Assuming your hash function is not dependent on the length of word" the hash generation time will then take O(n) time.
But, i cannot think of a hash function that does not go through each character of the word to compute its hash i.e k characters. If you can guarantee such hash function then
Total complexity = O(nk) // for reading + O(n) // for hash genration & comparison i.e
~ O(nK)
I don't think it can be optimized further ! No reading mechanism can read you a file of n(words) * k (character) in less than nk time.
Let me know what you or others think about 1)my hash based approach & 2)overall complexity of (nk)
I think The Hercules missing some cases. He is looking for the distance between word1 and word2 and ignoring word2 and word1.
I was under the impression that the question states find the minimum distance between words ipWord1 & ipWord2 "starting" from ipWord1.
If however anyone of them can be the 1st word & distance has to be computed from either the same "algorithm or logic" can be extended to handle this case.
I gave the basic sketch of the logic & not a compilable, running, exception proof code. Just extend the idea to handle the cases accordingly if you end up writing this functionality for an application to be used in production
Here is Algo goes
int shortestDistance( char *firstWord , char * secondWord )
local varable
bool firstWordFound = false , secondWordFound = false;
int count = 0 , minDistance = 0;
char *readWord = NULL;
STEP 0. if strcmp ( firstWord , secondWord ) == 0 then // if both word are same then distance between both word will be zero
STEP 1. return 0
STEP 2. while file.eof() != NULL then // while loop end of file
STEP 3. Read the word from file with white space seprated // Reading the word from file by white space seprated
STEP 4. if strcmp( readWord , firstWord ) == 0 then // comparing read word and first word
STEP 5. if firstWrodFound == false then // when first time first word will found
STEP 6. firstWordFound = true
STEP 7. else if firstWordFoun == true then // when second time same word found
STEP 8. firstWordFound = false
STEP 9. else if strcmp( readWord , secondWord ) == 0 then // comparing the second word
STEP 10. if secondWrodFound == false then // when first time second word found
STEP 11. secondWordFound = true
STEP 12. else if secondWordFoun == true then // when second time second wod found
STEP 13. secondWordFound = false
STEP 14 else
STEP 15 if firstWordFound || secondWordFound then // counting the distance when one of the word found either first word or second word
STEP 16 count++
STEP 17 if firstWordFound && secondWordFound then // when both word found put into array and keep maintaining as file is large we may expect more number for combination.
STEP 18 shortDistanceArray [ num ] = count ; num++; count = 0;
STEP 19 end ; // while loop
STEP 20 return minNumberFromArray ( shotDistanceArray ); // finding the minimum number from array in O(n) time
so over all complexity is in linear time.
import java.util.StringTokenizer;
public class WordDistance {
private static String TEXT = "as was is the was the yahoo you me was the and";
private static String WORD1 = "yahoo";
private static String WORD2 = "was";
public static void main (String[] args){
StringTokenizer parser = new StringTokenizer(TEXT);
int word1index = -1;
int word2index = -1;
int shortestDistance = Integer.MAX_VALUE;
int sword1index = -1;
int sword2index = -1;
for (int wordIndex = 0; parser.hasMoreTokens(); ++wordIndex){
String word = parser.nextToken();
if (word.equals(WORD1)){
word1index = wordIndex;
if (word2index >= 0){
if (Math.abs(word2index-word1index) < shortestDistance){
shortestDistance = Math.abs(word2index-word1index);
sword1index = word1index;
sword2index = word2index;
}
}
word2index = -1;
} else if (word.equals(WORD2)){
word2index = wordIndex;
if (word1index >= 0){
if (Math.abs(word2index-word1index) < shortestDistance){
shortestDistance = Math.abs(word2index-word1index);
sword1index = word1index;
sword2index = word2index;
}
}
word1index = -1;
}
}
System.out.println("Shortest distance = " + (shortestDistance-1));
System.out.println("word1 index = " + sword1index);
System.out.println("word2 index = " + sword2index);
}
}
If you are allowed O(n^2) space and preprocessing time of O(n^2), it will be possible to give answer in O(1) for a query.
Preprocess file and compute Distance[Word1][Word2] (Word1 < Word2 in dictionary order). This takes O(n^2) time and space.
For a specific query of words w1,w2 return Distance[w1][w2]. WLOG, w1 < w2 in dictionary order. This is O(1).
Preprocessing space and time requirements are admittedly bad - but that is a one time cost that is not seen by queries.
I agree, trie is the correct data structure to solve this problem.
We can modify the node structure of the trie to accomodate an array of word occurances indexes of for that word in the file.
So when given two words simply lookup the words in the trie, and obtain the arrays containing their occurances index in the file, which would abviously be sorted, we now simply need to process these arrays to abtain the minimum word difference
take 5 variables- currentindex, firstwordindex, secondwordindex, currentdistance and mindistance.
initialise firstwordindex and secondword to -infinity and mindistance to +infinity.
Read the file word by word.
if firstword is found
{
firstwordindex = currentindex;
currentdistance = firstwordindex - secondwordindex;
if (currentdistance < mindistance)
mindistance = currentdistance;
}
if secondword is found
{
secondwordindex = currentindex;
currentdistance = secondwordindex - firstwordindex;
if (currentdistance < mindistance)
mindistance = currentdistance;
}
Java implementation , search takes O(n) where n is the number of words in the file.
public class ShortestDistance {
/**
* @param args
*/
public static void main(String[] args) {
ShortestDistance sd = new ShortestDistance();
String sentence = "c a b e f b c a b a a d a a d c b e f c f g h b c";
System.out.println(sentence + ":"
+ sd.shortestDistance(sentence, "b", "c"));
}
public int shortestDistance(String sentence, String sWord, String eWord) {
int shortestDistance = -1;
String[] words = sentence.split(" ");
boolean sWordFound = false;
int count = 0;
for (int i = 0; i < words.length; i++) {
if (sWordFound) {
count++;
}
if (words[i].equalsIgnoreCase(sWord)) {
sWordFound = true;
// Reset
count = 0;
} else if (words[i].equalsIgnoreCase(eWord) && sWordFound) {
if (shortestDistance == -1 || shortestDistance > (count - 1)) {
shortestDistance = (count - 1);
}
// Reset
sWordFound = false;
count = 0;
}
}
return shortestDistance;
}
}
I don't think it can be done in O(1). The only worst case I can think of is, lets say the input contains 'n' words out of which the first (n-1) are same and the last word is different and we have to find the least distance between those two words. It goes to O(n) in this particular case. But, I don't think it'll depend on 'n' if the input text has normal sentences with proper grammar. :)
Here goes my solution, (it uses TRIES)
1) read the input character by character and update it in a trie
2) when the input character is a <space>, update the trie entry of the previous word and append the position of the current WORD to a list.
----
When the input pair of words are given, simply traverse the trie.
We'll get two lists, containing all the positions of the position words(sorted in ascending order).
Use merge method of merge sort to find the least distance.
How large is the file? If file is too big hashing keywords may become quite expensive...
- Vlad January 25, 2008