You have a large text file. Gi

Yahoo Interview Question for Software Engineer / Developers

0

of 0 votes

24
Answers
You have a large text file. Given any two words, find the shortest distance(in terms of number of words) between the two given words in the file. Can you make the searching operation in O(1) time? what about the space complexity for your solution?
Ex:
File contains:
as was is the as the yahoo you me was the and
Words given: the, was
Answer should be: 0

If Words given: you, the
Answer: 2
- Nawin January 22, 2008 | Report Duplicate | Flag | PURGE
Yahoo Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

How large is the file? If file is too big hashing keywords may become quite expensive...

- Vlad January 25, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Nawin: are you sure you mean O(1) and not O(n)?

- curious January 25, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Yes.. The interviewer said O(1).. May be he would have expected me to say that O(1) is not possible..

- Anonymous January 26, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Looks like, it should be O(n).

- ramas January 25, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I guess the interviewer meant O(n). I don't think it can be done in O(1).

Secondly, i do not understand why people want to find the most complex solution to a question. We do not need the hi-fi technicalities of hash, sorted list & merge.

This can be solved in O(n) time with only 3 variables --> one boolean & 2 int variables for maintaining count.

If you do not agree with me, read the question carefully for "large" text file & "space" complexity i.e memory. With the approach of Hash & merge of a clearly defined "large" file you are going to have a LARGE hashtable & again for merge !!i.e huge memory consumption. So, this solution does not even address the specs.

Following is the algo for O(1) i.e constant memory & O(n) computations

- The Hercules January 25, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

This is the basic algo which can be refined for all boundary conditions

int FindShortestDistance( const char *ipWord1, const char* ipWord2)
{
ifstream reader; // properly initialized with file name & stream opened
char wordRead[100]; // Handle the size more carefully.
char whiteSpace = ' '; // whitespace value
bool isFirstPatternFound = FALSE;
int shortestDistance = 0;
int currentCount = 0;

// Start reading the file
while( ! reader.eof() )
{ // Get function takes a delimiter with default delimiter value as --> \n so
// change the delimiter to ' ' i.e whitespace

reader.get( wordRead, whiteSpace );
if( 0 == strcmp( ipWord1, wordRead) )
{ // 1st pattern found
if( isFirstPatternFound == FALSE )
{
isFirstPatternFound = TRUE;
}
// reset count everytime you find 1st pattern because only resetting
// gives shortest distance
currentCount = 0;
}
// check if 2nd pattern is found
else if( 0 == strcmp(pWord2, wordRead) )
{
if( isFirstPatternFound == FALSE )
{ // starting word not yet found
continue;
}

// 1st & 2nd both pattern found
if( currentCount < shortestDistance || 0 == shortestDistance )
{
shortestDistance = currentCount;
}
// 1st pattern & 2nd pattern both found at least once.
// But, 1st pattern can occurs 2nd,3rd.. nth time
// So reset for onward scanning
isFirstPatternFound = FALSE;
}
else if( TRUE == isFirstPatternFound )
{ // increment only when 1st pattern is found & 2nd pattern not found
++currentCount;
}
}
return shortestDistance;
}

- The Hercules January 25, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Its not completely O(n) as you are using strcmp function which in itself take O(k) time assuming k to be the average length of each word and so the efficiency will become O(nk). I am not sure if we need to consider k or not. I guess that depends on what the interviewer wants.

- gauravk.18 February 29, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Gaurav, i am not sure if this can be obviated.

I am reading the file one word at a time. i.e O(k) // assuming k is average length.
Now strcmp will use compare the k characters (in worst case). For a file with 'n' words i will have O(nk) to read + O(nk) for strcmp.

So total complexity = O(nk) + O(nk) ~ O(nk)

The 2nd & perhaps better option is to compute hash [dont use a hastable, just compute the hash value of the word].
Just replace the call to strcms by something like gethashedvalue(currentWord). The compare the hashed values ipWord1, ipWord2 with the hashed value of currentWord

"Assuming your hash function is not dependent on the length of word" the hash generation time will then take O(n) time.
But, i cannot think of a hash function that does not go through each character of the word to compute its hash i.e k characters. If you can guarantee such hash function then
Total complexity = O(nk) // for reading + O(n) // for hash genration & comparison i.e
~ O(nK)

I don't think it can be optimized further ! No reading mechanism can read you a file of n(words) * k (character) in less than nk time.

Let me know what you or others think about 1)my hash based approach & 2)overall complexity of (nk)

- The Hercules May 26, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

I think it will be inverted indexing - preprocessing the text, then query

- newhjh February 01, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think The Hercules missing some cases. He is looking for the distance between word1 and word2 and ignoring word2 and word1.

- Victor March 11, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I was under the impression that the question states find the minimum distance between words ipWord1 & ipWord2 "starting" from ipWord1.

If however anyone of them can be the 1st word & distance has to be computed from either the same "algorithm or logic" can be extended to handle this case.

I gave the basic sketch of the logic & not a compilable, running, exception proof code. Just extend the idea to handle the cases accordingly if you end up writing this functionality for an application to be used in production

- The Hercules May 26, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Here is Algo goes

int shortestDistance( char *firstWord , char * secondWord )

local varable

bool firstWordFound = false , secondWordFound = false;
int count = 0 , minDistance = 0;
char *readWord = NULL;

STEP 0. if strcmp ( firstWord , secondWord ) == 0 then // if both word are same then distance between both word will be zero
STEP 1. return 0
STEP 2. while file.eof() != NULL then // while loop end of file
STEP 3. Read the word from file with white space seprated // Reading the word from file by white space seprated
STEP 4. if strcmp( readWord , firstWord ) == 0 then // comparing read word and first word
STEP 5. if firstWrodFound == false then // when first time first word will found
STEP 6. firstWordFound = true
STEP 7. else if firstWordFoun == true then // when second time same word found
STEP 8. firstWordFound = false
STEP 9. else if strcmp( readWord , secondWord ) == 0 then // comparing the second word
STEP 10. if secondWrodFound == false then // when first time second word found
STEP 11. secondWordFound = true
STEP 12. else if secondWordFoun == true then // when second time second wod found
STEP 13. secondWordFound = false
STEP 14 else
STEP 15 if firstWordFound || secondWordFound then // counting the distance when one of the word found either first word or second word
STEP 16 count++
STEP 17 if firstWordFound && secondWordFound then // when both word found put into array and keep maintaining as file is large we may expect more number for combination.
STEP 18 shortDistanceArray [ num ] = count ; num++; count = 0;
STEP 19 end ; // while loop
STEP 20 return minNumberFromArray ( shotDistanceArray ); // finding the minimum number from array in O(n) time

so over all complexity is in linear time.

- Anonymous April 06, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Step 17 should be called whenever firstword or secondword is found and there is no need to set firstWordFound = false and secondWordFound = false

- Sanjeev May 18, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

@Anonymous

Very good algorithm. We can't find the minimum distance in constant time.

- Anonymous April 13, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

import java.util.StringTokenizer;

public class WordDistance {

private static String TEXT = "as was is the was the yahoo you me was the and";

private static String WORD1 = "yahoo";
private static String WORD2 = "was";

public static void main (String[] args){
StringTokenizer parser = new StringTokenizer(TEXT);
int word1index = -1;
int word2index = -1;
int shortestDistance = Integer.MAX_VALUE;
int sword1index = -1;
int sword2index = -1;

for (int wordIndex = 0; parser.hasMoreTokens(); ++wordIndex){
String word = parser.nextToken();
if (word.equals(WORD1)){
word1index = wordIndex;
if (word2index >= 0){
if (Math.abs(word2index-word1index) < shortestDistance){
shortestDistance = Math.abs(word2index-word1index);
sword1index = word1index;
sword2index = word2index;
}
}
word2index = -1;
} else if (word.equals(WORD2)){
word2index = wordIndex;
if (word1index >= 0){
if (Math.abs(word2index-word1index) < shortestDistance){
shortestDistance = Math.abs(word2index-word1index);
sword1index = word1index;
sword2index = word2index;
}
}
word1index = -1;
}
}
System.out.println("Shortest distance = " + (shortestDistance-1));
System.out.println("word1 index = " + sword1index);
System.out.println("word2 index = " + sword2index);
}

}

- warlock May 15, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

if you set word1index or word2index =-1 wihin for loop, then it will calculate shotest distance between two words "yahoo" and "was" always. But not between the 2nd occurence of "yahoo" and 1st occurence of "was".

- Sanjeev May 18, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

If you are allowed O(n^2) space and preprocessing time of O(n^2), it will be possible to give answer in O(1) for a query.

Preprocess file and compute Distance[Word1][Word2] (Word1 < Word2 in dictionary order). This takes O(n^2) time and space.

For a specific query of words w1,w2 return Distance[w1][w2]. WLOG, w1 < w2 in dictionary order. This is O(1).

Preprocessing space and time requirements are admittedly bad - but that is a one time cost that is not seen by queries.

- Iron man May 18, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

How about building a trie and then traversing it for both the words?Keep count of number of levels and compare the count!

Looking up a trie would be O(m) where m is the length of the string which is faster than other data structures in worst case.

- Anonymous November 08, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I agree, trie is the correct data structure to solve this problem.
We can modify the node structure of the trie to accomodate an array of word occurances indexes of for that word in the file.

So when given two words simply lookup the words in the trie, and obtain the arrays containing their occurances index in the file, which would abviously be sorted, we now simply need to process these arrays to abtain the minimum word difference

- anup January 14, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

take 5 variables- currentindex, firstwordindex, secondwordindex, currentdistance and mindistance.

initialise firstwordindex and secondword to -infinity and mindistance to +infinity.

Read the file word by word.
if firstword is found
{
firstwordindex = currentindex;
currentdistance = firstwordindex - secondwordindex;
if (currentdistance < mindistance)
mindistance = currentdistance;

}
if secondword is found
{
secondwordindex = currentindex;
currentdistance = secondwordindex - firstwordindex;
if (currentdistance < mindistance)
mindistance = currentdistance;
}

- Dumbo February 15, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Dumbo ! Not sure if it works correctly :( Please recheck the solution.

- Time pass ! February 19, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Java implementation , search takes O(n) where n is the number of words in the file.

public class ShortestDistance {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		ShortestDistance sd = new ShortestDistance();
		String sentence = "c a b e f b c a b a a d a a d c b e f c f g h b c";
		System.out.println(sentence + ":"
				+ sd.shortestDistance(sentence, "b", "c"));
	}

	public int shortestDistance(String sentence, String sWord, String eWord) {
		int shortestDistance = -1;
		String[] words = sentence.split(" ");
		boolean sWordFound = false;
		int count = 0;
		for (int i = 0; i < words.length; i++) {
			if (sWordFound) {
				count++;
			}
			if (words[i].equalsIgnoreCase(sWord)) {
				sWordFound = true;
				// Reset
				count = 0;
			} else if (words[i].equalsIgnoreCase(eWord) && sWordFound) {
				if (shortestDistance == -1 || shortestDistance > (count - 1)) {
					shortestDistance = (count - 1);
				}
				// Reset
				sWordFound = false;
				count = 0;
			}
		}
		return shortestDistance;
	}
}

- Singleton July 03, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

You can do it by inverted index and proximity search with O(1) as searching time.

- Harit December 03, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I don't think it can be done in O(1). The only worst case I can think of is, lets say the input contains 'n' words out of which the first (n-1) are same and the last word is different and we have to find the least distance between those two words. It goes to O(n) in this particular case. But, I don't think it'll depend on 'n' if the input text has normal sentences with proper grammar. :)

Here goes my solution, (it uses TRIES)
1) read the input character by character and update it in a trie
2) when the input character is a <space>, update the trie entry of the previous word and append the position of the current WORD to a list.
----
When the input pair of words are given, simply traverse the trie.
We'll get two lists, containing all the positions of the position words(sorted in ascending order).
Use merge method of merge sort to find the least distance.

- Harsha June 14, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

* Hash each keyword's position in sorted order (by default it will be sorted).
* Now given any two keywords, get the sorted list and merge the list in sorted order.
*When shifting from one list to other, note the distance and pick the shortest one.

- ramas January 24, 2008 | Flag Reply

CareerCup

Yahoo Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews