Amazon Interview Question
Java DevelopersCountry: United States
Interview Type: Phone Interview
I don't think you'd need that pair object. It only asks to print the lines and not for the line numbers or any extra data.
Iterate over the 1st file and add the files to the hashmap
Iterate over the 2nd file - on each line: if there is a value in the hashmap print that line. If there are duplicate lines in the files and you only want one instance of each common line you could remove the value from the hashmap after you print it.
@Mrs Jumbo. If you compute hash code for a billion strings, the likelihood of collisions becomes very high. I therefore thought of using the line numbers to keep track of which all lines are hashing to the same bucket.
While processing the second file a line may hash to a bucket, but it need not be a string that exactly matches lines from the 1st file. Therefore, based on the line #s, we need to do exact string matching of those lines from the first file and if they match, increment the counter of the matching line.
one billion objects in memory is too much.... what if there are 100 billion lines.
do external sort and compare
this can be a variation of LCS where we have to dealt with strings instead of characters. LCS can be solved using DP.
Correct me if I am wrong.
LCS is the algorithm for
diff
command in unix
In java we need to use RandomAccessFile and use to read line #readLine
MapReduce design
=====================================================================
1. One Map function reads line for 1st file line by line, say for example 1000 lines (depends on memory and speed available ) at one shot
2. Another map function reads another file (2nd file) line by line, say for example 1000 lines (depends on memory and speed available ) at one shot and look for line in first file. If matches then increase the count
3. Reduce function takes those hash table keys and those which have >0 value remove the key from map and print the common line (if lines are so big volumne then save it in external storage using another reduce function)
It can be implemented in thread model or in distributed way or in parallel way.
and maintain a HashTable for each line
how about Trie, since each line can not be very large, say length is 100, the ASCII string is 256, the trie construction can be O(nlogn), find also O(nlogn).
If with hashtable, the constuction can be n, however, different string can have same hashcode, which means if the hashcode is the same, we also need to compare the string as well. But if ingore the hashcode same issue, it is O(n)
Since both the files have billion lines, I assume they are stored in the secondary memory.
I would initially do an external merge sort on both the files. Note that this will be lexicographic sorting.
Next, I'll match each line from file 1 with file 2. The matching will now be efficient because both the files are sorted. since you can't keep both the files in the main memory, you'll have to do matching in chunks.
Needless to say this will run in O(nlogn) time, but it is guaranteed to work for big files.
Nice idea. If there are no duplicate lines in each file, I think you can even merge file1 and file2 into file3 and check if two lines are the same. About time complexity I think you will have to factor in the average length of a line of both the files as well. If it were m, time complexity would be O(mnlgmn)
We can store initially one complete into map as line as the key, now start with second file, try to insert into map, if it fails, print the repeated line and remove it from the map. Even if that line inserted into map, delete that entry since we can reduce the map size. So finally Map will have non duplicate entries from both the file. Because there is a chances that Line1 in file1 matches with Last line in file2. If this is the case, we want to store atleast 1 map. If there is any issue in retrieving the string, due to hashing....we can use double hashing.
Hi ,
This solution will be suitable for this question.
public class ReadCommonLinesFromFile {
public static void main(String args[]) throws IOException{
File f1=new File("E:\\Text1.txt");
File f2=new File("E:\\Text2.txt");
String name=null;
BufferedReader buffReader1=new BufferedReader(new FileReader(f1));
BufferedReader buffReader2=new BufferedReader(new FileReader(f2));
Set<String> set1=new HashSet<String>();
Set<String> set2=new HashSet<String>();
while(( name=buffReader1.readLine())!=null){
set1.add(name);
}
while(( name=buffReader2.readLine())!=null){
set2.add(name);
}
set1.retainAll(set2);
System.out.println(set1);
}
}
I hope this will helpful
Maintain a hash table with the following structure.
- Murali Mohan June 14, 20131. Use a Pair object with (line#, count) elements
2. Key of hash table would be the hashcode(something like MD5) of the string and value would be an arraylist of Pair objects.
[hashCodeOfString]-> ArrayList<Pair>()
For the first file, for each of the lines, compute the hashcode for key.. As it's value, store the corresponding line# and the initial count set to 1.
For the second file, for each of the lines, compute the hash code and if the hashcode matches, from the Pair object, match the current line with the corresponding lines from the 1st file. If the string comparison also matches increment the count field of the pair object.
Iterate over the hash table and print those lines with count > 1