Google Interview Question
Software Engineer / DevelopersThis softwareinterviewqna person seems to be trying to advertise his/her site here!
He/She copies the question posted here, to the blog and then answers it there and posts a link to the blog here!
I don't see any reason why you need to post your text-only answers elsewhere and post a link to that here. You might as well post the answer here.
Please... cut the self-righteous BS about sharing. If you want to share, just put it here. Anyone who visits this site comes with sharing in mind. I don't see them trying to advertise their blogs/sites here.
btw, how useful and valuable would it be, when your site goes away?
If you want to, you can include the answer here AND include a link to your blog. Putting just the link comes off as unsolicited spam. Maybe your intentions are good, but it sure does not look that way.
btw, have my posts increased traffic to your site? :-)
What does a sorted file mean? Sorted line by line? What?
Assuming each line contains an int and we sort on that,
Just use a Heap of size k, for k files.
Now if the list of files is so huge that the OS can't handle them, then we might need to break the list of files up, create intermediate merge files and re-merge them. We might end up with a tree like structure for the merges, if multiple steps of re-merge are involved, so we will have to write a recursive function for that.
Note that the basic step for each merge step will remain the same and that is split into a separate function.
// All data structures etc made up, but you should get the idea.
BasicMerge(FileList files, File outputFile)
{
MinHeap h = new MinHeap();
while (files.HasFiles()) {
foreach (File file in files) {
if (file.ReachedEnd()) {
files.Remove(file);
continue;
}
int val = file.ReadInt();
h.Insert(val);
}
int value = h.GetMin();
outputFile.WriteInt(value);
}
}
// This does the merge of the whole list, could be 100,000 files
// if required.
// maxOsFiles = constant which determines a limit of OS.
void CompleteMerge(FileList files, File outputFile) {
FileList [] brokenUpList;
FileList MergeFileList = new FileList();
if (files.NumberofFiles() > maxOSFiles) {
brokenUpList = ComputeSplit(files, maxOSFiles);
}else{
BasicMerge(files, outputFile);
return;
}
foreach (FileList fileList in brokenUpList) {
File tempOutput = GenerateTempFile();
BasicMerge(fileList, tempOutput);
MergeFileList.Add(tempOutput);
}
CompleteMerge(MergFileList, outputFile);
}
We can use Map Reduce for this problem. Load the 1000 sorted files into HDFS . In the map phase output individual values as intermediate key value pair. Then the shuffling and sorting process would automatically sort the values and present it to reducer in which we can write values to file c.
OK, I'm the "softwareinterviewqna person".
For me this internet thing is all about sharing. Gayle provided this great place for us to share questions, I love it. I provide my solution with careful consideration, thorough explanation, I contributed my value. I choose to put it on my site, it's just simple like that.
The answer above is almost right.
- Rajesh Konda June 29, 2009Basically you use a min heap (sorting increasing order) of size k, where k is the number of files.
First you read the first records from all the files into the heap
Loop until (no more records in any of the files/file list is empty)
Remove the min element from the heap, add it to output file
Read the next element from the file from which the min element came
if the file has no more records, delete the file from the file list
continue
add it to the min heap
Now you have one file with all elements sorted.
The complexity is n log k, where n is the total number of records.
Another question is what if the number of files itself is too big? The answer is simple. Select the first k files that you can manage using your memory resources. Produce an output file. Then work with the next k files to produce the next output file. So on and so forth. Once done, now recurse your method with the new set of files (the output files, they are all sorted). Do it until you output only one file.