Yahoo Interview Question
Software Engineer / DevelopersTeam: Search
Country: United States
Interview Type: In-Person
How can you be sure that the merged 2GB data won't overwrite the data that hasn't been merged yet?
For example, if the first 8.192MB from the first chunk happens to be the biggest numbers in the file. The first 2GB write to the file would overwrite and permanently loose the first 2GB of the file.
What are your thoughts on how this would be done if you need to sort the 1000GB file in-place ?
@riddimon: Heapsort would be an example of an in-place sort, but it's not very appropriate for files. The trick with manipulating files is that you want to read and write to them in a contiguous way because disk seeks are slow. Heapsort accesses data in a highly non-contiguous way as you move up and down the heap, so it's not a great choice.
Quicksort might be a good fit in this case. The main challenge of quicksort is the in-place partitioning of an array with respect to a pivot, which is something that can be done without much random access of the file. You write elements starting from the front and elements starting from the back in a contiguous way (buffer up values in memory until you have enough of them to write). Quicksort isn't really 100% in-place, requiring something like O(log N) memory, but that's a very small amount even for large N.
Quicksort does have shortcomings like degenerate cases that potentially take O(N^2) time. If you don't want to use random pivots to avoid degenerate cases, it's possible to choose the median as the pivot. There are linear-time median-selection algorithms like "median of medians", which can also be done with only logarithmic extra memory.
Assuming that they wanted a algorithmic approach,
- ASK November 09, 2012Use: a two pass approach
First pass chunk sorting:
1. Load chunks of 4 GB data in memory and do an in place sort (Merge Sort or Quick Sort).
2. Write the chunk back to the same place on disk
3. Repeat 250 times (until end of file or all data is consumed)
Second Phases "merge":
Now you have a choice to merge, what is going to be your "Bucket size" which will govern the N in a N-way-merge. (This gets real tricky, real fast)
In this case we have 250 chunks of 4 GB each,
One way to do this is,
1. Load first 8.192 MB from each of 250 chunks (250*8.192 MB = 2GB) this is our input buffer and rest of the 4 GB is our output buffer
2. Do a 250 ways merge on this 250 sorted bins / buckets
3. When any of the 250 bins is empty load it with the next 8.192 MB from the associated chunks
4. When Output buffer of 2 GB is full - write out the buffer to the external storage.
Additional passes might be necessary to ensure all the numbers are sorted.