CareerCup

Yahoo Interview Question for Software Engineer / Developers

0

of 0 votes

8
Answers
How to sort a 1000 GB file with ram size is 4 GB only. Which algorithm or data structure we need to use to sort these files?

Follow Up: External sort is Ok... but how can you make this solution more efficient...

Follow Up 2: Ideal chunk size for external sort (I said 512 MB based on my experience with MS Word 2013, it can not load file size >512 MB)
- codemonkey November 09, 2012 in United States for Search | Report Duplicate | Flag | PURGE
Yahoo Software Engineer / Developer Algorithm

Email me when people comment.

Email me when people comment.

Loading...

An error occurred in subscribing you.

Team: Search
Country: United States
Interview Type: In-Person

More Questions from This Interview

Email me when people comment.

Email me when people comment.

Loading...

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

13

of 13 vote

Assuming that they wanted a algorithmic approach,

Use: a two pass approach
First pass chunk sorting:
1. Load chunks of 4 GB data in memory and do an in place sort (Merge Sort or Quick Sort).
2. Write the chunk back to the same place on disk
3. Repeat 250 times (until end of file or all data is consumed)

Second Phases "merge":
Now you have a choice to merge, what is going to be your "Bucket size" which will govern the N in a N-way-merge. (This gets real tricky, real fast)
In this case we have 250 chunks of 4 GB each,
One way to do this is,
1. Load first 8.192 MB from each of 250 chunks (250*8.192 MB = 2GB) this is our input buffer and rest of the 4 GB is our output buffer
2. Do a 250 ways merge on this 250 sorted bins / buckets
3. When any of the 250 bins is empty load it with the next 8.192 MB from the associated chunks
4. When Output buffer of 2 GB is full - write out the buffer to the external storage.

Additional passes might be necessary to ensure all the numbers are sorted.

- ASK November 09, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

0

of 0 votes

How can you be sure that the merged 2GB data won't overwrite the data that hasn't been merged yet?

For example, if the first 8.192MB from the first chunk happens to be the biggest numbers in the file. The first 2GB write to the file would overwrite and permanently loose the first 2GB of the file.

- Miro November 11, 2012 | Flag

Comment hidden because of low score. Click to expand.

0

of 0 votes

@Miro: The output is written to a new file. Nothing is overwritten.

- eugene.yarovoi November 11, 2012 | Flag

Comment hidden because of low score. Click to expand.

0

of 0 votes

What are your thoughts on how this would be done if you need to sort the 1000GB file in-place ?

- riddimon February 03, 2014 | Flag

Comment hidden because of low score. Click to expand.

0

of 0 votes

@riddimon: Heapsort would be an example of an in-place sort, but it's not very appropriate for files. The trick with manipulating files is that you want to read and write to them in a contiguous way because disk seeks are slow. Heapsort accesses data in a highly non-contiguous way as you move up and down the heap, so it's not a great choice.

Quicksort might be a good fit in this case. The main challenge of quicksort is the in-place partitioning of an array with respect to a pivot, which is something that can be done without much random access of the file. You write elements starting from the front and elements starting from the back in a contiguous way (buffer up values in memory until you have enough of them to write). Quicksort isn't really 100% in-place, requiring something like O(log N) memory, but that's a very small amount even for large N.

Quicksort does have shortcomings like degenerate cases that potentially take O(N^2) time. If you don't want to use random pivots to avoid degenerate cases, it's possible to choose the median as the pivot. There are linear-time median-selection algorithms like "median of medians", which can also be done with only logarithmic extra memory.

- eugene.yarovoi February 06, 2014 | Flag

Comment hidden because of low score. Click to expand.

0

of 0 votes

How you determined input buffer size as 2 GB ?

- Anonymous November 15, 2018 | Flag

Comment hidden because of low score. Click to expand.

0

of 0 votes

How you determined input buffer size as 2GB?

- Ravi November 15, 2018 | Flag

Comment hidden because of low score. Click to expand.

-2

of 2 vote

Any one can float example Multi threaded Sorting here

- Ashis Kumar March 04, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

Comment hidden because of low score. Click to expand.

0

of 2 votes

what a great value-add answer when there's already an answer on page that says external sort but actually gives all the details

- Anonymous November 09, 2012 | Flag

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.