You are given many files of 6

Amazon Interview Question for SDE-3s

0

of 0 votes

5
Answers
You are given many files of 6 GB, each having stream of integers. You have space of 4 GB left in your main memory (mainly to swap out, swap in). You have to store sorted sequence of integers in all file in a other output file. How will you do that?
- neer.1304 April 06, 2019 in United States | Report Duplicate | Flag | PURGE
Amazon SDE-3 System Design

Email me when people comment.

An error occurred in subscribing you.

Country: United States

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

1. let's say there are 100 files of such 6GB of data.
2. Conceptually divide each file in to 3 segments, i.e. there will be 300 segment
3. now run this loop -

for (int i = 0; i < 300; i++) {
	for (int j = 0; j < i; j++) {
		load_in_memory(segment[j], segment[j-1]);
		sort_data_in_memory();
		save_in_disk_at_segment_position(i, 1st_half_of_memory);
		save_in_disk_at_segment_position(i+1, 2nd_half_of_memory);
	}
	// after i == 0 segment[0] is the smallest segments of all files.
	// after i == 300 all segments are sorted.
}

does it make sense? it's like bubble sort, in each bubble we sort two consecutive segment.

- ahmedhasan July 15, 2020 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

correction -
save_in_disk_at_segment_position(j-1, 1st_half_of_memory);
save_in_disk_at_segment_position(i, 2nd_half_of_memory);

- ahmedhasan July 15, 2020 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

As all the files hold stream of integers, I would design a program that read each file and count the number of each integer in the files. When the program starts, it will create 8 temporary count files to hold the count of all integers each of which will keep counts of 2^29 numbers in sorted order. As each count will be a 64 bit number, each file will be 4GB. 8 files will cover count of all possible numbers.

A program will be designed to read each integer from each file and store integer counts temporarily in a sorted key-pair values such as sorted dictionary Dictionary<int, long>. When the number of element in the dictionary reach a point where it would run out of memory (about 340k element), the program will update the count files by adding the new value to existing one and saved to the same location. The program then clears the dictionary and continues the loop until all input files are read and processed.
When all the input files are processed, the program reads the 8 temporary count files, the output the number of integer to the final output file.

This solution limit is that there are at most 2^64-1 appearance of each number in all the input files

- Dung Nguyen April 16, 2019 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Assuming numbers are uniformly distributed, this approach would require a lot of count file switches -> lots of 4GB disk writes. As disk operations are way slower than inmemory operations, we need to minimize disk access for an efficient solution.

what i would do instead; read from 6GB files into RAM 3 GB at a time, sort inmemory and write back.
so N x 6GB files becomes -> 2N x 3GB sorted files

then I would do a 2N way merge sort on those sorted files. Whenever the overall sorted block size in RAM reaches 4GB, write it to disk, empty the memory and go on until all 3GB chunks are processed.

This requires reading + writing all data once to create sorted chunks and then once more to create single sorted file, overall a O(N) solution

- Anonymous74 May 06, 2020 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

This can be solved with map/reduce paradigm. basically you create K mapper (k is the number of buckets you want to do rough partition sort, say (2^31-1)/20, 20 will be # of mappers) and each mapper covers a range of integers to contain. After mapper tasks, each mapper contains a range of integers and each mapper's bucket is in ascending order. Now create same number of sorter and each sorter is sorting the corresponding mapper's outtput. Each sorter (reducer) will output the sorted result into the final output file. As you see the number of mapper and reducer will match and process streams coming from the file sources. This is a scalable and efficient design with map/reduce paradigm.

- Anonymous June 18, 2019 | Flag Reply

CareerCup

Amazon Interview Question for SDE-3s

Books

Videos

Resume Review

Mock Interviews