Amazon Interview Question
SDE-3sCountry: United States
As all the files hold stream of integers, I would design a program that read each file and count the number of each integer in the files. When the program starts, it will create 8 temporary count files to hold the count of all integers each of which will keep counts of 2^29 numbers in sorted order. As each count will be a 64 bit number, each file will be 4GB. 8 files will cover count of all possible numbers.
A program will be designed to read each integer from each file and store integer counts temporarily in a sorted key-pair values such as sorted dictionary Dictionary<int, long>. When the number of element in the dictionary reach a point where it would run out of memory (about 340k element), the program will update the count files by adding the new value to existing one and saved to the same location. The program then clears the dictionary and continues the loop until all input files are read and processed.
When all the input files are processed, the program reads the 8 temporary count files, the output the number of integer to the final output file.
This solution limit is that there are at most 2^64-1 appearance of each number in all the input files
Assuming numbers are uniformly distributed, this approach would require a lot of count file switches -> lots of 4GB disk writes. As disk operations are way slower than inmemory operations, we need to minimize disk access for an efficient solution.
what i would do instead; read from 6GB files into RAM 3 GB at a time, sort inmemory and write back.
so N x 6GB files becomes -> 2N x 3GB sorted files
then I would do a 2N way merge sort on those sorted files. Whenever the overall sorted block size in RAM reaches 4GB, write it to disk, empty the memory and go on until all 3GB chunks are processed.
This requires reading + writing all data once to create sorted chunks and then once more to create single sorted file, overall a O(N) solution
This can be solved with map/reduce paradigm. basically you create K mapper (k is the number of buckets you want to do rough partition sort, say (2^31-1)/20, 20 will be # of mappers) and each mapper covers a range of integers to contain. After mapper tasks, each mapper contains a range of integers and each mapper's bucket is in ascending order. Now create same number of sorter and each sorter is sorting the corresponding mapper's outtput. Each sorter (reducer) will output the sorted result into the final output file. As you see the number of mapper and reducer will match and process streams coming from the file sources. This is a scalable and efficient design with map/reduce paradigm.
1. let's say there are 100 files of such 6GB of data.
2. Conceptually divide each file in to 3 segments, i.e. there will be 300 segment
3. now run this loop -
does it make sense? it's like bubble sort, in each bubble we sort two consecutive segment.
- ahmedhasan July 15, 2020