Flipkart Interview QuestionSoftware Engineer / Developers
How would you keep track of which element came from which stream, especially after you have processed them to make a heap?
Looks like there are k incoming streams which are sorted among themselves. So all we need to do here is form another stream i.e. the output stream from the k input streams.
The problem would arise when one or more of the streams have their values start after a huge number of values of other streams.
In this case lets say that all the k-1 streams are having numbers from range 1 and 1000000. So now its easy to merge the k-1 streams.
if the kth stream starts from 10000000000000 i.e. of order several times larger then the values would keep coming in, and you would need to store in a buffer and when that goes full, serialize that into a filestream.
This is the main crux of the problem. The problem would be more magnified if its the other way round, i.e. 1 stream starts from 1 and goes up in increments of 1 and the other k-1 streams start from 100000000000000 and above. So you will have (k-1)*100000000000000 values in store before the first value gets consumed. this would mean requiring the (k-1) file streams to serialize and then read from the files to throw the output into the output stream.
This question is not an algorithmic or time complexity question but more of a large data system design.
External merge sort
One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:
1. Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
2. Write the sorted data to disk.
3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
4. Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
5. Perform a 9-way merge and store the result in the output buffer. If the output buffer is full, write it to the final sorted file, and empty it. If any of the 9 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available.
Since the streams are infinite in nature, we have to use MinHeap concept along with insertion sort. Take an empty result array.
From all of the sorted streams, pick the smallest number (comparing index 0 of all streams) and put in result array.
Continue the same.
In the result array, we need to take care of inserting the new element at correct position.
Since the streams are already sorted, the problem domain is not sorting but merging. Take any two streams at one time and perform a simple merge routine. Further merge processing can be made efficient on consideration that the streams are already sorted and tail of one stream may be directly lesser than head of another. Thus the number of times a stream needs to be compared element-wise may be reduced.
@Radhika you are right if there are k streams which are sorted and n elements in total this gives O(nlogk) algorithm.(refer to cormen)
Since we are performing heap operation for every element in all the streams there are a total of n*k elements. Therefore the complexity is O(nklogk)
what if we construct a binary search tree it would have a time complexity of log(no. of elements=k.N) + its inorder traversal O(k.N)
where N is no. of elements in a stream...
becoz if we use heap for such large data set our programme will give heep Exception..
Found on the algo blog
Step1: Take first element of all k streams and build a min heap of them. => O(k)
Step2: Remove the min element (elem at top of heap) from the heap and put in the new stream. => O(log k)
Step3: Put new element in heap from the stream to which the prev elem belongs (which was at heap min). => O(log k)
Step4: continue above steps till we exhaust all the streams. If all streams in combination have n elements then order is O(n log(k))
Build a heap out the first element from all k streams.. Remove the minimum and insert the next element into the heap from the stream to which minimum belongs. Keep repeating this till we run out of all streams.. Will this work?- Radhika July 02, 2011