Ebay Interview Question
Software Engineer / DevelopersCountry: India
Interview Type: Phone Interview
Min-Heap and heap size is K.
The time complexity Θ(n*log(K)). The space complexity is K .
Right. Min-heap for max/top k, as you will be pushing out/comparing with the minimum each time.
Θ for runtime, but not space usage? Why?
When I read "big data file" I understand that it doesn't all fit into memory at once, so heaps won't work.
I think to use QuickSelect - as soon as element is compared with pivot it is recorded into separate file (there are 2 files for right and left partition). Eventually files will be small enough to fit into memory, but regardless of that it allows to find K-th largest element in linear time.
Then go through initial array again, picking elements larger than K, which is linear again.
So time is O(n) and space is K (because you can delete the files you don't need anymore)
We can use the following logic:
1. Split the file into N different smaller files. Assign each file to a separate process - so, N processes handling N smaller files
2. In each task, sort file assigned to itself (using a heap) and record the largest number. Finally, write that largest value to a separate file
3. Finally, read the N files and pick the largest K values from them
Very easy to implement this using hadoop map/reduce framework because input split and assigning those to tasks etc are automatically done by the framework.
Use max heap. n*log(k) complexity
- Anonymous December 29, 2013