Given a huge file 100 million

Amazon Interview Question for SDE1s

1

of 1 vote

32
Answers
Given a huge file 100 million integers. He further divided the file into 100 files with 1 million integers in each and each file is sorted. Needed to find the k smallest integers.

I used the concept of min-heap. Take the first element from each file and construct a min-heap. Take the root,as it is the smallest element and insert the next element from the file which contains the root root element. Heapify the tree and repeat k times.

The interviewer asked if another efficient method exists?
- pulkit.mehra.001 April 07, 2014 in India | Report Duplicate | Flag | PURGE
Amazon SDE1 Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: India
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

I just had a brilliant idea! I'm pretty sure it's O(K) in complexity.

Abstract each file out, and treat it like a stack. Call it a FileStack.

Create a min-heap, with good decrease-key performance ( O(1) ) and find-min performance ( O(1) ).

Put each FileStack on the heap. You have O(100) complexity.

# Get K numbers. find_min and decrease_key are both O(1)
for i in xrange(0, K):
	fs = heap.find_min()
	print fs
	heap.decrease_key(fs)

O(100) + O(1) + O(1) + O(K) = O(K)

Thoughts?

... and its effectively the same as OP, except that heapify() wouldn't need to be called every time.

- rotinom April 08, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Good job!

Your whole idea is the same as pulkit.mehra.001's one.
You choose an efficient heap data structure, with a constant "amortized" time decrease key operation (like Fibonacci heap). Thus, your algorithm is O(k) time, in theory!

However, I think, the most time consuming process is to read in the numbers from files, since reading file takes much more time than doing operations in main memory. That is, reading k numbers takes O(c.k) time, where c is a coeficient for file processing overhead, which is usually very big compared to that for accessing main memory. You may propose O(k) or O(k. log100) algorithms for this problem, but I think the file reading time dominates the algorithm running time. In other words, improving algorithm running time doesn't help much to reduce time for the overall problem.
Although you "abstract" out the files as stacks, you need to read files anyway, isn't it. Your solution (and pulkit.mehra.001's one) requires at most (k+99) times of reading file, which I think is the most efficient already.

In my opinion, this question concerns more about practical issue. Look at the way the interviewer formulated the question: he has a big data file, spitted into 100 files with sorted data... And he asks for "efficient" method, not "optimal" method. This makes me think of a practical issue rather than a pure theoretical problem.

An other point is that, for the solution, we need a heap data structure with a size of just 100, which is fairly small. So which heap implementation is more appropriate, considering the overhead cost for maintaining its structure?
I would say that binary heap is pretty simple yet efficient. It is usually implemented with array. Its simplicity makes the overhead cost negligible. While, for Fibonacci heap (and other theoretically efficient heaps), its complicated implementation may cause a costly overhead, especially for small size (like a size of 100 in this problem).

Anyway, your idea is great!

I have other questions:
1. Can "abstracting" out a file as stack make it faster to read?
2. I've heard of "memory-mapped file". Can it be efficiently used for this problem?

Cheers,

- ninhnnsoc April 09, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

The tricky part about these problems, is that the interviewer is looking for something, and that sometimes is lost when it comes to careercup. Hey may have been going towards a billion numbers, in a million files. Which leans towards a distributed application.

mem-mapping the file will be the fastest way to access it. You also will get a benefit from the multiple levels of cacheing from the OS. If you read a single number, the OS really reads at least a block of data (maybe more), because if you read one part of the data, you will most likely want the rest of it.

If K is small enough, you could just read K numbers from all 100 files into memory, as a front-end way to mitigate the burden.

Heck:

100million numbers = 400million bytes = 400Mb of memory. Based on the bounds of the application, you could read them in today, without a whole lot of pain. Once loaded, your disk latency is eliminated.

Which goes back to what I said earlier. Tricks like what I said, are easily eliminated if you go up to a billion numbers, or a trillion, or whatever. Interviews like this keep upping the ante to see what the candidate is capable of.

- rotinom April 09, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

This is a *standard algorithm* for merging multiple sorted lists. Put mins in a heap, and keep extract-min and getting from appropriate list and inserting into heap.

- Anonymous April 10, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Yes, as you guys point out reading one int at a time from each file is grossly inefficient as there might be lot of paging. Best is to read all k numbers in each file and put it into min heap of size 100k.
Also if performance if not important but speed of writing code is important (say a quick test), then we can do it one line in Unix.
Assume current dir has the files.
cat * >> out && sort -n out | head -k
Obviously algorithm wise its inefficient, but good to point out in an interview as it can be quite handy to quick jobs.

- im.code.warrior April 10, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

You shouldn't create a min-heap. If you construct a min-heap, then you have the compare the new element with all the elements in the heap.

You should construct a max heap with first k elements. This will place the max element in the root. For each incoming element.
1) Check whether the element is less than the root.
2) If it's less than the root
3) Replace root with the incoming element
4) Heapify

This will produce a heap of k smallest elements at the end of traversing all the elements.
Complexity:O(nlogk)

- arsragavan April 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

in complexity what is "n"?

- kr.neerav April 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

n is all the 100 million numbers = input size
k = heap size

- arsragavan April 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

not sure if it will be an efficient implementation

- kr.neerav April 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Yeah, this is a good way. But in this case we would need to process all the elements in all the files. I think we can also have a flag which would be set to true if the root is smaller than the next elements from every file, implying that the smallest k elements have been found !!

- pulkit.mehra.001 April 08, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

This should have been the answer if the files are not sorted. And since it is given that the files are sorted, I guess OP's solution will work fine.

- rpganesshkumar April 20, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

Use the concept of merging in merge sort.

Pick up the first 2 files and merge to extract top k values. Complexity O(k)
Repeat this for all files. Total complexity O(100K)
Space complexity O(k)

- kr.neerav April 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I doubt if the complexity is O(100k)
From the first file, you will create a array of k size.
And, everytime you have to traverse all the elements in your array of k size for comparison with the first k elements of the file..
So, I guess the complexity is O(100*k^2).
Please correct me if I am wrong..

- arsragavan April 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

as per the problem each file is sorted. so we can pick smallest k elements from the 2 files and we can keep on doing the same using the merge sort technique. we can additional check like:

Assume I need 100 smallest numbers: So I will create smallest[100]. Once we get the smallest[100] from the first 2 files then we can compare like
a[100] < 3rd_File[0] -> If this is the case we can skip 3rd file and so on ....

Worst case would be o(100*k^2)

- Rajesh M April 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

For each file that we are merging we do k comparisons. Total number of files to merge 100. So complexity O(100k). Am not sure how we will get k^2

- kr.neerav April 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Everytime you have to do k comparisons with each of k element in your current array. So, it's definitely going to be O(n^2)

- arsragavan April 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I think the complexity will be linear in k
The fact that the files are sorted and the approach of merging 2 sorted files is being used here.
Suppose we have 2 files with 100 elements in each and each is sorted. We need 10 smallest elements. Then we will definitely need 10 comparisons.
Compare 1st element of file 1 with 1st element of file 2. Take the smaller and move to the next element in that file, and so on.

So for k elements the number of comparisons will be O(k)
We need to do this for all the 100 files finding the smallest k elements using the concept of merging so overall time complexity will be O(100k)

- pulkit.mehra.001 April 08, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Hi all here!
The original solution of pulkit.mehra.001 is O( k. log100), with at most (k+99) times of reading files.

I think it is the best so far.

Is your solution better than that??

- ninhnnsoc April 08, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

How abt this solution, Since he has not mentioned anything abt how much the memory can hold, we can take the first k elements from each file and put into a file.
Then we can use Quickselect algorithm to find the k smallest elements.
We can argue that this can be achieved in O(n) complexity...

- arsragavan April 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

We can use Median of medians approach...

- yash April 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think pulkit.mehra.001's solution, which is O(k log100), is quite efficient already.

It reads the files only k times for k smallest numbers. Reading from file is much slower than accessing ram memory. Thus, I think reading these k numbers, which takes O(c.k) with a big coefficient c, is the bottle neck; O(k log 100) time algorithm for finding the answer is not the bottle neck.

Can you find the k-smallest number in these 100 files with less than k times of reading file?
Can a file be randomly accessed?
IDK!

- ninhnnsoc April 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

pulkit.mehra.001's solution needs at most (k+99) times of reading file.

- ninhnnsoc April 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

I think he wanted you to think about file system buffering.

- Jaffa Krishna April 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

public static class FindSmallest {
        private static class Data implements Comparable<Data> {
            private int value;
            private int index;
            private List<Integer> file_id;

            public Data(int value, int index, List<Integer> file) {
                this.value = value;
                this.index = index;
                this.file_id = file;
            }
            
            @Override
            public int compareTo(Data o) {
                return this.value - o.value;
            };
            public String toString(){
                return value+"";
            }
        }

        public static int[] findKSmallest(List<List<Integer>> files, int k) {

            
            PriorityQueue<Data> minQueue = new PriorityQueue<Data>(k);
            int index = 0;
            for (List<Integer> file : files) {
                Data data = new Data(file.get(0), index, file);
                minQueue.add(data);
            }

            for ( int i = 0 ; i < k ; i++) {
                
                System.out.println(" minQueue:" + Arrays.toString(minQueue.toArray()));
                Data minData = minQueue.peek();
                if (minData != null) {
                    minData.index++;
                    int nextIndex = minData.index;
                    if ( nextIndex >= minData.file_id.size() )continue;
                    
                    Data nextData = new Data(minData.file_id.get(nextIndex),
                            nextIndex, minData.file_id);
                    minQueue.add(nextData);
                }
            }
            Object[] result = minQueue.toArray();
            int[] out = new int[k];
            for (int i = 0; i < k; i++) {
                out[i] = ((Data)result[i]).value;
            }
            return out;
        }
    }

- codedore April 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Pure data structure approach: Fibonacci heap

Find min is very cheap, and each subtree could be loaded separately as a file.

Otherwise, I don't know if a data structure is needed.

You have 100 sorted files, so you can open each one in time, as a speedup, you could read the first N/100 lines if it's small enough to minimize disk access. Then go through each file and do a merge.

Many approaches with different tradeoffs. What I think they are looking for at the 100 million level, is the tradeoffs of each tactic. Lots of wrong answers and no stand out right one.

- rotinom April 08, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

No need for any special DS still there is an efficient algorithm...
you need to find k smallest numbers in sorted files, just scan k times for all files, each time pick the smallest one among beginning of each file. Time would be O(n). Other algorithm only has minor improvements.

- Jason April 11, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Let all the 'n' files are sorted in increasing order.
Let we have to find 'k' smallest numbers in 'n' sorted files.
MaxHeap h = createMaxHeap(k);

//Read 1st K elements from file1 and create a MAX_HEAP out of them.
FileHandler fh = readFile(file1);
for(i=0;i<k;i++)
h.insert(line[i].data of fh);

for(j=1;j<n;i++)
{
FileHandler fh = readFile(file[j]);
for(i=0;i<k;i++){
if(line[i].data of fh < getMax(h)){
deleteMax(h);
h.insert(line[i].data)
}
}
}

return the heap as result;

- nishant April 13, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Hey Guys!! What do you think of this approach?

First we take the starting numbers from each of 100 files and sort them which will take 100log100 which can be treated as a constant. Along with the number we will store the file descriptor from which file it is extracted.

Now treat the array as a queue, starting from minimum element, got to its file and extract all the numbers that are smaller than the next minimum and store or print them. Now remove the minimum from queue. update the minimum and next minimum. Keep the count of no of elements printed so far. Continue this process untill the queue is empty or count = K. if the queue is empty and count < k. Then rebuild the queue with numbers that are being currently pointed to by the file pointers. And repeat the process.

Pros: As block caching is implemented in OS during reading a file printing contiguous elements will be a major adv.
In worst case when the numbers across the files are sorted. The time complexity would be k*100*log100 which is the case with the above implementations also.

Please correct me if am wrong.

- vicky22291 May 09, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Correction the worst case time complexity is k*log100.

- vicky22291 May 09, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

This problem could be solved using a variable next-min and next mean count.
It will take O(K*100)

- Anand June 21, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

-3

of 3 vote

Since you want to find k smallest elements max heap is more relevant (every time you have to remove max element in O(1) time). If current el is smaller then max remove max and insert a new element.

- thelineofcode April 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

But then you are not making use of the fact that the files are sorted.

- pulkit.mehra.001 April 08, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

He is right. Complete flow is as below:
Let all the 'n' files are sorted in increasing order.
Let we have to find 'k' smallest numbers in 'n' sorted files.
MaxHeap h = createMaxHeap(k);

//Read 1st K elements from file1 and create a MAX_HEAP out of them.
FileHandler fh = readFile(file1);
for(i=0;i<k;i++)
h.insert(line[i].data of fh);

for(j=1;j<n;i++)
{
FileHandler fh = readFile(file[j]);
for(i=0;i<k;i++){
if(line[i].data of fh < getMax(h)){
deleteMax(h);
h.insert(line[i].data)
}
}
}

return the heap as result;

- Nishant April 13, 2014 | Flag

CareerCup

Amazon Interview Question for SDE1s

Books

Videos

Resume Review

Mock Interviews