Amazon Interview Question for SDE1s


Country: India
Interview Type: In-Person




Comment hidden because of low score. Click to expand.
1
of 1 vote

I just had a brilliant idea! I'm pretty sure it's O(K) in complexity.

Abstract each file out, and treat it like a stack. Call it a FileStack.

Create a min-heap, with good decrease-key performance ( O(1) ) and find-min performance ( O(1) ).

Put each FileStack on the heap. You have O(100) complexity.

# Get K numbers. find_min and decrease_key are both O(1)
for i in xrange(0, K):
	fs = heap.find_min()
	print fs
	heap.decrease_key(fs)

O(100) + O(1) + O(1) + O(K) = O(K)

Thoughts?


... and its effectively the same as OP, except that heapify() wouldn't need to be called every time.

- rotinom April 08, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

Good job!

Your whole idea is the same as pulkit.mehra.001's one.
You choose an efficient heap data structure, with a constant "amortized" time decrease key operation (like Fibonacci heap). Thus, your algorithm is O(k) time, in theory!

However, I think, the most time consuming process is to read in the numbers from files, since reading file takes much more time than doing operations in main memory. That is, reading k numbers takes O(c.k) time, where c is a coeficient for file processing overhead, which is usually very big compared to that for accessing main memory. You may propose O(k) or O(k. log100) algorithms for this problem, but I think the file reading time dominates the algorithm running time. In other words, improving algorithm running time doesn't help much to reduce time for the overall problem.
Although you "abstract" out the files as stacks, you need to read files anyway, isn't it. Your solution (and pulkit.mehra.001's one) requires at most (k+99) times of reading file, which I think is the most efficient already.

In my opinion, this question concerns more about practical issue. Look at the way the interviewer formulated the question: he has a big data file, spitted into 100 files with sorted data... And he asks for "efficient" method, not "optimal" method. This makes me think of a practical issue rather than a pure theoretical problem.

An other point is that, for the solution, we need a heap data structure with a size of just 100, which is fairly small. So which heap implementation is more appropriate, considering the overhead cost for maintaining its structure?
I would say that binary heap is pretty simple yet efficient. It is usually implemented with array. Its simplicity makes the overhead cost negligible. While, for Fibonacci heap (and other theoretically efficient heaps), its complicated implementation may cause a costly overhead, especially for small size (like a size of 100 in this problem).

Anyway, your idea is great!

I have other questions:
1. Can "abstracting" out a file as stack make it faster to read?
2. I've heard of "memory-mapped file". Can it be efficiently used for this problem?

Cheers,

- ninhnnsoc April 09, 2014 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

The tricky part about these problems, is that the interviewer is looking for something, and that sometimes is lost when it comes to careercup. Hey may have been going towards a billion numbers, in a million files. Which leans towards a distributed application.

mem-mapping the file will be the fastest way to access it. You also will get a benefit from the multiple levels of cacheing from the OS. If you read a single number, the OS really reads at least a block of data (maybe more), because if you read one part of the data, you will most likely want the rest of it.

If K is small enough, you could just read K numbers from all 100 files into memory, as a front-end way to mitigate the burden.

Heck:

100million numbers = 400million bytes = 400Mb of memory. Based on the bounds of the application, you could read them in today, without a whole lot of pain. Once loaded, your disk latency is eliminated.

Which goes back to what I said earlier. Tricks like what I said, are easily eliminated if you go up to a billion numbers, or a trillion, or whatever. Interviews like this keep upping the ante to see what the candidate is capable of.

- rotinom April 09, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

This is a *standard algorithm* for merging multiple sorted lists. Put mins in a heap, and keep extract-min and getting from appropriate list and inserting into heap.

- Anonymous April 10, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Yes, as you guys point out reading one int at a time from each file is grossly inefficient as there might be lot of paging. Best is to read all k numbers in each file and put it into min heap of size 100k.
Also if performance if not important but speed of writing code is important (say a quick test), then we can do it one line in Unix.
Assume current dir has the files.
cat * >> out && sort -n out | head -k
Obviously algorithm wise its inefficient, but good to point out in an interview as it can be quite handy to quick jobs.

- im.code.warrior April 10, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 2 vote

You shouldn't create a min-heap. If you construct a min-heap, then you have the compare the new element with all the elements in the heap.

You should construct a max heap with first k elements. This will place the max element in the root. For each incoming element.
1) Check whether the element is less than the root.
2) If it's less than the root
3) Replace root with the incoming element
4) Heapify

This will produce a heap of k smallest elements at the end of traversing all the elements.
Complexity:O(nlogk)

- arsragavan April 07, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

in complexity what is "n"?

- kr.neerav April 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

n is all the 100 million numbers = input size
k = heap size

- arsragavan April 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

not sure if it will be an efficient implementation

- kr.neerav April 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Yeah, this is a good way. But in this case we would need to process all the elements in all the files. I think we can also have a flag which would be set to true if the root is smaller than the next elements from every file, implying that the smallest k elements have been found !!

- pulkit.mehra.001 April 08, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

This should have been the answer if the files are not sorted. And since it is given that the files are sorted, I guess OP's solution will work fine.

- rpganesshkumar April 20, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 2 vote

Use the concept of merging in merge sort.

Pick up the first 2 files and merge to extract top k values. Complexity O(k)
Repeat this for all files. Total complexity O(100K)
Space complexity O(k)

- kr.neerav April 07, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

I doubt if the complexity is O(100k)
From the first file, you will create a array of k size.
And, everytime you have to traverse all the elements in your array of k size for comparison with the first k elements of the file..
So, I guess the complexity is O(100*k^2).
Please correct me if I am wrong..

- arsragavan April 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

as per the problem each file is sorted. so we can pick smallest k elements from the 2 files and we can keep on doing the same using the merge sort technique. we can additional check like:

Assume I need 100 smallest numbers: So I will create smallest[100]. Once we get the smallest[100] from the first 2 files then we can compare like
a[100] < 3rd_File[0] -> If this is the case we can skip 3rd file and so on ....

Worst case would be o(100*k^2)

- Rajesh M April 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

For each file that we are merging we do k comparisons. Total number of files to merge 100. So complexity O(100k). Am not sure how we will get k^2

- kr.neerav April 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Everytime you have to do k comparisons with each of k element in your current array. So, it's definitely going to be O(n^2)

- arsragavan April 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

I think the complexity will be linear in k
The fact that the files are sorted and the approach of merging 2 sorted files is being used here.
Suppose we have 2 files with 100 elements in each and each is sorted. We need 10 smallest elements. Then we will definitely need 10 comparisons.
Compare 1st element of file 1 with 1st element of file 2. Take the smaller and move to the next element in that file, and so on.

So for k elements the number of comparisons will be O(k)
We need to do this for all the 100 files finding the smallest k elements using the concept of merging so overall time complexity will be O(100k)

- pulkit.mehra.001 April 08, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Hi all here!
The original solution of pulkit.mehra.001 is O( k. log100), with at most (k+99) times of reading files.

I think it is the best so far.

Is your solution better than that??

- ninhnnsoc April 08, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

How abt this solution, Since he has not mentioned anything abt how much the memory can hold, we can take the first k elements from each file and put into a file.
Then we can use Quickselect algorithm to find the k smallest elements.
We can argue that this can be achieved in O(n) complexity...

- arsragavan April 07, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 2 vote

We can use Median of medians approach...

- yash April 07, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

I think pulkit.mehra.001's solution, which is O(k log100), is quite efficient already.

It reads the files only k times for k smallest numbers. Reading from file is much slower than accessing ram memory. Thus, I think reading these k numbers, which takes O(c.k) with a big coefficient c, is the bottle neck; O(k log 100) time algorithm for finding the answer is not the bottle neck.

Can you find the k-smallest number in these 100 files with less than k times of reading file?
Can a file be randomly accessed?
IDK!

- ninhnnsoc April 07, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

pulkit.mehra.001's solution needs at most (k+99) times of reading file.

- ninhnnsoc April 07, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

I think he wanted you to think about file system buffering.

- Jaffa Krishna April 07, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

public static class FindSmallest {
        private static class Data implements Comparable<Data> {
            private int value;
            private int index;
            private List<Integer> file_id;

            public Data(int value, int index, List<Integer> file) {
                this.value = value;
                this.index = index;
                this.file_id = file;
            }
            
            @Override
            public int compareTo(Data o) {
                return this.value - o.value;
            };
            public String toString(){
                return value+"";
            }
        }

        public static int[] findKSmallest(List<List<Integer>> files, int k) {

            
            PriorityQueue<Data> minQueue = new PriorityQueue<Data>(k);
            int index = 0;
            for (List<Integer> file : files) {
                Data data = new Data(file.get(0), index, file);
                minQueue.add(data);
            }

            for ( int i = 0 ; i < k ; i++) {
                
                System.out.println(" minQueue:" + Arrays.toString(minQueue.toArray()));
                Data minData = minQueue.peek();
                if (minData != null) {
                    minData.index++;
                    int nextIndex = minData.index;
                    if ( nextIndex >= minData.file_id.size() )continue;
                    
                    Data nextData = new Data(minData.file_id.get(nextIndex),
                            nextIndex, minData.file_id);
                    minQueue.add(nextData);
                }
            }
            Object[] result = minQueue.toArray();
            int[] out = new int[k];
            for (int i = 0; i < k; i++) {
                out[i] = ((Data)result[i]).value;
            }
            return out;
        }
    }

- codedore April 07, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Pure data structure approach: Fibonacci heap

Find min is very cheap, and each subtree could be loaded separately as a file.

Otherwise, I don't know if a data structure is needed.

You have 100 sorted files, so you can open each one in time, as a speedup, you could read the first N/100 lines if it's small enough to minimize disk access. Then go through each file and do a merge.

Many approaches with different tradeoffs. What I think they are looking for at the 100 million level, is the tradeoffs of each tactic. Lots of wrong answers and no stand out right one.

- rotinom April 08, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

No need for any special DS still there is an efficient algorithm...
you need to find k smallest numbers in sorted files, just scan k times for all files, each time pick the smallest one among beginning of each file. Time would be O(n). Other algorithm only has minor improvements.

- Jason April 11, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Let all the 'n' files are sorted in increasing order.
Let we have to find 'k' smallest numbers in 'n' sorted files.
MaxHeap h = createMaxHeap(k);

//Read 1st K elements from file1 and create a MAX_HEAP out of them.
FileHandler fh = readFile(file1);
for(i=0;i<k;i++)
h.insert(line[i].data of fh);


for(j=1;j<n;i++)
{
FileHandler fh = readFile(file[j]);
for(i=0;i<k;i++){
if(line[i].data of fh < getMax(h)){
deleteMax(h);
h.insert(line[i].data)
}
}
}

return the heap as result;

- nishant April 13, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Hey Guys!! What do you think of this approach?

First we take the starting numbers from each of 100 files and sort them which will take 100log100 which can be treated as a constant. Along with the number we will store the file descriptor from which file it is extracted.

Now treat the array as a queue, starting from minimum element, got to its file and extract all the numbers that are smaller than the next minimum and store or print them. Now remove the minimum from queue. update the minimum and next minimum. Keep the count of no of elements printed so far. Continue this process untill the queue is empty or count = K. if the queue is empty and count < k. Then rebuild the queue with numbers that are being currently pointed to by the file pointers. And repeat the process.

Pros: As block caching is implemented in OS during reading a file printing contiguous elements will be a major adv.
In worst case when the numbers across the files are sorted. The time complexity would be k*100*log100 which is the case with the above implementations also.

Please correct me if am wrong.

- vicky22291 May 09, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

Correction the worst case time complexity is k*log100.

- vicky22291 May 09, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

This problem could be solved using a variable next-min and next mean count.
It will take O(K*100)

- Anand June 21, 2016 | Flag Reply
Comment hidden because of low score. Click to expand.
Comment hidden because of low score. Click to expand.
0
of 0 votes

But then you are not making use of the fact that the files are sorted.

- pulkit.mehra.001 April 08, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

He is right. Complete flow is as below:
Let all the 'n' files are sorted in increasing order.
Let we have to find 'k' smallest numbers in 'n' sorted files.
MaxHeap h = createMaxHeap(k);

//Read 1st K elements from file1 and create a MAX_HEAP out of them.
FileHandler fh = readFile(file1);
for(i=0;i<k;i++)
h.insert(line[i].data of fh);


for(j=1;j<n;i++)
{
FileHandler fh = readFile(file[j]);
for(i=0;i<k;i++){
if(line[i].data of fh < getMax(h)){
deleteMax(h);
h.insert(line[i].data)
}
}
}

return the heap as result;

- Nishant April 13, 2014 | Flag


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More