Find the largest k elements fr

Amazon Interview Question for SDE1s

1

of 1 vote

22
Answers
Find the largest k elements from a large file?
You dont have RAM to store even k elements.
- grave June 27, 2013 in India | Report Duplicate | Flag | PURGE
Amazon SDE1 Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: India
Interview Type: In-Person

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

Just implementing the K-th order statistic algorithm to work with files will solve this problem. Complexity - O(N)

- gevorgk June 28, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

its not the K'th largest, Its K largest

- NaMo June 29, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

Does it matter ? :) if you're looking for K-th largest element by partitioning array every time, after finding a K-th largest element, the elements in array after K will be K largest elements. In this caser array=file.

- gevorgk June 29, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

yeah, you are right. It would contain target elements ( but not in sorted order, that was what confused me !! )

- NaMo June 30, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

But how can you find the

k-th order statistics

externally ? Don't you have to keep in memory the whole file ? I haven't seen the Selection algorithm working externally yet and I couldn't find it,
Thanks

- Arian July 01, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Hi Gevorg.

Have you considered meeting in your suggested solution the constraint

"You dont have RAM to store even k elements."

The K-th order statistics does require at least "K * sizeof(object)" allocated memory. Or you maybe suggesting some implementation that could use file system ?

- ashot madatyan July 05, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Ashot, of course I'm suggesting to modify it to work with file system. Something like
1. Choose a pivot element
2. Start reading a file, write elements less than pivot into one file, and elements greater than pivot into another file. Depending on position of pivot (less than or greater than K) recurse into one of output files.

Working with files will make it really slow, but there is no other choice here. The only optimisation we can do here is to read an input file by blocks of maximum size which can fit into memory instead of reading one element at a time.
So, basically idea is same as in external sort, when you have to sort a file which doesn't fit in memory.

- gevorgk July 06, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Kth order statistic will work when all elements can be stored in main Memory.
I am not sure if we can use secondary storage for implementing kth order statistic.

- varun October 26, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

Since we do not have enough memory to store the values, we need to use external file to store the numbers.
File will store the k largest numbers.
Every time we read a number from the input file, we need to compare with the already present numbers in the output file then accordingly store or reject this number.
Complexity is O(k*n).
Can we improve the searching in the output file?

- subahjit June 27, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

in fact linear solution is possible, see my answer below

- gevorgk June 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Since we can't hold even k elements in memory, another alternative is to use a min-heap of size k/2. In the first pass, keep storing the top k/2 elements in the min-heap and output them at the end of the iteration.

In the second pass, store the next top k/2 elements and output them.

Time complexity would be O(nlg(k/2)). If k is large, then it would make a difference to Subhajit's algorithm. Otherwise, time complexity would be nearly the same as his.

- Murali Mohan July 02, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

You can do it using kth order statistics. in complexity O(n). while partitioning instead of storing into memory store it on disk. once you find the Kth element merge all the files having values larger than a[k] and output it.

- Sourabh July 03, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

N - size of input
K - size of required set
M - size of memory

- build minheaps of size M, we will have K/M heaps.
- Within memory store the smallest value (say MIN), and a pointer to the minheap (say HP) with this smallest value
- When reading the input if you encounter a value less than MIN, then adjust HP to add the new element - Time complexity log M
- Search through K/M heaps to find new MIN and HP - time complexity log (K/M)

Time complexity will be N (log M + log (K/M))

- IntwPrep July 18, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

I do this with MemoryMappedFile in C# if I understand your question right.

MemoryMappedFile mySequentialMMF = MemoryMappedFile.CreateFromFile(fileName, FileMode.Open, "MySequentialMap");
	bool IsmutexCreated;
        Mutex mymutex = new Mutex(true, "NonPersisterMemoryMappedFilemutex", out IsmutexCreated);
	StreamReader sr = new StreamReader(mySequentialMMF.CreateViewStream());
        while (!sr.EndOfStream)
        {
               if (sr.ReadLine() > MAXItem)
		{
			MAXItem = sr.ReadLine();
		}
        }
         

        sr.Close();
        mymutex.ReleaseMutex();

- Rapira June 27, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

and I can make max heap for k elements if we need have a k max elements

- rapirapp June 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

-2

of 2 votes

@rapirapp-i think minheap will help in finding k largest,not maxheap

- Amit June 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 3 votes

Ok Amit let's understand what is the difference between min heap and max heap. In the first one uses ascending priority where the smallest item is the first to be popped from the heap. A max heap uses descending priority where the largest item is the first to be popped.

- rapirapp June 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 3 votes

@Amit is right. You'll need a min-heap for k largest elements

- oOZz June 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

-2

of 2 votes

ok I don't say that Amit is not right but I want to understand why is Amit right.

- rapirapp June 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

-1 to all commenters in this thread.
1. First of all, the solution posted on top does something weird, but does't solve the problem
2. Max heap will not help here, since problem says that even K elements wouldn't fit in the memory
3. If they would fit, max heap is the solution, not the min heap.

- gevorgk June 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Please do a dry run of your algo using maxheap..you will find why exactly do you need minheap...Nd if you find nothing wrong, write your code here with maxheap. I will give test cases where that will fail :)

- Amit June 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Well, assuming that using a heap already violates problem statement (remember, we can't store even K elements in memory ?), my solution is follows (also violates the property that file cannot fit in memory)

1. read all numbers into memory
2. Create MAX HEAP out of that elements (O(N))
3. Attention.... Extract MAX K times !

total complexity – O(N) + O(k*logN), memory - O(N)

Using the min heap you'll use less memory ( O(K) ) but algorithm will be O(N*logK), which is greater than O(k*LogN) if K < N (which obviously is).

- gevorgk June 28, 2013 | Flag

CareerCup

Amazon Interview Question for SDE1s

Books

Videos

Resume Review

Mock Interviews