If you have a file with 1 tril

Amazon Interview Question for Software Engineer / Developers

0

of 0 votes

12
Answers
If you have a file with 1 trillion numbers, but memory can hold only 1 million, how would you find the lowest 1 million out of all these 1 trillion numbers?
- anon October 20, 2010 | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Large Scale Computing

Email me when people comment.

An error occurred in subscribing you.

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 3 vote

Implement heap with 1m number, max is on top.
Fill it with first 1m and sort. O(M*logM), M=1m.
Then go over each number, compare to first element, if it is smaller, replace first with new one and move it to right position in heap O(logM).
Total complexity is O(NlogM).

- Anonymous October 20, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

External Sort

- chennavarri October 20, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

If you are making a heap of 1 million number than your memory is full and u cant process the number.

So, the approach here is to minimize the number of memory swaps required between the disc and main memory.

- nittu October 21, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Using Max Tree we can get min 1 million numbers in o(n)

- thiruppathi raja October 24, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Building on the first reply:

Load up (1 million - 1) numbers into a heap.
Take the next number from the file:
If number is smaller than top:
Replace top with next biggest item on the heap
Insert the compared number into heap.

Time complexity = O(nlog(n)).

- Beginner October 24, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

And at the end you could be 1 short of 1M so this doesn't quite work...

- Anonymous October 27, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Naa .. IMO it works just fine to give smallest 1 million nos as the new number we are taking in just for comparison with the top most element in the max heap and memory is only used for storing data.

- JoshMachine October 28, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

test it to get the smallest TWO numbers of a 10-number array, you can only have max of TWO numbers in memory.
ex input: 10 9 8 7 6 5 4 3 2 1 and 1 2 3 4 5 6 7 8 9 10

- Anonymous November 28, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

@Chenna external sort is not required here, we just want to find the smallest one million of the trillion elements....
Solution by Beginner looks fine

- Harshal November 19, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

use selection/partition alg, basically you partition the file (like in qsort) using some pivot value and get the index of pivot value after partition. Then check if the new pivot index is equal to 1M, if it is true output the first 1M elements in file, otherwise continue to run partition in the left or right part of the file. If the pivot value is properly chosen the alg is said to be O(n)

{{
for(int left = 0, right = 1 trillion;;)
{
int pivot_index = some good pivot guess;
int new_pivot_index = partition(file, left, right, pivot_index);
if (new_pivot_index == 1M)
{
output first 1M in the file;
return;
}
if (new_pivot_index < 1M)
{
left = new_pivot_index + 1;
}
else
{
right = pivot_index - 1;
}
}
}}

- ldd January 19, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

formatting the code:

for(int left = 0, right = 1 trillion;;)
  {
    int pivot_index = some good pivot guess;
    int new_pivot_index = partition(file, left, right, pivot_index);
    if (new_pivot_index == 1M)
    {
      output first 1M in the file;
      return;
    }
    if (new_pivot_index < 1M)
    {
      left = new_pivot_index + 1;
    }
    else
    {
      right = pivot_index - 1;
    }
  }

- Anonymous January 19, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

{{{ 1T(trillion) = 1000 billion = 1000*1000M(million) = 1M*1M DP1 (Distributed Processing with 1 host) - if there is only one host available for processing: 1) Load 1M records into array, and iteratively heapify into -maximum- heap. O(M) 2) For each one in the remaining (1T-1M) records, do the following: O(T*logM) a) Compare it with heap top b) If it is no less than heap top, discard the record c) Otherwise remove heap top to have an updated maximum heap. d) Add the new record into the heap. 3) HeapSort the maximum heap. O(M*logM) Overall the time complexity is O(T*logM) DPn - if there are n hosts available 1) Divide 1T into n partitions, so that each host work on one partition. O(T) 2) For each host, process in DP1 way to produce 1M sorted minimum numbers as disk file partition.k, where k=1 to n. O(T/n*logM) 3) Merge n sorted partitions by n-way merging. O(M*logM) a) remove smallest element from each parition and add to a heap of size n. b) Do the following iteratively until 1M records are output: i) remove heap top and output. ii) remove the next record from the partition where removed heap top in i) belonged to. iii) add the record to heap. Overall wall time complexity is O(T/n*logM). if n >logM, then the wall time is dominated by partition cost in step 1), i.e., O(T). - jzhu May 18, 2012 | Flag Reply

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews