## Amazon Interview Question for Software Development Managers

Country: United States
Interview Type: Phone Interview

Comment hidden because of low score. Click to expand.
1
of 1 vote

I guess the solution above works. Also, the article below talks about various solutions related to this topic.

aperiodic.net/phil/archives/Geekery/find-duplicate-elements.html

Comment hidden because of low score. Click to expand.
1
of 1 vote

There are many ways to solve this problem. Following ways can be used for this.

1) If change in position of numbers is allowed, then Sort the list of numbers and remove the duplicates. Sorting will require at least n*log(n) time if quick sort or merge sort is used.
2) Iterate through list n^2 times removing the duplicate elements by choosing a seed and then comparing that seed against each number. But this is not effient method.
3) Using hashing, we can map the numbers to some less number of bucket and then compare the numbers in each bucket and arrive at a list of unique number.
4) Take bit map of number. Because we know that it is 32 bit integer, we require 2^32 bits to represent the bit map image and initialize the image to all 0. On first occurrance of number, set corresponding bit of bit map to 1. If number appears again then remove it from list. This will be O(1), but requires constant memory space i.e. 2^32 bits = 2^29 bytes = 512 Megabytes.

Comment hidden because of low score. Click to expand.
0

IMO option 4 is the best and the correct solution to this problem. allocating 512 MB of memory for such a large set of input should not be a problem. However if the interviewer wants you to reduce the amount of memory required then also the same logic will apply but with a little change and slightly more but still constant processing time.

What we will do is divide the processing into N batches depending upon the memory restriction imposed by the interviewer. Consider N to be 2 for this example. So we will divide it into 2 passes. In the first pass of 5 billion numbers we will only consider numbers from 0 to 2^32/N = 2^31 and accordingly allocate 2^31/2^3 = 2^28 = 256 MB of memory. In the second pass we will consider elements in the integer range of 2^31 to 2^32 and again allocate 256 MB and remove duplicates according to technique mentioned in point 4.

Hence by applying the same algorithm with different ranges of number I have reduced my memory footprint by Half. We can further reduce it by incorporating more passes into it.

Comment hidden because of low score. Click to expand.
0

yes! That can be done if there is any restriction on memory usage.

Comment hidden because of low score. Click to expand.
0

On the approach 4, the bit-field space is one area that may do well with another improvement. Instead of keeping one bit for each number which can be set on the first occurance, we could reduce the memory requirement by using hash and a bloom-filter. Typically a bloomfilter of size 8 bit per hash input can return with 2% false positive rate in O(1) time. For input size of 2^32 we can choose a good hash function of size 8*32 bits to give a relatively collision-resistant hash and then use a bloomfilter of 8*(8*32) bits to find out if the hash of the input pattern occurred before or the pattern is new.
If it is new, we add the hash to bloomfilter and if it occured before we remove the input pattern.
.

Comment hidden because of low score. Click to expand.
0

There were some mistakes in the calculation above. The hash space could be 8 or 10 bit size rendering 2^32 inputs to 2^8 or 2^10 unique hash output. Now allowing 8 bits or 10 bits [as recommended in Wiki for 1% false positive rate] the bloomfilter size would be 2^10*10 bits i.e. 1032*10 bits or 10Kb or 1.2 KB which is lot smaller than keeping a bit for every 32 bit number.

Comment hidden because of low score. Click to expand.
0

nice explanation !!!!!!

Comment hidden because of low score. Click to expand.
0
of 0 vote

Allocate two byte arrays of (5 billion / 8) size - one for positive integers, second for negative integers.
Go through array. If the next item >= 0 then check whether bit number [item % 8] of the First Byte Array[item /8] is set. If it it set then we found the duplicate element, print it and remove from array. Set the bit if it is unset. continue.
Similar for item<0

Comment hidden because of low score. Click to expand.
0

dude, why do you need 2 byte arrays ? you can treat integers as unsigned if you want

Comment hidden because of low score. Click to expand.
0

Why cant we use 2 bit arrays each of size 2^31 without any of this % business? The space requirement for that will be 0.25 + 0.25 = 0.5 GB with no extra calculation required (which will save atleast 5 billion %8 operations)

Comment hidden because of low score. Click to expand.
0

It looks like you are considering all the elements are in the range of 5 billion, which is not mentioned in the problem description.

Comment hidden because of low score. Click to expand.
0

32-bit limits everything to 2^32 or less.

Comment hidden because of low score. Click to expand.
0
of 2 vote

This can be solved easily with two Hastables, the first Hashtable is where I will keep all numbers that appear at least once in the input, the second Hashtable is where I will keep all the numbers that appear more than once in the input (ie. a duplicate). Simply iterate over the input in one pass, testing the first Hashtable and then the second Hashtable to see if it already contains the number, and insert to the Hashtable and continue if it is not there already.

Comment hidden because of low score. Click to expand.
0

But this solution has a worst-case space complexity of more than 2 GB (2 GB for the numbers between -2^31 to +2^31 plus additional overhead for maintaining the hashtable). The first solution posted in this article and the solutions posted in the article above come close to solving this in less space complexity. I am not sure if there are other better solutions.

Also, correct me if I am wrong, but I guess in the original solution, he/she meant to say "Allocate two byte arrays of (2 GB / 8) size". We only need to access the index pointed by arrayOfFiveGB[i] element to check if there is a duplicate.

Comment hidden because of low score. Click to expand.
0
of 0 vote

May I ask if a B-tree would work (not BST) ? I understand that it is frequently used to store large data sets.

Comment hidden because of low score. Click to expand.
0
of 0 vote

One solution that seems reasonably efficient:

pass1: allocate an array of 2^32 elements each is two bit initialized to 00,
we iterate the 5-billion array and mark the 2^32 array as followed:
00: not seen
01: seen
10: duplicate
Then,we can output the duplicate number every time we see a switch from "seen" to "duplicate".

pass2: this is pass is only used for eliminating duplicate
create two pointers each pointing to the 0th element of the array
iterate the first pointer through the 5-billion array and use the number to index the 2^32 array to see whether we have a seen element or duplicate element,if we have a seen element,we copy it to what the second pointer points,and forward the second pointer by 1.
after the first pointer is at the tail,we know we are done and what the second pointer points is the tail of the new array.

Comment hidden because of low score. Click to expand.
0

Case I

you don't need 2 bits. 1 bit is enough.
0 - unseen
1 - seen
Whenever a number enters, if the bit is 1 then you can print it as duplicate.

Comment hidden because of low score. Click to expand.
0
of 0 vote

Comment hidden because of low score. Click to expand.
0
of 0 vote

Sorry .. it's the link for previous question ...

Comment hidden because of low score. Click to expand.
0
of 0 vote

``````public static void removeDups2(int[] origArr,int[] newArr )
{
//32 bit signed integer means max +ve value are 2^31-1 and 2^31-1 -ve integers.
//We will use the index to refer the actual integer
//2^31 is approx 2.2 Gig
BitSet positive=new BitSet((int)Math.pow(2, 31));
BitSet negative=new BitSet((int)Math.pow(2, 31)-1);
boolean isDuplicate=false;
for(int i=0,j=0;i<origArr.length;++i)
{
isDuplicate=false;
if(origArr[i]>=0)
{
//if the bit is set means its a duplicate else set the bit
if(!positive.get(origArr[i]))
positive.set(origArr[i]);
else
isDuplicate=true;

}
else if(origArr[i]<0)
{
//if the bit is set means its a duplicate else set the bit
if(!negative.get(origArr[i]*-1))//multiply by -1 to make it a postive index
negative.set(origArr[i]*-1);
else
isDuplicate=true;
}
if(isDuplicate)
{
out.println(origArr[i]+" is duplicate.. remove it");
origArr[i]=0;//  this is necessary only if we are not using the 2nd array for output
}
else
newArr[j++]=origArr[i];

}
}``````

Comment hidden because of low score. Click to expand.
0
of 0 vote

Create an array DUP of 2^32 size. initialized with -1 (or something)
Pass through the given array, and use the value of the array as index into the DUP array and if its -1 set it to 1. and if its alreay 1, then its already seen so print it....One pass solution is obtained

Comment hidden because of low score. Click to expand.
0

Improvemnt....Since it is a signed 32 bit numbers in the 5 b array...we can use 2^16 array size would be enough to keep track...0 if not seen any of the +/- number... last bit set....+ve number already seen second last bit set if -ve number seen.

Now if the current number is +/- and is already seen then its duplicate.....now the size of array is reduced by a lot

Comment hidden because of low score. Click to expand.
0

Well ...since only 2 bits are being used in every element of 2^16 array....remaining bits can be used further reducing the memory by half. Hence for can be further reduced by using all bits of each element arrray...so memory is requirement would become very lesss and negligible...and working with bits are faster..improves.!

Comment hidden because of low score. Click to expand.
0
of 0 vote

I wonder "eliminate duplicate" means " eliminate all the other cases of duplicates but retain one" or "eliminate all the cases of a duplicated number",

if it is the former one,one pass is feasible.

Comment hidden because of low score. Click to expand.
0
of 2 vote

[C#]
Use a HashSet() to keep track of 32bit ints that have been seen.

Also need to overwrite duplicate entries, with non-duplicates...

``````static void RemoveAndPrintDuplicates( BigArray<int>[] input)
{
HashSet<int> set = new HashSet<int>();
Int64 destIndex = 0;

for (Int64 srcIndex = 0; srcIndex < input.Length; srcIndex++)
{
if (set.Contains(input[srcIndex]))
{
Console.WriteLine("Duplicate: {0}", input[srcIndex]);
continue;
}

// If srcIndex and destIndex are different,
// copy the src # to the dest.
if (srcIndex != destIndex)
{
input[destIndex] = input[srcIndex];
}
destIndex++;
}
}``````

Comment hidden because of low score. Click to expand.
0
of 2 vote

we can do this in two steps:

1. Sort the array using ur favorite sorting algorithm.
2. Now we need to find duplicates and remove them

pseudo code:

``````removeDuplicates( A[] , size)
{
A=sort(A,size);
last_element= infinite;
index=0;
for each element in A:
if  A[i] = last_element             // =>duplicate
// do nothing
else
A[index] = A[i];
index++;
last_element=A[i];

return A[];
}``````

the size of array A is now equal to index.

Comment hidden because of low score. Click to expand.
0
of 0 vote

How are you guys getting the space requirements to under 2 GB or 512 MB in some cases?

The array size itself is 20 GB, isnt it? Please correct me if im wrong:
32 bits = 4 bytes.
5 billion 32 bit numbers = 5 billion * 4 bytes = 20 billion bytes

20 billion bytes = 2*10^10 bytes = 2*10^7 KB = 2*10^4 MB = 2*10^1 GB = 20 GB(roughly)

Comment hidden because of low score. Click to expand.
0

Because, bit map will be used. Which will be of fixed length. In this case each bit will represent presence or absence of the number. The memory usage of 512 MB is extra memory that is required for the processing apart from the array of numbers.

Comment hidden because of low score. Click to expand.
0
of 0 vote

linux./unix solution

\$uniq -u <filename>

Comment hidden because of low score. Click to expand.
0
of 0 vote

linux./unix solution

\$uniq -u <filename>

Comment hidden because of low score. Click to expand.
0
of 0 vote

We can think of this as equivalent to: There is a 20 GB file full of unsigned integers, how to produce a file that has no duplicates?
1. Break this large file into 40 500 MB files
2. Sort the integer in each file, modify whatever sorting you are doing to eliminate duplicates and print them.
3. Merge these 400 files, (like merging 400 arrays), while merging, eliminate and print duplicates.

Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

### Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

### Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.