Given a packed file with 1Tb

Google Interview Question for Software Developers

9

of 9 votes

24
Answers
Given a packed file with 1Tb of 64-bit doubles (first 8 bytes are first double, next 8 bytes are next, etc) find the exact value of median. For simplicity assume the number of doubles is odd.
You can't modify the file and you have only 8Gb of free memory.

Update: you may use no more than two passes through file and your algorithm shouldn't rely on some nature of file - it should work in all cases.
- emb March 19, 2016 in United States | Report Duplicate | Flag | PURGE
Google Software Developer Coding

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 9 vote

The updates make this a lot more interesting problem.

First, I'll explain the algorithm without the 8gb memory limit. Doubles (and floats in general) are designed so that when ordered lexicographically, they'll be ordered from the largest to the smallest. This is important because this means that if a double is smaller than another double based only on the first 32 bits, the rest don't matter.

Now we can have 2**32 = ~4 billion buckets. We do the first pass and count the number of occurrences for each distinct 32bit prefix. After that, we can easily find out the first 32 bits of the median based on the counts in the buckets, and we do a second pass taking only into account that particular prefix, finding out the suffix in a similar way.

Now because we have only 8gb we're in trouble. We have 1Tb of doubles, 8 bytes per double, that makes 125 billion doubles. Because we need ~4 billion buckets and the count in each can go up to 125 billion, we need ceil(log2(125*10**9)) = 37 bits per bucket, meaning we would need 37*2**32 bits of memory, which is ~20gb.

Now this is probably the most interesting part. We can't use really anything with pointers since each pointer takes 8 bytes. Instead, let's create a byte array of size 2**32. This array obviously takes 4gb of memory. In addition, let's create a (c++ style) vector of 32bit integers.

The array keeps track of the counts of each 32bit prefix/suffix modulo 256 and every time the count overflows (goes from 255 to 0), we append the prefix/suffix corresponding to that count to the vector. The count can overflow at most 125*10**9/2**8 = 490mil times, so the vector can have at most 490mil entries, taking at most 490*10**6*4 = ~2 gigabytes.

So after one iteration we have a table with the count of each 32 bit prefix/suffix modulo 256 and a vector which has one occurrence of the prefix/suffix each time the count overflowed. We can sort the vector and start iterating from the smallest prefix/suffix and add 256 to the count every time we also see a corresponding entry in the vector.

This solution takes 4gb for the byte array and max 2gb for the int vector, so there's even nice 2gb to spare.

- jani March 21, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Hi,@jani, I am not clear about "We can sort the vector and start iterating from the smallest prefix/suffix and add 256 to the count every time we also see a corresponding entry in the vector." So, do you want to record this "count" when you add 256 to the count. If I understand you correctly, you need 4 billion "count", right? If so, each count is an integer? Wouldn't these count take additional large mount of memory?

- Lex March 23, 2016 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

"Doubles (and floats in general) are designed so that when ordered lexicographically, they'll be ordered from the largest to the smallest" <= How true is this? The most significant bit is a sign bit.

- funcoolgeek May 08, 2016 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

"Doubles (and floats in general) are designed so that when ordered lexicographically, they'll be ordered from the largest to the smallest. This is important because this means that if a double is smaller than another double based only on the first 32 bits, the rest don't matter."

--can you elaborate more on this statement?
If the rest of the bits doesn't matter..what do they contribute to the value of the double variable? nothing?
Thanks.

- settyblue June 06, 2016 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

The first solution that comes to mind is to use an external sort to sort the 1TB file of doubles and since the number of doubles is given to be odd, we can pick the middle number. Of course, we need a function to compare two doubles....need to go back and dig into my comp arch stuff to see how to do this. Complexity is nlogn with additional files to store the intermediate sorts in the external sort.

The next thought that came to me was: can we use a max and min heap to keep the first and second half of the numbers. After reading the last number, you will know the median: either the last number OR the top of the max or min heaps. This solution does not use file writes, but will use more memory than 8GB. So, not a good one.

The next thought that came to me was: can I use a bucket sort based method? Here is my thought process. 2^64 is the total possible combinations...we can use just the mantissa of a double (2^52). 8GB/8bytes per double => 1GB doubles. So, to fit 2^52 into 2^30, I can use 22 of the bits in the input number as an index. The plan is to form 2^22 buckets, make one pass through the file and develop a count of nos for each bucket. Note that I am not storing the actual nos. Now you have the total nos in the file and the "middle" count. Find the bucket which maps to the middle count. Go through the file a second time and collect all the nos that belong to the identified bucket. Sort them (takes into account the exponent now) and find the exact median number.

Thoughts?

- smallchallenges March 19, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 votes

But I think I am missing something here.
Initially while counting the number of doubles belonging to each bucket, you are not considering the exponent, which means numbers having completely different exponent will go to the same bucket. Doesn't this imply that the median we are computing is completely wrong?

- settyblue June 06, 2016 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Alright, the question seems abandoned, so it's safe to post answer:
IEEE754 doubles have property that if reinterpret_casted to uint64, the comparison will still yield correct results (except for negative numbers - how to fix this is exercise to reader)
Now we interpret doubles as uint64
So we know total number of doubles in file. Then we count numbers with MSB being 1 and being 0. After this is done, we know if median has its MSB as 1 or 0. Then we grep only those numbers that have MSB equal to this value and do the same 64 times till we get all bits of median.
But hey, we haven't used the memory at all. So instead of having 2 counters for 0 and 1 let's have 2^32 counters. This way we will be able to find the median in just two passes - once for highest 32 bits and once for lowest 32 bits.
So we think - there are 1Tb of doubles, that's approximately 140B numbers - we will have to use 37 bits for every counter (for the worst skew case), and 2**32 * 37 = 158Gib ~ 20 Gb of memory.
Turns out we are ok with 12 bits for each counter.
The idea is the following: if some counter overflows, append its number to some list so that later we will know that it overflowed.
So for k-bit counters we will need k * (2**32) bits for counters and 32 * (2**40 / 2 ** k) for overflow entries.

def mem(k, total = 2 ** 40):
    bits = k * (2 ** 32) + 32 * total / (2 ** k)
    return bits / 8

for i in xrange(1, 100):
    print i, mem(i) / float(2 ** 30)

And this gives us 7Gb required memory, so that we have 1Gb for OS runtime , file buffer and etc.
After we pass through the file, sort the overflow entries (268Millions max) using quicksort and merge-sort-step through both arrays in order to get the real uint64_t value of each counter and decide where median is.

So basically this problem is about the bit-bucket trick and counter-compressing trick.

- emb March 21, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Let's reduce the size of the problem. Suppose you have 101 numbers and memory enough only to store 10. Median is element number 51.

1. Use the memory to find the 10 least elements. You can use max heap to do this.
2. Get the maximum element from the heap.
3. We now have the 10th element.
4. Parse through the integers again and find the next 10 using max heap again.
5. We now have the 20th element.
6. Continue this until you have the 50th element.
7. Find the next element.

If n is the number of elements, and m is the memory capacity, the time complexity of this method is O((n/m)*n). You have to parse the whole file, n/m times.

- xeq March 19, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@smallchallenges
The answer is close to correct solution, but there are a few quirks
What if all nos in the file are the same? Or say, there would happen a skew - a bucket with 0.5Tb of nos?

Sorry, couldn't comment under your comment.

- emb March 19, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Either quickselect based on random selection. You go through 8 GB worth of doubles will we consume 1TB, counting how many are less than pivot and how many are more than it(position in the sorted array). We use reservoir sampling to pick two random pivots for the next step, one(p1) greater than pivot and other(p2) less than pivot. If the pivot is in the first half, ie pivot's position is<n/2, we run the loop with p1 as pivot and if pivot's position is >n/2, we run with p2 as pivot.

- Vijay March 19, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We can use randomized quick select with reservoir sampling to select a random pivot.
Start with low=-infinity, high=+infinity, pivot=A[random(n)];
In each iteration,
we count how many are less than pivot and how many are more than it.
At the same time, we select two random pivots using reservoir sampling.
One(p1) for x>pivot && x<high and other(p2) x<pivot && x>low.
If pivot position in the sorted array(by counting how many are less than it)<n/2,
low=pivot, pivot=p1
else if position>n/2
high=pivot, pivot=p2
else
return pivot

- Vijay Thurpati March 19, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Read first 8 bytes (var temp)
2. Convert that to LogBase 2 [var log2 = Math.Log(temp,2)]
3. Left shift 1 [temp1 |= 1 << log2 ]
4. Now temp1 represents (all 1s) elements in the text file
5.
int count = 0; int64 median = 0;
for(int64 i=0; i < 64; i++)
{
if((temp1 & (1 << i) = temp1)
{
count++;
}
}

median = Math.Pow(2, (1 << (count/2)));

- JS March 20, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Read first 8 bytes (var temp)
2. Convert that to LogBase 2 [var log2 = Math.Log(temp,2)]
3. Left shift 1 [temp1 |= 1 << log2 ]
4. Now temp1 represents (all 1s) elements in the text file
5.

int count = 0; int64 median = 0; 
for(int64 i=0; i < 64; i++)
   {
	if((temp1 & (1 << i) = temp1)
	{
	     count++;		
	}
   }

median = Math.Pow(2, (1 << (count/2)));

- JS March 20, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- JS March 20, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Read first 8 bytes (var temp) 2. Convert that to LogBase 2 [var log2 = Math.Log(temp,2)] 3. Left shift 1 [temp1 |= 1 << log2 ] 4. Now temp1 represents (all 1s) elements in the text file 5. {{{int count = 0; int64 median = 0; }}} 6. Count all the 1s {{{ for(int64 i=0; i < 64; i++) {if((temp1 & (1 << i) = temp1) {count++;}} 7. Find the median as {{{median = Math.Pow(2, (1 << (count/2)));}}} - JS March 20, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@emb:

Fair point.

If all the nos are the same, then only one bucket will be filled i.e. easy to bail out and pick the single repeated number as the median.

If the median bucket is > 2^30, then the solution can be adapted to repeat until the wanted bucket size is <= 2^30 i.e. 8GB or 1GB doubles. It would mean one pass through the file for each repetition, but the algorithm will still be O(n).

Does that get me closer :-)

- smallchallenges March 20, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

small-challenges: give it up, what you say is silly, and will cause multiple and unbounded iterations over the input file......

- gen-x-s March 20, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Variation of Quicksort. Create sorted segments out of every 1 billion values (we will have 1000 segments) and write these segments back to the file. Create an array of size 1000 with numbers from 0 to 999 (each of these numbers is an index for a sorted segment). Randomly select a segment ( say index 77) and designate it's minimum value as a pivot. Partition segments around this pivot -compare the remaining segments' maximum values to the pivot value. If the segment's maximum is less than the pivot then move that segment's index value before 77 in the array of values. After partitioning, you will know the position of the pivot value in the overall sorted array (ie. say the pivot is the 9856 value). Since 9856 is less than 500 billion (median value position in a Terabyte size set of values), repeat this randomized quicksort approach to segments whose index values that appear to the right of segment 77. Keep doing this until you find the segment containing the 500 billionth value then, once the correct segment is found, apply binary search to find the value which is the median.

- divm01986 March 21, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use external mergesort and then just pick the middle value for the median since it's odd.

- Dust4 March 21, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@jani, Nice job! It's awesome how similar our answers look :) It seems I forgot to divide 2**40 by 8 in my calculations, but still got in memory bounds.

- emb March 21, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@emb haha yeah reading your solution was like a deja vu. Your problems are always the most interesting ones, I end up spending an hour or two whenever I see one.

- jani March 22, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I'm proven correct, as I said "small-challenges" solution will "cause multiple and unbounded iterations over the input file", and as you can see, now there is a formal restriction that we can only read through the file twice. Of course, this restriction was obvious, since any solution that has a worst case leading to unbounded reads is bad. Conclusion: don't accept anything "small-challenges" suggests, this guy is always wrong, and his "solutions" are horrible mess showing inadequate knowledge.

- gen-x-s March 23, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@Lex I'm not sure i understand the question. At that stage we simply want to find out what is the Nth largest number (ie. the median) so we just need to go over the counts of doubles which have appeared (from largest double to smallest) and add them together until that count reaches N. For this we only need one (64bit) integer, to keep track of the total number of doubles seen.

- jani March 26, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

First some basic calculations:
1TB holds 2^34 integers. 8GB holds maximum 2^27 integers. We would use counting sort to do this.

First pass: keep an array of integer of size 2^26 in the 8GB, this should takes more than 4GB with some room to spare. Entry at index k is the count for integers between [k*2^8, (k+1)*2^8) in the 1TB file. One can find out which bucket is the median in from this array.

Second pass: create another array of size 2^8, this time representing count for integers between number k*2^8 to (k+1)*2^8. Combine this array with the first array you can identify the exact median.

- totolipton April 14, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1st of all great job on getting simple logic behind this solution:
I found small mistake in calculation :
1TB holds 2^37 numbers [double] - as 1 GB contains 2^27 numbers as per given specs

- Mystery_coder June 24, 2018 | Flag Reply

CareerCup

Google Interview Question for Software Developers

Books

Videos

Resume Review

Mock Interviews