How to find medium of 1 billio

Google Interview Question for SDE-2s

1

of 3 votes

21
Answers
How to find medium of 1 billion numbers across N distributed machines efficiently?
- seanren7 August 09, 2013 in United States | Report Duplicate | Flag | PURGE
Google SDE-2 Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 7 vote

1)Each machine sorts it's own elements.
Comlexity: nlog(n)
Time: Highest of all the machines.
2) Leader machine builds a heap of m elements(m being the number of machines)
Heap node contains numbers and machine to which the number belongs
3) Leader machine asks each machine to give next smallest element.
Complexity: m log(m)
4) Leader machine removes the smallest element from heap(o(1)) and asks for next min number to the machine to which that number belonged.
5) Insert the next min number in heap, repeast from step 4 till the time kth min number is found.
Total time complexity:
if h is highest chunk of data with a machine, h log(h) for sorting.
If m is number of machines:
m log(m) for building heap.
If k is half of billion numbers, find kth element complexity is:
k log(m)
Total messages passed:
k(half billion).

I am wondering if I could do the heap part in parallel.

- Brij August 12, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 3 vote

1. the leader of the N machines tells a random worker machine announce its median to its rest of the workers
2. all fellow workers machines replies with how many numbers are strictly bigger than the number broadcasted
3-a if the sum of the numbers that workers replied is 500M, then you have the answer!
3-b if the sum of the numbers that workers replied with is more than 500M, then Leader tells everyone to disregard anything less than the broadcasted number, then recurse
3-c if the sum of the numbers, call it K, that workers replied with is less than 500M, then Leader will look for the Kth number, then recurse

- annon August 09, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

This sounds like a good approach. I'm trying to think about the efficiency.

m= # of machines
n = total numbers ( a billion )

The workers would have to sort all their numbers. This is n*log(n) in total processing time but at the worker level, it is only (n / m ) * log( n / m ), done concurrently.

When the leader asks for the number of strictly bigger numbers, each worker can find the amount larger in log( n / m ) time since the numbers are sorted.

The number of leader requests for strictly greater counts should be log(n) on average. I believe log(n) since this algorithm will eliminate about half the numbers on each request (for similar reason quicksort eliminates half on average for random pivot point).

Summary:
Each worker's sorting -- (n / m ) * log( n / m ).
Each worker's count of bigger: log( n / m)
Number of leader requests for strictly bigger: log(n) on average

Sound right?

- wgestrich August 09, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

You can do that in O(N/M). As Skiena describes in his book, you can use the Partition method from the Quick-sort algorithm to search for a median in O(N), because you run Partition only on one side of the array each time. If all the machines work paralelly, they can finish in O(N/M).

int median(int[] x, int a, int b) {
    if(a >= b-1) 
        return a;
    int p = partition(x, a, b);
    if(p < (a+b)/2) return median(x, a, p);
    else return median(x, p+1, b);
}
int partition(int x[], int a, int b) {
    int p = b-1;
    firstHigh = a;
    for(int i=a; i < b; i++)
       if(x[i] < x[p]) {
           swap(x, i, firstHigh);
           firstHigh ++;
       }
    swap(x, p, firstHigh);
    return firstHigh;
}

- galli August 10, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@jakubadamek: So each machine is using the above code to return the median of its data. But when a machine obtains the median of its set, how are the M results used to get the global median? Are the recursive calls being delegated from a leader to other machines?

- wgestrich August 10, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Good point. What about this:
* Choose the pivot randomly (e.g. the last number in the current sub-partition on the first machine)
* Tell all machines to partition around the pivot and report the three array indexes: a, p, b
* Use the answers to find whether the median lies to the left or right of the pivot
* Repeat recursively

int distributedMedian() {
    return distributedMedian(x[n-1], true);
}    
// pv = pivot value; left = partition the left or right sub-partition?
int distributeMedian(int pv, boolean left) {
    lsum = 0;
    rsum = 0;
    // run on all machines - the real implementation would need to run this asynchronously on all machines at once
    for(int j=0; j < m; j ++) {
       // each machine reports the indexes, but it also remembers them to be able to run the next distributedPartition call
       (aj, pj, bj) = distributedPartition (j, left, pv);
       lsum += pj - aj - 1;
       rsum += bj - pj;
    }
    left = lsum > rsum;
    pv = // choose randomly next pivot in the left or right partition
    distributeMedian(pv, left);
}

- galli August 10, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Implementation of annon idea in pseudocode.

sort numbers on each machine
for each machine m
low = using binary search find element
position in m.array which is smaller than
previous machine's array[low] element,
if there is no previous machine set it to 0
if low == array.len
switch to the next machine
high = using binary search find next
element position in m.array which is higher
than previous machine's array[high] element,
if there is no prev machine set it to m.array.length
if high == 0
switch to the next machine
do
pivot = (high - low) / 2
k = lessThan(pivot, machines)
if (k == N/2) return pivot
if (k < N/2)
low = pivot
pivot = low + (high - low / 2)
if (k > N/2)
high = pivot
pivot = low + (high - low / 2)
while (low != pivot || high != pivot)

- b1- August 13, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

There is a huge chance that this solution will lead into integer overflow.
For example, let s say each number in the array is greater than 6.
Then once we reach 500 million'th number, the total sum will be more than 2^32 (i.e., the sum will be greater than 4.2 billion);
Pls correct me if I'm wrong.

- sathishp123 October 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

I assume that there is a leader machine and N machine

1. each machine, except the leader, receive N chunks of data
2. each machine sorts its own list in O(nlog(n))
3. each machine sends its minimum value along with its ID to the leader machine.
4. The leader machine stores these pairs of values and IDs in a heap.
5. The leader machine is idle until all the N machines send their values.
6. The leader machine removes the minimum value from its heap. eg. (v_i, id_i)
7. The leader machine sends a message to the machine id_i asking for another minimum value and waits for the respond.

8. the leader performs step 6 and 7 for 500 million times.
9. the medium would be the next minimum value in the leader's heap.

Note that the heap data structure should be thread safe!

- Nima August 30, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- Split list into 1/N chunks.
- Send a 1/N chunk to each of the N machines.
- Do a local sort on each machine.
- Merge sorted lists across N machines until you you hit value 500 million. That is your median

- kg August 09, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

@kg, After sending the 1/N chunk to a machine, if that machine shuts down then how will you take care of this part?

- OTR August 09, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

we can use replication. send chunk on 3 machines, for example

- glebstepanov1992 August 09, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

If we merge N sorted lists, isn't that going to take 1,000,000,00 / 2 runtime & space? This is assuming the merge is coordinated by one of the machines. I'm comparing this to @annon's approach which would take log(1,000,000,000). It seems we want to avoid an operation that requires the entire data set be iterated (merged) by a single machine. Maybe I missed something though.

- wgestrich August 10, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 4 vote

I think kg's algorithm is called the "median of medians" algorithm if I'm not mistaken. Link: en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

- Alistair August 09, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I agree, that should be it..

- Tuotuo August 10, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

will it give extact median for present case where there are 1 billion numbers distributed across N machines? because that solution divides numbers into 30%/70%.

- googlybhai August 12, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

LOL

- Anonymous August 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

123

- Nima August 30, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Yes. But how?

- algo_guy August 09, 2013 | Flag

CareerCup

Google Interview Question for SDE-2s

Books

Videos

Resume Review

Mock Interviews