Google Interview Question
SDE-2sCountry: United States
1. the leader of the N machines tells a random worker machine announce its median to its rest of the workers
2. all fellow workers machines replies with how many numbers are strictly bigger than the number broadcasted
3-a if the sum of the numbers that workers replied is 500M, then you have the answer!
3-b if the sum of the numbers that workers replied with is more than 500M, then Leader tells everyone to disregard anything less than the broadcasted number, then recurse
3-c if the sum of the numbers, call it K, that workers replied with is less than 500M, then Leader will look for the Kth number, then recurse
This sounds like a good approach. I'm trying to think about the efficiency.
m= # of machines
n = total numbers ( a billion )
The workers would have to sort all their numbers. This is n*log(n) in total processing time but at the worker level, it is only (n / m ) * log( n / m ), done concurrently.
When the leader asks for the number of strictly bigger numbers, each worker can find the amount larger in log( n / m ) time since the numbers are sorted.
The number of leader requests for strictly greater counts should be log(n) on average. I believe log(n) since this algorithm will eliminate about half the numbers on each request (for similar reason quicksort eliminates half on average for random pivot point).
Summary:
Each worker's sorting -- (n / m ) * log( n / m ).
Each worker's count of bigger: log( n / m)
Number of leader requests for strictly bigger: log(n) on average
Sound right?
You can do that in O(N/M). As Skiena describes in his book, you can use the Partition method from the Quick-sort algorithm to search for a median in O(N), because you run Partition only on one side of the array each time. If all the machines work paralelly, they can finish in O(N/M).
int median(int[] x, int a, int b) {
if(a >= b-1)
return a;
int p = partition(x, a, b);
if(p < (a+b)/2) return median(x, a, p);
else return median(x, p+1, b);
}
int partition(int x[], int a, int b) {
int p = b-1;
firstHigh = a;
for(int i=a; i < b; i++)
if(x[i] < x[p]) {
swap(x, i, firstHigh);
firstHigh ++;
}
swap(x, p, firstHigh);
return firstHigh;
}
@jakubadamek: So each machine is using the above code to return the median of its data. But when a machine obtains the median of its set, how are the M results used to get the global median? Are the recursive calls being delegated from a leader to other machines?
Good point. What about this:
* Choose the pivot randomly (e.g. the last number in the current sub-partition on the first machine)
* Tell all machines to partition around the pivot and report the three array indexes: a, p, b
* Use the answers to find whether the median lies to the left or right of the pivot
* Repeat recursively
int distributedMedian() {
return distributedMedian(x[n-1], true);
}
// pv = pivot value; left = partition the left or right sub-partition?
int distributeMedian(int pv, boolean left) {
lsum = 0;
rsum = 0;
// run on all machines - the real implementation would need to run this asynchronously on all machines at once
for(int j=0; j < m; j ++) {
// each machine reports the indexes, but it also remembers them to be able to run the next distributedPartition call
(aj, pj, bj) = distributedPartition (j, left, pv);
lsum += pj - aj - 1;
rsum += bj - pj;
}
left = lsum > rsum;
pv = // choose randomly next pivot in the left or right partition
distributeMedian(pv, left);
}
Implementation of annon idea in pseudocode.
sort numbers on each machine
for each machine m
low = using binary search find element
position in m.array which is smaller than
previous machine's array[low] element,
if there is no previous machine set it to 0
if low == array.len
switch to the next machine
high = using binary search find next
element position in m.array which is higher
than previous machine's array[high] element,
if there is no prev machine set it to m.array.length
if high == 0
switch to the next machine
do
pivot = (high - low) / 2
k = lessThan(pivot, machines)
if (k == N/2) return pivot
if (k < N/2)
low = pivot
pivot = low + (high - low / 2)
if (k > N/2)
high = pivot
pivot = low + (high - low / 2)
while (low != pivot || high != pivot)
There is a huge chance that this solution will lead into integer overflow.
For example, let s say each number in the array is greater than 6.
Then once we reach 500 million'th number, the total sum will be more than 2^32 (i.e., the sum will be greater than 4.2 billion);
Pls correct me if I'm wrong.
I assume that there is a leader machine and N machine
1. each machine, except the leader, receive N chunks of data
2. each machine sorts its own list in O(nlog(n))
3. each machine sends its minimum value along with its ID to the leader machine.
4. The leader machine stores these pairs of values and IDs in a heap.
5. The leader machine is idle until all the N machines send their values.
6. The leader machine removes the minimum value from its heap. eg. (v_i, id_i)
7. The leader machine sends a message to the machine id_i asking for another minimum value and waits for the respond.
8. the leader performs step 6 and 7 for 500 million times.
9. the medium would be the next minimum value in the leader's heap.
Note that the heap data structure should be thread safe!
- Split list into 1/N chunks.
- Send a 1/N chunk to each of the N machines.
- Do a local sort on each machine.
- Merge sorted lists across N machines until you you hit value 500 million. That is your median
@kg, After sending the 1/N chunk to a machine, if that machine shuts down then how will you take care of this part?
If we merge N sorted lists, isn't that going to take 1,000,000,00 / 2 runtime & space? This is assuming the merge is coordinated by one of the machines. I'm comparing this to @annon's approach which would take log(1,000,000,000). It seems we want to avoid an operation that requires the entire data set be iterated (merged) by a single machine. Maybe I missed something though.
I think kg's algorithm is called the "median of medians" algorithm if I'm not mistaken. Link: en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm
will it give extact median for present case where there are 1 billion numbers distributed across N machines? because that solution divides numbers into 30%/70%.
1)Each machine sorts it's own elements.
- Brij August 12, 2013Comlexity: nlog(n)
Time: Highest of all the machines.
2) Leader machine builds a heap of m elements(m being the number of machines)
Heap node contains numbers and machine to which the number belongs
3) Leader machine asks each machine to give next smallest element.
Complexity: m log(m)
4) Leader machine removes the smallest element from heap(o(1)) and asks for next min number to the machine to which that number belonged.
5) Insert the next min number in heap, repeast from step 4 till the time kth min number is found.
Total time complexity:
if h is highest chunk of data with a machine, h log(h) for sorting.
If m is number of machines:
m log(m) for building heap.
If k is half of billion numbers, find kth element complexity is:
k log(m)
Total messages passed:
k(half billion).
I am wondering if I could do the heap part in parallel.