Google Interview Question
Software DevelopersTeam: Developer Tools
Country: United States
Interview Type: Phone Interview
For loop will be blocked at each sumOfFile call, where it could have been processed in parallel. So that the next file can be processed without waiting for the sumOfFile call to complete for the current file.
Instead of holding the entire file in the memory, read the file in segments and simultaneously do the summation till you meet the end.
Once you're out of threads in thread pool, the processing of the remaining files will wait till someone gets free to pick up the next file.
Multiple issues:
1) Performance - assuming 1M numbers to be added take 10ms. Doing sequential summing on a single machines means 10000*10=100s. Too bad.
2) Overflow - Here total are 10Bill integers, cant bank on max int limit of ~2.15Bill. Long is safer @ max of 9.2BillBill. Best is go for BigInteger. Can also go with long as it can be faster....so decide the trade off based on data range.
3) Beware of OOM due to heap size limits. With ints, each data set is 4MB+(4B per int + object overhead), with longs ~16MB . For BigInteger plan more, as it internally uses int[].
So assuming max heap configured at 8GB, assuming 50% is available for you ...so you can process (4GB/16MB) ~250 files per node in a multi-threaded implementation, needed 40 nodes.
Assuming each node takes 20-30ms to do the job. Have workers/jobs to delegate job to 20 machines with file paths(each machine can read the target files directly from a central mount location). Kick them off in parallel. Sum up the 40 results...each taking 20-30ms each... account for say 2-5ms of network latency...you can get final answer in ~35-40ms.
Optimize further as per your perf SLA.
Comments welcome!!
Lets assume that we have 1 Node with 1 GB of Main Memory -
--> Suppose lets say we have just one file with a million integers. We cant load all the integers to main memory as this would lead to memory overflow. Each interger is 4 bytes , so we can read around 250000 integers each time and compute their sum. We need to store the sum as string to reduce integer overflow. Now to fasten this up we can use multithreading. We can compute the [0..62,500] integers in one thread and [62,501..125000] in second thread and so on.
--> Lets assume that we have 10 Nodes each with 1GB of Main Memory
We can split the files equally among the 10 Nodes and run the algorithm mentioned above.
In this case each Node would get 100 files. After the sum computation is done across the 10 Nodes the final sum can be calulated by adding the total_sum across the nodes.
/*
1. bucketize on thread pool.
2. Shoot.
*/
// per thread num of file
bucket_size = 100
// assume list of files are here
files // yep, files as in like list of strings
num_files = size(files)
def process_files ( start_inx ){
end_inx = start_inx + bucket_size
end_inx = end_inx > num_files ? num_files : end_inx
sum ( [start_inx : end_inx ] ) -> { sumOfFile( files[$.o] ) }
}
num_buckets = ceil ( num_files / bucket_size )
start_inx = 0
threads = list ( [0:num_buckets] ) -> {
start_inx += bucket_size
thread( ) -> { process_files( start_inx ) }
}
// do not get into a loop, ever
succeeded = poll( 42, 300 ) :: { !exists( threads ) :: { $.o.alive } }
// zoomba threads return value, proper they are option monad
assert ( succeeded , "Why you fail in timeout???" )
total_sum = sum ( threads ) -> { $.o.value }
println( total_sum ) // should be cool
Sum is write-only var. Order of writing don't matter. You can send jobs to several multithreading servers and accomulate result. Divide and Conquer multithreading optimization.
- lalalka March 25, 2017