## NVIDIA Interview Question for Software Engineer / Developers

Country: United States
Interview Type: Written Test

Comment hidden because of low score. Click to expand.
2
of 2 vote

I don't think a global lock is necessary. Have each thread return a partial sum and store it in a separate array element. Then the main thread sums up partial sums once partial sum-calculating threads are done

Comment hidden because of low score. Click to expand.
1
of 1 vote

``````#define M 1000
#define N 5

int list[M];
int sums[N] = { 0 };

void func(int n, int start, int end) {

for (int i = start; i < end; i++) {
sums[n] += list[i];
}

}

int main(int argc, char* argv[])
{

int prev = 0;
int count = 0;
for (int i = 0; i < N; i++) {
if (i == N - 1) {
} else {
threads[i] = new thread(count, func, prev, (i + 1) * M / N);
}
count++;
prev = (i + 1) * M / N;
}

int sum = 0;
for (int i = 0; i < N; i++) {
sum += sums[i];
}

cout << sum << endl;

return 0;
}``````

Note 1 : ask the interviewer if integer addition is atomic on the system, if it is, there is no need for adding up all the elements together at the end

Note 2 : if the number of threads is huge (as Vik said) you can use spin locks on a global integer and thereby enforce atomicity for integer addition

Comment hidden because of low score. Click to expand.
0
of 0 vote

In java we can divide array between N threads to accumulate numbers, also we should use CyclicBarrier for gathering results of all threads.

Comment hidden because of low score. Click to expand.
0
of 0 vote

Each thread have to calculate a part of array. Size of part X is M/N, and last part size is M-X*(N-1); each thread add numbers to a local variable and add this value to global variable via atomic operation.

Comment hidden because of low score. Click to expand.
2

Yeah but then since you have N threads now merging the answer into a single global variable will need some kind of locking mechanism on that global variable. Now consider this if the number of threads is huge the step of merging the result into a single variable becomes bottleneck and all threads are waiting to take a lock on a single variable so that they can update the total sum...

Comment hidden because of low score. Click to expand.
0
of 0 vote

Can anyone please provide the code for this problem.

Comment hidden because of low score. Click to expand.
0
of 0 vote

You can break sums, since is an operation you can do in parallel
IE you have an array of 20 elements and 2 threads when in one thread you can sum 1-10 in one thread and the other 10 elements in other thread, the tricky part here in java is that at the end of every thread you will need to accumulate results in a share variable or something you will need to synchronize this if you don't want to have some race problems.
You may like to review map reduce (this is kind of the idea behind it)

Comment hidden because of low score. Click to expand.
0
of 0 vote

hope it helps~

import java.util.concurrent.CountDownLatch;

public class Sum {

public static Integer sum=new Integer(0);
/**
* @param args
*/
public static void main(String[] args) {
int[] arr = new int;
for (int i = 0; i < arr.length; i++) {
arr[i] = 1;
}
int length = arr.length;
for (int i = 0; i < threads-1; i++) {
new Thread(new SumPart(arr, num*i, (num*(i+1))-1, count)).start();
}
try {
count.await();
} catch (Exception e) { }
System.out.println(sum);
}
}
class SumPart implements Runnable{
private CountDownLatch countDownLatch;
private int[] arr = null;
private int start,end=0;
public SumPart(int[] arr, int start, int end, CountDownLatch countDownLatch) {
this.arr = arr;
this.start = start;
this.end = end;
this.countDownLatch = countDownLatch;
}
@Override
public void run() {
while (start<=end) {
Sum.sum = Sum.sum+arr[start++];
}
countDownLatch.countDown();
}
}

Comment hidden because of low score. Click to expand.
0
of 0 vote

Put the result of each thread into a linked-list. Using another thread , add the linked-list nodes deleting the nodes as it gets summed and then put the result in another node.
At the end of first summing phase there will be only one node in the linked-list containing the intermediate sum . Continuing the same process till we have only 1 node in the list for 3-5 times with same value.

Comment hidden because of low score. Click to expand.
0
of 0 vote

DWORD SUBARRAY(DWORD *ARR,DWORD SIZE)
{
INT I=0;
WHILE(I<M)
{
IF(I=M-1)&&(I%2!=0)
{
I++;
}
ELSE
{
I=I+2;
}
}
}
VOLITLE DWORD SUM=0;
VOID TSUM(DWORD A, DWORD B)
{
}

Comment hidden because of low score. Click to expand.
0
of 0 vote

Language: CUDA/C++
Device specs:
- Tesla C2050
- 448 CUDA cores
- 49152 bytes of shared memory per block
- 3072 MB of global memory
- 1024 max threads per block

This implementation is a two-pass reduction kernel as explained in the CUDA Handbook by Jonathan Wilt, there is room for improvement.

Basically, each block performs a summation on the subarray it is allocated. Then, the sums from all the blocks are added by calling the same kernel again.

For an allocation with blocks of 16 threads, these are the reported times:

``````64 x1 array:0.055840 ms.
256x1 array:0.05616   ms
512x1 array:0.056480 ms
1024x1 array:0.056704 ms
4096x1 array:0.076640 ms
8192x1 array:0.098848 ms.``````

For an allocation with blocks of 256 threads, these are the reported times:

``````64x1 array:0.135904 ms
256x1 array:0.064192 ms
512x1 array:0.064064 ms
1024x1 array:0.064512 ms
4096x1 array:0.064928 ms
8192x1 array:0.067456 ms``````

CUDA Code:

``````#include <cuda.h>
#include <iostream>
#include <stdlib.h>
#include <cstdio>
#include <sys/time.h>
#include <omp.h>
#define BLOCK_SIZE 256
#define ARRAY_SIZE 4096

//to handle API errors
static void HandleError( cudaError_t err,
const char *file,
int line )
{
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
file, line );
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

//for timing the time to perform the summation
cudaEvent_t event_gpu_start,event_gpu_end;

float gpu (int * numbers_d, int * blocksSums_d, int * answers_d, int size, int numBlocks, int blockSize);
__global__ void reduction_1 (int * numbers, int * blockSums, int blocks);

int  main (int argc, char ** argv){
//array to reduce
//CPU data
int * numbers = new int [ARRAY_SIZE];
int * answer = new int ;

//arrays to hold GPU data
int * numbers_d;                //copy of data on device
int * blockSums_d;              //holds blocks sums using during the second pass

//allocate the array to be summed. assign ints from 0 to size of array
for (int i=0; i < ARRAY_SIZE; i++) {
numbers[i] = i;
}
//allocate gpu memory space and copy the array to be summed
HANDLE_ERROR(cudaMalloc((void **)&numbers_d,ARRAY_SIZE*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(numbers_d, numbers, ARRAY_SIZE*sizeof(int),cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMalloc((void **)&blockSums_d,(ARRAY_SIZE/BLOCK_SIZE)*sizeof(int)));

//display message
printf ("[INFO]:data allocation done\n");
printf ("[INFO]:reduction started\n");

//call function to initialize and launch threads
float gpu_time = gpu (numbers_d, blockSums_d, answer_d, ARRAY_SIZE, ARRAY_SIZE/BLOCK_SIZE,BLOCK_SIZE);
printf ("[INFO]:reduction done\n");
printf ("[INFO]:time to reduce array %ux%u of integers:%3.6f ms.\n[INFO]:with %u blocks of %u threads\n"
,ARRAY_SIZE,1,gpu_time,ARRAY_SIZE/BLOCK_SIZE,BLOCK_SIZE);
printf ("[INFO]:the sum is %u\n", answer);
HANDLE_ERROR(cudaFree(numbers_d));
HANDLE_ERROR(cudaFree(blockSums_d));
}

float  gpu (int * numbers_d, int * blockSums_d, int * answer_d, int size, int numBlocks, int blockSize) {

float gpu_time = 0.00;
//start timing the running time of the kernels
HANDLE_ERROR(cudaEventCreate(&event_gpu_start));
HANDLE_ERROR(cudaEventRecord(event_gpu_start));
unsigned int sharedMemorySize = blockSize*sizeof(int);

//first pass: calculate sum of subarrays
reduction_1 <<<numBlocks,blockSize,sharedMemorySize>>>(numbers_d,blockSums_d, size);

//second pass: calculate sum of all the sums of subarrays
HANDLE_ERROR(cudaEventCreate(&event_gpu_end));
HANDLE_ERROR(cudaEventRecord(event_gpu_end));

//calculate the time elapsed
cudaEventSynchronize(event_gpu_end);
cudaEventElapsedTime(&gpu_time, event_gpu_start, event_gpu_end);
return gpu_time;
}

__global__ void reduction_1 (int * numbers, int * blockSums, int size) {
int sum = 0;
extern __shared__ int interim[]; //dynamic shared memory
for (int i = blockIdx.x*blockDim.x + threadIdx.x; i < size; i+=blockDim.x*gridDim.x) {
sum+=numbers[i];
}
//store results in a shared array within a block
//make sure all threads reach this point

//now we have reduce the number of active threads per block
//for an array size = 64 and blocksize = 16;
//the first phase we reassign the same value in sum
//for the second pass
//we will use half of the threads to add together partial sums
//then 8/2 = 4 which adds  x and x+4
//then 4/2 = 2 which adds  x and x+2
//then 2/2 = 1 which adds  x and x+1
}
}
//the first thread of each block writes the sum of the subarray in an output array
}
}``````

Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

### Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

### Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.