Microsoft Interview Question
Software EngineersCountry: United States
You just store the difference between the long integers, that way you can save up memory.
You can use bucket sort to sort the array/buffer where you stored the difference then since you know the maximum and the minimum number will be a long integer so the constraints are met and your time complexity would be O(N).
First we could check with the interviewer whether the numbers are in any range. For example if the numbers represent human range. if yes ,
then we need to store numbers between 1 - 130 and counts of each one of them.
Its just about asking right questions.
If the numbers are really distributed and no correlation
we can bring in numbers in memory and use merge sort
That we we bring data 1000 + 500 + 250 + 125 + 63 + .... times
Assumptions: Size of integer is 4 bytes
Range of integers is -2Billion (Integer.minValue) to 2Billion (Integer.maxValue)
Space required to hold 4Billion numbers = 4B * 4bytes = 16B bytes = 16GB
But, it is given that we have 1TB of data which are numbers. Obviously, there are a lot of duplicates. We will need to find out, which of them is repeated how many times.
We can create a hash table of all numbers in the integer range (from -2Billion to 2Billion) as key and the count of their occurrence as value. Count is unsigned int.
Size of 1 entry of hash table (4 Bytes for number and 4 Bytes for count) = 8 Bytes
Total number of entries in the hash table = 4Billion
Total size of hash table = 4Billion * 8 Bytes = 32 Billion bytes = 32 GB
Available RAM is1GB= We will need 32 different hash tables to hold 32GB of data. Alternatively, we can use 64 hash tables, each with 500MB size.
Here is the algorithm:
1. Read 500MB of data from the 1TB file
2. Read numbers sequentially from step 1
3. Pick the hash table (1 to 64) depending on where the number read in step 2 fits. Add the entry and increment the count against the number. Save hashtable to disk if you have to pick another hashtable.
4. Repeat steps 2-3 until you exhaust the number
5. Goto step 1 to read the next 500MB worth of data from disk
Complexity of this algorithm is O(n)
Limitation: This will not work of any number(s) is repeated more than 4B times (range of int)
This requires a lot of copying from memory to disk picking out the right hash table for each number! So even though its O(n), it may be highly inefficient.
This will require a lot of read/write from memory to disk, in order to pick out the appropriate hash table for each number. So even though it is O(n), it may be highly inefficient.
Divide the array in blocks of 0.5Gb each. Then proceed in a Bubble Sort fashion by sorting adjacent blocks. This is equivalent to the swapping procedure used between scalars in Bubble Sort.
If B is the number of blocks
For i=1 to B
For j=1 to B-1
sort(Block_j merged to Block_{j+1})
The sorting can be performed in memory. The complexity is quadratic in B.
Divide the array in blocks of 0.5Gb each. Then proceed in a Bubble Sort fashion by sorting adjacent blocks. This is equivalent to the swapping procedure used between scalars in Bubble Sort.
If B is the number of blocks
For i=1 to B
For j=1 to B-1
sort(Block_j merged to Block_{j+1})
The sorting can be performed in memory. The complexity is quadratic in B.
(sorry for the previous posts)
Divide the array in blocks of 0.5Gb each. Then proceed in a Bubble Sort fashion by sorting adjacent blocks. This is equivalent to the swapping procedure used between scalars in Bubble Sort.
If B is the number of blocks
For i=1 to B
For j=1 to B-i
sort(Block_j merged to Block_{j+1})
The sorting can be performed in memory. The complexity is quadratic in B.
Consider 1 TB array has size N, and 1 GB array has size n
1. Partition N into block of size n
2. Inplace/using memory Sort all blocks
3. use memory as min heap and take first element of every sorted block
4.heapify and take out first element from min heap push to first position of N array
5. take second element from same array n add to min heap. and repeat (4,5) until we get sorted array
Complexity : (n+1)N * log(n)
First split the 1 TB data into 1024 blocks and then sort these files/data individually.
- rangacoder February 02, 2016Then open all files (1024 of them. just have a pointer to the beginning of the file, not load the entire file). read the first entry from all the files and then put them into a min heap. So this way, we get 1st 1000 integers which are sorted globally. Next again load the heap with next 1000 entries from the file, continue this until all the entries in the file are finished.