Microsoft Interview Question
Software EngineersCountry: United States
Interview Type: In-Person
External counted bitmap sort, O(n) complexity.
Since these are integers (assumed 32-bit), there are 2 properties can benefit from:
1. The range of values is finite, hence we can use a bitmap to partition the range un sub-ranges.
2. The file size is larger than the possible range of values, which means there are duplicates, which will need counted.
We split the integer values in ranges that, together with the counters, fit in the available memory. Worst case scenario, only 1 value repeated the entire file, so the counter needs to be 64-bit. One record has the counter = 8 bytes, 200MB of memory can contain 2^24 such records, so we split the 32-bit range of values in 2^8=256 ranges.
Scan the original file and save each value in its appropriate temp file, in original order.
For each temp file, in the order of ranges:
Scan the file and increment the counter for each integer read.
Scan the bitmap and append the values to the final file, each repeated the respective count.
using System.IO;
public static void sortIntegerLargeFile(string filePath)
{
FileStream inStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
BinaryReader inBinary = new BinaryReader(readStream);
FileStream outTempStreams[256];
BinaryWriter outTempBinaries[256]; = new BinaryReader(readStream);
for(int i = 0; i < 256; i++)
{
string strTempFilePath = @"C:\temp\sortRange" + str(i);
outTempStreams[i] = new FileStream(strTempFilePath, FileMode.Create, FileAccess.Write);
outTempBinaries[i] = new BinaryWriter(outTempStreams[i]);
}
try
{
while(true)
{
Int32 intRead = readBinary.ReadInt32();
int range = intRead >> 24;
outTempBinaries[range].Write(intRead);
}
}
catch(EndOfStreamException eosx)
{
}
for(int i = 0; i < 256; i++)
{
outTempStreams[i].Close();
}
using(FileStream outStream = new FileStream(@"C:\sortedFile", FileMode.Create, FileAccess.Write))
{
BinaryWriter outBinary = new BinaryWriter(outStream);
for(int iRange = 0; iRange < 256; iRange++)
{
string strTempFilePath = @"C:\temp\sortRange" + str(i);
using(inTempStream = new FileStream(strTempFilePath, FileMode.Open, FileAccess.Read))
{
inTempBinary = new BinaryReader(inTempStream);
try
{
while(true)
{
Int64[] rangeBitmap = new Int64[16*1024*1024];
Int32 intRead = readBinary.ReadInt32();
int index = intRead & 0x00FFFFFF;
rangeBitmap[index] += 1;
}
}
catch(EndOfStreamException eosx)
{
}
}
for(int iValue = 0; iValue < 0x00FFFFFF; iValue++)
{
Int64 sortedValue = (iRange << 24) | iValue;
for(int count = 0; count < rangeBitmap[iValue]; count++)
{
outBinary.Write(sortedValue);
}
}
}
}
}
External Sorting is the standard way to do this.
First We start by fetch pages of the 50GB array into the memory.
For 200MB RAM, lets consider the page size for the machine to be say 10MB.
# We will fetch the first 10 Pages of the array (100MB of RAM is used)
# perform quick sort on each page
# No we have 10 pages of sorted Integers
# Merge Sort the 10 pages into new page of 100MB (all 200MB used)
# write back to the array the first 10 pages of data
# repeat for the next 10 pages. (50GB/100MB times)
** Now we have an array of 50GB where every 100MB is individually sorted. i.e 512 blocks of sorted integers
# Lets consider 512 as 4 parts of 128 blocks
# fetch first 1MB of 128 blocks of the first part
# Merge sort and keep writing into disk (at a new place)
# Use Double Cache (when 1MB of a block is done, discard it and get the next 1MB)
# Do same to the other three parts.
# Do the another double cached merge sort over these new 4 parts and overwrite on the old array.
This is one way of doing it, any comments on improving this approach is welcomed.
external merge sort.
- zr.roman January 02, 2016O(logn) passes through file.