Amazon Interview Question
Software Engineer / DevelopersCountry: United States
I do not know who has downvoted king's answer, but I would try to do it too - use the external merge sort if we are limited in the memory.
Burstsort and its variants are cache-efficient algorithms for sorting strings and are faster than quicksort for large data sets, first published in 2003.[1]
Burstsort algorithms use a trie to store prefixes of strings, with growable arrays of pointers as end nodes containing sorted, unique, suffixes (referred to as buckets). Some variants copy the string tails into the buckets. As the buckets grow beyond a predetermined threshold, the buckets are "burst", giving the sort its name. A more recent variant uses a bucket index with smaller sub-buckets to reduce memory usage. Most implementations delegate to multikey quicksort, an extension of three-way radix quicksort, to sort the contents of the buckets. By dividing the input into buckets with common prefixes, the sorting can be done in a cache-efficient manner.
There are several part of the question
1. Fastest sorting algorithm for string depend on the input
- if all the input string has equal size: radix sorting is the best choice with complexity of O(k*n) with k is the size of the string. Space complexity O(n) since it can not be done in-place
- if the input has very different size: and the longest string can be large > 20, quick sort is generally favored . The space complexity is O(1), it can be done in-place
2. Sorting when the number of string is large. When he mean "large", it should be inferred as large than memory can handle (for both in-place or out-of-place)
The answer for the case that we can sort in the memory I have mentioned above.
Now I assume we need to do external sorting.
One potential solution (that can combine with the other in-memory method that I mentioned above)
- Sort all string based on the length of the string
- you need only a linear scan to collect the length
- you need another linear scan to put string on to the group based on the length
- Based on the counting histogram, divide the pre-sorted string into group that fit onto the memory then sort
- Chose radix sorting if the length in the group uniform
- Chose quicksort if the length is non-uniform or the number of string is small log N is a small number
- At this point the only extra step is to merge between sorting string elements of the same length in consecutive groups.
3-way radix quicksort is a good start
public static void quick3Way(String[] a) {
int lo = 0; // the initial index
int hi = a.length - 1; // the last index
int d = 0;
quick3Way(a, lo, hi, d);
}
private static void quick3Way(String[] a, int lo, int hi, int d) {
if ( hi < lo ) return;
int i = lo + 1;
int lt = lo; int gt = hi;
// partitioning character
int v = charAt(a[lo], d);
while ( i <= gt ) {
int t = charAt(a[i], d);
if ( t < v ) exch( a, lt++, i++ );
else if ( t > v ) exch( a, i, gt-- );
else ++i;
}
quick3Way(a, lo, lt-1, d);
if ( v >= 0 ) quick3Way(a, lt, gt, d+1);
quick3Way(a, gt+1, hi, d);
}
private static void exch(String[] a, int i, int j) {
String t = a[i];
a[i] = a[j];
a[j] = t;
}
Burstsort and its variants are cache-efficient algorithms for sorting strings and are faster than quicksort for large data sets, first published in 2003.[1]
Burstsort algorithms use a trie to store prefixes of strings, with growable arrays of pointers as end nodes containing sorted, unique, suffixes (referred to as buckets). Some variants copy the string tails into the buckets. As the buckets grow beyond a predetermined threshold, the buckets are "burst", giving the sort its name. A more recent variant uses a bucket index with smaller sub-buckets to reduce memory usage. Most implementations delegate to multikey quicksort, an extension of three-way radix quicksort, to sort the contents of the buckets. By dividing the input into buckets with common prefixes, the sorting can be done in a cache-efficient manner.
What is the meaning of file size is too large ? Too large to fit into memory ? If yes, then I will go with external merge sort.
- king July 09, 2013