diff --git a/doc/AutoDoxywarnings.log b/doc/AutoDoxywarnings.log
new file mode 100644
index 0000000..e69de29
diff --git a/doc/autodoc.xml b/doc/autodoc.xml
new file mode 100644
index 0000000..0a730c2
--- /dev/null
+++ b/doc/autodoc.xml
@@ -0,0 +1,693 @@
+
+Boost.Sort C++ Reference
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Cast_type
+ Floating-point IEEE 754/IEC559 type.
+ Integer type (same size) to which to cast.
+ const Data_type &Casts a float to the specified integer type.
+Example:struct rightshift {
+ int operator()(const DATA_TYPE &x, const unsigned offset) const {
+ return float_mem_cast<KEY_TYPE, CAST_TYPE>(x.key) >> offset;
+ }
+};
+
+
+void
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data.float_sort with casting to the appropriate size.
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_float_sort
+ osx_float_sortA simple example of sorting some floating-point is:vector<float> vec;
+vec.push_back(1.0);
+vec.push_back(2.3);
+vec.push_back(1.3);
+spreadsort(vec.begin(), vec.end());
+
+The sorted vector contains ascending values "1.0 1.3 2.3".
+
+void
+
+ Range &Range [first, last) for sorting. Floating-point sort algorithm using range.
+
+void
+
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. Right_shiftFunctor that returns the result of shifting the value_type right a specified number of bits. Floating-point sort algorithm using random access iterators with just right-shift functor.
+
+void
+
+
+ Range &Range [first, last) for sorting. Right_shiftFunctor that returns the result of shifting the value_type right a specified number of bits. Floating-point sort algorithm using range with just right-shift functor.
+
+void
+
+
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. Right_shiftFunctor that returns the result of shifting the value_type right a specified number of bits. CompareA binary functor that returns whether the first element passed to it should go before the second in order. Float sort algorithm using random access iterators with both right-shift and user-defined comparison operator.
+
+void
+
+
+
+ Range &Range [first, last) for sorting. Right_shiftFunctor that returns the result of shifting the value_type right a specified number of bits. CompareA binary functor that returns whether the first element passed to it should go before the second in order. Float sort algorithm using range with both right-shift and user-defined comparison operator.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+void
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data.Integer sort algorithm using random access iterators. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). integer_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so integer_sort is asymptotically faster than pure comparison-based algorithms. s is max_splits, which defaults to 11, so its worst-case with default settings for 32-bit integers is O(N * ((32/11) slow radix-based iterations fast comparison-based iterations).
+
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_integer_sort
+ osx_integer_sort
+
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+ Range &Range [first, last) for sorting.Integer sort algorithm using range. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). integer_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so integer_sort is asymptotically faster than pure comparison-based algorithms. s is max_splits, which defaults to 11, so its worst-case with default settings for 32-bit integers is O(N * ((32/11) slow radix-based iterations fast comparison-based iterations).
+
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_integer_sort
+ osx_integer_sort
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. Right_shiftFunctor that returns the result of shifting the value_type right a specified number of bits. CompareA binary functor that returns whether the first element passed to it should go before the second in order.Integer sort algorithm using random access iterators with both right-shift and user-defined comparison operator. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). integer_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so integer_sort is asymptotically faster than pure comparison-based algorithms. s is max_splits, which defaults to 11, so its worst-case with default settings for 32-bit integers is O(N * ((32/11) slow radix-based iterations fast comparison-based iterations).
+
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_integer_sort
+ osx_integer_sort
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+
+ Range &Range [first, last) for sorting. Right_shiftFunctor that returns the result of shifting the value_type right a specified number of bits. CompareA binary functor that returns whether the first element passed to it should go before the second in order.Integer sort algorithm using range with both right-shift and user-defined comparison operator. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). integer_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so integer_sort is asymptotically faster than pure comparison-based algorithms. s is max_splits, which defaults to 11, so its worst-case with default settings for 32-bit integers is O(N * ((32/11) slow radix-based iterations fast comparison-based iterations).
+
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_integer_sort
+ osx_integer_sort
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. Right_shiftA functor that returns the result of shifting the value_type right a specified number of bits.Integer sort algorithm using random access iterators with just right-shift functor. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). integer_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+ Performance:Worst-case performance is O(N * (lg(range)/s + s)) , so integer_sort is asymptotically faster than pure comparison-based algorithms. s is max_splits, which defaults to 11, so its worst-case with default settings for 32-bit integers is O(N * ((32/11) slow radix-based iterations fast comparison-based iterations).
+
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_integer_sort
+ osx_integer_sort
+
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable The elements in the range [first, last) are sorted in ascending order.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+ Range &Range [first, last) for sorting. Right_shiftA functor that returns the result of shifting the value_type right a specified number of bits.Integer sort algorithm using range with just right-shift functor. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). integer_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+ Performance:Worst-case performance is O(N * (lg(range)/s + s)) , so integer_sort is asymptotically faster than pure comparison-based algorithms. s is max_splits, which defaults to 11, so its worst-case with default settings for 32-bit integers is O(N * ((32/11) slow radix-based iterations fast comparison-based iterations).
+
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_integer_sort
+ osx_integer_sort
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+boost::enable_if_c< std::numeric_limits< typename std::iterator_traits< RandomAccessIter >::value_type >::is_integer, void >::type
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data.Generic spreadsort variant detecting integer-type elements so call to integer_sort. If the data type provided is an integer, integer_sort is used. Sorting other data types requires picking between integer_sort, float_sort and string_sort directly, as spreadsort won't accept types that don't have the appropriate type_traits.
+
+
+
+
+
+
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.
+boost::enable_if_c< !std::numeric_limits< typename std::iterator_traits< RandomAccessIter >::value_type >::is_integer &&std::numeric_limits< typename std::iterator_traits< RandomAccessIter >::value_type >::is_iec559, void >::type
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data.Generic spreadsort variant detecting float element type so call to float_sort. If the data type provided is a float or castable-float, float_sort is used. Sorting other data types requires picking between integer_sort, float_sort and string_sort directly, as spreadsort won't accept types that don't have the appropriate type_traits.
+
+
+
+
+
+
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.
+boost::enable_if_c< is_same< typename std::iterator_traits< RandomAccessIter >::value_type, typename std::string >::value, void >::type
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data.Generic spreadsort variant detecting string element type so call to string_sort for std::strings. If the data type provided is a string, string_sort is used. Sorting other data types requires picking between integer_sort, float_sort and string_sort directly, as spreadsort won't accept types that don't have the appropriate type_traits.
+
+
+
+
+
+
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.
+boost::enable_if_c< is_same< typename std::iterator_traits< RandomAccessIter >::value_type, typename std::wstring >::value &&sizeof(wchar_t)==2, void >::type
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data.Generic spreadsort variant detecting string element type so call to string_sort for std::wstrings. If the data type provided is a wstring, string_sort is used. Sorting other data types requires picking between integer_sort, float_sort and string_sort directly, as spreadsort won't accept types that don't have the appropriate type_traits. Also, 2-byte wide-characters are the limit above which string_sort is inefficient, so on platforms with wider characters, this will not accept wstrings.
+
+
+
+
+
+
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.
+void
+
+ Range &Range [first, last) for sorting.Generic spreadsort variant detects value_type and calls required sort function. Sorting other data types requires picking between integer_sort, float_sort and string_sort directly, as spreadsort won't accept types that don't have the appropriate type_traits.
+
+
+
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+void
+ Random access iterator
+ Unsigned character type used for string.
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. Unsigned_char_typevalue with the same type as the result of the [] operator, defining the Unsigned_char_type. The actual value is unused.String sort algorithm using random access iterators, allowing character-type overloads.
+ (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+ Unsigned character type used for string.
+ Range &Range [first, last) for sorting. Unsigned_char_typevalue with the same type as the result of the [] operator, defining the Unsigned_char_type. The actual value is unused.String sort algorithm using range, allowing character-type overloads.
+ (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data.String sort algorithm using random access iterators, wraps using default of unsigned char. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_string_sort
+ osx_string_sort
+
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+ Range &Range [first, last) for sorting.String sort algorithm using range, wraps using default of unsigned char. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size). string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_string_sort
+ osx_string_sort
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+ Random access iterator
+
+ Unsigned character type used for string.
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. CompareA binary functor that returns whether the first element passed to it should go before the second in order. Unsigned_char_typevalue with the same type as the result of the [] operator, defining the Unsigned_char_type. The actual value is unused.String sort algorithm using random access iterators, allowing character-type overloads. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+ Unsigned character type used for string.
+ Range &Range [first, last) for sorting. CompareA binary functor that returns whether the first element passed to it should go before the second in order. Unsigned_char_typevalue with the same type as the result of the [] operator, defining the Unsigned_char_type. The actual value is unused.String sort algorithm using range, allowing character-type overloads. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+ windows_integer_sort
+ osx_integer_sort
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. CompareA binary functor that returns whether the first element passed to it should go before the second in order.String sort algorithm using random access iterators, wraps using default of unsigned char. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+ Range &Range [first, last) for sorting. CompareA binary functor that returns whether the first element passed to it should go before the second in order.String sort algorithm using range, wraps using default of unsigned char. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. Get_charBracket functor equivalent to operator[], taking a number corresponding to the character offset. Get_lengthFunctor to get the length of the string in characters.String sort algorithm using random access iterators, wraps using default of unsigned char. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable RandomAccessIter value_type supports the operator>>, which returns an integer-type right-shifted a specified number of bits. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+
+ Range &Range [first, last) for sorting. Get_charBracket functor equivalent to operator[], taking a number corresponding to the character offset. Get_lengthFunctor to get the length of the string in characters.String sort algorithm using range, wraps using default of unsigned char. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. Get_charBracket functor equivalent to operator[], taking a number corresponding to the character offset. Get_lengthFunctor to get the length of the string in characters. CompareA binary functor that returns whether the first element passed to it should go before the second in order.String sort algorithm using random access iterators, wraps using default of unsigned char. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+
+
+ Range &Range [first, last) for sorting. Get_charBracket functor equivalent to operator[], taking a number corresponding to the character offset. Get_lengthFunctor to get the length of the string in characters. CompareA binary functor that returns whether the first element passed to it should go before the second in order.String sort algorithm using range, wraps using default of unsigned char. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+
+
+ RandomAccessIterIterator pointer to first element. RandomAccessIterIterator pointing to one beyond the end of data. Get_charBracket functor equivalent to operator[], taking a number corresponding to the character offset. Get_lengthFunctor to get the length of the string in characters. CompareA binary functor that returns whether the first element passed to it should go before the second in order.Reverse String sort algorithm using random access iterators. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. RandomAccessIter value_type is mutable. RandomAccessIter value_type is LessThanComparable The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+void
+
+
+
+
+ Range &Range [first, last) for sorting. Get_charBracket functor equivalent to operator[], taking a number corresponding to the character offset. Get_lengthFunctor to get the length of the string in characters. CompareA binary functor that returns whether the first element passed to it should go before the second in order.Reverse String sort algorithm using range. (All variants fall back to std::sort if the data size is too small, < detail::min_sort_size).string_sort is a fast templated in-place hybrid radix/comparison algorithm, which in testing tends to be roughly 50% to 2X faster than std::sort for large tests (>=100kB).
+Worst-case performance is O(N * (lg(range)/s + s)) , so string_sort is asymptotically faster than pure comparison-based algorithms.
+
+Some performance plots of runtime vs. n and log(range) are provided:
+windows_string_sort
+osx_string_sort
+
+
+
+
+
+Throwing an exception may cause data loss. This will also throw if a small vector resize throws, in which case there will be no data loss.
+Invalid arguments cause undefined behaviour.
+spreadsort function provides a wrapper that calls the fastest sorting algorithm available for a data type, enabling faster generic-programming.
+The lesser of O(N*log(N)) comparisons and O(N*log(K/S + S)) operations worst-case, where:
+* N is last - first,
+* K is the log of the range in bits (32 for 32-bit integers using their full range),
+* S is a constant called max_splits, defaulting to 11 (except for strings where it is the log of the character size).
+[first, last) is a valid range. The elements in the range [first, last) are sorted in ascending order.void.std::exception Propagates exceptions if any of the element comparisons, the element swaps (or moves), the right shift, subtraction of right-shifted elements, functors, or any operations on iterators throw.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/bibliography.qbk b/doc/bibliography.qbk
new file mode 100644
index 0000000..6275334
--- /dev/null
+++ b/doc/bibliography.qbk
@@ -0,0 +1,31 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+
+[section:bibliography 4.- Bibliography]
+
+[*Steven Ross]
+
+[*Francisco Tapia]
+
+[01] Introduction to Algorithms, 3rd Edition (Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein)
+
+[02] C++ STL Sort Algorithms
+
+[03] Algorithm + Data Structures = Programs ( Nicklaus Wirth) Prentice Hall Series in Automatic Computation
+
+[4] Structured Parallel Programming: Patterns for Efficient Computation (Michael McCool, James Reinders, Arch Robison)
+
+
+[*Orson Peters]
+
+[endsect]
+
+
+
diff --git a/doc/block_indirect_sort.qbk b/doc/block_indirect_sort.qbk
new file mode 100644
index 0000000..3a7c225
--- /dev/null
+++ b/doc/block_indirect_sort.qbk
@@ -0,0 +1,176 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:block_indirect_sort 3.1- block_indirect_sort]
+
+
+[section:block_introcuction 3.1.1- Introduction]
+
+[*BLOCK_INDIRECT_SORT] is a new non stable parallel sort, created and implemented by Francisco Jose Tapia for the Boost Library
+
+[table AlgorithmDescription
+[[Algorithm] [Parallel] [Stable][Additional Memory] [Best, average, and worst case]]
+[[block_indirect_sort] [Yes] [No] [block_size * num_threads] [NlogN, NlogN , NlogN]]
+]
+
+The block_size is an internal parameter of the algorithm, which in order to achieve the
+highest speed, change according the size of the objects to sort according the next table.
+The strings use a block_size of 128.
+
+[table BlockSize
+[[object size (bytes)] [1 - 15][16 - 31][32 - 63][64 - 127][128 - 255][256 - 511][512 -]]
+[[block_size (number of elements)] [4096] [2048] [1024][768][512][256][128]]
+]
+
+Sorting 100 000 000 64 bits numbers randomly filled with 12 threads the measured results are :
+
+[table MemoryUsed
+[[Algorithm][Time (secs)][Memory used in MB]]
+[[Open MP Parallel Sort] [1.1990][1564 MB]]
+[[Threading Building Blocks (TBB)][1.6411][789 MB]]
+[[Block Indirect Sort] [0.9270][790 MB]]
+]
+
+This algorithms [*do not use any other library or utility]. Compiling this library requires a
+[*C++11 compliant compiler]. Don't need link with any external static or dynamic library.
+
+The algorithms use a [*comparison object], in the same way as the standard library sort
+algorithms. If you don't define it, the comparison object defaults to std::less, which uses
+the < operator internally for comparisons.
+If no comparison object is specified, the default class ( std::less ) is used.
+
+
+The algorithms are [*exception safe], meaning that, the exceptions generated by the algorithms
+guarantee the integrity of the objects to sort, but not their relative order. If the exception
+is generated inside the objects (in the move or in the copy constructor.. ) the results can be
+unpredictable.
+
+
+You only need to include the file boost/sort/parallel/sort.hpp if you wish to use this
+
+``
+ #include
+
+
+
+ template
+ void block_indirect_sort (iter_t first, iter_t last);
+
+ template
+ void block_indirect_sort (iter_t first, iter_t last, compare comp);
+
+ template
+ void block_indirect_sort (iter_t first, iter_t last, uint32_t num_thread);
+
+ template
+ void block_indirect_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
+``
+The algorithm run in the namespace boost::sort
+
+
+[*THREAD SPECIFICATION]
+
+The parallel algorithms have a integer parameter indicating the number of thread to use in the sorting process,
+which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of
+the machine where the program is running provided by std::thread::hardware_concurrency().
+
+If the number is 1 or 0, the algorithm is done with only 1 thread.
+
+The number of threads passed can be greater than the number of HW threads in the HW. We can pass 100 threads in a machine with 4 HW threads,
+and in the same mode we can pass a variable or function as (std::thread::hardware_concurrency() / 4 ). If the value resultant value is 0, the program is executed with 1 thread
+
+[endsect]
+
+
+[section:block_internal 3.1.2- Internal Description]
+
+
+There are two primary categories of parallelization in sorting algorithms.
+
+[*SUBDIVISION ALGORITHMS]
+
+[:Filter the data and generate two or more parts. Each part obtained is
+filtered and divided by other threads, until the size of the data to
+sort is smaller than a predefined size, then it is sorted by a single
+thread. The algorithm most frequently used in the filter and sorting
+is quick sort
+
+These algorithms are fast with a small number of threads, but are inefficient
+with a great number of HW (hardware) threads. Examples of this category are
+# Intel Threading Building Blocks (TBB)
+# Microsoft PPL Parallel Sort.
+]
+
+[*MERGING ALGORITHMS]
+
+[:Divide the data in parts, and each part is sorted by a thread. When
+the parts are sorted, they are merged to obtain the final results. The
+problem of these algorithms is they need additional memory for the
+merge, usually the same size as the data.
+
+With a small number of threads, these algorithms have similar speed to
+than the subdivision algorithms, but with span style=font-weight: bold;many
+threads they are much faster/span . Examples of this category are
+# GCC Parallel Sort (based on OpenMP)
+# Microsoft PPL Parallel Buffered Sort
+]
+
+This generates an undesirable duality. With a small number of threads the optimal algorithm is not the optimal for a big number of threads.
+For this reason, the SW designed for a small machine is inadequate for a big machine and vice versa.
+But the main problem for the merging algorithms is the additional memory used, usually of the same size as the data.
+
+[*NEW PARALLEL SORT ALGORITHM (Block Indirect Sort) ]
+
+This algorithm, named Block Indirect Sort, created for processors connected with shared memory, is a hybrid algorithm.
+With small number of threads, it is a subdivision algorithm, but with many threads is a merging algorithms,
+ which need a small auxiliary memory ( block_size * number of threads).
+
+This algorithm eliminates the duality. You compile your program using the new algorithm. The number of threads to use is evaluate
+in each execution. It can be a number, a variable of a expression. When the program runs with a small number of threads the algorithm
+internally uses a subdivision algorithm and has similar performance to TBB, and when run with many threads,
+internally uses the new algorithm and has the performance of GCC Parallel Sort, with the additional advantage of reduced memory consumption.
+
+[*DESIGN PROCESS ]
+
+The initial idea of this algorithm, was to build a merge algorithm, for to be fast with many threads, with a low additional memory.
+
+The results obtained in the benchmarks, in speed and memory used, are brilliant. These are the results sorting 100000000 numbers of 64 bits, randomly generated,
+in a Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz, with 6 cores and 2 threads by core, and 15M of cache
+
+[table MemoryUsed
+[[Algorithm][Time (secs)][Memory used in MB]]
+[[Open MP Parallel Sort] [1.1990][1564 MB]]
+[[Threading Building Blocks (TBB)][1.6411][ 789 MB]]
+[[Block Indirect Sort] [0.9270][ 790 MB]]
+]
+
+The technique used in the algorithm (indirect blocks) is new, and had been designed for the algorithm.
+
+The process had been long and very hard, mainly, by the uncertainty about if the ideas are correct and run
+so fast as need for to be useful. With the aggravating factor that can’t be sure of this, until the last part
+of the code is done and run the first benchmark
+
+But it had been too, a very exciting process, each time a problem is resolved, a new algorithm is designed,
+tested …, and see, step by step , the advance of the process.
+
+In this process, appeared new problems, unknown until now, which forced to design new internal algorithms for to resolve them,
+and divide the work in many parts for to execute in parallel mode. Due this, inside the sorting algorithm, you can find too, other nice algorithms,
+for to resolve and parallelize the internal problems.
+
+The best words about this algorithm are expressed by the [@#linux_parallel benchmarks] results
+
+If you are interested in a detailed description of the algorithm, you can find in the next link [@./papaers/block_indirect_sort_en.pdf Block Indirect Sort].
+
+
+[endsect]
+
+[endsect]
+
+
+
diff --git a/doc/flat_stable_sort.qbk b/doc/flat_stable_sort.qbk
new file mode 100644
index 0000000..255cd22
--- /dev/null
+++ b/doc/flat_stable_sort.qbk
@@ -0,0 +1,93 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:flat_stable_sort 2.4.- flat_stable_sort]
+
+[*Flat_stable_sort] is a new stable sort algorithm, designed and implemented by Francisco Jose Tapia for the Boost Sort Library
+
+The goal of the algorithm is provide a stable sort with a low additional memory (about 1% of the memory used by the data).
+
+The stable sort algorithms provided by the compilers and libraries use an additional memory, usually of a half of the data to sort.
+
+This new algorithm provide around 80%-90% of the speed of the spinsort and the stable sort algorithms provided by compilers and libraries.
+
+The algorithm have an excellent response when the data are near sorted. Many times the new elements are inserted at end the sorted elements,
+or some elements are modified, breaking the order of the elements. In these cases, the flat_stable_sort algorithm provide a very fast response.
+
+You can see it in the results of the benchmarks in the point 3.- Single Thread Benchmarks
+
+[table AlgorithmDescription
+[[Algorithm] [Additional Memory] [Best, average, and worst case]]
+[[flat_stable sort] [size of the data / 256 + 8K] [N, NlogN , NlogN]]
+]
+
+This benchmarks sort 100 000 000 numbers of 64 bits
+
+[table benchmark
+[[Data] [std::stable_sort] [spin_sort] [flat_stable_sort]]
+[[random] [ 8.62 ] [ 9.73 ] [10.80 ]]
+[[sorted] [ 4.88 ] [ 0.06 ] [ 0.07 ]]
+[[sorted + 0.1% end] [ 4.92 ] [ 0.41 ] [ 0.36 ]]
+[[sorted + 1% end] [ 4.97 ] [ 0.55 ] [ 0.49 ]]
+[[sorted + 10% end] [ 5.73 ] [ 1.32 ] [ 1.40 ]]
+[[sorted + 0.1% mid] [ 6.58 ] [ 1.89 ] [ 2.61 ]]
+[[sorted + 1% mid] [ 7.06 ] [ 2.12 ] [ 3.07 ]]
+[[sorted + 10% mid] [ 9.56 ] [ 4.02 ] [ 5.49 ]]
+[[reverse sorted] [ 0.13 ] [ 0.14 ] [ 1.87 ]]
+[[reverse sorted + 0.1% end] [ 5.22 ] [ 0.52 ] [ 0.42 ]]
+[[reverse sorted + 1% end] [ 5.29 ] [ 0.66 ] [ 0.55 ]]
+[[reverse sorted + 10% end] [ 6.03 ] [ 1.45 ] [ 1.44 ]]
+[[reverse sorted + 0.1% mid] [ 6.52 ] [ 1.89 ] [ 2.54 ]]
+[[reverse sorted + 1% mid] [ 7.09 ] [ 2.12 ] [ 3.09 ]]
+[[reverse sorted + 10% mid] [ 9.46 ] [ 4.02 ] [ 5.53 ]]
+]
+
+
+Memory used by the stable sort algorithms measured on Linux x64
+
+[table memory
+[[Algorithm] [Memory used ( MB)] ]
+[[std::stable_sort] [1177 MB] ]
+[[spinsort] [1175 MB] ]
+[[flat_stable_sort] [ 788 MB] ]
+[[spreadsort ] [ 785 MB] ]
+]
+
+
+
+You only need to include the file boost/sort/parallel/sort.hpp
+
+``
+ #include
+
+
+ template
+ void flat_stable_sort (iter_t first, iter_t last, compare comp = compare());
+``
+
+The spinsort function is in the namespace boost::sort
+
+If you want a detailed information of this algorithm you can find in the [@./papers/flat_stable_sort_eng.pdf flat stable sort document]
+
+
+Compiling this library requires a [*C++11 compliant compiler].
+
+The algorithms use a [*comparison object], in the same way as the standard library sort
+algorithms. If you don't define it, the comparison object defaults to std::less, which uses
+the < operator internally for comparisons.
+
+The algorithm is [*exception safe], meaning that, the exceptions generated by the algorithms
+guarantee the integrity of the objects to sort, but not their relative order. If the exception
+is generated inside the objects (in the move or in the copy constructor.. ) the results can be
+unpredictable.
+
+[endsect]
+
+
+
diff --git a/doc/gratitude.qbk b/doc/gratitude.qbk
new file mode 100644
index 0000000..71dcc90
--- /dev/null
+++ b/doc/gratitude.qbk
@@ -0,0 +1,30 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:gratitude 5.- Gratitude]
+
+[*Steven Ross]
+
+[*Francisco Tapia]
+
+To [@http://www.cesvima.upm.es CESVIMA], Centro de Cálculo de la Universidad Politécnica de Madrid.
+ When need machines for to tune this algorithm, I contacted with the investigation department of many Universities of Madrid. Only them, help me.
+
+To Hartmut Kaiser, Adjunct Professor of Computer Science at Louisiana State University. By their faith in my work,
+
+To Steven Ross, by their infinite patience in the long way in the develop of this algorithm, and their wise advises.
+
+
+
+[*Orson Peters]
+
+[endsect]
+
+
+
diff --git a/doc/introduction.qbk b/doc/introduction.qbk
new file mode 100644
index 0000000..5aab1ac
--- /dev/null
+++ b/doc/introduction.qbk
@@ -0,0 +1,17 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:introduction 1.- Introduction]
+
+
+
+[endsect]
+
+
+
diff --git a/doc/linux_parallel.qbk b/doc/linux_parallel.qbk
new file mode 100644
index 0000000..66e6229
--- /dev/null
+++ b/doc/linux_parallel.qbk
@@ -0,0 +1,126 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:linux_parallel 3.4- Linux Benchmarks]
+
+The following results are from more complex benchmarks, not include in the library because they use licensed software.
+(If you are interested in them, contact fjtapia@gmail.com)
+
+There are 3 types of benchmarks,
+
+# 64 bits integers
+# strings
+# objects of several sizes.
+
+The objects are arrays of integers. The heavy comparison sums all the elements in each, and the light comparison uses only the first number of the array.
+
+The computer used is a Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz, with 6 cores and 2 threads by core, and 15M of cache.
+
+
+
+[teletype]
+``
+
+ 100 000 000 NUMBERS OF 64 BITS
+ RANDOMLY FILLED
+
+ | Time | Maximum |
+ | secs | Memory Used |
+ -------------------------+---------+-------------+
+ OMP parallel sort | 1.1990 | 1564 MB |
+ TBB parallel_sort | 1.6411 | 789 MB |
+ block_indirect_sort | 0.9270 | 790 MB |
+ | | |
+ OMP parallel stable sort | 1.5814 | 1972 MB |
+ TBB parallel stable sort | 1.1745 | 1570 MB |
+ sample sort | 1.2872 | 1566 MB |
+ parallel stable sort | 1.7158 | 1176 MB |
+ -------------------------+---------+-------------+
+
+``
+
+
+[teletype]
+``
+
+ 10 000 000 S T R I N G S
+ RANDOMLY FILLED
+
+ | Time | Maximum |
+ | secs | Memory Used |
+ -------------------------+---------+-------------+
+ OMP parallel sort | 1.5738 | 2023 MB |
+ TBB parallel_sort | 1.8626 | 826 MB |
+ block_indirect_sort | 1.2411 | 825 MB |
+ | | |
+ OMP parallel stable sort | 2.3214 | 2024 MB |
+ TBB parallel stable sort | 1.4383 | 1143 MB |
+ sample sort | 1.5097 | 1135 MB |
+ parallel stable sort | 2.0970 | 978 MB |
+ -------------------------+---------+-------------+
+
+``
+[teletype]
+``
+
+
+ =============================================================
+ = OBJECT COMPARISON =
+ = --------------------- =
+ = =
+ = The objects are arrays of 64 bits numbers =
+ = =
+ = They are compared in two ways : =
+ = =
+ = (H) Heavy : The comparison sums all the numbers in the =
+ = array. =
+ = =
+ = (L) Light : Uses the first element of the array as a key =
+ = for comparison. =
+ = =
+ =============================================================
+
+ | 100000000 | 50000000 | 25000000 | 12500000 | 6250000 | 1562500 |
+ | objects of| objects of|objects of |objects of |objects of |objects of |
+ | 8 bytes | 16 bytes | 32 bytes | 64 bytes | 128 bytes | 512 bytes |
+ | | | | | | |
+ | H L | H L | H L | H L | H L | H L |
+ --------------------+-----------+-----------+-----------+-----------+-----------+-----------+
+ OMP parallel sort | 1.18 1.17| 0.73 0.66| 0.51 0.45| 0.43 0.39| 0.41 0.37| 0.37 0.32|
+ TBB parallel_sort | 1.71 1.59| 0.85 0.81| 0.56 0.54| 0.51 0.42| 0.45 0.39| 0.36 0.32|
+ block_indirect_sort | 1.11 1.08| 0.66 0.63| 0.49 0.46| 0.43 0.39| 0.40 0.37| 0.37 0.33|
+ | | | | | | |
+ OMP par stable sort | 1.55 1.55| 1.38 1.35| 1.23 1.22| 1.17 1.17| 1.09 1.08| 0.97 0.97|
+ TBB par stable sort | 1.23 1.23| 0.89 0.85| 0.74 0.72| 0.71 0.69| 0.69 0.69| 0.68 0.68|
+ sample sort | 1.32 1.32| 0.84 0.78| 0.66 0.63| 0.63 0.62| 0.62 0.61| 0.60 0.60|
+ parallel stable sort| 1.80 1.90| 1.17 1.07| 0.83 0.75| 0.76 0.71| 0.72 0.70| 0.70 0.69|
+ | | | | | | |
+ --------------------+-----------+-----------+-----------+-----------+-----------+-----------+
+
+
+ | Maximum |
+ | Memory Used |
+ -------------------------+-------------+
+ OMP parallel sort | 1569 MB |
+ TBB parallel_sort | 788 MB |
+ block_indirect_sort | 794 MB |
+ | |
+ OMP parallel stable sort | 1980 MB |
+ TBB parallel stable sort | 1573 MB |
+ sample sort | 1568 MB |
+ parallel stable sort | 1177 MB |
+ | |
+ -------------------------+-------------+
+
+``
+
+[endsect]
+
+
+
diff --git a/doc/linux_single.qbk b/doc/linux_single.qbk
new file mode 100644
index 0000000..d37fd08
--- /dev/null
+++ b/doc/linux_single.qbk
@@ -0,0 +1,169 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:linux_single 2.5.- Linux Benchmarks]
+
+
+[*LINUX x64 GCC 6.3 BENCHMARK]
+
+
+In the library you can find in the folder benchmark, programs to measure the speed of the algorithms in your machine and operating system.
+These are brief benchmarks for to see the speed with different kind of data ( random, sorted, sorted plus unsorted append at end ...)
+
+The computer used is a Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz, with 6 cores and 2 threads by core, and 15M of cache
+
+The results obtained with GCC 6.3 on Linux, in the benchmark_numbers with integers are:
+
+[teletype]
+``
+
+
+
+ B O O S T S O R T
+ S I N G L E T H R E A D
+ I N T E G E R B E N C H M A R K
+
+ SORT OF 100 000 000 NUMBERS OF 64 BITS
+
+ ( 1 ) std::sort ( 2 ) pdqsort ( 3 ) std::stable_sort
+ ( 4 ) spin_sort ( 5 ) flat_stable_sort ( 6 ) spreadsort
+
+ | | | | | | |
+ | ( 1 )| ( 2 )| ( 3 )| ( 4 )| ( 5 )| ( 6 )|
+ --------------------+------+------+------+------+------+------+
+ random | 8.21 | 3.99 | 8.62 | 9.73 |10.80 | 4.26 |
+ | | | | | | |
+ sorted | 1.84 | 0.13 | 4.88 | 0.06 | 0.07 | 0.06 |
+ sorted + 0.1% end | 6.41 | 2.91 | 4.92 | 0.41 | 0.36 | 3.16 |
+ sorted + 1% end |14.15 | 3.39 | 4.97 | 0.55 | 0.49 | 3.65 |
+ sorted + 10% end | 6.72 | 4.15 | 5.73 | 1.32 | 1.40 | 4.39 |
+ | | | | | | |
+ sorted + 0.1% mid | 4.41 | 3.31 | 6.58 | 1.89 | 2.61 | 3.29 |
+ sorted + 1% mid | 4.39 | 3.62 | 7.06 | 2.12 | 3.07 | 3.80 |
+ sorted + 10% mid | 6.35 | 4.71 | 9.56 | 4.02 | 5.49 | 4.99 |
+ | | | | | | |
+ reverse sorted | 1.36 | 0.26 | 5.12 | 0.13 | 0.14 | 1.87 |
+ rv sorted + 0.1% end| 7.57 | 2.92 | 5.22 | 0.52 | 0.42 | 2.83 |
+ rv sorted + 1% end| 4.99 | 3.33 | 5.29 | 0.66 | 0.55 | 3.45 |
+ rv sorted + 10% end| 4.62 | 4.16 | 6.03 | 1.45 | 1.44 | 4.35 |
+ | | | | | | |
+ rv sorted + 0.1% mid| 4.38 | 3.29 | 6.52 | 1.89 | 2.54 | 3.28 |
+ rv sorted + 1% mid| 4.43 | 3.65 | 7.09 | 2.12 | 3.09 | 3.81 |
+ rv sorted + 10% mid| 6.42 | 4.70 | 9.46 | 4.02 | 5.53 | 5.00 |
+ --------------------+------+------+------+------+------+------+
+
+``
+
+The next results are obtained from more complex benchmarks, not include in the library because use non free SW.
+(If you are interested in them , contact with fjtapia@gmail.com)
+
+There are 3 types of benchmarks,
+* 64 bits integers
+* strings
+* objects of several sizes.
+
+The objects are arrays of integers. With the heavy comparison sum all the elements in each, in the light comparison use only the first number of the array.
+
+
+
+[teletype]
+
+ 100 000 000 NUMBERS OF 64 BITS
+ RANDOMLY FILLED
+
+ | Time | Maximum |
+ | secs | Memory Used |
+ --------------------+---------+-------------+
+ std::sort | 8.2154 | 784 MB |
+ pdqsort | 3.9356 | 784 MB |
+ | | |
+ std::stable_sort | 8.5016 | 1176 MB |
+ spin_sort | 9.4262 | 1175 MB |
+ flat_stable_sort | 10.6790 | 788 MB |
+ spreadsort | 4.2248 | 785 MB |
+ --------------------+---------+-------------+
+
+
+
+
+[teletype]
+
+
+
+ 10 000 000 S T R I N G S
+ RANDOMLY FILLED
+
+ | Time | Maximum |
+ | secs | Memory Used |
+ --------------------+---------+-------------+
+ std::sort | 6.2442 | 822 MB |
+ pdqsort | 6.6661 | 821 MB |
+ | | |
+ std::stable_sort | 12.2620 | 1134 MB |
+ spin_sort | 8.5996 | 978 MB |
+ flat_stable_sort | 9.2559 | 978 MB |
+ spreadsort | 2.4323 | 822 MB |
+ --------------------+---------+-------------+
+
+
+[teletype]
+
+
+
+ =============================================================
+ = OBJECT COMPARISON =
+ = --------------------- =
+ = =
+ = The objects are arrays of 64 bits numbers =
+ = =
+ = They are compared in two ways : =
+ = =
+ = (H) Heavy : The comparison is the sum of all the numbers =
+ = of the array. In each comparison, sum all =
+ = the numbers of the array =
+ = =
+ = (L) Light : The comparison is with the first element of =
+ = the array, as a key =
+ = =
+ =============================================================
+
+ | 100000000 | 50000000 | 25000000 | 12500000 | 6250000 | 1562500 |
+ | objects of| objects of| objects of| objects of| objects of| objects of|
+ | 8 bytes | 16 bytes | 32 bytes | 64 bytes | 128 bytes | 512 bytes |
+ | | | | | | |
+ | H L | H L | H L | H L | H L | H L |
+ -----------------+-----------+-----------+-----------+-----------+-----------+-----------+
+ std::sort | 8.25 8.26| 4.46 4.23| 2.67 2.33| 2.10 1.45| 1.72 1.11| 1.13 0.76|
+ pdqsort | 8.17 8.17| 4.42 4.11| 2.57 2.26| 1.78 1.37| 1.46 1.06| 0.97 0.70|
+ | | | | | | |
+ std::stable_sort |10.28 10.25| 5.57 5.24| 3.68 3.26| 2.97 2.59| 2.60 2.46| 2.38 2.29|
+ spinsort | 9.70 9.69| 5.25 4.89| 3.28 2.65| 2.41 1.92| 2.03 1.66| 1.66 1.52|
+ flat_stable_sort |10.75 10.73| 6.44 5.99| 4.36 3.71| 3.59 2.86| 3.04 2.11| 1.64 1.45|
+ spreadsort | 5.10 5.10| 3.79 4.18| 2.22 1.88| 1.58 1.11| 1.51 0.99| 0.74 0.53|
+ | | | | | | |
+ -----------------+-----------+-----------+-----------+-----------+-----------+-----------+
+
+
+ | Maximum |
+ | Memory Used |
+ -----------------+-------------+
+ std::sort | 786 MB |
+ pdqsort | 786 MB |
+ | |
+ std::stable_sort | 1177 MB |
+ spinsort | 1176 MB |
+ flat_stable_sort | 789 MB |
+ spreadsort | 786 MB |
+ -----------------+-------------+
+
+
+[endsect]
+
+
+
diff --git a/doc/papers/block_indirect_sort_en.pdf b/doc/papers/block_indirect_sort_en.pdf
new file mode 100644
index 0000000..c079ea6
Binary files /dev/null and b/doc/papers/block_indirect_sort_en.pdf differ
diff --git a/doc/papers/flat_stable_sort_eng.pdf b/doc/papers/flat_stable_sort_eng.pdf
new file mode 100644
index 0000000..2d2f3fa
Binary files /dev/null and b/doc/papers/flat_stable_sort_eng.pdf differ
diff --git a/doc/parallel.qbk b/doc/parallel.qbk
new file mode 100644
index 0000000..467ffbb
--- /dev/null
+++ b/doc/parallel.qbk
@@ -0,0 +1,84 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:parallel 3.- Parallel Algorithms]
+
+
+The algorithms use a [*comparison object], in the same way as the standard library sort algorithms. If you don't define it,
+the comparison object defaults to std::less, which uses the < operator internally for comparisons.
+
+The algorithms are [*exception safe], meaning that, the exceptions generated by the algorithms guarantee the integrity
+of the objects to sort, but not their relative order. If the exception is generated inside the objects (in the move or in the copy constructor.. ) the results can be unpredictable.
+
+This table provides you a brief description of the sort algorithms in the library.
+
+[table Parallel Algorithms
+[[Algorithm][Stable][Additional memory][Best, average, and worst case]]
+[[block_indirect_sort] [no] [block_size * num_threads][N LogN, N LogN , N LogN]]
+[[sample_sort] [yes][N] [N LogN, N LogN , N LogN]]
+[[parallel_stable_sort][yes][N / 2] [N LogN, N LogN , N LogN]]
+]
+
+The block_size is an internal parameter of the algorithm, which in order to achieve the
+highest speed, change according the size of the objects to sort according the next table.
+[table BlockSize
+[[object size (bytes)] [1 - 15][16 - 31][32 - 63][64 - 127][128 - 255][256 - 511][512 -]]
+[[block_size (number of elements)] [4096] [2048] [1024][768][512][256][128]]
+]
+
+* Sample_sort is a implementation of the [@ https://en.wikipedia.org/wiki/Samplesort Samplesort] algorithm done by Francisco Tapia.
+* Parallel_stable_sort is based on the samplesort algorithm, but using a half of the memory used by sample_sort, ideated and implemented by Francisco Tapia.
+* Block_indirect_sort is a novelty parallel sort algorithm, ideated and implemented by Francisco Tapia.
+
+[*THREAD SPECIFICATION]
+
+The parallel algorithms have a integer parameter indicating the number of thread to use in the sorting process,
+which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of
+the machine where the program is running provided by std::thread::hardware_concurrency().
+
+If the number is 1 or 0, the algorithm is done with only 1 thread.
+
+The number of thread is not a fixed number, is calculate in each execution. The number of threads passed can be greater
+than the number of HW threads in the HW.
+
+We can pass 100 threads in a machine with 4 HW threads,
+and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ). If this value is 0, the program is executed with 1 thread
+
+You only need to include the file boost/sort/parallel/sort.hpp
+
+``
+ #include
+``
+
+The parallel algorithms have 4 invocation formats:
+
+``
+[^teletype]
+ algorithm ( first iterator, last iterator, comparison object, number of threads )
+ algorithm ( first iterator, last iterator, comparison object )
+ algorithm ( first iterator, last iterator, number of threads )
+ algorithm ( first iterator, last iterator )
+``
+
+All the algorithms are in the namespace boost::sort
+
+If no comparison object is specified, the default class ( std::less ) is used.
+
+If the number of threads is unspecified, the number of HW threads on the machine where the the program is running is used
+
+
+[include block_indirect_sort.qbk]
+[include sample_sort.qbk]
+[include parallel_stable_sort.qbk]
+[include linux_parallel.qbk]
+[include windows_parallel.qbk]
+[endsect]
+
+
+
diff --git a/doc/parallel_old.qbk b/doc/parallel_old.qbk
new file mode 100644
index 0000000..d93b18b
--- /dev/null
+++ b/doc/parallel_old.qbk
@@ -0,0 +1,855 @@
+
+
+[/ Some composite templates]
+[template super[x]''''''[x]'''''']
+[template sub[x]''''''[x]'''''']
+[template floor[x]'''⌊'''[x]'''⌋''']
+[template floorlr[x][lfloor][x][rfloor]]
+[template ceil[x] '''⌈'''[x]'''⌉''']
+
+[/ Required for autoindexing]
+[import ../../../tools/auto_index/include/auto_index_helpers.qbk]
+[/ Must be first included file!]
+
+
+
+[section 1.- Introduction]
+
+[section 1.1.- Description]
+
+This library provides both *stable and unstable* sorting algorithms, in *single threaded and parallel* versions.
+
+These algorithms *do not use any other library or utility*. Compiling this library requires a *C++11 compliant compiler*.
+
+The algorithms use a *comparison object*, in the same way as the standard library sort
+algorithms. If you don't define it, the comparison object defaults to std::less, which uses
+the < operator internally for comparisons.
+
+The algorithms are *exception safe*, meaning that, the exceptions generated by the algorithms
+guarantee the integrity of the objects to sort, but not their relative order. If the exception
+is generated inside the objects (in the move or in the copy constructor.. ) the results can be
+unpredictable.
+
+This library is *include only*. There is no need to link with any external static or dynamic library.
+This doesn't depend on any other boost files, variables or libraries or any other external libraries.
+To use this library, just include the files in the boost/sort/parallel folder.
+
+This table provides you a brief description of the sort algorithms in the library.
+
+[table AlgorithmDescription
+[[Algorithm] [Parallel] [Stable][Additional Memory] [Best, average, and worst case]]
+[[sort] [No] [No] [Log N] [NlogN, NlogN , NlogN]]
+[[stable sort] [No] [Yes] [N / 2] [NlogN, NlogN , NlogN]]
+[[parallel_sort] [Yes] [No] [block_size * num_threads] [NlogN, NlogN , NlogN]]
+[[parallel_stable_sort] [Yes] [Yes] [N / 2] [NlogN, NlogN , NlogN]]
+[[sample_sort] [Yes] [Yes] [N] [NlogN, NlogN , NlogN]]
+]
+The block_size is an internal parameter of the algorithm, which in order to achieve the
+highest speed, change according the size of the objects to sort according the next table.
+The strings use a block_size of 128.
+
+[table BlockSize
+[[object size (bytes)] [1 - 15][16 - 31][32 - 63][64 - 127][128 - 255][256 - 511][512 -]]
+[[block_size (number of elements)] [4096] [2048] [1024][768][512][256][128]]
+]
+[endsect] [/section 1.1.- Description]
+
+
+
+[section 1.2.- Present Perspective]
+
+There are two primary categories of parallelization in sorting algorithms.
+
+[*SUBDIVISION ALGORITHMS]
+
+[:Filter the data and generate two or more parts. Each part obtained is
+filtered and divided by other threads, until the size of the data to
+sort is smaller than a predefined size, then it is sorted by a single
+thread. The algorithm most frequently used in the filter and sorting
+is quick sort
+
+These algorithms are fast with a small number of threads, but are inefficient
+with a great number of HW (hardware) threads. Examples of this category are
+# Intel Threading Building Blocks (TBB)
+# Microsoft PPL Parallel Sort.
+]
+
+[*MERGING ALGORITHMS]
+
+[:Divide the data in parts, and each part is sorted by a thread. When
+the parts are sorted, they are merged to obtain the final results. The
+problem of these algorithms is they need additional memory for the
+merge, usually the same size as the data.
+
+With a small number of threads, these algorithms have similar speed to
+than the subdivision algorithms, but with span style=font-weight: bold;many
+threads they are much faster/span . Examples of this category are
+# GCC Parallel Sort (based on OpenMP)
+# Microsoft PPL Parallel Buffered Sort
+]
+[endsect] [/section 1.2.- Present Perspective]
+[br]
+
+
+[section 1.3.- New Parallel Sort Algorithm]
+
+This generates an undesirable duality. With a small number of threads the optimal algorithm is not the optimal for a big number of threads. For this reason, the SW designed for a small machine is inadequate for a big machine and vice versa. But the main problem for the merging algorithms is the additional memory used, usually of the same size as the data.
+
+This version have as a *new parallel_sort algorithm* (internally named Block Indirect), created for processors connected with shared memory.
+It is a hybrid algorithm. With small number of threads, it is a subdivision algorithm, but with many threads is a merging algorithms, which need a small auxiliary memory ( block_size * number of threads).
+
+
+The block_size is an internal parameter of the algorithm, which in order to achieve the
+highest speed, change according the size of the objects to sort according the next table.
+The strings use a block_size of 128.
+
+[table BlockSize
+[[object size (bytes)] [1 - 15][16 - 31][32 - 63][64 - 127][128 - 255][256 - 511][512 -]]
+[[block_size (number of elements)] [4096] [2048] [1024][768][512][256][128]]
+]
+
+This algorithm eliminates the duality. You compile your program using the new algorithms. When your program runs on a machine with a small number of threads the algorithm internally uses a subdivision algorithm and has similar performance to TBB, and when run on a machine with many threads, internally uses the new algorithm and has the performance of GCC Parallel Sort, with the additional advantage of reduced memory consumption.
+
+The algorithm uses an auxiliary memory of block_size elements for each thread. The worst case for the algorithm is when there are very big elements and many threads. With big elements (512 bytes), and 12 threads, The memory measured was:
+
+[table MemoryUsed
+[[Algorithm][Memory used in MB]]
+[[GCC Parallel Sort (OpenMP)][1565 MB]]
+[[Threading Building Blocks (TBB)][783 MB]]
+[[Block Indirect Sort][812 MB]]
+]
+
+This new parallel_sort algorithm had been created and implemented specifically for this library by the author.
+
+If you are interested in a brief description of the algorithm, you can find in the next link [@../../doc/papers/block_indirect_sort_brief_en.pdf Block Indirect Sort Brief].
+
+If you are interested in a detailed description of the algorithm, you can find in the next link [@../../doc/papers/block_indirect_sort_en.pdf Block Indirect Sort].
+.
+
+If you want run the benchmarks in your machine, all the code, instructions and procedures are in ([@https://github.com/fjtapia/sort_parallel_benchmark Sort Parallel Benchmarks])
+
+[endsect] [/section 1.3.- New Parallel Sort Algorithm]
+[br]
+
+[section 1.4.- Thread specification in the parallel algorithms]
+
+The parallel algorithms have a parameter indicating the number of thread to use in the sorting process, which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of the machine where the program is running.
+
+The parallel algorithms have 4 invocation formats:
+
+ algorithm ( first iterator, last iterator, comparison object, number of threads )
+ algorithm ( first iterator, last iterator, comparison object )
+ algorithm ( first iterator, last iterator, number of threads )
+ algorithm ( first iterator, last iterator )
+
+If no comparison object is specified, the default class ( std::less ) is used.
+
+If the number of threads is unspecified, the number of HW threads on the machine where the the program is running is used
+
+[endsect] [/section 1.4.- Thread specification in the parallel algorithms]
+[br]
+
+[section 1.5.- Programing]
+
+You only need to include the file boost/sort/parallel/sort.hpp if you wish to use this
+
+
+ #include
+
+
+All the functions and definitions are in the namespace boost::sort::parallel
+[endsect] [/section 1.5.- Programing]
+[br]
+[section 1.6.- Examples]
+
+This example uses the single threaded sort and stable_sort.
+
+
+ #include
+ #include #include
+ #include namespace bsp = boost::sort::parallel;
+
+ int main (void)
+ { //-------------- begin------------
+ std::mt19937_64 my_rand(0);
+ const uint32_t NMAX = 1000000;
+ std::vector A, B;
+
+ for (uint32_t i = 0; i < NMAX; ++i) A.push_back(my_rand());
+ B = A;
+ bsp::sort (A.begin(), A.end());
+ bsp::stable_sort (B.begin(), B.end());
+
+ for (uint32_t i = 0; i < NMAX; ++i)
+ if (A[i] != B[i]) std::cout<<"Error in the sorting process\n";
+ return 0;
+ };
+
+
+This example uses parallel_sort and sample_sort.
+
+
+ #include
+ #include
+ #include
+ #include
+ namespace bsp = boost::sort::parallel;
+
+ int main( void )
+ { //-------------- begin------------
+ std::mt19937_64 my_rand(0);
+ const uint32_t NMAX = 1000000; std::vector A, B;
+
+ for (uint32_t i = 0; i < NMAX; ++i) A.push_back (my_rand());
+ B = A ;
+ //------------------------------------------------------------------------
+ // if the thread parameter is not specified, the number of thread used
+ // is the number of HW threads of the machine where the program is running.
+ // This number is calculate in each execution of the code
+ //------------------------------------------------------------------------
+ bsp::parallel_sort (A.begin(), A.end());
+ bsp::sample_sort (B.begin(), B.end());
+
+ for (uint32_t i = 0; i < NMAX; ++i)
+ if (A[i] != B[i]) std::cout<<"Error in the sorting process\n";
+ return 0 ;
+ };
+
+
+This example uses parallel_sort and sample_sort and specifies the thread count.
+
+ #include
+ #include
+ #include
+ #include
+ namespace bsp = boost::sort::parallel;int main (void)
+ { //-------------- begin------------
+ std::mt19937_64 my_rand(0);
+ const uint32_t NMAX = 1000000;
+ uint32_t number_threads = std::thread::hardware_concurrency();
+ std::vector A, B ;
+ for (uint32_t i = 0; i < NMAX; ++i) A.push_back (my_rand());
+ B = A ;
+ //------------------------------------------------------------------------
+ // If the result of number_threads / 6 is smaller than 1, internally use 1 thread
+ //------------------------------------------------------------------------
+ bsp::parallel_sort (A.begin(), A.end(), number_threads / 6);
+ //------------------------------------------------------------------------
+ // force to execute with 100 threads
+ //------------------------------------------------------------------------
+ bsp::sample_sort (B.begin(), B.end(), 100);
+
+ for (uint32_t i = 0; i < NMAX ; ++i)
+ if (A[i] != B[i]) std::cout<<"Error in the sorting process\n";
+ return 0 ;
+ };
+
+[endsect] [/section 1.6.- Examples]
+[endsect] [/section 1.- Introduction]
+[br]
+
+[section 2.- Algorithms]
+
+[section 2.1.- Single Thread ( sort, stable_sort)]
+[h3 sort]
+
+Sort is a implementation of the Introsort algorithm. Initially it uses quicksort, but when the numbers of division is greater than a number, it changes to the heapsort algorithm.
+
+Heapsort is a O(NlogN) algorithm but slower than quick_sort. This is to prevent the worst case of QuickSort (N²).
+
+[table sort
+[[Algorithm] [Parallel] [Stable][Additional Memory] [Best, average, and worst case]]
+[[sort] [No] [No] [Log N] [NlogN, NlogN , NlogN]]
+]
+
+ template
+ void sort (iter_t first, iter_t last, compare comp = compare()) ;
+
+[h3 stable_sort]
+
+This is a new single threaded stable sort algorithm, internally named spin_sort, created and developed specifically for this library. This algorithm combines several ideas to improve on other stable sort algorithms.
+
+In the benchmarks you can find a detailed description of the results in time and memory obtained. This algorithm uses an auxiliary memory of (N/2) elements.
+
+[table stable_sort
+[[Algorithm] [Parallel] [Stable][Additional Memory] [Best, average, and worst case]]
+[[stable sort] [No] [Yes] [N / 2] [NlogN, NlogN , NlogN]]
+]
+
+ template
+ void stable_sort (iter_t first, iter_t last, compare comp = compare());
+
+[endsect] [/section 2.1.- Single Thread ( sort, stable_sort)]
+
+[section 2.2.- Parallel ( parallel_sort, parallel_stable_sort, sample_sort)]
+[h3 parallel_sort]
+
+This is the new algorithm Block Indirect Sort. It's a hybrid algorithm, because with a small number of HW threads it uses a parallel version of introsort, and with a number of threads > 5 uses the new algorithm. When the number of threads is 1, it uses introsort.
+
+This algorithm combines the speed of GCC Parallel Sort with many cores, with the small memory consumption of Threading Building Blocks (TBB). This algorithm had been created and implemented by the author for this library. The auxiliary memory needed is ( block_size * num of threads.) (See description in 1.3.- New Parallel Sort Algorithm )
+
+[table parallel_sort
+[[Algorithm] [Parallel] [Stable][Additional Memory] [Best, average, and worst case]]
+[[parallel_sort] [Yes] [No] [block_size * num_threads] [NlogN, NlogN , NlogN]]
+]
+
+ template
+ void parallel_sort (iter_t first, iter_t last);
+
+ template
+ void parallel_sort (iter_t first, iter_t last, compare comp);
+
+ template
+ void parallel_sort (iter_t first, iter_t last, uint32_t num_thread);
+
+ template
+ void parallel_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
+
+
+
+[h3 parallel_stable_sort]
+
+This is a parallel stable sort algorithm, built on top of the sample sort algorithm , bust using less auxiliary memory (N / 2 elements) in exchange for slowing it down about 10%.
+
+[table parallel_stable_sort
+[[Algorithm] [Parallel] [Stable][Additional Memory] [Best, average, and worst case]]
+[[parallel_stable_sort] [Yes] [Yes] [N / 2] [NlogN, NlogN , NlogN]]
+]
+
+ template
+ void parallel_stable_sort (iter_t first, iter_t last);
+
+ template
+ void parallel_stable_sort (iter_t first, iter_t last, compare comp);
+
+ template
+ void parallel_stable_sort (iter_t first, iter_t last, uint32_t num_thread);
+
+ template
+ void parallel_stable_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
+
+
+[h3 sample_sort]
+
+This is a parallel stable sort algorithm. It is faster than parallel_stable_sort but the auxiliary memory used is N elements.
+You can see the details in the benchmark chapter
+
+[table sample_sort
+[[Algorithm] [Parallel] [Stable][Additional Memory] [Best, average, and worst case]]
+[[sample_sort] [Yes] [Yes] [N] [NlogN, NlogN , NlogN]]
+]
+
+ template
+ void sample_sort (iter_t first, iter_t last);
+
+ template
+ void sample_sort (iter_t first, iter_t last, compare comp);
+
+ template
+ void sample_sort (iter_t first, iter_t last, uint32_t num_thread);
+
+ template
+ void sample_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
+
+[endsect] [/section 2.2.- Parallel ( parallel_sort, parallel_stable_sort, sample_sort)]
+
+[section 2.3.- less_ptr_no_null]
+
+Sometimes, we don't want sort physically the data by a concept. In such cases we can create a vector of pointers or iterators to the elements, named index, and sort the index. This permits keeping separate indexes into the the same data set at the same time, each sorted by different concepts.
+
+To sort an index , we have a special comparison object less_ptr_no_null , which permits calls to the internal comparison between objects, from the pointers to the iterators. The less_ptr_no_null object receives in the constructor the comparison object between two objects. This comparison object make trivial the sorting of a index
+
+
+ //---------------------------------------------------------------------------
+ /// @class less_ptr_no_null
+ ///
+ /// @remarks this is the comparison object for a pair of (non-null) pointers.
+ //---------------------------------------------------------------------------
+ template < class iter_t ,
+ class comp_t
+ = std::less ::value_type> >
+ struct less_ptr_no_null
+ { //----------------------------- Variables -----------------------
+ comp_t comp ;
+ //----------------------------- Functions ----------------------
+ inline less_ptr_no_null (comp_t C1 = comp_t()): comp (C1) {};
+
+ inline bool operator () (iter_t T1, iter_t T2) const
+ { return comp (*T1 ,*T2);
+ };
+ };
+
+
+In this example, there are structures sorted by the num field , but we create an index with the elements sorted by name, and another for to sorted by weight. It creates the less_ptr_no_null comparison objects, and sorts the indices, and then prints the results.
+
+
+ #include
+ #include
+ #include
+ #include
+
+ #include
+
+ using namespace std;
+ namespace bs_sort = boost::sort::parallel;
+ using bs_sort::less_ptr_no_null;
+
+ struct member
+ { uint32_t num;
+ std::string name;
+ float weight;
+ };
+ typedef typename vector::iterator iter_t;
+
+ struct cmp_num
+ { bool operator() ( const member &m1, const member &m2) const
+ { return (m1.num < m2.num); };
+ };
+
+ struct cmp_name
+ { bool operator() (const member &m1, const member &m2)const
+ { return (m1.name < m2.name); };
+ };
+
+ struct cmp_weight
+ { bool operator() (const member &m1, const member &m2) const
+ { return (m1.weight < m2.weight); };
+ };
+
+ ostream & operator << (ostream & out, const member &m)
+ { out< VM = { {1, "Peter", 85.6}, {2, "Hanna", 63.4},
+ {3, "John", 83.6}, {4, "Elsa", 56.6} };
+
+ vector Ix_name, Ix_weight;
+ for (iter_t it= VM.begin(); it != VM.end(); ++it)
+ { Ix_name.push_back (it);
+ Ix_weight.push_back(it);
+ };
+
+ typedef less_ptr_no_null compare_name ;
+ typedef less_ptr_no_null compare_weight ;
+
+ bs_sort::sort (Ix_name.begin(), Ix_name.end(), compare_name ());
+ bs_sort::sort (Ix_weight.begin(), Ix_weight.end(), compare_weight());
+
+ cout<<"Printing sorted by number \n";
+ for (auto it = VM.begin(); it != VM.end(); ++it) cout<<(*it);
+
+ cout<<"Printing sorted by name \n";
+ for (auto it = Ix_name.begin(); it != Ix_name.end(); ++it) cout<<(*(*it));
+
+ cout<<"Printing sorted by weight \n";
+ for (auto it = Ix_weight.begin(); it != Ix_weight.end(); ++it) cout<<(*(*it));
+
+ return 0;
+ };
+
+
+The output of the program is
+
+
+ Printing sorted by number
+ 1 - Peter - 85.6
+ 2 - Hanna - 63.4
+ 3 - John - 83.6
+ 4 - Elsa - 56.6
+ Printing sorted by name
+ 4 - Elsa - 56.6
+ 2 - Hanna - 63.4
+ 3 - John - 83.6
+ 1 - Peter - 85.6
+ Printing sorted by weight
+ 4 - Elsa - 56.6
+ 2 - Hanna - 63.4
+ 3 - John - 83.6
+ 1 - Peter - 85.6
+
+[endsect] [/[section 2.3.- less_ptr_no_null] ]
+
+[endsect] [/section 2.- Algorithms ]
+
+[section 3.- Benchmarks ]
+
+The goal of the benchmarks is to show a first approach to the performance of the algorithms. The performance can have many variations depending of the machine and their characteristics, as power of process, cache size, memory bandwidth, number of cores ....
+
+There is other repository,([@https://github.com/fjtapia/sort_parallel_benchmark]) with all the code, instructions and scripts for to compile and execute the benchmarks. These repository don't belong to Boost, because contains non free code as TBB and Microsoft PPL, used for to compare the speed and memory used with the Boost Sort Parallel
+
+Each algorithm have an "optimal" relation between the characteristics of the machine. By example, if you change to other machine with the same processor, cores, but with better memory bandwidth, some algorithms are more beneficiary than others.
+
+The invariant characteristics of an algorithm are associated to their internal design, which condition the memory usage, and their performance. By example, the GCC parallel sort, with many cores, is faster than Threading Building Blocks (TBB), because have a better division of the work between the cores, but TBB use a half of the memory needed by GCC Parallel Sort.
+
+In the processors with the Hyper Threading activate, Boost Parallel Sort usually is faster than GCC Parallel Sort. But In the Machines with the Hyper Threading not activate, GCC Parallel Sort is faster than Boost Parallel Sort.
+
+[h3 Description]
+
+The benchmark of these algorithms try to measure the speed in a wide range of cases, trying to provide useful information in all situations.
+There are 3 benchmarks ;
+
+# Sort of 100000000 uint64_t numbers randomly generated. The utility of this benchmark is to see the speed with small elements with a very fast comparison.
+
+# Sort of 10000000 of strings randomly filled. The comparison is no so easy as the integers.
+
+# Sort of objects of several sizes. The objects are arrays of 64 bits numbers, randomly filled. We will check with arrays of 1 , 2 , 4, 8, 16, 32 and 64 numbers.
+
+[table Objects
+[[Definition of the object][Bytes][Number of elements to sort]]
+[[uint64_t [1] ][8][100 000 000 ]]
+[[uint64_t [2] ][16][50 000 000 ]]
+[[uint64_t [4] ][32][25 000 000 ]]
+[[uint64_t [8] ][64][12 500 000 ]]
+[[uint64_t [16] ][128][6 250 000 ]]
+[[uint64_t [32] ][256][3 125 000 ]]
+[[uint64_t [64] ][512][1 562 500 ]]
+]
+
+ template
+ struct int_array
+ { uint64_t M[NN];
+ };
+
+
+
+The comparison between objects can be of two ways:
+
+* Heavy comparison : The comparison is done with the sum of all the numbers of the array. In each comparison, make the sum.
+* Light comparison : It's done using only the first number of the array, as a key in a register.
+
+
+
+[section 3.1.- Linux 64 GCC 5.2 Benchmarks]
+
+The benchmark are running in a machine with a I7 5820 3.3 GHz 6 cores, 12 threads, quad channel memory (2133 MHz) with Ubuntu and the GCC 5.2 compiler
+
+
+[section 3.1.1.- Single Thread Algorithms]
+
+The algorithms involved in this benchmark are :
+
+[table
+[[Algorithm][Stable][Memory used][Comments]]
+[[GCC sort][no][N + Log N ]]
+[[boost sort][no][N + Log N]]
+[[GCC stable_sort][yes][N + N / 2]]
+[[Boost stable_sort][yes][N + N / 2]]
+[[Boost spreadsort][yes][N + Log N][Extremely fast algorithm, only for integers, floats and strings]]
+]
+
+[h4 Integer Benchmark Sort of 100000000 64 bits numbers, randomly filled]
+
+[table
+[[Algorithm][Time][Memory]]
+[[GCC sort][8.33 secs][784 MB]]
+[[Boost sort][8.11 secs][784 MB]]
+[[GCC stable sort][8.69 secs][1176 MB]]
+[[Boost stable sort][8.75 secs][1175 MB]]
+[[Boost Spreadsort][4.33 secs][784 MB]]
+]
+
+[h4 Strings Benchmark Sort of 10 000 000 strings randomly filled]
+
+[table
+[[Algorithm][Time][Memory]]
+[[GCC sort][6.39 secs][820 MB]]
+[[Boost sort][7.01 secs][820 MB]]
+[[GCC stable sort][12.99 secs][1132 MB]]
+[[Boost stable sort][9.17 secs][976 MB]]
+[[Boost Spreadsort][2.44 secs][820 MB]]
+]
+
+[h4 Objects Benchmark]
+
+Sorting of objects of different sizes. The objects are arrays of 64 bits numbers. This benchmark is done using two kinds of comparison.
+
+[*Heavy comparison] : The comparison is done with the sum of all the numbers of the array. In each comparison, make the sum.
+
+[table
+[[Algorithm][8 bytes][16 bytes][32 bytes][64 bytes][128 bytes][256 bytes][512 bytes][Memory used]]
+[[GCC sort][8.75][4.49][3.03][1.97][1.71][1.37][1.17][783 MB]]
+[[Boost sort][8.19][4.42][2.65][1.91][1.67][1.35][1.09][783 MB]]
+[[GCC stable_sort][10.23][5.67][3.67][2.94][2.6][2.49][2.34][1174 MB]]
+[[Boost stable_sort][8.85][5.11][3.18][2.41][2.01][1.86][1.60][1174 MB]]
+]
+
+
+[*Light comparison] : It's done using only the first number of the array, as a key in a register.
+
+[table
+[[Algorithm][8 bytes][16 bytes][32 bytes][64 bytes][128 bytes][256 bytes][512 bytes][Memory used]]
+[[GCC sort][8.69][4.31][2.35][1.50][1.23][0.86][0.79][783 MB]]
+[[Boost sort][8.18][4.04][2.25][1.45][1.24][0.88][0.76][783 MB]]
+[[GCC stable_sort][10.34][5.26][3.20][2.57][2.47][2.41][2.30][1174 MB]]
+[[Boost stable_sort][8.92][4.59][2.51][1.94][1.68][1.68][1.50][1174 MB]]
+]
+
+[endsect][/section 3.1.1.- Single Thread Algorithms]
+
+[section 3.1.2.- Parallel Algorithms]
+
+The algorithms involved in this benchmark are :
+
+[table
+[[Algorithm][Stable][Memory used][Comments]]
+[[GCC parallel sort][No][2N][Based on OpenMP]]
+[[TBB parallel sort][No][N + LogN][]]
+[[Boost parallel sort][No][N +block_size*num threads][New parallel algorithm]]
+[[GCC parallel stable sort][Yes][2 N][Based on OpenMP]]
+[[Boost parallel stable sort][Yes][N / 2][]]
+[[Boost sample sort][Yes][N]]
+[[TBB parallel stable sort][Yes][N][Experimental code, not in the TBB official]]
+]
+
+The block_size is an internal parameter of the algorithm, which in order to achieve the highest speed, change according the size of the objects to sort according to the next table. The strings use a block_size of 128.
+
+[table BlockSize
+[[object size (bytes)] [1 - 15][16 - 31][32 - 63][64 - 127][128 - 255][256 - 511][512 -]]
+[[block_size (number of elements)] [4096] [2048] [1024][768][512][256][128]]
+]
+
+For the benchmark I use the next additional code:
+
+* Threading Building Blocks ( TBB)
+* OpenMP
+* Threading Building Block experimental code ( [@https://software.intel.com/sites/default/files/managed/48/9b/parallel_stable_sort.zip] )
+
+
+The most significant of this parallel benchmark is the comparison between the Parallel Sort algorithms. GCC parallel sort is extremely fast with many cores, but need an auxiliary memory of the same size then the data. In the other side Threading Building Blocks (TBB), is not so fast with many cores , but the auxiliary memory is LogN.
+
+The Boost Parallel Sort (internally named Block Indirect Sort), is a new algorithm created and implemented by the author for this library, which combine the speed of GCC Parallel sort, with a small memory consumption (block_size elements for each thread). The worst case for this algorithm is when have very big elements and many threads. With big elements (512 bytes), and 12 threads, The memory measured was:
+
+# GCC Parallel Sort (OpenMP) 1565 MB
+# Threading Building Blocks (TBB) 783 MB
+# Block Indirect Sort 812 MB
+
+In machines with a small number of HW threads, TBB is faster than GCC, but with a great number of HW threads GCC is more faster than TBB. Boost Parallel Sort have similar speed than GCC Parallel Sort with a great number of HW threads, and similar speed to TBB with a small number, If you are interested in a brief description of the algorithm, you can find here, and if you are interested in a detailed description of the algorithm, you can find here
+
+[h4 Integer Benchmark Sort of 100 000 000 64 bits numbers, randomly filled]
+
+[table
+[[Algorithm][time (secs)][memory (MB)]]
+[[OMP parallel_sort][1.25][1560]]
+[[TBB parallel_sort][1.64][783]]
+[[Boost parallel_sort][1.08][786]]
+[[OMP parallel_stable_sort][1.56][1948]]
+[[TBB parallel_stable_sort][1.56][1561]]
+[[Boost sample_sort][1.19][1565]]
+[[Boost parallel_stable_sort][1.54][1174]]
+]
+
+[h4 Strings Benchmark Sort of 10000000 strings randomly filled]
+
+[table
+[[Algorithm][time (secs)][memory (MB)]]
+[[OMP parallel_sort][1.49][2040]]
+[[TBB parallel_sort][1.84][820]]
+[[Boost parallel_sort][1.30][822]]
+[[OMP parallel_stable_sort][2.25][2040]]
+[[TBB parallel_stable_sort][2.10][1131]]
+[[Boost sample_sort][1.51][1134]]
+[[Boost parallel_stable_sort][2.10][977]]
+]
+
+[h4 Objects Benchmark]
+
+Sorting of objects of different sizes. The objects are arrays of 64 bits number. This benchmark is done using two kinds of comparison.
+
+
+Heavy comparison : The comparison is done with the sum of all the numbers of the array. In each comparison, make the sum.
+
+[table
+[[Algorithm][8 bytes][16 bytes][32 bytes][64 bytes][128 bytes][256 bytes][512 bytes][Memory used]]
+[[OMP parallel_sort][1.27][0.72][0.56][0.45][0.41][0.39][0.32][1565 MB]]
+[[TBB parallel_sort][1.63][0.8][0.56][0.5][0.44][0.39][0.32][783 MB]]
+[[Boost parallel_sort][1.13][0.67][0.53][0.47][0.43][0.41][0.34][812 MB]]
+[[OMP parallel_stable_sort][1.62][1.38][1.23][1.19][1.09][1.07][0.97][1954 MB]]
+[[TBB parallel_stable_sort][1.58][1.02][0.81][0.76][0.73][0.73][0.71][1566 MB]]
+[[Boost sample_sort][1.15][0.79][0.63][0.62][0.62][0.61][0.6][1566 MB]]
+[[Boost parallel_stable_sort][1.58][1.02][0.8][0.76][0.73][0.73][0.71][1175 MB]]
+]
+
+Light comparison : It's done using only the first number of the array, as a key in a register.
+
+[table
+[[Algorithm][8 bytes][16 bytes][32 bytes][64 bytes][128 bytes][256 bytes][512 bytes][Memory used]]
+[[OMP parallel_sort][1.24][0.71][0.48][0.41][0.38][0.35][0.32][1565 MB]]
+[[TBB parallel_sort][1.66][0.8][0.52][0.43][0.4][0.35][0.32][783 MB]]
+[[Boost parallel_sort][1.11][0.65][0.49][0.43][0.41][0.37][0.34][812 MB]]
+[[OMP parallel_stable_sort][1.55][1.36][1.23][1.18][1.09][1.07][0.97][1954 MB]]
+[[TBB parallel_stable_sort][1.58][0.91][0.75][0.72][0.71][0.72][0.71][1566 MB]]
+[[Boost parallel_stable_sort][1.16][0.74][0.63][0.62][0.61][0.61][0.6][1566 MB]]
+[[Boost sample_sort][1.56][0.91][0.75][0.72][0.72][0.72][0.71][1175 MB]]
+]
+
+[endsect][/section 3.1.2.- Parallel Algorithms]
+[endsect][/section 3.1.- Linux 64 GCC 5.2 Benchmarks]
+
+[section 3.2.- Windows 10 Visual Studio 2015 64 bits Benchmarks]
+
+The benchmark are running in a virtual machine with Windows 10 and 10 threads over a I7 5820 3.3 GHz with Visual Studio 2015 C++ compiler
+
+[section 3.2.1 -Single Thread Algorithms]
+
+The algorithms involved in this benchmark are :
+
+[table
+[[Algorithm][Stable][Memory used][Comments]]
+[[std::sort][no][N + Log N]]
+[[Boost sort][no][N + Log N]]
+[[std::stable_sort][yes][N + N / 2 ]]
+[[Boost stable_sort][yes][N + N / 2]]
+[[Boost spreadsort][yes][N + Log N][Extremely fast algorithm, only for integers, floats and strings]]
+]
+
+[h4 Integer Benchmark Sort of 100000000 64 bits numbers, randomly filled]
+
+[table
+[[Algorithm][time (secs)][memory (MB)]]
+[[std::sort][13][763 MB]]
+[[Boost sort][10.74][763 MB]]
+[[std::stable_sort][14.94][1144 MB]]
+[[Boost stable_sort][13.37][1144 MB]]
+[[Boost spreadsort][9.58][763 MB]]
+]
+
+[h4 Strings Benchmark Sort of 10 000 000 strings randomly filled]
+
+[table
+[[Algorithm][time (secs)][memory (MB)]]
+[[std::sort][13.3][862 MB]]
+[[Boost sort][13.6][862 MB]]
+[[std::stable_sort][26.99][1015 MB]]
+[[Boost stable_sort][20.64][1015 MB]]
+[[Boost spreadsort][5.7][862 MB]]
+]
+
+
+[h4 Objects Benchmark]
+
+Sorting of objects of different sizes. The objects are arrays of 64 bits numbers. This benchmark is done using two kinds of comparison.
+
+Heavy comparison : The comparison is done with the sum of all the numbers of the array. In each comparison, make the sum.
+
+[table
+[[Algorithm][8 bytes][16 bytes][32 bytes][64 bytes][128 bytes][256 bytes][512 bytes][Memory used]]
+[[std::sort][13.36][6.98][4.2][2.58][2.87][2.37][2.29][763 MB]]
+[[Boost sort][10.54][5.61][3.26][2.72][2.45][1.76][1.73][763 MB]]
+[[std::stable_sort][15.49][8.47][5.47][3.97][3.85][3.55][2.99][1144 MB]]
+[[Boost stable_sort][13.11][8.86][5.06][4.16][3.9][3.06][3.32][1144 MB]]
+]
+
+Light comparison : It's done using only the first number of the array, as a key in a register.
+
+[table
+[[Algorithm][8 bytes][16 bytes][32 bytes][64 bytes][128 bytes][256 bytes][512 bytes][Memory used]]
+[[std::sort][14.15][7.26][4.33][2.69][1.92][1.98][1.73][763 MB]]
+[[Boost sort][10.33][5][2.99][1.85][1.53][1.46][1.4][763 MB]]
+[[std::stable_sort][14.68][7.64][4.29][3.33][3.22][2.86][3.08][1144 MB]]
+[[Boost stable_sort][13.59][8.36][4.45][3.73][3.16][2.81][2.6][1144 MB]]
+]
+[endsect][/section 3.2.1 -Single Thread Algorithms]
+
+[section 3.2.2.- Parallel Algorithms]
+
+The algorithms involved in this benchmark are :
+
+[table
+[[Algorithm][Stable][Memory used][Comments]]
+[[PPL parallel sort][No][N]]
+[[PPL parallel buffered sort][No][2 N]]
+[[Boost parallel sort][No][N +block_size*num threads][New parallel algorithm]]
+[[Boost parallel stable sort][Yes][N + N / 2]]
+[[Boost sample sort][Yes][2 N]]
+]
+If you are interested in a brief description of the new Boost parallel sort algorithm, you can find here, and if you are interested in a detailed description of the algorithm, you can find here
+
+[h4 Integer Benchmark Sort of 100 000 000 64 bits numbers, randomly filled]
+
+[table
+[[Algorithm][time (secs)][memory (MB)]]
+[[PPL parallel sort][3.11][764 ]]
+[[PPL parallel buffered sort][1.74][1527]]
+[[Boost parallel sort][2.1][764]]
+[[Boost sample sort][2.78][1511]]
+[[Boost parallel stable sort][3.3][1145]]
+]
+
+[h4 Strings Benchmark Sort of 10000000 strings randomly filled]
+
+[table
+[[Algorithm][time (secs)][memory (MB)]]
+[[PPL parallel sort][3.76][864]]
+[[PPL parallel buffered sort][3.77][1169]]
+[[Boost parallel sort][3.41][866]]
+[[Boost sample sort][3.74][1168]]
+[[Boost parallel stable sort][5.7][1015]]
+]
+
+[h4 Objects Benchmark]
+
+Sorting of objects of different sizes. The objects are arrays of 64 bits number. This benchmark is done using two kinds of comparison.
+
+Heavy comparison : The comparison is done with the sum of all the numbers of the array. In each comparison, make the sum.
+
+[table
+[[Algorithm][8 bytes][16 bytes][32 bytes][64 bytes][128 bytes][256 bytes][512 bytes][Memory used]]
+[[PPL parallel sort][2.84][1.71][1.01][0.84][0.89][0.77][0.65][764 MB]]
+[[PPL parallel buffered sort][2.2][1.29][2][0.88][0.98][1.32][0.82][1527 MB]]
+[[Boost parallel sort][1.93][0.82][0.9][0.72][0.77][0.68][0.69][764 MB]]
+[[Boost sample sort][3.02][2.03][2.15][1.41][1.55][1.82][1.39][1526 MB]]
+[[Boost parallel stable sort][3.36][2.67][1.62][1.45][1.38][1.19][1.37][1145 MB]]
+]
+
+Light comparison : It's done using only the first number of the array, as a key in a register.
+
+[table
+[[Algorithm][8 bytes][16 bytes][32 bytes][64 bytes][128 bytes][256 bytes][512 bytes][Memory used]]
+[[PPL parallel sort][3.1][1.37][0.97][0.7][0.61][0.58][0.57][764 MB]]
+[[PPL parallel buffered sort][2.31][1.39][0.9][0.88][1.1][0.89][1.44][1527 MB]]
+[[Boost parallel sort][2.15][1.21][0.7][0.72][0.41][0.51][0.54][764 MB]]
+[[Boost sample sort][3.4][1.94][1.56][1.41][2][1.41][1.96][1526 MB]]
+[[Boost parallel stable sort][3.56][2.37][1.79][1.45][1.72][1.34][1.44][1145 MB]]
+]
+
+[endsect][/section 3.2.2.- Parallel Algorithms]
+[endsect][/section 3.2.- Windows 10 Visual Studio 2015 64 bits Benchmarks]
+[endsect][/section 3.- Benchmarks]
+[br]
+[section 4.- Bibliography]
+
+
+[01] Introduction to Algorithms, 3rd Edition (Thomas H. Cormen, Charles E. Leiserson, Ronald L.
+Rivest, Clifford Stein)
+
+[02] C++ STL Sort Algorithms
+
+[03] Algorithm + Data Structures = Programs ( Nicklaus Wirth) Prentice Hall Series in Automatic Computation
+
+[4] Structured Parallel Programming: Patterns for Efficient Computation (Michael McCool, James Reinders, Arch Robison)
+
+[endsect][/section 4.- Bibliography]
+
+[section 5.- Gratitude]
+To CESVIMA ([@http://www.cesvima.upm.es/]), Centro de Cálculo de la Universidad Politécnica de
+Madrid. When need machines for to tune this algorithm, I contacted with the investigation department of
+many Universities of Madrid. Only them, help me.
+[br][br]
+To Hartmut Kaiser, Adjunct Professor of Computer Science at Louisiana State University. By their faith in my work,
+[br][br]
+To Steven Ross, by their infinite patience in the long way in the develop of this algorithm, and their wise
+advises.
+[br][br]
+[endsect][/section 5.- Gratitude]
+
+[/
+ Copyright (c) 2017 Francisco Tapia
+ Distributed under the Boost Software License,
+ Version 1.0. (See accompanying file LICENSE_1_0.txt
+ or copy at http://boost.org/LICENSE_1_0.txt)
+]
+
diff --git a/doc/parallel_stable_sort.qbk b/doc/parallel_stable_sort.qbk
new file mode 100644
index 0000000..5e4f25f
--- /dev/null
+++ b/doc/parallel_stable_sort.qbk
@@ -0,0 +1,58 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:parallel_stable_sort 3.3.- Parallel_stable_sort]
+
+This algorithm is based on the [@ https://en.wikipedia.org/wiki/Samplesort Samplesort] algorithm, but using a half of the memory ueds by samplesort.
+The design and implementation are done by Francisco Tapia for the Boost Library.
+
+[table AlgorithmDescription
+[[Algorithm] [Parallel] [Stable][Additional Memory] [Best, average, and worst case]]
+[[parallel_stable_sort] [Yes] [Yes] [N / 2 ] [N, NlogN , NlogN]]
+]
+
+This algorithms [*do not use any other library or utility]. Compiling this library requires a
+[*C++11 compliant compiler]. Don't need link with any external static or dynamic library.
+
+The algorithms use a [*comparison object], in the same way as the standard library sort
+algorithms. If you don't define it, the comparison object defaults to std::less, which uses
+the < operator internally for comparisons.
+If no comparison object is specified, the default class ( std::less ) is used.
+
+
+The algorithms are [*exception safe], meaning that, the exceptions generated by the algorithms
+guarantee the integrity of the objects to sort, but not their relative order. If the exception
+is generated inside the objects (in the move or in the copy constructor.. ) the results can be
+unpredictable.
+
+You only need to include the file boost/sort/parallel/sort.hpp if you wish to use this
+
+``
+ #include
+
+
+ template
+ void parallel_stable_sort (iter_t first, iter_t last);
+
+ template
+ void parallel_stable_sort (iter_t first, iter_t last, compare comp);
+
+ template
+ void parallel_stable_sort (iter_t first, iter_t last, uint32_t num_thread);
+
+ template
+ void parallel_stable_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
+
+``
+The algorithm run in the namespace boost::sort
+
+[endsect]
+
+
+
diff --git a/doc/pdqsort.qbk b/doc/pdqsort.qbk
new file mode 100644
index 0000000..78ba61f
--- /dev/null
+++ b/doc/pdqsort.qbk
@@ -0,0 +1,17 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:pdqsort 2.2.- pdqsort]
+
+your text here
+
+[endsect]
+
+
+
diff --git a/doc/sample_sort.qbk b/doc/sample_sort.qbk
new file mode 100644
index 0000000..fd1d792
--- /dev/null
+++ b/doc/sample_sort.qbk
@@ -0,0 +1,62 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:sample_sort 3.2- Sample_Sort]
+
+This is a implementation of the [@https://en.wikipedia.org/wiki/Samplesort Samplesort] algorithm done by Francisco Tapia for the Boost Library.
+
+[table AlgorithmDescription
+[[Algorithm] [Parallel] [Stable][Additional Memory] [Best, average, and worst case]]
+[[sample_sort] [Yes] [Yes] [N] [N, NlogN , NlogN]]
+]
+
+This algorithms [*do not use any other library or utility]. Compiling this library requires a
+[*C++11 compliant compiler]. Don't need link with any external static or dynamic library.
+
+The algorithms use a [*comparison object], in the same way as the standard library sort
+algorithms. If you don't define it, the comparison object defaults to std::less, which uses
+the < operator internally for comparisons.
+If no comparison object is specified, the default class ( std::less ) is used.
+
+
+The algorithms are [*exception safe], meaning that, the exceptions generated by the algorithms
+guarantee the integrity of the objects to sort, but not their relative order. If the exception
+is generated inside the objects (in the move or in the copy constructor.. ) the results can be
+unpredictable.
+
+
+You only need to include the file boost/sort/parallel/sort.hpp if you wish to use this
+
+``
+ #include
+
+
+ template
+ void sample_sort (iter_t first, iter_t last);
+
+ template
+ void sample_sort (iter_t first, iter_t last, compare comp);
+
+ template
+ void sample_sort (iter_t first, iter_t last, uint32_t num_thread);
+
+ template
+ void sample_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
+
+``
+The algorithm run in the namespace boost::sort
+
+This is a parallel stable sort algorithm. It is faster than parallel_stable_sort but the auxiliary memory used is N elements.
+You can see the details in the benchmark chapter
+
+
+[endsect]
+
+
+
diff --git a/doc/single_thread.qbk b/doc/single_thread.qbk
new file mode 100644
index 0000000..d224817
--- /dev/null
+++ b/doc/single_thread.qbk
@@ -0,0 +1,23 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:single_thread 2.- Single Threaded Algorithms]
+
+your text here
+
+[include spreadsort.qbk]
+[include pdqsort.qbk]
+[include spinsort.qbk]
+[include flat_stable_sort.qbk]
+[include linux_single.qbk]
+[include windows_single.qbk]
+[endsect]
+
+
+
diff --git a/doc/sort.qbk b/doc/sort.qbk
index 0b3ebe6..e2829f7 100644
--- a/doc/sort.qbk
+++ b/doc/sort.qbk
@@ -1,8 +1,8 @@
[library Boost.Sort
[quickbook 1.7]
[id sort]
- [copyright 2014-2017 Steven Ross and Francisco Tapia]
- [authors [Ross, Steven] [Tapia, Francisco]]
+ [copyright 2014-2017 Steven Ross Francisco Tapia Orson Peters]
+ [authors [Ross, Steven] [Tapia, Francisco] [Peters, Orson]]
[dirname sort]
[license Distributed under the
[@http://boost.org/LICENSE_1_0.txt Boost Software License, Version 1.0].
@@ -20,918 +20,15 @@
[import ../../../tools/auto_index/include/auto_index_helpers.qbk]
[/ Must be first included file!]
-[/Files containing quickbook snippets]
-[import ../example/charstringsample.cpp]
-[import ../example/stringfunctorsample.cpp]
-[import ../example/reverseintsample.cpp]
-[import ../example/rightshiftsample.cpp]
-[import ../example/int64.cpp]
-[import ../example/floatfunctorsample.cpp]
-[import ../example/generalizedstruct.cpp]
-
-[import html4_symbols.qbk] [/ Provides various useful squiggles.]
-
-[def __spreadsort [@http://en.wikipedia.org/wiki/Spreadsort spreadsort]]
-[def __introsort [@http://en.wikipedia.org/wiki/Introsort introsort]]
-[def __stl_sort [@http://www.cplusplus.com/reference/algorithm/sort/ STL std::sort]]
-[def __big_o [@http://en.wikipedia.org/wiki/Big_O_notation Big O notation]]
-[def __radix_sort[@http://en.wikipedia.org/wiki/Radix_sort radix sort]]
-[def __adaptive_left_reflex [@http://www.nik.no/2002/Maus.pdf Arne Maus, Adaptive Left Reflex]]
-[def __american_flag [@http://en.wikipedia.org/wiki/American_flag_sort American flag sort]]
-[def __overloading [link sort.overview.overloading overloading]]
-[def __lookup [link sort.rationale.lookup lookup]]
-[def __random_access_iter [@http://en.cppreference.com/w/cpp/concept/RandomAccessIterator RandomAccessIterator]]
-[def __strictweakordering [@http://en.wikipedia.org/wiki/Weak_ordering#Strict_weak_orderings strict weak ordering]]
-
-[/ Links to functions for use in text]
-[def __integer_sort [^[funcref boost::sort::spreadsort::integer_sort integer_sort]]]
-[def __float_sort [^[funcref boost::sort::spreadsort::float_sort float_sort]]]
-[def __string_sort [^[funcref boost::sort::spreadsort::string_sort string_sort]]]
-[def __spreadsort [^[funcref boost::sort::spreadsort::spreadsort spreadsort]]] [/Note diff from Wiki link __spreadsort]
-[def __std_sort [@http://en.cppreference.com/w/cpp/algorithm/sort std::sort]]
-
-[section:introduction Introduction]
-
-TODO(Francisco): Insert overall library description here.
-
-[endsect] [/section:introduction Introduction]
-
-[section:single_threaded Single Threaded Algorithms]
-
-TODO(Francisco): Add other single-threaded algorithms here.
-
-[section:sort_hpp Spreadsort]
-
-[section:spreadsort_overview Spreadsort Overview]
-
-[section:spreadsort_intro Spreadsort Introduction]
-
-The Boost.Sort library provides a generic implementation of high-speed sorting algorithms
-that outperform those in the C++ standard in both average and worst case performance
-when there are over 1000 elements in the list to sort.
-
-They fall back to __stl_sort on small data sets.
-
-[warning These algorithms all only work on
-[@http://www.cplusplus.com/reference/iterator/RandomAccessIterator/ random access iterators].]
-
-They are hybrids using both radix and comparison-based sorting,
-specialized to sorting common data types, such as integers, floats, and strings.
-
-These algorithms are encoded in a generic fashion and accept functors,
-enabling them to sort any object that can be processed like these basic data types.
-In the case of __string_sort, this includes anything
-with a defined strict-weak-ordering that __std_sort can sort,
-but writing efficient functors for some complex key types
-may not be worth the additional effort relative to just using __std_sort,
-depending on how important speed is to your application.
-Sample usages are available in the example directory.
-
-Unlike many radix-based algorithms,
-the underlying __spreadsort algorithm is designed around [*worst-case performance].
-It performs better on chunky data (where it is not widely distributed),
-so that on real data it can perform substantially better than on random data.
-Conceptually, __spreadsort can sort any data for which an absolute ordering can be determined,
-and __string_sort is sufficiently flexible that this should be possible.
-
-Situations where __spreadsort is fastest relative to __std_sort:
-
-# Large number of elements to sort (['N] >= 10000).
-
-# Slow comparison function (such as floating-point numbers on x86 processors or strings).
-
-# Large data elements (such as key + data sorted on a key).
-
-# Completely sorted data when __spreadsort has an optimization to quit early in this case.
-
-Situations where __spreadsort is slower than __std_sort:
-
-# Data sorted in reverse order. Both __std_sort and __spreadsort are faster
-on reverse-ordered data than randomized data,
-but __std_sort speeds up more in this special case.
-
-# Very small amounts of data (< 1000 elements).
-For this reason there is a fallback in __spreadsort to __std_sort
-if the input size is less than 1000,
-so performance is identical for small amounts of data in practice.
-
-These functions are defined in `namespace boost::sort::spreadsort`.
-
-[endsect] [/section:spreadsort_intro Spreadsort Introduction]
-
-[section:overloading Overloading]
-
-[tip In the Boost.Sort C++ Reference section, click on the appropriate overload, for example `float_sort(RandomAccessIter, RandomAccessIter, Right_shift, Compare);` to get full details of that overload.]
-
-Each of __integer_sort, __float_sort, and __string_sort have 3 main versions:
-The base version, which takes a first iterator and a last iterator, just like __std_sort:
-
- integer_sort(array.begin(), array.end());
- float_sort(array.begin(), array.end());
- string_sort(array.begin(), array.end());
-
-The version with an overridden shift functor, providing flexibility
-in case the `operator>>` already does something other than a bitshift.
-The rightshift functor takes two args, first the data type,
-and second a natural number of bits to shift right.
-
-For __string_sort this variant is slightly different;
-it needs a bracket functor equivalent to `operator`\[\],
-taking a number corresponding to the character offset,
-along with a second `getlength` functor to get the length of the string in characters.
-In all cases, this operator must return an integer type that compares with the
-`operator<` to provide the intended order
-(integers can be negated to reverse their order).
-
-In other words (aside from negative floats, which are inverted as ints):
-
- rightshift(A, n) < rightshift(B, n) -> A < B
- A < B -> rightshift(A, 0) < rightshift(B, 0)
-
-[rightshift_1]
-[bracket_1]
-
-See [@../../example/rightshiftsample.cpp rightshiftsample.cpp] for a working example of integer sorting with a rightshift functor.
-
-And a version with a comparison functor for maximum flexibility.
-This functor must provide the same sorting order as the integers returned by the rightshift (aside from negative floats):
-
- rightshift(A, n) < rightshift(B, n) -> compare(A, B)
- compare(A, B) -> rightshift(A, 0) < rightshift(B, 0)
-
-[reverse_int_2]
-
-Examples of functors are:
-
-[lessthan_functor]
-
-[bracket_functor]
-
-[getsize_functor]
-
-and these functors are used thus:
-
-[stringsort_functors_call]
-
-See [@../../example/stringfunctorsample.cpp stringfunctorsample.cpp] for a working example of sorting strings with all functors.
-
-[endsect] [/section:overloading Overloading]
-
-[section:performance Performance]
-
-The __spreadsort algorithm is a hybrid algorithm;
-when the number of elements being sorted is below a certain number,
-comparison-based sorting is used. Above it, radix sorting is used.
-The radix-based algorithm will thus cut up the problem into small pieces,
-and either completely sort the data based upon its radix if the data is clustered,
-or finish sorting the cut-down pieces with comparison-based sorting.
-
-The Spreadsort algorithm dynamically chooses
-either comparison-based or radix-based sorting when recursing,
-whichever provides better worst-case performance.
-This way worst-case performance is guaranteed to be the better of
-['[bigo](N[sdot]log2(N))] comparisons and ['[bigo](N[sdot]log2(K/S + S))] operations where
-
-* ['N] is the number of elements being sorted,
-* ['K] is the length in bits of the key, and
-* ['S] is a constant.
-
-This results in substantially improved performance for large [' N];
-__integer_sort tends to be 50% to 2X faster than __std_sort,
-while __float_sort and _string_sort are roughly 2X faster than __std_sort.
-
-Performance graphs are provided for __integer_sort, __float_sort, and __string_sort in their description.
-
-Runtime Performance comparisons and graphs were made on a Core 2 Duo laptop
-running Windows Vista 64 with MSVC 8.0,
-and an old G4 laptop running Mac OSX with gcc.
-[@http://www.boost.org/build/doc/html/ Boost bjam/b2] was used to control compilation.
-
-Direct performance comparisons on a newer x86 system running Ubuntu,
-with the fallback to __std_sort at lower input sizes disabled are below.
-
-[note The fallback to __std_sort for smaller input sizes prevents
-the worse performance seen on the left sides of the first two graphs.]
-
-__integer_sort starts to become faster than __std_sort at about 1000 integers (4000 bytes),
-and __string_sort becomes faster than __std_sort at slightly fewer bytes (as few as 30 strings).
-
-[note The 4-threaded graph has 4 threads doing [*separate sorts simultaneously]
-(not splitting up a single sort)
-as a test for thread cache collision and other multi-threaded performance issues.]
-
-__float_sort times are very similar to __integer_sort times.
-
-[/ These provide links to the images, but currently graphs are shown - see below]
-[/@../../doc/images/single_threaded.png single_threaded.png] [/file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png]
-[/@../../doc/images/4_threaded.png 4_threaded.png]
-[/@../../doc/images/entropy.png entropy.png]
-[/@../../doc/images/bits_per_byte.png bits_per_byte.png]
-
-[$../images/single_threaded.png] [/
== file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png]
-[$../images/4_threaded.png]
-[$../images/entropy.png]
-[$../images/bits_per_byte.png]
-
-Histogramming with a fixed maximum number of splits is used
-because it reduces the number of cache misses,
-thus improving performance relative to the approach described in detail
-in the [@http://en.wikipedia.org/wiki/Spreadsort original SpreadSort publication].
-
-The importance of cache-friendly histogramming is described
-in __adaptive_left_reflex,
-though without the worst-case handling described below.
-
-The time taken per radix iteration is:
-
-['[bigo](N)] iterations over the data
-
-['[bigo](N)] integer-type comparisons (even for _float_sort and __string_sort)
-
-['[bigo](N)] swaps
-
-['[bigo](2[super S])] bin operations.
-
-To obtain ['[bigo](N)] worst-case performance per iteration,
-the restriction ['S <= log2(N)] is applied, and ['[bigo](2[super S])] becomes ['[bigo](N)].
-For each such iteration, the number of unsorted bits log2(range)
-(referred to as ['K]) per element is reduced by ['S].
-As ['S] decreases depending upon the amount of elements being sorted,
-it can drop from a maximum of ['S[sub max]] to the minimum of ['S[sub min]].
-
-Assumption: __std_sort is assumed to be ['[bigo](N*log2(N))],
-as __introsort exists and is commonly used.
-(If you have a quibble with this please take it up with the implementor of your __std_sort;
-you're welcome to replace the recursive calls to __std_sort to calls
-to __introsort if your __std_sort library call is poorly implemented).
-
-[@http://en.wikipedia.org/wiki/Introsort Introsort] is not included with this algorithm for simplicity and
-because the implementor of the __std_sort call
-is assumed to know what they're doing.
-
-To maintain a minimum value for ['S (S[sub min])],
-comparison-based sorting has to be used to sort when
-['n <= log2(meanbinsize)], where ['log2(meanbinsize) (lbs)] is a small constant,
-usually between 0 and 4, used to minimize bin overhead per element.
-There is a small corner-case where if ['K < S[sub min]] and ['n >= 2^K],
-then the data can be sorted in a single radix-based iteration with an ['S = K]
-(this bucketsorting special case is by default only applied to __float_sort).
-So for the final recursion, worst-case performance is:
-
-1 radix-based iteration if ['K <= S[sub min]],
-
-or ['S[sub min] + lbs] comparison-based iterations if ['K > S[sub min]] but ['n <= 2[super (S[sub min] + lbs)]].
-
-So for the final iteration, worst-case runtime is ['[bigo](N*(S[sub min] + lbs))] but
-if ['K > S[sub min]] and ['N > 2[super (S[sub min] + lbs)]]
-then more than 1 radix recursion will be required.
-
-For the second to last iteration, ['K <= S[sub min] * 2 + 1] can be handled,
-(if the data is divided into ['2[super (S[sub min] + 1)]] pieces)
-or if ['N < 2[super (S[sub min] + lbs + 1)]],
-then it is faster to fallback to __std_sort.
-
-In the case of a radix-based sort plus recursion, it will take
-['[bigo](N*(S[sub min] + lbs)) + [bigo](N) = [bigo](N*(S[sub min] + lbs + 1))] worst-case time,
-as
-['K_remaining = K_start - (S[sub min] + 1)], and ['K_start <= S[sub min] * 2 + 1].
-
-Alternatively, comparison-based sorting is used if ['N < 2[super (S[sub min] + lbs + 1)]],
-which will take ['[bigo](N*(S[sub min] + lbs + 1))] time.
-
-So either way ['[bigo](N*(S[sub min] + lbs + 1))] is the worst-case time for the second to last iteration,
-which occurs if ['K <= S[sub min] * 2 + ]1 or ['N < 2[super (S[sub min] + lbs + 1)]].
-
-This continues as long as ['S[sub min] <= S <= S[sub max]],
-so that for ['K_m <= K_(m-1) + S[sub min] + m] where ['m]
-is the maximum number of iterations after this one has finished,
-or where ['N < 2[super (S[sub min] + lbs + m)]],
-then the worst-case runtime is ['[bigo](N*(S[sub min] + lbs + m))].
-
-[space][space]['K_m] at ['m <= (S[sub max] - S[sub min])] works out to:
-
-[space][space]['K_1 <= (S[sub min]) + S[sub min] + 1 <= 2S[sub min] + 1]
-
-[space][space]['K_2 <= (2S[sub min] + 1) + S[sub min] + 2]
-
-as the sum from 0 to ['m] is ['m(m + 1)/2]
-
-[space][space]['K_m <= (m + 1)S[sub min] + m(m + 1)/2 <= (S[sub min] + m/2)(m + 1)]
-
-substituting in S[sub max] - S[sub min] for m
-
-[space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + (S[sub max] - S[sub min])/2)*(S[sub max] - S[sub min] + 1)]
-
-[space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + S[sub max]) * (S[sub max] - S[sub min] + 1)/2]
-
-Since this involves ['S[sub max] - S[sub min] + 1] iterations,
-this works out to dividing ['K] into an average ['(S[sub min] + S[sub max])]/2
-pieces per iteration.
-
-To finish the problem from this point takes ['[bigo](N * (S[sub max] - S[sub min]))] for ['m] iterations,
-plus the worst-case of ['[bigo](N*(S[sub min] + lbs))] for the last iteration,
-for a total of ['[bigo](N *(S[sub max] + lbs))] time.
-
-When ['m > S[sub max] - S[sub min]], the problem is divided into ['S[sub max]] pieces per iteration,
-or __std_sort is called if ['N < 2^(m + S[sub min] + lbs)]. For this range:
-
-[space][space]['K_m <= K_(m - 1) + S[sub max]], providing runtime of
-
-[space][space]['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))] if recursive,
-
-or ['[bigo](N * log(2^(m + S[sub min] + lbs)))] if comparison-based,
-
-which simplifies to ['[bigo](N * (m + S[sub min] + lbs))],
-which substitutes to ['[bigo](N * ((m - (S[sub max] - S[sub min])) + S[sub max] + lbs))],
-which given that ['m - (S[sub max] - S[sub min]) <= (K - K_(S[sub max] - S[sub min]))/S[sub max]]
-(otherwise a lesser number of radix-based iterations would be used)
-
-also comes out to ['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))].
-
-Asymptotically, for large ['N] and large ['K], this simplifies to:
-
-[space][space]['[bigo](N * (K/S[sub max] + S[sub max] + lbs))],
-
-simplifying out the constants related to the ['S[sub max] - S[sub min]] range,
-providing an additional ['[bigo](N * (S[sub max] + lbs))] runtime on top of the
-['[bigo](N * (K/S))] performance of LSD __radix_sort,
-but without the ['[bigo](N)] memory overhead.
-For simplicity, because ['lbs] is a small constant
-(0 can be used, and performs reasonably),
-it is ignored when summarizing the performance in further discussions.
-By checking whether comparison-based sorting is better,
-Spreadsort is also ['[bigo](N*log(N))], whichever is better,
-and unlike LSD __radix_sort, can perform much better than the worst-case
-if the data is either evenly distributed or highly clustered.
-
-This analysis was for __integer_sort and __float_sort.
-__string_sort differs in that ['S[sub min] = S[sub max] = sizeof(Char_type) * 8],
-['lbs] is 0, and that __std_sort's comparison is not a constant-time operation,
-so strictly speaking __string_sort runtime is
-
-[space][space]['[bigo](N * (K/S[sub max] + (S[sub max] comparisons)))].
-
-Worst-case, this ends up being ['[bigo](N * K)]
-(where ['K] is the mean string length in bytes),
-as described for __american_flag, which is better than the
-
-[space][space]['[bigo](N * K * log(N))]
-
-worst-case for comparison-based sorting.
-
-[endsect] [/section:performance Performance]
-
-[section:tuning Tuning]
-__integer_sort and __float_sort have tuning constants that control
-how the radix-sorting portion of those algorithms work.
-The ideal constant values for __integer_sort and __float_sort vary depending on
-the platform, compiler, and data being sorted.
-By far the most important constant is ['max_splits],
-which defines how many pieces the radix-sorting portion splits
-the data into per iteration.
-
-The ideal value of ['max_splits] depends upon the size of the L1 processor cache,
-and is between 10 and 13 on many systems.
-A default value of 11 is used. For mostly-sorted data, a much larger value is better,
-as swaps (and thus cache misses) are rare,
-but this hurts runtime severely for unsorted data, so is not recommended.
-
-On some x86 systems, when the total number of elements being sorted is small
-( less than 1 million or so), the ideal ['max_splits] can be substantially larger,
-such as 17. This is suspected to be because all the data fits into the L2 cache,
-and misses from L1 cache to L2 cache do not impact performance
-as severely as misses to main memory.
-Modifying tuning constants other than ['max_splits] is not recommended,
-as the performance improvement for changing other constants is usually minor.
-
-If you can afford to let it run for a day, and have at least 1GB of free memory,
-the perl command: `./tune.pl -large -tune` (UNIX)
-or `perl tune.pl -large -tune -windows` (Windows)
-can be used to automatically tune these constants.
-This should be run from the `libs/sort directory` inside the boost home directory.
-This will work to identify the `ideal constants.hpp` settings for your system,
-testing on various distributions in a 20 million element (80MB) file,
-and additionally verifies that all sorting routines sort correctly
-across various data distributions.
-Alternatively, you can test with the file size you're most concerned with
-`./tune.pl number -tune` (UNIX) or `perl tune.pl number -tune -windows` (Windows).
-Substitute the number of elements you want to test with for `number`.
-Otherwise, just use the options it comes with, they're decent.
-With default settings `./tune.pl -tune` (UNIX) `perl tune.pl -tune -windows` (Windows),
-the script will take hours to run (less than a day),
-but may not pick the correct ['max_splits] if it is over 10.
-Alternatively, you can add the `-small` option to make it take just a few minutes,
-tuning for smaller vector sizes (one hundred thousand elements),
-but the resulting constants may not be good for large files
-(see above note about ['max_splits] on Windows).
-
-The tuning script can also be used just to verify that sorting works correctly
-on your system, and see how much of a speedup it gets,
-by omiting the "-tune" option. This runs at the end of tuning runs.
-Default args will take about an hour to run and give accurate results
-on decent-sized test vectors. `./tune.pl -small` (UNIX) `perl tune.pl -small -windows` (Windows)
-is a faster option, that tests on smaller vectors and isn't as accurate.
-
-If any differences are encountered during tuning, please call `tune.pl` with `-debug > log_file_name`.
-If the resulting log file contains compilation or permissions issues,
-it is likely an issue with your setup.
-If some other type of error is encountered (or result differences),
-please send them to the library author at spreadsort@gmail.com.
-Including the zipped `input.txt` that was being used is also helpful.
-
-[endsect] [/section:tuning Tuning]
-
-[endsect] [/section:spreadsort_overview Spreadsort Overview]
-
-[section:header_spreadsort Header ``]
-
-__spreadsort checks whether the data-type provided is an integer,
-castable float, string, or wstring.
-
-* If data-type is an integer, __integer_sort is used.
-* If data-type is a float, __float_sort is used.
-* If data-type is a string or wstring, __string_sort is used.
-* Sorting other data-types requires picking between
-__integer_sort, __float_sort and __string_sort directly,
-as __spreadsort won't accept types that don't have the appropriate type traits.
-
-Overloading variants are provided that permit use of user-defined right-shift functors and comparison functors.
-
-Each function is optimized for its set of arguments; default functors are not provided to avoid the risk of any reduction of performance.
-
-See __overloading section.
-
-[h5 Rationale:]
-
-__spreadsort function provides a wrapper that calls the fastest sorting algorithm
-available for a data-type, enabling faster generic programming.
-
-[section:spreadsort_examples Spreadsort Examples]
-
-See [@../../example/ example] folder for all examples.
-
-See [@../../example/sample.cpp sample.cpp] for a simple working example.
-
-For an example of 64-bit integer sorting, see [@../../example/int64.cpp int64.cpp].
-
-This example sets the element type of a vector to 64-bit integer
-
-[int64bit_1]
-
-and calls the sort
-
-[int64bit_2]
-
-For a simple example sorting `float`s,
-
- vector vec;
- vec.push_back(1.0);
- vec.push_back(2.3);
- vec.push_back(1.3);
- ...
- spreadsort(vec.begin(), vec.end());
- //The sorted vector contains "1.0 1.3 2.3 ..."
-
-See also [@../../example/floatsample.cpp floatsample.cpp] which checks for abnormal values.
-
-[endsect] [/section:spreadsort_examples Spreadsort Examples]
-
-[endsect] [/section:header_spreadsort Header ``]
-
-[section:integer_sort Integer Sort]
-
-__integer_sort is a fast templated in-place hybrid radix/comparison algorithm,
-which in testing tends to be roughly 50% to 2X faster than
-__std_sort for large tests (>=100kB).
-Worst-case performance is ['[bigo](N * (log2(range)/s + s))],
-so __integer_sort is asymptotically faster than pure comparison-based algorithms.
-['s] is ['max_splits], which defaults to 11,
-so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)]
-slow radix-based iterations + 11 fast comparison-based iterations).
-
-Some performance plots of runtime vs. n and log2(range) are provided below:
-
-[@../../doc/graph/windows_integer_sort.htm Windows Integer Sort]
-
-[@../../doc/graph/osx_integer_sort.htm OSX integer Sort]
-
-[section:integersort_examples Integer Sort Examples]
-
-See [@../../example/rightshiftsample.cpp rightshiftsample.cpp] for a working example of using rightshift, using a user-defined functor:
-
-[rightshift_int_functor]
-
-Other examples:
-
-[@../../example/keyplusdatasample.cpp Sort structs using an integer key.]
-
-[@../../example/reverseintsample.cpp Sort integers in reverse order.]
-
-[@../../example/mostlysorted.cpp Simple sorting of integers; this case is a performance test for integers that are already mostly sorted.]
-
-[endsect] [/section:integersort_examples Integer Sort Examples]
-
-[endsect] [/section:integer_sort Integer Sort]
-
-[section:float_sort Float Sort]
-
-__float_sort is a fast templated in-place hybrid radix/comparison algorithm much like __integer_sort, but sorts IEEE floating-point numbers (positive, zero, NaN, and negative) into ascending order by casting them to integers. This works because positive IEEE floating-point numbers sort like integers with the same bits, and negative IEEE floating-point numbers sort in the reverse order of integers with the same bits. float_sort is roughly 2X as fast as std::sort.
-
--0.0 vs. 0.0 and NaN are given definitive ordered positions by the radix-based portion of this algorithm, where comparison-based sorting does not guarantee their relative position. The included tests avoid creating NaN and -0.0 so that results match std::sort, which is not consistent in how it handles these numbers, as they compare as equal to numbers with different values.
-
-float_sort checks the size of the data type and whether it is castable, picking
- an integer type to cast to, if a casting functor isn't provided by the user.
-
-float_mem_cast casts IEEE floating-point numbers (positive, zero, NaN, and negative) into integers. This is an essential utility for creating a custom rightshift functor for float_sort, when one is needed. Only IEEE floating-point numbers of the same size as the integer type being cast to should be used in this specialized method call.
-Worst-case performance is ['[bigo](N * (log2(range)/s + s))],
-so __float_sort is asymptotically faster than pure comparison-based algorithms.
-['s] is ['max_splits], which defaults to 11,
-so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)]
-slow radix-based iterations + 11 fast comparison-based iterations).
-
-Some performance plots of runtime vs. n and log2(range) are provided below:
-
-[@../../doc/graph/windows_float_sort.htm Windows Float Sort]
-
-[@../../doc/graph/osx_float_sort.htm OSX Float Sort]
-
-[section:floatsort_examples Float Sort Examples]
-
-See [@../../example/floatfunctorsample.cpp floatfunctorsample.cpp] for a working example of how to sort structs with a float key:
-
-[float_functor_types]
-
-[float_functor_datatypes]
-
-Right-shift functor:
-
-[float_functor_rightshift]
-
-Comparison lessthan `operator<` functor:
-
-[float_functor_lessthan]
-
-Other examples:
-
-[@../../example/double.cpp Sort doubles.]
-
-[@../../example/shiftfloatsample.cpp Sort floats using a rightshift functor.]
-
-[endsect] [/section:floatsort_examples Float Sort Examples]
-
-[endsect] [/section:float_sort Float Sort]
-
-[section:string_sort String Sort]
-__string_sort is a hybrid radix-based/comparison-based algorithm that sorts strings of characters (or arrays of binary data) in ascending order.
-
-The simplest version (no functors) sorts strings of items that can cast to an unsigned data type (such as an unsigned char), have a < operator, have a size function, and have a data() function that returns a pointer to an array of characters, such as a std::string. The functor version can sort any data type that has a strict weak ordering, via templating, but requires definitions of a get_char (acts like x[offset] on a string or a byte array), get_length (returns length of the string being sorted), and a comparison functor. Individual characters returned by get_char must support the != operator and have an unsigned value that defines their lexicographical order.
-
-This algorithm is not efficient for character types larger than 2 bytes each, and is optimized for one-byte character strings. For this reason, __std_sort will be called instead if the character type is of size > 2.
-
-__string_sort has a special optimization for identical substrings. This adds some overhead on random data, but identical substrings are common in real strings.
-
-reverse_string_sort sorts strings in reverse (decending) order, but is otherwise identical. __string_sort is sufficiently flexible that it should sort any data type that __std_sort can, assuming the user provides appropriate functors that index into a key.
-
-[@../../doc/graph/windows_string_sort.htm Windows String Sort]
-
-[@../../doc/graph/osx_string_sort.htm OSX String Sort]
-
-
-
-[section:stringsort_examples String Sort Examples]
-
-See [@../../example/stringfunctorsample.cpp stringfunctorsample.cpp] for an example of how to sort structs using a string key and all functors:
-
-[lessthan_functor]
-
-[bracket_functor]
-
-[getsize_functor]
-
-and these functors are used thus:
-
-[stringsort_functors_call]
-
-
-See [@../../example/generalizedstruct.cpp generalizedstruct.cpp] for a working example of a generalized approach to sort structs by a sequence of integer, float, and multiple string keys using string_sort:
-
-[generalized_functors]
-
-[generalized_functors_call]
-
-Other examples:
-
-[@../../example/stringsample.cpp String sort.]
-
-[@../../example/reversestringsample.cpp Reverse string sort.]
-
-[@../../example/wstringsample.cpp Wide character string sort.]
-
-[@../../example/caseinsensitive.cpp Case insensitive string sort.]
-
-[@../../example/charstringsample.cpp Sort structs using a string key and indexing functors.]
-
-[@../../example/reversestringfunctorsample.cpp Sort structs using a string keynd all functors in reverse order.]
-
-[endsect] [/section:stringsort_examples String Sort Examples]
-
-[endsect] [/section:string_sort String Sort]
-
-[section:rationale Rationale]
-
-[section:radix_sorting Radix Sorting]
-Radix-based sorting allows the data to be divided up into more than 2 pieces per iteration,
-and for cache-friendly versions, it normally cuts the data up into around a thousand pieces per iteration.
-This allows many fewer iterations to be used to complete sorting the data,
-enabling performance superior to the ['[bigo](N*log(N))] comparison-based sorting limit.
-[endsect] [/section:radix_sorting Radix Sorting]
-
-[section:hybrid_radix Hybrid Radix]
-
-There a two primary types of radix-based sorting:
-
-Most-significant-digit Radix sorting (MSD) divides the data recursively
-based upon the top-most unsorted bits.
-This approach is efficient for even distributions that divide nicely,
-and can be done in-place (limited additional memory used).
-There is substantial constant overhead for each iteration to deal
-with the splitting structure.
-The algorithms provided here use MSD Radix Sort for their radix-sorting portion.
-The main disadvantage of MSD Radix sorting is that when the data is cut up into small
-pieces, the overhead for additional recursive calls starts to dominate runtime,
-and this makes worst-case behavior substantially worse than ['[bigo](N*log(N))].
-
-By contrast, __integer_sort, __float_sort, and __string_sort all check to see
-whether Radix-based or Comparison-based sorting have better worst-case runtime,
-and make the appropriate recursive call.
-Because Comparison-based sorting algorithms are efficient on small pieces,
-the tendency of MSD __radix_sort to cut the problem up small is turned into
-an advantage by these hybrid sorts. It is hard to conceive of a common usage case
-where pure MSD __radix_sort would have any significant advantage
-over hybrid algorithms.
-
-Least-significant-digit __radix_sort (LSD) sorts based upon
-the least-significant bits first. This requires a complete copy of the data being sorted,
-using substantial additional memory. The main advantage of LSD Radix Sort
-is that aside from some constant overhead and the memory allocation,
-it uses a fixed amount of time per element to sort, regardless of distribution or
-size of the list. This amount of time is proportional to the length of the radix.
-The other advantage of LSD Radix Sort is that it is a stable sorting algorithm,
-so elements with the same key will retain their original order.
-
-One disadvantage is that LSD Radix Sort uses the same amount of time
-to sort "easy" sorting problems as "hard" sorting problems,
-and this time spent may end up being greater than an efficient ['[bigo](N*log(N))]
-algorithm such as __introsort spends sorting "hard" problems on large data sets,
-depending on the length of the datatype, and relative speed of comparisons,
-memory allocation, and random accesses.
-
-The other main disadvantage of LSD Radix Sort is its memory overhead.
-It's only faster for large data sets, but large data sets use significant memory,
-which LSD Radix Sort needs to duplicate. LSD Radix Sort doesn't make sense for items
-of variable length, such as strings; it could be implemented by starting at the end
-of the longest element, but would be extremely inefficient.
-
-All that said, there are places where LSD Radix Sort is the appropriate and
-fastest solution, so it would be appropriate to create a templated LSD Radix Sort
-similar to __integer_sort and __float_sort. This would be most appropriate in cases where
-comparisons are extremely slow.
-
-[endsect] [/section:hybrid_radix Hybrid Radix]
-
-[section:why_spreadsort Why spreadsort?]
-
-The __spreadsort algorithm used in this library is designed to provide best possible
-worst-case performance, while still being cache-friendly.
-It provides the better of ['[bigo](N*log(K/S + S))] and ['[bigo](N*log(N))] worst-case time,
-where ['K] is the log of the range. The log of the range is normally the length in bits
-of the data type; 32 for a 32-bit integer.
-
-`flash_sort` (another hybrid algorithm), by comparison is ['[bigo](N)]
-for evenly distributed lists. The problem is, `flash_sort` is merely an MSD __radix_sort
-combined with ['[bigo](N*N)] insertion sort to deal with small subsets where
-the MSD Radix Sort is inefficient, so it is inefficient with chunks of data
-around the size at which it switches to `insertion_sort`, and ends up operating
-as an enhanced MSD Radix Sort.
-For uneven distributions this makes it especially inefficient.
-
-__integer_sort and __float_sort use __introsort instead, which provides ['[bigo](N*log(N))]
-performance for these medium-sized pieces. Also, `flash_sort`'s ['[bigo](N)]
-performance for even distributions comes at the cost of cache misses,
-which on modern architectures are extremely expensive, and in testing
-on modern systems ends up being slower than cutting up the data in multiple,
-cache-friendly steps. Also worth noting is that on most modern computers,
-`log2(available RAM)/log2(L1 cache size)` is around 3,
-where a cache miss takes more than 3 times as long as an in-cache random-access,
-and the size of ['max_splits] is tuned to the size of the cache.
-On a computer where cache misses aren't this expensive, ['max_splits]
-could be increased to a large value, or eliminated entirely,
-and `integer_sort/float_sort` would have the same ['[bigo](N)] performance
-on even distributions.
-
-Adaptive Left Radix (ALR) is similar to `flash_sort`, but more cache-friendly.
-It still uses insertion_sort. Because ALR uses ['[bigo](N*N)] `insertion_sort`,
-it isn't efficient to use the comparison-based fallback sort on large lists,
-and if the data is clustered in small chunks just over the fallback size
-with a few outliers, radix-based sorting iterates many times doing little sorting
-with high overhead. Asymptotically, ALR is still ['[bigo](N*log(K/S + S))],
-but with a very small ['S] (about 2 in the worst case),
-which compares unfavorably with the 11 default value of ['max_splits] for
-Spreadsort.
-
-ALR also does not have the ['[bigo](N*log(N))] fallback, so for small lists
-that are not evenly distributed it is extremely inefficient.
-See the `alrbreaker` and `binaryalrbreaker` testcases for examples;
-either replace the call to sort with a call to ALR and update the ALR_THRESHOLD
-at the top, or as a quick comparison make `get_max_count return ALR_THRESHOLD`
-(20 by default based upon the paper).
-These small tests take 4-10 times as long with ALR as __std_sort
-in the author's testing, depending on the test system,
-because they are trying to sort a highly uneven distribution.
-Normal Spreadsort does much better with them, because `get_max_count`
-is designed around minimizing worst-case runtime.
-
-`burst_sort` is an efficient hybrid algorithm for strings that
-uses substantial additional memory.
-
-__string_sort uses minimal additional memory by comparison.
-Speed comparisons between the two haven't been made,
-but the better memory efficiency makes __string_sort more general.
-
-`postal_sort` and __string_sort are similar. A direct performance comparison
-would be welcome, but an efficient version of `postal_sort` was not found
-in a search for source.
-
-__string_sort is most similar to the __american_flag algorithm.
-The main difference is that it doesn't bother trying to optimize how empty
-buckets/piles are handled, instead just checking to see if all characters
-at the current index are equal. Other differences are using __std_sort
-as the fallback algorithm, and a larger fallback size (256 vs. 16),
-which makes empty pile handling less important.
-
-Another difference is not applying the stack-size restriction.
-Because of the equality check in __string_sort, it would take ['m*m] memory
-worth of strings to force __string_sort to create a stack of depth ['m].
-This problem isn't a realistic one on modern systems with multi-megabyte stacksize
-limits, where main memory would be exhausted holding the long strings necessary
-to exceed the stacksize limit. __string_sort can be thought of as modernizing
-__american_flag to take advantage of __introsort as a fallback algorithm.
-In the author's testing, __american_flag (on `std::strings`) had comparable runtime
-to __introsort, but making a hybrid of the two allows reduced overhead and
-substantially superior performance.
-
-[endsect] [/section:why_spreadsort]
-
-[section:unstable_sort Unstable Sorting]
-
-Making a __radix_sort stable requires the usage of an external copy of the data.
-A stable hybrid algorithm also requires a stable comparison-based algorithm,
-and these are generally slow. LSD __radix_sort uses an external copy of the data,
-and provides stability, along with likely being faster (than a stable hybrid sort),
-so that's probably a better way to go for integer and floating-point types.
-It might make sense to make a stable version of __string_sort using external memory,
-but for simplicity this has been left out for now.
-
-[endsect] [/section:unstable_sort Unstable Sorting]
-
-[section:optimization Unused X86 optimization]
-
-Though the ideal ['max_splits] for `n < 1 million` (or so) on x86
-['seems] to be substantially larger, enabling a roughly 15% speedup for such tests,
-this optimization isn't general, and doesn't apply for `n > 1 million`.
-A too large ['max_splits] can cause sort to take more than twice as long,
-so it should be set on the low end of the reasonable range, where it is right now.
-
-[endsect] [/section:optimization Unused X86 optimization]
-
-[section:lookup Lookup Table?]
-
-The ideal way to optimize the constants would be to have a carefully-tuned
-lookup-table instead of the `get_max_count` function, but 4 tuning variables
-is simpler, `get_max_count` enforces worst-case performance minimization rules,
-and such a lookup table would be difficult to optimize
-for cross-platform performance.
-
-Alternatively, `get_max_count` could be used to generate a static lookup table.
-This hasn't been done due to concerns about cross-platform compatibility
-and flexibility.
-
-[endsect] [/section:lookup]
-
-[endsect] [/section:rationale Rationale]
-
-[endsect] [/section:sort_hpp Spreadsort]
-
-[endsect] [/section:single_threaded Single Threaded Algorithms]
-
-[section:parallel Parallel Algorithms]
-
-TODO(Francisco): Insert parallel libraries documentation here.
-
-[endsect] [/section:parallel Parallel Algorithms]
-
-[section:definitions Definitions]
-
-[h4 stable sort]
-
-A sorting approach that preserves pre-existing order.
-If there are two elements with identical keys in a list that is later stably sorted,
-whichever came first in the initial list will come first in a stably sorted list.
-The algorithms provided here provide no such guarantee; items with identical keys
-will have arbitrary resulting order relative to each other.
-
-[endsect] [/section:definitions Definitions]
-
-[section:faq Frequently Asked Questions]
-
-There are no FAQs yet.
-
-[endsect] [/section:faq Frequently asked Questions]
-
-[section:acks Acknowledgements]
-
-* The author would like to thank his wife Mary for her patience and support
-during the long process of converting this from a piece of C code
-to a template library.
-
-* The author would also like to thank Phil Endecott and Frank Gennari
-for the improvements they've suggested and for testing.
-Without them this would have taken longer to develop or performed worse.
-
-* `float_mem_cast` was fixed to be safe and fast thanks to Scott McMurray.
-That fix was critical for a high-performance cross-platform __float_sort.
-
-* Thanks also for multiple helpful suggestions provided by Steven Watanabe,
-Edouard Alligand, and others.
-
-* Initial documentation was refactored to use Quickbook by Paul A. Bristow.
-
-[endsect] [/section:acknowledgements Acknowledgements]
-
-[section:bibliog Bibliography]
-
-[h4 Standard Template Library Sort Algorithms]
-
-[@http://www.cplusplus.com/reference/algorithm/sort/ C++ STL sort algorithms].
-
-[h4 Radix Sort]
-
-A type of algorithm that sorts based upon distribution instead of by comparison.
-Wikipedia has an article about Radix Sorting.
-A more detailed description of various Radix Sorting algorithms is provided here:
-
-Donald Knuth. The Art of Computer Programming,
-Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998.
-ISBN 0-201-89685-0. Section 5.2.5: Sorting by Distribution, pp.168-179.
-
-[h4 Introsort]
-
-A high-speed comparison-based sorting algorithm that takes ['[bigo](N * log(N))] time.
-See __introsort and
-Musser, David R. (1997). "Introspective Sorting and Selection Algorithms",
-Software: Practice and Experience (Wiley) 27 (8), pp 983-993,
-available at [@http://www.cs.rpi.edu/~musser/gp/introsort.ps Musser Introsort].
-
-[h4 American Flag Sort]
-
-A high-speed hybrid string sorting algorithm that __string_sort is partially based
-upon. See __american_flag and Peter M. McIlroy, Keith Bostic, M. Douglas McIlroy. Engineering Radix Sort, Computing Systems 1993.
-
-[h4 Adaptive Left Radix (ARL)]
-
-ARL (Adaptive Left Radix) is a hybrid cache-friendly integer sorting algorithm
-with comparable speed on random data to __integer_sort,
-but does not have the optimizations for worst-case performance,
-causing it to perform poorly on certain types of unevenly distributed data.
-
-Arne Maus, [@http://www.nik.no/2002/Maus.pdf ARL, a faster in-place, cache friendly sorting algorithm],
-presented at NIK2002, Norwegian Informatics Conference, Kongsberg, 2002. Tapir, ISBN 82-91116-45-8.
-
-[h4 Original Spreadsort]
-
-The algorithm that __integer_sort was originally based on.
-__integer_sort uses a smaller number of key bits at a time for better cache efficiency
-than the method described in the paper.
-The importance of cache efficiency grew as CPU clock speeds increased
-while main memory latency stagnated.
-See Steven J. Ross,
-The Spreadsort High-performance General-case Sorting Algorithm,
-Parallel and Distributed Processing Techniques and Applications, Volume 3, pp.1100-1106. Las Vegas Nevada. 2002. See
-[@../../doc/papers/original_spreadsort06_2002.pdf Steven Ross spreadsort_2002].
-
-[endsect] [/section:bibliography Bibliography]
-
-[section:history History]
-
-* First release following review in Boost 1.58.
-
-* [@http://permalink.gmane.org/gmane.comp.lib.boost.devel/255194 Review of Boost.Sort/Spreadsort library]
-
-[endsect] [/section:history]
+[include introduction.qbk]
+[include single_thread.qbk]
+[include parallel.qbk]
+[include bibliography.qbk]
+[include gratitude.qbk]
[xinclude autodoc.xml] [/ Using Doxygen reference documentation.]
-[/Include the indexes (class, function and everything) ]
+[/Include the indexes (class, function and everything)]
'''
Function Index
@@ -942,7 +39,7 @@ Parallel and Distributed Processing Techniques and Applications, Volume 3, pp.11
'''
[/
- Copyright (c) 2014 Steven Ross
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, and Orson Peters
Distributed under the Boost Software License,
Version 1.0. (See accompanying file LICENSE_1_0.txt
or copy at http://boost.org/LICENSE_1_0.txt)
diff --git a/doc/spinsort.qbk b/doc/spinsort.qbk
new file mode 100644
index 0000000..d33dbee
--- /dev/null
+++ b/doc/spinsort.qbk
@@ -0,0 +1,71 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+
+[section:spinsort 2.3.- spinsort]
+
+
+[*Spinsort] is a new stable sort algorithm, designed and implemented by Francisco Tapia for the Boost Sort Library.
+
+It is a merge algorithms which combines several ideas in order to improve the speed of other stable sort algorithms.
+
+The algorithm have an excellent response when the data are near sorted. Many times the new elements are inserted at end
+of the sorted elements, or some elements are modified, breaking the order of the elements. In these cases, spinsort
+provide a very fast response.
+
+[table AlgorithmDescription
+[[Algorithm] [Stable][Additional Memory] [Best, average, and worst case]]
+[[spinsort] [Yes] [N / 2] [N, NlogN , NlogN]]
+]
+
+The benchmark with 100000000 64 bits integers, running on a Intel i7-5820K CPU @ 3.30GH shows the mentioned characteristics.
+
+[table benchmark
+[[Data] [std::stable_sort] [spin_sort] ]
+[[random] [ 8.62 ] [ 9.73 ] ]
+[[sorted] [ 4.88 ] [ 0.06 ] ]
+[[sorted + 0.1% end] [ 4.92 ] [ 0.41 ] ]
+[[sorted + 1% end] [ 4.97 ] [ 0.55 ] ]
+[[sorted + 10% end] [ 5.73 ] [ 1.32 ] ]
+[[sorted + 0.1% mid] [ 6.58 ] [ 1.89 ] ]
+[[sorted + 1% mid] [ 7.06 ] [ 2.12 ] ]
+[[sorted + 10% mid] [ 9.56 ] [ 4.02 ] ]
+[[reverse sorted] [ 0.13 ] [ 0.14 ] ]
+[[reverse sorted + 0.1% end] [ 5.22 ] [ 0.52 ] ]
+[[reverse sorted + 1% end] [ 5.29 ] [ 0.66 ] ]
+[[reverse sorted + 10% end] [ 6.03 ] [ 1.45 ] ]
+[[reverse sorted + 0.1% mid] [ 6.52 ] [ 1.89 ] ]
+[[reverse sorted + 1% mid] [ 7.09 ] [ 2.12 ] ]
+[[reverse sorted + 10% mid] [ 9.46 ] [ 4.02 ] ]
+]
+
+You only need to include the file boost/sort/parallel/sort.hpp
+
+``
+ #include
+
+
+ template
+ void spinsort (iter_t first, iter_t last, compare comp = compare());
+``
+
+The spinsort function is in the namespace boost::sort
+
+The algorithms use a [*comparison object], in the same way as the standard library sort
+algorithms. If you don't define it, the comparison object defaults to std::less, which uses
+the < operator internally for comparisons.
+
+The algorithms are [*exception safe], meaning that, the exceptions generated by the algorithms
+guarantee the integrity of the objects to sort, but not their relative order. If the exception
+is generated inside the objects (in the move or in the copy constructor.. ) the results can be
+unpredictable.
+
+[endsect]
+
+
diff --git a/doc/spreadsort.qbk b/doc/spreadsort.qbk
new file mode 100644
index 0000000..dbe7dc9
--- /dev/null
+++ b/doc/spreadsort.qbk
@@ -0,0 +1,934 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:spreadsort 2.1.-Spreadsort]
+
+[/ Some composite templates]
+[template super[x]''''''[x]'''''']
+[template sub[x]''''''[x]'''''']
+[template floor[x]'''⌊'''[x]'''⌋''']
+[template floorlr[x][lfloor][x][rfloor]]
+[template ceil[x] '''⌈'''[x]'''⌉''']
+
+[/ Required for autoindexing]
+[import ../../../tools/auto_index/include/auto_index_helpers.qbk]
+[/ Must be first included file!]
+
+[/Files containing quickbook snippets]
+[import ../example/charstringsample.cpp]
+[import ../example/stringfunctorsample.cpp]
+[import ../example/reverseintsample.cpp]
+[import ../example/rightshiftsample.cpp]
+[import ../example/int64.cpp]
+[import ../example/floatfunctorsample.cpp]
+[import ../example/generalizedstruct.cpp]
+
+[import html4_symbols.qbk] [/ Provides various useful squiggles.]
+
+[def __spreadsort [@http://en.wikipedia.org/wiki/Spreadsort spreadsort]]
+[def __introsort [@http://en.wikipedia.org/wiki/Introsort introsort]]
+[def __stl_sort [@http://www.cplusplus.com/reference/algorithm/sort/ STL std::sort]]
+[def __big_o [@http://en.wikipedia.org/wiki/Big_O_notation Big O notation]]
+[def __radix_sort[@http://en.wikipedia.org/wiki/Radix_sort radix sort]]
+[def __adaptive_left_reflex [@http://www.nik.no/2002/Maus.pdf Arne Maus, Adaptive Left Reflex]]
+[def __american_flag [@http://en.wikipedia.org/wiki/American_flag_sort American flag sort]]
+[def __overloading [link sort.overview.overloading overloading]]
+[def __lookup [link sort.rationale.lookup lookup]]
+[def __random_access_iter [@http://en.cppreference.com/w/cpp/concept/RandomAccessIterator RandomAccessIterator]]
+[def __strictweakordering [@http://en.wikipedia.org/wiki/Weak_ordering#Strict_weak_orderings strict weak ordering]]
+
+[/ Links to functions for use in text]
+[def __integer_sort [^[funcref boost::sort::spreadsort::integer_sort integer_sort]]]
+[def __float_sort [^[funcref boost::sort::spreadsort::float_sort float_sort]]]
+[def __string_sort [^[funcref boost::sort::spreadsort::string_sort string_sort]]]
+[def __spreadsort [^[funcref boost::sort::spreadsort::spreadsort spreadsort]]] [/Note diff from Wiki link __spreadsort]
+[def __std_sort [@http://en.cppreference.com/w/cpp/algorithm/sort std::sort]]
+
+[section:overview Overview]
+
+[section:intro Introduction]
+
+Spreadsort combines generic implementations of multiple high-speed sorting algorithms
+that outperform those in the C++ standard in both average and worst case performance
+when there are over 1000 elements in the list to sort.
+
+They fall back to __stl_sort on small data sets.
+
+[warning These algorithms all only work on
+[@http://www.cplusplus.com/reference/iterator/RandomAccessIterator/ random access iterators].]
+
+They are hybrids using both radix and comparison-based sorting,
+specialized to sorting common data types, such as integers, floats, and strings.
+
+These algorithms are encoded in a generic fashion and accept functors,
+enabling them to sort any object that can be processed like these basic data types.
+In the case of __string_sort, this includes anything
+with a defined strict-weak-ordering that __std_sort can sort,
+but writing efficient functors for some complex key types
+may not be worth the additional effort relative to just using __std_sort,
+depending on how important speed is to your application.
+Sample usages are available in the example directory.
+
+Unlike many radix-based algorithms,
+the underlying __spreadsort algorithm is designed around [*worst-case performance].
+It performs better on chunky data (where it is not widely distributed),
+so that on real data it can perform substantially better than on random data.
+Conceptually, __spreadsort can sort any data for which an absolute ordering can be determined,
+and __string_sort is sufficiently flexible that this should be possible.
+
+Situations where __spreadsort is fastest relative to __std_sort:
+
+# Large number of elements to sort (['N] >= 10000).
+
+# Slow comparison function (such as floating-point numbers on x86 processors or strings).
+
+# Large data elements (such as key + data sorted on a key).
+
+# Completely sorted data when __spreadsort has an optimization to quit early in this case.
+
+Situations where __spreadsort is slower than __std_sort:
+
+# Data sorted in reverse order. Both __std_sort and __spreadsort are faster
+on reverse-ordered data than randomized data,
+but __std_sort speeds up more in this special case.
+
+# Very small amounts of data (< 1000 elements).
+For this reason there is a fallback in __spreadsort to __std_sort
+if the input size is less than 1000,
+so performance is identical for small amounts of data in practice.
+
+These functions are defined in `namespace boost::sort::spreadsort`.
+
+[endsect] [/section Introduction]
+
+[section:overloading Overloading]
+
+[tip In the Boost.Sort C++ Reference section, click on the appropriate overload, for example `float_sort(RandomAccessIter, RandomAccessIter, Right_shift, Compare);` to get full details of that overload.]
+
+Each of __integer_sort, __float_sort, and __string_sort have 3 main versions:
+The base version, which takes a first iterator and a last iterator, just like __std_sort:
+
+ integer_sort(array.begin(), array.end());
+ float_sort(array.begin(), array.end());
+ string_sort(array.begin(), array.end());
+
+The version with an overridden shift functor, providing flexibility
+in case the `operator>>` already does something other than a bitshift.
+The rightshift functor takes two args, first the data type,
+and second a natural number of bits to shift right.
+
+For __string_sort this variant is slightly different;
+it needs a bracket functor equivalent to `operator`\[\],
+taking a number corresponding to the character offset,
+along with a second `getlength` functor to get the length of the string in characters.
+In all cases, this operator must return an integer type that compares with the
+`operator<` to provide the intended order
+(integers can be negated to reverse their order).
+
+In other words (aside from negative floats, which are inverted as ints):
+
+ rightshift(A, n) < rightshift(B, n) -> A < B
+ A < B -> rightshift(A, 0) < rightshift(B, 0)
+
+[rightshift_1]
+[bracket_1]
+
+See [@../../example/rightshiftsample.cpp rightshiftsample.cpp] for a working example of integer sorting with a rightshift functor.
+
+And a version with a comparison functor for maximum flexibility.
+This functor must provide the same sorting order as the integers returned by the rightshift (aside from negative floats):
+
+ rightshift(A, n) < rightshift(B, n) -> compare(A, B)
+ compare(A, B) -> rightshift(A, 0) < rightshift(B, 0)
+
+[reverse_int_2]
+
+Examples of functors are:
+
+[lessthan_functor]
+
+[bracket_functor]
+
+[getsize_functor]
+
+and these functors are used thus:
+
+[stringsort_functors_call]
+
+See [@../../example/stringfunctorsample.cpp stringfunctorsample.cpp] for a working example of sorting strings with all functors.
+
+[endsect] [/section:overloading Overloading]
+
+[section:performance Performance]
+
+The __spreadsort algorithm is a hybrid algorithm;
+when the number of elements being sorted is below a certain number,
+comparison-based sorting is used. Above it, radix sorting is used.
+The radix-based algorithm will thus cut up the problem into small pieces,
+and either completely sort the data based upon its radix if the data is clustered,
+or finish sorting the cut-down pieces with comparison-based sorting.
+
+The Spreadsort algorithm dynamically chooses
+either comparison-based or radix-based sorting when recursing,
+whichever provides better worst-case performance.
+This way worst-case performance is guaranteed to be the better of
+['[bigo](N[sdot]log2(N))] comparisons and ['[bigo](N[sdot]log2(K/S + S))] operations where
+
+* ['N] is the number of elements being sorted,
+* ['K] is the length in bits of the key, and
+* ['S] is a constant.
+
+This results in substantially improved performance for large [' N];
+__integer_sort tends to be 50% to 2X faster than __std_sort,
+while __float_sort and _string_sort are roughly 2X faster than __std_sort.
+
+Performance graphs are provided for __integer_sort, __float_sort, and __string_sort in their description.
+
+Runtime Performance comparisons and graphs were made on a Core 2 Duo laptop
+running Windows Vista 64 with MSVC 8.0,
+and an old G4 laptop running Mac OSX with gcc.
+[@http://www.boost.org/build/doc/html/ Boost bjam/b2] was used to control compilation.
+
+Direct performance comparisons on a newer x86 system running Ubuntu,
+with the fallback to __std_sort at lower input sizes disabled are below.
+
+[note The fallback to __std_sort for smaller input sizes prevents
+the worse performance seen on the left sides of the first two graphs.]
+
+__integer_sort starts to become faster than __std_sort at about 1000 integers (4000 bytes),
+and __string_sort becomes faster than __std_sort at slightly fewer bytes (as few as 30 strings).
+
+[note The 4-threaded graph has 4 threads doing [*separate sorts simultaneously]
+(not splitting up a single sort)
+as a test for thread cache collision and other multi-threaded performance issues.]
+
+__float_sort times are very similar to __integer_sort times.
+
+[/ These provide links to the images, but currently graphs are shown - see below]
+[/@../../doc/images/single_threaded.png single_threaded.png] [/file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png]
+[/@../../doc/images/4_threaded.png 4_threaded.png]
+[/@../../doc/images/entropy.png entropy.png]
+[/@../../doc/images/bits_per_byte.png bits_per_byte.png]
+
+[$../images/single_threaded.png] [/
== file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png]
+[$../images/4_threaded.png]
+[$../images/entropy.png]
+[$../images/bits_per_byte.png]
+
+Histogramming with a fixed maximum number of splits is used
+because it reduces the number of cache misses,
+thus improving performance relative to the approach described in detail
+in the [@http://en.wikipedia.org/wiki/Spreadsort original SpreadSort publication].
+
+The importance of cache-friendly histogramming is described
+in __adaptive_left_reflex,
+though without the worst-case handling described below.
+
+The time taken per radix iteration is:
+
+['[bigo](N)] iterations over the data
+
+['[bigo](N)] integer-type comparisons (even for _float_sort and __string_sort)
+
+['[bigo](N)] swaps
+
+['[bigo](2[super S])] bin operations.
+
+To obtain ['[bigo](N)] worst-case performance per iteration,
+the restriction ['S <= log2(N)] is applied, and ['[bigo](2[super S])] becomes ['[bigo](N)].
+For each such iteration, the number of unsorted bits log2(range)
+(referred to as ['K]) per element is reduced by ['S].
+As ['S] decreases depending upon the amount of elements being sorted,
+it can drop from a maximum of ['S[sub max]] to the minimum of ['S[sub min]].
+
+Assumption: __std_sort is assumed to be ['[bigo](N*log2(N))],
+as __introsort exists and is commonly used.
+(If you have a quibble with this please take it up with the implementor of your __std_sort;
+you're welcome to replace the recursive calls to __std_sort to calls
+to __introsort if your __std_sort library call is poorly implemented).
+
+[@http://en.wikipedia.org/wiki/Introsort Introsort] is not included with this algorithm for simplicity and
+because the implementor of the __std_sort call
+is assumed to know what they're doing.
+
+To maintain a minimum value for ['S (S[sub min])],
+comparison-based sorting has to be used to sort when
+['n <= log2(meanbinsize)], where ['log2(meanbinsize) (lbs)] is a small constant,
+usually between 0 and 4, used to minimize bin overhead per element.
+There is a small corner-case where if ['K < S[sub min]] and ['n >= 2^K],
+then the data can be sorted in a single radix-based iteration with an ['S = K]
+(this bucketsorting special case is by default only applied to __float_sort).
+So for the final recursion, worst-case performance is:
+
+1 radix-based iteration if ['K <= S[sub min]],
+
+or ['S[sub min] + lbs] comparison-based iterations if ['K > S[sub min]] but ['n <= 2[super (S[sub min] + lbs)]].
+
+So for the final iteration, worst-case runtime is ['[bigo](N*(S[sub min] + lbs))] but
+if ['K > S[sub min]] and ['N > 2[super (S[sub min] + lbs)]]
+then more than 1 radix recursion will be required.
+
+For the second to last iteration, ['K <= S[sub min] * 2 + 1] can be handled,
+(if the data is divided into ['2[super (S[sub min] + 1)]] pieces)
+or if ['N < 2[super (S[sub min] + lbs + 1)]],
+then it is faster to fallback to __std_sort.
+
+In the case of a radix-based sort plus recursion, it will take
+['[bigo](N*(S[sub min] + lbs)) + [bigo](N) = [bigo](N*(S[sub min] + lbs + 1))] worst-case time,
+as
+['K_remaining = K_start - (S[sub min] + 1)], and ['K_start <= S[sub min] * 2 + 1].
+
+Alternatively, comparison-based sorting is used if ['N < 2[super (S[sub min] + lbs + 1)]],
+which will take ['[bigo](N*(S[sub min] + lbs + 1))] time.
+
+So either way ['[bigo](N*(S[sub min] + lbs + 1))] is the worst-case time for the second to last iteration,
+which occurs if ['K <= S[sub min] * 2 + ]1 or ['N < 2[super (S[sub min] + lbs + 1)]].
+
+This continues as long as ['S[sub min] <= S <= S[sub max]],
+so that for ['K_m <= K_(m-1) + S[sub min] + m] where ['m]
+is the maximum number of iterations after this one has finished,
+or where ['N < 2[super (S[sub min] + lbs + m)]],
+then the worst-case runtime is ['[bigo](N*(S[sub min] + lbs + m))].
+
+[space][space]['K_m] at ['m <= (S[sub max] - S[sub min])] works out to:
+
+[space][space]['K_1 <= (S[sub min]) + S[sub min] + 1 <= 2S[sub min] + 1]
+
+[space][space]['K_2 <= (2S[sub min] + 1) + S[sub min] + 2]
+
+as the sum from 0 to ['m] is ['m(m + 1)/2]
+
+[space][space]['K_m <= (m + 1)S[sub min] + m(m + 1)/2 <= (S[sub min] + m/2)(m + 1)]
+
+substituting in S[sub max] - S[sub min] for m
+
+[space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + (S[sub max] - S[sub min])/2)*(S[sub max] - S[sub min] + 1)]
+
+[space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + S[sub max]) * (S[sub max] - S[sub min] + 1)/2]
+
+Since this involves ['S[sub max] - S[sub min] + 1] iterations,
+this works out to dividing ['K] into an average ['(S[sub min] + S[sub max])]/2
+pieces per iteration.
+
+To finish the problem from this point takes ['[bigo](N * (S[sub max] - S[sub min]))] for ['m] iterations,
+plus the worst-case of ['[bigo](N*(S[sub min] + lbs))] for the last iteration,
+for a total of ['[bigo](N *(S[sub max] + lbs))] time.
+
+When ['m > S[sub max] - S[sub min]], the problem is divided into ['S[sub max]] pieces per iteration,
+or __std_sort is called if ['N < 2^(m + S[sub min] + lbs)]. For this range:
+
+[space][space]['K_m <= K_(m - 1) + S[sub max]], providing runtime of
+
+[space][space]['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))] if recursive,
+
+or ['[bigo](N * log(2^(m + S[sub min] + lbs)))] if comparison-based,
+
+which simplifies to ['[bigo](N * (m + S[sub min] + lbs))],
+which substitutes to ['[bigo](N * ((m - (S[sub max] - S[sub min])) + S[sub max] + lbs))],
+which given that ['m - (S[sub max] - S[sub min]) <= (K - K_(S[sub max] - S[sub min]))/S[sub max]]
+(otherwise a lesser number of radix-based iterations would be used)
+
+also comes out to ['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))].
+
+Asymptotically, for large ['N] and large ['K], this simplifies to:
+
+[space][space]['[bigo](N * (K/S[sub max] + S[sub max] + lbs))],
+
+simplifying out the constants related to the ['S[sub max] - S[sub min]] range,
+providing an additional ['[bigo](N * (S[sub max] + lbs))] runtime on top of the
+['[bigo](N * (K/S))] performance of LSD __radix_sort,
+but without the ['[bigo](N)] memory overhead.
+For simplicity, because ['lbs] is a small constant
+(0 can be used, and performs reasonably),
+it is ignored when summarizing the performance in further discussions.
+By checking whether comparison-based sorting is better,
+Spreadsort is also ['[bigo](N*log(N))], whichever is better,
+and unlike LSD __radix_sort, can perform much better than the worst-case
+if the data is either evenly distributed or highly clustered.
+
+This analysis was for __integer_sort and __float_sort.
+__string_sort differs in that ['S[sub min] = S[sub max] = sizeof(Char_type) * 8],
+['lbs] is 0, and that __std_sort's comparison is not a constant-time operation,
+so strictly speaking __string_sort runtime is
+
+[space][space]['[bigo](N * (K/S[sub max] + (S[sub max] comparisons)))].
+
+Worst-case, this ends up being ['[bigo](N * K)]
+(where ['K] is the mean string length in bytes),
+as described for __american_flag, which is better than the
+
+[space][space]['[bigo](N * K * log(N))]
+
+worst-case for comparison-based sorting.
+
+[endsect] [/section:performance Performance]
+
+[section:tuning Tuning]
+__integer_sort and __float_sort have tuning constants that control
+how the radix-sorting portion of those algorithms work.
+The ideal constant values for __integer_sort and __float_sort vary depending on
+the platform, compiler, and data being sorted.
+By far the most important constant is ['max_splits],
+which defines how many pieces the radix-sorting portion splits
+the data into per iteration.
+
+The ideal value of ['max_splits] depends upon the size of the L1 processor cache,
+and is between 10 and 13 on many systems.
+A default value of 11 is used. For mostly-sorted data, a much larger value is better,
+as swaps (and thus cache misses) are rare,
+but this hurts runtime severely for unsorted data, so is not recommended.
+
+On some x86 systems, when the total number of elements being sorted is small
+( less than 1 million or so), the ideal ['max_splits] can be substantially larger,
+such as 17. This is suspected to be because all the data fits into the L2 cache,
+and misses from L1 cache to L2 cache do not impact performance
+as severely as misses to main memory.
+Modifying tuning constants other than ['max_splits] is not recommended,
+as the performance improvement for changing other constants is usually minor.
+
+If you can afford to let it run for a day, and have at least 1GB of free memory,
+the perl command: `./tune.pl -large -tune` (UNIX)
+or `perl tune.pl -large -tune -windows` (Windows)
+can be used to automatically tune these constants.
+This should be run from the `libs/sort directory` inside the boost home directory.
+This will work to identify the `ideal constants.hpp` settings for your system,
+testing on various distributions in a 20 million element (80MB) file,
+and additionally verifies that all sorting routines sort correctly
+across various data distributions.
+Alternatively, you can test with the file size you're most concerned with
+`./tune.pl number -tune` (UNIX) or `perl tune.pl number -tune -windows` (Windows).
+Substitute the number of elements you want to test with for `number`.
+Otherwise, just use the options it comes with, they're decent.
+With default settings `./tune.pl -tune` (UNIX) `perl tune.pl -tune -windows` (Windows),
+the script will take hours to run (less than a day),
+but may not pick the correct ['max_splits] if it is over 10.
+Alternatively, you can add the `-small` option to make it take just a few minutes,
+tuning for smaller vector sizes (one hundred thousand elements),
+but the resulting constants may not be good for large files
+(see above note about ['max_splits] on Windows).
+
+The tuning script can also be used just to verify that sorting works correctly
+on your system, and see how much of a speedup it gets,
+by omiting the "-tune" option. This runs at the end of tuning runs.
+Default args will take about an hour to run and give accurate results
+on decent-sized test vectors. `./tune.pl -small` (UNIX) `perl tune.pl -small -windows` (Windows)
+is a faster option, that tests on smaller vectors and isn't as accurate.
+
+If any differences are encountered during tuning, please call `tune.pl` with `-debug > log_file_name`.
+If the resulting log file contains compilation or permissions issues,
+it is likely an issue with your setup.
+If some other type of error is encountered (or result differences),
+please send them to the library author at spreadsort@gmail.com.
+Including the zipped `input.txt` that was being used is also helpful.
+
+[endsect] [/section:tuning Tuning]
+
+[endsect] [/section Overview]
+
+[section:sort_hpp Spreadsort]
+
+[section:header_spreadsort Header ``]
+
+__spreadsort checks whether the data-type provided is an integer,
+castable float, string, or wstring.
+
+* If data-type is an integer, __integer_sort is used.
+* If data-type is a float, __float_sort is used.
+* If data-type is a string or wstring, __string_sort is used.
+* Sorting other data-types requires picking between
+__integer_sort, __float_sort and __string_sort directly,
+as __spreadsort won't accept types that don't have the appropriate type traits.
+
+Overloading variants are provided that permit use of user-defined right-shift functors and comparison functors.
+
+Each function is optimized for its set of arguments; default functors are not provided to avoid the risk of any reduction of performance.
+
+See __overloading section.
+
+[h5 Rationale:]
+
+__spreadsort function provides a wrapper that calls the fastest sorting algorithm
+available for a data-type, enabling faster generic programming.
+
+[section:spreadsort_examples Spreadsort Examples]
+
+See [@../../example/ example] folder for all examples.
+
+See [@../../example/sample.cpp sample.cpp] for a simple working example.
+
+For an example of 64-bit integer sorting, see [@../../example/int64.cpp int64.cpp].
+
+This example sets the element type of a vector to 64-bit integer
+
+[int64bit_1]
+
+and calls the sort
+
+[int64bit_2]
+
+For a simple example sorting `float`s,
+
+ vector vec;
+ vec.push_back(1.0);
+ vec.push_back(2.3);
+ vec.push_back(1.3);
+ ...
+ spreadsort(vec.begin(), vec.end());
+ //The sorted vector contains "1.0 1.3 2.3 ..."
+
+See also [@../../example/floatsample.cpp floatsample.cpp] which checks for abnormal values.
+
+[endsect] [/section:spreadsort_examples Spreadsort Examples]
+
+[endsect] [/section:header_spreadsort Header ``]
+
+[section:integer_sort Integer Sort]
+
+__integer_sort is a fast templated in-place hybrid radix/comparison algorithm,
+which in testing tends to be roughly 50% to 2X faster than
+__std_sort for large tests (>=100kB).
+Worst-case performance is ['[bigo](N * (log2(range)/s + s))],
+so __integer_sort is asymptotically faster than pure comparison-based algorithms.
+['s] is ['max_splits], which defaults to 11,
+so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)]
+slow radix-based iterations + 11 fast comparison-based iterations).
+
+Some performance plots of runtime vs. n and log2(range) are provided below:
+
+[@../../doc/graph/windows_integer_sort.htm Windows Integer Sort]
+
+[@../../doc/graph/osx_integer_sort.htm OSX integer Sort]
+
+[section:integersort_examples Integer Sort Examples]
+
+See [@../../example/rightshiftsample.cpp rightshiftsample.cpp] for a working example of using rightshift, using a user-defined functor:
+
+[rightshift_int_functor]
+
+Other examples:
+
+[@../../example/keyplusdatasample.cpp Sort structs using an integer key.]
+
+[@../../example/reverseintsample.cpp Sort integers in reverse order.]
+
+[@../../example/mostlysorted.cpp Simple sorting of integers; this case is a performance test for integers that are already mostly sorted.]
+
+[endsect] [/section:integersort_examples Integer Sort Examples]
+
+[endsect] [/section:integer_sort Integer Sort]
+
+[section:float_sort Float Sort]
+
+__float_sort is a fast templated in-place hybrid radix/comparison algorithm much like __integer_sort, but sorts IEEE floating-point numbers (positive, zero, NaN, and negative) into ascending order by casting them to integers. This works because positive IEEE floating-point numbers sort like integers with the same bits, and negative IEEE floating-point numbers sort in the reverse order of integers with the same bits. float_sort is roughly 2X as fast as std::sort.
+
+-0.0 vs. 0.0 and NaN are given definitive ordered positions by the radix-based portion of this algorithm, where comparison-based sorting does not guarantee their relative position. The included tests avoid creating NaN and -0.0 so that results match std::sort, which is not consistent in how it handles these numbers, as they compare as equal to numbers with different values.
+
+float_sort checks the size of the data type and whether it is castable, picking
+ an integer type to cast to, if a casting functor isn't provided by the user.
+
+float_mem_cast casts IEEE floating-point numbers (positive, zero, NaN, and negative) into integers. This is an essential utility for creating a custom rightshift functor for float_sort, when one is needed. Only IEEE floating-point numbers of the same size as the integer type being cast to should be used in this specialized method call.
+Worst-case performance is ['[bigo](N * (log2(range)/s + s))],
+so __float_sort is asymptotically faster than pure comparison-based algorithms.
+['s] is ['max_splits], which defaults to 11,
+so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)]
+slow radix-based iterations + 11 fast comparison-based iterations).
+
+Some performance plots of runtime vs. n and log2(range) are provided below:
+
+[@../../doc/graph/windows_float_sort.htm Windows Float Sort]
+
+[@../../doc/graph/osx_float_sort.htm OSX Float Sort]
+
+[section:floatsort_examples Float Sort Examples]
+
+See [@../../example/floatfunctorsample.cpp floatfunctorsample.cpp] for a working example of how to sort structs with a float key:
+
+[float_functor_types]
+
+[float_functor_datatypes]
+
+Right-shift functor:
+
+[float_functor_rightshift]
+
+Comparison lessthan `operator<` functor:
+
+[float_functor_lessthan]
+
+Other examples:
+
+[@../../example/double.cpp Sort doubles.]
+
+[@../../example/shiftfloatsample.cpp Sort floats using a rightshift functor.]
+
+[endsect] [/section:floatsort_examples Float Sort Examples]
+
+[endsect] [/section:float_sort Float Sort]
+
+[section:string_sort String Sort]
+__string_sort is a hybrid radix-based/comparison-based algorithm that sorts strings of characters (or arrays of binary data) in ascending order.
+
+The simplest version (no functors) sorts strings of items that can cast to an unsigned data type (such as an unsigned char), have a < operator, have a size function, and have a data() function that returns a pointer to an array of characters, such as a std::string. The functor version can sort any data type that has a strict weak ordering, via templating, but requires definitions of a get_char (acts like x[offset] on a string or a byte array), get_length (returns length of the string being sorted), and a comparison functor. Individual characters returned by get_char must support the != operator and have an unsigned value that defines their lexicographical order.
+
+This algorithm is not efficient for character types larger than 2 bytes each, and is optimized for one-byte character strings. For this reason, __std_sort will be called instead if the character type is of size > 2.
+
+__string_sort has a special optimization for identical substrings. This adds some overhead on random data, but identical substrings are common in real strings.
+
+reverse_string_sort sorts strings in reverse (decending) order, but is otherwise identical. __string_sort is sufficiently flexible that it should sort any data type that __std_sort can, assuming the user provides appropriate functors that index into a key.
+
+[@../../doc/graph/windows_string_sort.htm Windows String Sort]
+
+[@../../doc/graph/osx_string_sort.htm OSX String Sort]
+
+
+
+[section:stringsort_examples String Sort Examples]
+
+See [@../../example/stringfunctorsample.cpp stringfunctorsample.cpp] for an example of how to sort structs using a string key and all functors:
+
+[lessthan_functor]
+
+[bracket_functor]
+
+[getsize_functor]
+
+and these functors are used thus:
+
+[stringsort_functors_call]
+
+
+See [@../../example/generalizedstruct.cpp generalizedstruct.cpp] for a working example of a generalized approach to sort structs by a sequence of integer, float, and multiple string keys using string_sort:
+
+[generalized_functors]
+
+[generalized_functors_call]
+
+Other examples:
+
+[@../../example/stringsample.cpp String sort.]
+
+[@../../example/reversestringsample.cpp Reverse string sort.]
+
+[@../../example/wstringsample.cpp Wide character string sort.]
+
+[@../../example/caseinsensitive.cpp Case insensitive string sort.]
+
+[@../../example/charstringsample.cpp Sort structs using a string key and indexing functors.]
+
+[@../../example/reversestringfunctorsample.cpp Sort structs using a string keynd all functors in reverse order.]
+
+[endsect] [/section:stringsort_examples String Sort Examples]
+
+[endsect] [/section:string_sort String Sort]
+
+[section:rationale Rationale]
+
+[section:radix_sorting Radix Sorting]
+Radix-based sorting allows the data to be divided up into more than 2 pieces per iteration,
+and for cache-friendly versions, it normally cuts the data up into around a thousand pieces per iteration.
+This allows many fewer iterations to be used to complete sorting the data,
+enabling performance superior to the ['[bigo](N*log(N))] comparison-based sorting limit.
+[endsect] [/section:radix_sorting Radix Sorting]
+
+[section:hybrid_radix Hybrid Radix]
+
+There a two primary types of radix-based sorting:
+
+Most-significant-digit Radix sorting (MSD) divides the data recursively
+based upon the top-most unsorted bits.
+This approach is efficient for even distributions that divide nicely,
+and can be done in-place (limited additional memory used).
+There is substantial constant overhead for each iteration to deal
+with the splitting structure.
+The algorithms provided here use MSD Radix Sort for their radix-sorting portion.
+The main disadvantage of MSD Radix sorting is that when the data is cut up into small
+pieces, the overhead for additional recursive calls starts to dominate runtime,
+and this makes worst-case behavior substantially worse than ['[bigo](N*log(N))].
+
+By contrast, __integer_sort, __float_sort, and __string_sort all check to see
+whether Radix-based or Comparison-based sorting have better worst-case runtime,
+and make the appropriate recursive call.
+Because Comparison-based sorting algorithms are efficient on small pieces,
+the tendency of MSD __radix_sort to cut the problem up small is turned into
+an advantage by these hybrid sorts. It is hard to conceive of a common usage case
+where pure MSD __radix_sort would have any significant advantage
+over hybrid algorithms.
+
+Least-significant-digit __radix_sort (LSD) sorts based upon
+the least-significant bits first. This requires a complete copy of the data being sorted,
+using substantial additional memory. The main advantage of LSD Radix Sort
+is that aside from some constant overhead and the memory allocation,
+it uses a fixed amount of time per element to sort, regardless of distribution or
+size of the list. This amount of time is proportional to the length of the radix.
+The other advantage of LSD Radix Sort is that it is a stable sorting algorithm,
+so elements with the same key will retain their original order.
+
+One disadvantage is that LSD Radix Sort uses the same amount of time
+to sort "easy" sorting problems as "hard" sorting problems,
+and this time spent may end up being greater than an efficient ['[bigo](N*log(N))]
+algorithm such as __introsort spends sorting "hard" problems on large data sets,
+depending on the length of the datatype, and relative speed of comparisons,
+memory allocation, and random accesses.
+
+The other main disadvantage of LSD Radix Sort is its memory overhead.
+It's only faster for large data sets, but large data sets use significant memory,
+which LSD Radix Sort needs to duplicate. LSD Radix Sort doesn't make sense for items
+of variable length, such as strings; it could be implemented by starting at the end
+of the longest element, but would be extremely inefficient.
+
+All that said, there are places where LSD Radix Sort is the appropriate and
+fastest solution, so it would be appropriate to create a templated LSD Radix Sort
+similar to __integer_sort and __float_sort. This would be most appropriate in cases where
+comparisons are extremely slow.
+
+[endsect] [/section:hybrid_radix Hybrid Radix]
+
+[section:why_spreadsort Why spreadsort?]
+
+The __spreadsort algorithm used in this library is designed to provide best possible
+worst-case performance, while still being cache-friendly.
+It provides the better of ['[bigo](N*log(K/S + S))] and ['[bigo](N*log(N))] worst-case time,
+where ['K] is the log of the range. The log of the range is normally the length in bits
+of the data type; 32 for a 32-bit integer.
+
+`flash_sort` (another hybrid algorithm), by comparison is ['[bigo](N)]
+for evenly distributed lists. The problem is, `flash_sort` is merely an MSD __radix_sort
+combined with ['[bigo](N*N)] insertion sort to deal with small subsets where
+the MSD Radix Sort is inefficient, so it is inefficient with chunks of data
+around the size at which it switches to `insertion_sort`, and ends up operating
+as an enhanced MSD Radix Sort.
+For uneven distributions this makes it especially inefficient.
+
+__integer_sort and __float_sort use __introsort instead, which provides ['[bigo](N*log(N))]
+performance for these medium-sized pieces. Also, `flash_sort`'s ['[bigo](N)]
+performance for even distributions comes at the cost of cache misses,
+which on modern architectures are extremely expensive, and in testing
+on modern systems ends up being slower than cutting up the data in multiple,
+cache-friendly steps. Also worth noting is that on most modern computers,
+`log2(available RAM)/log2(L1 cache size)` is around 3,
+where a cache miss takes more than 3 times as long as an in-cache random-access,
+and the size of ['max_splits] is tuned to the size of the cache.
+On a computer where cache misses aren't this expensive, ['max_splits]
+could be increased to a large value, or eliminated entirely,
+and `integer_sort/float_sort` would have the same ['[bigo](N)] performance
+on even distributions.
+
+Adaptive Left Radix (ALR) is similar to `flash_sort`, but more cache-friendly.
+It still uses insertion_sort. Because ALR uses ['[bigo](N*N)] `insertion_sort`,
+it isn't efficient to use the comparison-based fallback sort on large lists,
+and if the data is clustered in small chunks just over the fallback size
+with a few outliers, radix-based sorting iterates many times doing little sorting
+with high overhead. Asymptotically, ALR is still ['[bigo](N*log(K/S + S))],
+but with a very small ['S] (about 2 in the worst case),
+which compares unfavorably with the 11 default value of ['max_splits] for
+Spreadsort.
+
+ALR also does not have the ['[bigo](N*log(N))] fallback, so for small lists
+that are not evenly distributed it is extremely inefficient.
+See the `alrbreaker` and `binaryalrbreaker` testcases for examples;
+either replace the call to sort with a call to ALR and update the ALR_THRESHOLD
+at the top, or as a quick comparison make `get_max_count return ALR_THRESHOLD`
+(20 by default based upon the paper).
+These small tests take 4-10 times as long with ALR as __std_sort
+in the author's testing, depending on the test system,
+because they are trying to sort a highly uneven distribution.
+Normal Spreadsort does much better with them, because `get_max_count`
+is designed around minimizing worst-case runtime.
+
+`burst_sort` is an efficient hybrid algorithm for strings that
+uses substantial additional memory.
+
+__string_sort uses minimal additional memory by comparison.
+Speed comparisons between the two haven't been made,
+but the better memory efficiency makes __string_sort more general.
+
+`postal_sort` and __string_sort are similar. A direct performance comparison
+would be welcome, but an efficient version of `postal_sort` was not found
+in a search for source.
+
+__string_sort is most similar to the __american_flag algorithm.
+The main difference is that it doesn't bother trying to optimize how empty
+buckets/piles are handled, instead just checking to see if all characters
+at the current index are equal. Other differences are using __std_sort
+as the fallback algorithm, and a larger fallback size (256 vs. 16),
+which makes empty pile handling less important.
+
+Another difference is not applying the stack-size restriction.
+Because of the equality check in __string_sort, it would take ['m*m] memory
+worth of strings to force __string_sort to create a stack of depth ['m].
+This problem isn't a realistic one on modern systems with multi-megabyte stacksize
+limits, where main memory would be exhausted holding the long strings necessary
+to exceed the stacksize limit. __string_sort can be thought of as modernizing
+__american_flag to take advantage of __introsort as a fallback algorithm.
+In the author's testing, __american_flag (on `std::strings`) had comparable runtime
+to __introsort, but making a hybrid of the two allows reduced overhead and
+substantially superior performance.
+
+[endsect] [/section:why_spreadsort]
+
+[section:unstable_sort Unstable Sorting]
+
+Making a __radix_sort stable requires the usage of an external copy of the data.
+A stable hybrid algorithm also requires a stable comparison-based algorithm,
+and these are generally slow. LSD __radix_sort uses an external copy of the data,
+and provides stability, along with likely being faster (than a stable hybrid sort),
+so that's probably a better way to go for integer and floating-point types.
+It might make sense to make a stable version of __string_sort using external memory,
+but for simplicity this has been left out for now.
+
+[endsect] [/section:unstable_sort Unstable Sorting]
+
+[section:optimization Unused X86 optimization]
+
+Though the ideal ['max_splits] for `n < 1 million` (or so) on x86
+['seems] to be substantially larger, enabling a roughly 15% speedup for such tests,
+this optimization isn't general, and doesn't apply for `n > 1 million`.
+A too large ['max_splits] can cause sort to take more than twice as long,
+so it should be set on the low end of the reasonable range, where it is right now.
+
+[endsect] [/section:optimization Unused X86 optimization]
+
+[section:lookup Lookup Table?]
+
+The ideal way to optimize the constants would be to have a carefully-tuned
+lookup-table instead of the `get_max_count` function, but 4 tuning variables
+is simpler, `get_max_count` enforces worst-case performance minimization rules,
+and such a lookup table would be difficult to optimize
+for cross-platform performance.
+
+Alternatively, `get_max_count` could be used to generate a static lookup table.
+This hasn't been done due to concerns about cross-platform compatibility
+and flexibility.
+
+[endsect] [/section:lookup]
+
+[endsect] [/section:rationale Rationale]
+
+[endsect] [/section:sort_hpp Spreadsort]
+
+[section:definitions Definitions]
+
+[h4 stable sort]
+
+A sorting approach that preserves pre-existing order.
+If there are two elements with identical keys in a list that is later stably sorted,
+whichever came first in the initial list will come first in a stably sorted list.
+The algorithms provided here provide no such guarantee; items with identical keys
+will have arbitrary resulting order relative to each other.
+
+[endsect] [/section:definitions Definitions]
+
+[section:faq Frequently Asked Questions]
+
+There are no FAQs yet.
+
+[endsect] [/section:faq Frequently asked Questions]
+
+[section:acks Acknowledgements]
+
+* The author would like to thank his wife Mary for her patience and support
+during the long process of converting this from a piece of C code
+to a template library.
+
+* The author would also like to thank Phil Endecott and Frank Gennari
+for the improvements they've suggested and for testing.
+Without them this would have taken longer to develop or performed worse.
+
+* `float_mem_cast` was fixed to be safe and fast thanks to Scott McMurray.
+That fix was critical for a high-performance cross-platform __float_sort.
+
+* Thanks also for multiple helpful suggestions provided by Steven Watanabe,
+Edouard Alligand, and others.
+
+* Initial documentation was refactored to use Quickbook by Paul A. Bristow.
+
+[endsect] [/section:acknowledgements Acknowledgements]
+
+[section:bibliog Bibliography]
+
+[h4 Standard Template Library Sort Algorithms]
+
+[@http://www.cplusplus.com/reference/algorithm/sort/ C++ STL sort algorithms].
+
+[h4 Radix Sort]
+
+A type of algorithm that sorts based upon distribution instead of by comparison.
+Wikipedia has an article about Radix Sorting.
+A more detailed description of various Radix Sorting algorithms is provided here:
+
+Donald Knuth. The Art of Computer Programming,
+Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998.
+ISBN 0-201-89685-0. Section 5.2.5: Sorting by Distribution, pp.168-179.
+
+[h4 Introsort]
+
+A high-speed comparison-based sorting algorithm that takes ['[bigo](N * log(N))] time.
+See __introsort and
+Musser, David R. (1997). "Introspective Sorting and Selection Algorithms",
+Software: Practice and Experience (Wiley) 27 (8), pp 983-993,
+available at [@http://www.cs.rpi.edu/~musser/gp/introsort.ps Musser Introsort].
+
+[h4 American Flag Sort]
+
+A high-speed hybrid string sorting algorithm that __string_sort is partially based
+upon. See __american_flag and Peter M. McIlroy, Keith Bostic, M. Douglas McIlroy. Engineering Radix Sort, Computing Systems 1993.
+
+[h4 Adaptive Left Radix (ARL)]
+
+ARL (Adaptive Left Radix) is a hybrid cache-friendly integer sorting algorithm
+with comparable speed on random data to __integer_sort,
+but does not have the optimizations for worst-case performance,
+causing it to perform poorly on certain types of unevenly distributed data.
+
+Arne Maus, [@http://www.nik.no/2002/Maus.pdf ARL, a faster in-place, cache friendly sorting algorithm],
+presented at NIK2002, Norwegian Informatics Conference, Kongsberg, 2002. Tapir, ISBN 82-91116-45-8.
+
+[h4 Original Spreadsort]
+
+The algorithm that __integer_sort was originally based on.
+__integer_sort uses a smaller number of key bits at a time for better cache efficiency
+than the method described in the paper.
+The importance of cache efficiency grew as CPU clock speeds increased
+while main memory latency stagnated.
+See Steven J. Ross,
+The Spreadsort High-performance General-case Sorting Algorithm,
+Parallel and Distributed Processing Techniques and Applications, Volume 3, pp.1100-1106. Las Vegas Nevada. 2002. See
+[@../../doc/papers/original_spreadsort06_2002.pdf Steven Ross spreadsort_2002].
+
+[endsect] [/section:bibliography Bibliography]
+
+[section:history History]
+
+* First release following review in Boost 1.58.
+
+* [@http://permalink.gmane.org/gmane.comp.lib.boost.devel/255194 Review of Boost.Sort/Spreadsort library]
+
+[endsect] [/section:history]
+
+[xinclude autodoc.xml] [/ Using Doxygen reference documentation.]
+
+
+[/Include the indexes (class, function and everything) ]
+'''
+
+ Function Index
+
+
+
+
+'''
+
+[endsect]
+
+[/
+ Copyright (c) 2014 Steven Ross
+ Distributed under the Boost Software License,
+ Version 1.0. (See accompanying file LICENSE_1_0.txt
+ or copy at http://boost.org/LICENSE_1_0.txt)
+]
\ No newline at end of file
diff --git a/doc/windows_parallel.qbk b/doc/windows_parallel.qbk
new file mode 100644
index 0000000..a36b63d
--- /dev/null
+++ b/doc/windows_parallel.qbk
@@ -0,0 +1,119 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:windows_parallel 3.5- Windows Benchmark]
+
+The next results are obtained from more complex benchmarks, not include in the library because use non free SW.
+(If you are interested, contact with fjtapia@gmail.com)
+
+There are 3 types of benchmarks,
+
+# 64 bits integers
+# strings
+# objects of several sizes.
+
+The objects are arrays of integers. With the heavy comparison sum all the elements in each, in the light comparison use only the first number of the array.
+
+The benchmark runs over a VirtualBox virtual machine with 8 threads and 16 GB of RAM,
+running over aIntel(R) Core(TM) i7-5820K CPU @ 3.30GHz with 6 cores and 2 threads by core, and 15M of cache.
+
+
+[teletype]
+``
+
+ 100 000 000 NUMBERS OF 64 BITS
+ RANDOMLY FILLED
+
+ | Time | Maximum |
+ | secs | Memory Used |
+ --------------------------+---------+-------------+
+ PPL parallel sort | 2.4016 | 786 MB |
+ PPL parallel_buffered_sort| 2.0373 | 1567 MB |
+ block_indirect_sort | 1.6101 | 785 MB |
+ | | |
+ sample sort | 2.1191 | 1565 MB |
+ parallel stable sort | 2.4503 | 1175 MB |
+ --------------------------+---------+-------------+
+
+``
+
+[teletype]
+``
+
+ 10 000 000 S T R I N G S
+ RANDOMLY FILLED
+
+ | Time | Maximum |
+ | secs | Memory Used |
+ --------------------------+---------+-------------+
+ PPL parallel sort | 4.3241| 887 MB |
+ PPL parallel_buffered_sort| 3.5434| 1199 MB |
+ block_indirect_sort | 3.5732| 1601 MB |
+ | | |
+ sample sort | 3.8107| 1198 MB |
+ parallel stable sort | 5.0277| 1041 MB |
+ --------------------------+---------+-------------+
+``
+
+[teletype]
+``
+
+ =============================================================
+ = OBJECT COMPARISON =
+ = --------------------- =
+ = =
+ = The objects are arrays of 64 bits numbers =
+ = =
+ = They are compared in two ways : =
+ = =
+ = (H) Heavy : The comparison is the sum of all the numbers =
+ = of the array. In each comparison, sum all =
+ = the numbers of the array =
+ = =
+ = (L) Light : The comparison is with the first element of =
+ = the array, as a key =
+ = =
+ =============================================================
+
+ | 100000000 | 50000000 | 25000000 | 12500000 | 6250000 | 1562500 |
+ | objects of| objects of|objects of |objects of |objects of |objects of |
+ | 8 bytes | 16 bytes | 32 bytes | 64 bytes | 128 bytes | 512 bytes |
+ | | | | | | |
+ | H L | H L | H L | H L | H L | H L |
+ --------------------+-----------+-----------+-----------+-----------+-----------+-----------+
+ PPL parallel sort | 2.50 2.40| 1.34 1.16| 0.85 0.73| 0.70 0.57| 0.72 0.45| 0.54 0.40|
+ PPL parallel_ | 2.20 2.26| 1.34 1.24| 1.03 0.79| 1.00 0.83| 0.90 0.85| 0.78 0.87|
+ buffered_sort | | | | | | |
+ block_indirect_sort | 1.62 1.59| 0.94 0.87| 0.63 0.57| 0.50 0.44| 0.58 0.38| 0.55 0.35|
+ | | | | | | |
+ sample sort | 2.19 2.25| 1.69 1.54| 1.12 1.14| 1.18 1.14| 1.03 1.13| 1.09 1.17|
+ parallel stable sort| 2.54 2.49| 1.69 1.52| 1.25 1.10| 1.10 1.03| 1.07 1.00| 1.05 0.97|
+ | | | | | | |
+ --------------------+-----------+-----------+-----------+-----------+-----------+-----------+
+
+ | Maximum |
+ | Memory Used |
+ ---------------------------+-------------+
+ PPL parallel sort | 785 MB |
+ PPL parallel_buffered_sort | 1567 MB |
+ block_indirect_sort | 785 MB |
+ | |
+ sample sort | 1565 MB |
+ parallel stable sort | 1175 MB |
+ | |
+ ---------------------------+-------------+
+
+``
+
+
+
+[endsect]
+
+
+
diff --git a/doc/windows_single.qbk b/doc/windows_single.qbk
new file mode 100644
index 0000000..50ce94d
--- /dev/null
+++ b/doc/windows_single.qbk
@@ -0,0 +1,170 @@
+[/===========================================================================
+ Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
+
+
+ Distributed under the Boost Software License, Version 1.0
+ See accompanying file LICENSE_1_0.txt or copy at
+ http://www.boost.org/LICENSE_1_0.txt
+=============================================================================/]
+
+[section:windows_single 2.6.- Windows Benchmarks]
+
+[*WINDOWS x64 VC++ 2017]
+
+
+In the library you can find in the folder benchmark, programs to measure the speed of the algorithms in your machine and operating system.
+These are brief benchmarks for to see the speed with different kind of data ( random, sorted, sorted plus unsorted append at end ...)
+
+The benchmark runs over a VirtualBox virtual machine with 8 threads and 16 GB of RAM,
+running over aIntel(R) Core(TM) i7-5820K CPU @ 3.30GHz with 6 cores and 2 threads by core, and 15M of cache.
+
+In the benchmark_numbers with integers are:
+
+[teletype]
+``
+
+
+
+ B O O S T S O R T
+ S I N G L E T H R E A D
+ I N T E G E R B E N C H M A R K
+
+ SORT OF 100 000 000 NUMBERS OF 64 BITS
+
+
+ [ 1 ] std::sort [ 2 ] pdqsort [ 3 ] std::stable_sort
+ [ 4 ] spinsort [ 5 ] flat_stable_sort [ 6 ] spreadsort
+
+ | | | | | | |
+ | [ 1 ]| [ 2 ]| [ 3 ]| [ 4 ]| [ 5 ]| [ 6 ]|
+ --------------------+------+------+------+------+------+------+
+ random |11.74 | 9.40 |13.42 |12.12 |13.17 | 7.10 |
+ | | | | | | |
+ sorted | 1.93 | 0.16 | 4.43 | 0.12 | 0.09 | 0.09 |
+ sorted + 0.1% end | 3.54 | 2.08 | 4.39 | 0.87 | 0.48 | 5.87 |
+ sorted + 1% end | 4.52 | 2.77 | 4.82 | 1.12 | 1.00 | 7.20 |
+ sorted + 10% end | 9.69 | 5.99 | 7.40 | 2.21 | 2.29 | 8.59 |
+ | | | | | | |
+ sorted + 0.1% mid | 3.66 | 2.43 | 4.74 | 2.63 | 3.84 | 5.59 |
+ sorted + 1% mid | 4.36 | 3.04 | 4.97 | 4.35 | 6.21 | 9.20 |
+ sorted + 10% mid | 9.50 | 7.28 | 7.44 | 5.37 | 8.03 | 9.95 |
+ | | | | | | |
+ reverse sorted | 2.38 | 0.35 | 5.61 | 0.24 | 0.18 | 2.84 |
+ rv sorted + 0.1% end| 4.24 | 2.64 | 5.72 | 0.96 | 0.67 | 6.66 |
+ rv sorted + 1% end| 4.44 | 2.67 | 5.22 | 1.10 | 0.86 | 5.68 |
+ rv sorted + 10% end| 6.93 | 4.98 | 6.27 | 2.00 | 1.95 | 7.37 |
+ | | | | | | |
+ rv sorted + 0.1% mid| 4.63 | 3.18 | 5.76 | 3.18 | 5.22 | 7.52 |
+ rv sorted + 1% mid| 4.38 | 3.06 | 4.94 | 3.10 | 4.54 | 6.55 |
+ rv sorted + 10% mid| 9.20 | 7.08 | 7.56 | 5.28 | 7.40 | 9.04 |
+ --------------------+------+------+------+------+------+------+
+
+``
+
+The next results are obtained from more complex benchmarks, not include in the library because use non free SW.
+(If you are interested, contact with fjtapia@gmail.com)
+
+There are 3 types of benchmarks,
+* 64 bits integers
+* strings
+* objects of several sizes.
+
+The objects are arrays of integers. With the heavy comparison sum all the elements in each, in the light comparison
+use only the first number of the array.
+
+
+[teletype]
+``
+
+ 100 000 000 NUMBERS OF 64 BITS
+ RANDOMLY FILLED
+
+ | Time | Maximum |
+ | secs | Memory Used |
+ --------------------+---------+-------------+
+ std::sort | 12.381 | 783 MB |
+ pdqsort | 9.760 | 783 MB |
+ | | |
+ std::stable_sort | 13.311 | 1174 MB |
+ spin_sort | 11.541 | 1174 MB |
+ flat_stable_sort | 13.664 | 787 MB |
+ spreadsort | 8.507 | 783 MB |
+ --------------------+---------+-------------+
+
+``
+
+
+[teletype]
+``
+
+ 10 000 000 S T R I N G S
+ RANDOMLY FILLED
+
+ | Time | Maximum |
+ | secs | Memory Used |
+ --------------------+---------+-------------+
+ std::sort | 9.658 | 885 MB |
+ pdqsort | 15.247 | 1605 MB |
+ | | |
+ std::stable_sort | 19.753 | 1041 MB |
+ spin_sort | 17.596 | 1041 MB |
+ flat_stable_sort | 19.159 | 887 MB |
+ spreadsort | 5.221 | 885 MB |
+ --------------------+---------+-------------+
+
+``
+
+[teletype]
+``
+
+ =============================================================
+ = OBJECT COMPARISON =
+ = --------------------- =
+ = =
+ = The objects are arrays of 64 bits numbers =
+ = =
+ = They are compared in two ways : =
+ = =
+ = (H) Heavy : The comparison is the sum of all the numbers =
+ = of the array. In each comparison, sum all =
+ = the numbers of the array =
+ = =
+ = (L) Light : The comparison is with the first element of =
+ = the array, as a key =
+ = =
+ =============================================================
+
+ | 100000000 | 50000000 | 25000000 | 12500000 | 6250000 | 1562500 |
+ | objects of| objects of| objects of| objects of| objects of| objects of|
+ | 8 bytes | 16 bytes | 32 bytes | 64 bytes | 128 bytes | 512 bytes |
+ | | | | | | |
+ | H L | H L | H L | H L | H L | H L |
+ -----------------+-----------+-----------+-----------+-----------+-----------+-----------+
+ std::sort |11.86 12.00| 6.53 6.10| 3.85 3.21| 2.79 1.97| 3.17 1.37| 2.04 1.30|
+ pdqsort | 9.80 9.39| 5.39 4.98| 3.11 2.51| 2.14 1.61| 2.50 1.10| 1.92 1.03|
+ | | | | | | |
+ std::stable_sort |12.91 13.58| 7.73 7.32| 5.16 4.52| 4.22 3.67| 4.31 3.18| 3.46 2.89|
+ spinsort |11.58 11.37| 6.88 6.40| 4.43 3.76| 3.58 3.06| 3.84 2.41| 2.76 2.17|
+ flat_stable_sort |13.31 13.87| 8.35 7.83| 5.32 4.46| 4.16 3.14| 3.63 2.27| 2.67 2.13|
+ spreadsort | 8.37 8.37| 6.51 6.62| 3.72 3.16| 2.75 1.69| 2.56 1.20| 1.38 0.80|
+ | | | | | | |
+ -----------------+-----------+-----------+-----------+-----------+-----------+-----------+
+
+
+ | Maximum |
+ | Memory Used |
+ -----------------+-------------+
+ std::sort | 783 MB |
+ pdqsort | 783 MB |
+ | |
+ std::stable_sort | 1174 MB |
+ spinsort | 1174 MB |
+ flat_stable_sort | 787 MB |
+ spreadsort | 783 MB |
+ -----------------+-------------+
+``
+
+
+[endsect]
+
+