This adds a improved reduce() algorithm implementation for
GPUs. Also adds checks to accumulate() which allow it to
use the higher-performance reduce() algorithm if possible.
This removes the init argument from reduce. This simplifies the
implementation and avoids copying a value from the host to the
device on every call to reduce.
If an initial value is required, the accumulate function can be
called instead.
refs kylelutz/compute#9
device, context, and queue are initialized statically in `context_setup.hpp`.
With this change all tests are able to complete when an NVIDIA GPU is in
exclusive compute mode.
Side effect of the change:
Time for all tests to complete reduced from 15.71 to 13.03 sec Tesla C2075.