Gradient Descent

Synopsis

#include <boost/math/optimization/gradient_descent.hpp>

template<typename ArgumentContainer,
         typename RealType,
     class Objective,
     class InitializationPolicy,
     class ObjectiveEvalPolicy,
     class GradEvalPolicy>
class gradient_descent {
  public:
    void step();
}

/* Convenience overloads  */
/* make gradient descent by providing
 ** objective function
 ** variables to optimize over
 ** optionally learing rate
 *
 * requires that code is written using boost::math::differentiation::rvar
 */
template<class Objective, typename ArgumentContainer, typename RealType>
auto make_gradient_descent(Objective&& obj, ArgumentContainer& x, RealType lr = RealType{ 0.01 });

/* make gradient descent by providing
 * objective function
 ** variables to optimize over
 ** learning rate (not optional)
 ** initialization policy
 *
 * requires that code is written using boost::math::differentiation::rvar
 */

template<class Objective, typename ArgumentContainer, typename RealType, class InitializationPolicy>
    auto make_gradient_descent(Objective&& obj,
                               ArgumentContainer& x,
                               RealType lr,
                               InitializationPolicy&& ip);
/* make gradient descent by providing
** objective function
** variables to optimize over
** learning rate (not optional)
** variable initialization policy
** objective evaluation policy
** gradient evaluation policy
*
* code does not have to use boost::math::differentiation::rvar
*/
template<typename ArgumentContainer,
             typename RealType,
     class Objective,
         class InitializationPolicy,
             class ObjectiveEvalPolicy,
             class GradEvalPolicy>
auto make_gradient_descent(Objective&& obj,
                           ArgumentContainer& x,
                               RealType& lr,
                               InitializationPolicy&& ip,
                               ObjectiveEvalPolicy&& oep,
                               GradEvalPolicy&& gep)

Gradient descent iteratively updates parameters x in the direction opposite to the gradient of the objective function (minimizing the objective).

x[i] -= lr * g[i]

where lr is a user defined learning rate. For a more complete description of the theoretical principle check the wikipedia page

The implementation delegates: - the initialization of differentiable variables to an initialization policy - objective evaluation to an objective evaluation policy - the gradient computation to a gradient evaluation policy - the parameter updates to an update policy

The interface is intended to be pytorch-like, where a optimizer object is constructed and progressed with a step() method. A helper minimize method is also provided.

Parameters

Objective&& obj : objective function to minimize
ArgumentContainer& x : variables to optimize over
RealType& lr : learning rate. A larger value takes larger steps during descent, leading to faster, but more unstable convergence. Conversely, small values are more stable but take longer to converge.
InitializationPolicy&& ip : Initialization policy for optimizer state and variables. Users may supply a custom initialization policy to control how the argument container and any AD specific runtime state : i.e. reverse-mode tape attachment/reset are initialized. By default, the optimizer uses the user-provided initial values in x and performs the standard reverse mode AD initialization required for gradient evaluation. Custom initialization policies are useful for randomized starts, non rvar AD types, or when gradients are supplied externally. See the reverse-mode autodiff policy documentation for the required initialization policy interface when writing custom policies.
ObjectiveEvalPolicy&& oep : tells the optimizer how to evaluate the objective function. By default reverse_mode_function_eval_policy<RealType>.
GradEvalPolicy&& gep : tells the optimizer how to evaluate the gradient of the objective function. By default reverse_mode_gradient_evaluation_policy<RealType>

Example using a manual optimization

In this section we will present an example for finding optimal configurations of electrically charged particles confined to a R = 1 sphere. This problem is also known as the Thomson problem. In summary, we are looking for the configuration of an N-electron system subject to the Coulomb potential confined to the $S^2$ sphere. The Coulomb potential is given by:

The code below manually minimizes the above potential energy function for N particles over their two angular positions.

#include <boost/math/differentiation/autodiff_reverse.hpp>
#include <boost/math/optimization/gradient_descent.hpp>
#include <boost/math/optimization/minimizer.hpp>
#include <cmath>
#include <fstream>
#include <iostream>
#include <random>
#include <string>
namespace rdiff = boost::math::differentiation::reverse_mode;
namespace bopt  = boost::math::optimization;
double random_double(double min = 0.0, double max = 1.0)
{
    static thread_local std::mt19937       rng{std::random_device{}()};
    std::uniform_real_distribution<double> dist(min, max);
    return dist(rng);
}

template<typename S>
struct vec3
{
    /**
    * @brief R^3 coordinates of particle on Thomson Sphere
    */
    S x, y, z;
};

template<class S>
static inline vec3<S> sph_to_xyz(const S& theta, const S& phi)
{
    /**
     * convenience overload to convert from [theta,phi] -> x, y, z
     */
    return {sin(theta) * cos(phi), sin(theta) * sin(phi), cos(theta)};
}

template<typename T>
T thomson_energy(std::vector<T>& r)
{
    const size_t N    = r.size() / 2;
    const T      tiny = T(1e-12);

    T E = 0;
    for (size_t i = 0; i < N; ++i) {
        const T& theta_i = r[2 * i + 0];
        const T& phi_i   = r[2 * i + 1];
        auto     ri      = sph_to_xyz(theta_i, phi_i);

        for (size_t j = i + 1; j < N; ++j) {
            const T& theta_j = r[2 * j + 0];
            const T& phi_j   = r[2 * j + 1];
            auto     rj      = sph_to_xyz(theta_j, phi_j);

            T dx = ri.x - rj.x;
            T dy = ri.y - rj.y;
        T dz = ri.z - rj.z;

            T d2 = dx * dx + dy * dy + dz * dz + tiny;
            E += 1.0 / sqrt(d2);
        }
    }
    return E;
}

template<class T>
std::vector<rdiff::rvar<T, 1>> init_theta_phi_uniform(size_t N, unsigned seed = 12345)
{
    const T pi = T(3.1415926535897932384626433832795);

    std::mt19937                      rng(seed);
    std::uniform_real_distribution<T> unif01(T(0), T(1));
    std::uniform_real_distribution<T> unifm11(T(-1), T(1));

    std::vector<rdiff::rvar<T, 1>> u;
    u.reserve(2 * N);

    for (size_t i = 0; i < N; ++i) {
        T z     = unifm11(rng);
        T phi   = (T(2) * pi) * unif01(rng) - pi;
        T theta = std::acos(z);

        u.emplace_back(theta);
        u.emplace_back(phi);
    }
    return u;
}

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <N>\n";
        return 1;
    }

    const int    N      = std::stoi(argv[1]);
    const int    NSTEPS = 100000;
    const double lr     = 1e-3;

    auto u_ad = init_theta_phi_uniform<double>(N);

        auto gdopt = bopt::make_gradient_descent(&thomson_energy<rdiff::rvar<double, 1>>, u_ad, lr);

    // filenames
    std::string pos_filename    = "thomson_" + std::to_string(N) + ".csv";
    std::string energy_filename = "energy_" + std::to_string(N) + ".csv";

        std::ofstream pos_out(pos_filename);
        std::ofstream energy_out(energy_filename);

        pos_out << "step,particle,x,y,z\n";
        energy_out << "step,energy\n";

    for (int step = 0; step < NSTEPS; ++step) {
            gdopt.step();
            for (int pi = 0; pi < N; ++pi) {
                double theta = u_ad[2 * pi + 0].item();
                double phi   = u_ad[2 * pi + 1].item();
                auto   r     = sph_to_xyz(theta, phi);
                pos_out << step << "," << pi << "," << r.x << "," << r.y << "," << r.z << "\n";
            }
        auto E = gdopt.objective_value();
        energy_out << step << "," << E << "\n";
        }

        pos_out.close();
        energy_out.close();

        return 0;
}

The variable

const int    N      = std::stoi(argv[1]);

is the number of particles read from the command line

const int    NSTEPS = 100000;

is the number of optimization steps

const double lr     = 1e-3;

is the optimizer learning rate. Using the code the way its written, the optimizer runs for 100000 steps. Running the program with

./thomson_sphere N

optimizes the N particle system. Below is a plot of several optimal configurations for N=2,...8 particles.

Below is a plot of the final energy of the system, and its deviation from the theoretically predicted values. The table of theoretical energy values for the problem is from wikipedia.

[gbo_graphi thomson_energy_error_gradient_descent.svg]

Example using minimize

Often, we don't want to actually implement our own stepping function, i.e. we care about certain convergence criteria. In the above example, we need to include the minimizer.hpp header:

#include <boost/math/optimization/minimizer.hpp>

and replace the optimization loop:

for (int step = 0; step < NSTEPS; ++step) {
           gdopt.step();
           for (int pi = 0; pi < N; ++pi) {
               double theta = u_ad[2 * pi + 0].item();
               double phi   = u_ad[2 * pi + 1].item();
               auto   r     = sph_to_xyz(theta, phi);
               pos_out << step << "," << pi << "," << r.x << "," << r.y << "," << r.z << "\n";
           }
       auto E = gdopt.objective_value();
       energy_out << step << "," << E << "\n";
       }

with

auto result = minimize(gdopt);

minimize returns a

optimization_result<typename Optimizer::real_type_t>

, a struct with the following fields:

size_t num_iter;
RealType objective_value;
std::vector<RealType> objective_history;
bool converged;

where num_iter is the number of iterations the optimizer went through, objective_value is the final objective value, objective_history are the intermediate objective values, and converged is whether the convergence criterion was satisfied. By default, minimize(optimizer) uses a gradient norm convergence criterion. If norm(gradient_vector) < 1e-3, the criterion is satisfied. Maximum number of iterations is set at 100000. For more info on how to use minimize check the minimize docs. With default parameters, gradient descent solves the N=2 problem in 93799 steps.