L-BFGS

Synopsis

#include <boost/math/optimization/lbfgs.hpp>

namespace boost {
namespace math {
namespace optimization {

namespace rdiff = boost::math::differentiation::reverse_mode;

/**
 *
 * @brief Limited-memory BFGS (L-BFGS) optimizer
 *
 * The `lbfgs` class implements the Limited-memory BFGS optimization algorithm,
 * a quasi-Newton method that approximates the inverse Hessian using a rolling
 * window of the last `m` updates. It is suitable for medium- to large-scale
 * optimization problems where full Hessian storage is infeasible.
 *
 * @tparam> ArgumentContainer: container type for parameters, e.g.
 * std::vector<RealType>
 * @tparam> RealType scalar floating type (e.g. double, float)
 * @tparam> Objective: objective function. must support "f(x)" evaluation
 * @tparam> InitializationPolicy: policy for initializing x
 * @tparam> ObjectiveEvalPolicy: policy for computing the objective value
 * @tparam> GradEvalPolicy: policy for computing gradients
 * @tparam> LineaSearchPolicy: e.g. Armijo, StrongWolfe
 *
 * https://en.wikipedia.org/wiki/Limited-memory_BFGS
 */

template<typename ArgumentContainer,
         typename RealType,
         class Objective,
         class InitializationPolicy,
         class ObjectiveEvalPolicy,
         class GradEvalPolicy,
         class LineSearchPolicy>
class lbfgs
{
public:
  lbfgs(Objective&& objective,
        ArgumentContainer& x,
        size_t m,
        InitializationPolicy&& ip,
        ObjectiveEvalPolicy&& oep,
        GradEvalPolicy&& gep,
        lbfgs_update_policy<RealType>&& up,
        LineSearchPolicy&& lsp);

  void step();
};

/* Convenience overloads */
/* create l-bfgs optimizer with
 * objective function
 * argument container
 * optional
 * - history size : how far to look in the past
 */
template<class Objective, typename ArgumentContainer>
auto make_lbfgs(Objective&& obj, ArgumentContainer& x, std::size_t m = 10);

template<class Objective,
         typename ArgumentContainer,
         class InitializationPolicy>
auto make_lbfgs(Objective&& obj,
           ArgumentContainer& x,
           std::size_t m,
           InitializationPolicy&& ip)

/* construct lbfgs with a custom initialization and line search policy */
template<class Objective,
         typename ArgumentContainer,
         class InitializationPolicy,
         class LineSearchPolicy>
auto make_lbfgs(Objective&& obj,
           ArgumentContainer& x,
           std::size_t m,
           InitializationPolicy&& ip,
           LineSearchPolicy&& lsp);

/* construct lbfgs optimizer with:
 * custom initialization policy
 * function evaluation policy
 * gradient evaluation policy
 * line search policy
 */
template<class Objective,
         typename ArgumentContainer,
         class InitializationPolicy,
         class FunctionEvalPolicy,
         class GradientEvalPolicy,
         class LineSearchPolicy>
auto make_lbfgs(Objective&& obj,
           ArgumentContainer& x,
           std::size_t m,
           InitializationPolicy&& ip,
           FunctionEvalPolicy&& fep,
           GradientEvalPolicy&& gep,
           LineSearchPolicy&& lsp);

} // namespace optimization
} // namespace math
} // namespace boost

LBFGS (limited memory BFGS) is a quasi-Newton optimizer that builds an approximation to the inverse Hessian using only first-order information (function values and gradients). Unlike full BFGS, it does not store or update a dense matrix; instead it maintains a fixed size history of the most recent m correction pairs and computes the search direction using a two loop recursion. In practice, LBFGS often converges in significantly fewer iterations than normal gradient based methods, especially on smooth, ill-conditioned objectives.

Algorithm

At each iteration k, LBFGS: * Evaluates the gradient g_k = grad(f(x_k)). * Computes a quasi-Newton search direction using the last m updates. * Chooses a step length alpha_k using a line search policy. * Updates parameters:

x_k += alpha_k p_k

* Forms the correction pairs:

s_k = x_k - x_prev y_k = g_k - g_prev

and stores up to the last m pairs (s_k, y_k).

The line search is a key part of practical LBFGS: it typically removes the need to hand-tune a learning rate and improves robustness.

Parameters

Objective&& obj : objective function to minimize.
ArgumentContainer& x : variables to optimize over. Updated in-place.
std::size_t m : history size. Typical values are 5–20. Default is 10. Larger m can improve directions but increases memory and per-step cost.
InitializationPolicy&& ip : initialization policy for ArgumentContainer and optimizer state. For reverse-mode AD, the default typically initializes/attaches the tape and uses the user-provided initial values in x.
ObjectiveEvalPolicy&& oep : policy for evaluating the objective function value at a given x. By default this is a reverse-mode AD evaluation policy when using rvar.
GradEvalPolicy&& gep : policy for evaluating the gradient of the objective. By default this is a reverse-mode AD gradient evaluation policy when using rvar.
LineSearchPolicy&& lsp : policy for selecting the step length alpha. Default is Strong Wolfe, but Armijo is an option.

Notes

LBFGS assumes the objective is sufficiently smooth for gradients to be informative. It is typically most effective on unconstrained smooth problems.
LBFGS usually requires fewer iterations than gradient descent, but each iteration is more expensive because it performs multiple objective/gradient evaluations during line search.

Example : Thomson Problem

#include <boost/math/differentiation/autodiff_reverse.hpp>
#include <boost/math/optimization/lbfgs.hpp>
#include <boost/math/optimization/minimizer.hpp>
#include <cmath>
#include <fstream>
#include <iostream>
#include <random>
#include <string>
namespace rdiff = boost::math::differentiation::reverse_mode;
namespace bopt  = boost::math::optimization;
double random_double(double min = 0.0, double max = 1.0)
{
    static thread_local std::mt19937       rng{std::random_device{}()};
    std::uniform_real_distribution<double> dist(min, max);
    return dist(rng);
}

template<typename S>
struct vec3
{
    /**
     * @brief R^3 coordinates of particle on Thomson Sphere
     */
    S x, y, z;
};

template<class S>
static inline vec3<S> sph_to_xyz(const S& theta, const S& phi)
{
    /**
     * convenience overload to convert from [theta,phi] -> x, y, z
     */
    return {sin(theta) * cos(phi), sin(theta) * sin(phi), cos(theta)};
}

template<typename T>
T thomson_energy(std::vector<T>& r)
{
    /* inverse square law
     */
    const size_t N    = r.size() / 2;
    const T      tiny = T(1e-12);

    T E = 0;
    for (size_t i = 0; i < N; ++i) {
        const T& theta_i = r[2 * i + 0];
        const T& phi_i   = r[2 * i + 1];
        auto     ri      = sph_to_xyz(theta_i, phi_i);

        for (size_t j = i + 1; j < N; ++j) {
            const T& theta_j = r[2 * j + 0];
            const T& phi_j   = r[2 * j + 1];
            auto     rj      = sph_to_xyz(theta_j, phi_j);

            T dx = ri.x - rj.x;
            T dy = ri.y - rj.y;
            T dz = ri.z - rj.z;

            T d2 = dx * dx + dy * dy + dz * dz + tiny;
            E += 1.0 / sqrt(d2);
        }
    }
    return E;
}

template<class T>
std::vector<rdiff::rvar<T, 1>> init_theta_phi_uniform(size_t N, unsigned seed = 12345)
{
    const T pi = T(3.1415926535897932384626433832795);

    std::mt19937                      rng(seed);
    std::uniform_real_distribution<T> unif01(T(0), T(1));
    std::uniform_real_distribution<T> unifm11(T(-1), T(1));

    std::vector<rdiff::rvar<T, 1>> u;
    u.reserve(2 * N);

    for (size_t i = 0; i < N; ++i) {
        T z     = unifm11(rng);
        T phi   = (T(2) * pi) * unif01(rng) - pi;
        T theta = std::acos(z);

        u.emplace_back(theta);
        u.emplace_back(phi);
    }
    return u;
}

int main(int argc, char* argv[])
{

    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <N>\n";
        return 1;
    }

    const int    N      = std::stoi(argv[1]);
    auto u_ad = init_theta_phi_uniform<double>(N);

    auto lbfgs_opt = bopt::make_lbfgs(&thomson_energy<rdiff::rvar<double, 1>>, u_ad);

    // filenames
    std::string pos_filename    = "thomson_" + std::to_string(N) + ".csv";
    std::string energy_filename = "lbfgs_energy_" + std::to_string(N) + ".csv";

    std::ofstream pos_out(pos_filename);
    std::ofstream energy_out(energy_filename);

    energy_out << "step,energy\n";

    auto result = minimize(lbfgs_opt);
    for (int pi = 0; pi < N; ++pi) {
        double theta = u_ad[2 * pi + 0].item();
        double phi   = u_ad[2 * pi + 1].item();
        auto   r     = sph_to_xyz(theta, phi);
        pos_out << pi << "," << r.x << "," << r.y << "," << r.z << "\n";
    }
    auto E = lbfgs_opt.objective_value();
    int i = 0;
    for(auto& obj_hist : result.objective_history)
    {
        energy_out << i << "," << obj_hist << "\n";
        ++i;
    }
    energy_out << "," << E << "\n";

    pos_out.close();
    energy_out.close();

    return 0;
}

For the N = 2 case, LBFGS requires only 5 iterations to converge, the nesterov version of this problem converges in 4663 iterations with default parameters, and gradient descent requires 93799 iterations. Convergence is assumed to mean the norm of the gradient is less than 1e-3. Below is a plot showcasing the 3 different methods for the N=20 particle case.

In this case, gradient descent reaches the maximum number of iterations, and does not converge, nag converges in 150 iterations, and LBFGS converges in 67 iterations.