math/doc/optimization/gradient_optimizers.qbk

[/
Copyright (c) 2025-2026 Maksym Zhelyeznyakov
Use, modification and distribution are subject to the
Boost Software License, Version 1.0. (See accompanying file
LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
]

[section:gd_opt Gradient Based Optimizers]
[template autodiff_equation[name]  '''<inlinemediaobject><imageobject><imagedata fileref="../equations/autodiff/'''[name]'''"></imagedata></imageobject></inlinemediaobject>''']
[template autodiff_graph[name] '''<inlinemediaobject><imageobject><imagedata fileref="../graphs/autodiff/'''[name]'''"></imagedata></imageobject></inlinemediaobject>''']
[template gbo_graph[name] '''<inlinemediaobject><imageobject><imagedata fileref="../graphs/gradient_based_optimizers/'''[name]'''"></imagedata></imageobject></inlinemediaobject>''']

[section:introduction Introduction]
Gradient based optimizers are algorithms that use the gradient of a funciton to iteratively find locally extreme points of functions over a set of parameters. This sections provides a description of a set of gradient optimizers. The optimizers are written with `boost::math::differentiation::reverse_mode::rvar` in mind, however if a way to evaluate the funciton and its gradient is provided, the optimizers should work in exactly the same way.

[endsect] [/section:introduction]

[section:gradient_descent Gradient Desccent]
[heading Synopsis]
``
	#include <boost/math/optimization/gradient_descent.hpp>

	template<typename ArgumentContainer,
	         typename RealType,
		 class Objective,
		 class InitializationPolicy,
		 class ObjectiveEvalPolicy,
		 class GradEvalPolicy>
	class gradient_descent {
	  public:
	    void step();
	}

	/* Convenience overloads  */
	/* make gradient descent by providing
	 ** objective function
	 ** variables to optimize over
	 ** optionally learing rate
	 *
	 * requires that code is written using boost::math::differentiation::rvar
	 */
	template<class Objective, typename ArgumentContainer, typename RealType>
	auto make_gradient_descent(Objective&& obj, ArgumentContainer& x, RealType lr = RealType{ 0.01 });

	/* make gradient descent by providing
	 * objective function
	 ** variables to optimize over
	 ** learning rate (not optional)
	 ** initialization policy
	 *
	 * requires that code is written using boost::math::differentiation::rvar
	 */

	template<class Objective, typename ArgumentContainer, typename RealType, class InitializationPolicy>
        auto make_gradient_descent(Objective&& obj,
                                   ArgumentContainer& x,
                                   RealType lr,
                                   InitializationPolicy&& ip);
	/* make gradient descent by providing
	** objective function
	** variables to optimize over
	** learning rate (not optional)
	** variable initialization policy
	** objective evaluation policy
	** gradient evaluation policy
	*
	* code does not have to use boost::math::differentiation::rvar
	*/
	template<typename ArgumentContainer,
                 typename RealType,
		 class Objective,
         	 class InitializationPolicy,
                 class ObjectiveEvalPolicy,
                 class GradEvalPolicy>
	auto make_gradient_descent(Objective&& obj,
                      		   ArgumentContainer& x,
                                   RealType& lr,
                                   InitializationPolicy&& ip,
                                   ObjectiveEvalPolicy&& oep,
                                   GradEvalPolicy&& gep)
``

Gradient descent iteratively updates parameters `x` in the direction opposite to the gradient of the objective function (minimizing the objective).
``
x[i] -= lr * g[i]
``
where `lr` is a user defined learning rate. For a more complete decription of the theoretical principle check [@https://en.wikipedia.org/wiki/Gradient_descent the wikipedia page]

The implementation delegates:
- the initialization of differentiable variables to an initialization policy
- objective evaluation to an objective evaluation policy
- the gradient computation to a gradient evaluation policy
- the parameter updates to an update policy

The interface is intended to be pytorch-like, where a optimizer object is constructed and progressed with a `step()` method. A helper `minimize` method is also provided.

[heading Parameters]

* `Objective&& obj` : objective funciton to minimize
* `ArgumentContainer& x` : variables to optimize over
* `RealType& lr` : learning rate. A larger value takes larger steps during descent, leading to faster, but more unstable convergence. Conversely, small vaues are more stable but take longer to converge.
* `InitializationPolicy&& ip` : Initialization policy for `ArgumentContainer`, or the initial guess. By default it is set to `tape_initializer_rvar<RealType>` which lets the user provide the "initial guess" by setting the values of `x` manually. For more info check the Policies section.
* `ObjectiveEvalPolicy&& oep` : tells the optimizer how to evaluate the objective function. By default `reverse_mode_function_eval_policy<RealType>`.
* `GradEvalPolicy&& gep` : tells the optimzier how to evaluate the gradient of the objective function. By default `reverse_mode_gradient_evaluation_policy<RealType>`

[heading Example using a manual optimization]
In this section we will present an example for finding optimal configurations of electrically charged particles confined to a `R = 1` sphere. This problem is also known as the [@https://en.wikipedia.org/wiki/Thomson_problem Thomson problem]. In summary, we are looking for the configuration of an N-electron system subject to the Coulomb potential confined to the $S^2$ sphere. The Coulomb potential is given by:

[:[:[autodiff_equation thomson_potential.svg]]]

The code below manually minimizes the abover potential energy function for N particles over their two angular pozitions.

``
	#include <boost/math/differentiation/autodiff_reverse.hpp>
	#include <boost/math/optimization/gradient_descent.hpp>
	#include <boost/math/optimization/minimizer.hpp>
	#include <cmath>
	#include <fstream>
	#include <iostream>
	#include <random>
	#include <string>
	namespace rdiff = boost::math::differentiation::reverse_mode;
	namespace bopt  = boost::math::optimization;
	double random_double(double min = 0.0, double max = 1.0)
	{
	    static thread_local std::mt19937       rng{std::random_device{}()};
	    std::uniform_real_distribution<double> dist(min, max);
	    return dist(rng);
	}

	template<typename S>
	struct vec3
	{
    	/**
     	* @brief R^3 coordinates of particle on Thomson Sphere
     	*/
    	S x, y, z;
	};

	template<class S>
	static inline vec3<S> sph_to_xyz(const S& theta, const S& phi)
	{
	    /**
	     * convenience overload to convert from [theta,phi] -> x, y, z
	     */
	    return {sin(theta) * cos(phi), sin(theta) * sin(phi), cos(theta)};
	}

	template<typename T>
	T thomson_energy(std::vector<T>& r)
	{
	    const size_t N    = r.size() / 2;
	    const T      tiny = T(1e-12);

	    T E = 0;
	    for (size_t i = 0; i < N; ++i) {
        	const T& theta_i = r[2 * i + 0];
	        const T& phi_i   = r[2 * i + 1];
        	auto     ri      = sph_to_xyz(theta_i, phi_i);

	        for (size_t j = i + 1; j < N; ++j) {
        	    const T& theta_j = r[2 * j + 0];
	            const T& phi_j   = r[2 * j + 1];
        	    auto     rj      = sph_to_xyz(theta_j, phi_j);

	            T dx = ri.x - rj.x;
        	    T dy = ri.y - rj.y;
		    T dz = ri.z - rj.z;

	            T d2 = dx * dx + dy * dy + dz * dz + tiny;
        	    E += 1.0 / sqrt(d2);
        	}
    	}
    	return E;
	}

	template<class T>
	std::vector<rdiff::rvar<T, 1>> init_theta_phi_uniform(size_t N, unsigned seed = 12345)
	{
	    const T pi = T(3.1415926535897932384626433832795);

	    std::mt19937                      rng(seed);
	    std::uniform_real_distribution<T> unif01(T(0), T(1));
	    std::uniform_real_distribution<T> unifm11(T(-1), T(1));

	    std::vector<rdiff::rvar<T, 1>> u;
	    u.reserve(2 * N);

	    for (size_t i = 0; i < N; ++i) {
        	T z     = unifm11(rng);
	        T phi   = (T(2) * pi) * unif01(rng) - pi;
        	T theta = std::acos(z);

	        u.emplace_back(theta);
        	u.emplace_back(phi);
	    }
	    return u;
	}

	int main(int argc, char* argv[])
	{
	    if (argc != 2) {
        	std::cerr << "Usage: " << argv[0] << " <N>\n";
	        return 1;
	    }

	    const int    N      = std::stoi(argv[1]);
	    const int    NSTEPS = 100000;
   	    const double lr     = 1e-3;

   	    auto u_ad = init_theta_phi_uniform<double>(N);

    	    auto gdopt = bopt::make_gradient_descent(&thomson_energy<rdiff::rvar<double, 1>>, u_ad, lr);

   	    // filenames
   	    std::string pos_filename    = "thomson_" + std::to_string(N) + ".csv";
  	    std::string energy_filename = "energy_" + std::to_string(N) + ".csv";

            std::ofstream pos_out(pos_filename);
            std::ofstream energy_out(energy_filename);

            pos_out << "step,particle,x,y,z\n";
            energy_out << "step,energy\n";

   	    for (int step = 0; step < NSTEPS; ++step) {
            	gdopt.step();
            	for (int pi = 0; pi < N; ++pi) {
                	double theta = u_ad[2 * pi + 0].item();
                 	double phi   = u_ad[2 * pi + 1].item();
                 	auto   r     = sph_to_xyz(theta, phi);
                 	pos_out << step << "," << pi << "," << r.x << "," << r.y << "," << r.z << "\n";
            	}
        	auto E = gdopt.objective_value();
       	 	energy_out << step << "," << E << "\n";
    	    }

    	    pos_out.close();
            energy_out.close();

    	    return 0;
	}
``
The variable
``const int    N      = std::stoi(argv[1]);``
is the number of particles read from the command line
``const int    NSTEPS = 100000;``
is the number of optimization steps
``const double lr     = 1e-3;``
is the optimizer learning rate. Using the code the way its written, the optimizer runs for 100000 steps. Running tthe program with
```
	./thomson_sphere N
```
optimizes the N particle system. Below is a plot of several optimal configurations for N=2,...8 particles.
[:[:[gbo_graph thomson_sphere_2to8.svg]]]

Below is a plot of the final energy of the system, and its deviation from the theoretically predicted values. The table of theorical energy values for the problem is from [@https://en.wikipedia.org/wiki/Thomson_problem wikipedia].
[:[:[gbo_graphi thomson_energy_error_gradient_descent.svg]]]

[heading Example using minimize]
Often, we don't want to actually implement our own stepping function, i.e. we care about certain convergence criteria. In the above example, we need to include the minimier.hpp header:
```
	#include <boost/math/optimization/minimizer.hpp>
```

and replace the optimization loop:
```
	 for (int step = 0; step < NSTEPS; ++step) {
            	gdopt.step();
            	for (int pi = 0; pi < N; ++pi) {
                	double theta = u_ad[2 * pi + 0].item();
                 	double phi   = u_ad[2 * pi + 1].item();
                 	auto   r     = sph_to_xyz(theta, phi);
                 	pos_out << step << "," << pi << "," << r.x << "," << r.y << "," << r.z << "\n";
            	}
        	auto E = gdopt.objective_value();
       	 	energy_out << step << "," << E << "\n";
    	    }

```
with
```
	auto result = minimize(gdopt);
```
minimize returns a ``optimization_result<typename Optimizer::real_type_t>``, a struct with the following fields:
```
	size_t num_iter;
 	RealType objective_value;
	std::vector<RealType> objective_history;
	bool converged;
```

where `num_iter` is the number of iterations the optimizer went through, `objective_value` is the final objective value, `objective_history` are the intermediate objective values, and `converged` is whether the convergence criterion was satisfied. By default, `minimize(optimizer)` uses a gradient norm convergence criterion. If norm(gradient_vector) < 1e-3, the criterion is satisfied. Maximum number of iterations is set at 100000. For more info on how to use `minimize` check the minimize docs. With default parameters, gradient descent solves the `N=2` problem in `93799` steps.

[endsect] [/section:gradient_descent]

[section:nesterov Nesterov Accelerated Gradient Desccent]
[heading Synopsis]
```
#include <boost/math/optimization/nesterov.hpp>

namespace boost {
namespace math {
namespace optimization {

namespace rdiff = boost::math::differentiation::reverse_mode;

/**
 * @brief The nesterov_accelerated_gradient class
 *
 * https://jlmelville.github.io/mize/nesterov.html
 */
template<typename ArgumentContainer,
         typename RealType,
         class Objective,
         class InitializationPolicy,
         class ObjectiveEvalPolicy,
         class GradEvalPolicy>
class nesterov_accelerated_gradient
  : public abstract_optimizer<
      ArgumentContainer,
      RealType,
      Objective,
      InitializationPolicy,
      ObjectiveEvalPolicy,
      GradEvalPolicy,
      nesterov_update_policy<RealType>,
      nesterov_accelerated_gradient<ArgumentContainer,
                                    RealType,
                                    Objective,
                                    InitializationPolicy,
                                    ObjectiveEvalPolicy,
                                    GradEvalPolicy>>
{
public:
  nesterov_accelerated_gradient(Objective&& objective,
                                ArgumentContainer& x,
                                InitializationPolicy&& ip,
                                ObjectiveEvalPolicy&& oep,
                                GradEvalPolicy&& gep,
                                nesterov_update_policy<RealType>&& up);
  void step();
};

/* Convenience overloads  */
/* make nesterov acelerated gradient descent by providing
 ** objective function
 ** variables to optimize over
 ** Optionally
 *  - lr: learning rate / step size (typical: 1e-4 .. 1e-1 depending on scaling)
 *  - mu: momentum coefficient in [0, 1) (typical: 0.8 .. 0.99)
 */

template<class Objective, typename ArgumentContainer, typename RealType>
auto make_nag(Objective&& obj,
              ArgumentContainer& x,
              RealType lr = RealType{ 0.01 },
              RealType mu = RealType{ 0.95 });

/* provide initialization policy
 * lr, and mu no longer optional
 */
template<class Objective,
         typename ArgumentContainer,
         typename RealType,
         class InitializationPolicy>
auto make_nag(Objective&& obj,
         ArgumentContainer& x,
         RealType lr,
         RealType mu,
         InitializationPolicy&& ip);

/* provide
 * initilaization policy
 * objective evaluation policy
 * gradient evaluation policy
 */
template<typename ArgumentContainer,
         typename RealType,
         class Objective,
         class InitializationPolicy,
         class ObjectiveEvalPolicy,
         class GradEvalPolicy>
auto make_nag(Objective&& obj,
         ArgumentContainer& x,
         RealType lr,
         RealType mu,
         InitializationPolicy&& ip,
         ObjectiveEvalPolicy&& oep,
         GradEvalPolicy&& gep);

} // namespace optimization
} // namespace math
} // namespace boost
```

Nesterov accelerated gradient (NAG) is a first-order optimizer that augments gradient descent with a momentum term and evaluates the gradient at a "lookahead" point. In practice this often improves convergence speed compared to vanilla gradient descent, especially in narrow valleys and ill-conditioned problems.

[heading Algorithm]

NAG maintains a "velocity" vector v (same shape as x). At iteration k it performs:
```
v = mu * v - lr * g;
x += -mu * v_prev + (1 + mu) *v
```
where:

lr is the learning rate / step size

mu is the momentum coefficient (typically close to 1)

Setting mu = 0 reduces NAG to standard gradient descent.
[heading Parameters]

* `Objective&& obj` : objective function to minimize.
* `ArgumentContainer& x` : variables to optimize over. Updated in place.
* `RealType lr` : learning rate. Larger values take larger steps (faster but potentially unsable). Smaller values are more stable but converge more slowly.
* `RealType mu` : momentum coefficient in `[0,1)`. Higher values, e.g. 0.9 to 0.99, typically accelerate convergence but may require a smaller `lr`
* `InitializationPolicy&& ip` : initialization policy for the optimizer state and variables. For NAG, this also initializes the internal momentum/velocity state. By default the optimizer uses the same initializer as gradient descent and initializes velocity to zero.
* `ObjectiveEvalPolicy&& oep` : objective evaluation policy. By default `reverse_mode_function_eval_policy<RealType>`
* `GradEvalPolicy&& gep` : gradient evaluation policy. By default `reverse_mode_gradient_evaluation_policy<RealType>`

[heading Notes]
* NAG uses the same policy-based design as gradient descent: initialization, objective evaluation, and gradient evaluation can be customized independently.
* When using reverse-mode AD `rvar`, the objective should be written in terms of AD variables so gradients can be obtained automatically by the default gradient evaluation policy.
* Typical tuning: start with `mu = 0.9` or `0.95`; if the objective oscillates or diverges, reduce `lr` (or slightly reduce `mu`).

[heading Example : Thomson Sphere]
```
#include <boost/math/differentiation/autodiff_reverse.hpp>
#include <boost/math/optimization/gradient_descent.hpp>
#include <boost/math/optimization/minimizer.hpp>
#include <cmath>
#include <fstream>
#include <iostream>
#include <random>
#include <string>
namespace rdiff = boost::math::differentiation::reverse_mode;
namespace bopt  = boost::math::optimization;
double random_double(double min = 0.0, double max = 1.0)
{
    static thread_local std::mt19937       rng{std::random_device{}()};
    std::uniform_real_distribution<double> dist(min, max);
    return dist(rng);
}

template<typename S>
struct vec3
{
    /**
     * @brief R^3 coordinates of particle on Thomson Sphere
     */
    S x, y, z;
};

template<class S>
static inline vec3<S> sph_to_xyz(const S& theta, const S& phi)
{
    /**
     * convenience overload to convert from [theta,phi] -> x, y, z
     */
    return {sin(theta) * cos(phi), sin(theta) * sin(phi), cos(theta)};
}

template<typename T>
T thomson_energy(std::vector<T>& r)
{
    /* inverse square law
     */
    const size_t N    = r.size() / 2;
    const T      tiny = T(1e-12);

    T E = 0;
    for (size_t i = 0; i < N; ++i) {
        const T& theta_i = r[2 * i + 0];
        const T& phi_i   = r[2 * i + 1];
        auto     ri      = sph_to_xyz(theta_i, phi_i);

        for (size_t j = i + 1; j < N; ++j) {
            const T& theta_j = r[2 * j + 0];
            const T& phi_j   = r[2 * j + 1];
            auto     rj      = sph_to_xyz(theta_j, phi_j);

            T dx = ri.x - rj.x;
            T dy = ri.y - rj.y;
            T dz = ri.z - rj.z;

            T d2 = dx * dx + dy * dy + dz * dz + tiny;
            E += 1.0 / sqrt(d2);
        }
    }
    return E;
}

template<class T>
std::vector<rdiff::rvar<T, 1>> init_theta_phi_uniform(size_t N, unsigned seed = 12345)
{
    const T pi = T(3.1415926535897932384626433832795);

    std::mt19937                      rng(seed);
    std::uniform_real_distribution<T> unif01(T(0), T(1));
    std::uniform_real_distribution<T> unifm11(T(-1), T(1));

    std::vector<rdiff::rvar<T, 1>> u;
    u.reserve(2 * N);

    for (size_t i = 0; i < N; ++i) {
        T z     = unifm11(rng);
        T phi   = (T(2) * pi) * unif01(rng) - pi;
        T theta = std::acos(z);

        u.emplace_back(theta);
        u.emplace_back(phi);
    }
    return u;
}

int main(int argc, char* argv[])
{

    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <N>\n";
        return 1;
    }

    const int    N      = std::stoi(argv[1]);
    const double lr     = 1e-3;
    const double mu 	= 0.95;
    auto u_ad = init_theta_phi_uniform<double>(N);

    auto nesterov_opt = bopt::make_nag(&thomson_energy<rdiff::rvar<double, 1>>, u_ad, lr, mu);

    // filenames
    std::string pos_filename    = "thomson_" + std::to_string(N) + ".csv";
    std::string energy_filename = "nesterov_energy_" + std::to_string(N) + ".csv";

    std::ofstream pos_out(pos_filename);
    std::ofstream energy_out(energy_filename);

    energy_out << "step,energy\n";

    auto result = minimize(nesterov_opt);
    for (int pi = 0; pi < N; ++pi) {
        double theta = u_ad[2 * pi + 0].item();
        double phi   = u_ad[2 * pi + 1].item();
        auto   r     = sph_to_xyz(theta, phi);
        pos_out << pi << "," << r.x << "," << r.y << "," << r.z << "\n";
    }
    auto E = nesterov_opt.objective_value();
    int i = 0;
    for(auto& obj_hist : result.objective_history)
    {
        energy_out << i << "," << obj_hist << "\n";
        ++i;
    }
    energy_out << "," << E << "\n";

    pos_out.close();
    energy_out.close();

    return 0;
}
```

The nesterov version of this problem converges much faster than regular gradient descent, in only `4663` iterations with default parameters, vs the `93799` iterations required by gradient descent.

[:[:[gbo_graph nag_to_gd_comparison.svg]]]

[endsect] [/section:nesterov]

[section:lbfgs L-BFGS]
[heading Synopsis]
```
#include <boost/math/optimization/lbfgs.hpp>

namespace boost {
namespace math {
namespace optimization {

namespace rdiff = boost::math::differentiation::reverse_mode;

/**
 *
 * @brief Limited-memory BFGS (L-BFGS) optimizer
 *
 * The `lbfgs` class implements the Limited-memory BFGS optimization algorithm,
 * a quasi-Newton method that approximates the inverse Hessian using a rolling
 * window of the last `m` updates. It is suitable for medium- to large-scale
 * optimization problems where full Hessian storage is infeasible.
 *
 * @tparam> ArgumentContainer: container type for parameters, e.g.
 * std::vector<RealType>
 * @tparam> RealType scalar floating type (e.g. double, float)
 * @tparam> Objective: objective function. must support "f(x)" evaluation
 * @tparam> InitializationPolicy: policy for initializing x
 * @tparam> ObjectiveEvalPolicy: policy for computing the objective value
 * @tparam> GradEvalPolicy: policy for computing gradients
 * @tparam> LineaSearchPolicy: e.g. Armijo, StrongWolfe
 *
 * https://en.wikipedia.org/wiki/Limited-memory_BFGS
 */

template<typename ArgumentContainer,
         typename RealType,
         class Objective,
         class InitializationPolicy,
         class ObjectiveEvalPolicy,
         class GradEvalPolicy,
         class LineSearchPolicy>
class lbfgs
{
public:
  lbfgs(Objective&& objective,
        ArgumentContainer& x,
        size_t m,
        InitializationPolicy&& ip,
        ObjectiveEvalPolicy&& oep,
        GradEvalPolicy&& gep,
        lbfgs_update_policy<RealType>&& up,
        LineSearchPolicy&& lsp);

  void step();
};

/* Convenience overloads */
/* create l-bfgs optimizer with
 * objective function
 * argument container
 * optional
 * - history size : how far to look in the past
 */
template<class Objective, typename ArgumentContainer>
auto make_lbfgs(Objective&& obj, ArgumentContainer& x, std::size_t m = 10);

template<class Objective,
         typename ArgumentContainer,
         class InitializationPolicy>
auto make_lbfgs(Objective&& obj,
           ArgumentContainer& x,
           std::size_t m,
           InitializationPolicy&& ip)

/* construct lbfgs with a custom initialization and line search policy */
template<class Objective,
         typename ArgumentContainer,
         class InitializationPolicy,
         class LineSearchPolicy>
auto make_lbfgs(Objective&& obj,
           ArgumentContainer& x,
           std::size_t m,
           InitializationPolicy&& ip,
           LineSearchPolicy&& lsp);

/* construct lbfgs optimizer with:
 * custom initialization policy
 * function evaluation policy
 * gradient evaluation policy
 * line search policy
 */
template<class Objective,
         typename ArgumentContainer,
         class InitializationPolicy,
         class FunctionEvalPolicy,
         class GradientEvalPolicy,
         class LineSearchPolicy>
auto make_lbfgs(Objective&& obj,
           ArgumentContainer& x,
           std::size_t m,
           InitializationPolicy&& ip,
           FunctionEvalPolicy&& fep,
           GradientEvalPolicy&& gep,
           LineSearchPolicy&& lsp);

} // namespace optimization
} // namespace math
} // namespace boost
```

LBFGS (limited memory BFGS) is a quasi-Newton optimizer that builds an approximation to the inverse Hessian using only first-order information (function values and gradients). Unlike full BFGS, it does not store or update a dense matrix; instead it maintains a fixed size history of the most recent m correction pairs and computes the search direction using a two loop recursion. In practice, LBFGS often converges in significantly fewer iterations than normal gradient based methods, especially on smooth, ill-conditioned objectives.

[heading Algorithm]

At each iteration k, LBFGS:
* Evaluates the gradient g_k = grad(f(x_k)).
* Computes a quasi-Newton search direction using the last m updates.
* Chooses a step length alpha_k using a line search policy.
* Updates parameters:
```
x_k += alpha_k p_k
```
* Forms the correction pairs:
```
s_k = x_k - x_prev y_k = g_k - g_prev
```
and stores up to the last `m` pairs `(s_k, y_k)`.

The line search is a key part of practical LBFGS: it typically removes the need to hand-tune a learning rate and improves robustness.

[heading Parameters]

* `Objective&& obj` : objective function to minimize.
* `ArgumentContainer& x` : variables to optimize over. Updated in-place.
* `std::size_t m` : history size. Typical values are 5–20. Default is 10. Larger m can improve directions but increases memory and per-step cost.
* `InitializationPolicy&& ip` : initialization policy for ArgumentContainer and optimizer state. For reverse-mode AD, the default typically initializes/attaches the tape and uses the user-provided initial values in x.
* `ObjectiveEvalPolicy&& oep` : policy for evaluating the objective function value at a given x. By default this is a reverse-mode AD evaluation policy when using `rvar`.
* `GradEvalPolicy&& gep` : policy for evaluating the gradient of the objective. By default this is a reverse-mode AD gradient evaluation policy when using `rvar`.
* `LineSearchPolicy&& lsp` : policy for selecting the step length alpha. Default is Strong Wolfe, but Armijo is an option.

[heading Notes]

* LBFGS assumes the objective is sufficiently smooth for gradients to be informative. It is typically most effective on unconstrained smooth problems.

* LBFGS usually requires fewer iterations than gradient descent, but each iteration is more expensive because it performs multiple objective/gradient evaluations during line search.

[heading Example : Thomson Problem]
```
#include <boost/math/differentiation/autodiff_reverse.hpp>
#include <boost/math/optimization/lbfgs.hpp>
#include <boost/math/optimization/minimizer.hpp>
#include <cmath>
#include <fstream>
#include <iostream>
#include <random>
#include <string>
namespace rdiff = boost::math::differentiation::reverse_mode;
namespace bopt  = boost::math::optimization;
double random_double(double min = 0.0, double max = 1.0)
{
    static thread_local std::mt19937       rng{std::random_device{}()};
    std::uniform_real_distribution<double> dist(min, max);
    return dist(rng);
}

template<typename S>
struct vec3
{
    /**
     * @brief R^3 coordinates of particle on Thomson Sphere
     */
    S x, y, z;
};

template<class S>
static inline vec3<S> sph_to_xyz(const S& theta, const S& phi)
{
    /**
     * convenience overload to convert from [theta,phi] -> x, y, z
     */
    return {sin(theta) * cos(phi), sin(theta) * sin(phi), cos(theta)};
}

template<typename T>
T thomson_energy(std::vector<T>& r)
{
    /* inverse square law
     */
    const size_t N    = r.size() / 2;
    const T      tiny = T(1e-12);

    T E = 0;
    for (size_t i = 0; i < N; ++i) {
        const T& theta_i = r[2 * i + 0];
        const T& phi_i   = r[2 * i + 1];
        auto     ri      = sph_to_xyz(theta_i, phi_i);

        for (size_t j = i + 1; j < N; ++j) {
            const T& theta_j = r[2 * j + 0];
            const T& phi_j   = r[2 * j + 1];
            auto     rj      = sph_to_xyz(theta_j, phi_j);

            T dx = ri.x - rj.x;
            T dy = ri.y - rj.y;
            T dz = ri.z - rj.z;

            T d2 = dx * dx + dy * dy + dz * dz + tiny;
            E += 1.0 / sqrt(d2);
        }
    }
    return E;
}

template<class T>
std::vector<rdiff::rvar<T, 1>> init_theta_phi_uniform(size_t N, unsigned seed = 12345)
{
    const T pi = T(3.1415926535897932384626433832795);

    std::mt19937                      rng(seed);
    std::uniform_real_distribution<T> unif01(T(0), T(1));
    std::uniform_real_distribution<T> unifm11(T(-1), T(1));

    std::vector<rdiff::rvar<T, 1>> u;
    u.reserve(2 * N);

    for (size_t i = 0; i < N; ++i) {
        T z     = unifm11(rng);
        T phi   = (T(2) * pi) * unif01(rng) - pi;
        T theta = std::acos(z);

        u.emplace_back(theta);
        u.emplace_back(phi);
    }
    return u;
}

int main(int argc, char* argv[])
{

    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <N>\n";
        return 1;
    }

    const int    N      = std::stoi(argv[1]);
    auto u_ad = init_theta_phi_uniform<double>(N);

    auto lbfgs_opt = bopt::make_lbfgs(&thomson_energy<rdiff::rvar<double, 1>>, u_ad);

    // filenames
    std::string pos_filename    = "thomson_" + std::to_string(N) + ".csv";
    std::string energy_filename = "lbfgs_energy_" + std::to_string(N) + ".csv";

    std::ofstream pos_out(pos_filename);
    std::ofstream energy_out(energy_filename);

    energy_out << "step,energy\n";

    auto result = minimize(lbfgs_opt);
    for (int pi = 0; pi < N; ++pi) {
        double theta = u_ad[2 * pi + 0].item();
        double phi   = u_ad[2 * pi + 1].item();
        auto   r     = sph_to_xyz(theta, phi);
        pos_out << pi << "," << r.x << "," << r.y << "," << r.z << "\n";
    }
    auto E = lbfgs_opt.objective_value();
    int i = 0;
    for(auto& obj_hist : result.objective_history)
    {
        energy_out << i << "," << obj_hist << "\n";
        ++i;
    }
    energy_out << "," << E << "\n";

    pos_out.close();
    energy_out.close();

    return 0;
}
```
For the `N = 2` case, LBFGS requires only 5 iterations to converge, the nesterov version of this problem converges in `4663` iterations with default parameters, and gradient descent requires `93799` iterations. Convergence is assumed to mean the norm of the gradient is less than `1e-3`. Below is a plot showcasing the 3 different methods for the `N=20` particle case.

[:[:[gbo_graph lbfgs_to_nag_and_gd_comparison.svg]]]

In this case, gradient descent reaches the maximum number of iterations, and does not converge, nag converges in `150` iterations, and LBFGS converges in `67` iterations.

[endsect] [/section:lbfgs]

[section:minimize minimize]
[heading Synopsis]
```
#include <boost/math/optimization/minimizer.hpp>

namespace boost {
namespace math {
namespace optimization {
template<typename RealType>
struct optimization_result
{
  size_t num_iter = 0;
  RealType objective_value;
  std::vector<RealType> objective_history;
  bool converged;
};

/* Minimize an objective using a given optimizer.
 *
 * The optimizer `opt` must expose a `step()` method and the associated
 * typedefs:
 *   - argument_container_t
 *   - real_type_t
 *
 * The behavior of the minimization loop is controlled by policy objects:
 *  - ConstraintPolicy: projects variables onto a feasible set
 *  - ConvergencePolicy: determines when convergence has been reached
 *  - TerminationPolicy: determines when optimization must stop regardless
 *    of convergence (e.g. max iterations)
 *
 * The function returns an `optimization_result` summarizing the run.
 */
template<class Optimizer,
         class ConstraintPolicy =
           unconstrained_policy<typename Optimizer::argument_container_t>,
         class ConvergencePolicy =
           gradient_norm_convergence_policy<typename Optimizer::real_type_t>,
         class TerminationPolicy = max_iter_termination_policy>
auto minimize(Optimizer& opt,
         ConstraintPolicy project = ConstraintPolicy{},
         ConvergencePolicy converged =
           ConvergencePolicy{
             static_cast<typename Optimizer::real_type_t>(1e-3) },
         TerminationPolicy terminate = TerminationPolicy(100000),
         bool history = true)
```

This header provides a generic driver function `minimize` that repeatedly advances an optimizer via `step()`, optionally projects parameters onto a constraint set, and stops when either a convergence criterion is met or a termination criterion triggers.

[heading Parameters]
* `Optimizer& opt` : Optimizer instance.
* `ConstraintPolicy project` : Projection policy applied after each `step()`. Default is `unconstrained_policy` (no projection).
* `ConvergencePolicy converged` : Stopping criterion. Default is `gradient_norm_convergence_policy(tol=1e-3)`.
* `TerminationPolicy terminate` : Hard stop criterion. Default is `max_iter_termination_policy(100000)`.
* `bool history` : If true, records `opt.objective_value()` after each iteration in `optimization_result::objective_history.`

[heading Convergence Policies]
Convergence policies decide when the optimization has converged. Each policy has the signature:
```
bool operator()(const GradientContainer& g, RealType objective_value) const;
```
[heading gradient_norm_convergence_policy]

Stops when the Euclidean norm of the gradient falls below a tolerance:
```
template<typename RealType>
struct gradient_norm_convergence_policy
{
explicit gradient_norm_convergence_policy(RealType tol);

template<class GradientContainer>
bool operator()(const GradientContainer& g, RealType objective_v) const;
};
```
[heading objective_tol_convergence_policy]

Stops when the absolute change in objective value between successive iterations drops below a tolerance:
```
template<typename RealType>
struct objective_tol_convergence_policy
{
explicit objective_tol_convergence_policy(RealType tol);

template<class GradientContainer>
bool operator()(const GradientContainer&, RealType objective_v) const;
};
```
The first call always returns false

[heading relative_objective_tol_policy]

Stops when the relative objective change drops below a tolerance:
```
template<typename RealType>
struct relative_objective_tol_policy
{
explicit relative_objective_tol_policy(RealType rel_tol);

template<class GradientContainer>
bool operator()(const GradientContainer&, RealType objective_v) const;
};
```
The relative change is computed as:
```
abs(obj - last_obj) / max(1, abs(last_obj))
```

[heading combined_convergence_policy]

Combines two convergence policies and stops when either triggers:
```
template<class Policy1, class Policy2>
struct combined_convergence_policy
{
combined_convergence_policy(Policy1 p1, Policy2 p2);

template<class GradientContainer, class RealType>
bool operator()(const GradientContainer& g, RealType obj) const;
};
```

[heading Termination Policies]

Termination policies provide a hard stop independent of convergence. Each policy has the signature:

```
bool operator()(size_t iter);
```

[heading max_iter_termination_policy]

Stops after a fixed number of iterations
```
struct max_iter_termination_policy { explicit max_iter_termination_policy(size_t max_iter); bool operator()(size_t iter); };
```

[heading wallclock_termination_policy]

Stops after a wall-clock time budget:
```
struct wallclock_termination_policy { explicit wallclock_termination_policy(std::chrono::milliseconds max_time); bool operator()(size_t iter) const; };
```

[heading Constraint and Projection Policies]

Projection policies modify the optimizer variables after each step. Each policy has the signature:

```
void operator()(ArgumentContainer& x) const;
```

[heading unconstrained_policy]

No projection; leaves parameters unchanged

```
template<typename ArgumentContainer> struct unconstrained_policy { void operator()(ArgumentContainer&); };
```

[heading box_constraints]

Clamps each variable into `[min, max]`

```
template<typename ArgumentContainer, typename RealType> struct box_constraints { box_constraints(RealType min, RealType max); void operator()(ArgumentContainer& x); };
```

[heading nonnegativity_constraint]

Sets anything less than `0` to `0`

```
template<typename ArgumentContainer, typename RealType> struct nonnegativity_constraint { void operator()(ArgumentContainer& x) const; };
```

[heading l2_ball_constraint]

Projects onto the `L2` ball of radius `radius` by uniform scaling when needed

```
template<typename ArgumentContainer, typename RealType> struct l2_ball_constraint { explicit l2_ball_constraint(RealType radius); void operator()(ArgumentContainer& x) const; };
```

[heading l1_ball_constraint]

Projects onto the `L1` ball of radius `radius` by uniform scaling

```
template<typename ArgumentContainer, typename RealType> struct l1_ball_constraint { explicit l1_ball_constraint(RealType radius); void operator()(ArgumentContainer& x) const; };
```

This is not the exact Euclidean projection onto the L1 ball; it is a simple scaling-based constraint

[heading simplex_constraint]

Projects onto the scaled probability simplex by clipping negatives to `0` and renormalizing to sum to `1`

```
template<typename ArgumentContainer, typename RealType> struct simplex_constraint { void operator()(ArgumentContainer& x) const; };
```

[heading function_constraint]

Wraps a user-provided projection function

```
template<typename ArgumentContainer> struct function_constraint { using func_t = void (*)(ArgumentContainer&); explicit function_constraint(func_t f); void operator()(ArgumentContainer& x) const; };
```

[heading unit_sphere_constraint]

Normalizes the vector to unit L2 norm

```
template<typename ArgumentContainer, typename RealType> struct unit_sphere_constraint { void operator()(ArgumentContainer& x) const; };
```

This projects onto the unit sphere (if ||x||_2 > 0)

[endsect] [/section:minimize]

[section:policies Reverse Mode autodiff policies]
This header provides default policy implementations for using the gradient-based optimizers with `boost::math::differentiation::reverse_mode::rvar` These policies handle:

* preparing the reverse-mode tape before evaluating an objective,
* extracting gradients from rvar arguments after backpropagation,
* initializing `rvar` containers and establishing tape checkpoints.

[heading Synopsis]

```
#include <boost/math/optimization/detail/rdiff_optimization_policies>

namespace boost {
namespace math {
namespace optimization {

namespace rdiff = boost::math::differentiation::reverse_mode;

/**
 * @brief> function evaluation policy for reverse mode autodiff
 * @arg objective> objective function to evaluate
 * @arg x> argument list
 */
template<typename RealType>
struct reverse_mode_function_eval_policy
{
  template<typename Objective, class ArgumentContainer>
  rdiff::rvar<RealType, 1> operator()(Objective&& objective,
                                      ArgumentContainer& x);
};

/**
 * @brief> gradient evaluation policy
 * @arg obj_f> objective
 * @arg x> argument list
 * @arg f_eval_olicy> funciton evaluation policy. These need to be
 *                    done in tandem
 * @arg obj_v> reference to variable inside gradient class
 */
template<typename RealType>
struct reverse_mode_gradient_evaluation_policy
{
  template<class Objective,
           class ArgumentContainer,
           class FunctionEvaluationPolicy>
  void operator()(Objective&& obj_f,
                  ArgumentContainer& x,
                  FunctionEvaluationPolicy&& f_eval_pol,
                  RealType& obj_v,
                  std::vector<RealType>& g)
};
 /*
 * init policies
 */

/* default rvar policy */
template<typename RealType>
struct tape_initializer_rvar
{
    template<class ArgumentContainer>
    void operator()(ArgumentContainer& x) const noexcept;
};

/* random uniform */
template<typename RealType>
struct random_uniform_initializer_rvar
{
  RealType low_, high_;
  size_t seed_;
  random_uniform_initializer_rvar(RealType low = 0,
                                  RealType high = 1,
                                  size_t seed = std::random_device{}());

  template<class ArgumentContainer>
  void operator()(ArgumentContainer& x) const;
};

/* constant initializer */
template<typename RealType>
struct costant_initializer_rvar
{
  RealType constant;
  explicit costant_initializer_rvar(RealType v = 0);

  template<class ArgumentContainer>
  void operator()(ArgumentContainer& x) const;
};
} // namespace optimization
} // namespace math
} // namespace boost
```

[heading Function evaluation policies]

A function evaluation policy defines how an objective is evaluated at the current parameters, and is responsible for any AD bookkeeping required before the call.

[heading reverse_mode_function_eval_policy]
```
template<typename RealType> struct reverse_mode_function_eval_policy { template<typename Objective, class ArgumentContainer> rdiff::rvar<RealType, 1> operator()(Objective&& objective, ArgumentContainer& x); };
```
This policy evaluates the objective objective(x) using reverse-mode AD and ensures the active tape is in a valid state before evaluation:

ArgumentContainer must contain rdiff::rvar<RealType,1> values.

[heading Gradient evaluation policies]

A gradient evaluation policy defines how gradients are computed and extracted into an ordinary numeric container.

[heading reverse_mode_gradient_evaluation_policy]

```
template<typename RealType> struct reverse_mode_gradient_evaluation_policy { template<class Objective, class ArgumentContainer, class FunctionEvaluationPolicy> void operator()(Objective&& obj_f, ArgumentContainer& x, FunctionEvaluationPolicy&& f_eval_pol, RealType& obj_v, std::vector<RealType>& g); };
```

This policy computes the objective value and gradient using reverse-mode autodiff

[heading Initialization policies]

Initialization policies prepare the argument container and the reverse-mode tape for optimization. They establish tape checkpoints that are later used by function/gradient evaluation policies.

[heading tape_initializer_rvar]

```
template<typename RealType> struct tape_initializer_rvar { template<class ArgumentContainer> void operator()(ArgumentContainer& x) const noexcept; };
```

This policy:
* statically requires `ArgumentContainer::value_type` to be `rdiff::rvar<RealType,1>`,
* adds a checkpoint to the active reverse-mode tape via `tape.add_checkpoint()`

[heading random_uniform_initializer_rvar]

```
template<typename RealType>
struct random_uniform_initializer_rvar
{
random_uniform_initializer_rvar(RealType low = 0,
RealType high = 1,
size_t seed = std::random_device{}());

template<class ArgumentContainer>
void operator()(ArgumentContainer& x) const;
};
```

This policy initializes each element of x independently as:
```
x[i] = rdiff::rvar<RealType,1>(U(low, high))
```

using
```
boost::random::mt19937
```

and

```
boost::random::uniform_real_distribution<RealType>
```

[heading costant_initializer_rvar]

```
template<typename RealType>
struct costant_initializer_rvar
{
explicit costant_initializer_rvar(RealType v = 0);

template<class ArgumentContainer>
void operator()(ArgumentContainer& x) const;
};
```

This policy initializes each element of x to the constant value v

[endsect] [/section:policies]

[endsect] [/section:gd_opt]