mirror of
https://github.com/boostorg/math.git
synced 2026-01-24 18:12:09 +00:00
* Implement rank vector [ci skip] * Add documentation. Admittedly terrible. * Add unit tests. * Cleanup method of detecting if execution policies are valid or not [ci skip] * Implement and test chatterjee correlation [ci skip] * Add spot checks and special handling for constant Y [ci skip] * Add performance file [ci skip] * Add execution policy support to rank [ci skip] * Remove duplicates from v when generating the order vector [ci skip] * Fix macro error for use of <execution> [ci skip] * Use explicit types instead of auto to avoid warnings [ci skip] * Add execution policy testing to rank [ci skip] * Add threaded implementation [ci skip] * Added threaded testing * Fix formatting and ASCII issues in test * Fix more ASCII issues * refactoring * Fix threaded impl * Remove non-ASCII apostrophe [ci skip] * Doc fixes and add test comparing generally to paper values * Significantly tighten tolerance around expected values from paper * Change tolerance for sin comparison Co-authored-by: Nick Thompson <nathompson7@protonmail.com>
75 lines
3.0 KiB
Plaintext
75 lines
3.0 KiB
Plaintext
[/
|
|
Copyright 2022 Matt Borland
|
|
|
|
Distributed under the Boost Software License, Version 1.0.
|
|
(See accompanying file LICENSE_1_0.txt or copy at
|
|
http://www.boost.org/LICENSE_1_0.txt).
|
|
]
|
|
|
|
[section:chatterjee_correlation Chatterjee Correlation]
|
|
|
|
[heading Synopsis]
|
|
|
|
``
|
|
#include <boost/math/statistics/chatterjee_correlation.hpp>
|
|
|
|
namespace boost::math::statistics {
|
|
|
|
C++17:
|
|
template <typename ExecutionPolicy, typename Container>
|
|
auto chatterjee_correlation(ExecutionPolicy&& exec, const Container& u, const Container& v);
|
|
|
|
C++11:
|
|
template <typename Container>
|
|
auto chatterjee_correlation(const Container& u, const Container& v);
|
|
}
|
|
``
|
|
|
|
[heading Description]
|
|
|
|
The classical correlation coefficients like the Pearson's correlation are useful primarily for distinguishing when one dataset depends linearly on another.
|
|
However, Pearson's correlation coefficient has a known weakness in that when the dependent variable has an obvious functional relationship with the independent variable, the value of the correlation coefficient can take on any value.
|
|
As Chatterjee says:
|
|
|
|
> Ideally, one would like a coefficient that approaches
|
|
its maximum value if and only if one variable looks more and more like a
|
|
noiseless function of the other, just as Pearson correlation is close to its maximum value if and only if one variable is close to being a noiseless linear function of the other.
|
|
|
|
This is the problem Chatterjee's coefficient solves.
|
|
Let X and Y be random variables, where Y is not constant, and let (X_i, Y_i) be samples from this distribution.
|
|
Rearrange these samples so that X_(0) < X_{(1)} < ... X_{(n-1)} and create (R(X_{(i)}), R(Y_{(i)})).
|
|
The Chatterjee correlation is then given by
|
|
|
|
[$../equations/chatterjee_correlation.svg]
|
|
|
|
In the limit of an infinite amount of i.i.d data, the statistic lies in [0, 1].
|
|
However, if the data is not infinite, the statistic may be negative.
|
|
If X and Y are independent, the value is zero, and if Y is a measurable function of X, then the statistic is unity.
|
|
The complexity is O(n log n).
|
|
|
|
An example is given below:
|
|
|
|
std::vector<double> X{1,2,3,4,5};
|
|
std::vector<double> Y{1,2,3,4,5};
|
|
using boost::math::statistics::chatterjee_correlation;
|
|
double coeff = chatterjee_correlation(X, Y);
|
|
|
|
The implementation follows [@https://arxiv.org/pdf/1909.10140.pdf Chatterjee's paper].
|
|
|
|
/Nota bene:/ If the input is an integer type the output will be a double precision type.
|
|
|
|
[heading Invariants]
|
|
|
|
The function expects at least two samples, a non-constant vector Y, and the same number of X's as Y's.
|
|
If Y is constant, the result is a quiet NaN.
|
|
The data set must be sorted by X values.
|
|
If there are ties in the values of X, then the statistic is random due to the random breaking of ties.
|
|
Of course, random numbers are not used internally, but the result is not guaranteed to be identical on different systems.
|
|
|
|
[heading References]
|
|
|
|
* Chatterjee, Sourav. "A new coefficient of correlation." Journal of the American Statistical Association 116.536 (2021): 2009-2022.
|
|
|
|
[endsect]
|
|
[/section:chatterjee_correlation Chatterjee Correlation]
|