Add M nearest-neighbour Chatterjee correlation (#990) by su-senka · Pull Request #1414 · boostorg/math

su-senka · 2026-07-01T06:04:55Z

Summary

Implements the revised (M nearest-neighbour) Chatterjee rank correlation of
Lin & Han (2021), addressing #990. Adds a new function
chatterjee_correlation_mnn(u, v, M) alongside the existing
chatterjee_correlation, with the same C++11 and C++17 overload structure.

The original coefficient has a detection boundary of n^(-1/4) for independence
testing, well short of the parametric n^(-1/2) rate. By using the M right
nearest neighbours of each point (rather than the single right neighbour) and
letting M grow with n, the revised statistic consistently estimates the same
dependence measure while approaching near-parametric efficiency. See Lin & Han,
On boosting the power of Chatterjee's rank correlation, Biometrika 110(2)
(2023) 283–299, arXiv:2108.06828.

Design notes

Separate function rather than an extended signature. The M-NN statistic
uses min(R_i, R_j) and a different normalisation, so even at M = 1 it is not
identical to chatterjee_correlation. A distinct function avoids silently
changing existing results and keeps the statistical intent explicit.
M is a required argument with no default.
Rank base. The internal rank() returns 0-based ranks; the paper's
formula uses 1-based ranks. The offset cancels in the existing M = 1 statistic
(which uses |R_i - R_{i+1}|) but not under min(.,.), so it is applied
explicitly. This is noted in a comment where it matters.
Complexity. O(n log n + nM). Near-linear for small M; tends to O(n²) as
M → n.
Parallel path. The outer index loop is partitioned across threads into
disjoint ranges, each reading the shared rank vector read-only (indices up to
i + M may fall in a neighbouring range; there are no writes). This differs
from the M = 1 parallel path, which splits the data array for the
difference-based transform.
Ties / degenerate input. Like chatterjee_correlation, the function
assumes distinct Y (continuous data). A constant Y returns a quiet NaN; this
is detected on the input directly, since rank() collapses tied values.
Choice of M. The asymptotic null variance is minimised at M ~ sqrt(n); the
choice is documented but left to the caller.

Tests

Added to test_chatterjee_correlation.cpp, covering float, double, and
long double:

Exact closed-form checks against the paper's Remark 2.5 (strictly increasing
and strictly decreasing dependence), which require no external reference.
Small exact spot values computed independently as rationals.
Constant-Y → NaN, and invariance under strictly increasing transforms of X
and Y.
Sequential/parallel agreement across several M (under the parallel build).

The sequential path was verified locally under b2 with cxxstd=14 and
cxxstd=17 (clang, arm64).

Add M nearest-neighbour Chatterjee correlation (boostorg#990)

2709fb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add M nearest-neighbour Chatterjee correlation (#990)#1414

Add M nearest-neighbour Chatterjee correlation (#990)#1414
su-senka wants to merge 1 commit into
boostorg:developfrom
su-senka:feature/chatterjee-mnn

su-senka commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

su-senka commented Jul 1, 2026

Summary

Design notes

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant