Skip to content

Add M nearest-neighbour Chatterjee correlation (#990)#1414

Open
su-senka wants to merge 1 commit into
boostorg:developfrom
su-senka:feature/chatterjee-mnn
Open

Add M nearest-neighbour Chatterjee correlation (#990)#1414
su-senka wants to merge 1 commit into
boostorg:developfrom
su-senka:feature/chatterjee-mnn

Conversation

@su-senka

@su-senka su-senka commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Implements the revised (M nearest-neighbour) Chatterjee rank correlation of
Lin & Han (2021), addressing #990. Adds a new function
chatterjee_correlation_mnn(u, v, M) alongside the existing
chatterjee_correlation, with the same C++11 and C++17 overload structure.

The original coefficient has a detection boundary of n^(-1/4) for independence
testing, well short of the parametric n^(-1/2) rate. By using the M right
nearest neighbours of each point (rather than the single right neighbour) and
letting M grow with n, the revised statistic consistently estimates the same
dependence measure while approaching near-parametric efficiency. See Lin & Han,
On boosting the power of Chatterjee's rank correlation, Biometrika 110(2)
(2023) 283–299, arXiv:2108.06828.

Design notes

  • Separate function rather than an extended signature. The M-NN statistic
    uses min(R_i, R_j) and a different normalisation, so even at M = 1 it is not
    identical to chatterjee_correlation. A distinct function avoids silently
    changing existing results and keeps the statistical intent explicit.
    M is a required argument with no default.

  • Rank base. The internal rank() returns 0-based ranks; the paper's
    formula uses 1-based ranks. The offset cancels in the existing M = 1 statistic
    (which uses |R_i - R_{i+1}|) but not under min(.,.), so it is applied
    explicitly. This is noted in a comment where it matters.

  • Complexity. O(n log n + nM). Near-linear for small M; tends to O(n²) as
    M → n.

  • Parallel path. The outer index loop is partitioned across threads into
    disjoint ranges, each reading the shared rank vector read-only (indices up to
    i + M may fall in a neighbouring range; there are no writes). This differs
    from the M = 1 parallel path, which splits the data array for the
    difference-based transform.

  • Ties / degenerate input. Like chatterjee_correlation, the function
    assumes distinct Y (continuous data). A constant Y returns a quiet NaN; this
    is detected on the input directly, since rank() collapses tied values.

  • Choice of M. The asymptotic null variance is minimised at M ~ sqrt(n); the
    choice is documented but left to the caller.

Tests

Added to test_chatterjee_correlation.cpp, covering float, double, and
long double:

  • Exact closed-form checks against the paper's Remark 2.5 (strictly increasing
    and strictly decreasing dependence), which require no external reference.
  • Small exact spot values computed independently as rationals.
  • Constant-Y → NaN, and invariance under strictly increasing transforms of X
    and Y.
  • Sequential/parallel agreement across several M (under the parallel build).

The sequential path was verified locally under b2 with cxxstd=14 and
cxxstd=17 (clang, arm64).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant