Skip to content

silk: Arm NEON for the float SILK analysis path (inner_product, warped autocorrelation, energy)#481

Open
czoli1976 wants to merge 3 commits into
xiph:mainfrom
czoli1976:arm-silk-float-neon
Open

silk: Arm NEON for the float SILK analysis path (inner_product, warped autocorrelation, energy)#481
czoli1976 wants to merge 3 commits into
xiph:mainfrom
czoli1976:arm-silk-float-neon

Conversation

@czoli1976

Copy link
Copy Markdown

Summary

The float SILK analysis path (silk/float/) had no Arm SIMD at all (only x86 has an AVX2 inner_product_FLP). This adds the first silk/float/arm/ NEON tier — three kernels — wired into the existing RTCD dispatch and all three build systems (autotools / CMake / Meson). The full meson test suite passes, and each kernel carries an OPUS_CHECK_ASM self-check (within-rounding, since float kernels reorder the f64 accumulation and so aren't bit-exact like fixed-point ones).

1. silk_inner_product_FLP

NEON with f64 accumulation matching the C double reference (mirrors the AVX2 path and its RTCD wiring). It's the workhorse float dot product (LPC/Burg autocorrelation, LTP correlation, pitch analysis), so those callers benefit transitively.

  • Kernel: ~1.59x geomean over scalar on Apple M4 (1.6-2.0x at the dominant 48-96 lengths).
  • Numerically faithful: worst rel. error 9.8e-11 vs a long-double reference over 744k adversarial vectors (the scalar reference's own error is 6.8e-11).
  • E2E: encode time within run-to-run noise (this kernel is ~2.5% of SILK encode); encoded bitstream byte-identical across mono/stereo/5.1 x NB-FB x the bitrate range.

2. silk_warped_autocorrelation_FLP (~11% of float SILK/hybrid encode)

Scalar on every platform until now (the only NEON warped autocorr was fixed-point). The all-pass cascade is loop-carried, so this keeps that chain scalar in double (bit-exact state) and vectorises the per-lag correlation accumulation in f64.

  • Kernel: ~1.34x geomean (1.47x at order 24).
  • Bit-exact with the C reference -> byte-identical bitstream.
  • E2E: ~1.04-1.05x faster RTC voice encode (VOIP+DTX+FEC, mono WB) -> ~4-6% more concurrent real-time streams per core.

A faster non-bit-exact variant exists (lag-parallel reformulation of the fixed-point NEON kernel, all-pass state in f32): ~1.89x kernel, ~6-9% RTC E2E. Its measured BD-rate cost is +0.69% (PESQ-NB) / +0.61% (PESQ-WB) on real speech and +0.055% on 48 kHz music (opus_compare) - sub-perceptual but real. This PR ships the bit-exact variant by default; the faster one can be offered behind a build option if the trade-off is wanted.

3. silk_energy_FLP

Same f64-accumulation approach (sum of squares). ~1.26x kernel (1.6x at the short lengths it's called with); bit-identical output. (silk_scale_vector_FLP was evaluated but is a trivially auto-vectorised elementwise multiply - measured ~1.01x, no win - so it's excluded.)

Dispatch / wiring

Follows existing x86 precedent: inner_product uses the OVERRIDE_* hook + an _IMPL[arch] table in arm_silk_map.c (PRESUME + RTCD). warped/energy have no arch argument at their call sites, so they're dispatched on PRESUME-NEON targets (aarch64); ARMv7 runtime-detection keeps the C path. Build wiring added for autotools, CMake and Meson via a new SILK_SOURCES_FLOAT_ARM_NEON_INTR group.

Notes

Benchmarks/validation on Apple M4 (aarch64).

Ckristian Zoli and others added 3 commits June 16, 2026 16:38
The float SILK analysis path had no Arm SIMD at all (no silk/float/arm/),
even though x86 provides an AVX2 silk_inner_product_FLP. This is the
workhorse float dot product, called from the LPC/Burg autocorrelation,
LTP correlation matrix and pitch analysis, so it benefits those callers
transitively.

Add a NEON implementation that, like the AVX2 one, widens each f32 operand
to f64 before multiplying and accumulates in two float64x2 lanes, matching
the C reference's double-precision accumulation. It is numerically faithful
to silk_inner_product_FLP_c (worst relative error 9.8e-11 vs a long-double
reference over 744k adversarial vectors, vs the scalar reference's own
6.8e-11 -- both ~1000x below f32 epsilon).

Wired via the existing OVERRIDE_inner_product_FLP hook: a new
silk/arm/SigProc_FLP_arm.h provides the PRESUME (direct call) and RTCD
(SILK_INNER_PRODUCT_FLP_IMPL table in arm_silk_map.c) dispatch, mirroring
silk/x86/main_sse.h. Build wiring added for autotools, CMake and Meson via
a new SILK_SOURCES_FLOAT_ARM_NEON_INTR group.

Kernel microbench on Apple M4 (real operating lengths 48-256): 1.59x
geomean over scalar (1.6-2.0x at the dominant 48-96). End-to-end encode
time is within run-to-run noise (this kernel is ~2.5% of SILK encode) and
the encoded bitstream is byte-identical across mono/stereo/5.1 x NB-FB x
the full bitrate range, so output is unchanged.

The full meson test suite passes (test_opus_encode/decode/api etc.).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
silk_warped_autocorrelation_FLP is ~11% of float SILK/hybrid encode and is
scalar on every platform (the only NEON warped autocorrelation is the
fixed-point one). It runs whenever warping is enabled (complexity >= 4),
i.e. across the RTC speech operating point.

Add a NEON implementation. The reference runs a serial all-pass cascade per
sample (loop-carried, not parallelisable across taps without changing the
rounding), so we keep that chain scalar in double precision -- producing the
SAME state[] as the C reference bit-for-bit -- and vectorise the per-lag
correlation accumulation across (order+1) lags with float64x2 (f64 matches
the reference's double C[]). Only the 2-wide lane reduction reorders adds, so
the result is within ~1e-15 of the reference.

Renames silk_warped_autocorrelation_FLP to ..._c behind a new
OVERRIDE_warped_autocorrelation_FLP hook in main_FLP.h, mirroring the
inner_product_FLP pattern. The call site has no arch argument, so this is
dispatched on PRESUME-NEON targets (aarch64); ARMv7 runtime-detection builds
keep the C path (adding an arch parameter would enable RTCD there too).

Kernel microbench on Apple M4 (production dims order 16/20/24, length
120-240): 1.34x geomean (1.47x at order 24). End-to-end RTC voice encode
(VOIP+DTX+FEC, mono WB) is ~1.04-1.05x faster -- ~4-6% more concurrent
real-time streams per core -- with byte-identical bitstream (verified across
mono/stereo, NB/WB/FB). Full meson test suite passes.

A faster non-bit-exact variant exists (vectorises the all-pass state in f32
via a lag-parallel reformulation of the fixed-point NEON kernel): ~1.89x
kernel, ~6-9% RTC E2E. It is NOT bit-exact -- validated BD-rate cost on real
speech is +0.69% (PESQ-NB) / +0.61% (PESQ-WB), and +0.055% on real 48 kHz
music (opus_compare), i.e. sub-perceptual but a real, systematic cost. It can
be offered behind a build option if the speed/exactness trade-off is wanted;
this commit ships the bit-exact variant by default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
silk_energy_FLP (sum of squares of a float vector, used in residual-energy
and LTP analysis) was scalar on Arm. Add a NEON implementation following the
same approach as silk_inner_product_FLP: widen each f32 to f64 before
squaring and accumulate in two float64x2 lanes, matching the C reference's
double accumulation (within rounding, well below float precision).

Adds the OVERRIDE_energy_FLP hook (rename to silk_energy_FLP_c + macro in
SigProc_FLP.h) and the NEON dispatch in silk/arm/SigProc_FLP_arm.h. Like the
warped kernel, the call site has no arch argument, so it is dispatched on
PRESUME-NEON targets (aarch64); ARMv7 runtime-detection keeps the C path.

Kernel microbench on Apple M4: 1.26x geomean, 1.5-1.6x at the short vector
lengths it is typically called with (64-80), tapering to memory-bound at
larger sizes. End-to-end within run-to-run noise (small kernel) and the
encoded bitstream is unchanged. Full meson test suite passes.

(silk_scale_vector_FLP was also evaluated but is not included: it is a
trivially auto-vectorisable elementwise multiply that the compiler already
vectorises, so a hand-written NEON version measured 1.01x -- no win.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant