silk: Arm NEON for the float SILK analysis path (inner_product, warped autocorrelation, energy) by czoli1976 · Pull Request #481 · xiph/opus

czoli1976 · 2026-06-16T15:40:17Z

Summary

The float SILK analysis path (silk/float/) had no Arm SIMD at all (only x86 has an AVX2 inner_product_FLP). This adds the first silk/float/arm/ NEON tier — three kernels — wired into the existing RTCD dispatch and all three build systems (autotools / CMake / Meson). The full meson test suite passes, and each kernel carries an OPUS_CHECK_ASM self-check (within-rounding, since float kernels reorder the f64 accumulation and so aren't bit-exact like fixed-point ones).

1. `silk_inner_product_FLP`

NEON with f64 accumulation matching the C double reference (mirrors the AVX2 path and its RTCD wiring). It's the workhorse float dot product (LPC/Burg autocorrelation, LTP correlation, pitch analysis), so those callers benefit transitively.

Kernel: ~1.59x geomean over scalar on Apple M4 (1.6-2.0x at the dominant 48-96 lengths).
Numerically faithful: worst rel. error 9.8e-11 vs a long-double reference over 744k adversarial vectors (the scalar reference's own error is 6.8e-11).
E2E: encode time within run-to-run noise (this kernel is ~2.5% of SILK encode); encoded bitstream byte-identical across mono/stereo/5.1 x NB-FB x the bitrate range.

2. `silk_warped_autocorrelation_FLP` (~11% of float SILK/hybrid encode)

Scalar on every platform until now (the only NEON warped autocorr was fixed-point). The all-pass cascade is loop-carried, so this keeps that chain scalar in double (bit-exact state) and vectorises the per-lag correlation accumulation in f64.

Kernel: ~1.34x geomean (1.47x at order 24).
Bit-exact with the C reference -> byte-identical bitstream.
E2E: ~1.04-1.05x faster RTC voice encode (VOIP+DTX+FEC, mono WB) -> ~4-6% more concurrent real-time streams per core.

A faster non-bit-exact variant exists (lag-parallel reformulation of the fixed-point NEON kernel, all-pass state in f32): ~1.89x kernel, ~6-9% RTC E2E. Its measured BD-rate cost is +0.69% (PESQ-NB) / +0.61% (PESQ-WB) on real speech and +0.055% on 48 kHz music (opus_compare) - sub-perceptual but real. This PR ships the bit-exact variant by default; the faster one can be offered behind a build option if the trade-off is wanted.

3. `silk_energy_FLP`

Same f64-accumulation approach (sum of squares). ~1.26x kernel (1.6x at the short lengths it's called with); bit-identical output. (silk_scale_vector_FLP was evaluated but is a trivially auto-vectorised elementwise multiply - measured ~1.01x, no win - so it's excluded.)

Dispatch / wiring

Follows existing x86 precedent: inner_product uses the OVERRIDE_* hook + an _IMPL[arch] table in arm_silk_map.c (PRESUME + RTCD). warped/energy have no arch argument at their call sites, so they're dispatched on PRESUME-NEON targets (aarch64); ARMv7 runtime-detection keeps the C path. Build wiring added for autotools, CMake and Meson via a new SILK_SOURCES_FLOAT_ARM_NEON_INTR group.

Notes

Complementary to Optimize Silk encoder NEON paths and fix warped autocorrelation precision bug #473 (which optimises the fixed-point SILK paths) - no file overlap.
The override wiring follows current convention; happy to adopt the streamlined mechanism from RFC: Streamline implementation overrides #392 if that RFC lands.

Benchmarks/validation on Apple M4 (aarch64).

The float SILK analysis path had no Arm SIMD at all (no silk/float/arm/), even though x86 provides an AVX2 silk_inner_product_FLP. This is the workhorse float dot product, called from the LPC/Burg autocorrelation, LTP correlation matrix and pitch analysis, so it benefits those callers transitively. Add a NEON implementation that, like the AVX2 one, widens each f32 operand to f64 before multiplying and accumulates in two float64x2 lanes, matching the C reference's double-precision accumulation. It is numerically faithful to silk_inner_product_FLP_c (worst relative error 9.8e-11 vs a long-double reference over 744k adversarial vectors, vs the scalar reference's own 6.8e-11 -- both ~1000x below f32 epsilon). Wired via the existing OVERRIDE_inner_product_FLP hook: a new silk/arm/SigProc_FLP_arm.h provides the PRESUME (direct call) and RTCD (SILK_INNER_PRODUCT_FLP_IMPL table in arm_silk_map.c) dispatch, mirroring silk/x86/main_sse.h. Build wiring added for autotools, CMake and Meson via a new SILK_SOURCES_FLOAT_ARM_NEON_INTR group. Kernel microbench on Apple M4 (real operating lengths 48-256): 1.59x geomean over scalar (1.6-2.0x at the dominant 48-96). End-to-end encode time is within run-to-run noise (this kernel is ~2.5% of SILK encode) and the encoded bitstream is byte-identical across mono/stereo/5.1 x NB-FB x the full bitrate range, so output is unchanged. The full meson test suite passes (test_opus_encode/decode/api etc.). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

silk_warped_autocorrelation_FLP is ~11% of float SILK/hybrid encode and is scalar on every platform (the only NEON warped autocorrelation is the fixed-point one). It runs whenever warping is enabled (complexity >= 4), i.e. across the RTC speech operating point. Add a NEON implementation. The reference runs a serial all-pass cascade per sample (loop-carried, not parallelisable across taps without changing the rounding), so we keep that chain scalar in double precision -- producing the SAME state[] as the C reference bit-for-bit -- and vectorise the per-lag correlation accumulation across (order+1) lags with float64x2 (f64 matches the reference's double C[]). Only the 2-wide lane reduction reorders adds, so the result is within ~1e-15 of the reference. Renames silk_warped_autocorrelation_FLP to ..._c behind a new OVERRIDE_warped_autocorrelation_FLP hook in main_FLP.h, mirroring the inner_product_FLP pattern. The call site has no arch argument, so this is dispatched on PRESUME-NEON targets (aarch64); ARMv7 runtime-detection builds keep the C path (adding an arch parameter would enable RTCD there too). Kernel microbench on Apple M4 (production dims order 16/20/24, length 120-240): 1.34x geomean (1.47x at order 24). End-to-end RTC voice encode (VOIP+DTX+FEC, mono WB) is ~1.04-1.05x faster -- ~4-6% more concurrent real-time streams per core -- with byte-identical bitstream (verified across mono/stereo, NB/WB/FB). Full meson test suite passes. A faster non-bit-exact variant exists (vectorises the all-pass state in f32 via a lag-parallel reformulation of the fixed-point NEON kernel): ~1.89x kernel, ~6-9% RTC E2E. It is NOT bit-exact -- validated BD-rate cost on real speech is +0.69% (PESQ-NB) / +0.61% (PESQ-WB), and +0.055% on real 48 kHz music (opus_compare), i.e. sub-perceptual but a real, systematic cost. It can be offered behind a build option if the speed/exactness trade-off is wanted; this commit ships the bit-exact variant by default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

silk_energy_FLP (sum of squares of a float vector, used in residual-energy and LTP analysis) was scalar on Arm. Add a NEON implementation following the same approach as silk_inner_product_FLP: widen each f32 to f64 before squaring and accumulate in two float64x2 lanes, matching the C reference's double accumulation (within rounding, well below float precision). Adds the OVERRIDE_energy_FLP hook (rename to silk_energy_FLP_c + macro in SigProc_FLP.h) and the NEON dispatch in silk/arm/SigProc_FLP_arm.h. Like the warped kernel, the call site has no arch argument, so it is dispatched on PRESUME-NEON targets (aarch64); ARMv7 runtime-detection keeps the C path. Kernel microbench on Apple M4: 1.26x geomean, 1.5-1.6x at the short vector lengths it is typically called with (64-80), tapering to memory-bound at larger sizes. End-to-end within run-to-run noise (small kernel) and the encoded bitstream is unchanged. Full meson test suite passes. (silk_scale_vector_FLP was also evaluated but is not included: it is a trivially auto-vectorisable elementwise multiply that the compiler already vectorises, so a hand-written NEON version measured 1.01x -- no win.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Ckristian Zoli and others added 3 commits June 16, 2026 16:38

czoli1976 mentioned this pull request Jun 16, 2026

x86: runtime AVX512-VNNI tier for the int8 DNN GEMV (+ multi-accumulator cgemv) #484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

silk: Arm NEON for the float SILK analysis path (inner_product, warped autocorrelation, energy)#481

silk: Arm NEON for the float SILK analysis path (inner_product, warped autocorrelation, energy)#481
czoli1976 wants to merge 3 commits into
xiph:mainfrom
czoli1976:arm-silk-float-neon

czoli1976 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented Jun 16, 2026

Summary

1. silk_inner_product_FLP

2. silk_warped_autocorrelation_FLP (~11% of float SILK/hybrid encode)

3. silk_energy_FLP

Dispatch / wiring

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `silk_inner_product_FLP`

2. `silk_warped_autocorrelation_FLP` (~11% of float SILK/hybrid encode)

3. `silk_energy_FLP`