silk: Arm NEON for the float SILK analysis path (inner_product, warped autocorrelation, energy)#481
Open
czoli1976 wants to merge 3 commits into
Open
silk: Arm NEON for the float SILK analysis path (inner_product, warped autocorrelation, energy)#481czoli1976 wants to merge 3 commits into
czoli1976 wants to merge 3 commits into
Conversation
The float SILK analysis path had no Arm SIMD at all (no silk/float/arm/), even though x86 provides an AVX2 silk_inner_product_FLP. This is the workhorse float dot product, called from the LPC/Burg autocorrelation, LTP correlation matrix and pitch analysis, so it benefits those callers transitively. Add a NEON implementation that, like the AVX2 one, widens each f32 operand to f64 before multiplying and accumulates in two float64x2 lanes, matching the C reference's double-precision accumulation. It is numerically faithful to silk_inner_product_FLP_c (worst relative error 9.8e-11 vs a long-double reference over 744k adversarial vectors, vs the scalar reference's own 6.8e-11 -- both ~1000x below f32 epsilon). Wired via the existing OVERRIDE_inner_product_FLP hook: a new silk/arm/SigProc_FLP_arm.h provides the PRESUME (direct call) and RTCD (SILK_INNER_PRODUCT_FLP_IMPL table in arm_silk_map.c) dispatch, mirroring silk/x86/main_sse.h. Build wiring added for autotools, CMake and Meson via a new SILK_SOURCES_FLOAT_ARM_NEON_INTR group. Kernel microbench on Apple M4 (real operating lengths 48-256): 1.59x geomean over scalar (1.6-2.0x at the dominant 48-96). End-to-end encode time is within run-to-run noise (this kernel is ~2.5% of SILK encode) and the encoded bitstream is byte-identical across mono/stereo/5.1 x NB-FB x the full bitrate range, so output is unchanged. The full meson test suite passes (test_opus_encode/decode/api etc.). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
silk_warped_autocorrelation_FLP is ~11% of float SILK/hybrid encode and is scalar on every platform (the only NEON warped autocorrelation is the fixed-point one). It runs whenever warping is enabled (complexity >= 4), i.e. across the RTC speech operating point. Add a NEON implementation. The reference runs a serial all-pass cascade per sample (loop-carried, not parallelisable across taps without changing the rounding), so we keep that chain scalar in double precision -- producing the SAME state[] as the C reference bit-for-bit -- and vectorise the per-lag correlation accumulation across (order+1) lags with float64x2 (f64 matches the reference's double C[]). Only the 2-wide lane reduction reorders adds, so the result is within ~1e-15 of the reference. Renames silk_warped_autocorrelation_FLP to ..._c behind a new OVERRIDE_warped_autocorrelation_FLP hook in main_FLP.h, mirroring the inner_product_FLP pattern. The call site has no arch argument, so this is dispatched on PRESUME-NEON targets (aarch64); ARMv7 runtime-detection builds keep the C path (adding an arch parameter would enable RTCD there too). Kernel microbench on Apple M4 (production dims order 16/20/24, length 120-240): 1.34x geomean (1.47x at order 24). End-to-end RTC voice encode (VOIP+DTX+FEC, mono WB) is ~1.04-1.05x faster -- ~4-6% more concurrent real-time streams per core -- with byte-identical bitstream (verified across mono/stereo, NB/WB/FB). Full meson test suite passes. A faster non-bit-exact variant exists (vectorises the all-pass state in f32 via a lag-parallel reformulation of the fixed-point NEON kernel): ~1.89x kernel, ~6-9% RTC E2E. It is NOT bit-exact -- validated BD-rate cost on real speech is +0.69% (PESQ-NB) / +0.61% (PESQ-WB), and +0.055% on real 48 kHz music (opus_compare), i.e. sub-perceptual but a real, systematic cost. It can be offered behind a build option if the speed/exactness trade-off is wanted; this commit ships the bit-exact variant by default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
silk_energy_FLP (sum of squares of a float vector, used in residual-energy and LTP analysis) was scalar on Arm. Add a NEON implementation following the same approach as silk_inner_product_FLP: widen each f32 to f64 before squaring and accumulate in two float64x2 lanes, matching the C reference's double accumulation (within rounding, well below float precision). Adds the OVERRIDE_energy_FLP hook (rename to silk_energy_FLP_c + macro in SigProc_FLP.h) and the NEON dispatch in silk/arm/SigProc_FLP_arm.h. Like the warped kernel, the call site has no arch argument, so it is dispatched on PRESUME-NEON targets (aarch64); ARMv7 runtime-detection keeps the C path. Kernel microbench on Apple M4: 1.26x geomean, 1.5-1.6x at the short vector lengths it is typically called with (64-80), tapering to memory-bound at larger sizes. End-to-end within run-to-run noise (small kernel) and the encoded bitstream is unchanged. Full meson test suite passes. (silk_scale_vector_FLP was also evaluated but is not included: it is a trivially auto-vectorisable elementwise multiply that the compiler already vectorises, so a hand-written NEON version measured 1.01x -- no win.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The float SILK analysis path (
silk/float/) had no Arm SIMD at all (only x86 has an AVX2inner_product_FLP). This adds the firstsilk/float/arm/NEON tier — three kernels — wired into the existing RTCD dispatch and all three build systems (autotools / CMake / Meson). The full meson test suite passes, and each kernel carries anOPUS_CHECK_ASMself-check (within-rounding, since float kernels reorder the f64 accumulation and so aren't bit-exact like fixed-point ones).1.
silk_inner_product_FLPNEON with f64 accumulation matching the C
doublereference (mirrors the AVX2 path and its RTCD wiring). It's the workhorse float dot product (LPC/Burg autocorrelation, LTP correlation, pitch analysis), so those callers benefit transitively.2.
silk_warped_autocorrelation_FLP(~11% of float SILK/hybrid encode)Scalar on every platform until now (the only NEON warped autocorr was fixed-point). The all-pass cascade is loop-carried, so this keeps that chain scalar in double (bit-exact state) and vectorises the per-lag correlation accumulation in f64.
A faster non-bit-exact variant exists (lag-parallel reformulation of the fixed-point NEON kernel, all-pass state in f32): ~1.89x kernel, ~6-9% RTC E2E. Its measured BD-rate cost is +0.69% (PESQ-NB) / +0.61% (PESQ-WB) on real speech and +0.055% on 48 kHz music (opus_compare) - sub-perceptual but real. This PR ships the bit-exact variant by default; the faster one can be offered behind a build option if the trade-off is wanted.
3.
silk_energy_FLPSame f64-accumulation approach (sum of squares). ~1.26x kernel (1.6x at the short lengths it's called with); bit-identical output. (
silk_scale_vector_FLPwas evaluated but is a trivially auto-vectorised elementwise multiply - measured ~1.01x, no win - so it's excluded.)Dispatch / wiring
Follows existing x86 precedent:
inner_productuses theOVERRIDE_*hook + an_IMPL[arch]table inarm_silk_map.c(PRESUME + RTCD).warped/energyhave noarchargument at their call sites, so they're dispatched on PRESUME-NEON targets (aarch64); ARMv7 runtime-detection keeps the C path. Build wiring added for autotools, CMake and Meson via a newSILK_SOURCES_FLOAT_ARM_NEON_INTRgroup.Notes
Benchmarks/validation on Apple M4 (aarch64).