gh-151289: Add a wide int fast path for add/sub#151290
Open
KRRT7 wants to merge 17 commits into
Open
Conversation
…declaration Add inline infrastructure to pycore_long.h for the upcoming wide int addition fast path: - _PY_LONG_MAX_DIGITS_FOR_INT64: macro for the maximum digit count that can still fit in int64_t (2 on 30-bit builds, 5 on 15-bit) - _PyLong_FitsInt64(): cheap tag-based check; fast-paths compact and small-digit ints before inspecting the boundary digit - _PyLong_CheckExactAndFitsInt64(): exact-type + fits-int64 guard for use in specialization guards - _PyLong_TryAsInt64Exact(): no-exception int64 extraction; special-cases the ndigits==2/30-bit path for the common case - PyAPI_FUNC declaration for _PyCompactLong_AddWide()
Add three new micro-ops and update the BINARY_OP_ADD_INT macro to use them, replacing the compact-only path: - _GUARD_TOS_INT_WIDE / _GUARD_NOS_INT_WIDE: type guards that accept any exact int fitting in int64_t (via _PyLong_CheckExactAndFitsInt64) - _BINARY_OP_ADD_INT_WIDE: calls _PyCompactLong_AddWide; EXIT_IF on int64 overflow (deopt), ERROR_IF on OOM The existing _GUARD_TOS_INT / _GUARD_NOS_INT compact guards are kept unchanged — they are still used by BINARY_OP_SUBTRACT_INT, BINARY_OP_MULTIPLY_INT, COMPARE_OP_INT, and all subscr ops. Regenerate: generated_cases.c.h, executor_cases.c.h, optimizer_cases.c.h, pycore_opcode_metadata.h, pycore_uop_ids.h, pycore_uop_metadata.h, test_cases.c.h
Change the add specialization condition from _PyLong_CheckExactAndCompact to _PyLong_CheckExactAndFitsInt64 so that exact int operands in the full int64 range (not just compact/single-digit values) are specialized to BINARY_OP_ADD_INT. Subtract and multiply retain their compact-only conditions.
BINARY_OP_ADD_INT now specializes for non-compact int64-range operands (e.g. 10_000_000_000). Update the test accordingly: - Assert BINARY_OP_ADD_INT is used for wide int add - Keep the assertions that BINARY_OP_SUBTRACT_INT and BINARY_OP_MULTIPLY_INT are not used for non-compact ints
…Exact Verify that _PyLong_TryAsInt64Exact correctly handles INT64_MIN (abs_val == INT64_MAX + 1 with negative sign), INT64_MAX, and that values outside the int64 range gracefully fall back to the slow path.
Non-compact (2-digit) int results previously bypassed the freelist and called PyObject_Malloc directly. Add an `ints2` freelist alongside the existing `ints` (1-digit) freelist. - `long_alloc(2)` checks `ints2` before `PyObject_Malloc` - `_PyLong_ExactDealloc` and `long_dealloc` recycle exact 2-digit ints to `ints2` instead of immediately freeing them - `_PyObject_ClearFreeLists` clears `ints2` the same way as `ints`
Extends the ints2 freelist pattern to 3-digit objects, which cover the range [2^60, 2^63-1] (positive) and [-2^63, -2^60] (negative) on 30-bit builds - including INT64_MAX, INT64_MIN, and nanosecond-precision timestamps. Also fuses the two _PyLong_IsCompact + _PyLong_DigitCount checks in long_dealloc under a single PyLong_CheckExact branch. Benchmark (5M ops, 30-bit build): 2-digit+2-digit -> 3-digit result: 19.6 ns -> 17.0 ns (-13%) 3-digit+compact -> 3-digit result: 18.3 ns -> 15.4 ns (-16%) INT64_MAX + 0: 18.2 ns -> 15.9 ns (-13%) INT64_MIN + 0: 18.1 ns -> 16.2 ns (-10%)
…T-free - Remove the dead `_BINARY_OP_ADD_INT` micro-op (no longer referenced by the macro); remove its abstract op from optimizer_bytecodes.c. - Annotate `_GUARD_TOS_INT_WIDE`, `_GUARD_NOS_INT_WIDE`, and `_BINARY_OP_ADD_INT_WIDE` as `tier1`-only so the JIT executor and optimizer generator skip them entirely. The JIT defers to tier 1 for any `BINARY_OP_ADD_INT` trace; no new JIT code paths are introduced. - Add a compact fast-path to `_PyCompactLong_AddWide` so compact-only int addition retains its original `medium_value` cost and avoids the int64-extraction overhead. - Use `__builtin_add_overflow` in `_Py_i64_add_overflow` on GCC/Clang (single instruction on x86-64 / ARM64). - Peel the last loop iteration in `_PyLong_TryAsInt64Exact` to hoist the max-digit overflow-guard out of the inner loop body.
Change the subtract specialization condition to accept exact ints in the full int64 range, matching the widened add path while keeping multiply compact-only.
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
gh-151289: Add a wide int fast path for add/sub
This adds a separate fast path for exact PyLong add/sub operands that fit in signed 64-bit integers, while preserving the existing compact-int specialization.
This keeps the compact-int hot path unchanged and avoids broad opcode churn there, while allowing wide exact ints to bypass the slower generic long arithmetic path.
Performance: representative interpreter-only results with JIT disabled:
Related issue: