Skip to content

gh-151289: Add a wide int fast path for add/sub#151290

Open
KRRT7 wants to merge 17 commits into
python:mainfrom
KRRT7:wide-int-accel
Open

gh-151289: Add a wide int fast path for add/sub#151290
KRRT7 wants to merge 17 commits into
python:mainfrom
KRRT7:wide-int-accel

Conversation

@KRRT7

@KRRT7 KRRT7 commented Jun 10, 2026

Copy link
Copy Markdown

gh-151289: Add a wide int fast path for add/sub

This adds a separate fast path for exact PyLong add/sub operands that fit in signed 64-bit integers, while preserving the existing compact-int specialization.

This keeps the compact-int hot path unchanged and avoids broad opcode churn there, while allowing wide exact ints to bypass the slower generic long arithmetic path.

Performance: representative interpreter-only results with JIT disabled:

  • add_wide:
  • sub_wide:
  • add_compact/sub_compact:

Related issue:

KRRT7 added 11 commits June 10, 2026 19:10
…declaration

Add inline infrastructure to pycore_long.h for the upcoming wide int addition fast path:

- _PY_LONG_MAX_DIGITS_FOR_INT64: macro for the maximum digit count that can still fit in int64_t (2 on 30-bit builds, 5 on 15-bit)
- _PyLong_FitsInt64(): cheap tag-based check; fast-paths compact and small-digit ints before inspecting the boundary digit
- _PyLong_CheckExactAndFitsInt64(): exact-type + fits-int64 guard for use in specialization guards
- _PyLong_TryAsInt64Exact(): no-exception int64 extraction; special-cases the ndigits==2/30-bit path for the common case
- PyAPI_FUNC declaration for _PyCompactLong_AddWide()
Add three new micro-ops and update the BINARY_OP_ADD_INT macro to use them, replacing the compact-only path:

- _GUARD_TOS_INT_WIDE / _GUARD_NOS_INT_WIDE: type guards that accept any exact int fitting in int64_t (via _PyLong_CheckExactAndFitsInt64)
- _BINARY_OP_ADD_INT_WIDE: calls _PyCompactLong_AddWide; EXIT_IF on int64 overflow (deopt), ERROR_IF on OOM

The existing _GUARD_TOS_INT / _GUARD_NOS_INT compact guards are kept unchanged — they are still used by BINARY_OP_SUBTRACT_INT, BINARY_OP_MULTIPLY_INT, COMPARE_OP_INT, and all subscr ops.

Regenerate: generated_cases.c.h, executor_cases.c.h, optimizer_cases.c.h, pycore_opcode_metadata.h, pycore_uop_ids.h, pycore_uop_metadata.h, test_cases.c.h
Change the add specialization condition from _PyLong_CheckExactAndCompact to _PyLong_CheckExactAndFitsInt64 so that exact int operands in the full int64 range (not just compact/single-digit values) are specialized to BINARY_OP_ADD_INT.

Subtract and multiply retain their compact-only conditions.
BINARY_OP_ADD_INT now specializes for non-compact int64-range operands (e.g. 10_000_000_000). Update the test accordingly:

- Assert BINARY_OP_ADD_INT is used for wide int add
- Keep the assertions that BINARY_OP_SUBTRACT_INT and BINARY_OP_MULTIPLY_INT are not used for non-compact ints
…Exact

Verify that _PyLong_TryAsInt64Exact correctly handles INT64_MIN (abs_val == INT64_MAX + 1 with negative sign), INT64_MAX, and that values outside the int64 range gracefully fall back to the slow path.
Non-compact (2-digit) int results previously bypassed the freelist and called PyObject_Malloc directly. Add an `ints2` freelist alongside the existing `ints` (1-digit) freelist.

- `long_alloc(2)` checks `ints2` before `PyObject_Malloc`
- `_PyLong_ExactDealloc` and `long_dealloc` recycle exact 2-digit ints to `ints2` instead of immediately freeing them
- `_PyObject_ClearFreeLists` clears `ints2` the same way as `ints`
Extends the ints2 freelist pattern to 3-digit objects, which cover the range [2^60, 2^63-1] (positive) and [-2^63, -2^60] (negative) on 30-bit builds - including INT64_MAX, INT64_MIN, and nanosecond-precision timestamps.

Also fuses the two _PyLong_IsCompact + _PyLong_DigitCount checks in long_dealloc under a single PyLong_CheckExact branch.

Benchmark (5M ops, 30-bit build):
  2-digit+2-digit -> 3-digit result:  19.6 ns -> 17.0 ns  (-13%)
  3-digit+compact -> 3-digit result:  18.3 ns -> 15.4 ns  (-16%)
  INT64_MAX + 0:                     18.2 ns -> 15.9 ns  (-13%)
  INT64_MIN + 0:                     18.1 ns -> 16.2 ns  (-10%)
…T-free

- Remove the dead `_BINARY_OP_ADD_INT` micro-op (no longer referenced by the macro); remove its abstract op from optimizer_bytecodes.c.
- Annotate `_GUARD_TOS_INT_WIDE`, `_GUARD_NOS_INT_WIDE`, and `_BINARY_OP_ADD_INT_WIDE` as `tier1`-only so the JIT executor and optimizer generator skip them entirely. The JIT defers to tier 1 for any `BINARY_OP_ADD_INT` trace; no new JIT code paths are introduced.
- Add a compact fast-path to `_PyCompactLong_AddWide` so compact-only int addition retains its original `medium_value` cost and avoids the int64-extraction overhead.
- Use `__builtin_add_overflow` in `_Py_i64_add_overflow` on GCC/Clang (single instruction on x86-64 / ARM64).
- Peel the last loop iteration in `_PyLong_TryAsInt64Exact` to hoist the max-digit overflow-guard out of the inner loop body.
Change the subtract specialization condition to accept exact ints in the full int64 range, matching the widened add path while keeping multiply compact-only.
@bedevere-app

bedevere-app Bot commented Jun 10, 2026

Copy link
Copy Markdown

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app

bedevere-app Bot commented Jun 10, 2026

Copy link
Copy Markdown

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant