feat: context compaction strategies for the react loop#996
Conversation
|
please add @ramon-astudillo as an observer |
|
(original PR body) SummaryAdds an optional Empirically on the BCP benchmark with Granite 4.1-8b, Backwards compatible: MotivationLong agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall. On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions,
Hardware / infrastructure for the table above:
Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency. DesignProtocol —
|
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
|
@yelkurdi I updated the PR body so that the update-pr-body check would pass (and copied your original body into a comment). Now that the check is passing, feel free to re-edit to include your original comments in the appropriate section. |
|
relevant related discussion #1099 |
|
Will update according to #1099 |
Component PR ChecklistUse this checklist when adding or modifying components in Protocol Compliance
Content Blocks
Integration
|
planetf1
left a comment
There was a problem hiding this comment.
Thanks for this — the Compactor architecture is really well thought through, and the test coverage in test_compactor.py is thorough. A few things to sort out before this lands, mostly around LLMSummarizeCompactor edge cases and a couple of quick doc fixes.
| def test_context_construction(): | ||
| context_construction(SimpleContext) | ||
| # ChatContext defaults to WindowCompactor(5); a single add stays well under | ||
| # the window so the linked-list shape is identical to the pre-compaction |
There was a problem hiding this comment.
[WARNING] This comment says "ChatContext defaults to WindowCompactor(5)" but the actual default is compactor=None — no compaction at all. test_default_has_no_compactor in test_compactor.py asserts this directly. Looks like a leftover from an earlier design iteration.
There was a problem hiding this comment.
@planetf1 you are correct it is a leftover design iteration, will be corrected.
| directly between turns, e.g. when compaction is exposed to the model as a | ||
| tool. | ||
|
|
||
| See ``docs/rewrite/`` for full usage examples. |
There was a problem hiding this comment.
[WARNING] The module docstring points to docs/rewrite/ for usage examples, but that directory only has session_deepdive/ and streaming/ — nothing compaction-related. The actual examples are in docs/examples/context/. Worth fixing so the breadcrumb is useful.
There was a problem hiding this comment.
good catch, will be fixed to point to docs/examples/context/
| return pool.submit(asyncio.run, coro).result() | ||
|
|
||
|
|
||
| class LLMSummarizeCompactor: |
There was a problem hiding this comment.
[WARNING] Worth calling out in the class docstring that this compactor can't be wired into ChatContext(compactor=...) directly. ChatContext.add() calls compact(new) without a backend, so once the body grows past keep_n every add() raises ValueError. The natural instinct after seeing ChatContext(compactor=WindowCompactor(...)) work is to try the same here, and the error message won't point you at the fix. A one-liner — "use via react(compactor=...) only" — would save a lot of confusion.
There was a problem hiding this comment.
@planetf1 how about adding a hard block, such as limiting the types of compactors ChatContext can directly take e.g. 1. BasicCompactor (does not require backend). 2. ThresholdCompactor (algorithmic compactor wraps LLMSummaryCompactor) ?
There was a problem hiding this comment.
Sure, that's better than a docstring.
There was a problem hiding this comment.
Great, I made the following changes to create a hard block at construction time:
- Added
InlineCompactormarker;ChatContext(compactor=...)rejects non-InlineCompactorinstances with aTypeErrorpointing atreact(compactor=...),ThresholdCompactor, or manualcompact(). WindowCompactorandThresholdCompactorinherit the marker;LLMSummarizeCompactordoes not (would call backend everyadd()).ThresholdCompactor(LLMSummarizeCompactor(...))is the recommended wiring — it gates by token usage, so backend calls fire sparsely rather than per-add().LLMSummarizeCompactornow requiresdefault_backendat construction;compact()falls back to it, with call-timebackend=overriding.
| from mellea.stdlib.context.compactor import Compactor | ||
|
|
||
|
|
||
| class ChatContext(Context): |
There was a problem hiding this comment.
[WARNING] ChatContext(window_size=N) used to keep the full history in as_list() and only apply the window in view_for_generation(). Now as_list() itself is truncated to N items. For the model, nothing changes — but callers inspecting as_list() directly (e.g. session.py:376 uses it for interaction_count) will silently get a capped value. The PR description says "backwards compatible" which is true for the model's view, but not for as_list(). Worth a migration note in the docstring at minimum.
There was a problem hiding this comment.
Added better tracking of interaction counts, required small modifications in session.py
ChatContextdocstring: added a Note flagging thatas_list()now reflects post-compaction state (used to keep full history); points callers at out-of-band turn tracking.MelleaSession: turnedctxinto a property whose setter increments_interaction_count;reset()/__init__bypass via_ctxso lifecycle events don't pollute the count.cleanup()publishesself._interaction_countinstead oflen(self.ctx.as_list()).SessionCleanupPayload.interaction_count: docstring rewritten to "turns committed" semantics (was "items in context").
| if len(body) <= self.keep_n: | ||
| return ctx | ||
|
|
||
| return _run_coro_blocking(self._async_compact(ctx, backend)) |
There was a problem hiding this comment.
[WARNING] If the backend call raises here (rate limit, network error, timeout), the exception propagates through _run_coro_blocking and kills the entire react loop. For a long-running research task, that's quite painful. Since compaction is best-effort by nature, wrapping this in a try/except, logging a warning, and returning ctx unchanged would be much more robust.
There was a problem hiding this comment.
Done, compaction is wrapped in try/except and behaves as best effort by returning ctx and logs warning when backend fails, enabling the react loop to continue.
| ) | ||
|
|
||
|
|
||
| def _run_coro_blocking(coro): # type: ignore[no-untyped-def] |
There was a problem hiding this comment.
[WARNING] When called from inside react() (which is async), this blocks the event loop thread for the full duration of the summary LLM call. Nothing else on the loop — telemetry flushers, cancellation signals, other sessions — can make progress during that time. It's documented as "fine for a serial ReACT loop", but worth noting that backends using per-loop resources (e.g. httpx.AsyncClient) may behave unexpectedly. A stronger docstring warning here, and ideally an async variant on the protocol long-term, would be the right direction.
There was a problem hiding this comment.
Done, beefed-up the warning message as you indicated.
| elif isinstance(c, Message): | ||
| lines.append(f"{c.role}: {c.content}") | ||
| elif isinstance(c, ModelOutputThunk): | ||
| lines.append(f"assistant: {c.value}") |
There was a problem hiding this comment.
[WARNING] The rendering loop only handles c.content and c.value. Images, attached documents, and tool-call-only thunks (where value is None) are silently dropped — the latter render as "assistant: None". Users running multimodal or tool-heavy sessions will get incomplete summaries without any warning. Guarding with c.value or '' would at least fix the None case, and a docstring note about lossy summarisation for multimodal content would set expectations.
There was a problem hiding this comment.
Augmented the block with the following:
Message: append"[N image(s) attached]"/"[M document(s) attached]"markers; bytes not reproduced.ModelOutputThunk: render"assistant called tools: name({args}), ..."for tool-call-only thunks (value=None); empty thunks (no value, no tool calls) are skipped entirely. Eliminates"assistant: None".- Catch-all
else:"<TypeName: content>"or"<TypeName>"instead of the default object repr (e.g.ReactInitiatorwhen not pinned).
| lines.append(str(getattr(c, "content", c))) | ||
|
|
||
| prompt = self.prompt_template.format(conversation="\n".join(lines)) | ||
| result, _ = await mfuncs.aact( |
There was a problem hiding this comment.
[WARNING] The two equivalent mfuncs.aact() calls in react.py (lines 93 and 123) both pass silence_context_type_warning=True for internal framework calls. Without it here, every compaction fires a context-type-mismatch warning into the logs — potentially quite noisy with ThresholdCompactor triggering regularly in a long session.
There was a problem hiding this comment.
added silence_context_type_warning=True parameter
| a generic conversation-summary template. | ||
| """ | ||
|
|
||
| def __init__( |
There was a problem hiding this comment.
[WARNING] There's no model_options parameter here, so the summary call uses the backend's default max_tokens. On many local backends that's 256–512, which will silently truncate a summary of a long conversation. react_summary_prompt(max_tokens_hint=N) adds a text nudge to the prompt but doesn't set the actual API parameter. Adding model_options: dict | None = None to the constructor and forwarding it to mfuncs.aact would let users enforce a real token budget.
There was a problem hiding this comment.
model_options added, it is a good idea, I have been actually using it in my separate BCP eval harness.
|
@planetf1 I have incorporated your feedback, I've also ran the tests and the full BCP deep search task tests (3 runs of each compaction method), all checks out. |
planetf1
left a comment
There was a problem hiding this comment.
Second pass after the latest fixes — most of the earlier thread was addressed cleanly, thanks. A few new things from this pass, plus one CI-blocker in the docs example.
Heads up — PR is currently in DIRTY merge state (conflicts against main). Worth a rebase before final review.
Findings inline; one separate top-level note on _run_coro_blocking as a follow-up suggestion (not reopening the resolved thread — your docstring update there is solid).
| from mellea.stdlib.context.chat import _rebuild_chat_context | ||
|
|
||
|
|
||
| class TruncateOldest: |
There was a problem hiding this comment.
Hey — I think this example actually crashes when pattern_1_wired_into_context() runs: TruncateOldest doesn't inherit InlineCompactor, and the new guard in chat.py:52-58 raises TypeError on construction. The # pytest: unit annotation means it'll show up in the default CI run too, so worth catching before merge.
One-line fix:
from mellea.stdlib.context.compactor import InlineCompactor
class TruncateOldest(InlineCompactor):
...Totally tangential thought, take it or leave it — the example imports _rebuild_chat_context and reads ctx._compactor, which is what WindowCompactor does internally too, so it's an honest reflection of reality. Just wondering whether it's worth promoting _rebuild_chat_context to public (rebuild_chat_context) so the "look how easy it is to write your own" framing holds together. Happy to file a follow-up if that's out of scope here.
There was a problem hiding this comment.
The example is fixed, docstring also elaborated, detailed the two patterns, pattern1 for simple custom InlineCompactor and pattern2 for general custom compactors.
I agree with you that _rebuild_chat_context would be promoted to public (rebuild_chat_context), however I would recommend taking that as a separate PR since it was that way in the original context.py file.
|
|
||
| try: | ||
| return _run_coro_blocking(self._async_compact(ctx, backend)) | ||
| except Exception as exc: |
There was a problem hiding this comment.
Small thing on the except Exception here — it'll catch the network/rate-limit cases the docstring mentions, but it'll also quietly swallow TypeError/AttributeError from genuine bugs (e.g. a pin_predicate returning the wrong type, or a component shape change). The user sees the same "warning logged, context unchanged" outcome either way, which can hide a real contract violation for a long time.
Could be worth either narrowing to the actual transient classes, or keeping the broad catch but re-raising on (TypeError, AttributeError, AssertionError, KeyError) after logging. Either way the "best-effort, retries next call" promise still holds for the cases that promise actually targets.
There was a problem hiding this comment.
Added the narrower re-raising on programming-error classes, also docstring updated
| ) | ||
|
|
||
| summary_message = Message( | ||
| role="user", content=f"[CONTEXT SUMMARY]\n{result.value or ''}" |
There was a problem hiding this comment.
Quick question on the role choice for the summary message — was role="user" deliberate? The content's coming back from the backend, but on the next turn the chat template will render it as "user: [CONTEXT SUMMARY] …" which feels like a small fib to the model.
I noticed react.py does the same thing for goal restatement, so it might just be the established convention here, in which case ignore me. Asking partly because I've seen smaller chat-tuned models (granite/llama 3B–8B range) occasionally treat a summary-under-user-role as a fresh instruction. role="system" or a marker on the next genuine user turn would dodge that — but only worth changing if the convention isn't already settled.
There was a problem hiding this comment.
Good catch — you're right that the role doesn't match the source. role="user" is deliberate though: the summary functions as user-supplied context for the next turn, same pattern as react.py's goal restatement. The [CONTEXT SUMMARY] content prefix is meant to be the disambiguator.
If you've seen smaller models conflate it with a fresh instruction in practice, I'd take that seriously — happy to sharpen the prefix (e.g., [CONTEXT SUMMARY — auto-generated, not a new user turn]) if that would help.
On extensibility: would you prefer exposed params (summary_prefix, summary_role, etc.) on LLMSummarizeCompactor, or is subclassing / wrapping with a custom compactor the right escape hatch? Leaning toward the latter to keep the default surface small, but happy either way.
There was a problem hiding this comment.
Checked the code and I think this is fine — consistent with how the goal is surfaced as a user turn in react.py, and role="user" is the most portable choice across chat templates. Happy to leave it as-is.
| strategy, or ``window_size=`` as sugar for ``WindowCompactor(size=...)``. | ||
|
|
||
| Note: | ||
| Compaction is now applied at ``add()`` time and persists in the linked |
There was a problem hiding this comment.
Tiny doc nit, no rush: "Compaction is now applied at add() time…" reads like a release-note for me at the moment, but in six months "now" loses its anchor and just confuses a fresh reader. Could just be "Compaction is applied at add() time and persists in the linked list…" — same information, ages better.
|
Following up on your docstring update on Have you considered routing the coroutine through The benefit is mostly for the sync-from-outside-loop case (someone calling Totally fine as a follow-up issue if you'd rather not expand scope here. |
Adds CompactionStrategy abstraction and KeepLastN implementation to mellea/stdlib/compaction.py, wires an optional compaction parameter into the react() loop, and adds full test coverage in test/stdlib/test_compaction.py. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Switches `CompactionStrategy.threshold` from a component-count trigger to a token-count trigger, read from the most recent `ModelOutputThunk.usage` populated by the backend. This aligns compaction with the real constraint (context size) and sidesteps per-backend tokenizer dependencies by using provider-reported usage; the trade-off is a one-turn lag since usage is recorded at the end of each model call. Also reorders the react loop so compaction runs after the final-answer check, skipping wasted work (and a wasted LLM call for LLMSummarize) on terminal turns. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Signed-off-by: ramon-astudillo <ramon@astudillo.com>
Move the compaction strategies alongside the react framework they serve: - mellea/stdlib/compaction.py -> mellea/stdlib/frameworks/react_compaction.py - test/stdlib/test_compaction.py -> test/stdlib/frameworks/test_react_compaction.py Imports and module docstrings updated accordingly. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
The docstring quality gate (tooling/docs-autogen/audit_coverage.py --quality --threshold 100) requires each documented symbol to have its own Args/Returns sections — inheritance from the abstract parent is not consulted. Six issues were reported against the compact() overrides on ClearAll, KeepLastN, and LLMSummarize. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Replaces the original async ``react_compaction`` strategies (ClearAll,
KeepLastN, LLMSummarize) with a generic, sync ``Compactor`` protocol
that operates on any ``Context``. ``ReACT`` and ``ChatContext`` are
rewired around the new protocol; sample callers, tests, and docs are
updated.
Squash of 29 Mellea-side commits from
context_compaction_for_react_2; the BCP eval harness commits in that
branch are intentionally excluded.
mellea/stdlib/context/ becomes a package
- Compactor protocol: sync ``compact(ctx, *, backend=None) -> Context``
- WindowCompactor(size, pin_predicate)
keep last-N body components; ``size=0`` clears the body and
retains only the pinned prefix
- ThresholdCompactor(inner, threshold)
token-gated wrapper that reads cumulative context size from the
most recent ModelOutputThunk's ``generation.usage`` and forwards
to ``inner.compact`` only above the gate
- LLMSummarizeCompactor(keep_n, pin_predicate, prompt_template)
summarizes old body components via the backend; the (async)
backend call is hidden behind a sync ``compact()`` via
``_run_coro_blocking`` so the protocol stays sync
- PinPredicate API: ``pin_nothing``, ``pin_system``,
``pin_system_and_initial_user``; chat compactors compose freely
mellea/stdlib/frameworks/react.py
- ``react()`` gains a ``compactor: Compactor | None = None`` per-turn
hook; invoked once after each tool observation
- The old ``react_compaction`` module is removed
mellea/stdlib/components/react.py
- ``pin_react_initiator``: a PinPredicate that pins everything up to
and including the first ``ReactInitiator``
- ``react_summary_prompt(goal=None, max_tokens_hint=None)``: factory
that returns a research-flavoured summary prompt template (with the
{conversation} placeholder LLMSummarizeCompactor expects). Optional
``GOAL: <goal>`` line and optional ``- Be at most ~N tokens`` bullet
when callers want goal anchoring or length-cap hints.
mellea/stdlib/context/chat.py
- ``ChatContext()`` defaults to no compactor (full history); pass
``compactor=`` or ``window_size=`` for opt-in compaction. Matches
upstream main's window_size=None unbounded semantics.
Test coverage
- test/stdlib/test_compactor.py (~500 LOC): protocol semantics;
Window / Threshold / LLMSummarize behaviours; pin-predicate edge
cases; ``size=0`` collapse; threshold gate edge cases
- test/stdlib/frameworks/test_react_framework.py (~210 LOC):
react() per-turn hook integration + react_summary_prompt
(default, goal interpolation, brace escaping, max_tokens_hint
bullet ordering, LLMSummarizeCompactor template-validation)
- test/stdlib/test_base_context.py: pin-non-compacting ChatContext
in the session-copy operations test (matches new opt-in default)
Net diff: 17 files, +381 / -896 lines (drops the old
react_compaction.py and its dedicated test file).
Backwards-compatible default behaviour preserved: bare ``ChatContext()``
retains full history; ``react()`` without ``compactor=`` behaves
identically to today; ``LLMSummarizeCompactor`` defaults to a generic
conversation-summary prompt unless callers opt in to the
research-flavoured variant via ``react_summary_prompt``.
Assisted-by: Claude Code
Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…ples - Correct comment in test_base_context.py: ChatContext defaults to compactor=None, not WindowCompactor(5). - Point compactor.py module docstring to docs/examples/context/ instead of the nonexistent docs/rewrite/. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…n LLMSummarize Restrict ChatContext(compactor=...) to compactors that inherit InlineCompactor. Wiring a backend-requiring compactor (e.g. LLMSummarizeCompactor) directly would invoke the backend on every add(); the new isinstance guard rejects that with a TypeError pointing at react(compactor=...), ThresholdCompactor, or manual compact() as alternatives. ThresholdCompactor remains accepted regardless of its inner -- it gates by token usage, so backend calls are sparse rather than per-add. LLMSummarizeCompactor now takes a required default_backend at construction and falls back to it when compact() is invoked without an explicit backend. A backend kwarg passed to compact() still overrides the default for that call. This makes ThresholdCompactor(LLMSummarizeCompactor(...)) work end-to-end when attached to ChatContext: at trip time, the inner uses its stored default_backend. InlineCompactor carries the compact() signature (raising NotImplementedError) so it's a usable static type without cast() workarounds. Specialized to ChatContext rather than parameterized over Context -- ThresholdCompactor's prior generic-T signature was unexercised, so the simpler shape applies. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…antic shift
Reviewer flagged that ChatContext(window_size=N) used to keep full history
in as_list() and only window view_for_generation(); the per-turn compactor
work made as_list() itself reflect the post-compaction state, which silently
undercounted in MelleaSession.cleanup() (interaction_count = len(as_list())).
- ChatContext docstring: add a Note describing the semantic shift and
pointing callers at out-of-band turn tracking when full counts matter.
- MelleaSession: turn ctx into a property; the setter increments
_interaction_count, and reset() / __init__ bypass via _ctx so lifecycle
events don't pollute the count. cleanup() now publishes
self._interaction_count, stable under any compaction strategy.
- SessionCleanupPayload: rewrite the interaction_count field doc to
match the new semantics ("turns committed" rather than "items in
context").
Assisted-by: Claude Code
Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged that an exception from the summarisation backend call (rate limit, network error, timeout) propagates through _run_coro_blocking and kills the entire react loop. For long-running research tasks that's quite painful, especially since compaction is best-effort by nature. LLMSummarizeCompactor.compact() now wraps _run_coro_blocking in a try/except Exception. On failure it logs a WARNING via MelleaLogger with the exception type and message, then returns ctx unchanged. The next compact() invocation retries; the conversation keeps growing in the meantime. BaseException (KeyboardInterrupt, SystemExit) still propagates so users can interrupt a stuck loop. Added a regression test with a backend that raises RuntimeError on every call: compact() returns the same ctx, original history is intact, and a warning naming the exception type is logged. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged that _run_coro_blocking, when invoked from inside an async caller like react(), blocks the entire event loop for the full duration of the wrapped coroutine — not just the calling task. The previous "fine for a serial ReACT loop" wording undersold the implications. Beefed-up Warning: block now spells out: - The loop, not just the thread, is stalled — callbacks, telemetry, cancellation signals, other sessions sharing the loop, keepalives are all blocked. - Backends with per-loop resources (notably httpx.AsyncClient) may behave unexpectedly because the coroutine runs on a fresh loop in a worker thread; documents the typical failure signatures. - Long-term direction is an async variant on the Compactor protocol so callers can await natively. Docs-only; no behavior change. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged silent drops when rendering the slice fed to the
summariser:
- Message: append "[N image(s) attached]" / "[M document(s) attached]"
markers; bytes not reproduced.
- ModelOutputThunk: render "assistant called tools: name({args}), ..."
for tool-call-only thunks (value=None) and "assistant: <empty>" for
empty thunks. Eliminates "assistant: None".
- Catch-all else: "<TypeName: content>" or "<TypeName>" instead of the
default object repr (e.g. ReactInitiator when not pinned).
Docstring gains a Note: that summaries are text-only and lossy for
multimodal / heavy-tool sessions. Tests cover each branch.
Assisted-by: Claude Code
Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…ng "<empty>" A "<empty>" marker for thunks with neither value nor tool_calls tended to leak into the resulting summary verbatim. These turns carry no information worth summarising, so drop them from the rendered slice entirely. Test updated to assert no line is emitted for an empty thunk while neighbouring turns still come through. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…call Reviewer flagged that aact's context-type warning could be noisy under ThresholdCompactor-driven repeated compaction. Match react.py's pattern of setting silence_context_type_warning=True on internal framework calls so the warning stays quiet if the context argument is later changed to a non-SimpleContext, and to self-document the intent. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged that the summary call uses the backend's default max_tokens (often 256-512 on local backends), silently truncating long summaries. react_summary_prompt(max_tokens_hint=N) is only a soft prompt- side nudge, not a real API parameter. Add model_options: dict | None to the constructor and forward it to mfuncs.aact so callers can set a hard token budget (or any other backend option). Default None preserves existing behaviour. Tests cover both forwarded and default paths. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
The body slice was only consumed by len() before being discarded (_async_compact recomputes its own body). Replace with body_len arithmetic, matching the existing pattern in WindowCompactor.compact. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
LLMSummarizeCompactor.compact() and _async_compact were independently calling ctx.as_list() and self.pin_predicate(full). Thread the values through instead so the async core skips the duplicate work. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Replace the while-with-index pattern in pin_system() with a cleaner enumerate-based early-return loop. Behavior unchanged. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
planetf1
left a comment
There was a problem hiding this comment.
Second-pass review — one concrete fix needed (TruncateOldest prefix), plus a few nits.
| # Per-turn compaction hook (terminal turns skip this since `is_final` returned). | ||
| if compactor is not None: | ||
| context = compactor.compact(context, backend=backend) | ||
|
|
There was a problem hiding this comment.
Since compact's return type is T bound to Context, a custom compactor that returns e.g. SimpleContext type-checks fine but then breaks the ChatContext assumptions the next aact call relies on. Worth noting in the compactor= docstring that the compactor must return a ChatContext.
There was a problem hiding this comment.
Good point — added a paragraph to the compactor= docstring noting that the Compactor protocol is generic in T bound to Context, so a compactor returning a non-chat Context (e.g. SimpleContext) type-checks fine but breaks the ChatContext assumptions the next aact call relies on. Custom compactors used with react must return a ChatContext.
| items = ctx.as_list() | ||
| if len(items) <= 1: | ||
| return ctx | ||
| return _rebuild_chat_context(items[1:], compactor=ctx._compactor) |
There was a problem hiding this comment.
One caveat worth flagging here: items[1:] drops element 0 unconditionally, so a leading system message gets evicted on the first add(). That's fine for this example since TruncateOldest is deliberately the smallest possible compactor, but it's worth a one-line note pointing readers at WindowCompactor / pin_system if they want the pinned-system variant rather than reaching for this as a starting point.
There was a problem hiding this comment.
Good catch — added a Note to TruncateOldest's docstring spelling out that items[1:] evicts a leading system message on the first compaction, with a pointer to WindowCompactor + pin_system for the keep-the-system-prompt variant. Kept the body of the example as-is since the smallest-possible shape is the point.
| mutate the input context; they must return a fresh instance and copy over | ||
| any data that should be preserved. | ||
|
|
||
| The protocol is generic in `T` (a `Context` subtype) so concrete |
There was a problem hiding this comment.
The phrase "a chat-only compactor declares T = ChatContext" (line 120) is the misleading bit — there's no T = declaration anywhere. T is a per-call TypeVar on compact(), so a concrete chat compactor just annotates the override as compact(self, ctx: ChatContext) -> ChatContext. Could we reword that sentence to describe the override annotation rather than a class-level T assignment?
There was a problem hiding this comment.
Reworded. T is a per-call TypeVar on compact(), so the docstring now describes the override annotation directly: a chat-only compactor overrides as def compact(self, ctx: ChatContext, *, backend=None) -> ChatContext. No misleading "T = ChatContext" framing.
| self, ctx: ChatContext, *, backend: Backend | None = None | ||
| ) -> ChatContext: | ||
| """Subclasses must override this with their concrete strategy.""" | ||
| raise NotImplementedError("InlineCompactor subclasses must implement compact()") |
There was a problem hiding this comment.
Nit: compact raises NotImplementedError but the docstring has no Raises: entry — minor gap against the project convention.
There was a problem hiding this comment.
Added a Raises: NotImplementedError entry to InlineCompactor.compact()'s docstring noting that the marker base always raises and subclasses are expected to override.
The Compactor protocol is generic in T bound to Context, so a custom compactor returning a non-chat Context (e.g. SimpleContext) type-checks at the react() call site but breaks the ChatContext assumptions the next aact call relies on. Document the constraint on the compactor= parameter so callers don't get surprised at runtime. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer pointed out that items[1:] drops index 0 unconditionally, so a leading system message gets evicted on first compaction. Add a Note to the class docstring flagging the caveat and pointing readers at WindowCompactor + pin_system for the keep-the-system-prompt variant. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer pointed out there's no class-level T = declaration — T is a per-call TypeVar on compact(). Reword to describe the override annotation directly: chat-only compactors override compact() with a ChatContext signature. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged the docstring gap: compact() raises NotImplementedError unconditionally on the marker base class, but the docstring had no Raises: section. Add one noting the always-raises behavior and that subclasses are expected to override. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
The build-and-validate CI job's --fail-on-quality threshold flagged 8 issues across pin_nothing, pin_system, pin_system_and_initial_user, and InlineCompactor.compact (missing Args + Returns). Add the sections. Also collapse a multi-line _run_coro_blocking call back to one line so ruff format passes. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
|
Thanks for addressing or replying to everything. I'll accept your explanation for the event loop blocking concern, but I think we need a follow-up issue and need Nathan's approval. |
|
Fix looks good, but can we rebase to current code and ensure no conflict issues. |
There was a problem hiding this comment.
Comparing against the summary of our previous discussion in #1099, in my opinion the PR conforms to that design. There's a couple of things we need to do before we can merge:
- rebase / resolve the merge conflicts
- update the PR description to match the new design
Resolve conflicts in session cleanup payload: keep out-of-band interaction_count tracking (turns committed) from this branch, and adopt session_id + exception fields added upstream. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
|
DCO check is failing on this commit: 1e28704
@ramon-astudillo I think this is yours. |
|
This commit will get squashed anyway, but let me check with @yelkurdi |
|
@psschwei I resolved the conflicts and now I'm in the process of updating the PR description. |
5b01e1e to
6daa892
Compare
|
PR description updated. |
| mot._generate_log = GenerateLog(is_final_result=True) | ||
| return mot, ctx.add(action).add(mot) | ||
|
|
||
| async def generate_from_raw( |
There was a problem hiding this comment.
The Backend ABC was updated in #1264 — the abstract method is now _generate_from_raw (leading underscore); the public generate_from_raw is a concrete wrapper that fires hooks. These fakes override the wrapper but leave the abstract method unimplemented, so FakeBackend() raises TypeError: Can't instantiate abstract class at test setup and the whole TestLLMSummarizeCompactor class errors out. Same issue in OtherBackend, BrokenBackend, BuggyBackend, InterruptingBackend, and ScriptedBackend in react_compaction.py — a rename to _generate_from_raw in all six fixes it.
There was a problem hiding this comment.
thanks for catching that, corrected.
|
|
||
| # Per-turn compaction hook (terminal turns skip this since `is_final` returned). | ||
| if compactor is not None: | ||
| context = compactor.compact(context, backend=backend) |
There was a problem hiding this comment.
The aact calls a few lines up both assert isinstance(next_context, ChatContext) to catch a wrong return type early — it'd be worth the same guard here, since the compact() docstring notes a non-ChatContext return would break the loop but nothing currently enforces it:
if compactor is not None:
context = compactor.compact(context, backend=backend)
assert isinstance(context, ChatContext), (
"react() compactor must return a ChatContext"
)
planetf1
left a comment
There was a problem hiding this comment.
Third pass — the design work here is solid and the earlier findings are all cleaned up. One thing that needs fixing before merge: six fake Backend subclasses (five in test_compactor.py, one in react_compaction.py) override generate_from_raw instead of the abstract _generate_from_raw that #1264 introduced. The whole TestLLMSummarizeCompactor class errors at fixture setup as a result — details in the inline comment. A rename in each fake unblocks it.
CI is now running and the failing quality check is the active gate — no need for a separate review block on top of it.
The Backend ABC's abstract method was renamed to _generate_from_raw (leading underscore) in generative-computing#1264; the public generate_from_raw is now a concrete hook-firing wrapper. The fakes in test_compactor.py overrode the wrapper but left the abstract method unimplemented, so each backend raised TypeError on instantiation. Rename generate_from_raw -> _generate_from_raw in FakeBackend, OtherBackend, BrokenBackend, BuggyBackend, and InterruptingBackend. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Mirror the guard on the aact next_context returns: the LLMSummarizeCompactor docstring notes a non-ChatContext return would break the loop, but nothing enforced it. Add an isinstance assert after compact() to fail fast. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

Component PR
Use this template when adding or modifying components in
mellea/stdlib/components/.Description
Implementation Checklist
Protocol Compliance
parts()returns list of constituent parts (Components or CBlocks)format_for_llm()returns TemplateRepresentation or string_parse(computed: ModelOutputThunk)parses model output correctly into the specified Component return typeContent Blocks
Integration
mellea/stdlib/components/__init__.pyor, if you are adding a library of components, from your sub-moduleTesting
tests/components/Attribution
Summary
Adds a generic, sync
Compactorprotocol tomellea.stdlib.contextfor shrinking a runningContext, with three implementations —WindowCompactor,ThresholdCompactor,LLMSummarizeCompactor— plus composable pin predicates that decide which leading components survive. ReACT gains an optionalcompactor=parameter;ChatContextgainscompactor=/window_size=for per-append compaction.Empirically on the BCP benchmark with Granite 4.1-8b, LLM-summary compaction cuts inference cost by 23.7% and raises accuracy by 3.5 pp — a dual win, not a cost/quality trade-off. Backwards compatible: the defaults preserve existing behavior exactly.
Motivation
Long agentic loops pile up tool responses. Each react iteration re-sends the full history, so prompt-token cost grows quadratically and the loop can exhaust the context window before finishing. Compaction trims that history, cutting both dollar cost and the chance of hitting the context / timeout wall.
On BrowseCompPlus (BCP) with Granite 4.1-8b (131K context, 830 questions,
loop_budget=400,per_question_timeout_s=1800, mean of 3 runs/strategy):noneclear_allllm_summarizellm_summarizeis the clear winner — cost −23.7% and accuracy +3.5 pp (+23% relative) at similar wall-clock, since the smaller prompt makes each iteration faster.clear_allalso helps (−16.8% cost, +2.9 pp) but its +35% wall-clock is less attractive when the budget allows the extra summarization call.Benchmark labels are the external eval-harness strategy names:
clear_all→WindowCompactor(size=0, pin_predicate=pin_react_initiator),llm_summarize→LLMSummarizeCompactor(pin_predicate=pin_react_initiator), both gated viaThresholdCompactor.Hardware / infrastructure
ibm-granite/granite-4.1-8b(bf16, native 131K context)--enable-prefix-cachingon. GPU 0 reserved for the BCP local search service.OpenAIBackend→ local vLLM,concurrency=56,loop_budget=400,per_question_timeout_s=1800,--compaction_threshold 50000(~38% of 131K),--compaction_keep_n 5.openai/gpt-oss-120bon the same node after agent teardown.Measured with a separate BCP eval harness running on Mellea's react framework; the feature itself has no external runtime dependency.
Design
A
Compactorreturns a fresh, compacted copy of a context (never mutates the input). Sync and generic inT(aContextsubtype), so a chat compactor narrows toChatContext -> ChatContext. Backend-calling compactors hide the async work behind the sync method. Two usage patterns: per-append (ChatContext(compactor=...), runs on everyadd()) and per-turn / manual (react(compactor=...)or a directcompact(ctx, backend=...)between turns).InlineCompactormarker —ChatContext.add()callscompact()without a backend on every append, soChatContext(compactor=...)accepts onlyInlineCompactorinstances (structuralCompactorsatisfaction isn't enough), steering backend-calling compactors toreact(compactor=...),ThresholdCompactor, or manualcompact().Pin predicates — a
PinPredicatereturns the index after the protected prefix (components[:idx]preserved,components[idx:]is the body):pin_nothing,pin_system,pin_system_and_initial_user, andpin_react_initiator(keeps the react goal + tool registration).Compactors:
WindowCompactor(size, pin_predicate=pin_system)(inline) — keep prefix + lastsizebody components;size=0drops the body ("clear all").ThresholdCompactor(inner, threshold)(inline) — forwards toinner.compactonly when measured token size exceedsthreshold; the composable token-gate.LLMSummarizeCompactor(default_backend, keep_n=5, pin_predicate=pin_nothing, ...)— summarize older body into one[CONTEXT SUMMARY]message, keep lastkeep_nverbatim. Not inline (calls the backend each time); use viareact/ThresholdCompactor/manual.Token size (
_last_usage_tokens) is read from the most recentModelOutputThunk.generation.usagetotal_tokens. Helperreact_summary_prompt(goal, max_tokens_hint)builds a research-flavoured template forLLMSummarizeCompactor.Integration — in
react, compaction runs once per turn after theis_finalcheck so terminal turns skip it (saves a wasted summary call). Custom compactors used withreactmust return aChatContext.Non-goals
No
mellea/core/changes. The single-modulemellea/stdlib/context.pyis refactored into amellea/stdlib/context/package (chat.py,simple.py,compactor.py); public imports frommellea.stdlib.contextare unchanged.Test plan
uv run pytest test/stdlib/test_compactor.py— 62 tests, ~12 s, no external servicesuv run pytest test/stdlib/frameworks/test_react_framework.py—compactor=integrationCovers each compactor's output/prefix preservation, all pin predicates, threshold gating (below/at/above,
<=0disables, no-thunk-yet), window sizing incl.size=0and thewindow_size=sugar, theInlineCompactormarker check, andLLMSummarizeCompactorrendering + best-effort error handling (backend failure → unchanged; programming errors propagate). Runnable examples indocs/examples/context/.Pitfalls
generation.usage→ token-gated compaction silently no-ops. All mainline backends populate it (AGENTS.md §5); a new backend without it disablesThresholdCompactor.LLMSummarizeCompactor(#1258).compact()is sync but must call an async backend, so it bridges through_run_coro_blocking(a worker-threadasyncio.run). Two known failure modes: (a) inside a running loop (e.g.react()) the worker blocks the calling thread, stalling the outer loop's other tasks — callbacks, telemetry flushers, cancellation — until the summary finishes; (b) a backend whose async client (e.g.httpx.AsyncClient) is bound to mellea's shared loop can't be used from the fresh worker loop, surfacing asRuntimeError: This event loop is already runningor a hung request. Both are flagged in the docstrings but can't be fixed without a protocol change. The resolution tracked in AsyncCompactorprotocol to replace sync↔async bridge inLLMSummarizeCompactor#1258 is an asyncacompact()on theCompactorprotocol thatreact()awaits directly — no bridge, no loop blocking, no cross-loop client hazard.is_finalcheck — easy to swap in a refactor; guarded by a comment, ideally a placement test later.Files