feat: model aware sliding window context by AngeloDanducci · Pull Request #1270 · generative-computing/mellea

AngeloDanducci · 2026-06-15T21:54:34Z

Pull Request

Issue

Fixes #108

Description

Adds a model aware sliding context window based on the issues request for such.

Note: HF_SMOLLM3_3B_no_ollama: ollama_name="" was changed to ollama_name=None — this is safe because _build_table already guards if name: before indexing, so the empty string was silently excluded from the lookup table anyway.

Behavioral change: reset() now preserves model_id, window_size, and token_context_length_limit on the new context. Previously it returned a bare ChatContext() with no configuration. Callers that relied on a config-free context after reset will need to set those fields explicitly if needed.

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code was added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

Adding a new component, requirement, sampling strategy, or tool?

If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.

Component
Requirement
Sampling Strategy
Tool

NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.

ajbozarth

Some feedback from Claude.

Blocking — feature does not implement token-aware truncation

ModelIdentifier.context_length is in tokens. ChatContext.window_size is a count of context items (CBlocks/Components). view_for_generation() passes the token count straight into as_list(last_n_components=...) as an item count.

The new docstring at context.py lines 34–42 candidly admits the ceiling is "never reached" because real conversations don't accumulate 131,072 items. So for the default path (no explicit window_size), this PR is a no-op — full history is always returned, just like before.

#108 asks for a sliding window that moves when the token budget would be exceeded. To do that the implementation needs to estimate per-item token counts (likely via the bound backend's tokenizer) and walk history popping oldest items until the running sum fits under context_length.

Two ways forward, either is fine:

Implement the token-aware truncation here.
Re-scope this PR to "context-length metadata + binding hook," explicitly defer truncation to a follow-up issue, and update the title / docstring so it doesn't read as if truncation already happens.

AngeloDanducci · 2026-06-16T02:26:05Z

Thanks for the review - should be fixed, was missing a batch of staged commits but I also incorporated your review into the newest changeset.

planetf1

session.py:377 — reset() docstring (lines not in diff hunk, so can't inline):

The docstring still says "replaces self.ctx with the result of ctx.reset_to_new()" but ChatContext now goes through new_instance(). Suggested update:

    def reset(self) -> None:
        """Reset the context state to a fresh, empty context of the same type.

        Fires the `SESSION_RESET` plugin hook if any plugins are registered, then
        replaces `self.ctx` with a fresh empty context, discarding all accumulated
        conversation history. For `ChatContext`, uses `new_instance()` so the
        `model_id` and `window_size` bindings are preserved; for all other context
        types, uses `reset_to_new()`.
        """

ajbozarth

Thanks — the blocker from my previous review is addressed. _as_list_token_budget actually walks history now and the docstring is clear about token-vs-item semantics.

A few small nits below, plus +1 to Nigel's open comments — headroom (packing to 100% of context_length will overflow once the action + response are appended), hasattr → getattr(..., None) is not None, debug log on truncation, the collision guard in _build_table, the # 8B+ comment on Qwen3, the str()-falls-back-to-repr note, and the reset() docstring update. Those are all real and worth taking before merge.

Two test-coverage gaps worth filling:

Boundary test for _as_list_token_budget where history fits exactly at token_budget — locks the > vs >= choice in the truncation condition (currently if spent + cost > token_budget: break, which correctly allows equality).
If Nigel's _build_table collision guard lands, add a test that two ModelIdentifiers sharing a name with mismatched context_length raises.

One thing still outstanding from the previous round: please add a one-line callout in the PR description for the ollama_name="" → None change in HF_SMOLLM3_3B_no_ollama — it's incidental to the title but reviewers shouldn't have to grep to confirm it's safe.

ajbozarth · 2026-06-16T16:25:45Z

Also might be worth noting that #1264 adds a Backend._model_id string value that may or may not be useful in this change, I'm hoping to merge that PR later today.

ajbozarth

Re-review. #1 (Ollama truncation) is the one real blocker — silent correctness gap. The rest are worth considering but not blocking.

ajbozarth

Looks good — all the blockers from the previous round are addressed, plus the bonus new_instance() move to the base class is a nice cleanup. One small doc-accuracy nit inline.

planetf1

lgtm - consider alex's nit

jakelorocco · 2026-06-22T14:49:30Z

+    def _as_list_token_budget(self, token_budget: int) -> list[Component | CBlock]:
+        """Return history items that fit within *token_budget*, dropping oldest first.
+
+        Walks the linked list from newest to oldest, accumulating items until
+        adding the next item would exceed the budget.  The returned list is in
+        chronological order (oldest-first), matching `as_list` behaviour.
+        Token count per item is estimated as `len(str(item)) // 4`; note that
+        `str()` falls back to `repr()` for `Component` subclasses, so the
+        estimate reflects repr boilerplate rather than rendered content.
+
+        A headroom factor of 0.55 is applied to absorb repr-vs-render skew and
+        to leave capacity for the current action and the model's generated
+        response.  This is a conservative approximation; a tokenizer-backed
+        estimate is a known follow-up.


I don't think this is a good estimate. I have a few issues:

we aren't accounting for default system prompts. I'm fine if we want to leave that in the wiggle room / headroom factor. However, we could:
a. store known lengths of system prompts (this helps but isn't 100% accurate because of tools / docs that might get inserted).
b. use a "dead-reckoning" system for backends that return prompt token lengths
c. and/or, include these sources of variability in the note

A factor of .55 seems like it might be off by a large margin. I'm not sure that it's within the helpful range.

You should probably be calling the TemplateFormatter class to get the exact word count of the stringified object. This template formatter is actually tied to the model id so you should be able to either create a new one (or grab it from a backend, depending on where this actually gets called).

ajbozarth

Re-checked after Jake's round. All four of his points are addressed cleanly and tests/lint are green — the previous approval still stands. One small note inline.

ajbozarth · 2026-06-22T17:23:20Z

+        while not current.is_root_node:
+            item = current.node_data
+            assert item is not None
+            cost = max(1, len(formatter.print(item)) // 4)


Minor behavior shift worth noting: formatter.print(item) raises ValueError("could not find template candidate...") for Components without a registered template, where the previous str(item) always returned a repr fallback. In practice the same item would fail at actual generation anyway (same formatter), so this is a latency-of-failure change rather than a new failure mode — but a try/except ValueError here that falls back to len(str(item)) // 4 would keep view_for_generation() from blowing up earlier than the rest of the pipeline. Not blocking; flagging in case you want to tighten it.

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

jakelorocco · 2026-06-23T14:03:53Z

+        self,
+        *,
+        window_size: int | None = None,
+        model_id: str | ModelIdentifier | None = None,


Should we allow passing in an arbitrary token context length limit? How would a user currently set an ad hoc limit? or even to override a given limit?

I could see the argument, something like this?

1. window_size (item count) — wins if set 2. token_limit (explicit token limit) — wins over model-derived 3. Model-derived limit via get_context_length 4. No limit

I don't think anything is currently exposed to allow adjustment of ad hoc limits or overrides.

Yes. I will approve the PR as is; but can you either open a follow-up PR or create an issue for this feature / functionality, @AngeloDanducci?

jakelorocco · 2026-06-23T14:16:25Z

+        fits within `context_length`. Set `window_size` explicitly to enforce
+        an item-count limit instead of a token budget.
+
+        Per-item token count is estimated as ``len(rendered) // 4`` where


Sorry, should've commented this before as well. I didn't quite realize what the //4 heuristic was doing. I think we should explain the choice here that it's saying 1 token == 4 characters.

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

planetf1

A couple of notes inline. One broader item: docs/docs/concepts/context-and-sessions.md still describes only window_size and frames context overflow as unmitigated — a short paragraph covering the new model-aware auto-sizing path and the Ollama num_ctx caveat would round it out.

planetf1 · 2026-06-24T09:49:23Z

+            prev = current.previous_node
+            assert prev is not None
+            current = prev
+        dropped = total - len(collected)


dropped here will always be 0 or 1, regardless of how many items were actually truncated — total only increments for items examined before the break, so when the loop exits total == len(collected) + 1 at most. A history of 7 items truncated to 3 logs "dropped 1 item(s)" rather than 4.

A clean fix: walk the full chain once up front, then compute the real count at the end:

# before the while loop chain_length = 0 node = self while not node.is_root_node: chain_length += 1 node = node.previous_node # type: ignore[assignment] # replace the dropped = … line dropped = chain_length - len(collected)

Fixed, should be accurate now.

planetf1 · 2026-06-24T09:49:23Z

+        current: Context = self
+        while not current.is_root_node:
+            item = current.node_data
+            assert item is not None


assert is silently stripped by python -O. A corrupted chain would then produce a confusing downstream error rather than a clear diagnostic here. A guard would be safer:

Suggested change

assert item is not None

if item is None: # pragma: no cover

raise RuntimeError(

"Malformed context chain: node_data is None at a non-root node"

)

planetf1 · 2026-06-24T09:49:23Z

+            collected.append(item)
+            spent += cost
+            prev = current.previous_node
+            assert prev is not None


Same as above — assert is a no-op under python -O:

Suggested change

assert prev is not None

if prev is None: # pragma: no cover

raise RuntimeError(

"Malformed context chain: previous_node is None at a non-root node"

)

planetf1 · 2026-06-24T09:49:24Z

                invoke_hook(HookType.SESSION_RESET, payload, backend=self.backend)
            )
-        self.ctx = self.ctx.reset_to_new()
+        self.ctx = self.ctx.new_instance()


The docstring covers the new behaviour, but worth a note in the PR description too: reset() previously returned a bare ChatContext() with no config; it now preserves model_id, window_size, and token_context_length_limit. Callers who relied on getting a config-free context after reset will see different behaviour silently.

Added description to PR.

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

AngeloDanducci requested a review from a team as a code owner June 15, 2026 21:54

AngeloDanducci requested review from ajbozarth, akihikokuroda and planetf1 June 15, 2026 21:54

github-actions Bot added the enhancement New feature or request label Jun 15, 2026

ajbozarth requested changes Jun 16, 2026

View reviewed changes

Comment thread mellea/stdlib/context.py Outdated

Comment thread mellea/stdlib/context.py

Comment thread mellea/backends/context_lengths.py Outdated

Comment thread mellea/backends/model_ids.py

Comment thread mellea/stdlib/session.py Outdated

planetf1 reviewed Jun 16, 2026

View reviewed changes

Comment thread mellea/stdlib/context.py

Comment thread mellea/stdlib/session.py Outdated

Comment thread mellea/stdlib/context.py Outdated

Comment thread mellea/backends/model_ids.py Outdated

Comment thread mellea/backends/context_lengths.py

ajbozarth reviewed Jun 16, 2026

View reviewed changes

Comment thread mellea/backends/context_lengths.py

Comment thread mellea/stdlib/context.py

AngeloDanducci enabled auto-merge June 16, 2026 19:35

AngeloDanducci requested a review from ajbozarth June 16, 2026 19:35

ajbozarth reviewed Jun 16, 2026

View reviewed changes

AngeloDanducci requested review from jakelorocco and nrfulton as code owners June 16, 2026 20:49

AngeloDanducci requested a review from ajbozarth June 16, 2026 20:50

ajbozarth reviewed Jun 16, 2026

View reviewed changes

Comment thread mellea/stdlib/context.py

planetf1 approved these changes Jun 17, 2026

View reviewed changes

AngeloDanducci requested a review from ajbozarth June 17, 2026 19:00

ajbozarth approved these changes Jun 17, 2026

View reviewed changes

AngeloDanducci force-pushed the ad-108 branch from e2584ce to 753181b Compare June 22, 2026 12:58

jakelorocco reviewed Jun 22, 2026

View reviewed changes

AngeloDanducci requested a review from jakelorocco June 22, 2026 17:18

ajbozarth approved these changes Jun 22, 2026

View reviewed changes

AngeloDanducci added 6 commits June 22, 2026 15:05

first pass at model aware sliding context window

ebd6b7a

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

additional pass at context length window

412ac5a

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

remove double backticks

1f759d5

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

add missing batch of changes and review changes

beac933

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

review feedback

d582785

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

additional review feedback

5f77ed1

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

AngeloDanducci added 3 commits June 22, 2026 15:09

make root now iterates propagated fields

3eb0874

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

review feedback

0a649e1

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

update token budget tests

ff3d362

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

AngeloDanducci force-pushed the ad-108 branch from fc42073 to ff3d362 Compare June 22, 2026 19:10

jakelorocco reviewed Jun 23, 2026

View reviewed changes

jakelorocco mentioned this pull request Jun 23, 2026

feat: add token usage from previous turns to context length estimation #1319

Open

allow arbitrary token context length limit

dd0c867

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

AngeloDanducci requested a review from jakelorocco June 23, 2026 20:00

planetf1 reviewed Jun 24, 2026

View reviewed changes

review feedback

d01be4b

Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>

jakelorocco approved these changes Jun 24, 2026

View reviewed changes

AngeloDanducci added this pull request to the merge queue Jun 24, 2026

Merged via the queue into generative-computing:main with commit 878c98d Jun 24, 2026
9 checks passed

AngeloDanducci deleted the ad-108 branch June 24, 2026 15:59

-            assert item is not None
+            if item is None:  # pragma: no cover
+                raise RuntimeError(
+                    "Malformed context chain: node_data is None at a non-root node"
+                )

Uh oh!

Conversation

AngeloDanducci commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Issue

Description

Testing

Attribution

Adding a new component, requirement, sampling strategy, or tool?

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Blocking — feature does not implement token-aware truncation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AngeloDanducci commented Jun 16, 2026

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajbozarth commented Jun 16, 2026

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

AngeloDanducci commented Jun 15, 2026 •

edited

Loading