feat: model aware sliding window context#1270
Conversation
ajbozarth
left a comment
There was a problem hiding this comment.
Some feedback from Claude.
Blocking — feature does not implement token-aware truncation
ModelIdentifier.context_length is in tokens. ChatContext.window_size is a count of context items (CBlocks/Components). view_for_generation() passes the token count straight into as_list(last_n_components=...) as an item count.
The new docstring at context.py lines 34–42 candidly admits the ceiling is "never reached" because real conversations don't accumulate 131,072 items. So for the default path (no explicit window_size), this PR is a no-op — full history is always returned, just like before.
#108 asks for a sliding window that moves when the token budget would be exceeded. To do that the implementation needs to estimate per-item token counts (likely via the bound backend's tokenizer) and walk history popping oldest items until the running sum fits under context_length.
Two ways forward, either is fine:
- Implement the token-aware truncation here.
- Re-scope this PR to "context-length metadata + binding hook," explicitly defer truncation to a follow-up issue, and update the title / docstring so it doesn't read as if truncation already happens.
|
Thanks for the review - should be fixed, was missing a batch of staged commits but I also incorporated your review into the newest changeset. |
planetf1
left a comment
There was a problem hiding this comment.
session.py:377 — reset() docstring (lines not in diff hunk, so can't inline):
The docstring still says "replaces self.ctx with the result of ctx.reset_to_new()" but ChatContext now goes through new_instance(). Suggested update:
def reset(self) -> None:
"""Reset the context state to a fresh, empty context of the same type.
Fires the `SESSION_RESET` plugin hook if any plugins are registered, then
replaces `self.ctx` with a fresh empty context, discarding all accumulated
conversation history. For `ChatContext`, uses `new_instance()` so the
`model_id` and `window_size` bindings are preserved; for all other context
types, uses `reset_to_new()`.
"""
ajbozarth
left a comment
There was a problem hiding this comment.
Thanks — the blocker from my previous review is addressed. _as_list_token_budget actually walks history now and the docstring is clear about token-vs-item semantics.
A few small nits below, plus +1 to Nigel's open comments — headroom (packing to 100% of context_length will overflow once the action + response are appended), hasattr → getattr(..., None) is not None, debug log on truncation, the collision guard in _build_table, the # 8B+ comment on Qwen3, the str()-falls-back-to-repr note, and the reset() docstring update. Those are all real and worth taking before merge.
Two test-coverage gaps worth filling:
- Boundary test for
_as_list_token_budgetwhere history fits exactly attoken_budget— locks the>vs>=choice in the truncation condition (currentlyif spent + cost > token_budget: break, which correctly allows equality). - If Nigel's
_build_tablecollision guard lands, add a test that twoModelIdentifiers sharing a name with mismatchedcontext_lengthraises.
One thing still outstanding from the previous round: please add a one-line callout in the PR description for the ollama_name="" → None change in HF_SMOLLM3_3B_no_ollama — it's incidental to the title but reviewers shouldn't have to grep to confirm it's safe.
|
Also might be worth noting that #1264 adds a Backend._model_id string value that may or may not be useful in this change, I'm hoping to merge that PR later today. |
ajbozarth
left a comment
There was a problem hiding this comment.
Looks good — all the blockers from the previous round are addressed, plus the bonus new_instance() move to the base class is a nice cleanup. One small doc-accuracy nit inline.
planetf1
left a comment
There was a problem hiding this comment.
lgtm - consider alex's nit
| def _as_list_token_budget(self, token_budget: int) -> list[Component | CBlock]: | ||
| """Return history items that fit within *token_budget*, dropping oldest first. | ||
|
|
||
| Walks the linked list from newest to oldest, accumulating items until | ||
| adding the next item would exceed the budget. The returned list is in | ||
| chronological order (oldest-first), matching `as_list` behaviour. | ||
| Token count per item is estimated as `len(str(item)) // 4`; note that | ||
| `str()` falls back to `repr()` for `Component` subclasses, so the | ||
| estimate reflects repr boilerplate rather than rendered content. | ||
|
|
||
| A headroom factor of 0.55 is applied to absorb repr-vs-render skew and | ||
| to leave capacity for the current action and the model's generated | ||
| response. This is a conservative approximation; a tokenizer-backed | ||
| estimate is a known follow-up. |
There was a problem hiding this comment.
I don't think this is a good estimate. I have a few issues:
- we aren't accounting for default system prompts. I'm fine if we want to leave that in the wiggle room / headroom factor. However, we could:
a. store known lengths of system prompts (this helps but isn't 100% accurate because of tools / docs that might get inserted).
b. use a "dead-reckoning" system for backends that return prompt token lengths
c. and/or, include these sources of variability in the note - A factor of .55 seems like it might be off by a large margin. I'm not sure that it's within the helpful range.
- You should probably be calling the TemplateFormatter class to get the exact word count of the stringified object. This template formatter is actually tied to the model id so you should be able to either create a new one (or grab it from a backend, depending on where this actually gets called).
ajbozarth
left a comment
There was a problem hiding this comment.
Re-checked after Jake's round. All four of his points are addressed cleanly and tests/lint are green — the previous approval still stands. One small note inline.
| while not current.is_root_node: | ||
| item = current.node_data | ||
| assert item is not None | ||
| cost = max(1, len(formatter.print(item)) // 4) |
There was a problem hiding this comment.
Minor behavior shift worth noting: formatter.print(item) raises ValueError("could not find template candidate...") for Components without a registered template, where the previous str(item) always returned a repr fallback. In practice the same item would fail at actual generation anyway (same formatter), so this is a latency-of-failure change rather than a new failure mode — but a try/except ValueError here that falls back to len(str(item)) // 4 would keep view_for_generation() from blowing up earlier than the rest of the pipeline. Not blocking; flagging in case you want to tighten it.
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
| self, | ||
| *, | ||
| window_size: int | None = None, | ||
| model_id: str | ModelIdentifier | None = None, |
There was a problem hiding this comment.
Should we allow passing in an arbitrary token context length limit? How would a user currently set an ad hoc limit? or even to override a given limit?
There was a problem hiding this comment.
I could see the argument, something like this?
1. window_size (item count) — wins if set
2. token_limit (explicit token limit) — wins over model-derived
3. Model-derived limit via get_context_length
4. No limit
I don't think anything is currently exposed to allow adjustment of ad hoc limits or overrides.
There was a problem hiding this comment.
Yes. I will approve the PR as is; but can you either open a follow-up PR or create an issue for this feature / functionality, @AngeloDanducci?
| fits within `context_length`. Set `window_size` explicitly to enforce | ||
| an item-count limit instead of a token budget. | ||
|
|
||
| Per-item token count is estimated as ``len(rendered) // 4`` where |
There was a problem hiding this comment.
Sorry, should've commented this before as well. I didn't quite realize what the //4 heuristic was doing. I think we should explain the choice here that it's saying 1 token == 4 characters.
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
planetf1
left a comment
There was a problem hiding this comment.
A couple of notes inline. One broader item: docs/docs/concepts/context-and-sessions.md still describes only window_size and frames context overflow as unmitigated — a short paragraph covering the new model-aware auto-sizing path and the Ollama num_ctx caveat would round it out.
| prev = current.previous_node | ||
| assert prev is not None | ||
| current = prev | ||
| dropped = total - len(collected) |
There was a problem hiding this comment.
dropped here will always be 0 or 1, regardless of how many items were actually truncated — total only increments for items examined before the break, so when the loop exits total == len(collected) + 1 at most. A history of 7 items truncated to 3 logs "dropped 1 item(s)" rather than 4.
A clean fix: walk the full chain once up front, then compute the real count at the end:
# before the while loop
chain_length = 0
node = self
while not node.is_root_node:
chain_length += 1
node = node.previous_node # type: ignore[assignment]
# replace the dropped = … line
dropped = chain_length - len(collected)There was a problem hiding this comment.
Fixed, should be accurate now.
| current: Context = self | ||
| while not current.is_root_node: | ||
| item = current.node_data | ||
| assert item is not None |
There was a problem hiding this comment.
assert is silently stripped by python -O. A corrupted chain would then produce a confusing downstream error rather than a clear diagnostic here. A guard would be safer:
| assert item is not None | |
| if item is None: # pragma: no cover | |
| raise RuntimeError( | |
| "Malformed context chain: node_data is None at a non-root node" | |
| ) |
| collected.append(item) | ||
| spent += cost | ||
| prev = current.previous_node | ||
| assert prev is not None |
There was a problem hiding this comment.
Same as above — assert is a no-op under python -O:
| assert prev is not None | |
| if prev is None: # pragma: no cover | |
| raise RuntimeError( | |
| "Malformed context chain: previous_node is None at a non-root node" | |
| ) |
| invoke_hook(HookType.SESSION_RESET, payload, backend=self.backend) | ||
| ) | ||
| self.ctx = self.ctx.reset_to_new() | ||
| self.ctx = self.ctx.new_instance() |
There was a problem hiding this comment.
The docstring covers the new behaviour, but worth a note in the PR description too: reset() previously returned a bare ChatContext() with no config; it now preserves model_id, window_size, and token_context_length_limit. Callers who relied on getting a config-free context after reset will see different behaviour silently.
There was a problem hiding this comment.
Added description to PR.
Signed-off-by: AngeloDanducci <angelo.danducci.ii@ibm.com>
878c98d
Pull Request
Issue
Fixes #108
Description
Adds a model aware sliding context window based on the issues request for such.
Note:
HF_SMOLLM3_3B_no_ollama: ollama_name=""was changed toollama_name=None— this is safe because_build_tablealready guardsif name:before indexing, so the empty string was silently excluded from the lookup table anyway.Behavioral change: reset() now preserves model_id, window_size, and token_context_length_limit on the new context. Previously it returned a bare ChatContext() with no configuration. Callers that relied on a config-free context after reset will need to set those fields explicitly if needed.
Testing
Attribution
Adding a new component, requirement, sampling strategy, or tool?
If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.
NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.