Skip to content

feat(index): support Utf8View prefixes in SargableQueryParser#7351

Open
wombatu-kun wants to merge 2 commits into
lance-format:mainfrom
wombatu-kun:feat/sargable-utf8view
Open

feat(index): support Utf8View prefixes in SargableQueryParser#7351
wombatu-kun wants to merge 2 commits into
lance-format:mainfrom
wombatu-kun:feat/sargable-utf8view

Conversation

@wombatu-kun

Copy link
Copy Markdown
Contributor

Follow-up to #7310.

Problem

#7310 added Utf8View support to the ngram TextQueryParser and noted the identical pre-existing gap in SargableQueryParser (starts_with / LIKE-prefix) as out of scope. The BTree/ZoneMap parser extracts string prefixes by matching only ScalarValue::Utf8 and ScalarValue::LargeUtf8. When the predicate literal is coerced to ScalarValue::Utf8View, the parser drops it, so starts_with(col, 'x') and col LIKE 'x%' do not use the scalar index and fall back to a full scan.

Change

SargableQueryParser::visit_scalar_function (starts_with) and visit_like now accept a Utf8View literal/pattern and normalize the extracted prefix to ScalarValue::Utf8.

Normalize rather than preserve the variant: unlike the ngram path (where the pattern is just a regex string), the SargableQueryParser emits a SargableQuery::LikePrefix whose ScalarValue flows into the BTree. Page pruning (pages_between) compares the query bound against Utf8 page-statistics arrays with Arrow's type-dispatched make_comparator, which rejects a Utf8View bound vs Utf8 stats ("Can't compare arrays of different types"). Lance already normalizes Utf8View columns to Utf8 at write time, so the stored index data is always Utf8; normalizing the prefix to Utf8 matches that and needs no changes to the shared comparison code.

Tests

New test_sargable_query_parser_utf8view exercises visit_scalar_function (starts_with) and visit_like directly with Utf8View literals/patterns, asserting the emitted LikePrefix(Utf8) query and recheck behavior, plus a Utf8 parity control. It fails on the pre-change parser and passes after.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 18, 2026
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.17834% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/expression.rs 94.35% 7 Missing ⚠️
rust/lance-index/src/scalar.rs 88.88% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

Follow-up to lance-format#7310, which added Utf8View handling to the ngram TextQueryParser and explicitly left the identical gap in SargableQueryParser out of scope. The BTree/ZoneMap parser only matched Utf8 / LargeUtf8 for starts_with and infix-free LIKE prefixes, so a Utf8View predicate literal was dropped and the query silently fell back to a full scan instead of using the scalar index.

Unlike the ngram path (where the pattern is only ever used as a regex string), here the parser emits a SargableQuery::LikePrefix whose ScalarValue flows downstream into the BTree, which compares the query bound against Utf8 page statistics with Arrow's type-dispatched comparator. A Utf8View bound cannot be compared against Utf8 stats arrays. Because Lance already normalizes Utf8View columns to Utf8 at write time (the stored index data is always Utf8), the fix normalizes a Utf8View prefix to Utf8 in the parser rather than threading a new type through the shared comparison code.

Adds test_sargable_query_parser_utf8view, which exercises visit_scalar_function (starts_with) and visit_like directly with Utf8View literals and asserts the resulting LikePrefix(Utf8) query, with a Utf8 parity control. The test fails on the pre-change parser (the Utf8View literal is dropped) and passes after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the feat/sargable-utf8view branch from badeddb to d697df2 Compare June 18, 2026 14:52
// `Utf8View` literal is normalized to `Utf8` to match the indexed data: the
// BTree compares the query bound against `Utf8` page statistics at the Arrow
// level, which rejects a `Utf8View` bound.
Expr::Literal(ScalarValue::Utf8View(Some(s)), _) => {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Utf8View prefix path can feed ZoneMap an inexact LikePrefix query, but the later recheck reconstructs the predicate as an unescaped LIKE pattern. Prefixes containing _ or % can therefore match rows that do not satisfy the original starts_with predicate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 528b9f0 - LikePrefix::to_expr now escapes _/%/\ and reconstructs the recheck as LIKE 'prefix%' ESCAPE '\' (only when the prefix actually contains metacharacters), so the inexact zone-map recheck no longer over-matches.

// Arrow level (a `Utf8View` bound would fail that comparison).
let prefix_value = match pattern {
ScalarValue::Utf8(_) => ScalarValue::Utf8(Some(prefix)),
ScalarValue::Utf8(_) | ScalarValue::Utf8View(_) => ScalarValue::Utf8(Some(prefix)),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared sargable parser now emits LikePrefix for Utf8View prefixes even when the selected index is Bitmap, which rejects that query during search. A query that previously fell back to ordinary filtering can now fail once planned as an index scan.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 528b9f0 - the bitmap index now configures the parser with without_like_prefix, so LIKE/starts_with fall back to ordinary filtering instead of emitting a LikePrefix that bitmap search would reject.

… for bitmap indexes

Escape LIKE metacharacters (_, %, \) when rebuilding the LikePrefix recheck predicate so a literal prefix no longer over-matches on the inexact (zone map) path.

Configure the bitmap index parser with without_like_prefix so LIKE/starts_with fall back to ordinary filtering instead of failing at search time.
@wombatu-kun wombatu-kun requested a review from Xuanwo June 20, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants