feat(index): support Utf8View prefixes in SargableQueryParser#7351
feat(index): support Utf8View prefixes in SargableQueryParser#7351wombatu-kun wants to merge 2 commits into
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Follow-up to lance-format#7310, which added Utf8View handling to the ngram TextQueryParser and explicitly left the identical gap in SargableQueryParser out of scope. The BTree/ZoneMap parser only matched Utf8 / LargeUtf8 for starts_with and infix-free LIKE prefixes, so a Utf8View predicate literal was dropped and the query silently fell back to a full scan instead of using the scalar index. Unlike the ngram path (where the pattern is only ever used as a regex string), here the parser emits a SargableQuery::LikePrefix whose ScalarValue flows downstream into the BTree, which compares the query bound against Utf8 page statistics with Arrow's type-dispatched comparator. A Utf8View bound cannot be compared against Utf8 stats arrays. Because Lance already normalizes Utf8View columns to Utf8 at write time (the stored index data is always Utf8), the fix normalizes a Utf8View prefix to Utf8 in the parser rather than threading a new type through the shared comparison code. Adds test_sargable_query_parser_utf8view, which exercises visit_scalar_function (starts_with) and visit_like directly with Utf8View literals and asserts the resulting LikePrefix(Utf8) query, with a Utf8 parity control. The test fails on the pre-change parser (the Utf8View literal is dropped) and passes after. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
badeddb to
d697df2
Compare
| // `Utf8View` literal is normalized to `Utf8` to match the indexed data: the | ||
| // BTree compares the query bound against `Utf8` page statistics at the Arrow | ||
| // level, which rejects a `Utf8View` bound. | ||
| Expr::Literal(ScalarValue::Utf8View(Some(s)), _) => { |
There was a problem hiding this comment.
The new Utf8View prefix path can feed ZoneMap an inexact LikePrefix query, but the later recheck reconstructs the predicate as an unescaped LIKE pattern. Prefixes containing _ or % can therefore match rows that do not satisfy the original starts_with predicate.
There was a problem hiding this comment.
Done 528b9f0 - LikePrefix::to_expr now escapes _/%/\ and reconstructs the recheck as LIKE 'prefix%' ESCAPE '\' (only when the prefix actually contains metacharacters), so the inexact zone-map recheck no longer over-matches.
| // Arrow level (a `Utf8View` bound would fail that comparison). | ||
| let prefix_value = match pattern { | ||
| ScalarValue::Utf8(_) => ScalarValue::Utf8(Some(prefix)), | ||
| ScalarValue::Utf8(_) | ScalarValue::Utf8View(_) => ScalarValue::Utf8(Some(prefix)), |
There was a problem hiding this comment.
The shared sargable parser now emits LikePrefix for Utf8View prefixes even when the selected index is Bitmap, which rejects that query during search. A query that previously fell back to ordinary filtering can now fail once planned as an index scan.
There was a problem hiding this comment.
Done 528b9f0 - the bitmap index now configures the parser with without_like_prefix, so LIKE/starts_with fall back to ordinary filtering instead of emitting a LikePrefix that bitmap search would reject.
… for bitmap indexes Escape LIKE metacharacters (_, %, \) when rebuilding the LikePrefix recheck predicate so a literal prefix no longer over-matches on the inexact (zone map) path. Configure the bitmap index parser with without_like_prefix so LIKE/starts_with fall back to ordinary filtering instead of failing at search time.
Follow-up to #7310.
Problem
#7310 added Utf8View support to the ngram TextQueryParser and noted the identical pre-existing gap in SargableQueryParser (starts_with / LIKE-prefix) as out of scope. The BTree/ZoneMap parser extracts string prefixes by matching only ScalarValue::Utf8 and ScalarValue::LargeUtf8. When the predicate literal is coerced to ScalarValue::Utf8View, the parser drops it, so starts_with(col, 'x') and col LIKE 'x%' do not use the scalar index and fall back to a full scan.
Change
SargableQueryParser::visit_scalar_function (starts_with) and visit_like now accept a Utf8View literal/pattern and normalize the extracted prefix to ScalarValue::Utf8.
Normalize rather than preserve the variant: unlike the ngram path (where the pattern is just a regex string), the SargableQueryParser emits a SargableQuery::LikePrefix whose ScalarValue flows into the BTree. Page pruning (pages_between) compares the query bound against Utf8 page-statistics arrays with Arrow's type-dispatched make_comparator, which rejects a Utf8View bound vs Utf8 stats ("Can't compare arrays of different types"). Lance already normalizes Utf8View columns to Utf8 at write time, so the stored index data is always Utf8; normalizing the prefix to Utf8 matches that and needs no changes to the shared comparison code.
Tests
New test_sargable_query_parser_utf8view exercises visit_scalar_function (starts_with) and visit_like directly with Utf8View literals/patterns, asserting the emitted LikePrefix(Utf8) query and recheck behavior, plus a Utf8 parity control. It fails on the pre-change parser and passes after.