Skip to content

refactor: move core index utilities into a trait that does not depend on lance#7340

Open
westonpace wants to merge 2 commits into
lance-format:mainfrom
westonpace:refactor-core-index-utils-trait
Open

refactor: move core index utilities into a trait that does not depend on lance#7340
westonpace wants to merge 2 commits into
lance-format:mainfrom
westonpace:refactor-core-index-utils-trait

Conversation

@westonpace

Copy link
Copy Markdown
Member

No description provided.

westonpace and others added 2 commits June 16, 2026 22:09
Introduce `lance-index-core` as a lightweight crate containing only
abstract interfaces (traits, enums, expression types) with minimal
dependencies. `lance-index` remains the concrete implementation crate and
re-exports everything from core for backwards compatibility.

Key changes:
- New `lance-index-core` crate: Index/IndexMetadata/IndexParams traits,
  scalar expression parsing, plugin registry interfaces, metrics and
  progress traits, and a new `RowIdRemapper` trait
- `RowIdRemapper` trait replaces direct `Arc<FragReuseIndex>` usage in
  the plugin API, removing the `lance-table` dependency from core
- `FragReuseIndex` implements `RowIdRemapper` in `lance-index`
- `IndexFile` defined locally in core (identical fields, no table dep)
- All call sites in `lance/src/` cast `Arc<FragReuseIndex>` to
  `Arc<dyn RowIdRemapper>` at the crate boundary

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve conflicts from the rebase onto origin/main and fix bugs
introduced when porting the refactoring:

- Fix fmindex plugin lookup: remove the "fmindex"→"fm" rename in
  get_plugin_name_from_details_name so the plugin registers under the
  same name ("fmindex") that BuiltinIndexType::FMIndex produces.

- Add extract_nested_column_path to lance-index-core expression.rs
  so scalar index queries on nested fields with dots in field names
  (e.g. data.`value.v1`) are correctly resolved via get_field chains.

- Port the IsNotNull(regexp_match(...)) unwrapping from lance-index
  into lance-index-core so NGram regex acceleration still fires.

- Move Index/RowIdRemapper impls for FragReuseIndex and MemWalIndex
  to lance-table to avoid orphan rule violations.

- Fix CreatedIndex.files to use Option<Vec<IndexFile>>, fix IndexType
  references (Fm→FMIndex), remove duplicate code from expression.rs,
  remove unused imports and dead BTreeIndex::lookup_batch field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer A-deps Dependency updates A-namespace Namespace impls and removed A-index Vector index, linalg, tokenizer A-deps Dependency updates A-namespace Namespace impls labels Jun 17, 2026
@westonpace

Copy link
Copy Markdown
Member Author

Current status (notes for my own book-keeping)

Non-trivial changes in this refactor

This PR extracts abstract traits and types from lance-index into a new lance-index-core crate, leaving concrete implementations in lance-index. Most of the diff is mechanical (moved code, updated imports), but a handful of places required real decisions. Those are documented here.


1. as_vector_index() removed from the Index trait

What changed: The as_vector_index() -> Result<Arc<dyn VectorIndex>> method was removed from the Index trait entirely.

Why it had to change: The Index trait now lives in lance-index-core, but VectorIndex lives in lance-index. The Rust orphan rule forbids lance-index-core from referencing types defined in lance-index, so keeping as_vector_index() on the trait would create a circular dependency. Scalar indices were already implementing this as a guaranteed error anyway, so the method wasn't meaningful at the trait level. Vector indices still expose their own concrete casting mechanisms within lance-index.


2. RowIdRemapper trait abstracts FragReuseIndex

What changed: Scalar indices previously held Option<Arc<FragReuseIndex>> for row-ID remapping during fragment compaction. This has been replaced with Option<Arc<dyn RowIdRemapper>>, a new trait defined in lance-index-core.

Why it had to change: FragReuseIndex is defined in lance-table. lance-index cannot depend on lance-table without introducing a cycle. Rather than invert the dependency, we defined a minimal RowIdRemapper trait in core and implemented it on FragReuseIndex in lance-table. The four methods required are:

  • remap_row_id(u64) -> Option<u64>
  • remap_row_addrs_tree_map(&RowAddrTreeMap) -> RowAddrTreeMap
  • remap_row_ids_roaring_tree_map(&RoaringTreemap) -> RoaringTreemap
  • remap_row_ids_record_batch(RecordBatch, col_idx) -> Result<RecordBatch>

As a side effect, impl Index for FragReuseIndex and impl Index for MemWalIndex moved to lance-table (the only crate that can define impls for both the trait and the type without violating the orphan rule).


3. CreatedIndex.files is now Option<Vec<IndexFile>>

What changed: The files field on CreatedIndex changed from Vec<IndexFile> to Option<Vec<IndexFile>>.

Why it had to change: CreatedIndex is now defined in lance-index-core, but system indices (FragmentReuse, MemWal) don't produce files in the traditional sense. Option lets them return None cleanly rather than pretending they return an empty file list. All concrete scalar indices continue to wrap their files in Some(...). Call sites that read .files use .unwrap_or_default() or .map(...) as appropriate.

There's also a small conversion layer in lance/src/index.rs (index_file_to_table / index_file_from_table) because lance-index-core and lance-table each define their own IndexFile struct — the core crate can't depend on the table crate's format types.


4. Newtype wrappers in lance_format.rs (orphan rule)

What changed: lance-index/src/scalar/lance_format.rs now uses newtype wrappers LancePreviousReader and LanceCurrentReader rather than implementing IndexReader directly on PreviousFileReader and current_reader::FileReader.

Why it had to change: IndexReader is now defined in lance-index-core; the reader types come from lance-file. The orphan rule forbids implementing a foreign trait on a foreign type, so newtype wrappers were necessary. The wrappers are thin — method bodies just delegate via .0.


5. Plugin registry: trait injection over concrete types

What changed: The plugin registry was split into an abstract PluginRegistry trait (in lance-index-core) and a concrete IndexPluginRegistry struct (in lance-index). Plugins also gained an attach_registry hook:

fn attach_registry(&self, _registry: Arc<dyn PluginRegistry>) {}

Why it had to change: Some plugins need to delegate to other plugins (e.g., JSON index delegates to BTree). For that to work inside core, plugins need a way to look up siblings — but the concrete registry type and all its registered plugins live in lance-index. The solution is dependency injection: after with_default_plugins() builds the registry, it calls attach_registry on every plugin, passing a dyn PluginRegistry reference. Plugins that need cross-plugin lookup store this handle; plugins that don't simply ignore the call (default no-op).


6. FMIndex plugin name: "fm""fmindex"

What changed: The IndexPluginRegistry previously had a special case that mapped the protobuf details name FmIndexIndexDetails"fmindex""fm", registering the plugin under "fm". This special case was removed; the plugin is now registered under "fmindex", which matches BuiltinIndexType::FMIndex.as_str().

Why it had to change: Once the registry and BuiltinIndexType moved to lance-index-core, the lookup path (BuiltinIndexType::FMIndex.as_str()"fmindex") and the registration path (protobuf name stripping → "fm") no longer agreed, causing "No scalar index plugin found for name 'fmindex'" at runtime. Removing the renaming special case aligns the two paths. The IndexType enum variant was also renamed from Fm to FMIndex for consistency with other index types.


7. Nested field index queries via get_field chains

What changed: A new extract_nested_column_path function was added to lance-index-core/src/scalar/expression.rs, and maybe_indexed_column now calls it first before falling back to simple column extraction.

Why it had to change: When DataFusion represents a filter on a nested struct field like data.\value.v1`, it produces a get_field(data, "value.v1")expression — not a bareColumn. The existing maybe_indexed_columnonly recognizedExpr::Column, so nested-field indices were silently bypassed. The fix walks get_field chains upward, reconstructs the full quoted path (e.g., `` "data.value.v1" ``), and checks whether an index exists for that path. This was already present in lance-indexbut had to be explicitly ported when the function moved tolance-index-core`.


8. IsNotNull(regexp_match(...)) unwrapping for NGram acceleration

What changed: The apply_scalar_indices match arm for Expr::IsNotNull in lance-index-core now has a special case: if the inner expression is a regexp_match scalar function, it is routed through visit_scalar_fn rather than visit_is_null.

Why it had to change: DataFusion rewrites regexp_match(col, pat) — which returns an array type — as regexp_match(col, pat) IS NOT NULL to produce a boolean predicate. Without this special case the IsNotNull wrapper was opaque to the index planner and the NGram index was never consulted. This was also already handled in lance-index's expression.rs and had to be ported when the function moved to core.

@westonpace westonpace marked this pull request as ready for review June 17, 2026 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant