feat: add lance_dataset_calculate_data_stats for per-field storage stats by LuciferYang · Pull Request #47 · lance-format/lance-c

LuciferYang · 2026-06-16T13:33:26Z

Exposes upstream's Dataset::calculate_data_stats() so callers can size each field on disk for query planning and cost estimation.

Surface

lance_dataset_calculate_data_stats() walks every fragment and returns an opaque LanceDataStatistics snapshot, following the same handle + indexed accessor shape as lance_dataset_versions():

lance_data_statistics_count() — number of fields
lance_data_statistics_field_id_at(i) — schema field id
lance_data_statistics_bytes_on_disk_at(i) — compressed on-disk size
lance_data_statistics_close() — free the snapshot

Entries are ordered by schema field id, one per field (nested struct/list children included). The C++ side adds Dataset::calculate_data_stats() returning std::vector<FieldStatistics>.

One caveat worth flagging in the docs: bytes_on_disk is 0 for datasets written with the legacy (v1) storage format, which doesn't track per-field sizes. Since 0 is also the error sentinel for the accessors, the headers tell callers to check lance_last_error_code() when they pass an untrusted index.

Tests

Rust integration tests cover the single-fragment happy path (field ids and non-zero sizes, including the field-id-0-vs-error-sentinel disambiguation), a three-field schema, and multi-fragment aggregation (verified by comparing a two-fragment dataset against a single-fragment baseline with identical rows in the first fragment, so the assertion actually proves summation). Two edge cases that exercise the documented contract: a legacy v1 dataset (every field reports 0 bytes, error stays clear) and an empty-schema dataset (count 0 with no error, distinguishing it from the NULL-handle error). The full rejection surface is covered too — NULL dataset, NULL handle on every accessor, out-of-range index, and NULL-safe close. C and C++ smoke tests run against the freshly-written (v2) dataset before the mutation tests reshape it.

Exposes upstream's `Dataset::calculate_data_stats()` so callers can size each field on disk for query planning and cost estimation. This was the last open binding in the roadmap's write-path/mutations work. The matching roadmap edits ship separately as a docs-only PR. ## Surface `lance_dataset_calculate_data_stats()` walks every fragment and returns an opaque `LanceDataStatistics` snapshot, following the same handle + indexed accessor shape as `lance_dataset_versions()`: - `lance_data_statistics_count()` — number of fields - `lance_data_statistics_field_id_at(i)` — schema field id - `lance_data_statistics_bytes_on_disk_at(i)` — compressed on-disk size - `lance_data_statistics_close()` — free the snapshot Entries are ordered by schema field id, one per field (nested struct/list children included). The C++ side adds `Dataset::calculate_data_stats()` returning `std::vector<FieldStatistics>`. One caveat worth flagging in the docs: `bytes_on_disk` is 0 for datasets written with the legacy (v1) storage format, which doesn't track per-field sizes. Since 0 is also the error sentinel for the accessors, the headers tell callers to check `lance_last_error_code()` when they pass an untrusted index. ## Tests Rust integration tests cover the single-fragment happy path (field ids and non-zero sizes, including the field-id-0-vs-error-sentinel disambiguation), a three-field schema, and multi-fragment aggregation (verified by comparing a two-fragment dataset against a single-fragment baseline with identical rows in the first fragment, so the assertion actually proves summation). Two edge cases that exercise the documented contract: a legacy v1 dataset (every field reports 0 bytes, error stays clear) and an empty-schema dataset (count 0 with no error, distinguishing it from the NULL-handle error). The full rejection surface is covered too — NULL dataset, NULL handle on every accessor, out-of-range index, and NULL-safe close. C and C++ smoke tests run against the freshly-written (v2) dataset before the mutation tests reshape it.

- Keep the existing LanceScanner/Batch/Versions typedef spacing untouched and add only the new LanceDataStatistics line (the realignment was needless churn on unrelated lines). - Reword the count and field_id_at Rust doc comments to match the tighter C-header phrasing.

LuciferYang marked this pull request as draft June 16, 2026 13:36

LuciferYang force-pushed the feat/dataset-data-statistics branch from 33622dd to 91d7e48 Compare June 16, 2026 13:39

LuciferYang mentioned this pull request Jun 16, 2026

docs: mark Phase 3/4 mutation and statistics rows complete in roadmap #48

Draft

LuciferYang marked this pull request as ready for review June 16, 2026 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add lance_dataset_calculate_data_stats for per-field storage stats#47

feat: add lance_dataset_calculate_data_stats for per-field storage stats#47
LuciferYang wants to merge 2 commits into
lance-format:mainfrom
LuciferYang:feat/dataset-data-statistics

LuciferYang commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Surface

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LuciferYang commented Jun 16, 2026 •

edited

Loading