Skip to content

feat: add lance_dataset_calculate_data_stats for per-field storage stats#47

Open
LuciferYang wants to merge 2 commits into
lance-format:mainfrom
LuciferYang:feat/dataset-data-statistics
Open

feat: add lance_dataset_calculate_data_stats for per-field storage stats#47
LuciferYang wants to merge 2 commits into
lance-format:mainfrom
LuciferYang:feat/dataset-data-statistics

Conversation

@LuciferYang

@LuciferYang LuciferYang commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Exposes upstream's Dataset::calculate_data_stats() so callers can size each field on disk for query planning and cost estimation.

Surface

lance_dataset_calculate_data_stats() walks every fragment and returns an opaque LanceDataStatistics snapshot, following the same handle + indexed accessor shape as lance_dataset_versions():

  • lance_data_statistics_count() — number of fields
  • lance_data_statistics_field_id_at(i) — schema field id
  • lance_data_statistics_bytes_on_disk_at(i) — compressed on-disk size
  • lance_data_statistics_close() — free the snapshot

Entries are ordered by schema field id, one per field (nested struct/list children included). The C++ side adds Dataset::calculate_data_stats() returning std::vector<FieldStatistics>.

One caveat worth flagging in the docs: bytes_on_disk is 0 for datasets written with the legacy (v1) storage format, which doesn't track per-field sizes. Since 0 is also the error sentinel for the accessors, the headers tell callers to check lance_last_error_code() when they pass an untrusted index.

Tests

Rust integration tests cover the single-fragment happy path (field ids and non-zero sizes, including the field-id-0-vs-error-sentinel disambiguation), a three-field schema, and multi-fragment aggregation (verified by comparing a two-fragment dataset against a single-fragment baseline with identical rows in the first fragment, so the assertion actually proves summation). Two edge cases that exercise the documented contract: a legacy v1 dataset (every field reports 0 bytes, error stays clear) and an empty-schema dataset (count 0 with no error, distinguishing it from the NULL-handle error). The full rejection surface is covered too — NULL dataset, NULL handle on every accessor, out-of-range index, and NULL-safe close. C and C++ smoke tests run against the freshly-written (v2) dataset before the mutation tests reshape it.

@LuciferYang LuciferYang marked this pull request as draft June 16, 2026 13:36
Exposes upstream's `Dataset::calculate_data_stats()` so callers can size each
field on disk for query planning and cost estimation. This was the last open
binding in the roadmap's write-path/mutations work. The matching roadmap edits
ship separately as a docs-only PR.

## Surface

`lance_dataset_calculate_data_stats()` walks every fragment and returns an
opaque `LanceDataStatistics` snapshot, following the same handle + indexed
accessor shape as `lance_dataset_versions()`:

- `lance_data_statistics_count()` — number of fields
- `lance_data_statistics_field_id_at(i)` — schema field id
- `lance_data_statistics_bytes_on_disk_at(i)` — compressed on-disk size
- `lance_data_statistics_close()` — free the snapshot

Entries are ordered by schema field id, one per field (nested struct/list
children included). The C++ side adds `Dataset::calculate_data_stats()`
returning `std::vector<FieldStatistics>`.

One caveat worth flagging in the docs: `bytes_on_disk` is 0 for datasets
written with the legacy (v1) storage format, which doesn't track per-field
sizes. Since 0 is also the error sentinel for the accessors, the headers tell
callers to check `lance_last_error_code()` when they pass an untrusted index.

## Tests

Rust integration tests cover the single-fragment happy path (field ids and
non-zero sizes, including the field-id-0-vs-error-sentinel disambiguation), a
three-field schema, and multi-fragment aggregation (verified by comparing a
two-fragment dataset against a single-fragment baseline with identical rows in
the first fragment, so the assertion actually proves summation). Two edge cases
that exercise the documented contract: a legacy v1 dataset (every field reports
0 bytes, error stays clear) and an empty-schema dataset (count 0 with no error,
distinguishing it from the NULL-handle error). The full rejection surface is
covered too — NULL dataset, NULL handle on every accessor, out-of-range index,
and NULL-safe close. C and C++ smoke tests run against the freshly-written (v2)
dataset before the mutation tests reshape it.
@LuciferYang LuciferYang force-pushed the feat/dataset-data-statistics branch from 33622dd to 91d7e48 Compare June 16, 2026 13:39
- Keep the existing LanceScanner/Batch/Versions typedef spacing untouched and
  add only the new LanceDataStatistics line (the realignment was needless churn
  on unrelated lines).
- Reword the count and field_id_at Rust doc comments to match the tighter
  C-header phrasing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant