feat: add lance_dataset_calculate_data_stats for per-field storage stats#47
Open
LuciferYang wants to merge 2 commits into
Open
feat: add lance_dataset_calculate_data_stats for per-field storage stats#47LuciferYang wants to merge 2 commits into
LuciferYang wants to merge 2 commits into
Conversation
Exposes upstream's `Dataset::calculate_data_stats()` so callers can size each field on disk for query planning and cost estimation. This was the last open binding in the roadmap's write-path/mutations work. The matching roadmap edits ship separately as a docs-only PR. ## Surface `lance_dataset_calculate_data_stats()` walks every fragment and returns an opaque `LanceDataStatistics` snapshot, following the same handle + indexed accessor shape as `lance_dataset_versions()`: - `lance_data_statistics_count()` — number of fields - `lance_data_statistics_field_id_at(i)` — schema field id - `lance_data_statistics_bytes_on_disk_at(i)` — compressed on-disk size - `lance_data_statistics_close()` — free the snapshot Entries are ordered by schema field id, one per field (nested struct/list children included). The C++ side adds `Dataset::calculate_data_stats()` returning `std::vector<FieldStatistics>`. One caveat worth flagging in the docs: `bytes_on_disk` is 0 for datasets written with the legacy (v1) storage format, which doesn't track per-field sizes. Since 0 is also the error sentinel for the accessors, the headers tell callers to check `lance_last_error_code()` when they pass an untrusted index. ## Tests Rust integration tests cover the single-fragment happy path (field ids and non-zero sizes, including the field-id-0-vs-error-sentinel disambiguation), a three-field schema, and multi-fragment aggregation (verified by comparing a two-fragment dataset against a single-fragment baseline with identical rows in the first fragment, so the assertion actually proves summation). Two edge cases that exercise the documented contract: a legacy v1 dataset (every field reports 0 bytes, error stays clear) and an empty-schema dataset (count 0 with no error, distinguishing it from the NULL-handle error). The full rejection surface is covered too — NULL dataset, NULL handle on every accessor, out-of-range index, and NULL-safe close. C and C++ smoke tests run against the freshly-written (v2) dataset before the mutation tests reshape it.
33622dd to
91d7e48
Compare
- Keep the existing LanceScanner/Batch/Versions typedef spacing untouched and add only the new LanceDataStatistics line (the realignment was needless churn on unrelated lines). - Reword the count and field_id_at Rust doc comments to match the tighter C-header phrasing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Exposes upstream's
Dataset::calculate_data_stats()so callers can size each field on disk for query planning and cost estimation.Surface
lance_dataset_calculate_data_stats()walks every fragment and returns an opaqueLanceDataStatisticssnapshot, following the same handle + indexed accessor shape aslance_dataset_versions():lance_data_statistics_count()— number of fieldslance_data_statistics_field_id_at(i)— schema field idlance_data_statistics_bytes_on_disk_at(i)— compressed on-disk sizelance_data_statistics_close()— free the snapshotEntries are ordered by schema field id, one per field (nested struct/list children included). The C++ side adds
Dataset::calculate_data_stats()returningstd::vector<FieldStatistics>.One caveat worth flagging in the docs:
bytes_on_diskis 0 for datasets written with the legacy (v1) storage format, which doesn't track per-field sizes. Since 0 is also the error sentinel for the accessors, the headers tell callers to checklance_last_error_code()when they pass an untrusted index.Tests
Rust integration tests cover the single-fragment happy path (field ids and non-zero sizes, including the field-id-0-vs-error-sentinel disambiguation), a three-field schema, and multi-fragment aggregation (verified by comparing a two-fragment dataset against a single-fragment baseline with identical rows in the first fragment, so the assertion actually proves summation). Two edge cases that exercise the documented contract: a legacy v1 dataset (every field reports 0 bytes, error stays clear) and an empty-schema dataset (count 0 with no error, distinguishing it from the NULL-handle error). The full rejection surface is covered too — NULL dataset, NULL handle on every accessor, out-of-range index, and NULL-safe close. C and C++ smoke tests run against the freshly-written (v2) dataset before the mutation tests reshape it.