refactor(hash-aggr): Use `EmitTo` to output by 2010YOUY01 · Pull Request #23055 · apache/datafusion

2010YOUY01 · 2026-06-20T12:58:06Z

Which issue does this PR close?

Rationale for this change

Regarding the EPIC issue: I have drafted all the migrations locally, and verified that after deleting the old implementation, UTs are passing.

We are now about 4 feature migration PRs away from completing the EPIC. Before continuing with those migrations, this PR performs some cleanup and refactoring.

What changes are included in this PR?

This PR can be read commit by commit:

commit 1: use EmitTo for incremental outputting
commit 2: split hash_table.rs into small files

Are these changes tested?

Are there any user-facing changes?

alamb · 2026-06-22T12:19:42Z

Regarding the EPIC issue: I have drafted all the migrations locally, and verified that after deleting the old implementation, UTs are passing.

Amazing

alamb

This is amazing @2010YOUY01 -- thank you. I found this code really easy to follow and understand. While it is complicated, I think it much more closely mirrors the complexity of the problem being solved now and setting up the control flow logic in this way means we will be in a much better place to improve the performance / featuers going forward

👏

cc @Rachelint

alamb · 2026-06-22T12:29:50Z

+    AggregateExec, PhysicalGroupBy, aggregate_expressions, evaluate_group_by,
+};
+
+/// Marker for raw rows -> partial state aggregation.


I like this structure and how it makes it clearer what is going on with the state here

alamb · 2026-06-22T12:42:11Z

Minor is that the structuis called final but the module is called final_table.rs -- should we keep it consistent with final.rs?

No, that is the marker struct for hash table aggregation mode. I have renamed it AggregateHashTable<Final> -> AggregateHashTable<FinalMarker> to make it more explicit

alamb · 2026-06-22T12:45:17Z

likewise here, the struct is named Partial but the module partial_table.rs -- recommend partial.rs to be consistent

Same as above ⬆️

alamb · 2026-06-22T12:51:06Z

+    ) -> Result<Option<RecordBatch>> {
+        let output_schema = Arc::clone(&self.output_schema);
+        let batch_size = self.batch_size;
+        match &mut self.state {


this state match and some of the outputtting state is duplicated across the types of tables, but I think it is ok

There are several small differences like metrics tracking, so probably it's clearer to keep them separated 🤔

alamb · 2026-06-22T12:54:58Z

+    /// In skip-partial-aggregation optimization, when a decision has made to skip
+    /// partial stage, build a typed hash table only for aggregation state conversion
+    /// row-by-row.
+    pub(in crate::aggregates) fn partial_skip_table(


I wonder if we could avoid some clones below if this consumed self rather than took it by reference

Maybe it doesn't matter

Yes, it's doable, and I think we can further simplify it into a much smaller struct since for partial aggregation skip stage, only a bunch of GroupsAccmulators are used.

This requires a separate PR, but I agree it's more of a idea to polish the code, not super important for now, I'll try to address it when the refactor is mostly done

Simplify AggregateHashTable<PartialSkip> #23113

alamb · 2026-06-22T13:16:55Z

+            .building()
+            .accumulators
+            .iter()
+            .all(|acc| acc.supports_convert_to_state())


I think we should try and remove this "supports_convert_to_state" API (as a follow on PR / project) to simplify the hash aggregate code and ensure all our groups accumulators have the high performance APIs.

I filed a ticket

Remove GroupsAccumulator::supports_convert_to_state and make convert_to_state mandatory #23081

…artial_table.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

2010YOUY01 · 2026-06-23T09:28:11Z

All comments have been addressed, thank you for the careful reviews! @alamb @Rachelint

I found this code really easy to follow and understand. While it is complicated, I think it much more closely mirrors the complexity of the problem being solved now and setting up the control flow logic in this way means we will be in a much better place to improve the performance / featuers going forward

I only figured this out very recently. The split-stream approach is somewhat counterintuitive: it does introduce a lot of duplicated code, but it can make the code easier to work with.

The key idea, I think, is problem decomposition. If we can break a large problem into smaller subproblems, we can tackle each of them individually.

2010YOUY01 added 2 commits June 20, 2026 20:28

refactor: use EmitTo for aggregate state output

2e7892b

split hash_table.rs into small files

d96b68c

github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 20, 2026

small comments update

6feef68

2010YOUY01 marked this pull request as draft June 21, 2026 01:16

2010YOUY01 marked this pull request as ready for review June 21, 2026 01:16

2010YOUY01 closed this Jun 21, 2026

2010YOUY01 reopened this Jun 21, 2026

alamb approved these changes Jun 22, 2026

View reviewed changes

alamb reviewed Jun 22, 2026

View reviewed changes

alamb mentioned this pull request Jun 22, 2026

Remove GroupsAccumulator::supports_convert_to_state and make convert_to_state mandatory #23081

Open

Rachelint reviewed Jun 23, 2026

View reviewed changes

2010YOUY01 and others added 5 commits June 23, 2026 16:04

Review: rename BuildingHashTableState and more comments

cd9ba97

Review: more comments

d1b825a

Review: cleanup and more comments to common.rs

3e499ce

Review: no need to drop timer

2c568ff

Update datafusion/physical-plan/src/aggregates/aggregate_hash_table/p…

eee3501

…artial_table.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Conversation

2010YOUY01 commented Jun 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb commented Jun 22, 2026 •

edited

Loading