fix(lambda): only push referenced params into the merged batch by Adam-Alani · Pull Request #22853 · apache/datafusion

Adam-Alani · 2026-06-09T14:23:29Z

Which issue does this PR close?

Closes #.

Rationale for this change

Extracted from #22689 per maintainer request: the original UDF and the underlying lambda fix deserve separate review threads. This PR carries only the lambda fix.

LambdaExpr previously compressed the column-index projection by enumerating every referenced Column/LambdaVariable index and packing them into a dense range. That collapse is correct for outer captures (and is a no-op for single-parameter lambdas, which is why array_transform was never affected), but it also moves lambda parameters around. A two-parameter lambda like (k, v) -> v (with k unused) would have its LambdaVariable for v re-projected from index 1 to index 0 — so at runtime the body reads the slot the higher-order function had filled with k and silently returns the wrong column.

This is a latent bug today — no in-tree higher-order function exercises it — but it blocks #22689 (transform_values, which uses (k, v) -> body lambdas) and any future HOF that takes more than one parameter.

What changes are included in this PR?

Per @gstvg's suggested non-breaking approach in the discussion thread:

LambdaExpr now tracks used_params: HashSet<String> — the subset of its own declared parameters that the body actually references. It is computed during a single walk of the body in LambdaExpr::new, with a shadow stack that ignores LambdaVariables bound by nested lambdas. For (k, v) -> func(col, (k, v2) -> k + v2 + v) the inner k shadows the outer k, so only v flows up as used by the outer lambda.
LambdaArgument gets an Option<HashSet<String>> for the used parameter names plus a non-breaking LambdaArgument::new_with_used_params(...) constructor. The existing LambdaArgument::new(...) calls it with None, which preserves the old "push every declared parameter" behavior — so external callers that build LambdaArgument directly keep working unchanged.
LambdaArgument::evaluate (through merge_captures_with_variables) only evaluates and pushes the closures whose parameter name appears in used_params, preserving the original declaration order. Unused declared parameters therefore leave no slot in the merged batch, so the body's compressed indices line up directly with the columns the evaluator actually built.
HigherOrderFunctionExpr::evaluate calls LambdaArgument::new_with_used_params(...) and forwards lambda.used_params().clone(), so all in-tree higher-order UDFs pick up the fix automatically with no callsite change.

Compared to the previous revision of this PR (which added an outer_columns_count: usize parameter to LambdaExpr::try_new and expressions::lambda(...)), this revision:

Has no breaking API change. LambdaExpr::try_new, expressions::lambda(...) and LambdaArgument::new keep their existing signatures. cargo-semver-checks should be clean now.
Should be straightforwardly backportable to the 54 release branch.
Also has the nice side effect of skipping evaluation of declared-but-unused parameters entirely (the higher-order function never invokes their closures), which avoids materializing arrays the lambda body never reads.

Are these changes tested?

Yes, two new unit tests in datafusion/physical-expr/src/expressions/lambda.rs:

test_used_params_collects_only_referenced_param — a (k, v) -> v lambda reports only {\"v\"} as used.
test_used_params_handles_shadowing_inside_nested_lambda — for (k, v) -> col + (k, v2) -> k + v2 + v, the outer lambda's used_params is {\"v\"} only; the inner k does not flag the outer k as used.

The existing test_lambda_evaluate, test_lambda_duplicate_name, and test_higher_order_function_* tests continue to pass. cargo test -p datafusion-expr higher_order (11 tests) and cargo test -p datafusion-physical-expr lambda (7 tests) both pass.

Are there any user-facing changes?

No breaking changes. LambdaArgument gains a new non-breaking constructor and a new optional field; the rest is purely internal correctness.

Per maintainer request on PR apache#22689, the `LambdaExpr::try_new` / `expressions::lambda(...)` signature change (adding `outer_columns_count`) is being reviewed separately in apache#22853 because it's a breaking change to the physical-expr public API and warrants its own attention, distinct from this additive UDF. This commit reverts the `lambda.rs` / `higher_order_function.rs` / `planner.rs` files to their `upstream/main` state, removes the sqllogictest file (every query in it uses `(k, v) -> body` lambdas that require the upstream fix), and marks the unit tests that exercise multi-parameter lambdas with `#[ignore = "blocked on apache#22853: multi-param lambda projection fix"]`. `transform_values_uses_keys_via_case` and `transform_values_all_null_rows_returns_null_array` still pass because the former references both `k` and `v` (so projection is a no-op) and the latter short-circuits before evaluating the lambda. This PR will be rebased onto main once apache#22853 merges, at which point the ignore markers will be removed and the sqllogictest file restored.

rluvaton · 2026-06-09T14:36:19Z

@gstvg can you please take a look?

github-actions · 2026-06-09T15:16:02Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion-physical-expr v53.1.0 (current)
       Built [  40.413s] (current)
     Parsing datafusion-physical-expr v53.1.0 (current)
      Parsed [   0.055s] (current)
    Building datafusion-physical-expr v53.1.0 (baseline)
       Built [  30.706s] (baseline)
     Parsing datafusion-physical-expr v53.1.0 (baseline)
      Parsed [   0.054s] (baseline)
    Checking datafusion-physical-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.523s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure function_parameter_count_changed: pub fn parameter count changed ---

Description:
A publicly-visible function now takes a different number of parameters.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_parameter_count_changed.ron

Failed in:
  datafusion_physical_expr::expressions::lambda now takes 3 parameters instead of 2, in /home/runner/work/datafusion/datafusion/datafusion/physical-expr/src/expressions/lambda.rs:243

--- failure method_parameter_count_changed: pub method parameter count changed ---

Description:
A publicly-visible method now takes a different number of parameters, not counting the receiver (self) parameter.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/method_parameter_count_changed.ron

Failed in:
  datafusion_physical_expr::expressions::LambdaExpr::try_new takes 2 parameters in /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/007757cece9d2366ac3da6831e9b71798d16403b/datafusion/physical-expr/src/expressions/lambda.rs:64, but now takes 3 parameters in /home/runner/work/datafusion/datafusion/datafusion/physical-expr/src/expressions/lambda.rs:76

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  73.494s] datafusion-physical-expr

gstvg · 2026-06-09T18:45:41Z

Thanks @Adam-Alani for unconvering and fixing this. This looks great to me, but I wonder if we want to fix this without breaking changes so that it can be trivially backported to the 54 branch? If you and @rluvaton agree on this, is possible to:

Remove the body projection in LambdaExpr::new, keeping only the used columns indices collection
In HigherOrderFunctionExpr::evaluate, instead of projecting the batch, derive a new batch with uncaptured columns swapped with a cheap NullArray
In LambdaArgument::evaluate, inline merge_captures_with_variables, evaluate the parameters before spreading captures, get the len of the first evaluated parameter (0-param lambdas should return an error), then, if all columns in captures are DataType::Null, directly create a new batch of null arrays with the len of the first evaluated parameter, instead of calling spread_captures

That's how it was implemented in the first version of the lambda PR

Adam-Alani · 2026-06-10T13:15:47Z

Thanks @Adam-Alani for unconvering and fixing this. This looks great to me, but I wonder if we want to fix this without breaking changes so that it can be trivially backported to the 54 branch? If you and @rluvaton agree on this, is possible to:

Remove the body projection in LambdaExpr::new, keeping only the used columns indices collection

In HigherOrderFunctionExpr::evaluate, instead of projecting the batch, derive a new batch with uncaptured columns swapped with a cheap NullArray

In LambdaArgument::evaluate, inline merge_captures_with_variables, evaluate the parameters before spreading captures, get the len of the first evaluated parameter (0-param lambdas should return an error), then, if all columns in captures are DataType::Null, directly create a new batch of null arrays with the len of the first evaluated parameter, instead of calling spread_captures

That's how it was implemented in the first version of the lambda PR

@rluvaton let me know if this plan sounds good to you, happy to do that.

gstvg · 2026-06-10T23:00:01Z

I imagined another approach:

Add used_params as HashSet<String> or Option<HashSet<String>> to LambdaArgument
Add LambdaArgument::new_with_used_params, the existing new invokes this new method with None or the hash set of all params names
While traversing the lambda body to collect used indices, also collect used lambda variables names of the given lambda (ignoring vars of others lambda), and provide them to LambdaArgument::new_with_used_params. It must take into account shadowing: (k, v) -> func(column, (k, v2) -> k + v2 + v), the innermost lambda param k should not flag the outermost one as used. Only the outermost v is used
In LambdaArgument::evaluate, just evaluates and push to the batch the parameters that are actually used, so unused ones don't shift the others.

Because this skips the evaluation of declared but not used parameters, I personally prefer this one, WDYT?

`LambdaExpr` compresses the body's column-index projection by enumerating every referenced `Column`/`LambdaVariable` index and packing them into a dense range. That compression is correct for outer captures, but it silently broke multi-parameter lambdas: a body like `(k, v) -> v` (where `k` is unused) would have its `LambdaVariable("v")` re-projected from index 1 to index 0 and then, at runtime, read the slot the higher-order function had filled with `k`. Per maintainer feedback on apache#22853, fix it without a breaking change to `LambdaExpr::try_new` / `expressions::lambda(...)`: * `LambdaExpr` now tracks `used_params: HashSet<String>` — the subset of its own declared parameters that the body actually references. The set is computed during a single walk of the body in `LambdaExpr::new`, with a shadow stack that ignores `LambdaVariable`s bound by nested lambdas. For `(k, v) -> func(col, (k, v2) -> k + v2 + v)` the inner `k` shadows the outer `k`, so only `v` flows up as used by the outer lambda. * `LambdaArgument` gets an `Option<HashSet<String>>` for used parameter names plus a non-breaking `new_with_used_params(...)` constructor. The existing `new(...)` calls it with `None`, which preserves the old "push every declared parameter" behavior. * `LambdaArgument::evaluate` (through `merge_captures_with_variables`) only evaluates and pushes the closures whose parameter name appears in `used_params`, preserving the original declaration order. Unused declared parameters therefore leave no slot in the merged batch, so the body's compressed indices line up directly with the columns the evaluator actually built. * `HigherOrderFunctionExpr::evaluate` calls `new_with_used_params` and forwards `lambda.used_params().clone()`, so all in-tree higher-order UDFs benefit automatically without any callsite change. No public API breakage: `LambdaExpr::try_new`, `expressions::lambda(...)` and `LambdaArgument::new` keep their existing signatures. Two new tests cover the unused-parameter case and the nested-lambda shadowing case; existing tests in `physical-expr` and `expr` continue to pass.

Adam-Alani · 2026-06-23T08:54:39Z

@gstvg applied the changes.

github-actions Bot added the physical-expr Changes to the physical-expr crates label Jun 9, 2026

Adam-Alani mentioned this pull request Jun 9, 2026

Add transform_values UDF #22689

Open

Adam-Alani marked this pull request as ready for review June 9, 2026 14:29

github-actions Bot added the auto detected api change Auto detected API change label Jun 9, 2026

gstvg mentioned this pull request Jun 11, 2026

[EPIC] Full lambda support #21172

Open

30 tasks

gstvg mentioned this pull request Jun 22, 2026

Release DataFusion 54.1.0 (minor/patch) Release #22547

Open

13 tasks

Adam-Alani force-pushed the adam.alani/fix-lambda-multi-param-projection branch from b9f816f to f3f6504 Compare June 23, 2026 08:36

Adam-Alani force-pushed the adam.alani/fix-lambda-multi-param-projection branch from f3f6504 to 559c8be Compare June 23, 2026 08:37

github-actions Bot added the logical-expr Logical plan and expressions label Jun 23, 2026

Adam-Alani changed the title ~~fix(lambda): keep multi-param lambda parameter positions stable~~ fix(lambda): only push referenced params into the merged batch Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lambda): only push referenced params into the merged batch#22853

fix(lambda): only push referenced params into the merged batch#22853
Adam-Alani wants to merge 1 commit into
apache:mainfrom
Adam-Alani:adam.alani/fix-lambda-multi-param-projection

Adam-Alani commented Jun 9, 2026 •

edited

Loading

Uh oh!

rluvaton commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

gstvg commented Jun 9, 2026

Uh oh!

Adam-Alani commented Jun 10, 2026

Uh oh!

gstvg commented Jun 10, 2026 •

edited

Loading

Uh oh!

Adam-Alani commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Adam-Alani commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rluvaton commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

gstvg commented Jun 9, 2026

Uh oh!

Adam-Alani commented Jun 10, 2026

Uh oh!

gstvg commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Adam-Alani commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adam-Alani commented Jun 9, 2026 •

edited

Loading

gstvg commented Jun 10, 2026 •

edited

Loading