Skip to content

fix(lambda): only push referenced params into the merged batch#22853

Open
Adam-Alani wants to merge 1 commit into
apache:mainfrom
Adam-Alani:adam.alani/fix-lambda-multi-param-projection
Open

fix(lambda): only push referenced params into the merged batch#22853
Adam-Alani wants to merge 1 commit into
apache:mainfrom
Adam-Alani:adam.alani/fix-lambda-multi-param-projection

Conversation

@Adam-Alani

@Adam-Alani Adam-Alani commented Jun 9, 2026

Copy link
Copy Markdown

Which issue does this PR close?

  • Closes #.

Rationale for this change

Extracted from #22689 per maintainer request: the original UDF and the underlying lambda fix deserve separate review threads. This PR carries only the lambda fix.

LambdaExpr previously compressed the column-index projection by enumerating every referenced Column/LambdaVariable index and packing them into a dense range. That collapse is correct for outer captures (and is a no-op for single-parameter lambdas, which is why array_transform was never affected), but it also moves lambda parameters around. A two-parameter lambda like (k, v) -> v (with k unused) would have its LambdaVariable for v re-projected from index 1 to index 0 — so at runtime the body reads the slot the higher-order function had filled with k and silently returns the wrong column.

This is a latent bug today — no in-tree higher-order function exercises it — but it blocks #22689 (transform_values, which uses (k, v) -> body lambdas) and any future HOF that takes more than one parameter.

What changes are included in this PR?

Per @gstvg's suggested non-breaking approach in the discussion thread:

  • LambdaExpr now tracks used_params: HashSet<String> — the subset of its own declared parameters that the body actually references. It is computed during a single walk of the body in LambdaExpr::new, with a shadow stack that ignores LambdaVariables bound by nested lambdas. For (k, v) -> func(col, (k, v2) -> k + v2 + v) the inner k shadows the outer k, so only v flows up as used by the outer lambda.
  • LambdaArgument gets an Option<HashSet<String>> for the used parameter names plus a non-breaking LambdaArgument::new_with_used_params(...) constructor. The existing LambdaArgument::new(...) calls it with None, which preserves the old "push every declared parameter" behavior — so external callers that build LambdaArgument directly keep working unchanged.
  • LambdaArgument::evaluate (through merge_captures_with_variables) only evaluates and pushes the closures whose parameter name appears in used_params, preserving the original declaration order. Unused declared parameters therefore leave no slot in the merged batch, so the body's compressed indices line up directly with the columns the evaluator actually built.
  • HigherOrderFunctionExpr::evaluate calls LambdaArgument::new_with_used_params(...) and forwards lambda.used_params().clone(), so all in-tree higher-order UDFs pick up the fix automatically with no callsite change.

Compared to the previous revision of this PR (which added an outer_columns_count: usize parameter to LambdaExpr::try_new and expressions::lambda(...)), this revision:

  • Has no breaking API change. LambdaExpr::try_new, expressions::lambda(...) and LambdaArgument::new keep their existing signatures. cargo-semver-checks should be clean now.
  • Should be straightforwardly backportable to the 54 release branch.
  • Also has the nice side effect of skipping evaluation of declared-but-unused parameters entirely (the higher-order function never invokes their closures), which avoids materializing arrays the lambda body never reads.

Are these changes tested?

Yes, two new unit tests in datafusion/physical-expr/src/expressions/lambda.rs:

  • test_used_params_collects_only_referenced_param — a (k, v) -> v lambda reports only {\"v\"} as used.
  • test_used_params_handles_shadowing_inside_nested_lambda — for (k, v) -> col + (k, v2) -> k + v2 + v, the outer lambda's used_params is {\"v\"} only; the inner k does not flag the outer k as used.

The existing test_lambda_evaluate, test_lambda_duplicate_name, and test_higher_order_function_* tests continue to pass. cargo test -p datafusion-expr higher_order (11 tests) and cargo test -p datafusion-physical-expr lambda (7 tests) both pass.

Are there any user-facing changes?

No breaking changes. LambdaArgument gains a new non-breaking constructor and a new optional field; the rest is purely internal correctness.

@github-actions github-actions Bot added the physical-expr Changes to the physical-expr crates label Jun 9, 2026
Adam-Alani added a commit to Adam-Alani/datafusion that referenced this pull request Jun 9, 2026
Per maintainer request on PR apache#22689, the `LambdaExpr::try_new` /
`expressions::lambda(...)` signature change (adding `outer_columns_count`)
is being reviewed separately in apache#22853 because it's a
breaking change to the physical-expr public API and warrants its own
attention, distinct from this additive UDF.

This commit reverts the `lambda.rs` / `higher_order_function.rs` /
`planner.rs` files to their `upstream/main` state, removes the
sqllogictest file (every query in it uses `(k, v) -> body` lambdas that
require the upstream fix), and marks the unit tests that exercise
multi-parameter lambdas with
`#[ignore = "blocked on apache#22853: multi-param lambda projection fix"]`.

`transform_values_uses_keys_via_case` and
`transform_values_all_null_rows_returns_null_array` still pass because
the former references both `k` and `v` (so projection is a no-op) and
the latter short-circuits before evaluating the lambda. This PR will be
rebased onto main once apache#22853 merges, at which point the ignore
markers will be removed and the sqllogictest file restored.
@Adam-Alani Adam-Alani marked this pull request as ready for review June 9, 2026 14:29
@rluvaton

rluvaton commented Jun 9, 2026

Copy link
Copy Markdown
Member

@gstvg can you please take a look?

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-physical-expr v53.1.0 (current)
       Built [  40.413s] (current)
     Parsing datafusion-physical-expr v53.1.0 (current)
      Parsed [   0.055s] (current)
    Building datafusion-physical-expr v53.1.0 (baseline)
       Built [  30.706s] (baseline)
     Parsing datafusion-physical-expr v53.1.0 (baseline)
      Parsed [   0.054s] (baseline)
    Checking datafusion-physical-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.523s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure function_parameter_count_changed: pub fn parameter count changed ---

Description:
A publicly-visible function now takes a different number of parameters.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_parameter_count_changed.ron

Failed in:
  datafusion_physical_expr::expressions::lambda now takes 3 parameters instead of 2, in /home/runner/work/datafusion/datafusion/datafusion/physical-expr/src/expressions/lambda.rs:243

--- failure method_parameter_count_changed: pub method parameter count changed ---

Description:
A publicly-visible method now takes a different number of parameters, not counting the receiver (self) parameter.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/method_parameter_count_changed.ron

Failed in:
  datafusion_physical_expr::expressions::LambdaExpr::try_new takes 2 parameters in /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/007757cece9d2366ac3da6831e9b71798d16403b/datafusion/physical-expr/src/expressions/lambda.rs:64, but now takes 3 parameters in /home/runner/work/datafusion/datafusion/datafusion/physical-expr/src/expressions/lambda.rs:76

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  73.494s] datafusion-physical-expr

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 9, 2026
@gstvg

gstvg commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Thanks @Adam-Alani for unconvering and fixing this. This looks great to me, but I wonder if we want to fix this without breaking changes so that it can be trivially backported to the 54 branch? If you and @rluvaton agree on this, is possible to:

  • Remove the body projection in LambdaExpr::new, keeping only the used columns indices collection
  • In HigherOrderFunctionExpr::evaluate, instead of projecting the batch, derive a new batch with uncaptured columns swapped with a cheap NullArray
  • In LambdaArgument::evaluate, inline merge_captures_with_variables, evaluate the parameters before spreading captures, get the len of the first evaluated parameter (0-param lambdas should return an error), then, if all columns in captures are DataType::Null, directly create a new batch of null arrays with the len of the first evaluated parameter, instead of calling spread_captures

That's how it was implemented in the first version of the lambda PR

@Adam-Alani

Copy link
Copy Markdown
Author

Thanks @Adam-Alani for unconvering and fixing this. This looks great to me, but I wonder if we want to fix this without breaking changes so that it can be trivially backported to the 54 branch? If you and @rluvaton agree on this, is possible to:

  • Remove the body projection in LambdaExpr::new, keeping only the used columns indices collection
  • In HigherOrderFunctionExpr::evaluate, instead of projecting the batch, derive a new batch with uncaptured columns swapped with a cheap NullArray
  • In LambdaArgument::evaluate, inline merge_captures_with_variables, evaluate the parameters before spreading captures, get the len of the first evaluated parameter (0-param lambdas should return an error), then, if all columns in captures are DataType::Null, directly create a new batch of null arrays with the len of the first evaluated parameter, instead of calling spread_captures

That's how it was implemented in the first version of the lambda PR

@rluvaton let me know if this plan sounds good to you, happy to do that.

@gstvg

gstvg commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

I imagined another approach:

  • Add used_params as HashSet<String> or Option<HashSet<String>> to LambdaArgument
  • Add LambdaArgument::new_with_used_params, the existing new invokes this new method with None or the hash set of all params names
  • While traversing the lambda body to collect used indices, also collect used lambda variables names of the given lambda (ignoring vars of others lambda), and provide them to LambdaArgument::new_with_used_params. It must take into account shadowing: (k, v) -> func(column, (k, v2) -> k + v2 + v), the innermost lambda param k should not flag the outermost one as used. Only the outermost v is used
  • In LambdaArgument::evaluate, just evaluates and push to the batch the parameters that are actually used, so unused ones don't shift the others.

Because this skips the evaluation of declared but not used parameters, I personally prefer this one, WDYT?

@gstvg gstvg mentioned this pull request Jun 11, 2026
30 tasks
@Adam-Alani Adam-Alani force-pushed the adam.alani/fix-lambda-multi-param-projection branch from b9f816f to f3f6504 Compare June 23, 2026 08:36
Adam-Alani added a commit to Adam-Alani/datafusion that referenced this pull request Jun 23, 2026
`LambdaExpr` compresses the body's column-index projection by enumerating
every referenced `Column`/`LambdaVariable` index and packing them into a
dense range. That compression is correct for outer captures, but it
silently broke multi-parameter lambdas: a body like `(k, v) -> v` (where
`k` is unused) would have its `LambdaVariable("v")` re-projected from
index 1 to index 0 and then, at runtime, read the slot the higher-order
function had filled with `k`.

Per maintainer feedback on apache#22853, fix it without a breaking
change to `LambdaExpr::try_new` / `expressions::lambda(...)`:

* `LambdaExpr` now tracks `used_params: HashSet<String>` — the subset of
  its own declared parameters that the body actually references. The set
  is computed during a single walk of the body in `LambdaExpr::new`,
  with a shadow stack that ignores `LambdaVariable`s bound by nested
  lambdas. For
  `(k, v) -> func(col, (k, v2) -> k + v2 + v)` the inner `k` shadows the
  outer `k`, so only `v` flows up as used by the outer lambda.

* `LambdaArgument` gets an `Option<HashSet<String>>` for used parameter
  names plus a non-breaking `new_with_used_params(...)` constructor.
  The existing `new(...)` calls it with `None`, which preserves the old
  "push every declared parameter" behavior.

* `LambdaArgument::evaluate` (through `merge_captures_with_variables`)
  only evaluates and pushes the closures whose parameter name appears
  in `used_params`, preserving the original declaration order. Unused
  declared parameters therefore leave no slot in the merged batch, so
  the body's compressed indices line up directly with the columns the
  evaluator actually built.

* `HigherOrderFunctionExpr::evaluate` calls `new_with_used_params` and
  forwards `lambda.used_params().clone()`, so all in-tree higher-order
  UDFs benefit automatically without any callsite change.

No public API breakage: `LambdaExpr::try_new`, `expressions::lambda(...)`
and `LambdaArgument::new` keep their existing signatures. Two new
tests cover the unused-parameter case and the nested-lambda shadowing
case; existing tests in `physical-expr` and `expr` continue to pass.
`LambdaExpr` compresses the body's column-index projection by enumerating
every referenced `Column`/`LambdaVariable` index and packing them into a
dense range. That compression is correct for outer captures, but it
silently broke multi-parameter lambdas: a body like `(k, v) -> v` (where
`k` is unused) would have its `LambdaVariable("v")` re-projected from
index 1 to index 0 and then, at runtime, read the slot the higher-order
function had filled with `k`.

Per maintainer feedback on apache#22853, fix it without a breaking
change to `LambdaExpr::try_new` / `expressions::lambda(...)`:

* `LambdaExpr` now tracks `used_params: HashSet<String>` — the subset of
  its own declared parameters that the body actually references. The set
  is computed during a single walk of the body in `LambdaExpr::new`,
  with a shadow stack that ignores `LambdaVariable`s bound by nested
  lambdas. For
  `(k, v) -> func(col, (k, v2) -> k + v2 + v)` the inner `k` shadows the
  outer `k`, so only `v` flows up as used by the outer lambda.

* `LambdaArgument` gets an `Option<HashSet<String>>` for used parameter
  names plus a non-breaking `new_with_used_params(...)` constructor.
  The existing `new(...)` calls it with `None`, which preserves the old
  "push every declared parameter" behavior.

* `LambdaArgument::evaluate` (through `merge_captures_with_variables`)
  only evaluates and pushes the closures whose parameter name appears
  in `used_params`, preserving the original declaration order. Unused
  declared parameters therefore leave no slot in the merged batch, so
  the body's compressed indices line up directly with the columns the
  evaluator actually built.

* `HigherOrderFunctionExpr::evaluate` calls `new_with_used_params` and
  forwards `lambda.used_params().clone()`, so all in-tree higher-order
  UDFs benefit automatically without any callsite change.

No public API breakage: `LambdaExpr::try_new`, `expressions::lambda(...)`
and `LambdaArgument::new` keep their existing signatures. Two new
tests cover the unused-parameter case and the nested-lambda shadowing
case; existing tests in `physical-expr` and `expr` continue to pass.
@Adam-Alani Adam-Alani force-pushed the adam.alani/fix-lambda-multi-param-projection branch from f3f6504 to 559c8be Compare June 23, 2026 08:37
@github-actions github-actions Bot added the logical-expr Logical plan and expressions label Jun 23, 2026
@Adam-Alani Adam-Alani changed the title fix(lambda): keep multi-param lambda parameter positions stable fix(lambda): only push referenced params into the merged batch Jun 23, 2026
@Adam-Alani

Copy link
Copy Markdown
Author

@gstvg applied the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants