fix(lambda): only push referenced params into the merged batch#22853
fix(lambda): only push referenced params into the merged batch#22853Adam-Alani wants to merge 1 commit into
Conversation
Per maintainer request on PR apache#22689, the `LambdaExpr::try_new` / `expressions::lambda(...)` signature change (adding `outer_columns_count`) is being reviewed separately in apache#22853 because it's a breaking change to the physical-expr public API and warrants its own attention, distinct from this additive UDF. This commit reverts the `lambda.rs` / `higher_order_function.rs` / `planner.rs` files to their `upstream/main` state, removes the sqllogictest file (every query in it uses `(k, v) -> body` lambdas that require the upstream fix), and marks the unit tests that exercise multi-parameter lambdas with `#[ignore = "blocked on apache#22853: multi-param lambda projection fix"]`. `transform_values_uses_keys_via_case` and `transform_values_all_null_rows_returns_null_array` still pass because the former references both `k` and `v` (so projection is a no-op) and the latter short-circuits before evaluating the lambda. This PR will be rebased onto main once apache#22853 merges, at which point the ignore markers will be removed and the sqllogictest file restored.
|
@gstvg can you please take a look? |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
|
Thanks @Adam-Alani for unconvering and fixing this. This looks great to me, but I wonder if we want to fix this without breaking changes so that it can be trivially backported to the 54 branch? If you and @rluvaton agree on this, is possible to:
That's how it was implemented in the first version of the lambda PR |
@rluvaton let me know if this plan sounds good to you, happy to do that. |
|
I imagined another approach:
Because this skips the evaluation of declared but not used parameters, I personally prefer this one, WDYT? |
b9f816f to
f3f6504
Compare
`LambdaExpr` compresses the body's column-index projection by enumerating
every referenced `Column`/`LambdaVariable` index and packing them into a
dense range. That compression is correct for outer captures, but it
silently broke multi-parameter lambdas: a body like `(k, v) -> v` (where
`k` is unused) would have its `LambdaVariable("v")` re-projected from
index 1 to index 0 and then, at runtime, read the slot the higher-order
function had filled with `k`.
Per maintainer feedback on apache#22853, fix it without a breaking
change to `LambdaExpr::try_new` / `expressions::lambda(...)`:
* `LambdaExpr` now tracks `used_params: HashSet<String>` — the subset of
its own declared parameters that the body actually references. The set
is computed during a single walk of the body in `LambdaExpr::new`,
with a shadow stack that ignores `LambdaVariable`s bound by nested
lambdas. For
`(k, v) -> func(col, (k, v2) -> k + v2 + v)` the inner `k` shadows the
outer `k`, so only `v` flows up as used by the outer lambda.
* `LambdaArgument` gets an `Option<HashSet<String>>` for used parameter
names plus a non-breaking `new_with_used_params(...)` constructor.
The existing `new(...)` calls it with `None`, which preserves the old
"push every declared parameter" behavior.
* `LambdaArgument::evaluate` (through `merge_captures_with_variables`)
only evaluates and pushes the closures whose parameter name appears
in `used_params`, preserving the original declaration order. Unused
declared parameters therefore leave no slot in the merged batch, so
the body's compressed indices line up directly with the columns the
evaluator actually built.
* `HigherOrderFunctionExpr::evaluate` calls `new_with_used_params` and
forwards `lambda.used_params().clone()`, so all in-tree higher-order
UDFs benefit automatically without any callsite change.
No public API breakage: `LambdaExpr::try_new`, `expressions::lambda(...)`
and `LambdaArgument::new` keep their existing signatures. Two new
tests cover the unused-parameter case and the nested-lambda shadowing
case; existing tests in `physical-expr` and `expr` continue to pass.
`LambdaExpr` compresses the body's column-index projection by enumerating
every referenced `Column`/`LambdaVariable` index and packing them into a
dense range. That compression is correct for outer captures, but it
silently broke multi-parameter lambdas: a body like `(k, v) -> v` (where
`k` is unused) would have its `LambdaVariable("v")` re-projected from
index 1 to index 0 and then, at runtime, read the slot the higher-order
function had filled with `k`.
Per maintainer feedback on apache#22853, fix it without a breaking
change to `LambdaExpr::try_new` / `expressions::lambda(...)`:
* `LambdaExpr` now tracks `used_params: HashSet<String>` — the subset of
its own declared parameters that the body actually references. The set
is computed during a single walk of the body in `LambdaExpr::new`,
with a shadow stack that ignores `LambdaVariable`s bound by nested
lambdas. For
`(k, v) -> func(col, (k, v2) -> k + v2 + v)` the inner `k` shadows the
outer `k`, so only `v` flows up as used by the outer lambda.
* `LambdaArgument` gets an `Option<HashSet<String>>` for used parameter
names plus a non-breaking `new_with_used_params(...)` constructor.
The existing `new(...)` calls it with `None`, which preserves the old
"push every declared parameter" behavior.
* `LambdaArgument::evaluate` (through `merge_captures_with_variables`)
only evaluates and pushes the closures whose parameter name appears
in `used_params`, preserving the original declaration order. Unused
declared parameters therefore leave no slot in the merged batch, so
the body's compressed indices line up directly with the columns the
evaluator actually built.
* `HigherOrderFunctionExpr::evaluate` calls `new_with_used_params` and
forwards `lambda.used_params().clone()`, so all in-tree higher-order
UDFs benefit automatically without any callsite change.
No public API breakage: `LambdaExpr::try_new`, `expressions::lambda(...)`
and `LambdaArgument::new` keep their existing signatures. Two new
tests cover the unused-parameter case and the nested-lambda shadowing
case; existing tests in `physical-expr` and `expr` continue to pass.
f3f6504 to
559c8be
Compare
|
@gstvg applied the changes. |
Which issue does this PR close?
Rationale for this change
Extracted from #22689 per maintainer request: the original UDF and the underlying lambda fix deserve separate review threads. This PR carries only the lambda fix.
LambdaExprpreviously compressed the column-index projection by enumerating every referencedColumn/LambdaVariableindex and packing them into a dense range. That collapse is correct for outer captures (and is a no-op for single-parameter lambdas, which is whyarray_transformwas never affected), but it also moves lambda parameters around. A two-parameter lambda like(k, v) -> v(withkunused) would have itsLambdaVariableforvre-projected from index 1 to index 0 — so at runtime the body reads the slot the higher-order function had filled withkand silently returns the wrong column.This is a latent bug today — no in-tree higher-order function exercises it — but it blocks #22689 (
transform_values, which uses(k, v) -> bodylambdas) and any future HOF that takes more than one parameter.What changes are included in this PR?
Per @gstvg's suggested non-breaking approach in the discussion thread:
LambdaExprnow tracksused_params: HashSet<String>— the subset of its own declared parameters that the body actually references. It is computed during a single walk of the body inLambdaExpr::new, with a shadow stack that ignoresLambdaVariables bound by nested lambdas. For(k, v) -> func(col, (k, v2) -> k + v2 + v)the innerkshadows the outerk, so onlyvflows up as used by the outer lambda.LambdaArgumentgets anOption<HashSet<String>>for the used parameter names plus a non-breakingLambdaArgument::new_with_used_params(...)constructor. The existingLambdaArgument::new(...)calls it withNone, which preserves the old "push every declared parameter" behavior — so external callers that buildLambdaArgumentdirectly keep working unchanged.LambdaArgument::evaluate(throughmerge_captures_with_variables) only evaluates and pushes the closures whose parameter name appears inused_params, preserving the original declaration order. Unused declared parameters therefore leave no slot in the merged batch, so the body's compressed indices line up directly with the columns the evaluator actually built.HigherOrderFunctionExpr::evaluatecallsLambdaArgument::new_with_used_params(...)and forwardslambda.used_params().clone(), so all in-tree higher-order UDFs pick up the fix automatically with no callsite change.Compared to the previous revision of this PR (which added an
outer_columns_count: usizeparameter toLambdaExpr::try_newandexpressions::lambda(...)), this revision:LambdaExpr::try_new,expressions::lambda(...)andLambdaArgument::newkeep their existing signatures.cargo-semver-checksshould be clean now.Are these changes tested?
Yes, two new unit tests in
datafusion/physical-expr/src/expressions/lambda.rs:test_used_params_collects_only_referenced_param— a(k, v) -> vlambda reports only{\"v\"}as used.test_used_params_handles_shadowing_inside_nested_lambda— for(k, v) -> col + (k, v2) -> k + v2 + v, the outer lambda'sused_paramsis{\"v\"}only; the innerkdoes not flag the outerkas used.The existing
test_lambda_evaluate,test_lambda_duplicate_name, andtest_higher_order_function_*tests continue to pass.cargo test -p datafusion-expr higher_order(11 tests) andcargo test -p datafusion-physical-expr lambda(7 tests) both pass.Are there any user-facing changes?
No breaking changes.
LambdaArgumentgains a new non-breaking constructor and a new optional field; the rest is purely internal correctness.