Skip to content

Reorder follower pod validation to surface leader scheduling status#1188

Open
0xlen wants to merge 3 commits into
kubernetes-sigs:mainfrom
0xlen:reorder-follower-pod-validation-webhook
Open

Reorder follower pod validation to surface leader scheduling status#1188
0xlen wants to merge 3 commits into
kubernetes-sigs:mainfrom
0xlen:reorder-follower-pod-validation-webhook

Conversation

@0xlen
Copy link
Copy Markdown
Contributor

@0xlen 0xlen commented Mar 13, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR improves the diagnostic output of the vpod webhook for JobSets using exclusive placement.

Previously, ValidateCreate checked if a follower pod had a NodeSelector before checking if the leader pod was scheduled. If the leader was stuck in Pending (e.g., due to quotas), the mutating webhook correctly withheld the selector, but the validating webhook immediately rejected the pod with "follower pod node selector not set". This masked the more accurate "expected, transient error" regarding the leader pod's status.

By moving the leader scheduling check first, users will now correctly see the transient error, which guides them to investigate the leader pod's resource constraints rather than assuming a JobSet configuration failure.

Which issue(s) this PR fixes:

Fixes #1187

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 13, 2026
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g March 13, 2026 15:13
@netlify
Copy link
Copy Markdown

netlify Bot commented Mar 13, 2026

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit cad2439
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-jobset/deploys/69b5f4b0a32228000888e227

@k8s-ci-robot k8s-ci-robot requested a review from kannon92 March 13, 2026 15:13
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @0xlen. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 13, 2026
Copy link
Copy Markdown
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2026
@0xlen 0xlen force-pushed the reorder-follower-pod-validation-webhook branch from d316575 to d8ac436 Compare March 13, 2026 22:51
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 13, 2026
}
}

func TestPodWebhookValidateCreate(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AI]
Suggestions

  • Minor: The tests only cover error paths. A happy-path test case (follower pod with correct node selector + scheduled leader → no error) would strengthen confidence, but this is a gap in the existing test suite and not something this PR introduced.
  • Minor: The test currently doesn't cover the topologyKey missing case (node selector exists but doesn't have the topology key). Again, not a regression

Looks like there are a few missing test cases but I'm happy if you want to address this as a follow up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I’ve added both the happy-path case and the missing-topologyKey test in this PR:

  1. the case where the follower has a nodeSelector but is missing the required topology key
  2. the happy path where the leader pod is already scheduled and the follower has the expected topology nodeSelector

@kannon92
Copy link
Copy Markdown
Contributor

/approve

@GiuseppeTT can you take a look and give your LGTM?

I'm curious if we should consider a cherry pick for this.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 0xlen, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2026
@0xlen 0xlen force-pushed the reorder-follower-pod-validation-webhook branch from de8e038 to cad2439 Compare March 14, 2026 23:52
@kannon92
Copy link
Copy Markdown
Contributor

@GiuseppeTT @imreddy13 could one of you take a look here?

@0xlen
Copy link
Copy Markdown
Contributor Author

0xlen commented Apr 7, 2026

Hi @GiuseppeTT @imreddy13, gentle ping for LGTM when you get a chance. Let me know if cherry pick is required, I can help to create a separate PR. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pod validation webhook obscures leader scheduling state by validating follower NodeSelector too early

3 participants