Reorder follower pod validation to surface leader scheduling status#1188
Reorder follower pod validation to surface leader scheduling status#11880xlen wants to merge 3 commits into
Conversation
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
|
Hi @0xlen. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
d316575 to
d8ac436
Compare
| } | ||
| } | ||
|
|
||
| func TestPodWebhookValidateCreate(t *testing.T) { |
There was a problem hiding this comment.
[AI]
Suggestions
- Minor: The tests only cover error paths. A happy-path test case (follower pod with correct node selector + scheduled leader → no error) would strengthen confidence, but this is a gap in the existing test suite and not something this PR introduced.
- Minor: The test currently doesn't cover the topologyKey missing case (node selector exists but doesn't have the topology key). Again, not a regression
Looks like there are a few missing test cases but I'm happy if you want to address this as a follow up.
There was a problem hiding this comment.
Thanks for pointing this out. I’ve added both the happy-path case and the missing-topologyKey test in this PR:
- the case where the follower has a nodeSelector but is missing the required topology key
- the happy path where the leader pod is already scheduled and the follower has the expected topology nodeSelector
|
/approve @GiuseppeTT can you take a look and give your LGTM? I'm curious if we should consider a cherry pick for this. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 0xlen, kannon92 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
de8e038 to
cad2439
Compare
|
@GiuseppeTT @imreddy13 could one of you take a look here? |
|
Hi @GiuseppeTT @imreddy13, gentle ping for LGTM when you get a chance. Let me know if cherry pick is required, I can help to create a separate PR. Thanks! |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR improves the diagnostic output of the
vpodwebhook for JobSets using exclusive placement.Previously,
ValidateCreatechecked if a follower pod had aNodeSelectorbefore checking if the leader pod was scheduled. If the leader was stuck inPending(e.g., due to quotas), the mutating webhook correctly withheld the selector, but the validating webhook immediately rejected the pod with"follower pod node selector not set". This masked the more accurate"expected, transient error"regarding the leader pod's status.By moving the leader scheduling check first, users will now correctly see the transient error, which guides them to investigate the leader pod's resource constraints rather than assuming a JobSet configuration failure.
Which issue(s) this PR fixes:
Fixes #1187
Special notes for your reviewer:
Does this PR introduce a user-facing change?