HDDS-14774. Avoid datanode restart in TestContainerReportHandling to fix flaky timeout by chihsuan · Pull Request #10620 · apache/ozone

chihsuan · 2026-06-26T23:21:30Z

What changes were proposed in this pull request?

After the earlier fix (#10535) set hdds.container.report.interval to 1s, TestContainerReportHandling (and its HA variant) still timed out intermittently on CI.

Problem. The test marks a freshly-closed, non-empty container DELETING/DELETED and waits for SCM to delete the replicas. It intermittently times out because the container is resurrected out of the deletable state: SCM's safety net (AbstractContainerReportHandler, the HDDS-11136 / HDDS-12421 path) treats a non-empty, non-CLOSED replica on a DELETING/DELETED container as a reason to move it back to QUASI_CLOSED/CLOSED, after which no delete command is issued and the 180s wait times out.

A transient CLOSING replica report can reach SCM after the container is already DELETING/DELETED in two ways:

Datanode restart. The test restarted all datanodes after deleting; during restart a replica is briefly reported CLOSING. The restart was only ever a workaround to force a full container report under the old 60-minute report interval (HDDS-11136. Some containers affected by HDDS-8129 may still be in the DELETING state incorrectly #6967); with the interval now at 1s, the periodic full report already delivers the CLOSED replicas, so the restart is obsolete.
Close-vs-delete race. waitForContainerStateInSCM(CLOSED) returns as soon as the first RATIS replica is reported CLOSED, but the other replicas may still be CLOSING in SCM. The test then immediately forces DELETING/DELETED, and a lagging replica's in-flight CLOSING report resurrects the container.

Evidence (from a repro with the flaky-test-check split-failure masking removed, so timeouts surface): the failing iteration logs Resurrecting container #1 from DELETED to QUASI_CLOSED due to non-empty CLOSING replica and issues zero delete commands for that container, while passing iterations issue several.

Fix.

Remove the datanode restart (obsolete now that the report interval is 1s).
Before moving the container to DELETING, wait until SCM reports all replicas CLOSED, not just the aggregate container state. Once every replica is CLOSED in SCM, no lagging CLOSING report can resurrect the container.

Both changes are applied to TestContainerReportHandling and TestContainerReportHandlingWithHA.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14774

How was this patch tested?

flaky-test-check (10 splits x 15 iterations, both classes), all green, verified two ways:
- un-masked workflow run: the split-failure masking is removed so a timeout actually fails the split (the default flaky-test-check currently reports green even when a split times out).
- on-branch run: branch as-is; all 300 class executions pass, confirmed by parsing the split logs (its badge is green regardless due to the masking).
Ran both test classes locally across several iterations; all parameters pass.
checkstyle passes.
Side effect: removing the restart also cuts CI integration (container) runtime (passing runs): TestContainerReportHandling ~153s to ~93s, TestContainerReportHandlingWithHA ~223s to ~176s.

…fix flaky timeout The test marks a non-empty CLOSED container DELETING/DELETED and expected the datanode restart to trigger a full container report carrying CLOSED replicas, which SCM then deletes. During restart the replica is briefly reported as CLOSING, which trips the DELETING/DELETED resurrection path in AbstractContainerReportHandler and moves the container back to CLOSED, so the replicas are never deleted and the wait times out. The restart was only needed to force a timely report under the old 60-minute report interval. With the report interval now at 1s, the periodic full container report delivers the CLOSED replicas and triggers deletion, so the restart is removed. The same change is applied to TestContainerReportHandlingWithHA.

adoroszlai · 2026-06-27T05:34:55Z

flaky-test-check run: 100 runs (10 splits x 10 iterations) of all methods in both classes, all green.

Looks like HDDS-15589 introduced a bug in flaky-test-check where splits are not marked as failed. Workflow summary still shows 3 splits exited with error code 1 (i.e. failure):

So the patch may need more work. To fix the workflow, can you please try this change?

diff --git .github/workflows/intermittent-test-check.yml .github/workflows/intermittent-test-check.yml
index d6f1f9ef0c..bc208141d8 100644
--- .github/workflows/intermittent-test-check.yml
+++ .github/workflows/intermittent-test-check.yml
@@ -213,7 +213,6 @@ jobs:
             set -x
             hadoop-ozone/dev-support/checks/junit.sh $args -Dtest="$TEST_CLASS#$TEST_METHOD,Abstract*Test*\$*"
           fi
-        continue-on-error: true
         env:
           DEVELOCITY_ACCESS_KEY: ${{ secrets.DEVELOCITY_ACCESS_KEY }}
           repo_path: ${{ steps.download-ozone-repo.outputs.download-path }}

Sorry for the trouble.

chihsuan · 2026-06-27T06:35:46Z

So the patch may need more work. To fix the workflow, can you please try this change?

Thanks for catching this, and for the workflow fix! I'm investigating the real remaining cause now.

…urrection race

chihsuan · 2026-06-27T12:21:08Z

@adoroszlai Ready for another look! 🙏 You were right, my patch was incomplete.

I traced the root cause. Removing the restart wasn't enough. waitForContainerStateInSCM(CLOSED) returns once the first replica is CLOSED, so the test deletes while others are still CLOSING, and a lagging CLOSING report resurrects the container → 0 deletes → timeout.

I've updated the PR to now wait until all replicas are CLOSED in SCM before deleting. I verified it with the masking removed: 150 runs, both classes, all green. Full write-up is in the PR description.

chihsuan mentioned this pull request Jun 26, 2026

HDDS-14774. Fix intermittent timeout in TestContainerReportHandling #10535

Merged

adoroszlai added the test label Jun 27, 2026

chihsuan added 2 commits June 27, 2026 16:18

HDDS-14774. Wait for all replicas CLOSED before deleting to avoid res…

84f34b3

…urrection race

HDDS-14774. Extract waitForReplicaState helper into TestHelper

42ceabd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-14774. Avoid datanode restart in TestContainerReportHandling to fix flaky timeout#10620

HDDS-14774. Avoid datanode restart in TestContainerReportHandling to fix flaky timeout#10620
chihsuan wants to merge 3 commits into
apache:masterfrom
chihsuan:HDDS-14774-followup-drop-restart

chihsuan commented Jun 26, 2026 •

edited

Loading

Uh oh!

adoroszlai commented Jun 27, 2026 •

edited

Loading

Uh oh!

chihsuan commented Jun 27, 2026

Uh oh!

chihsuan commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chihsuan commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chihsuan commented Jun 27, 2026

Uh oh!

chihsuan commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chihsuan commented Jun 26, 2026 •

edited

Loading

adoroszlai commented Jun 27, 2026 •

edited

Loading