Skip to content

HDDS-14774. Avoid datanode restart in TestContainerReportHandling to fix flaky timeout#10620

Open
chihsuan wants to merge 3 commits into
apache:masterfrom
chihsuan:HDDS-14774-followup-drop-restart
Open

HDDS-14774. Avoid datanode restart in TestContainerReportHandling to fix flaky timeout#10620
chihsuan wants to merge 3 commits into
apache:masterfrom
chihsuan:HDDS-14774-followup-drop-restart

Conversation

@chihsuan

@chihsuan chihsuan commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

After the earlier fix (#10535) set hdds.container.report.interval to 1s, TestContainerReportHandling (and its HA variant) still timed out intermittently on CI.

Problem. The test marks a freshly-closed, non-empty container DELETING/DELETED and waits for SCM to delete the replicas. It intermittently times out because the container is resurrected out of the deletable state: SCM's safety net (AbstractContainerReportHandler, the HDDS-11136 / HDDS-12421 path) treats a non-empty, non-CLOSED replica on a DELETING/DELETED container as a reason to move it back to QUASI_CLOSED/CLOSED, after which no delete command is issued and the 180s wait times out.

A transient CLOSING replica report can reach SCM after the container is already DELETING/DELETED in two ways:

  1. Datanode restart. The test restarted all datanodes after deleting; during restart a replica is briefly reported CLOSING. The restart was only ever a workaround to force a full container report under the old 60-minute report interval (HDDS-11136. Some containers affected by HDDS-8129 may still be in the DELETING state incorrectly #6967); with the interval now at 1s, the periodic full report already delivers the CLOSED replicas, so the restart is obsolete.
  2. Close-vs-delete race. waitForContainerStateInSCM(CLOSED) returns as soon as the first RATIS replica is reported CLOSED, but the other replicas may still be CLOSING in SCM. The test then immediately forces DELETING/DELETED, and a lagging replica's in-flight CLOSING report resurrects the container.

Evidence (from a repro with the flaky-test-check split-failure masking removed, so timeouts surface): the failing iteration logs Resurrecting container #1 from DELETED to QUASI_CLOSED due to non-empty CLOSING replica and issues zero delete commands for that container, while passing iterations issue several.

Fix.

  1. Remove the datanode restart (obsolete now that the report interval is 1s).
  2. Before moving the container to DELETING, wait until SCM reports all replicas CLOSED, not just the aggregate container state. Once every replica is CLOSED in SCM, no lagging CLOSING report can resurrect the container.

Both changes are applied to TestContainerReportHandling and TestContainerReportHandlingWithHA.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14774

How was this patch tested?

  • flaky-test-check (10 splits x 15 iterations, both classes), all green, verified two ways:
    • un-masked workflow run: the split-failure masking is removed so a timeout actually fails the split (the default flaky-test-check currently reports green even when a split times out).
    • on-branch run: branch as-is; all 300 class executions pass, confirmed by parsing the split logs (its badge is green regardless due to the masking).
  • Ran both test classes locally across several iterations; all parameters pass.
  • checkstyle passes.
  • Side effect: removing the restart also cuts CI integration (container) runtime (passing runs): TestContainerReportHandling ~153s to ~93s, TestContainerReportHandlingWithHA ~223s to ~176s.

…fix flaky timeout

The test marks a non-empty CLOSED container DELETING/DELETED and expected the
datanode restart to trigger a full container report carrying CLOSED replicas,
which SCM then deletes. During restart the replica is briefly reported as
CLOSING, which trips the DELETING/DELETED resurrection path in
AbstractContainerReportHandler and moves the container back to CLOSED, so the
replicas are never deleted and the wait times out.

The restart was only needed to force a timely report under the old 60-minute
report interval. With the report interval now at 1s, the periodic full
container report delivers the CLOSED replicas and triggers deletion, so the
restart is removed. The same change is applied to TestContainerReportHandlingWithHA.
@adoroszlai

adoroszlai commented Jun 27, 2026

Copy link
Copy Markdown
Contributor
  • flaky-test-check run: 100 runs (10 splits x 10 iterations) of all methods in both classes, all green.

Looks like HDDS-15589 introduced a bug in flaky-test-check where splits are not marked as failed. Workflow summary still shows 3 splits exited with error code 1 (i.e. failure):

So the patch may need more work. To fix the workflow, can you please try this change?

diff --git .github/workflows/intermittent-test-check.yml .github/workflows/intermittent-test-check.yml
index d6f1f9ef0c..bc208141d8 100644
--- .github/workflows/intermittent-test-check.yml
+++ .github/workflows/intermittent-test-check.yml
@@ -213,7 +213,6 @@ jobs:
             set -x
             hadoop-ozone/dev-support/checks/junit.sh $args -Dtest="$TEST_CLASS#$TEST_METHOD,Abstract*Test*\$*"
           fi
-        continue-on-error: true
         env:
           DEVELOCITY_ACCESS_KEY: ${{ secrets.DEVELOCITY_ACCESS_KEY }}
           repo_path: ${{ steps.download-ozone-repo.outputs.download-path }}

Sorry for the trouble.

@chihsuan

Copy link
Copy Markdown
Contributor Author

So the patch may need more work. To fix the workflow, can you please try this change?

Thanks for catching this, and for the workflow fix! I'm investigating the real remaining cause now.

@chihsuan

Copy link
Copy Markdown
Contributor Author

@adoroszlai Ready for another look! 🙏 You were right, my patch was incomplete.

I traced the root cause. Removing the restart wasn't enough. waitForContainerStateInSCM(CLOSED) returns once the first replica is CLOSED, so the test deletes while others are still CLOSING, and a lagging CLOSING report resurrects the container → 0 deletes → timeout.

I've updated the PR to now wait until all replicas are CLOSED in SCM before deleting. I verified it with the masking removed: 150 runs, both classes, all green. Full write-up is in the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants