HDDS-14774. Avoid datanode restart in TestContainerReportHandling to fix flaky timeout#10620
HDDS-14774. Avoid datanode restart in TestContainerReportHandling to fix flaky timeout#10620chihsuan wants to merge 3 commits into
Conversation
…fix flaky timeout The test marks a non-empty CLOSED container DELETING/DELETED and expected the datanode restart to trigger a full container report carrying CLOSED replicas, which SCM then deletes. During restart the replica is briefly reported as CLOSING, which trips the DELETING/DELETED resurrection path in AbstractContainerReportHandler and moves the container back to CLOSED, so the replicas are never deleted and the wait times out. The restart was only needed to force a timely report under the old 60-minute report interval. With the report interval now at 1s, the periodic full container report delivers the CLOSED replicas and triggers deletion, so the restart is removed. The same change is applied to TestContainerReportHandlingWithHA.
Looks like HDDS-15589 introduced a bug in
So the patch may need more work. To fix the workflow, can you please try this change? diff --git .github/workflows/intermittent-test-check.yml .github/workflows/intermittent-test-check.yml
index d6f1f9ef0c..bc208141d8 100644
--- .github/workflows/intermittent-test-check.yml
+++ .github/workflows/intermittent-test-check.yml
@@ -213,7 +213,6 @@ jobs:
set -x
hadoop-ozone/dev-support/checks/junit.sh $args -Dtest="$TEST_CLASS#$TEST_METHOD,Abstract*Test*\$*"
fi
- continue-on-error: true
env:
DEVELOCITY_ACCESS_KEY: ${{ secrets.DEVELOCITY_ACCESS_KEY }}
repo_path: ${{ steps.download-ozone-repo.outputs.download-path }}Sorry for the trouble. |
Thanks for catching this, and for the workflow fix! I'm investigating the real remaining cause now. |
|
@adoroszlai Ready for another look! 🙏 You were right, my patch was incomplete. I traced the root cause. Removing the restart wasn't enough. I've updated the PR to now wait until all replicas are |
What changes were proposed in this pull request?
After the earlier fix (#10535) set
hdds.container.report.intervalto 1s,TestContainerReportHandling(and its HA variant) still timed out intermittently on CI.Problem. The test marks a freshly-closed, non-empty container DELETING/DELETED and waits for SCM to delete the replicas. It intermittently times out because the container is resurrected out of the deletable state: SCM's safety net (
AbstractContainerReportHandler, the HDDS-11136 / HDDS-12421 path) treats a non-empty, non-CLOSED replica on a DELETING/DELETED container as a reason to move it back to QUASI_CLOSED/CLOSED, after which no delete command is issued and the 180s wait times out.A transient CLOSING replica report can reach SCM after the container is already DELETING/DELETED in two ways:
waitForContainerStateInSCM(CLOSED)returns as soon as the first RATIS replica is reported CLOSED, but the other replicas may still be CLOSING in SCM. The test then immediately forces DELETING/DELETED, and a lagging replica's in-flight CLOSING report resurrects the container.Evidence (from a repro with the flaky-test-check split-failure masking removed, so timeouts surface): the failing iteration logs
Resurrecting container #1 from DELETED to QUASI_CLOSED due to non-empty CLOSING replicaand issues zero delete commands for that container, while passing iterations issue several.Fix.
Both changes are applied to
TestContainerReportHandlingandTestContainerReportHandlingWithHA.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14774
How was this patch tested?
checkstylepasses.integration (container)runtime (passing runs):TestContainerReportHandling~153s to ~93s,TestContainerReportHandlingWithHA~223s to ~176s.