Feat: getNextAliveNode() to failover to alternate endpoints on retryable failures.#2880
Feat: getNextAliveNode() to failover to alternate endpoints on retryable failures.#2880kishansinghifs1 wants to merge 4 commits into
Conversation
|
Repository collaborators can run the JMH benchmark suite against this PR by commenting: Optional regression threshold override (Δ% on Time or Alloc/op; defaults to 10%): Only one benchmark run per PR is active at a time — issuing a new |
…ing and selection support
|
@kishansinghifs1 |
…tine for client-v2
0e59d37 to
881ecfb
Compare
…nstead of throwing ServerException
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit f4a17cd. Configure here.
|
Hi @chernser, Thank you for your response! Sorry for taking so long to get back to you—I got busy with my midterms. I have resolve the merge conflict. Also, could you please suggest me the next issues that I can pick up, I'd really appreciate your suggestions. I'd be happy to work on them as well. Thanks again! |
|
@kishansinghifs1 Regarding this feature - it will go to 0.11.0 release because 0.10.0 is already packed. So we have time. Thanks a lot for contributing! |

Summary
EndpointState.java (new) — Tracks if a server is healthy or blocked.
ClientNodeSelector.java (new) — Picks the first healthy server from the list.
Client.java (modified) — Uses the new selector instead of always picking server 1.
ServerException.java (modified) — HTTP 503 now triggers failover.
ClientNodeSelectorTest.java (new) — 5 unit tests for the selector logic.
ClientFailoverTest.java (new) — 8 integration tests for real failover.
Closes #2855
Checklist
Delete items not relevant to your PR:
Note
Medium Risk
Changes core HTTP routing for all multi-endpoint clients and retry semantics (503, quarantine timing); behavior shifts when backups are configured but is covered by new tests.
Overview
client-v2 now fails over across multiple configured endpoints instead of always using the first URL. A new
ClientNodeSelectorpicks the first non-quarantined endpoint in registration order (primary affinity when healthy), marks failed endpoints quarantined for 30s on retry, and falls back to the primary if every node is quarantined.Query, insert, and stream insert retry paths call
getEndpoint()for the initial attempt andgetNextAliveNode(failed)after retryable errors; HTTP 503 is returned without throwing inHttpAPIClientHelperand is treated as retryable inServerException. The builder keeps endpoint order and deduplicates viaLinkedHashSet.CHANGELOG documents the feature; unit and integration tests cover selector behavior and real failover (including 503).
Reviewed by Cursor Bugbot for commit 790cdeb. Bugbot is set up for automated code reviews on this repo. Configure here.