Skip to content

Feat: getNextAliveNode() to failover to alternate endpoints on retryable failures.#2880

Open
kishansinghifs1 wants to merge 4 commits into
ClickHouse:mainfrom
kishansinghifs1:feature/getNextAliveNode
Open

Feat: getNextAliveNode() to failover to alternate endpoints on retryable failures.#2880
kishansinghifs1 wants to merge 4 commits into
ClickHouse:mainfrom
kishansinghifs1:feature/getNextAliveNode

Conversation

@kishansinghifs1

@kishansinghifs1 kishansinghifs1 commented Jun 16, 2026

Copy link
Copy Markdown

Summary

EndpointState.java (new) — Tracks if a server is healthy or blocked.
ClientNodeSelector.java (new) — Picks the first healthy server from the list.
Client.java (modified) — Uses the new selector instead of always picking server 1.
ServerException.java (modified) — HTTP 503 now triggers failover.
ClientNodeSelectorTest.java (new) — 5 unit tests for the selector logic.
ClientFailoverTest.java (new) — 8 integration tests for real failover.

Closes #2855

Checklist

Delete items not relevant to your PR:


Note

Medium Risk
Changes core HTTP routing for all multi-endpoint clients and retry semantics (503, quarantine timing); behavior shifts when backups are configured but is covered by new tests.

Overview
client-v2 now fails over across multiple configured endpoints instead of always using the first URL. A new ClientNodeSelector picks the first non-quarantined endpoint in registration order (primary affinity when healthy), marks failed endpoints quarantined for 30s on retry, and falls back to the primary if every node is quarantined.

Query, insert, and stream insert retry paths call getEndpoint() for the initial attempt and getNextAliveNode(failed) after retryable errors; HTTP 503 is returned without throwing in HttpAPIClientHelper and is treated as retryable in ServerException. The builder keeps endpoint order and deduplicates via LinkedHashSet.

CHANGELOG documents the feature; unit and integration tests cover selector behavior and real failover (including 503).

Reviewed by Cursor Bugbot for commit 790cdeb. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown

Repository collaborators can run the JMH benchmark suite against this PR by commenting:

/benchmark

Optional regression threshold override (Δ% on Time or Alloc/op; defaults to 10%):

/benchmark threshold=15

Only one benchmark run per PR is active at a time — issuing a new /benchmark comment cancels the previous run. After the run finishes a separate comment will be posted comparing it against the latest scheduled run on main; the PR check fails if any benchmark regresses by more than the threshold.

@chernser

Copy link
Copy Markdown
Contributor

@kishansinghifs1
Thank you for the contribution!
Please resolve merge conflicts so we can run CI.

@kishansinghifs1 kishansinghifs1 force-pushed the feature/getNextAliveNode branch from 0e59d37 to 881ecfb Compare June 16, 2026 16:53
@CLAassistant

CLAassistant commented Jun 16, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Comment thread client-v2/src/main/java/com/clickhouse/client/api/Client.java

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit f4a17cd. Configure here.

Comment thread client-v2/src/main/java/com/clickhouse/client/api/Client.java
@kishansinghifs1

Copy link
Copy Markdown
Author

Hi @chernser,

Thank you for your response! Sorry for taking so long to get back to you—I got busy with my midterms.

I have resolve the merge conflict. Also, could you please suggest me the next issues that I can pick up, I'd really appreciate your suggestions. I'd be happy to work on them as well.

Thanks again!

@chernser

Copy link
Copy Markdown
Contributor

@kishansinghifs1
no worries. We expect that it takes time because everyone has other work to do.
If something urgent - we clearly state it.

Regarding this feature - it will go to 0.11.0 release because 0.10.0 is already packed. So we have time.

Thanks a lot for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[client-v2] getNextAliveNode() always returns endpoints.get(0) — retries never fail over to alternate endpoints

3 participants