Backport of #104881 to 25.8: Use explicit flag for secondary on cluster queries#1875
Backport of #104881 to 25.8: Use explicit flag for secondary on cluster queries#1875ianton-ru wants to merge 1 commit into
Conversation
…ry-on-cluster Use explicit flag for secondary on cluster queries
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2fa783dc77
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| query_context->makeQueryContext(); | ||
| query_context->setCurrentQueryId(""); // generate random query_id | ||
| query_context->setQueryKind(ClientInfo::QueryKind::SECONDARY_QUERY); | ||
| query_context->setDDLOrOnClusterInternal(true); |
There was a problem hiding this comment.
Preserve worker mode for cluster table functions
When a DDL worker executes CREATE ... ON CLUSTER ... AS SELECT ... FROM fileCluster/urlCluster/s3Cluster, this context is no longer marked as SECONDARY_QUERY, while the *Cluster table functions still use query_kind == SECONDARY_QUERY to choose their worker-local storage path (for example TableFunctionFileCluster.cpp:23 and TableFunctionURLCluster.cpp:13). As a result, each DDL worker can fan out to the whole cluster again instead of reading only its assigned local input, multiplying remote reads and potentially duplicating the data inserted by CTAS on every replica; keep the secondary query kind for execution semantics or update those table-function checks to also recognize this new internal DDL flag.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch, but upstream master branch still uses query_kind == SECONDARY_QUERY, so this issue also there.
There was a problem hiding this comment.
the original PR author stated it was irrelevant to ddl
|
Failed tests looks as unstable, unrelated to PR |
mkmkme
left a comment
There was a problem hiding this comment.
LGTM, pretty much 1-to-1 with the original PR
| query_context->makeQueryContext(); | ||
| query_context->setCurrentQueryId(""); // generate random query_id | ||
| query_context->setQueryKind(ClientInfo::QueryKind::SECONDARY_QUERY); | ||
| query_context->setDDLOrOnClusterInternal(true); |
There was a problem hiding this comment.
the original PR author stated it was irrelevant to ddl
Audit update for PR #1875
PR: Altinity/ClickHouse#1875 — Backport of ClickHouse#104881 to 25.8: Use explicit flag for secondary on cluster queries Confirmed defectsHigh —
|
|
Tables can't be created with cluster table functions Can be created as but this is very strange action. And the same bug be in upstream master, latest commit b376a13 This works: This doesn't: |
|
I think we can live with this while upstream can. |
|
ClickHouse#107057 |
PR #1875 CI Verification — Backport: Use explicit flag for secondary on cluster queries
VerdictNo PR-caused regressions identified. All failing tests are either pre-existing flakes, cascade failures from a flaky test that hung the server, or infrastructure timeouts. The PR can be merged after a green rerun of the two affected jobs.
Job status summary
Failing checks (current SHA
|
| Test | Type |
|---|---|
02907_backup_restore_flatten_nested |
actual test |
Some queries hung |
meta / cascade |
Killed by signal (in clickhouse-server.log…) |
meta / cascade |
Fatal messages (in clickhouse-server.log…) |
meta / cascade |
The latter three are server-crash cascade markers triggered by the first test hanging. Root cause is 02907_backup_restore_flatten_nested.
Pre-existing flakiness of 02907_backup_restore_flatten_nested:
- Failures recorded on master (
pull_request_number = 0) and on many unrelated PRs going back to 2025-06-18.
Verdict: pre-existing flaky test causing cascade. Rerun recommended to confirm; do not block the PR on this run.
Failing checks reported in DB but green on GitHub (rerun cleared)
These show one FAIL row in the CI database from an earlier execution but the GitHub job conclusion is success at the latest run on the same SHA:
| Job | DB row test | GitHub conclusion |
|---|---|---|
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel) |
03707_set_index_bad_get_null_bug |
success |
Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, parallel) |
03579_create_table_populate_from_s3 |
success |
Flakiness check:
03707_set_index_bad_get_null_bug: 37 failures across 17 unrelated PRs in the last 60 days → known flake.03579_create_table_populate_from_s3: 1 failure ever (this PR), but the rerun on the same SHA passed and the test passes consistently across many other PRs (1869, 1896, 1867, master) on the sameDatabaseReplicatedconfig.
Verdict: both are flaky / one-off; already green at HEAD.
Other unrelated check
- DCO (
ACTION_REQUIRED): commit needs aSigned-off-bytrailer. Author action, not a CI failure.
Recommendations
- Author: add
Signed-off-bytrailer to satisfy DCO. - Rerun
Build (amd_release)(timeout) andStateless tests (amd_debug, distributed plan, s3 storage, parallel)(flaky02907_backup_restore_flatten_nested). - If
02907_backup_restore_flatten_nestedfails again on rerun in the same configuration, investigate further — but historical evidence makes a regression unlikely.
Backport of ClickHouse#104881 by @tavplubix
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Use an explicit flag in Context for secondary DDL/ON CLUSTER queries instead of
SECONDARY_QUERYCI/CD Options
Exclude tests:
Regression jobs to run: