Skip to content

hack: add workflow-stats tool for analyzing CI step durations#23042

Draft
nirs wants to merge 1 commit into
kubernetes:masterfrom
nirs:workflow-stats
Draft

hack: add workflow-stats tool for analyzing CI step durations#23042
nirs wants to merge 1 commit into
kubernetes:masterfrom
nirs:workflow-stats

Conversation

@nirs

@nirs nirs commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Analyzes GitHub Actions workflow step durations to help set per-step timeouts based on historical data. Computes min, avg, P50, P90, P95, max, and a suggested timeout (3x P95, rounded to minutes) across completed workflow runs.

Features:

  • SQLite cache at ~/.cache/workflow-stats///stats.db avoids redundant API calls; only new runs are fetched
  • Incremental updates using the latest cached run date
  • Filter by job name (-job), conclusion (-conclusion), branch (-branch)
  • Output as table (default), markdown, CSV, or JSON

Dependencies:

  • google/go-github/v85: GitHub Actions API client for fetching workflow runs and job details
  • modernc.org/sqlite: pure-Go SQLite driver (transpiled from C), chosen over mattn/go-sqlite3 to avoid CGO build dependencies

Example usage

$ go run workflow-stats/workflow_stats.go -workflow "Functional Test"
Fetching runs since 2026-05-24 ... 6 runs (0.7s)

  Step                                                N     Min     Avg     P95     Max  Timeout
  Run Functional Test                               728   2m50s   4m05s   5m21s   9m27s   17m00s
  Build minikube and e2e test binaries              156   1m07s   1m39s   2m09s   2m16s    7m00s
  Set up Rootless Docker (rootless)                  67     42s     48s     56s   1m21s    3m00s
  Update apt-get package index (ubuntu)             373      5s     12s     23s     42s    2m00s
  ...

Related-to #23041
Related-to #23043

@nirs nirs requested a review from medyagh May 25, 2026 18:43
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nirs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 25, 2026
Comment thread hack/workflow-stats/workflow_stats.go
@nirs nirs force-pushed the workflow-stats branch from e46e3ec to 8bbd61b Compare May 29, 2026 17:07
@nirs nirs requested a review from medyagh May 29, 2026 17:08
Comment thread hack/workflow-stats/README.md Outdated

@medyagh medyagh left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirs this is actually super cool,
some idea (not for this PR ofc )

but it would be cool if we have an automation job that every 14 days gathers this logs and makes a graphic chart our of them and adds them to our site

https://minikube.sigs.k8s.io/docs/benchmarks/

we could potenitally have something like this for example for "Functional Test on Docker"
we export the data that we get into a csv

and make a PR to add it to our site once a month (update functional test benchmarking)

@nirs nirs force-pushed the workflow-stats branch from 8bbd61b to a4f3909 Compare May 29, 2026 21:56
@nirs

nirs commented May 29, 2026

Copy link
Copy Markdown
Collaborator Author

@medyagh Comment addressed:

  • Removed "Building" section to we don't need to handle ignoring tools
  • Use go run in all the examples
  • Unified flag usage (-workflow instead of --workflow)

@nirs nirs requested a review from medyagh May 29, 2026 21:58
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@nirs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-minikube-docker-crio-linux-x86 a4f3909 link false /test pull-minikube-docker-crio-linux-x86
pull-minikube-kvm-crio-linux-x86 a4f3909 link false /test pull-minikube-kvm-crio-linux-x86
pull-minikube-kvm-containerd-linux-x86 a4f3909 link true /test pull-minikube-kvm-containerd-linux-x86

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nirs

nirs commented May 29, 2026

Copy link
Copy Markdown
Collaborator Author

/retest-required

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new hack/workflow-stats Go tool that fetches completed GitHub Actions runs via the GitHub API, caches jobs/steps in a local SQLite database (modernc.org/sqlite, pure Go — no CGO), and reports per-step duration statistics (min/avg/p50/p90/p95/max) plus a suggested timeout (p95 × multiplier, rounded up to whole minutes, floor 1 min). Output is available as table/markdown/CSV/JSON to support both human review and an upcoming automated timeout-tuning workflow (#23043).

Changes:

  • New hack/workflow-stats/workflow_stats.go CLI with SQLite-backed incremental fetch and four output formats.
  • New hack/workflow-stats/README.md documenting usage and a worked example of tuning Functional Test timeouts.
  • hack/go.mod/hack/go.sum updates pulling in google/go-github/v85, modernc.org/sqlite, and their transitive deps.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File Description
hack/workflow-stats/workflow_stats.go Implements the CLI: option parsing, GitHub API fetching, SQLite cache schema and queries, stats computation, and formatted output.
hack/workflow-stats/README.md User-facing documentation with usage examples and a sample workflow-edit diff.
hack/go.mod Adds modernc.org/sqlite direct dep and several indirect deps.
hack/go.sum Checksums for new direct/indirect modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +315 to +332
// run_id PRIMARY KEY: uniqueness + O(1) lookup by run ID (dbCachedRunIDs).
if _, err = db.Exec(`
CREATE TABLE IF NOT EXISTS runs (
run_id INTEGER PRIMARY KEY,
workflow_name TEXT NOT NULL DEFAULT '',
created_at TEXT NOT NULL DEFAULT ''
)`); err != nil {
log.Fatalf("Creating runs table: %v", err)
}

// Optimizes dbLatestRunDate (MAX(created_at) per workflow) and
// dbRunIDsSince (run IDs for a workflow within a date range).
if _, err = db.Exec("CREATE INDEX IF NOT EXISTS idx_runs_workflow_created ON runs (workflow_name, created_at)"); err != nil {
log.Fatalf("Creating index: %v", err)
}

// PRIMARY KEY (run_id, job_name, step_number): uniqueness + fast lookup
// by run_id prefix for dbCollectDurations (all steps for a set of runs).
Comment on lines +172 to +176
jobs := fetchJobsForRun(ctx, client, opts.Owner, opts.Repo, r.ID)
if jobs == nil {
continue
}
insertRun(db, r.ID, opts.Workflow, r.CreatedAt, jobs)
Comment on lines +147 to +158
func updateDB(ctx context.Context, client *github.Client, db *sql.DB, opts options) {
fetchSince := latestRunDate(db, opts.Workflow)
requestedSince := time.Now().UTC().AddDate(0, 0, -opts.Since)
if fetchSince.Before(requestedSince) {
fetchSince = requestedSince
}

fmt.Fprintf(os.Stderr, "Fetching runs since %s ...", fetchSince.Format("2006-01-02"))
t := time.Now()
wfID := findWorkflowID(ctx, client, opts.Owner, opts.Repo, opts.Workflow)
runs := fetchRuns(ctx, client, opts.Owner, opts.Repo, wfID, opts.Branch, fetchSince)
fmt.Fprintf(os.Stderr, " %d runs (%.1fs)\n", len(runs), time.Since(t).Seconds())

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to consider the branch flag, I think we can remove it. This tool should be used on the master branch. We can add branch option later if we have a real need.

Comment on lines +496 to +499
name := s.Name
if len(name) > 50 {
name = name[:49] + "…"
}
@nirs nirs marked this pull request as draft May 30, 2026 00:33
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2026
Analyzes GitHub Actions workflow step durations to help set per-step
timeouts based on historical data. Computes min, avg, P50, P90, P95,
max, and a suggested timeout (3x P95, rounded to minutes) across
completed workflow runs.

Features:
- SQLite cache at ~/.cache/workflow-stats/<owner>/<repo>/stats.db
  avoids redundant API calls; only new runs are fetched
- Incremental updates using the latest cached run date
- Filter by job name (-job), conclusion (-conclusion), branch (-branch)
- Output as table (default), markdown, CSV, or JSON

Dependencies:
- google/go-github/v85: GitHub Actions API client for fetching
  workflow runs and job details
- modernc.org/sqlite: pure-Go SQLite driver (transpiled from C),
  chosen over mattn/go-sqlite3 to avoid CGO build dependencies

Example:

    $ go run workflow-stats/workflow_stats.go -workflow "Functional Test"
    Fetching runs since 2026-05-24 ... 6 runs (0.7s)

    Step                                                N     Min     Avg     P95     Max  Timeout
    Run Functional Test                               728   2m50s   4m05s   5m21s   9m27s   17m00s
    Build minikube and e2e test binaries              156   1m07s   1m39s   2m09s   2m16s    7m00s
    Set up Rootless Docker (rootless)                  67     42s     48s     56s   1m21s    3m00s
    Update apt-get package index (ubuntu)             373      5s     12s     23s     42s    2m00s
    ...
@nirs nirs force-pushed the workflow-stats branch from a4f3909 to 1806d06 Compare June 12, 2026 16:35
Build minikube and e2e test binaries 200 1m07s 1m39s 1m38s 2m07s 2m09s 2m21s 7m00s
Set up Rootless Docker (rootless) 87 42s 47s 46s 53s 55s 1m01s 3m00s
Run actions/setup-go@4a3601121dd01d1626a1e23e3721… 1132 7s 20s 21s 26s 29s 1m01s 2m00s
Update apt-get package index (ubuntu) 475 5s 11s 8s 20s 23s 48s 2m00s

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add job name to the table, so we can see the best timeout for each job/step combination.

When using matrix, we may be able to set the timeout in the matrix, and use:

timeout-minutes: ${{ matrix.timeout-minutes }}

The code updating the timeouts can search timeout-minutes in the matrix when it finds timeout-minutes: ${{ matrix.timeout-minutes }}. This way we can fail fast jobs (ubuntu) quickly and wait longer only for slow builds (macos-15-intel).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants