hack: add workflow-stats tool for analyzing CI step durations#23042
hack: add workflow-stats tool for analyzing CI step durations#23042nirs wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nirs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
medyagh
left a comment
There was a problem hiding this comment.
@nirs this is actually super cool,
some idea (not for this PR ofc )
but it would be cool if we have an automation job that every 14 days gathers this logs and makes a graphic chart our of them and adds them to our site
https://minikube.sigs.k8s.io/docs/benchmarks/
we could potenitally have something like this for example for "Functional Test on Docker"
we export the data that we get into a csv
- (how long Run integration Step Took)
- Add to the CSV in our hack folder (timestamp, env name,value)
- Generate some chart with Google Charts like this
- https://minikube.sigs.k8s.io/docs/benchmarks/timetok8s/weekly_benchmark/
and make a PR to add it to our site once a month (update functional test benchmarking)
|
@medyagh Comment addressed:
|
|
@nirs: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/retest-required |
There was a problem hiding this comment.
Pull request overview
Adds a new hack/workflow-stats Go tool that fetches completed GitHub Actions runs via the GitHub API, caches jobs/steps in a local SQLite database (modernc.org/sqlite, pure Go — no CGO), and reports per-step duration statistics (min/avg/p50/p90/p95/max) plus a suggested timeout (p95 × multiplier, rounded up to whole minutes, floor 1 min). Output is available as table/markdown/CSV/JSON to support both human review and an upcoming automated timeout-tuning workflow (#23043).
Changes:
- New
hack/workflow-stats/workflow_stats.goCLI with SQLite-backed incremental fetch and four output formats. - New
hack/workflow-stats/README.mddocumenting usage and a worked example of tuning Functional Test timeouts. hack/go.mod/hack/go.sumupdates pulling ingoogle/go-github/v85,modernc.org/sqlite, and their transitive deps.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| hack/workflow-stats/workflow_stats.go | Implements the CLI: option parsing, GitHub API fetching, SQLite cache schema and queries, stats computation, and formatted output. |
| hack/workflow-stats/README.md | User-facing documentation with usage examples and a sample workflow-edit diff. |
| hack/go.mod | Adds modernc.org/sqlite direct dep and several indirect deps. |
| hack/go.sum | Checksums for new direct/indirect modules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // run_id PRIMARY KEY: uniqueness + O(1) lookup by run ID (dbCachedRunIDs). | ||
| if _, err = db.Exec(` | ||
| CREATE TABLE IF NOT EXISTS runs ( | ||
| run_id INTEGER PRIMARY KEY, | ||
| workflow_name TEXT NOT NULL DEFAULT '', | ||
| created_at TEXT NOT NULL DEFAULT '' | ||
| )`); err != nil { | ||
| log.Fatalf("Creating runs table: %v", err) | ||
| } | ||
|
|
||
| // Optimizes dbLatestRunDate (MAX(created_at) per workflow) and | ||
| // dbRunIDsSince (run IDs for a workflow within a date range). | ||
| if _, err = db.Exec("CREATE INDEX IF NOT EXISTS idx_runs_workflow_created ON runs (workflow_name, created_at)"); err != nil { | ||
| log.Fatalf("Creating index: %v", err) | ||
| } | ||
|
|
||
| // PRIMARY KEY (run_id, job_name, step_number): uniqueness + fast lookup | ||
| // by run_id prefix for dbCollectDurations (all steps for a set of runs). |
| jobs := fetchJobsForRun(ctx, client, opts.Owner, opts.Repo, r.ID) | ||
| if jobs == nil { | ||
| continue | ||
| } | ||
| insertRun(db, r.ID, opts.Workflow, r.CreatedAt, jobs) |
| func updateDB(ctx context.Context, client *github.Client, db *sql.DB, opts options) { | ||
| fetchSince := latestRunDate(db, opts.Workflow) | ||
| requestedSince := time.Now().UTC().AddDate(0, 0, -opts.Since) | ||
| if fetchSince.Before(requestedSince) { | ||
| fetchSince = requestedSince | ||
| } | ||
|
|
||
| fmt.Fprintf(os.Stderr, "Fetching runs since %s ...", fetchSince.Format("2006-01-02")) | ||
| t := time.Now() | ||
| wfID := findWorkflowID(ctx, client, opts.Owner, opts.Repo, opts.Workflow) | ||
| runs := fetchRuns(ctx, client, opts.Owner, opts.Repo, wfID, opts.Branch, fetchSince) | ||
| fmt.Fprintf(os.Stderr, " %d runs (%.1fs)\n", len(runs), time.Since(t).Seconds()) |
There was a problem hiding this comment.
Need to consider the branch flag, I think we can remove it. This tool should be used on the master branch. We can add branch option later if we have a real need.
| name := s.Name | ||
| if len(name) > 50 { | ||
| name = name[:49] + "…" | ||
| } |
Analyzes GitHub Actions workflow step durations to help set per-step
timeouts based on historical data. Computes min, avg, P50, P90, P95,
max, and a suggested timeout (3x P95, rounded to minutes) across
completed workflow runs.
Features:
- SQLite cache at ~/.cache/workflow-stats/<owner>/<repo>/stats.db
avoids redundant API calls; only new runs are fetched
- Incremental updates using the latest cached run date
- Filter by job name (-job), conclusion (-conclusion), branch (-branch)
- Output as table (default), markdown, CSV, or JSON
Dependencies:
- google/go-github/v85: GitHub Actions API client for fetching
workflow runs and job details
- modernc.org/sqlite: pure-Go SQLite driver (transpiled from C),
chosen over mattn/go-sqlite3 to avoid CGO build dependencies
Example:
$ go run workflow-stats/workflow_stats.go -workflow "Functional Test"
Fetching runs since 2026-05-24 ... 6 runs (0.7s)
Step N Min Avg P95 Max Timeout
Run Functional Test 728 2m50s 4m05s 5m21s 9m27s 17m00s
Build minikube and e2e test binaries 156 1m07s 1m39s 2m09s 2m16s 7m00s
Set up Rootless Docker (rootless) 67 42s 48s 56s 1m21s 3m00s
Update apt-get package index (ubuntu) 373 5s 12s 23s 42s 2m00s
...
| Build minikube and e2e test binaries 200 1m07s 1m39s 1m38s 2m07s 2m09s 2m21s 7m00s | ||
| Set up Rootless Docker (rootless) 87 42s 47s 46s 53s 55s 1m01s 3m00s | ||
| Run actions/setup-go@4a3601121dd01d1626a1e23e3721… 1132 7s 20s 21s 26s 29s 1m01s 2m00s | ||
| Update apt-get package index (ubuntu) 475 5s 11s 8s 20s 23s 48s 2m00s |
There was a problem hiding this comment.
We need to add job name to the table, so we can see the best timeout for each job/step combination.
When using matrix, we may be able to set the timeout in the matrix, and use:
timeout-minutes: ${{ matrix.timeout-minutes }}The code updating the timeouts can search timeout-minutes in the matrix when it finds timeout-minutes: ${{ matrix.timeout-minutes }}. This way we can fail fast jobs (ubuntu) quickly and wait longer only for slow builds (macos-15-intel).
Analyzes GitHub Actions workflow step durations to help set per-step timeouts based on historical data. Computes min, avg, P50, P90, P95, max, and a suggested timeout (3x P95, rounded to minutes) across completed workflow runs.
Features:
Dependencies:
Example usage
Related-to #23041
Related-to #23043