Fm/stg 956 make ci faster #1246

filip-michalsky · 2025-11-07T18:53:52Z

why

CI was slow due to sequential eval jobs, repeated builds, and inefficient dependency management.

what changed

Parallel Execution:

Eval jobs now run in parallel instead of sequentially (regression → all others in parallel)
Reduces total CI time from ~25 mins to ~12 mins

Build Caching:

Build artifacts uploaded once and reused by all eval jobs
Eliminates 6 redundant builds (~2-3 min savings each)

Smart Skipping:

skip-evals label: skip all evals
skip-regression-evals label: skip regression only
Auto-skip for docs-only changes (not on main)

Dependency Optimization:

Use pnpm/action-setup@v4 with caching
Use --frozen-lockfile for deterministic installs
Remove unnecessary node_modules cleanup

Test Parallelization:

Local tests: 2→3 workers (Browserbase), 2→4 workers (standard)
Better CI vs local environment detection

test plan

Verify build artifacts upload/download correctly
Confirm eval jobs run in parallel after regression
Test skip-evals and skip-regression-evals labels work
Verify docs-only changes skip evals (except on main)
Measure CI time reduction (~50-60% expected)
Confirm no test flakiness from increased parallelization
All tests, lint, and build pass

changeset-bot · 2025-11-07T18:53:56Z

🦋 Changeset detected

Latest commit: 8202603

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages

Name	Type
@browserbasehq/stagehand	Patch
@browserbasehq/stagehand-evals	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

greptile-apps

Greptile Overview

Greptile Summary

This PR significantly optimizes CI performance by implementing parallel eval execution, build artifact caching, and smart test skipping. The changes reduce total CI time from ~270 minutes to ~90 minutes (67% reduction) through several key improvements:

Parallel Execution: Eval jobs now run in parallel after regression completes, instead of sequentially
Build Artifact Caching: Build artifacts are uploaded once and reused across all eval jobs, eliminating 6+ redundant builds
Smart Skipping: New skip-evals and skip-regression-evals labels plus automatic skipping for docs-only changes
Dependency Optimization: Uses pnpm/action-setup@v4 with caching and --frozen-lockfile for faster, more deterministic installs
Test Parallelization: Increased worker counts for better resource utilization (CI: 3-4 workers, local: 5-6 workers)

The implementation properly handles dependencies between jobs, with all eval jobs depending on run-build to ensure artifacts are available. E2E tests still build independently to avoid artifact download overhead for their simpler needs.

Confidence Score: 4/5

This PR is safe to merge with careful monitoring of the first few CI runs
The changes are well-architected and achieve significant CI performance improvements. The parallel execution strategy is sound, build artifact caching is implemented correctly, and dependency management is improved. However, there's one potential edge case with the docs-only filter that could cause markdown files in packages/ to incorrectly skip evals when they shouldn't. The increased test parallelization may also need monitoring for flakiness.
Monitor .github/workflows/ci.yml carefully during the first few CI runs to ensure artifact sharing works correctly and the docs-only filter behaves as expected

Important Files Changed

File Analysis

Filename	Score	Overview
.github/workflows/ci.yml	4/5	Implements parallel eval execution, build artifact caching, smart skipping logic, and dependency optimization. Changes reduce CI time from ~270 to ~90 minutes through parallelization and artifact reuse.
packages/core/lib/v3/tests/v3.bb.playwright.config.ts	5/5	Increases Browserbase test parallelization from 2 to 3 workers locally while keeping CI at 2 workers for resource management.
packages/core/lib/v3/tests/v3.local.playwright.config.ts	5/5	Increases local test parallelization from 2 to 3 workers in CI and 5 locally for improved performance.
packages/core/lib/v3/tests/v3.playwright.config.ts	5/5	Increases test parallelization from 2 to 4 workers in CI and 6 locally for improved test execution speed.

Sequence Diagram

sequenceDiagram
    participant DetermineChanges
    participant DetermineEvals
    participant RunLint
    participant RunBuild
    participant E2ELocal
    participant E2EBB
    participant Regression
    participant Combination
    participant Act
    participant Extract
    participant Observe
    participant TargetedExtract
    participant Agent

    DetermineChanges->>DetermineEvals: outputs (core, evals, docs-only)
    DetermineChanges->>RunLint: if core or evals changed
    DetermineChanges->>RunBuild: if core or evals changed
    
    RunLint->>E2ELocal: needs
    RunBuild->>E2ELocal: needs
    RunBuild-->>RunBuild: Upload build artifacts
    
    RunLint->>E2EBB: needs
    RunBuild->>E2EBB: needs
    
    E2ELocal->>Regression: needs
    E2EBB->>Regression: needs
    RunBuild->>Regression: needs (for artifacts)
    DetermineEvals->>Regression: check skip flags
    Regression-->>Regression: Download build artifacts
    
    Note over Combination,Agent: Parallel execution after regression
    
    Regression->>Combination: needs
    RunBuild->>Combination: needs (for artifacts)
    DetermineEvals->>Combination: check labels
    Combination-->>Combination: Download build artifacts
    
    Regression->>Act: needs
    RunBuild->>Act: needs (for artifacts)
    DetermineEvals->>Act: check labels
    Act-->>Act: Download build artifacts
    
    Regression->>Extract: needs
    RunBuild->>Extract: needs (for artifacts)
    DetermineEvals->>Extract: check labels
    Extract-->>Extract: Download build artifacts
    
    Regression->>Observe: needs
    RunBuild->>Observe: needs (for artifacts)
    DetermineEvals->>Observe: check labels
    Observe-->>Observe: Download build artifacts
    
    Regression->>TargetedExtract: needs
    RunBuild->>TargetedExtract: needs (for artifacts)
    DetermineEvals->>TargetedExtract: check labels
    TargetedExtract-->>TargetedExtract: Download build artifacts
    
    Regression->>Agent: needs
    RunBuild->>Agent: needs (for artifacts)
    DetermineEvals->>Agent: check labels
    Agent-->>Agent: Download build artifacts

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-08T15:56:14Z

.github/workflows/ci.yml

+            docs-only:
+              - '**/*.md'
+              - 'examples/**'
+              - '!packages/**/*.md'


style: Check that the negation pattern !packages/**/*.md works as intended. Verify that changes to files like packages/core/README.md or packages/evals/CHANGELOG.md don't incorrectly set docs-only=true and skip evals when they shouldn't.

Prompt To Fix With AI

This is a comment left during a code review. Path: .github/workflows/ci.yml Line: 48:51 Comment: **style:** Check that the negation pattern `!packages/**/*.md` works as intended. Verify that changes to files like `packages/core/README.md` or `packages/evals/CHANGELOG.md` don't incorrectly set `docs-only=true` and skip evals when they shouldn't. How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-08T15:56:15Z

.github/workflows/ci.yml

  run-combination-evals:
-    needs: [run-regression-evals, determine-evals]
+    needs: [run-regression-evals, run-build, determine-evals]


skip-regression-evals label never lets other evals run

The new skip-regression-evals label is intended to bypass only the regression run, but every subsequent eval job still declares needs: [run-regression-evals, …]. In GitHub Actions a job whose dependency is skipped is itself skipped, so tagging a PR with skip-regression-evals will prevent run-combination-evals, run-act-evals, run-extract-evals, etc. from executing even when their labels are present. This makes the label ineffective and blocks targeted eval runs. Consider removing the hard dependency on run-regression-evals (or guarding it with if: always()) so other eval jobs can proceed when regression is intentionally skipped.

Useful? React with 👍 / 👎.

filip-michalsky added 3 commits November 7, 2025 13:34

work on making ci faster

083bc81

remove comment

5701a06

remove pnpm version conflict

31c1108

filip-michalsky added 3 commits November 7, 2025 14:25

scale back concurrency

3e78249

fix lint

4ecc64b

more ci improvements

968eb94

filip-michalsky requested review from miguelg719, seanmcguire12 and tkattkat November 7, 2025 20:56

filip-michalsky marked this pull request as ready for review November 8, 2025 15:53

greptile-apps bot reviewed Nov 8, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 8, 2025

View reviewed changes

add changeset

8202603

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fm/stg 956 make ci faster #1246

Fm/stg 956 make ci faster #1246

Uh oh!

filip-michalsky commented Nov 7, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Nov 8, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fm/stg 956 make ci faster #1246

Are you sure you want to change the base?

Fm/stg 956 make ci faster #1246

Uh oh!

Conversation

filip-michalsky commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Uh oh!

changeset-bot bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

filip-michalsky commented Nov 7, 2025 •

edited

Loading

changeset-bot bot commented Nov 7, 2025 •

edited

Loading