Skip to content

Conversation

@filip-michalsky
Copy link
Collaborator

@filip-michalsky filip-michalsky commented Nov 7, 2025

why

CI was slow due to sequential eval jobs, repeated builds, and inefficient dependency management.

what changed

Parallel Execution:

  • Eval jobs now run in parallel instead of sequentially (regression → all others in parallel)
  • Reduces total CI time from ~25 mins to ~12 mins

Build Caching:

  • Build artifacts uploaded once and reused by all eval jobs
  • Eliminates 6 redundant builds (~2-3 min savings each)

Smart Skipping:

  • skip-evals label: skip all evals
  • skip-regression-evals label: skip regression only
  • Auto-skip for docs-only changes (not on main)

Dependency Optimization:

  • Use pnpm/action-setup@v4 with caching
  • Use --frozen-lockfile for deterministic installs
  • Remove unnecessary node_modules cleanup

Test Parallelization:

  • Local tests: 2→3 workers (Browserbase), 2→4 workers (standard)
  • Better CI vs local environment detection

test plan

  • Verify build artifacts upload/download correctly
  • Confirm eval jobs run in parallel after regression
  • Test skip-evals and skip-regression-evals labels work
  • Verify docs-only changes skip evals (except on main)
  • Measure CI time reduction (~50-60% expected)
  • Confirm no test flakiness from increased parallelization
  • All tests, lint, and build pass

@changeset-bot
Copy link

changeset-bot bot commented Nov 7, 2025

🦋 Changeset detected

Latest commit: 8202603

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/stagehand-evals Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@filip-michalsky filip-michalsky marked this pull request as ready for review November 8, 2025 15:53
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR significantly optimizes CI performance by implementing parallel eval execution, build artifact caching, and smart test skipping. The changes reduce total CI time from ~270 minutes to ~90 minutes (67% reduction) through several key improvements:

  • Parallel Execution: Eval jobs now run in parallel after regression completes, instead of sequentially
  • Build Artifact Caching: Build artifacts are uploaded once and reused across all eval jobs, eliminating 6+ redundant builds
  • Smart Skipping: New skip-evals and skip-regression-evals labels plus automatic skipping for docs-only changes
  • Dependency Optimization: Uses pnpm/action-setup@v4 with caching and --frozen-lockfile for faster, more deterministic installs
  • Test Parallelization: Increased worker counts for better resource utilization (CI: 3-4 workers, local: 5-6 workers)

The implementation properly handles dependencies between jobs, with all eval jobs depending on run-build to ensure artifacts are available. E2E tests still build independently to avoid artifact download overhead for their simpler needs.

Confidence Score: 4/5

  • This PR is safe to merge with careful monitoring of the first few CI runs
  • The changes are well-architected and achieve significant CI performance improvements. The parallel execution strategy is sound, build artifact caching is implemented correctly, and dependency management is improved. However, there's one potential edge case with the docs-only filter that could cause markdown files in packages/ to incorrectly skip evals when they shouldn't. The increased test parallelization may also need monitoring for flakiness.
  • Monitor .github/workflows/ci.yml carefully during the first few CI runs to ensure artifact sharing works correctly and the docs-only filter behaves as expected

Important Files Changed

File Analysis

Filename Score Overview
.github/workflows/ci.yml 4/5 Implements parallel eval execution, build artifact caching, smart skipping logic, and dependency optimization. Changes reduce CI time from ~270 to ~90 minutes through parallelization and artifact reuse.
packages/core/lib/v3/tests/v3.bb.playwright.config.ts 5/5 Increases Browserbase test parallelization from 2 to 3 workers locally while keeping CI at 2 workers for resource management.
packages/core/lib/v3/tests/v3.local.playwright.config.ts 5/5 Increases local test parallelization from 2 to 3 workers in CI and 5 locally for improved performance.
packages/core/lib/v3/tests/v3.playwright.config.ts 5/5 Increases test parallelization from 2 to 4 workers in CI and 6 locally for improved test execution speed.

Sequence Diagram

sequenceDiagram
    participant DetermineChanges
    participant DetermineEvals
    participant RunLint
    participant RunBuild
    participant E2ELocal
    participant E2EBB
    participant Regression
    participant Combination
    participant Act
    participant Extract
    participant Observe
    participant TargetedExtract
    participant Agent

    DetermineChanges->>DetermineEvals: outputs (core, evals, docs-only)
    DetermineChanges->>RunLint: if core or evals changed
    DetermineChanges->>RunBuild: if core or evals changed
    
    RunLint->>E2ELocal: needs
    RunBuild->>E2ELocal: needs
    RunBuild-->>RunBuild: Upload build artifacts
    
    RunLint->>E2EBB: needs
    RunBuild->>E2EBB: needs
    
    E2ELocal->>Regression: needs
    E2EBB->>Regression: needs
    RunBuild->>Regression: needs (for artifacts)
    DetermineEvals->>Regression: check skip flags
    Regression-->>Regression: Download build artifacts
    
    Note over Combination,Agent: Parallel execution after regression
    
    Regression->>Combination: needs
    RunBuild->>Combination: needs (for artifacts)
    DetermineEvals->>Combination: check labels
    Combination-->>Combination: Download build artifacts
    
    Regression->>Act: needs
    RunBuild->>Act: needs (for artifacts)
    DetermineEvals->>Act: check labels
    Act-->>Act: Download build artifacts
    
    Regression->>Extract: needs
    RunBuild->>Extract: needs (for artifacts)
    DetermineEvals->>Extract: check labels
    Extract-->>Extract: Download build artifacts
    
    Regression->>Observe: needs
    RunBuild->>Observe: needs (for artifacts)
    DetermineEvals->>Observe: check labels
    Observe-->>Observe: Download build artifacts
    
    Regression->>TargetedExtract: needs
    RunBuild->>TargetedExtract: needs (for artifacts)
    DetermineEvals->>TargetedExtract: check labels
    TargetedExtract-->>TargetedExtract: Download build artifacts
    
    Regression->>Agent: needs
    RunBuild->>Agent: needs (for artifacts)
    DetermineEvals->>Agent: check labels
    Agent-->>Agent: Download build artifacts
Loading

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +48 to +51
docs-only:
- '**/*.md'
- 'examples/**'
- '!packages/**/*.md'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Check that the negation pattern !packages/**/*.md works as intended. Verify that changes to files like packages/core/README.md or packages/evals/CHANGELOG.md don't incorrectly set docs-only=true and skip evals when they shouldn't.

Prompt To Fix With AI
This is a comment left during a code review.
Path: .github/workflows/ci.yml
Line: 48:51

Comment:
**style:** Check that the negation pattern `!packages/**/*.md` works as intended. Verify that changes to files like `packages/core/README.md` or `packages/evals/CHANGELOG.md` don't incorrectly set `docs-only=true` and skip evals when they shouldn't.

How can I resolve this? If you propose a fix, please make it concise.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 239 to +305
run-combination-evals:
needs: [run-regression-evals, determine-evals]
needs: [run-regression-evals, run-build, determine-evals]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge skip-regression-evals label never lets other evals run

The new skip-regression-evals label is intended to bypass only the regression run, but every subsequent eval job still declares needs: [run-regression-evals, …]. In GitHub Actions a job whose dependency is skipped is itself skipped, so tagging a PR with skip-regression-evals will prevent run-combination-evals, run-act-evals, run-extract-evals, etc. from executing even when their labels are present. This makes the label ineffective and blocks targeted eval runs. Consider removing the hard dependency on run-regression-evals (or guarding it with if: always()) so other eval jobs can proceed when regression is intentionally skipped.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants