Bradley-Terry evaluator #3259

lars20070 · 2025-10-25T15:36:21Z

Before merging any code changes, developers often face questions such as "Did my agent workflow become more curious?" or "Are the answers now less subservient?". Such questions are impossible to answer with static benchmarks, see for example Google BIG-bench. Dynamic evaluations with LLMs are a more promising approach. LLMs are very good at binary evaluations. For example, "Is this recipe vegetarian?". The LLMJudge tool works well for these cases. However, scoring the agent responses is more challenging. For example: "Score the creativity of the ice cream flavor 'Miso Caramel' on a scale from 0 to 1.". Scoring each response independently produces inconsistent results that cannot be compared. On the other hand, including all other responses as context for comparison is not feasible.

The pull request presents an evaluation algorithm based on the Bradley-Terry model. It allows us to turn the difficult scoring problem into a set of simple binary evaluations in which LLMs excel. Instead of asking the LLM to score a response, we provide it with tasks such as "Which ice cream flavor is more creative: 'Miso Caramel' or 'Vanilla'?". The algorithm is the same as that used in Elo rating and Chatbot Arena. Inspired by the chess analogy, our implementation uses three core classes: EvalPlayer, EvalGame and EvalTournament. EvalPlayer contains a single agent response and, ultimately, the corresponding Bradley-Terry score. It maps closely to the Case class. EvalGame is a single A/B test between two EvalPlayers such as Miso Caramel vs. Vanilla. Finally, EvalTournament contains all players, the games and the scoring algorithm. It maps closely to Dataset.

In practice, an evaluation runs in three steps:

Create a Dataset with only inputs. Then serialize it.
Run the Dataset against the agent workflow in the main branch and record the responses in Dataset. Serialize it again. These responses serve as our baseline. This step should run whenever the main branch changes.
Run the Dataset against a novel agent in some feature branch. Run an EvalTournament with baseline and novel responses and score them in one go. Now you can check whether the scores have improved.

You can run the three-step use case test_evaltournament_usecase together with the unit tests as below.

uv run pytest tests/evals/test_tournament.py -v -s

Note that this pull request is just a rough proof of concept. I have not tried to integrate the code into the pydantic_evals framework, nor to appease the CI pipeline. I hope it will help with the discussions on Slack.

Added

choix and numpy dependencies
EvalPlayer, EvalGame and EvalTournament classes
three algorithms random_sampling_strategy, round_robin_strategy and adaptive_uncertainty_strategy (default)
unit tests including a use case example test_evaltournament_usecase

Changed

enabled localhost for VCR recording

lars20070 added 7 commits October 25, 2025 08:31

add choix and numpy

a40a0ec

add EvalPlayer

faec697

add EvalGame

4c8a2e6

add strategies

02fad97

add EvalTournament and use case

b56c4ba

new debug output, tests more deterministic

4436f47

fix prompt

7dd1c2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bradley-Terry evaluator #3259

Bradley-Terry evaluator #3259

Uh oh!

lars20070 commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bradley-Terry evaluator #3259

Are you sure you want to change the base?

Bradley-Terry evaluator #3259

Uh oh!

Conversation

lars20070 commented Oct 25, 2025

Added

Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant