Skip to content

Conversation

@lars20070
Copy link

Before merging any code changes, developers often face questions such as "Did my agent workflow become more curious?" or "Are the answers now less subservient?". Such questions are impossible to answer with static benchmarks, see for example Google BIG-bench. Dynamic evaluations with LLMs are a more promising approach. LLMs are very good at binary evaluations. For example, "Is this recipe vegetarian?". The LLMJudge tool works well for these cases. However, scoring the agent responses is more challenging. For example: "Score the creativity of the ice cream flavor 'Miso Caramel' on a scale from 0 to 1.". Scoring each response independently produces inconsistent results that cannot be compared. On the other hand, including all other responses as context for comparison is not feasible.

The pull request presents an evaluation algorithm based on the Bradley-Terry model. It allows us to turn the difficult scoring problem into a set of simple binary evaluations in which LLMs excel. Instead of asking the LLM to score a response, we provide it with tasks such as "Which ice cream flavor is more creative: 'Miso Caramel' or 'Vanilla'?". The algorithm is the same as that used in Elo rating and Chatbot Arena. Inspired by the chess analogy, our implementation uses three core classes: EvalPlayer, EvalGame and EvalTournament. EvalPlayer contains a single agent response and, ultimately, the corresponding Bradley-Terry score. It maps closely to the Case class. EvalGame is a single A/B test between two EvalPlayers such as Miso Caramel vs. Vanilla. Finally, EvalTournament contains all players, the games and the scoring algorithm. It maps closely to Dataset.

In practice, an evaluation runs in three steps:

  1. Create a Dataset with only inputs. Then serialize it.
  2. Run the Dataset against the agent workflow in the main branch and record the responses in Dataset. Serialize it again. These responses serve as our baseline. This step should run whenever the main branch changes.
  3. Run the Dataset against a novel agent in some feature branch. Run an EvalTournament with baseline and novel responses and score them in one go. Now you can check whether the scores have improved.

You can run the three-step use case test_evaltournament_usecase together with the unit tests as below.

uv run pytest tests/evals/test_tournament.py -v -s

Note that this pull request is just a rough proof of concept. I have not tried to integrate the code into the pydantic_evals framework, nor to appease the CI pipeline. I hope it will help with the discussions on Slack.

Added

  • choix and numpy dependencies
  • EvalPlayer, EvalGame and EvalTournament classes
  • three algorithms random_sampling_strategy, round_robin_strategy and adaptive_uncertainty_strategy (default)
  • unit tests including a use case example test_evaltournament_usecase

Changed

  • enabled localhost for VCR recording

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant