Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before merging any code changes, developers often face questions such as "Did my agent workflow become more curious?" or "Are the answers now less subservient?". Such questions are impossible to answer with static benchmarks, see for example Google BIG-bench. Dynamic evaluations with LLMs are a more promising approach. LLMs are very good at binary evaluations. For example, "Is this recipe vegetarian?". The
LLMJudgetool works well for these cases. However, scoring the agent responses is more challenging. For example: "Score the creativity of the ice cream flavor 'Miso Caramel' on a scale from 0 to 1.". Scoring each response independently produces inconsistent results that cannot be compared. On the other hand, including all other responses as context for comparison is not feasible.The pull request presents an evaluation algorithm based on the Bradley-Terry model. It allows us to turn the difficult scoring problem into a set of simple binary evaluations in which LLMs excel. Instead of asking the LLM to score a response, we provide it with tasks such as "Which ice cream flavor is more creative: 'Miso Caramel' or 'Vanilla'?". The algorithm is the same as that used in Elo rating and Chatbot Arena. Inspired by the chess analogy, our implementation uses three core classes:
EvalPlayer,EvalGameandEvalTournament.EvalPlayercontains a single agent response and, ultimately, the corresponding Bradley-Terry score. It maps closely to theCaseclass.EvalGameis a single A/B test between twoEvalPlayers such as Miso Caramel vs. Vanilla. Finally,EvalTournamentcontains all players, the games and the scoring algorithm. It maps closely toDataset.In practice, an evaluation runs in three steps:
Datasetwith only inputs. Then serialize it.Datasetagainst the agent workflow in themainbranch and record the responses inDataset. Serialize it again. These responses serve as our baseline. This step should run whenever themainbranch changes.Datasetagainst a novel agent in some feature branch. Run anEvalTournamentwith baseline and novel responses and score them in one go. Now you can check whether the scores have improved.You can run the three-step use case
test_evaltournament_usecasetogether with the unit tests as below.Note that this pull request is just a rough proof of concept. I have not tried to integrate the code into the
pydantic_evalsframework, nor to appease the CI pipeline. I hope it will help with the discussions on Slack.Added
choixandnumpydependenciesEvalPlayer,EvalGameandEvalTournamentclassesrandom_sampling_strategy,round_robin_strategyandadaptive_uncertainty_strategy(default)test_evaltournament_usecaseChanged