[EVAL] Long Horizon Execution #1074

akshathmangudi · 2025-11-21T10:57:49Z

I screwed up my previous git clone, so I had to redo the changes 😅

Description:
Approach described within #1056.

Tasks:

Initial scaffolding of /tasks/tasks/long_horizon_execution.py
Implement a custom scorer to parse <answer> tags.
Complete implementation of /tasks/tasks/long_horizon_execution.py
Evaluation and Testing

STATUS: ready for review.

Current behavior:

When we run lighteval tasks inspect long_horizon_execution, the output has been shown below:

... more lines
           "'basic', 'alive', 'cream', 'dress', 'black', 'brown', 'drama', "
           "'black', 'audio', 'brown', 'album', 'cover', 'avoid', 'aware', "
           "'event', 'dream', 'clean', 'clock', 'apple', 'above', 'close', "
           "'begin', 'allow', 'album', 'draft', 'brain', 'civil', 'faith', "
           "'death', 'coach', 'below', 'doubt', 'aware', 'cover', 'final', "
           "'allow', 'avoid', 'ahead', 'cross', 'child', 'cream', 'error', "
           "'break', 'brief', 'clock', 'final', 'dance', 'award', 'every', "
           "'chief', 'could', 'dream', 'begin', 'burst', 'audio', 'album', "
           "'cross', 'doubt', 'blood', 'child', 'brand', 'brand', 'extra', "
           "'broad', 'cloud', 'check', 'after', 'chart', 'basic', 'child', "
           "'coach', 'chair', 'faith', 'earth', 'audio', 'basic', 'field', "
           "'cloud', 'draft', 'apply', 'court', 'black', 'ahead', 'burst', "
           "'crowd', 'depth', 'enemy', 'drink', 'first', 'could', 'false', "
           "'could', 'blame', 'first', 'album', 'crowd', 'first', 'broad', "
           "'extra', 'clock', 'chart', 'fiber', 'board', 'earth', 'being', "
           "'alive', 'chart', 'avoid', 'dress', 'cloud', 'clean', 'avoid', "
           "'crash', 'clean', 'arise', 'death', 'brand', 'error']\n"
           '\n'
           'Your task: Calculate the cumulative sum after each key. The first '
           'sum is just the value of the first key. The second sum is the '
           'first value plus the second value, and so on.\n'
           '\n'
           'IMPORTANT:\n'
           '- Output your answer as a single line with comma-separated values '
           'inside <answer></answer> tags\n'
           '- Do not include any other text outside the answer tags\n'
           '- Format: <answer>value1,value2,value3,...</answer>\n'
           '- Example: If the cumulative sums are [5, 8, 12], output: '
           '<answer>5,8,12</answer>\n'
           '\n'
           'Your answer:',
  'sampling_methods': [],
  'specific': None,
  'stop_sequences': (),
  'task_name': 'long_horizon_execution',
  'unconditioned_query': None,
  'use_logits': False}

akshathmangudi · 2025-11-21T10:59:50Z

cc: @NathanHB

NathanHB · 2025-11-21T12:42:28Z

looking good ! Will run locally and review today or start of next week :)
Can you share a HUggingFace Space with the samples as described here to make it easier to verify ? 🤗

akshathmangudi · 2025-11-22T12:17:27Z

i ran the benchmark on HF Inference's gpt-4o but a lot of the results I am seeing are quite poor. is this expected or something wrong with the prompting that I haven't looked at yet?

https://huggingface.co/spaces/akshathmangudi/lhe-gpt4o-single

…hteval into akshath/issue-1056-v2

ready for review

cef0b0f

akshathmangudi mentioned this pull request Nov 21, 2025

[EVAL] Long Horizon Execution #1072

Closed

4 tasks

akshathmangudi marked this pull request as ready for review November 21, 2025 10:59

Merge branch 'main' into akshath/issue-1056-v2

fdc9288

akshathmangudi added 2 commits November 22, 2025 17:47

some fixes

2c0ceae

Merge branch 'akshath/issue-1056-v2' of github.com:akshathmangudi/lig…

3d8ac1b

…hteval into akshath/issue-1056-v2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EVAL] Long Horizon Execution #1074

[EVAL] Long Horizon Execution #1074

Uh oh!

akshathmangudi commented Nov 21, 2025

Uh oh!

akshathmangudi commented Nov 21, 2025

Uh oh!

NathanHB commented Nov 21, 2025

Uh oh!

akshathmangudi commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[EVAL] Long Horizon Execution #1074

Are you sure you want to change the base?

[EVAL] Long Horizon Execution #1074

Uh oh!

Conversation

akshathmangudi commented Nov 21, 2025

Uh oh!

akshathmangudi commented Nov 21, 2025

Uh oh!

NathanHB commented Nov 21, 2025

Uh oh!

akshathmangudi commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants