Code release for the AudioToolAgent paper. See paper here: https://arxiv.org/abs/2510.02995
The repository exposes a language-agent scaffold that calls audio specialists as tools. Two ready-to-run configurations are provided:
- AudioToolAgent-GPT5 (closed): GPT-5 orchestrator with vendor APIs (OpenAI GPT-4o, Google Gemini 2.5 Flash, Mistral Voxtral) plus a local AudioFlamingo server.
- AudioToolAgent-Open (open): DeepSeek-V3.1 orchestrator with Whisper, Voxtral, Qwen2.5 Omni, DeSTA 2.5, and AudioFlamingo 3.
audiotoolagent/— core package: agent runtime, tools, APIsconfigs/— example configsEvaluation/— benchmark runnersMMAU_GPT5.py(e.g.python -m Evaluation.MMAU_GPT5 --limit 50)MMAU_Open.pyMMAR_GPT5.pyMMAR_Open.pyMMAUPro_GPT5.pyMMAUPro_Open.py
scripts/launch_closed.sh— start local services for GPT-5 configlaunch_open.sh— start local services for DeepSeek config
main.py— CLI for single-run inference
# Clone and enter the project
git clone https://github.com/GLJS/AudioToolAgent.git
cd AudioToolAgent
# Create environment (Python 3.10+ recommended)
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtCreate a .env / export the following (only the relevant keys for your configuration are required):
# Shared
export TMP_DIR=/tmp/audiotoolagent
# Closed configuration (GPT-5)
export OPENAI_API_KEY="..." # GPT-5 orchestrator + GPT-4o tool
export GOOGLE_API_KEY="..." # Gemini 2.5 Flash
export MISTRAL_API_KEY="..." # Voxtral API
Both configurations rely on local HTTP endpoints for some tools. Two helper scripts launch the required processes and keep logs under logs/.
# Open / DeepSeek configuration
./scripts/launch_open.sh
# Closed / GPT-5 configuration
./scripts/launch_closed.shThe scripts spawn the following components:
- Open: Qwen2.5 Omni (vLLM on port 4002), DeSTA 2.5 FastAPI (port 4004), and the AudioFlamingo 3 FastAPI proxy (port 4010). Whisper runs in-process via
faster-whisper; Voxtral is accessed via API. - Closed: Qwen2.5 Omni (port 4002) and the AudioFlamingo 3 FastAPI proxy (port 4010).
hostnames.txt is updated automatically so the tool adapters discover the correct endpoints.
Use main.py to run the full tool-calling pipeline for a question + audio file.
python main.py \
--config configs/open_deepseek.yaml \
--audio /path/to/audio.wav \
--question "What instrument is playing?" \
--options "Piano" "Guitar" "Violin" "Drums"Add --no-stream to disable incremental console streaming and --output result.json to save the response.
Each benchmark/configuration pair has its own script under Evaluation/ so that commands from the paper can be reproduced exactly. Run them as Python modules to keep relative imports working, and use --limit for quick tests.
# MMAU (GPT-5 configuration)
python -m Evaluation.MMAU_GPT5 --limit 50
# MMAU (open configuration)
python -m Evaluation.MMAU_Open --limit 50
# MMAR (GPT-5 configuration)
python -m Evaluation.MMAR_GPT5 --limit 50
# MMAU-Pro (open configuration)
python -m Evaluation.MMAUPro_Open --limit 25Each runner downloads the corresponding Hugging Face dataset on first use and writes optional JSON outputs when --output is provided.
Configuration files live in configs/ and describe the orchestrator plus the set of enabled tools. Duplicate the YAMLs to experiment with alternative tool suites or decoding parameters.
Key fields:
orchestrator.llm_type:openai(GPT-5) orvllm(DeepSeek). Usellm_urlandapi_key_envto point to custom endpoints.tools: ordered list of tool descriptors. Setenabled: falseto disable a tool quickly.
- Add new tools under
audiotoolagent/tools/by subclassingAudioAnalysisModelTool,AudioTranscriptionModelTool, orExternalAPITool. - Register new FastAPI adapters in
audiotoolagent/apis/if a model needs to be exposed over HTTP. - Reference the tool in a configuration file and rerun the desired evaluation script.
If you use this codebase, please cite the AudioToolAgent paper.
@misc{wijngaard2025audiotoolagentagenticframeworkaudiolanguage,
title={AudioToolAgent: An Agentic Framework for Audio-Language Models},
author={Gijs Wijngaard and Elia Formisano and Michel Dumontier},
year={2025},
eprint={2510.02995},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2510.02995},
}