Skip to content

This is a pubic repository of adversarial medical question answering (AMQA) dataset for benchmarking bias of LLMs in medicine and healthcare.

Notifications You must be signed in to change notification settings

XY-Showing/AMQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Adverasarial Medical QA


πŸ‘€Overview

AMQA is an adversarial medical question answering dataset for benchmarking the bias of large language models (LLMs) in the medical question answering context. AMQA is created from the U.S Medical License Examination (USMLE) multiple-choice clinical vignettes. Each sample includes:

  • An original clinical vignette from the U.S Medical License Examination (USMLE) question bank (MedQA dataset).

  • A neutralized clinical vignette with sensitive attributes removed

  • Six adversarial variants targeting:

    • Race (Black vs. White)
    • Gender (Female vs. Male)
    • Socioeconomic Status (Low vs. High Income)

Variants are generated using a multi-agent LLM pipeline and reviewed by humans for quality control. The following figure demonstrates the workflow of the creation of the AMQA dataset.

Adverasarial Medical QA

🏠Repository Structure

AMQA/
β”œβ”€β”€ AMQA_Dataset/                         
β”‚   β”œβ”€β”€ AMQA_Dataset.jsonl/             # Final AMQA Dataset based on the adversarial variants from GPT-Agent and revised by Human Reviewers
β”‚   β”œβ”€β”€ Vignette_GPT-4.1.jsonl/         # Adversarial Clinical Vignette Variants from GPT-Agent
β”‚   β”œβ”€β”€ Vignette_Deepseek-v3.jsonl/     # Adversarial Clinical Vignette Variants from Deepseek-Agent
β”‚   β”œβ”€β”€ Vignette_Deepseek-v3.jsonl/     # Adversarial Clinical Vignette Variants from Deepseek-Agent
β”‚   └── .../
β”œβ”€β”€ Scripts/                         
β”‚   β”œβ”€β”€ AMQA_generation_batch/          # Python script for generating adversarial variants from neutralized clinical vignette
β”‚   β”œβ”€β”€ AMQA_Benchmark_LLM/             # Python script for benchmarking given LLMs
β”‚   └── .../
β”œβ”€β”€ Results/                        
β”‚   β”œβ”€β”€ AMQA_Benchmark_Answer_{LLM_Name}.jsonl    # Raw answers from {LLM_Name} on original vignettes, neutralized vignettes, and vignette variants
β”‚   └── AMQA_Benchmark_Summary_{LLM_Name}.jsonl   # Statistical Results of benchmarking {LLM_Name}
β”œβ”€β”€ Figures/                        
β”‚   β”œβ”€β”€ AMQA_Banner                     # Banner figure of AMQA benchmark dataset
β”‚   └── AMQA_Workflow                   # Workflow of the creation of the AMQA benchmark dataset
└── README.md

✍️ Evaluation Metrics

  • Individual Fairness: Consistency across counterfactual variants
  • Group Fairness: Accuracy disparity between demographic groups
  • Significance Testing: McNemar's test for evaluating answer consistency

πŸ“° Details of AMQA Dataset

Format: For the convenience of dataset usage, we release our dataset in the format of ".jsonl" and make it publicly available on both the AMQA GitHub Repository and the AMQA Hugging Face Page.

Properties: Currently, there are 801 samples in the AMQA dataset. Each sample contains 39 properties, including "question id", "original question", "neutralized question", 6 "adversarial description", six "adversarial variant", 6 "variant tag", answers on original question, neutralized question, and 6 variants...

πŸš€Usage

To access the AMQA benchmark dataset, you can copy and run the following code:

from datasets import load_dataset
ds = load_dataset("Showing-KCL/AMQA")

πŸ“§ Contact

Ying Xiao is the maintainer of the AMQA dataset as well as this repository. If you have any problems or suggestions in using the AMQA dataset as well as our source code, please feel free to reach out by emailing [ying.1.xiao@kcl.ac.uk].

About

This is a pubic repository of adversarial medical question answering (AMQA) dataset for benchmarking bias of LLMs in medicine and healthcare.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages