Codebase for the paper:
"DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities"
(EMNLP 2024)
Create and activate the environment:
conda create --name lsr python=3.9.12
conda activate lsrInstall required packages:
pip install -r requirements.txtgit clone https://github.com/thongnt99/DyVocd DyVo
mkdir dyvo_data
cd dyvo_dataMake sure the Hugging Face CLI is installed:
pip install huggingface_hubThen download the data:
huggingface-cli download lsr42/dyvo_dataNote:
- You may need to log in to Hugging Face before downloading:
huggingface-cli login
- The downloaded files will be cached locally. Refer to the Hugging Face CLI documentation for cache settings if needed.
Queries and documents are accessible via ir-datasets.
Please refer to the website for instructions on how to download them.
| Dataset | ir_datasets Key |
|---|---|
| Wapo | wapo/v2/trec-core-2018 |
| Robust04 | disks45/nocr/trec-robust-2004 |
| Codec | codec |
Example command to start training:
python -m lsr.train +experiment=qmlp_dmlm_emlm_laque_wapo_msmarco_pretrained_inparsv2_monot53b_distillation_l1_0.0_0.001_entw_0.05.yaml training_arguments.fp16=True- The list of experiment configuration files can be found in the
lsr/configs/experiment/directory.
If you find this repository helpful, please cite our paper:
@inproceedings{nguyen-etal-2024-dyvo,
title = "DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities",
author = "Nguyen, Thong and
Chatterjee, Shubham and
MacAvaney, Sean and
Mackie, Iain and
Dalton, Jeff and
Yates, Andrew",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024"
}