Authors: Linghao Jin, Jacqueline He, Jonathan May, Xuezhe Ma
This repository contains the code for our EMNLP 2023 paper, "Challenges in Context-Aware Neural Machine Translation".
Context-aware neural machine translation, a paradigm that involves leveraging information beyond sentence-level context to resolve inter-sentential discourse dependencies and improve document-level translation quality, has given rise to a number of recent techniques. However, despite well-reasoned intuitions, most context-aware translation models yield only modest improvements over sentence-level systems. In this work, we investigate and present several core challenges, relating to discourse phenomena, context usage, model architectures, and document-level evaluation, that impede progress within the field. To address these problems, we propose a more realistic setting for document-level translation, called paragraph-to-paragraph (Para2Para) translation, and collect a new dataset of Chinese-English novels to promote future research.
conda create -n canmt python=3.8
conda activate canmt
pip install -r requirements.txtNote: We use fairseq 0.9.0, so as to be compatible with the Mega (Ma et al., 2022) architecture. To download the official version:
git clone https://github.com/facebookresearch/mega.git && cd mega
pip install --editable ./We provide sentence counts for the train/valid/test splits on the datasets used in this paper below:
| Dataset | Lg. Pair | Train | Valid | Test | 
|---|---|---|---|---|
| BWB (Jiang et al., 2022) | Zh->En | 9576566 | 2632 | 2618 | 
| WMT17 (Bojar et al., 2017) | Zh->En | 25134743 | 2002 | 2001 | 
| IWSLT17 (Cettolo et al., 2012) | En<->Fr | 232825 | 5819 | 1210 | 
| IWSLT17 (Cettolo et al., 2012) | En<->De | 206112 | 5431 | 1080 | 
Pre-processed data for BWB and IWSLT-17 can be found here.
Run training script
To train all models for lg. pair zh->en implemented in the paper, you can run the following script:
cd sh/zh-en
chmod +x train_all.sh 
./train_all.shYou can configure the hyper-parameters in train_all.sh accordingly. Models are saved to ckpt/.
You can also train each model and setting separatively using the following scripts! Note: N, M are source and target context sizes, respectively. Following Fernandes et al., 2021, our settings are 0-1 (denoted as 1-2 in the paper) , and 1-1 (denoted as 2-2 in the paper).
Concatenation-based XFMR baseline (in concat_models)
cd sh/zh-en
chmod +x train_concat.sh
./train_concat.shConcatenation-based MEGA baseline (in concat_models)
cd sh/zh-en
chmod +x train_mega.sh
./train_mega.shTo evaluate all trained zh->en models on BLEU, COMET and BlonDe, you can run the following script
cd sh/zh-en
chmod +x generate_all.sh 
./generate_all.sh| Title | Pub. Year | Pub. Year | Avg. Para. Length | 
|---|---|---|---|
| Gone with the Wind (Margaret Mitchell) | 1936 | 3556 | 143 | 
| Rebecca (Daphne du Maurier) | 1938 | 1237 | 157 | 
| Alice’s Adventure in Wonderland (Lewis Carroll) | 1865 | 218 | 144 | 
| Foundation (Isaac Asimov) | 1951 | 3413 | 76 | 
| A Tale of Two Cities (Charles Dickens) | 1859 | 696 | 225 | 
| Twenty Thousand Leagues Under the Seas (Jules Verne) | 1870 | 1425 | 117 | 
We use the following backbone architectures for pre-training before fine-tuning on Para2Para dataset:
- XFMR (Vaswani et al., 2017), the Transformer-BIG model
 - LIGHTCONV (Wu et al., 2019), which replaces the self-attention modules in the Transformer-BIG with fixed convolutions
 - MBART25 (Liu et al., 2020), which is pre-trained on 25 languages at the document level
 
cd sh/p2p
chmod +x train_all.sh
./train_all.shWe provide the scripts to evaluate the pre-trained models on Para2Para without fine-tuning:
cd sh/p2p
chmod +x generate_pretrained.sh
./generate_pretrained.shTo evaluate the fine-tuned models on Para2Para:
cd sh/p2p
chmod +x generate_finetuned.sh
./generate_finetuned.sh- contextual_mt package from Fernandes et al., 2021
 - BlonDe package from Jiang et al., 2022
 - MEGA package from Ma et al., 2023
 
@inproceedings{jin2023challenges,
   title={Challenges in Context-Aware Neural Machine Translation},
   author={Jin, Linghao and He, Jacqueline and May, Jonathan and Ma, Xuezhe},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2023}
}