Challenges in Context-Aware Neural Machine Translation

Authors: Linghao Jin, Jacqueline He, Jonathan May, Xuezhe Ma

This repository contains the code for our EMNLP 2023 paper, "Challenges in Context-Aware Neural Machine Translation".

Context-aware neural machine translation, a paradigm that involves leveraging information beyond sentence-level context to resolve inter-sentential discourse dependencies and improve document-level translation quality, has given rise to a number of recent techniques. However, despite well-reasoned intuitions, most context-aware translation models yield only modest improvements over sentence-level systems. In this work, we investigate and present several core challenges, relating to discourse phenomena, context usage, model architectures, and document-level evaluation, that impede progress within the field. To address these problems, we propose a more realistic setting for document-level translation, called paragraph-to-paragraph (Para2Para) translation, and collect a new dataset of Chinese-English novels to promote future research.

Quick Start

conda create -n canmt python=3.8
conda activate canmt
pip install -r requirements.txt

Note: We use fairseq 0.9.0, so as to be compatible with the Mega (Ma et al., 2022) architecture. To download the official version:

git clone https://github.com/facebookresearch/mega.git && cd mega
pip install --editable ./

Context-aware NMT

Data

We provide sentence counts for the train/valid/test splits on the datasets used in this paper below:

Dataset	Lg. Pair	Train	Valid	Test
BWB (Jiang et al., 2022)	Zh->En	9576566	2632	2618
WMT17 (Bojar et al., 2017)	Zh->En	25134743	2002	2001
IWSLT17 (Cettolo et al., 2012)	En<->Fr	232825	5819	1210
IWSLT17 (Cettolo et al., 2012)	En<->De	206112	5431	1080

Pre-processed data for BWB and IWSLT-17 can be found here.

Training

Run training script

To train all models for lg. pair zh->en implemented in the paper, you can run the following script:

cd sh/zh-en
chmod +x train_all.sh 
./train_all.sh

You can configure the hyper-parameters in train_all.sh accordingly. Models are saved to ckpt/.

You can also train each model and setting separatively using the following scripts! Note: N, M are source and target context sizes, respectively. Following Fernandes et al., 2021, our settings are 0-1 (denoted as 1-2 in the paper) , and 1-1 (denoted as 2-2 in the paper).

Concatenation-based XFMR baseline (in concat_models)

cd sh/zh-en
chmod +x train_concat.sh
./train_concat.sh

Concatenation-based MEGA baseline (in concat_models)

cd sh/zh-en
chmod +x train_mega.sh
./train_mega.sh

Evaluation

To evaluate all trained zh->en models on BLEU, COMET and BlonDe, you can run the following script

cd sh/zh-en
chmod +x generate_all.sh 
./generate_all.sh

P2P NMT

P2P Data

Title	Pub. Year	Pub. Year	Avg. Para. Length
Gone with the Wind (Margaret Mitchell)	1936	3556	143
Rebecca (Daphne du Maurier)	1938	1237	157
Alice’s Adventure in Wonderland (Lewis Carroll)	1865	218	144
Foundation (Isaac Asimov)	1951	3413	76
A Tale of Two Cities (Charles Dickens)	1859	696	225
Twenty Thousand Leagues Under the Seas (Jules Verne)	1870	1425	117

Pre-training

We use the following backbone architectures for pre-training before fine-tuning on Para2Para dataset:

XFMR (Vaswani et al., 2017), the Transformer-BIG model
LIGHTCONV (Wu et al., 2019), which replaces the self-attention modules in the Transformer-BIG with fixed convolutions
MBART25 (Liu et al., 2020), which is pre-trained on 25 languages at the document level

Fine-tuning

cd sh/p2p
chmod +x train_all.sh
./train_all.sh

P2P Evaluation

We provide the scripts to evaluate the pre-trained models on Para2Para without fine-tuning:

cd sh/p2p
chmod +x generate_pretrained.sh
./generate_pretrained.sh

To evaluate the fine-tuned models on Para2Para:

cd sh/p2p
chmod +x generate_finetuned.sh
./generate_finetuned.sh

Collective results:

Code Acknowledgements

contextual_mt package from Fernandes et al., 2021
BlonDe package from Jiang et al., 2022
MEGA package from Ma et al., 2023

Citation

@inproceedings{jin2023challenges,
   title={Challenges in Context-Aware Neural Machine Translation},
   author={Jin, Linghao and He, Jacqueline and May, Jonathan and Ma, Xuezhe},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
concat_models		concat_models
p2p-data		p2p-data
scripts		scripts
sh		sh
.gitignore		.gitignore
README.md		README.md
docmt_translate.py		docmt_translate.py
p2p_results.png		p2p_results.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Challenges in Context-Aware Neural Machine Translation

Table of Contents

Quick Start

Context-aware NMT

Data

Training

Evaluation

P2P NMT

P2P Data

Pre-training

Fine-tuning

P2P Evaluation

Code Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Linghao-Jin/canmt-challenges

Folders and files

Latest commit

History

Repository files navigation

Challenges in Context-Aware Neural Machine Translation

Table of Contents

Quick Start

Context-aware NMT

Data

Training

Evaluation

P2P NMT

P2P Data

Pre-training

Fine-tuning

P2P Evaluation

Code Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages