Skip to content

Commit 49177c9

Browse files
nng555facebook-github-bot
authored andcommitted
Backward reranking public (#667)
Summary: Implementation of noisy channel model reranking for release with paper Pull Request resolved: fairinternal/fairseq-py#667 Reviewed By: michaelauli Differential Revision: D15901665 Pulled By: nng555 fbshipit-source-id: 2de2c518be8e5828ffad72db3e741b0940623373
1 parent ac66df4 commit 49177c9

File tree

13 files changed

+1629
-3
lines changed

13 files changed

+1629
-3
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,3 +116,6 @@ fairseq/modules/*_layer/*_backward.cu
116116

117117
# data
118118
data-bin/
119+
120+
# reranking
121+
examples/reranking/rerank_data

eval_lm.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -146,8 +146,9 @@ def main(parsed_args):
146146
hypos = scorer.generate(models, sample)
147147
gen_timer.stop(sample['ntokens'])
148148

149-
for hypos_i in hypos:
149+
for i, hypos_i in enumerate(hypos):
150150
hypo = hypos_i[0]
151+
sample_id = sample['id'][i]
151152

152153
tokens = hypo['tokens']
153154
tgt_len = tokens.numel()
@@ -199,7 +200,8 @@ def main(parsed_args):
199200
is_bpe = False
200201
w = ''
201202
if args.output_word_probs:
202-
print('\t'.join('{} [{:2f}]'.format(x[0], x[1]) for x in word_prob))
203+
print(str(int(sample_id)) + " " +
204+
('\t'.join('{} [{:2f}]'.format(x[0], x[1]) for x in word_prob)))
203205

204206
wps_meter.update(sample['ntokens'])
205207
t.log({'wps': round(wps_meter.avg)})

examples/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Copyright (c) 2017-present, Facebook, Inc.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the license found in the LICENSE file in
5+
# the root directory of this source tree. An additional grant of patent rights
6+
# can be found in the PATENTS file in the same directory.
7+
8+
__version__ = '0.7.2'
9+
10+
import examples.noisychannel # noqa

examples/noisychannel/README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Simple and Effective Noisy Channel Modeling for Neural Machine Translation (Yee et al., 2019)
2+
This page contains pointers to pre-trained models as well as instructions on how to run the reranking scripts.
3+
4+
## Citation:
5+
```bibtex
6+
@inproceedings{yee2018simple,
7+
title = {Simple and Effective Noisy Channel Modeling for Neural Machine Translation},
8+
author = {Kyra Yee and Yann Dauphin and Michael Auli},
9+
booktitle = {Conference on Empirical Methods in Natural Language Processing},
10+
year = {2019},
11+
}
12+
```
13+
14+
## Pre-trained Models:
15+
16+
Model | Description | Download
17+
---|---|---
18+
`transformer.noisychannel.de-en` | De->En Forward Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/noisychannel/forward_de2en.tar.bz2)
19+
`transformer.noisychannel.en-de` | En->De Channel Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/noisychannel/backward_en2de.tar.bz2)
20+
`transformer_lm.noisychannel.en` | En Language model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/noisychannel/reranking_en_lm.tar.bz2)
21+
22+
Test Data: [newstest_wmt17](https://dl.fbaipublicfiles.com/fairseq/models/noisychannel/wmt17test.tar.bz2)
23+
24+
## Example usage
25+
26+
```
27+
mkdir rerank_example
28+
curl https://dl.fbaipublicfiles.com/fairseq/models/noisychannel/forward_de2en.tar.bz2 | tar xvjf - -C rerank_example
29+
curl https://dl.fbaipublicfiles.com/fairseq/models/noisychannel/backward_en2de.tar.bz2 | tar xvjf - -C rerank_example
30+
curl https://dl.fbaipublicfiles.com/fairseq/models/noisychannel/reranking_en_lm.tar.bz2 | tar xvjf - -C rerank_example
31+
curl https://dl.fbaipublicfiles.com/fairseq/models/noisychannel/wmt17test.tar.bz2 | tar xvjf - -C rerank_example
32+
33+
beam=50
34+
num_trials=1000
35+
fw_name=fw_model_ex
36+
bw_name=bw_model_ex
37+
lm_name=lm_ex
38+
data_dir=rerank_example/hyphen-splitting-mixed-case-wmt17test-wmt14bpe
39+
data_dir_name=wmt17
40+
lm=rerank_example/lm/checkpoint_best.pt
41+
lm_bpe_code=rerank_example/lm/bpe32k.code
42+
lm_dict=rerank_example/lm/dict.txt
43+
batch_size=32
44+
bw=rerank_example/backward_en2de.pt
45+
fw=rerank_example/forward_de2en.pt
46+
47+
# reranking with P(T|S) P(S|T) and P(T)
48+
python examples/noisychannel/rerank_tune.py $data_dir --tune-param lenpen weight1 weight3 \
49+
--lower-bound 0 0 0 --upper-bound 3 3 3 --data-dir-name $data_dir_name \
50+
--num-trials $num_trials --source-lang de --target-lang en --gen-model $fw \
51+
-n $beam --batch-size $batch_size --score-model2 $fw --score-model1 $bw \
52+
--backwards1 --weight2 1 \
53+
-lm $lm --lm-dict $lm_dict --lm-name en_newscrawl --lm-bpe-code $lm_bpe_code \
54+
--model2-name $fw_name --model1-name $bw_name --gen-model-name $fw_name
55+
56+
# reranking with P(T|S) and P(T)
57+
python examples/noisychannel/rerank_tune.py $data_dir --tune-param lenpen weight3 \
58+
--lower-bound 0 0 --upper-bound 3 3 --data-dir-name $data_dir_name \
59+
--num-trials $num_trials --source-lang de --target-lang en --gen-model $fw \
60+
-n $beam --batch-size $batch_size --score-model1 $fw \
61+
-lm $lm --lm-dict $lm_dict --lm-name en_newscrawl --lm-bpe-code $lm_bpe_code \
62+
--model1-name $fw_name --gen-model-name $fw_name
63+
64+
# to run with a preconfigured set of hyperparameters for the lenpen and model weights, using rerank.py instead.
65+
python examples/noisychannel/rerank.py $data_dir \
66+
--lenpen 0.269 --weight1 1 --weight2 0.929 --weight3 0.831 \
67+
--data-dir-name $data_dir_name --source-lang de --target-lang en --gen-model $fw \
68+
-n $beam --batch-size $batch_size --score-model2 $fw --score-model1 $bw --backwards1 \
69+
-lm $lm --lm-dict $lm_dict --lm-name en_newscrawl --lm-bpe-code $lm_bpe_code \
70+
--model2-name $fw_name --model1-name $bw_name --gen-model-name $fw_name
71+
```
72+

examples/noisychannel/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Copyright (c) 2017-present, Facebook, Inc.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the license found in the LICENSE file in
5+
# the root directory of this source tree. An additional grant of patent rights
6+
# can be found in the PATENTS file in the same directory.
7+
8+
from .rerank_options import *

0 commit comments

Comments
 (0)