Skip to content

Commit ac66df4

Browse files
Myle Ottfacebook-github-bot
authored andcommitted
Update README
Summary: Pull Request resolved: fairinternal/fairseq-py#826 Differential Revision: D16830402 Pulled By: myleott fbshipit-source-id: 25afaa6d9de7b51cc884e3f417c8e6b349f5a7bc
1 parent 1d44cc8 commit ac66df4

File tree

4 files changed

+129
-100
lines changed

4 files changed

+129
-100
lines changed

examples/roberta/README.md

Lines changed: 38 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,15 @@
22

33
https://arxiv.org/abs/1907.11692
44

5-
### Introduction
5+
## Introduction
66

77
RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.
88

99
### What's New:
1010

1111
- August 2019: Added [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
1212

13-
### Pre-trained models
13+
## Pre-trained models
1414

1515
Model | Description | # params | Download
1616
---|---|---|---
@@ -19,36 +19,62 @@ Model | Description | # params | Download
1919
`roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)
2020
`roberta.large.wsc` | `roberta.large` finetuned on [WSC](wsc/README.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)
2121

22-
### Results
22+
## Results
2323

24-
##### Results on GLUE tasks (dev set, single model, single-task finetuning)
24+
**[GLUE (Wang et al., 2019)](https://gluebenchmark.com/)**
25+
_(dev set, single model, single-task finetuning)_
2526

2627
Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
2728
---|---|---|---|---|---|---|---|---
2829
`roberta.base` | 87.6 | 92.8 | 91.9 | 78.7 | 94.8 | 90.2 | 63.6 | 91.2
2930
`roberta.large` | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4
3031
`roberta.large.mnli` | 90.2 | - | - | - | - | - | - | -
3132

32-
##### Results on SuperGLUE tasks (dev set, single model, single-task finetuning)
33+
**[SuperGLUE (Wang et al., 2019)](https://super.gluebenchmark.com/)**
34+
_(dev set, single model, single-task finetuning)_
3335

3436
Model | BoolQ | CB | COPA | MultiRC | RTE | WiC | WSC
3537
---|---|---|---|---|---|---|---
3638
`roberta.large` | 86.9 | 98.2 | 94.0 | 85.7 | 89.5 | 75.6 | -
3739
`roberta.large.wsc` | - | - | - | - | - | - | 91.3
3840

39-
##### Results on SQuAD (dev set)
41+
**[SQuAD (Rajpurkar et al., 2018)](https://rajpurkar.github.io/SQuAD-explorer/)**
42+
_(dev set, no additional data used)_
4043

4144
Model | SQuAD 1.1 EM/F1 | SQuAD 2.0 EM/F1
4245
---|---|---
4346
`roberta.large` | 88.9/94.6 | 86.5/89.4
4447

45-
##### Results on Reading Comprehension (RACE, test set)
48+
**[RACE (Lai et al., 2017)](http://www.qizhexie.com/data/RACE_leaderboard.html)**
49+
_(test set)_
4650

4751
Model | Accuracy | Middle | High
4852
---|---|---|---
4953
`roberta.large` | 83.2 | 86.5 | 81.3
5054

51-
### Example usage
55+
**[HellaSwag (Zellers et al., 2019)](https://rowanzellers.com/hellaswag/)**
56+
_(test set)_
57+
58+
Model | Overall | In-domain | Zero-shot | ActivityNet | WikiHow
59+
---|---|---|---|---|---
60+
`roberta.large` | 85.2 | 87.3 | 83.1 | 74.6 | 90.9
61+
62+
**[Commonsense QA (Talmor et al., 2019)](https://www.tau-nlp.org/commonsenseqa)**
63+
_(test set)_
64+
65+
Model | Accuracy
66+
---|---
67+
`roberta.large` (single model) | 72.1
68+
`roberta.large` (ensemble) | 72.5
69+
70+
**[Winogrande (Sakaguchi et al., 2019)](https://arxiv.org/abs/1907.10641)**
71+
_(test set)_
72+
73+
Model | Accuracy
74+
---|---
75+
`roberta.large` | 78.1
76+
77+
## Example usage
5278

5379
##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
5480
```python
@@ -124,7 +150,7 @@ roberta.cuda()
124150
roberta.predict('new_task', tokens) # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
125151
```
126152

127-
### Advanced usage
153+
## Advanced usage
128154

129155
#### Filling masks:
130156

@@ -216,19 +242,19 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
216242
# Expected output: 0.9060
217243
```
218244

219-
### Finetuning
245+
## Finetuning
220246

221247
- [Finetuning on GLUE](README.glue.md)
222248
- [Finetuning on custom classification tasks (e.g., IMDB)](README.custom_classification.md)
223249
- [Finetuning on Winograd Schema Challenge (WSC)](wsc/README.md)
224250
- [Finetuning on Commonsense QA (CQA)](commonsense_qa/README.md)
225251
- Finetuning on SQuAD: coming soon
226252

227-
### Pretraining using your own data
253+
## Pretraining using your own data
228254

229255
See the [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
230256

231-
### Citation
257+
## Citation
232258

233259
```bibtex
234260
@article{liu2019roberta,

examples/roberta/README.pretraining.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This tutorial will walk you through pretraining RoBERTa over your own data.
44

5-
### 1) Preprocess the data.
5+
### 1) Preprocess the data
66

77
Data should be preprocessed following the [language modeling format](/examples/language_model).
88

examples/scaling_nmt/README.md

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -11,45 +11,57 @@ Model | Description | Dataset | Download
1111

1212
## Training a new model on WMT'16 En-De
1313

14-
Please first download the [preprocessed WMT'16 En-De data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8).
14+
First download the [preprocessed WMT'16 En-De data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8).
15+
1516
Then:
1617

17-
1. Extract the WMT'16 En-De data:
18+
##### 1. Extract the WMT'16 En-De data
1819
```bash
1920
TEXT=wmt16_en_de_bpe32k
2021
mkdir -p $TEXT
2122
tar -xzvf wmt16_en_de.tar.gz -C $TEXT
2223
```
2324

24-
2. Preprocess the dataset with a joined dictionary:
25+
##### 2. Preprocess the dataset with a joined dictionary
2526
```bash
26-
fairseq-preprocess --source-lang en --target-lang de \
27+
fairseq-preprocess \
28+
--source-lang en --target-lang de \
2729
--trainpref $TEXT/train.tok.clean.bpe.32000 \
2830
--validpref $TEXT/newstest2013.tok.bpe.32000 \
2931
--testpref $TEXT/newstest2014.tok.bpe.32000 \
3032
--destdir data-bin/wmt16_en_de_bpe32k \
3133
--nwordssrc 32768 --nwordstgt 32768 \
32-
--joined-dictionary
34+
--joined-dictionary \
35+
--workers 20
3336
```
3437

35-
3. Train a model:
38+
##### 3. Train a model
3639
```bash
37-
fairseq-train data-bin/wmt16_en_de_bpe32k \
40+
fairseq-train \
41+
data-bin/wmt16_en_de_bpe32k \
3842
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
3943
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
40-
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
41-
--lr 0.0005 --min-lr 1e-09 \
42-
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
44+
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
45+
--dropout 0.3 --weight-decay 0.0 \
46+
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
4347
--max-tokens 3584 \
4448
--fp16
4549
```
4650

47-
Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU.
51+
Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU or newer.
4852

4953
If you want to train the above model with big batches (assuming your machine has 8 GPUs):
50-
- add `--update-freq 16` to simulate training on 8*16=128 GPUs
54+
- add `--update-freq 16` to simulate training on 8x16=128 GPUs
5155
- increase the learning rate; 0.001 works well for big batches
5256

57+
##### 4. Evaluate
58+
```bash
59+
fairseq-generate \
60+
data-bin/wmt16_en_de_bpe32k \
61+
--path checkpoints/checkpoint_best.pt \
62+
--beam 4 --lenpen 0.6 --remove-bpe
63+
```
64+
5365
## Citation
5466

5567
```bibtex

0 commit comments

Comments
 (0)