Report on extensive experiments here .
Europarl parallel corpus (11 languages, common part, release v3)
The common part was extracted using English sentences to determine the set of sentences that has a translation in all the 11 languages. The extracted data has been checked and cleaned up.
Training data: 347,614 lines
Development set: 500 lines
Test set: 38,123 lines
References: 1 reference per line in the test set.
| Train: 347,614 | Dev: 500 | Test: 38,123 | |
|---|---|---|---|
| Danish(da) | 9,458,365 | 13,981 | 1,040,819 | 
| German(de) | 9,510,833 | 14,033 | 1,046,557 | 
| Greek(el) | 9,997,176 | 14,587 | 1,100,255 | 
| English(en) | 9,945,267 | 14,612 | 1,094,082 | 
| Spanish(es) | 10,472,178 | 15,398 | 1,151,404 | 
| Finnish(fi) | 7,179,991 | 10,546 | 789,206 | 
| French(fr) | 10,955,901 | 16,157 | 1,204,527 | 
| Italian(it) | 9,880,314 | 14,611 | 1,085,840 | 
| Dutch(nl) | 10,013,958 | 14,645 | 1,101,028 | 
| Portuguese(pt) | 10,287,116 | 15,256 | 1,129,898 | 
| Swedish(sv) | 8,988,906 | 13,243 | 988,588 |