EBMT / NLP Laboratory

Graduate School of Information, Production and Systems, Waseda University

Tokuteikadai 2015A-063

Report on extensive experiments here .

Europarl parallel corpus (11 languages, common part, release v3)

The common part was extracted using English sentences to determine the set of sentences that has a translation in all the 11 languages. The extracted data has been checked and cleaned up.

Number of lines:

Training data: 347,614 lines

Development set: 500 lines

Test set: 38,123 lines

References: 1 reference per line in the test set.

Number of words in all sets in all languages:

Train:
347,614
Dev:
500
Test:
38,123
Danish(da) 9,458,365 13,981 1,040,819
German(de) 9,510,833 14,033 1,046,557
Greek(el) 9,997,176 14,587 1,100,255
English(en) 9,945,267 14,612 1,094,082
Spanish(es) 10,472,178 15,398 1,151,404
Finnish(fi) 7,179,991 10,546 789,206
French(fr) 10,955,901 16,157 1,204,527
Italian(it) 9,880,314 14,611 1,085,840
Dutch(nl) 10,013,958 14,645 1,101,028
Portuguese(pt) 10,287,116 15,256 1,129,898
Swedish(sv) 8,988,906 13,243 988,588

Contact

EMBT / NLP Laboratory

Graduate School of Information,Production and Systems

Waseda University

2-7 Hibikino, Wakamatsu-ku,
Kitakyushu-shi, Fukuoka-ken, 808-0135, Japan

Tel/Fax: +81-93-692-5287