Tokuteikadai 2015A-063

Publications
Experimental Data

Report on extensive experiments here .

Europarl parallel corpus (11 languages, common part, release v3)

The common part was extracted using English sentences to determine the set of sentences that has a translation in all the 11 languages. The extracted data has been checked and cleaned up.

Number of lines:

Training data: 347,614 lines

Development set: 500 lines

Test set: 38,123 lines

References: 1 reference per line in the test set.

Number of words in all sets in all languages:

	Train: 347,614	Dev: 500	Test: 38,123
Danish(da)	9,458,365	13,981	1,040,819
German(de)	9,510,833	14,033	1,046,557
Greek(el)	9,997,176	14,587	1,100,255
English(en)	9,945,267	14,612	1,094,082
Spanish(es)	10,472,178	15,398	1,151,404
Finnish(fi)	7,179,991	10,546	789,206
French(fr)	10,955,901	16,157	1,204,527
Italian(it)	9,880,314	14,611	1,085,840
Dutch(nl)	10,013,958	14,645	1,101,028
Portuguese(pt)	10,287,116	15,256	1,129,898
Swedish(sv)	8,988,906	13,243	988,588

External Links

Contact

EMBT / NLP Laboratory

Graduate School of Information,Production and Systems

Waseda University

2-7 Hibikino, Wakamatsu-ku,
Kitakyushu-shi, Fukuoka-ken, 808-0135, Japan

Tel/Fax: +81-93-692-5287

EBMT / NLP Laboratory

Tokuteikadai 2015A-063

Number of lines:

Number of words in all sets in all languages:

External Links

Contact