EBMT / NLP Laboratory

Graduate School of Information, Production and Systems, Waseda University

Kakenhi 23500187

Outline of Grants-in-Aid for Scientific Research

Grants-in-Aid are awarded to promote creative and pioneering research across a wide spectrum of scientific fields, ranging from the humanities and social sciences to the natural sciences. Grants are awarded to projects organized by individual researchers or research groups at Japanese universities or research institutes engaged in basic research, particularly research in critical fields attuned to advanced research trends. Research results obtained under these Grants-in-Aid are widely published in academic journals.

Kakenhi 23500187

This project is a Kakenhi Project (Kiban C 23500187) entitled "Improvement of alignment for statistical and example-based machine translation and release of multilingual grammatical patterns." See more details.

Aims

Improve the production of translation tables in the following aspects, mainly using the sub-sentential sampling-based alignment method (Anymalign)

  1. Production of ad-hoc translation tables
  2. Production of longer n-grams in translation tables
  3. Production of rule-based translation tables

Results

  1. Experimental data
  2. Translation tables
  1. Juan Luo, Jing Sun and Yves Lepage. Improving sampling-based alignment method for statistical machine translation tasks. 言語処理学会第17回年次大会, pp.186-189, Toyohashi, March 2011.
  2. Juan Luo, Adrien Lardilleux and Yves Lepage. Exploring n-grams distribution for sampling-based alignment. In Proceedings of the 5th Language and Technology Conference (LTC’11), pp.289-293, Poznan, Poland, November 2011.
  3. Juan Luo, Adrien Lardilleux and Yves Lepage. Improving sampling-based alignment by investigating the distribution of n-grams in phrase translation tables. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 25), pp.150-159, Singapore, December 2011.
  4. Juan Luo, Jing Sun and Yves Lepage. Producing translation tables by separate n-grams subtables. 言語処理学会第18回年次大会, pp.797-800, Hiroshima, March 2012.
  5. J. Lee and Y. Lepage. Fast production of ad hoc translation tables using the sampling-based method. In Proceedings of the 18th Japanese National Conference in Natural Language Processing, pages 809–812, Hiroshima, March 2012.
  6. J. Luo, J. Sun, and Y. Lepage. Producing translation tables by separate N-grams subtables. In Proceedings of the 18th Japanese National Conference in Natural Language Processing, pages 797–800, Hiroshima, March 2012.
  7. J. Luo, A. Lardilleux, and Y. Lepage. Improving the distribution of N-grams in phrase tables obtained by the sampling-based method. Lecture Notes in Artificial Intelligence, ??:??–??, 2013.
  8. J. Luo and Y. Lepage. An investigation of the sampling-based alignment method and its contributions. International Journal of Artificial Intelligence & Applications (IJAIA), 4(4):9–19, July 2013.
  9. J. Luo and Y. Lepage. A comparison of association and estimation approaches to alignment in word-to- word translation. In Proceedings of the tenth international Symposium on Natural Language Processing (SNLP 2013), pages 181–186, Phuket, Thailand, October 2013. Phuket.
  10. T. Kimura, J. Matsuoka, Y. Nishikawa, and Y. Lepage. Analogy-based machine translation for longer sentences. In Proceedings of the 7th International Collaboration Symposium (IPS-ICS). IPS, Waseda university, November 2013.
  11. J. Luo, A. Max, and Y. Lepage. Using the productivity of language is rewarding for small data: Populating SMT phrase table by analogy. In Z. Vetulani, editor, Proceedings of the 6th Language & Technology Conference (LTC’13), pages 147–151, Poznan, December 2013. Fundacja uniwersytetu im. Adama Mickiewicza.
  12. T. Kimura, Y. Nishikawa, J. Matsuoka, and Y. Lepage. Generation of translation tables adequate for example-based machine translation by analogy. In Proceedings of the 2014 International Conference on Artificial Intelligence and Software Eng`ıneering (AISE2014), pages ??–??, Phuket, Thailand, January 2014. DESTech Publications.
  13. S. Zhang, J. Luo and Y. Lepage. Improving N-gram distribution for sampling-based alignment by extraction of longer N-grams. In Proceedings of the 215th Research Meeting in Natural Language Processing of the Japanese Information Processing Association. Tokyo, Japan, February 2014.
  14. T. Kimura, Y. Nishikawa, J. Matsuoka, and Y. Lepage. Ananlogy-based machine translation using secability. In Proceedings of the 2014 International Conference on Computational Science and Computational Intelligence (CSCI’2014), pages ??–??, Las Vegas, Nevada, USA, March 2014. IEEE Computer Society’s Conference Publishing Services.
  15. T. Kimura, J. Matsuoka, Y. Nishikawa, and Y. Lepage. Generation and assessment of translation tables for example-based machine translation by analogy (in Japanese). In Proceedings of the 16th Meeting of the Information Processing Society of Japan, pages ??–??, Tokyo, March 2014.
  16. T. Kimura, Y. Nishikawa, J. Matsuoka, and Y. Lepage. The influence of sentence length in example- based machine translation by analogy (in Japanese). In Proceeding of the 20th Yearly Conference of the Japanese Association for Natural Language Processing, pages ??–??, Sapporo, March 2014.
  17. Y. Nishikawa, T. Kimura, J. Matsuoka, and Y. Lepage. A study of analogy-based machine translation using monolingual or bilingual segmentation (in Japanese). In Proceeding of the 20th Yearly Conference of the Japanese Association for Natural Language Processing, pages ??–??, Sapporo, March 2014.

Experimental data

Europarl parallel corpus (11 languages, common part, release v3)

The common part was extracted using English sentences to determine the set of sentences that has a translation in all the 11 languages. The extracted data has been checked and cleaned up.

Number of lines

Training data
347,614 lines
Development set
500 lines
Test set
38,123 lines
References
1 reference per line in the test set.

Number of words in all sets in all languages

  Train:
347,614
Dev:
500
Test:
38,123
Danish(da) 9,458,365 13,981 1,040,819
German(de) 9,510,833 14,033 1,046,557
Greek(el) 9,997,176 14,587 1,100,255
English(en) 9,945,267 14,612 1,094,082
Spanish(es) 10,472,178 15,398 1,151,404
Finnish(fi) 7,179,991 10,546 789,206
French(fr) 10,955,901 16,157 1,204,527
Italian(it) 9,880,314 14,611 1,085,840
Dutch(nl) 10,013,958 14,645 1,101,028
Portuguese(pt) 10,287,116 15,256 1,129,898
Swedish(sv) 8,988,906 13,243 988,588

Experimental settings

Word-to-word alignement tools
  • GIZA++ (Och and Ney, 2003)
  • Anymalign (Lardilleux and Lepage, 2009)
Translation table generation
GIZA++/Moses or Anymalign.
Experiments with statistical machine translation
  • training and decoding: Moses (Koehn et al., 2007),
  • tuning: MERT (Och, 2003),
  • language models: SRILM (Stolcke, 2002)
Experiments with the example-based approach
not an open tool, engine being developed at the EBMT/NLP lab.

Results of experiments with different translation tables (TTs)

Experiments with Moses decoder

Baseline experiments
  • (tt_gen=giza++/moses, tt_type=phrases, tt_gen_option=w/o pruning, translation_engine=moses) GIZA++/MOSES standard pipeline, without pruning
  • (tt_gen=giza++/moses, tt_type=phrases, option=with pruning, translation_engine=moses) GIZA++/MOSES standard pipeline with pruning
  • (tt_gen=anymalign, tt_type=phrases, translation_engine=moses) same as above but the translation tables are the ones output by Anymalign with standard options
  • (tt_gen=giza++/moses+anymalign, tt_type=phrases, translation_engine=moses, translation_option=merged_tables) Merged tables and Moses decoder for translation
  • (tt_gen=giza++/moses+anymalign, tt_type=phrases, translation_engine=moses, translation_option=multiple_tables) Multiple phrase table with Moses decoder; experiments on some language pairs only
  • (tt_gen=anymalign, tt_type=phrases, ttgen_option=Giza++/Moses_nbr_of_entries, translation_engine=moses) Anymalign forced to outpout the same number of entries in each n-gram x m-gram cell as in the TTs output by GIZA++/Moses
Improvements by alloting different time for the generation of N-grams x M-grams entries
  • (tt_gen=anymalign, tt_type=phrases, ttgen_option=equal_time_distribution, translation_engine=moses) Anymalign with equal distribution of time for each n-gram x m-gram cells
  • (tt_gen=anymalign, tt_type=phrases, ttgen_option=univariate_time_distribution, translation_engine=moses) Anymalign with standard normal distribution of time for n-gram x m-gram cells (= univariate time distribution)
  • (tt_gen=anymalign, tt_type=phrases, ttgen_option=multivariate_time_distribution, translation_engine=moses) Anymalign with multivariate time distribution over n-gram x m-gram cells
Rule tables:
  • (tt_gen=giza++/moses, tt_type=rules, translation_engine=moses) GIZA++/MOSES
  • (tt_gen=anymalign, tt_type=rules, tt_gen_option=discontiguous_alignments, translation_engine=moses) Anymalign discontiguous entries filtered to generate rules, i.e., 0 to 2 placeholders

Experiments with an in-house analogy-based EBMT engine

Baseline experiments
  • tt_gen=giza++/moses, tt_type=phrases, option=with pruning, translation_engine=ebmt
  • Translation tables generated by the standard use of SMT pipeline GIZA++/Moses
  • (tt_gen=anymalign, tt_type=phrases, translation_engine=ebmt) Translation tables generated by Anymalign (3 hours)
Improvements by use of better suited translation tables
  • (tt_gen=secability, tt_type=phrases, translation_engine=ebmt) Word-to-word alignment using Anymalign and phrases output by secability in each language and phrase-to-phrase alignment generated by in-house alignment
  • (tt_gen=lseq, tt_type=phrases, translation_engine=ebmt) Word-to-word alignment using Anymalign and phrases output by secability in source langage only and phrase-to-phrase alignment generated by in-house alignment
  • (tt_gen=cutnalign, tt_type=phrases, translation_engine=ebmt) Word-to-word alignment using Anymalign and phrases output by cutnalign (simultaneous bilingual segmentation and alignment)

Contact

EMBT / NLP Laboratory

Graduate School of Information,Production and Systems

Waseda University

2-7 Hibikino, Wakamatsu-ku,
Kitakyushu-shi, Fukuoka-ken, 808-0135, Japan

Tel/Fax: +81-93-692-5287