EBMT / NLP Laboratory

Graduate School of Information, Production and Systems, Waseda University

Kakenhi 15K00317

Outline of Grants-in-Aid for Scientific Research

Grants-in-Aid are awarded to promote creative and pioneering research across a wide spectrum of scientific fields, ranging from the humanities and social sciences to the natural sciences. Grants are awarded to projects organized by individual researchers or research groups at Japanese universities or research institutes engaged in basic research, particularly research in critical fields attuned to advanced research trends. Research results obtained under these Grants-in-Aid are widely published in academic journals.

Kakenhi 15K00317

This project is a Kakenhi Project (Kakenhi C 15K00317) entitled "Language productivity: efficient extraction of productive analogical clusters and their evaluation using statistical machine translation." See more details.

2016

  1. J. Luo and Y. Lepage. A method of generating translations of unseen n-grams by using proportional analogy. IEEJ Transactions in Electronics, Information and Systems, 11(3):325–330, May 2016. DOI:10.1002/tee.22221 [Preparatory work for the research topic of Kakenhi 15K00317]
  2. R. Fam and Y. Lepage. Morphological predictability of unseen words using computational analogy. In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-16), pages 51–60, Atlanta, Georgia, October 2016.
  3. V. Kaveeta and Y. Lepage. Solving analogical equations between strings of symbols using neural networks. In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-16), pages 67–76, Atlanta, Georgia, October 2016.
  4. W. Yang, M. Gao, and Y. Lepage. Production of analogical clusters between marker-based chunks in Chinese and Japanese. In Proceedings of the 10th International collaboration Symposium on Information, Production and Systems (ISIPS 2016), pages 238–241, IPS, Waseda University, November 2016.

 

2017 

  1. R. Fam, Y. Lepage, S. Gojali, and A. Purwarianti. A study in explaining unseen words in Indonesian using analogical clusters. In Proceedings of 15th International Conference on Computer Applications (ICCA 2017), pages 416–421, Yangon, Myanmar, January 2017.
  2. W. Yang, H. Shen, and Y. Lepage. Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Japanese machine translation. Journal of Information Processing, 25:88–99, 2017. DOI:10.2197/ipsjjip.25.88
  3. R. Fam, Y. Lepage, S. Gojali, and A. Purwarianti. Indonesian unseen words explained by form, morphology and distributional semantics at the same time. In Proceedings of the 23rd Annual Meeting of the Japanese Association for Natural Language Processing, pages 178–181, Tsukuba, Japan, March 2017.
  4. Y. Lepage. Clusters et grilles analogiques : validation par la traduction automatique (invited talk). 40 ans de TA, Grenoble, France, July 2017.
  5. R. Fam and Y. Lepage. A study of the saturation of analogical grids agnostically extracted from texts. In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-17), pages 7–16, Trondheim, Norway, August 2017.
  6. Y. Lepage. Character–position arithmetic for analogy questions between word forms. In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-17), pages 17–26, Trondheim, Norway, August 2017.
  7. P. Liu and Y. Lepage. Confidence of word forms generated in analogical grids. In Proceedings of the 11th International collaboration Symposium on Information, Production and Systems (ISIPS 2017), pages 238–240, IPS, Waseda University, Nov 2017.
  8. F. Rashel, A. Purwarianti, and Y. Lepage. Plausibility of word forms generated from analogical grids on Indonesian. In Proceedings of the 11th International collaboration Symposium on Information, Production and Systems (ISIPS 2017), pages 245–247, IPS, Waseda University, Nov 2017.
  9. R. Fam and Y. Lepage. A holistic approach at a morphological inflection task. In Proceedings of the 8th Language & Technology Conference (LTC’17), pages 88–92, Poznan, November 2017. Fundacja uniwersytetu im. Adama Mickiewicza.
  10. Y. Lepage. Automatic production of quasi-parallel corpora for machine translation (invited talk). In International Conference on Natural Language, Signal and Speech Processing 2017, Casablanca, Morocco, 06--07 Dec. 2017
 

2018 

  1. Y. Lepage. Analogy for natural language processing and machine translation (invited talk). In Proceedings of 15th International Conference on Computer Applications (ICCA 2017), Yangon, Myanmar, January 2018.
  2. R. Fam, A. Purwarianti, and Y. Lepage. Plausibility of word forms generated from analogical grids in Indonesian. In Proceedings of the 16th International Conference on Computer Applications (ICCA 2018), pages 179–184, Yangon, Myanmar, February 2018.
  3. R. Fam and Y. Lepage. Validating analogically generated Indonesian words using Fisher’s exact test. In Proceedings of the 24th Annual Meeting of the Japanese Association for Natural Language Processing, pages 312–315, Okayama, Japan, March 2018.
  4. Y. Lepage. Quasi-parallel corpora: Hallucinating translations for the Chinese–Japanese language pair (invited talk). In Proceedings of the 11th workshop on building and using comparable corpora (BUCC), colocated with LREC 2018, Miyazaki, Japan, May 2018
  5. R. Fam and Y. Lepage. Tools for the production of analogical grids and a resource of n-gram analogical grids in 11 languages. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC 2018), pages 1060–1066 Miyazaki, Japan, May 2018.

Experimental data

 

Sentences: Europarl

Released data: Europarl corpus v3 - aligned sentences between 11 languages

The common part was extracted using English sentences to determine the set of sentences that has a translation in all the 11 languages. The extracted data has been checked and cleaned up.

Number of lines
Training data: 347,614 lines
Development set: 500 lines
Test set: 38,123 lines
References: 1 reference per line in the test set.
 
Number of words in all sets in all languages
  Train:
347,614
Dev:
500
Test:
38,123
Danish (da) 9,458,365 13,981 1,040,819
German (de) 9,510,833 14,033 1,046,557
Greek (el) 9,997,176 14,587 1,100,255
English (en) 9,945,267 14,612 1,094,082
Spanish (es) 10,472,178 15,398 1,151,404
Finnish (fi) 7,179,991 10,546 789,206
French (fr) 10,955,901 16,157 1,204,527
Italian (it) 9,880,314 14,611 1,085,840
Dutch (nl) 10,013,958 14,645 1,101,028
Portuguese (pt) 10,287,116 15,256 1,129,898
Swedish (sv) 8,988,906 13,243 988,588

Original data: Europarl parallel corpus (11 languages, common part, release v3)

 

Words: SIGMORPHON data set

Released data: SIGMORPHON 2016 - Analogies ( download )

We extract analogy questions from such data by considering all analogies of form filtered by morphological features. For each analogy, four different analogy questions can be asked, each of the four terms becoming the answer.

Caution: This is not the task proposed in SIGMORPHON Shared Task, which consists in a machine learning task: predicting a word form given a lemma and morphological features after having learnt from the training data.

Format:

The file contains one analogy per line.

A : B :: C : D

The lines introduced by a # give the name of the language of the analogies that follow.

# arabic/

yūnāniyyun : al-yūnāniyyatayni :: muʿāṣirun : al-muʿāṣiratayni

al-muʿāṣiratayni : al-yūnāniyyatayni :: muʿāṣirun : yūnāniyyun

al-yūnāniyyatayni : yūnāniyyun :: al-muʿāṣiratayni : muʿāṣirun

...

Original data: SIGMORPHON 2016 Shared Task: Morphological Reinflection Task 1 Track 1 (10 languages)

Publications:

If you are using this data, please cite our publication.

Y. Lepage. Character-Position Arithmetic for Analogy Questions between Word Forms. In Proceedings of the Computational Analogy Workshop at the 25th International Conference on Case-Based Reasoning (ICCBR-CAW-17), pages 17–26, Trondheim, Norway, June 2017.

 

Experimental settings

Word-to-word alignement tools
  • GIZA++ (Och and Ney, 2003)
  • Anymalign (Lardilleux and Lepage, 2009)
Translation table generation
GIZA++/Moses or Anymalign.
Experiments with statistical machine translation
  • training and decoding: Moses (Koehn et al., 2007),
  • tuning: MERT (Och, 2003),
  • language models: SRILM (Stolcke, 2002)
Experiments with the example-based approach
not an open tool, engine being developed at the EBMT/NLP lab.

Results of experiments

 

Resources: N-gram analogical clusters and grids

A resource of n-gram analogical clusters and grids extracted from the first thousand corresponding lines of the Europarl corpus v3. Please refer to our paper for further details.

Languages: Danish (da), German (de), Greek (el), English (en), Spanish (es), Finnish (fi), French (fr), Italian (it), Dutch (nl), Portuguese (pt), and Swedish (sv)

N-grams: 1-gram to 6-gram

Version: 1.0.0

Release Date: 2018-05-07

Last Updated: 2018-05-07

Download Links:

License

If you are using this data, please cite our publication.

R. Fam and Y. Lepage. Tools for the production of analogical grids and a resource of n-gram analogical grids in 11 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pages 1060–1066, Miyazaki, Japan, May 2018. ELRA. PDF

All resources on this page are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Creative Commons License

Nlg Module

Python 3 module for analogy. Please refer to readme file and our paper for installation, usages, etc.

  • Words2Clusters: Extraction of analogical clusters from a given set of words
  • Words2Grids: Construction of analogical grids from a given set of words
  • Words2Vectors: Building vector representations for words
  • Vectors2Clusters: Extraction of analogical clusters from a given set of words with their corresponding vectors
  • Vectors2Grids: Construction of analogical grids from a given set of words with their corresponding vectors
  • Clusters2Grids: Construction of analogical grids from a given list of analogical clusters

Version: 1.0.0

Release Date: 2018-05-07

Updates:

  1. v1.0.0: 2018-05-07 (first version in Python2.7)
  2. v3.0.0: 2020-04-14 (Python3 version)
  3. v3.1.0: 2021-04-30 (scripts for vector input)
  4. v3.2.0: 2021-09-08 (better clarity on clustering function)
  5. v3.2.1: 2021-12-20 (introduction of NumPy for better matrix manipulation and multiprocessing for faster computation)

Download Link: Nlg module v3.2.1

Dependencies:

 

License

If you use our module, please cite our paper:

R. Fam and Y. Lepage. Tools for the production of analogical grids and a resource of n-gram analogical grids in 11 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pages 1060–1066, Miyazaki, Japan, May 2018. ELRA. PDF

All resources on this page are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Creative Commons License

 

Questions / Contact

For questions regarding these resources, please contact Rashel Fam at fam(dot)rashel@fuji.waseda.jp.

Contact

EMBT / NLP Laboratory

Graduate School of Information,Production and Systems

Waseda University

2-7 Hibikino, Wakamatsu-ku,
Kitakyushu-shi, Fukuoka-ken, 808-0135, Japan

Tel/Fax: +81-93-692-5287