Kakenhi Project 21K12038

This project is a Kakenhi Project (Kakenhi Kiban C 21K12038) entitled "Theoretically founded algorithms for the automatic production of analogy test sets in NLP."

Background: Breakthroughs in NLP have led to vector representations for words and sentences via methods like word2vec, GloVe, BERT, GPT, and XLM-R. These models compute vectors for sentence parts in a shared space, evaluated through extrinsic tests like GLUE and SuperGLUE, primarily in English. Word embedding models are lighter to train than sentence models, often tackled by major institutions due to resource needs. Evaluation methods include analogies, seen as indicators of embedding space quality, tested in sets like Google's analogy test set and BATS for multiple languages, though resource-intensive.

Scientific and technical questions: The goal of the research project is to equip researchers in NLP with tools to explore vector representations of words or (part of) sentences so as to conduct intrinsic evaluation of vector representations. The practical results of the research project will be the release of tools which will automatically extract all analogies between all of the objects in a given vector space or a portion of this vector space. These tools will allow researchers to explore an entire given vector space of words or sentences, or a large portion of it. This will allow the automatic production of analogy test sets: it will become possible to conduct human verification of automatically produced candidate analogy test sets. The tools will lift two restrictions: they will apply to any vector space of any language, without restriction, and, any kind of analogies will be retrieved when exploring the entire space, without restriction, with a better balance between formal and semantic representations.

Core of research plan: The research project will address the following key scientific question: Given a set of vector representations of objects, obtained from distributional semantic methods or formal representations of words or sentences, how to extract all analogies from it? The project will explore solutions to extend a previous method developped in a previous project (Kakenhi Kiban C 15K00317) to real-valued vectors. This is not a trivial question. For the algorithm to work on real numbers, a proper definition of the relaxation of arithmetic analogies between numbers is needed. For that, theoretical work on the algebraic and analytic properties of analogy will be conducted. The risk is that the obtained methods lead to a large increase in computation time. The research project will explore methods to cast integer-valued string edit distances into real-valued vector representations.

References:

[Badra, 2020] A dataset complexity measure for analogical transfer. In Proc. of IJCAI-20, p. 1601–1607. 2020. Main track. [Barbot, 2020] Analogy between concepts (extended abstract). In Proc. of IJCAI-2020, p. 5015–5019. Journal track. [Devlin et al., 2019] BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, vol. 1, p. 4171–4186. [Chalmers et al., 1992] High-level perception, representation, and analogy: A critique of AI methodology. J. of Exp. and Theoretical AI, 4:185–211. [Forbus et al., 2002] An analogy ontology for integrating analogical processing and first-principles reasoning. In Proc. of the 18th Nat. Conf. on AI and 14th Conf. on Innovative Applications of AI. [Forbus et al., 2017] Extending SME to handle large-scale cognitive modeling. Cognitive Science, 41(5):1152–1201, 2017. [Furuse and Iida, 1991] An example-based method for transfer-driven machine translation. 1991. [Gentner, 1983] Structure mapping: A theoretical model for analogy. Cognitive Science, 7(2):155–170. [Gentner et al., 2002] Analogy just looks like high level perception: Why a domain-general approach to analogical mapping is right. J. of Exp. & Theoretical AI, 10. [Gladkova et al., 2016] Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proc. of the NAACL-HLT SRW (p. 47–54). ACL. [Hofstadter, 1994] Fluid Concepts and Creative Analogies. Basic Books. [Levy and Goldberg, 2014] Neural word embedding as implicit matrix factorization. In NIPS 27, p.2177–2185. Curran Associates, Inc. [Mikolov et al., 2013a] Efficient estimation of word representations in vector space. CoRR, abs/1301.3781. [Mikolov et al., 203b] Linguistic regularities in continuous space word representations. In Proc. of NAACL-HLT 2013, p.746–751. ACL. [Murena et al., 2020] Solving analogies on words based on minimal complexity transformation. In Proc. of IJCAI-2020, p.1848–1854. [Nagao, 1984] A framework of a mechanical translation between Japanese and English by analogy principle. In NATO symp., p.173–180. [Nissim et al., 2020] Fair is better than sensational: Man is to doctor as woman is to doctor. Computational Linguistics, 46(2):487–497. [Pennington et al., 2014] GloVe: Global vectors for word representation. In Proc. of EMNLP 2014, p.1532–1543. ACL. [Turney, 2008] The latent relation mapping engine: Algorithm and experiments. J. Artif. Int. Res., 33(1):615–655. [Turney and Pantel, 2010] From frequency to meaning: Vector space models of semantics. J. of Artificial Intelligence Rese, 37:141–188. [Sumita and Iida, 1991] Experiments and prospects of example-based machine translation. In Proc. of ACL 1991, p.185–192, 1991. ACL.

[Fam and Lepage, 2016] Morphological predictability of unseen words using computational analogy. In Proc. of the Comput. Analogy Workshop at ICCBR-16, p.51–60. [Fam and Lepage, 2017a] A study of the saturation of analogical grids agnostically extracted from texts. In Proc. of the Comput. Analogy Workshop at ICCBR-17, p.7–16. [Fam and Lepage, 2017b] A holistic approach at a morphological inflection task. In Proc. of LTC’17, p.88–92. Univ. Poznan. [Fam and Lepage, 2018a] IPS-WASEDA system at CoNLL-SIGMORPHON 2018 shared task on morphological inflection. In Proc. of CoNLL SIGMORPHON 2018 Shared Task, pages 33–42. ACL. [Fam and Lepage, 2018b] Tools for the production of analogical grids and a resource of n-gram analogical grids in 11 languages. In Proc. of LREC 2018, p. 1060–1066. ELRA. [Fam, Purwarianti, and Lepage., 2018] Plausibility of word forms generated from analogical grids in Indonesian. In Proc. of ICCA 2018, p. 179–184. [Hong and Lepage, 2018] Production of large analogical clusters from smaller example seed clusters using word embeddings. In Proc. of ICCBR-18, p.182–195. Springer. [Kaveeta and Lepage, 2016] Solving analogical equations between strings of symbols using neural networks. In Proc. of the Computational Analogy Workshop at ICCBR-16, vol. 1815, p.67–76. [Lepage, and Denoual, 2005] Purest ever example-based machine translation: detailed presentation and assessment. Machine Translation, 19:251–282. [Lepage, 2001] Analogy and formal languages. Electr. Notes Theor. Comput. Sci., 53:180–191. [Lepage, 2014] Analogy between binary images: application to Chinese characters. Chapter in Studies in Comp. Intell. 548, p.25–57. Springer. [Lepage, 2018] String transformations preserving analogies. In Proc. of ICACSIS 2018. (6 pages). IEEE. (Best paper award). [Lepage, 2019] Semantico-formal resolution of analogies between sentences. in Proc. of LTC 2019, p.57–61, Univ. Poznan. [Lepage and Lieber, 2019] An approach to case-based reasoning based on local enrichment of the case base. In Proc. of ICCBR-19, vol. 11680 of LNCS, p.235–250. Springer. [Taillandier et al., 2020] Taillandier, Wang et Lepage. Réseaux de neurones pour la résolution d’analogies entre phrases en traduction automatique par l’exemple (Neural networks for solving analogies in example-based machine translation). In Proc. of JEP-TALN- RECITAL, vol. 2, p.108–121. ATALA. (Best paper award) [Fam and Lepage, 2019] A study of analogical grids extracted using feature vectors on varying vocabulary sizes in indonesian. In Proc. of ICACSIS 2019, Oct. 2019. IEEE. (6 p.). [Wang and Lepage, 2020] Vector- to-sequence models for sentence analogies. In Proc.of ICACSIS 2020. IEEE. (6 p.). (Best paper award)

[1] X. Deng and Y. Lepage. Resolution of analogies between strings in the case of multiple solu- tions. In CEUR, editor, Proceedings of ICCBR: Workshop on Analogies: from Theory to Applications (ATA@ICCBR 2023), CEUR Workshop Proceedings, pages 3–14, July 2023.

[2] M. Eget, X. Yang, and Y. Lepage. A study in the generation of multilingually aligned middle sentences. In Z. Vetulani and P. Paroubek, editors, Proceedings of the 10th Language & Technology Conference (LTC 2023) – Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 45–49, April 2023.

[3] M. T. Eget, X. Yang, H. Xiao, and Y. Lepage. A study in the generation of multilingually aligned middle sentences. In Proceedings of the 16th International collaboration Symposium on Information, Production and Systems (ISIPS 2022), pages C2–1 (7678), IPS, Waseda university, nov 2022.

[4] R. Fam and Y. Lepage. A study of analogical density in various corpora at various granularity. Infor- mation, 12(8):no page number, 17 pages, Aug 2021.

[5] R. Fam and Y. Lepage. Organising lexica into analogical grids: a study of a holistic approach for morphological generation under various sizes of data in various languages. Journal of Experimental & Theoretical Artificial Intelligence, 36(1):1–26, 2022.

[6] R. Fam and Y. Lepage. Investigating parallelograms: Assessing several word embedding spaces against various analogy test sets in several languages using approximation. In Z. Vetulani and P. Paroubek, editors, Proceedings of the 10th Language & Technology Conference (LTC 2023) – Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 68–72, April 2023.

[7] R. Fam and Y. Lepage. Investigating parallelograms inside word embedding space using various analogy test sets in various languages. In Proceedings of the 29th Annual Meeting of the Japanese Association for Natural Language Processing, pages 718–722, Naha, Japan, March 2023.

[8] R. Fam and Y. Lepage. A study of universal morphological analysis using morpheme-based, holistic, and neural approaches under various data size conditions. Annals of Mathematics and Artificial Intelligence, ??(??):??–??, 2024.

[9] R. Hou, H. Liu, and Y. Lepage. Enhancing low-resource neural machine translation by using case-based reasoning. In Proceedings of the 17th International collaboration Symposium on Information, Production and Systems (ISIPS 2023), pages 25–29, IPS, Waseda university, nov 2023.

[10] Y. Lepage. Formulae for the solution of an analogical equation between Booleans using the Sheffer stroke (NAND) or the Pierce arrow (NOR). In M. Couceiro, P.-A. Murena, and S. Afantenos, editors, Proceedings of the Workshop Interactions between analogies and machine learning, colocated with IJCAI 2023 (IARML@IJCAI 2023), pages 3–14, August 2023.

[11] Y. Lepage and M. Couceiro. Analogie et moyenne g ́en ́eralis ́ee. In Actes de la conf ́erence Journ ́ees d’intelligence artificielle fran ̧caises – Plateforme fran ̧caise d’intelligence artificielle (PFIA-JIAF 2024), pages ??–??, La Rochelle, France, Juillet 2024.

[12] Y. Mei, R. Fam, and Y. Lepage. Extraction and comparison of analogical cluster sizes in different lan- guages for different vocabulary sizes. In Proceedings of the 15th International collaboration Symposium on Information, Production and Systems (ISIPS 2021), pages A1–6, IPS, Waseda university, nov 2021.

[13] L. Wang, Z. Pang, H. Wang, X. Zhao, and Y. Lepage. Solving sentence analogies by using embedding spaces combined with a vector-to-sequence decoder or by fine-tuning pre-trained language models. In Z. Vetulani and P. Paroubek, editors, Proceedings of the 10th Language & Technology Conference (LTC 2023) – Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 325–330, April 2023.

[14] L. Wang, H. Wang, and Y. Lepage. Continued pre-training on sentence analogies for translation with small data. In Proceedings of the 14th International Conference on Language Resources and Evaluation (LREC 2024) and the 30th International Conference on Computational Linguistics (COLING’24), pages ??–??, Turin, Italy, may 2024.

[15] B. Yan, H. Wang, L. Wang, Y. Zhou, and Y. Lepage. Transformer-based hierarchical attention models for solving analogy puzzles between longer, lexically richer and semantically more diverse sentences. In M. Couceiro, P.-A. Murena, and S. Afantenos, editors, Proceedings of the Workshop Interactions between analogies and machine learning, colocated with IJCAI 2024 (IARML@IJCAI 2024), pages ??–??, August 2024.

[16] Q. Zhang and Y. Lepage. Improving sentence embedding with sentence relationships from word analogies. In CEUR, editor, Proceedings of ICCBR: Workshop on Analogies: from Theory to Applications (ATA@ICCBR 2023), CEUR Workshop Proceedings, pages 43–53, July 2023.

Invited talks

[1] Analogie et données de langue, Colloquium LORIA, 15 novembre 2023, LORIA.
[2] Analogie, explication des données de langue et travaux récents sur représentations vectorielles de phrases et analogie, Workshop Analogies: From learning to explainability, Arras, 27–28 nov. 2023
[3] Analogie et moyenne : considérations générales et application aux chaînes, Forum sciences cognitives et traitement automatique des langues, Nancy, 29 nov. 2023
[4] Jeux d'analogies pour le TAL, exposé MALOTEC/LORIA, 13 décembre 2023

Results of experiments

We provide below three data sets of analogies between sentences.

A resource of more than 22,000 semantico-formal analogies analogies between English sentences

A resource of more than 22,000 semantico-formal analogies between sentences extracted from the English part of the Tatoeba corpus that exploits word analogies from the Google analogy test set.

Languages: English (en)

Type of data: Analogies between sentences

Format: Each line in the file is formatted in the following way:

sentence 1 \t sentence 2 \t sentence 3 \t sentence 4

where sentence 1 : sentence 2 :: sentence 3 : sentence 4 is a sentence analogy.

Examples:

Bamako is a superb city.   Mali is a wonderful country.   Bangkok is a superb city.   Thailand is a wonderful country.
When did you get to Zagreb?   When did you arrive in Croatia?   When did you get to Bern?   When did you arrive in Switzerland?
He was greatly respected, while his son was as much despised.   She was greatly respected, while her daughter was as much despised.   He received great respect, and his son also received contempt.   She received great respect, and her daughter also received contempt.
He woke his son up for the fajr prayer.   She woke her daughter up for the fajr prayer.   He woke up his son and began to pray.   She woke up her daughter and began to pray.

Version: 1.0.0

Release Date: 2023-05-02

Last Updated: 2023-05-02

Download Link:

Semantico-formal sentence analogies from Tatoeba and the Google analogy test set

License: All resources on this page are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

If you are using this data, please cite our publication.

(under review)

The Nlg package has been improved: acceleration thanks to vectorization and parallelization

See the NLG package under this site: Projects > Kakenhi 15K00317 > Tools -- Nlg package

EBMT / NLP Laboratory

Kakenhi Project 21K12038

Kakenhi Project 21K12038

Results of experiments

A resource of more than 22,000 semantico-formal analogies analogies between English sentences

The Nlg package has been improved: acceleration thanks to vectorization and parallelization

External Links

Contact