EBMT / NLP Laboratory

Graduate School of Information, Production and Systems, Waseda University

Kakenhi Kiban C 18K11446

 

Natural language processing for academic writing in English.

 

“Publish or perish” is the motto for nowadays researchers. It actually means publish in English, for larger diffusion and to be taken into account in global rankings. Researchers who are non-native speakers of English experience difficulties to describe their work and results in this language. They are more easily rejected for publication. The purpose of this research proposal is to design and implement a computer aided system, using natural language processing techniques, to help researchers who are not native speakers of English to write articles in English. Scientific journal articles usually have common formal and specific writing styles. Researchers may be able to roughly convey their ideas in simple English, but more fluent and adequate ways of saying the same thing may exist at a higher proficiency level. The goal of this research proposal is to aid transforming the research ideas in uncertain and simple writing into professional writing.

The Academic Ranking of World Universities (ARWU, aka Shanghai Ranking), is one of the most influential annual ranking of world’s research universities. In 2017, out of the top 100 universities, only 3 universities are from Japan, while two thirds (67 universities) are from English speaking countries (mainly United States of America and United Kingdom). The top Japanese university in the ranking, the University of Tokyo is only ranked 24th, and Kyoto University comes 35th. One of the main indicators used for the Shanghai Ranking is the research output, i.e. the number of articles published in Nature and Science and the number of articles indexed in the Science Citation Index - Expanded and Social Sciences Citation Index, which contains only journals written in English. Two of the most well-known bibliographic metrics are the impact factor for journals, and the h-index for researchers. Once again, the journals and articles listed to compute impact factor and h-index are only English. Articles written in other languages are not considered in the evaluation, even when the reported research results are excellent.

Researchers who are not native speakers of English, like Japanese researchers, may not have a sufficient proficiency level. In many cases, papers are rejected for publication because of the low English proficiency level and, for that reason, do not convincingly convey research ideas, methods and results. The use of English entails time and cost for non-native speakers.

Time: for a non-native speaker of English, writing in English requires extra time in comparison to researchers who are English native speakers. This extra time is lost at the expense of research time.

Cost: non-native speakers of English usually have their papers proofread by professional technical proofreaders. This is costly. Again, these costs are lost to the detriment of direct research costs.

Two conditions for publication in international journals are: content of the research results and quality of writing. We will of course not consider research results, but will focus on improving the quality of writing in English. The purpose of this research proposal is to design and implement a system to help researchers who are (1) not native speakers of English to write scientific articles in English, (2) using natural language processing (NLP) techniques, (3) while avoiding plagiarism.

2019

[1] C. Goh and Y. Lepage. An assessment of substitute words in the context of academic writing proposed by pre-trained and specific word embedding models. In Proceedings of the 2019 16th International Conference of the Pacific Association for Computational Linguistics (PACLING 2019), Hanoi, Vietnam, October 2019. (15 pages). PDF
[2] C. Goh and Y. Lepage. Extraction of lexical bundles used in natural language processing articles. In Proceedings of the 2019 International Conference on Advanced Computer Science and Information Systems (ICACSIS 2019), Bali, Indonesia, October 2019. (6 pages).
[3] C. Goh and Y. Lepage. Word embeddings in place of dictionary lookup in the context of academic writing. In Proceedings of the 25th Annual Meeting of the Japanese Association for Natural Language Processing, pages 1161-1164, Nagoya, Japan, March 2019.
[4] T. Li and Y. Lepage. Informative sections and relevant words for the generation of NLP article abstracts. In Proceedings of the 25th Annual Meeting of the Japanese Association for Natural Language Processing, pages 1281–1284, Nagoya, Japan, March 2019.

We carried out a survey on the use of lexical bundles in natural language processing (NLP) scientific articles. For that, we used the ACL Anthology Reference Corpus (ACL-ARC) as our corpus because it is a corpus of recognised NLP articles of high linguistic quality. We collected highly frequent N-gram candidates (N ranging from 3 to 6) from this corpus. A good part of such candidates does not comply to the definition of lexical bundles. Therefore, we resorted to machine learning techniques to classify these candidate N-grams into true and false lexical bundles. We used supervised learning to train the models, i.e., we used a list of bundles established in a previous work by Salazar (D. J. L. Salazar, “Lexical bundles in scientific English: A corpus-based study of native and non-native writing,” Ph.D. dissertation, Universitat de Barcelona, 2011) as our training dataset. We extracted lexical bundles from the ACL-ARC corpus, a large corpus of scientific articles in the NLP domain, so that they can be used as a reference for writing NLP articles in English. Around 32% of highly frequent N-grams were classified as true lexical bundles using a supervised machine learning model. This amounts to 18,000 new lexical bundles which we make publicly available here.

List of candidate bundles used in NLP

If you use these data, please make reference to the following publication:

C. Goh and Y. Lepage. Extraction of lexical bundles used in natural language processing articles. In Proceedings of the 2019 International Conference on Advanced Computer Science and Information Systems (ICACSIS 2019), Bali, Indonesia, October 2019. (6 pages). PDF

See page Experimental Data

 

 

Contact

EMBT / NLP Laboratory

Graduate School of Information,Production and Systems

Waseda University

2-7 Hibikino, Wakamatsu-ku,
Kitakyushu-shi, Fukuoka-ken, 808-0135, Japan

Tel/Fax: +81-93-692-5287