The software system for generation and processing of a database of collocations of the Ukrainian language.

  • Т. Riabokon National Technical University of Ukraine "Kyiv Polytechnic Institute named after Igor Sikorsky"
  • А. Petrashenko National Technical University of Ukraine "Kyiv Polytechnic Institute named after Igor Sikorsky" https://orcid.org/0000-0003-0239-1706
Keywords: statistical methods of collocation extraction, the database of collocations, colocation, text corpus.

Abstract

This article is devoted to the description of creating a database of collocations and identifying the most effective methods of collocation extraction from the text. Existing researches of statistical methods of collocation extraction from text data were analyzed and the criterion for comparison of their efficiency in work with Ukrainian language texts is offered. The architecture of automated generation of the database of collocations and possible ways of its acceleration is also described. An experiment was conducted and the most effective method of finding collocations for the selected corpus of texts in the Ukrainian language was determined.

References

Snedecor, George Waddel, and William G. Cochran. 1989. Statistical methods. Ames: Iowa State University Press. 8th edition. 53 c.

Church K. and Hanks P., 1990. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics.

Sinclair, John ted. 1995. ColIins COBUILD English dictionary. London: Harper Collins. New edition, completely revised.

Manning C. and Schütze H., 1999. Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.

Smadja F., 1993. Retrieving Collocations from text: Xtract, Computational Linguistics, 19: 143-177.

Dunning T., 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics.

Fontenelle, Thierry, Walter Briils, Luc Thomas, Tom Vanallemeersch, and Jacques Jansen. 1994. DECIDE, MLAP-Project 93-19, deliverable D-la: survey of collocation extraction tools. Technical report, University of Liege, Liege, Belgium.

Hawthorne, Mark. 1994. The computer in literary analysis: Using TACT with students. Computers and the Humanities.

Church, Kenneth W., and Robert L. Mercer. 1993. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics. 20 c.

Apache Spark - Unified Analytics Engine for Big Data. URL: https://spark.apache.org/ (дата звернення 06.09.2021)

Free and Open Search: The Creators of Elasticsearch, ELK & Kibana | Elastic. URL: https://www.elastic.co/ (дата звернення 06.09.2021)

UA-GEC: перший анотований GEC-корпус української мови вже у вільному доступі! URL: https://ua-gec-dataset.grammarly.ai/ (дата звернення 05.09.2021)

Natural Language Toolkit — NLTK 3.6.2 documentation. URL: https://www.nltk.org/ (дата звернення 05.09.2021)

Морфологический анализатор pymorphy2 — Морфологический анализатор pymorphy2. URL: https://pymorphy2.readthedocs.io/en/stable/ (дата звернення 05.09.2021)

S. Evert, B. Krenn, Using small random samples for the manual evaluation of statistical evaluation measures. Computer speech and language, 19: pp. 450–466; 2005.

Abstract views: 167
PDF Downloads: 130
Published
2021-11-02
How to Cite
Riabokon Т., & Petrashenko А. (2021). The software system for generation and processing of a database of collocations of the Ukrainian language. COMPUTER-INTEGRATED TECHNOLOGIES: EDUCATION, SCIENCE, PRODUCTION, (44), 141-148. https://doi.org/10.36910/6775-2524-0560-2021-44-22
Section
Computer science and computer engineering