Natural Language Processing and Speech Technologies for Central Asian Turkic Languages: A Review of Current Methods, Resources, and Challenges
pdf

Keywords

Turkic NLP
Morphological analysis
Low-resource languages
Speech recognition
Cross-lingual transfer

Abstract

This article provides a comprehensive review of contemporary research in the field of natural language processing (NLP) and speech technologies for Central Asian Turkic languages, including Kazakh, Kyrgyz, and Uzbek. Although a number of theoretical and applied studies have been published in recent years, these languages continue to be classified as low-resource. This situation is primarily caused by the limited availability of annotated text corpora, insufficient speech data, the parallel use of Cyrillic and Latin scripts, and the absence of unified annotation and evaluation standards. The article systematically examines current approaches to morphological segmentation, named entity recognition, sentiment analysis, and automatic speech recognition. Agglutinative morphology and vowel harmony are discussed as key typological features of Turkic languages that strongly influence computational processing strategies. The effectiveness of both rule-based and neural morphological analyzers is highlighted. The paper also describes the adaptation of computational models originally developed for Turkish, English, and Russian through subword modeling, character-level embeddings, and multilingual transformer architectures. In addition, cross- lingual transfer learning is evaluated as a promising approach to mitigating data scarcity. The study identifies corpus fragmentation, inconsistent annotation schemes, and the lack of standardized speech resources as major challenges. The author argues for the development of open-access datasets, the introduction of shared evaluation tasks, and the strengthening of institutional collaboration between linguists and computational language technology specialists. The findings of the study are of both theoretical and practical importance for the development of sustainable and effective language technologies for low-resource languages.

https://doi.org/10.64863/2312-4784/2025-3-50/7-18
pdf

References

1. Akhmed-Zaki, D., et al. (2021). Development of an information system for the Kazakh language: Text preprocessing tools for media corpus. Applied Computing and Informatics, 17(2), 320–334. DOI: https://doi.org/10.1080/23311916.2021.1896418

2. Bahry, S. A. (2012). What constitutes quality in minority education? A multiple embedded case study of stakeholder perspectives on minority linguistic and cultural content in school -based curriculum in Sunan Yughur Autonomous County, Gansu. Frontiers of Education in China, 7(3), 376-416.

3. Dave, B. (1996a). Politics of language revival: National identity and state building in Kazakhstan. Syracuse University.

4. Dave, B. (1996b). National revival in Kazakhstan: Language shift and identity change. Post- Soviet Affairs, 12(1), 51-72.

5. Dzhubanov, A., & Khasanov, B. (1980). Computational description of the Kazakh language. In Computational and mathematical linguistics: Proceedings of the International Conference on Computational Linguistics: Pisa, 27/viii-1/ix 1973: vol. II.-(Biblioteca dell'Archivum Romanicum. Serie II: Linguistica; 37) (pp. 75-77). LS Olschki.

6. Fierman, W. (2006). Language and education in post-Soviet Kazakhstan: Kazakh-medium instruction in urban schools. The Russian Review, 65(1), 98-116.

7. Hakkila, A. C. (2018). Transitional Literature of the Steppe: The First Two Qazaq Novels (Dulatuly and Köbeev). The University of Wisconsin-Madison.

8. Johanson, L., & Csató, É. Á. (1998). The Turkic languages. Routledge.

9. Kessikbayeva, G., & Cicekli, I. (2014, June). Rule based morphological analyzer of Kazakh language. In Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM (pp. 46-54).

10. Kessikbayeva, G. C. Ilyas. 2016.A RuleBased Morphological Analyzer and a Morphological Disambiguator for Kazakh Language.» Linguistics and Literature Studies, 96-104. DOI: https://doi.org/10.13189/lls.2016.040111

11. Khassenov, B., Bakhitova, Zh. (2025). Sema ntic Ide ntit y of Vo wel s in t he Kaza kh La ng uag e: an E xpe rime ntal Ana ly si s o f Sou nd Symb oli sm . Actual Problems of the Present, 3( 49 ), 1 9- 30 . DOI: https://doi.org/10.64863/2312-4784/2025-3-49/19-30

12. Kulgildinova, T., Zhumabekova, A., Shabdenova, K., Kuleimenova, L., & Yelubayeva, P. (2018). Bilingualism: language policy in modern Kazakhstan. XLinguae, 11(1), 332- 341.

13. Landau, J. M., & Kellner-Heinkele, B. (2001). Politics of language in the ex-Soviet muslim states: Azerbayjan, Uzbekistan, Kazakhstan. Kyrgyzstan, Turkmenistan, and Tajikistan. University of Michigan Press.

14. Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., & Sharafudinov, A. (2013, October). Assembling the kazakh language corpus. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1022-1031).

15. Makhambetov, O., Makazhanov, A., Sabyrgaliyev, I., & Yessenbayev, Z. (2015, April). Data-driven morphological analysis and disambiguation for kazakh. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 151-163). Cham: Springer International Publishing.

16. Manni, F., & Nerbonne, J. (2021). Linguistic Diversity and Human Migrations in Gabon. Human Migration: Biocultural Perspectives, 99.

17. Mansurova, M. E., & Rakhimova, D. R. (2024). Morphological parsing of Kazakh texts with deep learning approaches. Journal of Mathematics, Mechanics and Computer Science, 124(4), 48-58. DOI: https://doi.org/10.26577/JMMCS2024-v124-i4-a4

18. Moore, P. (2023). Translanguaging in CLIL. In The Routledge handbook of content and language integrated learning (pp. 28-42). Routledge.

19. Multimedia Corpus of Modern Spoken Kazakh Language Project. (2024). Project description. Nazarbayev University.

20. Mustajoki, A., Protassova, E., & Yelenevskaya, M. (2020). The soft power of the Russian language. Routledge.

21. Pavlenko, A. (2008). Multilingualism in post-Soviet countries: Language revival, language removal, and sociolinguistic theory. International journal of bilingual education and bilingualism, 11(3-4), 275-314.

22. Salaev, U. (2024, November). UzMorphAnalyser: A morphological analysis model for the Uzbek language using inflectional endings. In AIP Conference Proceedings (Vol. 3244, No. 1, p. 030058). AIP Publishing LLC. DOI: https://doi.org/10.1063/5.0241461

23. Sharipbay, A., Gatiatullin, A., Yergesh, B., & Kazhymukhan, D. (2018). Development of a unified meta language of the turkic languages morphology. Journal of Mathematics, Mechanics, Computer Science, 4(100), 78-87. DOI: https://doi.org/10.26577/JMMCS-2018-4-557

24. Smagulova, J. (2016). The re-acquisition of Kazakh in Kazakhstan: Achievements and challenges. Language change in central Asia, 89-108.

25. Smagulova, J., & Ahn, E. S. (Eds.). (2016). Language Change in Central Asia (Vol. 106). Walter de Gruyter GmbH & Co KG.

26. Yiner, Z., & Kurt, A. (2021). Two level Kazakh morphology. Acta Infologica, 5(1), 79-98. DOI: https://doi.org/10.26650/acin.842758

27. Wei, L. (2015). Multilingualism in the Chinese diaspora worldwide. Taylor & Francis.