Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems

The decision-making process to rule R&D relies on information related to current trends in particular research areas. In this work, we investigated how one can use large language models (LLMs) to transfer the dataset and its annotation from one language to another. This is crucial since sharing knowledge between different languages could boost certain underresourced directions in the target language, saving lots of effort in data annotation or quick prototyping. We experiment with English and Russian pairs, translating the DEFT (Definition Extraction from Texts) corpus. This corpus contains three layers of annotation dedicated to term-definition pair mining, which is a rare annotation type for Russian. The presence of such a dataset is beneficial for the natural language processing methods of trend analysis in science since the terms and definitions are the basic blocks of any scientific field. We provide a pipeline for the annotation transfer using LLMs. In the end, we train the BERT-based models on the translated dataset to establish a baseline. © 2025 Elsevier B.V., All rights reserved.

Авторы
Popov Dmitry 1, 2, 3 , Terentev Egor 1, 2 , Serenko Danil 1, 2 , Sochenkov Ilya V. 1, 3, 4 , Buyanov Igor 1
Издательство
Multidisciplinary Digital Publishing Institute (MDPI)
Номер выпуска
5
Язык
Английский
Статус
Опубликовано
Номер
116
Том
9
Год
2025
Организации
  • 1 Federal Research Center Informatics and Management of the Russian Academy of Sciences, Moscow, Russian Federation
  • 2 Mathematics and Natural Sciences, RUDN University, Moscow, Russian Federation
  • 3 Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russian Federation
  • 4 Ivannikov Institute for System Programming of the RAS, Moscow, Russian Federation
Ключевые слова
ChatGPT; data transferring; DeepSeek; DEFT; large language model; Llama; machine translation; Qwen
Цитировать
Поделиться

Другие записи

Аватков В.А., Апанович М.Ю., Борзова А.Ю., Бордачев Т.В., Винокуров В.И., Волохов В.И., Воробьев С.В., Гуменский А.В., Иванченко В.С., Каширина Т.В., Матвеев О.В., Окунев И.Ю., Поплетеева Г.А., Сапронова М.А., Свешникова Ю.В., Фененко А.В., Феофанов К.А., Цветов П.Ю., Школярская Т.И., Штоль В.В. ...
Общество с ограниченной ответственностью Издательско-торговая корпорация "Дашков и К". 2018. 411 с.