Development of Cross-Language Embeddings for Extracting Chemical Structures from Texts in Russian and English

This study is dedicated to describing an algorithm for implementation cross-lingual embeddings to extract chemical structures from texts in both Russian and English. The proposed algorithm focuses on fine-tuning of pre-trained models based on transformer architecture. After analyzing existing models, mBERT and LaBSE were selected. The training datasets for these models included texts related to chemistry and adjacent fields of science. Fine-tuning was done using a collected set of scientific articles and patent texts in Russian and English. For English, the ChemProt corpus was also used. The model was trained on tasks such as masked language modeling and entity recognition. Comparisons were made with several models, including BioBERT. The results of the experiments showed that the proposed implementation of embeddings more effectively solve the task of recognition chemical structure names in texts in both Russian and English.

Авторы
Molodchenkov A.I. 1, 2 , Deviatkin D.A. 1 , Loginov S.A. 3 , Lupatov A.Y. 4 , Gisina A.M. 4 , Lukin A.V. 1, 2
Издательство
Foundation for the Development of Internet Media, IT Education, Human Potential "League of Internet Media"
Номер выпуска
5
Язык
English
Страницы
62-66
Статус
Published
Том
13
Год
2025
Организации
  • 1 Federal Research Center "Computer science and Control" of the Russian Academy of Sciences
  • 2 RUDN University
  • 3 RUDN
  • 4 Institute of Biomedical Chemistry of the Russian Academy of Sciences
Ключевые слова
embeddings; transformer architecture; information Extraction; chemical structures
Цитировать
Поделиться

Другие записи

Avatkov V.A., Apanovich M.Yu., Borzova A.Yu., Bordachev T.V., Vinokurov V.I., Volokhov V.I., Vorobev S.V., Gumensky A.V., Иванченко В.С., Kashirina T.V., Матвеев О.В., Okunev I.Yu., Popleteeva G.A., Sapronova M.A., Свешникова Ю.В., Fenenko A.V., Feofanov K.A., Tsvetov P.Yu., Shkolyarskaya T.I., Shtol V.V. ...
Общество с ограниченной ответственностью Издательско-торговая корпорация "Дашков и К". 2018. 411 с.