Classification models for rsT discourse parsing of texts in Russian [КЛАССИФИКАЦИЯ РИТОРИЧЕСКИХ ОТНОШЕНИЙ ДЛЯ ДИСКУРСИВНОГО АНАЛИЗА ТЕКСТОВ НА РУССКОМ ЯЗЫКЕ]

The paper considers the task of automatic discourse parsing of texts in Russian. Discourse parsing is a well-known approach to capturing text semantics across boundaries of single sentences. Discourse annotation was found to be useful for various tasks including summarization, sentiment analysis, question-answering. Recently, the release of manually annotated Ru-RSTreebank corpus unlocked the possibility of leveraging supervised machine learning techniques for creating such parsers for Russian language. The corpus provides the discourse annotation in a widely adopted formalisation—Rhetorical Structure Theory. In this work, we develop feature sets for rhetorical relation classification in Russian-language texts, investigate importance of various types of features, and report results of the first experimental evaluation of machine learning models trained on Ru-RSTreebank corpus. We consider various machine learning methods including gradient boosting, neural network, and ensembling of several models by soft voting. © 2019 ABBYY PRODUCTION LLC. All rights reserved.

Authors
Chistova E.V. 1, 2 , Shelmanov A.O. 1, 3 , Kobozeva M.V.1 , Pisarevskaya D.B.1 , Smirnov I.V. 1 , Toldova S.Yu.4
Publisher
Rossiiskii Gosudarstvennyi Gumanitarnyi Universitet
Number of issue
18
Language
English
Pages
163-176
Status
Published
Volume
2019-May
Year
2019
Organizations
  • 1 FRC CSC RAS, Moscow, Russian Federation
  • 2 RUDN University, Moscow, Russian Federation
  • 3 Skoltech, Moscow, Russian Federation
  • 4 NRU Higher School of Economics, Moscow, Russian Federation
Keywords
Discourse parsing; Feature selection; Machine learning on annotated corpus; RST; Word embedding
Share

Other records