CLUSTERING DOCUMENTS USING THE NEURAL NETWORKS

Gabdrakhmanova, N.T.

Кластеризация документов с помощью нейронных сетей

В работе рассматривается задача автоматизации кластеризации документов, классификации документов и динамической классификации документов (ситуационная задача). Предлагается метод кластеризации с использованием локального коэффициента кластеризации для графов. Алгоритм кластеризации основан на структурном анализе графа. Представление текста в виде графа позволяет определить дискретный аналог кривизны Риччи на метрическом пространстве, как это сделано в работах Олливье. Для решения задачи классификации документов с помощью нейронных сетей предложены регуляризаторы на основе введенных понятий.

CLUSTERING DOCUMENTS USING THE NEURAL NETWORKS

A new algorithm for clustering documents based on neural networks, weighted graphs, and adjacency matrices is proposed. Neural networks derive their power from a parallel processing method and the ability to self-learn. The construction of a weighted graph for the document assumes the solution of the task of formalizing the object of modeling. The following clustering algorithm is proposed. Suppose we have N documents. We use these documents to get the training array of our neural network. Let each document already be divided into lexemes. A lexeme is a unit of the vocabulary of a language. A lexeme is the totality of the forms of a single word. For each document a weighted graph is constructed according to the following rule: the vertices of the graph are lexemes; the vertices of the graph are connected by an edge if the lexemes meet in the same sentence; the weight of the edge is the relative frequency of the lexemes in the text. In the tasks of clustering, we call the connective words in the text the "noise", i.e. such words as "so", "however", etc. In order to smooth "noise" we use filtering. We set an unspecified limit h, remove all edges with weight less than h. Base on the constructed weighted graph, we write the adjacency matrix Ai, where i is the document number. To every adjacency matrix Ai we associate the class of the document Yi. We obtain the tuples , i = 1,2, ... N for training the neural network. After training the neural network, it can be used to cluster documents. At the input of the neural network, the adjacency matrix of the document is fed, at the output - the document class number. In the future, it is proposed to develop the proposed clustering approach using the methods of modern geometry.

Авторы

Габдрахманова Н.Т. (Gabdrakhmanova N.T.) ¹

Журнал

Речевые технологии

Издательство

Научно-исследовательский институт школьных технологий

Номер выпуска

Язык

Русский

Страницы

45-53

Статус

Опубликовано

Год

2019

Организации

¹ Российский университет дружбы народов

Ключевые слова

лексема; кластер; локальный коэффициент кластеризации; нейронная сеть; локальный коэффициент кривизны; lexeme; cluster; weighted graph; adjacency matrix; neural network

Цитировать

ГОСТ MLA RIS BibTex

Другие записи

AUTONOMY IN THE RUSSIAN FEDERATION: THEORY AND PRACTICE

Статья

Kartashkin V.A., Abashidze A.Kh.

International Journal on Minority and Group Rights. Том 10. 2003. С. 203-220

ФЕНОМЕН ИНТЕРЯЗЫКА МАШИНОПИСНОГО ТЕКСТА

Статья

Дерябина С.А.

Речевые технологии. 2019. С. 54-66