On improving the accuracy of the classification on imbalanced classes with machine learning

Shchetinin, E.Y.; Sevastianov, L.A.; Kulyabov, D.S.; Ayrjan, E.A.

On improving the accuracy of the classification on imbalanced classes with machine learning

Imbalance of the classes, characterized by a disproportional ratio of observations in each class, is one of the significant problems in machine learning. Class imbalances can be detected in many areas, including medical diagnostics, spam filtering, and fraud detection. Most machine learning algorithms work optimally when the number of samples in each class is approximately the same. This is because most algorithms are designed to maximize accuracy and reduce error. However, under conditions of class imbalance, the model may be overfitted, which leads to incorrect estimates of object classification. Thus, in order to avoid this phenomenon and achieve better results, it is necessary to research methods for working with unbalanced data, as well as develop effective algorithms for classifying them. In this paper, we study machine learning methods to eliminate class imbalance in data in order to improve accuracy in multi-class classification problems. In this paper, to improve the accuracy of classification, it is proposed to use a combination of classification algorithms and feature selection methods RFE, Random Forest and Boruta with pre-balancing classes by random sampling, SMOTE and ADASYN. Using data on skin diseases as an example, computer experiments have shown that the use of sampling algorithms to eliminate the imbalance of classes, as well as the selection of the most informative features, significantly improves the accuracy of classification results. The Random Forest algorithm was the most effective in terms of classification accuracy when sampling data using the ADASYN algorithm.

Authors

Shchetinin E.Y. ¹ , Sevastianov L.A. ^2, ³ , Kulyabov D.S. ^2, ³ , Ayrjan E.A. ³

Conference proceedings

Распределенные компьютерные и телекоммуникационные сети: управление, вычисление, связь (DCCN-2020) (Distributed computer and communication networks: control, computation, communications (DCCN-2020))

Publisher

Институт проблем управления им. В.А. Трапезникова РАН

Language

English

Pages

130-138

State

Published

Year

2020

Organizations

¹ Financial University, Government of the Russian Federation
² Peoples' Friendship University of Russia (RUDN University)
³ Joint Institute for Nuclear Research
⁴ Dubna State University

Keywords

multiclass classification; imbalanced classes; machine learning; Smote; ADASYN; random forest

Cite

ГОСТ MLA RIS BibTex

DISTRIBUTION OF COMPUTING LOAD BY USING A P2P NETWORK

Article

Mamonov A.A., Varlamov R.A., Salpagarov S.I.

Distributed computer and communication networks: control, computation, communications (DCCN-2020). 2020. P.. 347-354

SENSITIVITY ANALYSIS OF CHARACTERISTICS OF A K-OUT-OF-N:F SYSTEM TO SHAPES OF LIFE AND REPAIR TIMES DISTRIBUTIONS OF ITS COMPONENTS

Article

Rykov V.V., Ivanova N.M., Kozyrev D.V.

Distributed computer and communication networks: control, computation, communications (DCCN-2020). 2020. P.. 268-275

On improving the accuracy of the classification on imbalanced classes with machine learning

Other records

DISTRIBUTION OF COMPUTING LOAD BY USING A P2P NETWORK

SENSITIVITY ANALYSIS OF CHARACTERISTICS OF A K-OUT-OF-N:F SYSTEM TO SHAPES OF LIFE AND REPAIR TIMES DISTRIBUTIONS OF ITS COMPONENTS

Cite