Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash

Rezaeian, N.; Novikova, G.M.

Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash

Plagiarism is one of the major problems in the age of communication. In many languages such as English, this issue is seriously of high importance and many powerful devices have been invented to prevent this problem from occurring. This article aims at discovering plagiarism in Russian texts based on fingerprint algorithm. The fingerprint algorithms have high speeds in finding out the plagiarism due to the compact features it creates and purely because of the comparison of these properties between original documents and dubious documents. Increasing the power and accuracy of plagiarism discovery, there must be elimination of general words and word rooting before pre-processing applications such as words separation, numbers replacement, and homogenization. In this article, four Simhash algorithms have been used. The implementation of these algorithms confirmed on 800 articles with the scientific topics was found to have satisfactory results. © 2017 The Authors.

Authors

Rezaeian N. ¹ , Novikova G.M. ¹

Conference proceedings

Procedia Computer Science

Publisher

Elsevier B.V.

Language

English

Pages

421-425

State

Published