Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash

Plagiarism is one of the major problems in the age of communication. In many languages such as English, this issue is seriously of high importance and many powerful devices have been invented to prevent this problem from occurring. This article aims at discovering plagiarism in Russian texts based on fingerprint algorithm. The fingerprint algorithms have high speeds in finding out the plagiarism due to the compact features it creates and purely because of the comparison of these properties between original documents and dubious documents. Increasing the power and accuracy of plagiarism discovery, there must be elimination of general words and word rooting before pre-processing applications such as words separation, numbers replacement, and homogenization. In this article, four Simhash algorithms have been used. The implementation of these algorithms confirmed on 800 articles with the scientific topics was found to have satisfactory results. © 2017 The Authors.

Authors
Conference proceedings
Publisher
Elsevier B.V.
Language
English
Pages
421-425
Status
Published
Volume
103
Year
2017
Organizations
  • 1 RUDN University, 6 Miklukho-Maklaya str., Moscowf, 117198, Russian Federation
Keywords
fingerprint algorithm; plagiarism; Simhash
Share

Other records