Systems and Means of Informatics

2015, Volume 25, Issue 1, pp 34-53

MULTICRITERIA METHOD FOR DETECTING NEAR-DUPLICATES IN A STREAM OF TEXT MESSAGES

  • A. Andreev
  • D. Berezkin
  • I. Kozlov
  • K. Simakov

Abstract

The problem of near-duplicate detection in a stream of text messages is considered. A model of a text document and a multicriteria duplicate identification method is proposed. The model provides flexible adjustment for different domains. The method is based on binary classification using support vector machine. The paper also provides a method of candidates prefiltration in order to ensure high efficiency of the approach. Several experiments with data obtained from a stream of news articles were carried out. The results show feasibility of the suggested approach.

[+] References (15)

[+] About this article