Informatics and Applications

2016, Volume 10, Issue 4, pp 89-95

GENERALIZED STATISTICAL METHOD OF TEXT ANALYSIS BASED ON CALCULATION OF PROBABILITY DISTRIBUTIONS OF STATISTICAL VALUES

  • A. K. Melnikov
  • A. F. Ronzhin

Abstract

A lot of data streams are a mixture of random and unique data. One of the properties of unique data is the nonuniform distribution of probability of encountering the data on the set of the values. The procedure of two steps is implemented for distinguishing unique data. On the first step of candidate selection, the criterion of consensus with the uniform distribution is implemented. On the second step, resource-intensive calculation in a condition of indeterminacy is performed in order to check other unique attributes of the candidates. The choice of the size of the criterion depends on the amount of resources given for the second step. The accuracy of calculation determines the quantity of overhead of the second term for processing random data and, therefore, a part of unique data loss. The paper analyzes the values of boundary parameters for which at the current level of computer technology, one can calculate the exact distribution. A generalized statistical method of text analysis, which can be used for a wide spectrum of text parameters, is developed.

[+] References (7)

[+] About this article