Informatics and Applications
2016, Volume 10, Issue 4, pp 8995
GENERALIZED STATISTICAL METHOD OF TEXT ANALYSIS BASED ON CALCULATION OF PROBABILITY DISTRIBUTIONS OF STATISTICAL VALUES
 A. K. Melnikov
 A. F. Ronzhin
Abstract
A lot of data streams are a mixture of random and unique data. One of the properties of unique data is the
nonuniform distribution of probability of encountering the data on the set of the values. The procedure of two steps
is implemented for distinguishing unique data. On the first step of candidate selection, the criterion of consensus
with the uniform distribution is implemented. On the second step, resourceintensive calculation in a condition of
indeterminacy is performed in order to check other unique attributes of the candidates. The choice of the size of
the criterion depends on the amount of resources given for the second step. The accuracy of calculation determines
the quantity of overhead of the second term for processing random data and, therefore, a part of unique data loss.
The paper analyzes the values of boundary parameters for which at the current level of computer technology, one
can calculate the exact distribution. A generalized statistical method of text analysis, which can be used for a wide
spectrum of text parameters, is developed.
Authors
A. K. Melnikov and A. F. Ronzhin
Author Affiliations
STC CLSC "InformInvestGroup;" 125, Bld. 17 Varshavskoye Shosse, Moscow 117587, Russian Federation
S. A. Lebedev Institute of Precision Mechanics and Computer Engineering of the Russian Academy of Sciences,
51 Leninsky Prosp., Moscow 119991, Russian Federation
