Институт проблем информатики Российской Академии наук
Институт проблем информатики Российской Академии наук
Российская Академия наук

Институт проблем информатики Российской Академии наук




«INFORMATICS AND APPLICATIONS»
Scientific journal
Volume 14, Issue 3, 2020

Content | About  Authors

Abstract and Keywords

STATISTICAL ESTIMATION OF DISTRIBUTIONS OF RANDOM COEFFICIENTS IN THE LANGEVIN STOCHASTIC DIFFERENTIAL EQUATION
  • A. K. Gorshenin  Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
  • V. Yu. Korolev  Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation, Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow 119991, Russian Federation
  • A. A. Shcherbinina  Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow 119991, Russian Federation

Abstract: A method is described for statistical estimation of the distributions of random coefficients of the Langevin stochastic differential equation (SDE) by the technique of moving separation of mixtures. Discrete approximations are proposed for these distributions. For the purpose of study of variability of the distributions of the SDE coefficients in time, an algorithm is proposed for sequential identification (determination of local connectivity) of the components of the resulting mixture distributions. This algorithm is based on combining a greedy algorithm for the determination of the number of components with a lustering method (k- or c-means). The application of the proposed method is illustrated by particular examples of the analysis of processes of heat transfer between atmosphere and ocean.

Keywords: mixture distribution; local connectivity; greedy algorithm; clustering

ON MARKOVIAN AND RATIONAL ARRIVAL PROCESSES. I
  • V. A. Naumov  Service Innovation Research Institute, 8A Annankatu, Helsinki 00120, Finland
  • К. Е. Samouylov  Peoples' Friendship University of Russia (RUDN University), 6 Miklukho-Maklaya Str., Moscow 117198, Russian Federation, Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation

Abstract: This article is the first part of a review carried out within the framework of the RFBR project No. 1917-50126. The purpose of this review is to get the interested readers familiar with the basics of the theory of Markovian arrival processes to facilitate the application of these models in practice and, if necessary, to study them in detail. In the first part of the review, the properties of general Markovian arrival processes are presented and their relationship with Markov additive processes and Markov renewal processes is shown. In the second part of the review, the important for applications subclasses of Markovian arrival processes, i. e., simple and batch arrival processes of homogeneous and heterogeneous arrivals, are considered. After that, it is shown how the properties of Markovian arrival processes are associated with the product form of stationary distributions of Markov systems. In conclusion, matrix-exponential distributions and rational arrival processes are discussed that expand the capabilities of Markovian arrival processes for modeling complex systems, while preserving the convenience of analyzing them using computations.

Keywords: Markov chain; Markovian arrival process; Markov additive process; MAP; MArP

APPROXIMATION OF THE SET OF SOLUTIONS OF SYSTEMS OF NONLINEAR INEQUALITIES USING GRAPHIC ACCELERATORS
  • M. V. Popov  Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119133, Russian Federation
  • M. A. Posypkin  Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119133, Russian Federation

Abstract: Solutions of certain problems can be reduced to the solution of some systems of inequalities. But the computation of the set of exact solutions may not be feasible. Thus, various methods for approximation of the solution set have been developed. The more accurate approximation is required, the bigger number of calculations must be performed and, consequently, the runtime of the algorithms increases. Nowadays, it is common to speed up algorithms by paralleling computations on graphics accelerators. The paper describes the serial method for approximation of the solution of systems of inequalities and proposes the parallel hybrid algorithm that combines iterations on the uniform grid and the branch and bound method. This algorithm is suited for direct implementation on graphics accelerators and does not suffer from the excessive enumeration of possible solution candidates. The sequential algorithm and the two versions of the parallel algorithm are compared through one example: the problem of approximation of the working area of the robot which consists of the set of robot's tool positions and is the key robot's characteristic.

Keywords: optimization; parallel computing; graphics accelerator, GPU; CUDA; nonlinear inequalities

A SINGLE-SERVER QUEUEING SYSTEM WITH LIFO SERVICE, PROBABILISTIC PRIORITY, BATCH POISSON ARRIVALS, AND BACKGROUND CUSTOMERS
  • T. A. Milovanova  Peoples' Friendship University of Russia (RUDN University), 6 Miklukho-Maklaya Str., Moscow 117198, Russian Federation
  • R. V. Razumchik  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation

Abstract: Consideration is given to the single-server queueing system with two independent flows of customers: a batch Poisson flow of (primary) customers and a saturated flow of background customers. Primary customers have relative priority over background customers, i.e., the service of a background customer cannot be interrupted.
A background customer is instantly taken for service every time the buffer for primary customers is empty upon the service completion. The service times of primary and background customers are independent and are allowed to be generally distributed. The implemented service policy is LIFO (last in, first out) with the probabilistic priority.
The method and analytic expressions for the computation (in terms of transforms) of the system's main stationary performance characteristics, including the stationary distribution of the waiting and sojourn times of the primary customers, are presented.

Keywords: queueing system; LIFO service; probabilistic priority; batch arrivals; background customers

ON THE DISTRIBUTION OF THE RATIO OF THE SUM OF SAMPLE ELEMENTS EXCEEDING A THRESHOLD TO THE TOTAL SUM OF SAMPLE ELEMENTS. I
  • V. Yu. Korolev  Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow 119991, Russian Federation, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation

Abstract: The problem of description of the distribution of the ratio of the sum of sample elements exceeding a threshold to the total sum of sample elements is considered. Unlike known versions of this problem in which the number of summed extreme order statistics is fixed, here, the specified threshold can be exceeded by an unpredictable number of sample elements. In the paper, in terms of the distribution function of a separate summand, the explicit form of the distribution of the ratio of the sum of sample elements exceeding a threshold to the total sum of sample elements is formally presented. The asymptotic and limit distributions are heuristically deduced for this ratio. These distributions are convenient for practical computations. The cases are considered in which the distributions of the summands have light tails (the second moments are finite) as well as the cases in which these distributions have heavy tails (belong to the domain of attraction of a stable law). In all cases, the normalization of the ratio is described that provides the existence of a nondegenerate limit (as the number of summands infinitely increases) distribution as well as the limit distribution itself (normal for the case of light tails and stable for the case of heavy tails).

Keywords: sum of independent random variables; random sum; binomial distribution; mixture of probability distributions; normal distribution; stable distribution; extreme order statistics

ON THE STATISTICAL PROPERTIES OF RISK ESTIMATE IN THE PROBLEM OF INVERTING THE RADON TRANSFORM WITH A RANDOM VOLUME OF PROJECTION DATA
  • O. V Shestakov  Department of Mathematical Statistics, Faculty of Computational Mathematics and Cybernetics, M. V Lomonosov Moscow State University, 1-52 Leninskiye Gory, GSP-1, Moscow 119991, Russian Federation, Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation

Abstract: When reconstructing tomographic images, it is necessary to solve the problem of suppressing the noise arising from registration of projection data. Methods for solving this problem based on wavelet algorithms and threshold processing procedures have several advantages, including computational efficiency and ability to adapt to local features of images. An analysis of the errors of these methods is an important practical task, since it makes it possible to evaluate the quality of both the methods themselves and the equipment used. When using threshold processing procedures, it is usually assumed that the number of decomposition coefficients is fixed and the noise distribution is Gaussian. This model has been well studied in the literature, and the optimal threshold values have been calculated for different classes of functions. However, in some situations, the sample size is not fixed in advance and must be modeled with some random variable. This paper considers a model with a random number of observations and investigates the asymptotic properties of the mean-square risk estimate. It is proved that the limiting distribution of this estimate belongs to the class of shift-scale mixtures of normal laws.

Keywords: threshold processing; random sample size; Radon transform; grid; mean-square risk estimate

METHOD OF LOGARITHMIC MOMENTS FOR ESTIMATING THE GAMMA-EXPONENTIAL DISTRIBUTION PARAMETERS
  • A. A. Kudryavtsev  Department of Mathematical Statistics, Faculty of Computational Mathematics and Cybernetics, M. V Lomonosov Moscow State University, 1-52 Leninskiye Gory, GSP-1, Moscow 119991, Russian Federation
  • O. V Shestakov  Department of Mathematical Statistics, Faculty of Computational Mathematics and Cybernetics, M. V Lomonosov Moscow State University, 1-52 Leninskiye Gory, GSP-1, Moscow 119991, Russian Federation, Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation

Abstract: The article discusses a modified method of moments for estimating the parameters of gamma-exponential distribution. The strong consistency of the estimates obtained is proved. Gamma-exponential distribution is a convenient mechanism for modeling the processes and phenomena using scale mixtures of generalized gamma distributions. Such problems arise in many fields of science under the assumption that the considered parameters are randomized and can be described in terms of Bayesian balance models. The obtained results can be applied in a wide class of problems that use for modeling the distribution with positive unlimited support, without additional assumptions about the representation of the studied object in terms of a scale mixture, due to the wide variety of density types of the five-parameter gamma-exponential distribution.

Keywords: parameter estimation; gamma-exponential distribution; mixed distributions; generalized gamma distribution; method of moments; consistent estimate

BASIC CONCEPTS OF PROGRAMMING EXPOUNDED FOR PRESCHOOLER
  • V. B. Betelin  Federal Research Center "Scientific Research Institute for System Analysis of the Russian Academy of Sciences," 36-1 Nakhimovsky Prosp., Moscow 117218, Russian Federation
  • A. G. Kushnirenko  Federal Research Center "Scientific Research Institute for System Analysis of the Russian Academy of Sciences," 36-1 Nakhimovsky Prosp., Moscow 117218, Russian Federation
  • A. G. Leonov  Federal Research Center "Scientific Research Institute for System Analysis of the Russian Academy of Sciences," 36-1 Nakhimovsky Prosp., Moscow 117218, Russian Federation, M. V. Lomonosov Moscow State University, 1 Leninskie Gory, GSP-1, Moscow 119991, Russian Federation, Moscow Pedagogical State University, 1-1 Malaya Pirogovskaya Str., Moscow 119991, Russian Federation

Abstract: The development of information technology has formed a socioeconomic demand for reducing the age of acquaintance of children with programming. As a result of 6 years of efforts, the authors managed to develop and massively introduce an annual programming course for preschoolers built on the metaphor of program control.
In the process of developing the course, the authors were able to select and formulate a set of basic programming concepts which fully reveals this metaphor and, at the same time, can be mastered by preschool children age 6+ in an active-play form. This set of concepts is introduced using examples of control programs for moving and stationary objects with an intuitive, visible command system. At the beginning of the course, control without feedback is introduced, the concept of feedback is introduced and used only at the end of the course. As a basic pedagogical software product, the PictoMir text-free pictographic system developed by the Federal Research Center "Scientific Research Institute for System Analysis of the Russian Academy of Sciences" and its programmatic and methodological content is used, allowing each preschooler to gain experience in writing and debugging at least 120-150 simplest programs by the end of the course.

Keywords: informatics; robot; program; computer; programming language; preschooler; PiktoMir; pictogram

COMPUTATIONAL ASPECTS OF OPTIMIZATION ON CC-VaR IN A COMPLEX OF MARKETS
  • G. A. Agasandyan  A. A. Dorodnicyn Computing Center, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 40 Vavilov Str., Moscow 119333, Russian Federation

Abstract: The work is the direct continuation of the previous author's investigation on using continuous VaR- criterion (CC-VaR) in a set of markets of different dimensions, which are mutually connected by their underliers.
The exposition is aimed at the application of ideas and methods developed for the theoretical continuous model to discrete scenarios markets. In a typical model case of a collection of one two-dimensional market and two one-dimensional markets, a rule of constructing a combined portfolio in these markets is submitted. This rule gives a necessary and sufficient condition of portfolio optimality in the weighted composition of basis instruments. The condition is founded on misbalance in returns relative between markets with maintaining optimality on CC-VaR.
The optimal combined portfolio with three components is constructed. Also, the idealistic and surrogate versions of this combined portfolio, which are useful in testing all algorithmic calculations and in graphic illustrating portfolio's payoff functions, are adduced. The model can be extended without difficulties, theoretic anyway, on markets of greater dimensions.

Keywords: underlie; risk preferences function; continuous VaR-criterion; cost and forecast densities; return relative function; Newman-Pearson procedure; combined portfolio; surrogate portfolio

MATHEMATICAL STATISTICS IN THE TASK OF IDENTIFYING HOSTILE INSIDERS
  • N. A. Grusho  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
  • M. I. Zabezhailo  A. A. Dorodnicyn Computing Center, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 40 Vavilov Str., Moscow 119333, Russian Federation
  • D. V. Smirnov  Sberbank of Russia, 19 Vavilov Str., Moscow 117999, Russian Federation
  • E. E. Timonina  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
  • S. Ya. Shorgin  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation

Abstract: The paper explores approaches to identifying hostile insiders of the organization using collusion. The problem of identifying the organized group of information security violators is one of the most complex tasks of ensuring the security of organization. The set of source data for analysis consists of many small samples describing the functionality of the organization's information technologies. This set can be considered as big data. The clustering method is used to reduce the amount of source data that made it possible to use mathematical statistics efficiently, i. e., to identify small samples carrying information about hostile insiders. The difficulty of the task was to lose as little as possible the needed small samples. The conditions have been found where in the series scheme, the probability of identifying insiders using collusion tends to 1.

Keywords: identification of the organized group of hostile insiders; small samples; big data; mathematical statistics

IDENTIFYING ANOMALIES USING METADATA
  • A. A. Grusho  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
  • E. E. Timonina  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
  • N. A. Grusho  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
  • I. Yu. Teryokhina  Faculty of Computational Mathematics and Cybernetics, M. V. Lomonosov Moscow State University, 1-52 Lenin-skiye Gory, GSP-1, Moscow 119991, Russian Federation

Abstract: The paper discusses the problem of information technology security control based on computer audit data. These data are the sequence of small samples, each of which describes the transmission of information from one transformation to another. Information technologies are represented by mathematical models in the form of oriented acyclic graphs. In the article, such graphs describing data transmission are called metadata. In integrated computer audit data, there may be at the same time traces of the execution of several information technologies described by their graphs. This makes it difficult to recognize information flows that correspond to arcs of different graphs. The concept of legal information flow is introduced in the paper, which corresponds to the transfer of data of all information technologies being performed. Information flows that do not correspond to the execution of existing information technologies are called illegal or anomalies. Such information flows can occur due to hostile activities of insiders or due to errors in user actions. The article solves the problem of effective identification of legal information flows and anomalies on the basis of metadata.

Keywords: information security; information flow; anomalies; metadata; systems of different representatives

APPROXIMATION OF THE MULTIUSER NETWORK FEASIBLE FLOWS SET
  • Yu. E. Malashenko  Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
  • I. A. Nazarova  Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation

Abstract: A method for approximate description of a convex polyhedral set of feasible flows transmitted between all network nodes simultaneously is considered. A method for constructing an internal convex approximating frame is proposed. The frame is formed based on the vectors of maximum feasible flows between pairs of source-receiver vertices. The system of support vectors is determined by the points lying on the outer edges of the baseline set. Any convex combination of base vectors sets the feasible flow distribution. The developed algorithmic schemes allow parallelization of computational procedures on heterogeneous multiprocessor complexes. The resulting aggregated description can be used for dispatching intensive input information flows that exceed the network's capability.

Keywords: multicommodity flow model; feasible flow sset; internal support frame

BAYESIAN APPROACH TO THE CONSTRUCTION OF AN INDIVIDUAL USER TRAJECTORY IN THE SYSTEM OF DISTANCE LEARNING
  • A. V. Bosov  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation, Moscow State Aviation Institute (National Research University), 4 Volokolamskoe Shosse, Moscow 125933, Russian Federation
  • Ya. G. Martyushova  Moscow State Aviation Institute (National Research University), 4 Volokolamskoe Shosse, Moscow 125933, Russian Federation
  • A. V. Naumov  Moscow State Aviation Institute (National Research University), 4 Volokolamskoe Shosse, Moscow 125933, Russian Federation
  • A. P. Sapunova  Moscow State Aviation Institute (National Research University), 4 Volokolamskoe Shosse, Moscow 125933, Russian Federation

Abstract: The paper considers the task of forming an individual user path for a distance learning system (LMS) with a mixed form of conducting educational activities with organization of independent work of students using LMS. At the end of each section of the training course, the LMS users are divided into categories determined by the solution of the Bayesian classification problem. For each category, an individual task of a different level of complexity is proposed for the next section of the course, thus forming the individual trajectory of the student. The Bayesian classifier is set up based on statistics of work of the users of LMS. The experimental results of solving the problem at one of the stages of training are presented.

Keywords: distance learning system; Bayesian classifier; adaptive systems; individual learning path

ANALYSIS OF THE NETWORK SLICING MECHANISMS WITH GUARANTEED ALLOCATED RESOURCES FOR VARIOUS TRAFFIC TYPES
  • K. A. Ageev  Peoples' Friendship University of Russia (RUDN University), 6 Miklukho-Maklaya Str., Moscow 117198, Russian Federation
  • E. S. Sopin  Peoples' Friendship University of Russia (RUDN University), 6 Miklukho-Maklaya Str., Moscow 117198, Russian Federation, Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
  • N. V. Yarkina  Peoples' Friendship University of Russia (RUDN University), 6 Miklukho-Maklaya Str., Moscow 117198, Russian Federation
  • K. E. Samouylov  Peoples' Friendship University of Russia (RUDN University), 6 Miklukho-Maklaya Str., Moscow 117198, Russian Federation, Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
  • S. Ya. Shorgin  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation

Abstract: Network slicing is one of the key capabilities of modern networks, allowing several virtual mobile operators to use the physical resources of one base station. This allows operators and resource owners (tenants) to lease and manage several dedicated logical networks with specific functionality working on top of a common infrastructure. Each of these logical networks is called a network slice and can be adapted to provide certain system behavior to maintain a specified level of quality of service indicators. The paper describes the developed mathematical framework of the network slicing mechanisms and analyzes it by means of extensive simulations.

Keywords: simulation modeling; queuing system; limited resources; network slicing

ON THE CONCEPT OF A STOCHASTIC MODEL WITH CONTROL AT THE MOMENTS OF THE PROCESS AT THE BORDER OF A PRESENTED SUBSET OF MULTIPLE STATES
  • P. V. Shnurkov  National Research University Higher School of Economics, 34 Tallinskaya Str., Moscow 123458, Russian Federation
  • D. A. Novikov  National Research University Higher School of Economics, 34 Tallinskaya Str., Moscow 123458, Russian Federation

Abstract: The work is devoted to the creation and analysis of the general concept of a special stochastic model with controls. The main feature of the model is that the control actions are carried out at times when a stochastic process describing the system under research reaches the boundary of a given subset of the set of states. The control action itself consists in transferring the process from the boundary to one of the internal states of a given subset. In this case, the internal states are interpreted as acceptable and the boundary ones as unacceptable. Control actions are described by a set of discrete probability distributions depending on the boundary state number. Such a set defines a control strategy. The problem of optimal control is formalized as the problem of finding a control strategy that delivers a global extremum to a certain stationary cost-effectiveness indicator, which in terms of its economic content represents the average specific profit arising from a long evolution of the system. The posed problem of optimal control is proposed to be alled the tuning problem. The paper notes that this stochastic model and the corresponding setup problem can be used to study many real phenomena occurring in economic and technical systems. As an example of such a real phenomenon, interventions in the foreign exchange market of the Russian Federation are considered.

Keywords: control in stochastic systems; Markov controlled processes; semi-Markov controlled processes; stochastic tuning problem

OPTIMIZATION MODELS EXTRACTION FROM DATA
  • V. I. Donskoy  V. I. Vernadsky Crimean Federal University, 4 Vernadsky Av., Simferopol 295007, Russian Federation

Abstract: The basic principles, methods and algorithms representing a new information technology for building optimization mathematical models from data (BOMD) are presented. This technology allows one to automatically build mathematical models of planning and control on the basis of use of precedents (observations) over objects that gives the chance to solve the problems of intellectual control and to define expedient behavior of economic and other objects in difficult environments. The BOMD technology allows one to obtain objective control models that reflect real-life relationships, goals, constraints, and processes. This is its main advantage over the traditional, subjective approach to control. Linear and nonlinear algorithms for synthesis of models based on precedent information are developed.

Keywords: machine learning; model extraction from data; optimization; neural networks; gradient methods

PROBLEM-ORIENTED VERIFYING THE COMPLETENESS OF TEMPORAL ONTOLOGIES AND FILLING CONCEPTUAL LACUNAS
  • I. M. Zatsman  Institute of Informatics Problems, Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation

Abstract: An approach to verify the completeness of ontologies and fill conceptual lacunas found in them is proposed. The approach is based on the following symbiotic information processes: goal-oriented discovering new knowledge using data, its representation in an ontology, and application of the ontology to solve a problem. In the process of its solving, the completeness of the ontology is verified and its conceptual lacunas found during solving the problem are registered and filled. The personal, collective, and conventional levels of knowledge representation in ontologies are discussed. The approach allows one to find conceptual lacunas at the conventional level of ontologies and fill them at their personal and/or collective levels, if for discovering new knowledge its potential sources are available. The purpose of the paper is to consider the model of symbiotic information processes. The developed model is a generalized flowchart that implements the proposed approach. The flowchart serves as the basis for computerization of symbiotic processes. The model description is illustrated by an example of finding conceptual lacunas in a linguistic typology and filling them with the concepts of new knowledge discovered using text data.

Keywords: three-level representation of knowledge; temporal ontology; conceptual lacuna; generation of new knowledge; symbiotic processes

USING TOPIC MODELS FOR PAIRWISE COMPARISON OF COLLECTIONS OF SCIENTIFIC PAPERS
  • F. V. Krasnov  NAUMEN R&D, 49A Tatishcheva Str., Ekaterinburg 620028, Russian Federation
  • A. V. Dimentov  National Electronic Information Consortium, 5 Letnikovskaya Str., Moscow 115114, Russian Federation
  • M. E. Shvartsman  National Electronic Information Consortium, 5 Letnikovskaya Str., Moscow 115114, Russian Federation, Russian State Library, 3/5 Vozdvigenka Str., Moscow 119019, Russian Federation

Abstract: The authors propose a new technique for pairwise comparison of collections of scientific articles via a topic model. The developed methodology is called Comparative Topic Analysis (CTA). Comparative topic analysis allows getting not only quantitative assessment of similarity of collections but also structural differences of the compared text collections. The authors developed transparent visualization for text collections distance. This study compares existing approaches to topic modeling concerning the task of comparing collections of scientific papers. The authors consider probabilistic and generative topic models. The analysis of the requirements for text collections for the correct application of CTA was carried out. The CTA methodology has shown high efficiency in identifying structural differences in related collections. The authors developed an integral metric "Content Uniqueness Ratio" which allows comparing text collections with each other. As a result of the digital experiment, the thematic model with additive regularization (ARTM) proved to be the most informative.

Keywords: comparative topic analysis; comparative text model; deep text analysis; topic models metrics DOI: 10.14357/19922264200318