Systems and Means of Informatics

2022, Volume 32, Issue 4, pp 59-68

TOKENIZATION BASED ON THE METHOD OF FUNCTIONAL PATTERNS

  • Yu. V. Nikitin
  • A. A. Khoroshilov
  • A. E. Makarova

Abstract

The article proposes a new method of text tokenization based on the use of generalized functional templates. The method is based on the classification of Unicode characters in terms of their role in the formation of text elements and on the use of compound patterns from the generalized character classes. Widespread regular expressions are not used here. A specific feature of the method is the use of a sequence of characters as a part of the interval template. The strengths of the method include successful tokenization of complex information objects (numbers, geographic coordinates, names of articles of engineering products, etc.), obtaining the detailed classification of tokens at the stage of their formation, the ability to turn on and off tokenization of a certain type of tokens, as well as adding new templates according to the sample text for additional training of the system.

[+] References (3)

[+] About this article