Systems and Means of Informatics

2019, Volume 29, Issue 2, pp 148-160

ANNOTATION METHODOLOGY OF SUPRACORPORA DATABASES

A. A. Goncharov
O. Yu. Inkova
M. G. Kruzhkov

Abstract

The paper considers methodological principles of annotating linguistic units in parallel corpora using supracorpora databases. Supracorpora databases are a novel information resource in linguistics that allows researchers to save the results of linguistic analysis of corpus data in the form of annotations structured according to the research objectives. When dealing with parallel corpora, the annotation procedure consists of 4 basic stages: annotation objects lookup; definition of the linguistic unit and its context (both in original and translated texts); definition of the linguistic unit ' s attributes (both in original and translated texts); and combination of two linguistic units into a translation correspondence and definition of its attributes. The paper summarizes the previously described annotation techniques, examines functional potential of supracorpora databases, and concludes that it is possible to apply the developed methodology to a wide variety of research objects.

[+] References (19)

Kruzhkov, M. G. 2015. Informatsionnye resursy kontrastivnykh lingvisticheskikh issledovaniy: tipologicheskie bazy dannykh [Information resources for contrastive studies: Typological databases]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 25(1): 198-212.
Kruzhkov, M. G. 2015. Informatsionnye resursy kontrastivnykh lingvisticheskikh issledovaniy: elektronnye korpusa tekstov [Information resources for contrastive studies: Electronic text corpora]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 25(2): 140-159.
Buntman, N. V., Anna A. Zaliznyak, I. M. Zatsman, M. G. Kruzhkov, E.Yu. Loshchilova, and D.V. Sichinava. 2014. Informatsionnye tekhnologii korpusnykh issledovaniy: printsipy postroeniya krosslingvisticheskikh baz dannykh [Information technologies for corpus studies: Underpinnings for cross-linguistic database creation]. Informatika i ee Primeneniya - Inform. Appl. 8(2):98-110.
Buntman, N., J. L. Minel, D. Le Pesant, and I. Zatsman. 2010. Typology and computer modeling of translation difficulties. Informatika i ee Primeneniya - Inform. Appl. 4(3):77-83.
Ide, N., and J. Pustejovsky, eds. 2017. Handbook of linguistic annotation. Dordrecht, The Netherlands: Springer Science + Business Media. 1468 p.
Segalovich, I. 2003. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. Conference (International) on Machine Learning: Models, Technologies and Applications Proceedings. Las Vegas, NV: CSREA Press. 273-280.
Zobnin, A. I. and G. V. Nosyrev. 2015. Morfologicheskiy analizator MyStem 3.0 [Morphological analyzer MyStem 3.0]. Trudy Instituta russkogo yazyka im. V. V. Vinogradova [Proceedings of the V. V. Vinogradov Russian Language Institute] 6:300310.
Goncharov, A. A., and I. M. Zatsman. 2019. Informatsionnye transformatsii parallel'nykh tekstov v zadachakh izvlecheniya znaniy [Information transformations of parallel texts in knowledge extraction]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 29(1):180-193.
Inkova, O. Yu. 2018. Lingvospetsifichnost' konnektorov : metody i parametry opisaniya [Language specificity of connectives methods and parameters of description]. Semantika konnektorov: kontrastivnoe issledovanie [Semantics of connectives: A contrastive study]. Moscow: TORUS PRESS. 5-23.
Johansson, S. 2007. Seeing through Multilingual Corpora: On the use of corpora in contrastive studies. Amsterdam: John Benjamins B. V. 355 p.
Rogozhnikova, R. P. 2003. Tolkovyy slovar' sochetaniy, ekvivalentnykh slovu: Okolo 1500 ustoychivykh sochetaniy russkogo yazyka [Dictionary of word-equivalent combinations: Around 1500 Russian fixed word combinations]. Moscow: Astrel': AST. 416 p.
Zatsman, I.M., O.Yu. Inkova, M. G. Kruzhkov, and N. A. Popkova. 2016. Pred- stavlenie krossyazykovykh znaniy o konnektorakh v nadkorpusnykh bazakh dannykh [Representation of cross-lingual knowledge about connectors in suprocorpora databases]. Informatika i ee Primeneniya - Inform. Appl. 10 (1): 106-118.
Zatsman, I. M., O.Yu. Inkova, and V. A. Nuriev. 2017. The construction of classification schemes: Methods and technologies of expert formation. Automatic Documentation Math. Linguistics 51(1):2 7-41.
Nuriev, V., N. Buntman, and O. Inkova. 2018. Machine translation of Russian connectives into French: Errors and quality failures. Informatika i ee Primeneniya - Inform. Appl. 12(2): 105-113.
Buntman, N. V., A. A. Goncharov, I. M. Zatsman, and V. A. Nuriev. 2018. Kolichestvennyy analiz rezul'tatov machinnogo perevoda s ispol'zovaniem nadkorpusnykh baz dannykh [Using supracorpora databases for quantitative analysis of machine transla-tions]. Informatika i ee Primeneniya - Inform. Appl. 12(4):96-105.
Zatsman, I.M., and M. G. Kruzhkov. 2018. Nadkorpusnaya baza dannykh konnektorov: razvitie sistemy terminov proektirovaniya [Supracorpora database of connectives: Design-oriented evolution of the term system]. Sistemy i Sredstva Informatiki - Systems and Means of Informatics 28(4):156-167.
Beninca, P., and J. Haiman. 1992. The Rhaeto-Romance languages. London: Rout- ledge. 260 p.
Van Dyk, J. 2009. Language learning through sight translation. Translation in second language learning and teaching. Bern: Peter Lang. 203-214.
Dobrovol'skiy, D.O., and Anna A. Zaliznyak. 2018. Nemetskie konstruktsii s modal'nymi glagolami i ikh russkie sootvetstviya: proekt nadkorpusnoy bazy dannykh [German constructions with modal verbs and their Russian correlates: A supracorpora database project]. Computational Linguistics and Intellectual Technologies: Conference (International) "Dialogue" Proceedings. Moscow: RGGU. 17(24): 172-184.

[+] About this article