Systems and Means of Informatics

2022, Volume 32, Issue 4, pp 45-58

NOISY TEXT ANALYTICS

M. P. Krivenko

Abstract

The article is devoted to an overview of methods for interpreting noisy text data in order to obtain significant information from them. Analytics allows one to isolate useful concepts, draw conclusions from the collected data, and form a forecast. It is assumed that the texts being processed may not correspond to the target and selected reference language. Such deviations can be caused by measurement and fixation errors, be the result of the influence of random or unforeseen factors, or arise as a result of incorrect choice or tuning of the model.
The article lists the types of distortions. The areas of application of methods of intellectual text processing are considered: scientific publications; blogging; e-mails; social media; speech messages; and web analytics. The methods focused on the processing of noisy texts are indicated. Promising directions for further research are formulated: clarification of the concepts of "noise" and "dirty" texts; development of ways to measure the degree of anomaly of the text; systematization of analytical tasks of text processing; and formation of criteria for the effectiveness of methods of intellectual analysis of the text to facilitate the selection of suitable technologies.

[+] References (35)

Hookway, N. 2008. "Entering the blogosphere": Some strategies for using blogs in social research. Qual. Res. 8(1):91-113.
Wilson, E., A. Kenny, andV. Dickson-Swift. 2015. Using blogs as a qualitative health research tool: A scoping review. Int. J. Qual. Meth. 14(5):1-12.
Tsai, F.S., Y. Chen, and K. L. Chan. 2007. Probabilistic techniques for corporate blog mining. Emerging technologies in knowledge discovery and data mining. Lecture notes in computer science ser. Berlin, Germany: Springer. 4819:35-44.
Webb, L. M., and Y. Wang. 2013. Techniques for analyzing blogs and micro-blogs. Advancing research methods with new technologies. Hershey, PA: IGI Global. 206227.
Kumar, S., R. Zafarani, M. Abbasi, G. Barbier, and H. Liu. 2010. Convergence of influential bloggers for topic discovery in the blogosphere. Advances in social computing. Eds. S. K. Chai, J. J. Salerno, and P. L. Mabry. Lecture notes in computer science ser. Berlin, Germany: Springer. 6007:406-412.
Barbier, G., and H. Liu. 2011. Data mining in social media. Social network data analytics. Boston, MA: Springer. 327-352.
Hassani, H., C. Beneki, S. Unger, M.T. Mazinani, and M.R. Yeganegi. 2020. Text mining in big data analytics. Big Data Cognitive Computing 4(1): 1-34.
Seep, K. S., and N. Patil. 2018. A multidimensional approach to blog mining. Progress in intelligent computing techniques: Theory, practice, and applications. Eds. P. Sa, M. Sahoo, M. Murugappan, Y. Wu, and B. Majhi. Advances in intelligent systems and computing ser. Singapore: Springer. 719:51-58.
Palmer, D. D. 2010. Text preprocessing. Handbook of natural language processing. 2nd ed. London, U.K.: Chapman & Hall/CRC. 9-30.
Minkov, E., R. C. Wang, and W. W. Cohe. 2005. Extracting personal names from emails: Applying named entity recognition to informal text. Conference on Human Language Technology and Empirical Methods in Natural Language Processing Proceedings. Stroudsburg, PA: Association for Computational Linguistics. 443-450.
Wani, M.A., and S. Jabin. 2018. Big data: Issues, challenges, and techniques in business intelligence. Big data analytics. Eds. V. Aggarwal, V. Bhatnagar, and D. Mishra. Advances in intelligent systems and computing ser. Singapore: Springer. 654:613-628.
Weerkamp, W., K. Balog, and M.De Rijke. 2009. Using contextual information to improve search in email archives. 31st European Conference on IR Research on Advances in Information Retrieval Proceedings. Berlin, Germany: Springer. 400-411.
Tang, G., J. Pei, and W. S. Luk. 2014. Email mining: Tasks, common techniques, and tools. Knowl. Inf. Syst. 41:1-31.
Mujtaba, G., L. Shuib, R. G. Raj, N. Majeed and M.A. Al-Garadi. 2017. Email classification research trends: Review and open issues. IEEE Access 5:9044-9064.
Hangal, S., M. S. Lam, and J. Heer. 2011. MUSE: Reviving memories using email archives. 24th Annual ACM Symposium on User Interface Software and Technology Proceedings. New York, NY: ACM. 75-84.
Chi, H., C. Scarllet, Z. G. Prodanoff, and D. Hubbard. 2016. Determining predisposition to insider threat activities by using text analysis. Future Technologies Conference Proceedings. Piscataway, NJ: IEEE. 985-990.
Soh, C., S. Yu, A. Narayanan, S. Duraisamy, and L. Chen. 2019. Employee profiling via aspect-based sentiment and network for insider threats detection. Expert Syst. Appl. 135:351-361.
Techniques for dealing with ransomware, business email compromise and spearphishing. Washington, DC: Osterman Research, Inc. An Osterman research white paper. 16 p. Available at: https://4b0e0ccff07a2960f53e-707fda739cd414d8753e03d02c531 a72.ssl.cf5.rackcdn.com/wp-content/ (accessed September 27, 2022).
Bhowmick, A., and S. M. Hazarika. Machine learning for e-mail spam filtering: Review, techniques and trends. Available at: https://www.researchgate.net/publication/ 303812063_Machine_Leaming_for_E-maiLSpam_Filtering_ReviewTechniques_and_ Trends/ (accessed September 27, 2022).
Dada, E.G., J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A. O. Adetunmbi, and O. E. Ajibuwa. 2019. Machine learning for email spam filtering: Review, approaches and open research problems. Heliyon 5:1-23.
Pang, B., and L. Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. 42nd Meeting of the Association for Computational Linguistics Proceedings. Stroudsburg, PA: Association for Computational Linguistics. 271-278.
Salloum, S. A., M. Al-Emran, A. A. Monem, and K. Shaalan. 2017. A survey of text mining in social media: Facebook and Twitter perspectives. ASTESJ 2(1): 127-133.
Akhtar, M. S., U. K. Sikdar, and A. Ekba. 2015. Hybrid approach for text normalization in twitter. ACL Workshop on Noisy User-generated Text Proceedings. Beijing, China. 106-110.
Bholat, D., S. Hansen, P. Santos, and C. Schonhardt-Bailey. 2015. Text mining for central banks. London, U.K.: Bank of England. Handbook No. 33. 29 p.
Eckley, P. 2015. Measuring economic uncertainty using news-media textual data. Germany: MPRA. Paper No. 64874. 76 p. Available at: https://mpra.ub.uni- muenchen.de/69784/ (accessed September 27, 2022).
Pang, B., and L. Lee. 2008. Opinion mining and sentiment analysis. Found. Trends Inf. Ret. 2(1-2): 1-135.
Klebanov, B.B., D. Diermeier, and E. Beigman. 2008. Lexical cohesion analysis of political speech. Polit. Anal. 16(4):447-463.
Acharya, A., N. Crawford, and M. Maduabum. 2016. A nation divided: Classifying presidential speeches. Stanford, CA: Stanford University. 6 p.
Sardianos, C., I. M. Katakis, G. Petasis, and V. Karkaletsis. 2015. Argument extraction from news. 2nd Workshop on Argumentation Mining Proceedings. Denver, CO. 56-66.
Fiscus, J. G., J. Ajot, and J. S. Garofolo. 2008. The Rich Transcription 2007 Meeting Recognition Evaluation. Multimodal technologies for perception of humans. Eds. R. Stiefelhagen, R. Bowers, and J. Fiscus. Lecture notes in computer science ser. Berlin, Germany: Springer. 4625:373-389.
Camelin, N., F. Bechet, G. Damnati, and R. De Mori. 2007. Speech mining in noisy audio message corpus. 8th Annual Conference of the International Speech Communication Association Proceedings. Antwerp, Belgium. 2401-2404.
Sheeba, J., and K. Vivekanan. 2012. Improved keyword and keyphrase extraction from meeting transcripts. Int. J. Computer Appl. 52(13): 11-15.
Liu, F., F. Liu, and Y. Liu. 2011. Supervised framework for keyword extraction from meeting transcripts. IEEE T. Audio Speech 19(3):538-548.
Waldherr, A., D. Maier, P. Miltner, and E. Gunther. 2017. Big data, big noise: The challenge of finding issue networks on the Web. Soc. Sci. Comput. Rev. 35(4):427-443.
Mohammad, R. M., F. Thabtah, andL. McCluskey. 2014. Tutorial and critical analysis of phishing websites methods. Computer Science Review 17:1-24.

[+] About this article