Evaluation of Text Materials Classification Quality Using «Random Forests» Machine Learning Algorithm
Abstract
The results of quality evaluation of text materials classification by the "random forests" machine learning algorithm implemented in the “scikit-learn” library are presented. Functions used in the “scikit-learn” library, as well as the parameters that affect classification quality, are described. The main stages of text materials classification are shown in the paper: the formation of sets of materials for training and control (ensuring sample representativeness, text processing, definition of groups for training and control); classifier model training; classifier model testing; quality evaluation of the obtained results. The quality evaluation is carried out using characteristics, such as precession, recall and F-measures of the classifier for various data preparation options: balanced and unbalanced training groups of materials, while the latter case is designed to convert the text into a set of tokens. Based on the results of the work, the main directions for improving quality of text materials classification by the "random forests" machine learning algorithm have been determined
DOI 10.14258/izvasu(2017)4-13
Downloads
Metrics
References
Карташев Е.А., Царегородцев А.Л. Автоматизированная информационная система поиска и анализа информации в сети Интернет // Фундаментальные исследования. — 2016. — № 10, ч. 2.
Епрев А.С. Автоматическая классификация текстовых документов // Математические структуры и моделирование. — № 21. — 2010.
Sebastiani F. Machine learning in automated text categorization // ACM Computing Surveys. — 34(1) . — 2002.
Кафтанников И.Л., Парасич А.В. Об особенности применения деревьев решений в задачах классификации // Вестник ЮУрГУ. Серия: Компьютерные технологии, управление, радиоэлектроника. — 2015. — Т. 15, No 3 [Электронный ресурс]. — URL: https://vestnik.susu.ru/ctcr/article/viewFile/4205/3780.
Вьюгин В.В. Математические основы машинного обучения и прогнозирования. — М., 2014.
Маннинг Кристофер Д., Рагхаван Прабхакар, Шютце Хайнрих. Введение в информационный поиск. — М., 2014.
Терновой О.С., Шатохин А.С. Использование байесовского классификатора для получения обучающих выборок, позволяющих определять вредоносный трафик на коротких интервалах // Известия Алтайского гос. ун-та. — 2013. — №1/1 (77).
Терновой О.С. Методика и средства раннего выявления и противодействия угрозам нарушения информационной безопасности в результате ddos атак // Известия Алтайского гос. ун-та. — 2013. — №1/2(77). DOI: 10.14258/ izvasu(2013)1.2-24.
Андреев A.M., Березкин Д.В., Морозов B.B., Симаков K.B. Автоматическая классификация текстовых документов с использованием нейросетевых алгоритмов и семантического анализа // Мир ПК. — 2007. — № 9.
Круглов В.В., Борисов В.В. Искусственные нейронные сети. Теория и практика. — М., 2001.
Попков М.И. Автоматическая система классификации текстов для базы знаний предприятия // International Journal of Open Information Technologies : научный журнал. — 2014. Т. 2, No 7 [Электронный ресурс]. — URL:http: // cyberleninka. ru/article/n/avtomaticheskaya-sistema-klassifikatsii-tekstov-dlya-bazy-znaniy-predpriyatiya.
Random Forest Classifier [Electronic resourse]. — URL: http://scikit-learn.org/stable/modules/ generated/sklearn. ensemble.RandomForestClassifier.html#sklearn.ensemble. RandomForestClassifier.
Половикова О.Н. Анализ способов формализаций документов для выполнения семантического поиска // Известия Алтайского гос. ун-та. — 2012. — №1 (73).
Copyright (c) 2017 И.С. Веретенников, Е.А. Карташев, А.Л. Царегородцев
This work is licensed under a Creative Commons Attribution 4.0 International License.
Izvestiya of Altai State University is a golden publisher, as we allow self-archiving, but most importantly we are fully transparent about your rights.
Authors may present and discuss their findings ahead of publication: at biological or scientific conferences, on preprint servers, in public databases, and in blogs, wikis, tweets, and other informal communication channels.
Izvestiya of Altai State University allows authors to deposit manuscripts (currently under review or those for intended submission to Izvestiya of Altai State University) in non-commercial, pre-print servers such as ArXiv.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).