Support Vector Machines Analysis of Web Pages Binary Classification Quality
Abstract
This paper presents the analysis of classification quality of web-pages binary classification by the support vector machines method. This classification is required to reveal the web pages containing text information, which distribution is forbidden in Russian Federation. Results are shown for three document collections: “drug dealing”, “extremism” and “terrorism”. Collections of documents are created as a result of specialists’ work with one of the Internet information search and analysis information systems. For each collection, we describe class proportions of testing and training samples, distribution by the type of Internet resources, and several problems that make the classification itself or classifier training difficult. Formation of document’s vector is also described. Next, we show testing results for different kernel functions and analyze classification mistakes. We use precision, recall and F1 score as quality measures. Machine learning library “scikit-learn” is used to implement support vector machines. Finally, we make assumptions about classification quality.
DOI 10.14258/izvasu(2017)4-14
Downloads
Metrics
References
Карташев Е.А., Царегородцев А.Л. Автоматизированная система поиска и анализа информации в сети Интернет // Фундаментальные исследования. — 2016. — № 10, ч. 2.
Вьюгин В.В. Математические основы машинного обучения и прогнозирования [Электронный ресурс]. — URL: http://elanbook.com/book/56397.
Fradkin D., Muchnik I. Support Vector Machines for Classification// Abello J. Carmode G. (Eds); Discrete Methods in Epidemiology, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, volume 70, 2006.
Cristianini N., Shawe-Taylor J. An Introduction to Support Vector Machines and other kernel-based learning methods. — Cambridge, 2000.
Joachims T. Text categorization with support vector machines: learning with many relevant features // Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, 1998.
Xu H., Caramanis C., Mannor Sh. Robustness and Regularization of Support Vector Machines // The Journal of Machine Learning Research, 10, 12/1/2009.
Support vector machines: scikit-learn [Electronic resourse]. — URL: http://scikit-learn.org/stable/modules/svm. html (дата обращения: 01.03.17).
Unbalanced problems of support vector machines: scikit-learn [Electronic resourse]. URL: http://scikit-learn.org/ stable/modules/svm.html#unbalanced-problems (дата обращения: 01.03.17).
Половикова О.Н. Анализ способов формализаций документов для выполнения семантического поиска // Известия Алтайского гос. ун-та. — 2012. — №1 (73).
Терновой О.С. Методика и средства раннего выявления и противодействия угрозам нарушения информационной безопасности в результате ddos атак // Известия Алтайского гос. ун-та. — 2013. — №1/2 (77). D0I:10.14258/ izvasu(2013)1.2-24.
Терновой О.С., Шатохин А.С. Использование байесовского классификатора для получения обучающих выборок, позволяющих определять вредоносный трафик на коротких интервалах // Известия Алтайского гос. ун-та. — 2013. — №1/1 (77).
Ямшанов М.Л. Оптимизация выбора параметров SVM-классификатора с ядром RBF для задач классификации текстовых документов // Вестник ВятГГУ — 2006. — №15.
Маслов М.Ю., Пяллинг А.А., Трифонов С.И. Автоматическая классификация веб-сайтов // Электронные библиотеки: перспективные методы и технологии, электронные коллекции : труды Десятой Всерос. науч. конф. «RCDL’2008». — Дубна, 2008.
Copyright (c) 2017 С.В. Волошин, А.Л. Царегородцев, Е.А. Карташев, В.В. Славский
This work is licensed under a Creative Commons Attribution 4.0 International License.
Izvestiya of Altai State University is a golden publisher, as we allow self-archiving, but most importantly we are fully transparent about your rights.
Authors may present and discuss their findings ahead of publication: at biological or scientific conferences, on preprint servers, in public databases, and in blogs, wikis, tweets, and other informal communication channels.
Izvestiya of Altai State University allows authors to deposit manuscripts (currently under review or those for intended submission to Izvestiya of Altai State University) in non-commercial, pre-print servers such as ArXiv.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).