Automatic Classification of Genetic Mutations Based on Machine Learning Methods

УДК 519.687:004.912

  • Olga N. Polovikova Altai State University, Barnaul, Russia Email: ponOlgap@gmail.com
  • Anastasiia S. Manicheva Altai State University, Barnaul, Russia Email: manichevaas@mc.asu.ru
  • Vyacheslav V. Shiryaev IT Sphere LLC, Barnaul, Russia Email: asmuddi628@gmail.com
Keywords: genetic mutations, machine learning methods, classification, text encoding, tokenization, vectorization, training quality metrics, logarithmic loss function, selection of model hyperparameters

Abstract

This paper considers the problem of identifying the type of genetic mutation of a cancer tumor after sequencing its genome. The problem solution relates to multi-class classification problems. The paper proposes an approach for the identification of mutation classes based on their text description using supervised machine learning methods. The study is carried out using a data set on cancer diseases obtained by the analysis of genome mutations in tumor cells. The data set includes the gene type, its mutation, a text description of the gene mutation, and the mutation class (with nine different classes overall). The paper provides an analysis and justification of the optimal text preprocessing and vectorization methods suitable for the source data at the inputs of the machine learning methods. There are several text data classifiers based on the k-nearest neighbors, decision trees, Bayesian classifiers, and the logistic regression machine learning methods. The obtained classification performance metrics after many simulations reveal that the best method to perform classification and identification of mutation classes is the linear regression method with the lowest error rate.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Author Biographies

Olga N. Polovikova, Altai State University, Barnaul, Russia

Candidate of Sciences in Physics and Mathematics, Associate Professor, Associate Professor of the Department of Informatics

Anastasiia S. Manicheva, Altai State University, Barnaul, Russia

Candidate of Sciences in Technology, Associate Professor, Associate Professor of the Department of Theoretical Cybernetics and Applied Mathematics

Vyacheslav V. Shiryaev, IT Sphere LLC, Barnaul, Russia

Programmer of the Department of Data Visualization Systems Development

References

Код жизни: прочесть не значит понять // Kaggle. URL: https://biomolecula.ru/articles/kod-zhizni-prochest-ne-znachit-poniathttp://archive.expert.ru/expert/ (дата обращения: 10.11.2023).

Третья фаза ENCODE обнаружила тысячи новых взаимодействий внутри генома // PRC NEWS. URL: https:// pcr.news/novosti/tretya-faza-encode-obnaruzhila-tysyachi-novykh-vzaimodeystviy-vnutri-genoma/ (дата обращения: 10.11.2023).

The Encyclopedia of DNA Elements (ENCODE) // National Human Genome Research Institute. URL: https://www. genome.gov/Funded-Pro-grams-Projects/ENCODE-Pro-ject-ENCyclopedia-Of-DNA-Elements (дата обращения: 10.11.2023).

The ENCODE Project Consortium et al. Expanded Encyclopaedias of DNA Elements in the Human and Mouse Genomes // Nature. 2020. № 583. P. 699-710. DOI: 10.1038/ s41586-020-2493-4

Vnencak-Jones C., Berger M., Pao W. Types of Molecular Tumor Testing // My Cancer Genome. URL: https://www.my-cancergenome.org/content/molecular-medicine/types-of-mo-lecular-tumor-testing/ (дата обращения: 10.11.2023).

Гаджиев Я., Шалбузова К. Применение методов машинного обучения в прогнозировании и раннем обнаружении рака // Sciences of Europe. 2022. № 108. С. 46-50.

Гусев А.В., Гаврилов Д.В., Корсаков И.Н. и др. Перспективы использования методов машинного обучения для предсказания сердечнососудистых заболеваний // Врач и информационные технологии. 2019. № 3. С. 41-47.

Гусев А.В., Новицкий РЭ., Ившин А.А., Алексеев А.А. Машинное обучение на лабораторных данных для прогнозирования заболеваний // Фармакоэкономика. Современная фармакоэкономика и фармакоэпидемиология. 2021. № 4. С. 581-592. DOI: 10.17749/2070-4909/farmakoekonomi-ka.2021.115

Раскина К.В., Мартынова Е.Ю., Перфильев А.В. и др. От персонализированной к точной медицине // Рациональная фармакотерапия в кардиологии. 2017. № 1. С. 69-79. DOI: 10.20996/1819-6446-2017-13-1-69-79

Emmert-Streib F. Personalized Medicine: Has it Started yet? A Reconstruction of the Early History // Front Genet. 2013. Vol. 3. № 313. DOI: 10.3389/fgene.2012.00313

3 главных причины для геномного секвенирования рака // Блог сайта addon. URL: https://addon.life/ru/2021/08/02/ genomic-sequencing-cancer/ (дата обращения: 10.11.2023).

Personalized Medicine: Redefining Cancer Treatment // Kaggle. URL: https://www.kaggle.com/competitions/msk-redefining-cancer-treatment/overview (дата обращения: 10.11.2023).

Обработка естественного языка // Машинное обучение. URL: https://www.dmitrymakarov.ru/intro/topic-iden-tification-19/ (дата обращения: 10.11.2023).

Самигулин Т.Р, Джурабаев А.Э. Анализ тональности текста методами машинного обучения // Научный результат. Информационные технологии. 2021. № 1. С. 55-62. DOI: 10.18413/2518-1092-2021-6-1-0-7

Published
2024-04-05
How to Cite
Polovikova O. N., Manicheva A. S., Shiryaev V. V. Automatic Classification of Genetic Mutations Based on Machine Learning Methods // Izvestiya of Altai State University, 2024, № 1(135). P. 126-131 DOI: 10.14258/izvasu(2024)1-18. URL: http://izvestiya.asu.ru/article/view/%282024%291-18.