Automatic Classification of Genetic Mutations Based on Machine Learning Methods
УДК 519.687:004.912
Abstract
This paper considers the problem of identifying the type of genetic mutation of a cancer tumor after sequencing its genome. The problem solution relates to multi-class classification problems. The paper proposes an approach for the identification of mutation classes based on their text description using supervised machine learning methods. The study is carried out using a data set on cancer diseases obtained by the analysis of genome mutations in tumor cells. The data set includes the gene type, its mutation, a text description of the gene mutation, and the mutation class (with nine different classes overall). The paper provides an analysis and justification of the optimal text preprocessing and vectorization methods suitable for the source data at the inputs of the machine learning methods. There are several text data classifiers based on the k-nearest neighbors, decision trees, Bayesian classifiers, and the logistic regression machine learning methods. The obtained classification performance metrics after many simulations reveal that the best method to perform classification and identification of mutation classes is the linear regression method with the lowest error rate.
Downloads
Metrics
References
Код жизни: прочесть не значит понять // Kaggle. URL: https://biomolecula.ru/articles/kod-zhizni-prochest-ne-znachit-poniathttp://archive.expert.ru/expert/ (дата обращения: 10.11.2023).
Третья фаза ENCODE обнаружила тысячи новых взаимодействий внутри генома // PRC NEWS. URL: https:// pcr.news/novosti/tretya-faza-encode-obnaruzhila-tysyachi-novykh-vzaimodeystviy-vnutri-genoma/ (дата обращения: 10.11.2023).
The Encyclopedia of DNA Elements (ENCODE) // National Human Genome Research Institute. URL: https://www. genome.gov/Funded-Pro-grams-Projects/ENCODE-Pro-ject-ENCyclopedia-Of-DNA-Elements (дата обращения: 10.11.2023).
The ENCODE Project Consortium et al. Expanded Encyclopaedias of DNA Elements in the Human and Mouse Genomes // Nature. 2020. № 583. P. 699-710. DOI: 10.1038/ s41586-020-2493-4
Vnencak-Jones C., Berger M., Pao W. Types of Molecular Tumor Testing // My Cancer Genome. URL: https://www.my-cancergenome.org/content/molecular-medicine/types-of-mo-lecular-tumor-testing/ (дата обращения: 10.11.2023).
Гаджиев Я., Шалбузова К. Применение методов машинного обучения в прогнозировании и раннем обнаружении рака // Sciences of Europe. 2022. № 108. С. 46-50.
Гусев А.В., Гаврилов Д.В., Корсаков И.Н. и др. Перспективы использования методов машинного обучения для предсказания сердечнососудистых заболеваний // Врач и информационные технологии. 2019. № 3. С. 41-47.
Гусев А.В., Новицкий РЭ., Ившин А.А., Алексеев А.А. Машинное обучение на лабораторных данных для прогнозирования заболеваний // Фармакоэкономика. Современная фармакоэкономика и фармакоэпидемиология. 2021. № 4. С. 581-592. DOI: 10.17749/2070-4909/farmakoekonomi-ka.2021.115
Раскина К.В., Мартынова Е.Ю., Перфильев А.В. и др. От персонализированной к точной медицине // Рациональная фармакотерапия в кардиологии. 2017. № 1. С. 69-79. DOI: 10.20996/1819-6446-2017-13-1-69-79
Emmert-Streib F. Personalized Medicine: Has it Started yet? A Reconstruction of the Early History // Front Genet. 2013. Vol. 3. № 313. DOI: 10.3389/fgene.2012.00313
3 главных причины для геномного секвенирования рака // Блог сайта addon. URL: https://addon.life/ru/2021/08/02/ genomic-sequencing-cancer/ (дата обращения: 10.11.2023).
Personalized Medicine: Redefining Cancer Treatment // Kaggle. URL: https://www.kaggle.com/competitions/msk-redefining-cancer-treatment/overview (дата обращения: 10.11.2023).
Обработка естественного языка // Машинное обучение. URL: https://www.dmitrymakarov.ru/intro/topic-iden-tification-19/ (дата обращения: 10.11.2023).
Самигулин Т.Р, Джурабаев А.Э. Анализ тональности текста методами машинного обучения // Научный результат. Информационные технологии. 2021. № 1. С. 55-62. DOI: 10.18413/2518-1092-2021-6-1-0-7
Copyright (c) 2024 Ольга Николаевна Половикова, Анастасия Станиславовна Маничева, Вячеслав Вячеславович Ширяев
This work is licensed under a Creative Commons Attribution 4.0 International License.
Izvestiya of Altai State University is a golden publisher, as we allow self-archiving, but most importantly we are fully transparent about your rights.
Authors may present and discuss their findings ahead of publication: at biological or scientific conferences, on preprint servers, in public databases, and in blogs, wikis, tweets, and other informal communication channels.
Izvestiya of Altai State University allows authors to deposit manuscripts (currently under review or those for intended submission to Izvestiya of Altai State University) in non-commercial, pre-print servers such as ArXiv.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).