Machine Learning Framework for Classification of COVID-19 Variants Using K-mer Based DNA Sequencing
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
John Wiley and Sons Inc
Abstract
Accurate classification of viral DNA sequences is essential for tracking mutations, understanding viral evolution, and enabling timely public health responses. Traditional alignment-based methods are often computationally intensive and less effective for highly mutating viruses. This article presents a machine learning framework for classifying DNA sequences of COVID-19 variants using K-mer-based tokenization and vectorization techniques inspired by Natural Language Processing (NLP). DNA sequences corresponding to Alpha, Beta, Gamma, and Omicron variants are obtained from the Global Initiative on Sharing All Influenza Data (GISAID) database and encoded into feature vectors. Multiple classifiers, including Extra Trees, Random Forest, Support Vector Classifier (SVC), Decision Tree, Logistic Regression, Naive Bayes, K-Nearest Neighbor (KNN), Ridge Classifier, Stochastic Gradient Descent (SGD), and XGBoost, are evaluated based on accuracy, precision, recall, and F1-score. The Extra Trees model achieved the highest accuracy of 93.10% (Formula presented.) 0.42, followed by Random Forest with 92.60% (Formula presented.) 0.38, both demonstrating robust and balanced performance. Statistical significance tests confirmed the robustness of the results. The results validate the effectiveness of K-mer-based encoding combined with traditional machine learning models in classifying COVID-19 variants, offering a scalable and efficient solution for genomic surveillance. © 2025 Wiley Periodicals LLC.
Description
Keywords
Classification (of information), Classifiers, COVID-19, Decision trees, DNA, DNA sequences, Encoding (symbols), Gene encoding, Gradient methods, Learning systems, Logistic regression, Natural language processing systems, Nearest neighbor search, Public health, Random forests, Signal encoding, Support vector regression, Viruses, DNA Sequencing, Encodings, Extra-trees, K-mer encoding, Language processing, Learning frameworks, Machine-learning, Natural language processing, Natural languages, Viral genotype differentiation, Stochastic systems
Citation
International Journal of Imaging Systems and Technology, 2025, 35, 6, pp. -
