Machine Learning Framework for Classification of COVID-19 Variants Using K-mer Based DNA Sequencing

dc.contributor.authorKumar, S.
dc.contributor.authorRaju, S.
dc.contributor.authorBhowmik, B.
dc.date.accessioned2026-02-03T13:19:17Z
dc.date.issued2025
dc.description.abstractAccurate classification of viral DNA sequences is essential for tracking mutations, understanding viral evolution, and enabling timely public health responses. Traditional alignment-based methods are often computationally intensive and less effective for highly mutating viruses. This article presents a machine learning framework for classifying DNA sequences of COVID-19 variants using K-mer-based tokenization and vectorization techniques inspired by Natural Language Processing (NLP). DNA sequences corresponding to Alpha, Beta, Gamma, and Omicron variants are obtained from the Global Initiative on Sharing All Influenza Data (GISAID) database and encoded into feature vectors. Multiple classifiers, including Extra Trees, Random Forest, Support Vector Classifier (SVC), Decision Tree, Logistic Regression, Naive Bayes, K-Nearest Neighbor (KNN), Ridge Classifier, Stochastic Gradient Descent (SGD), and XGBoost, are evaluated based on accuracy, precision, recall, and F1-score. The Extra Trees model achieved the highest accuracy of 93.10% (Formula presented.) 0.42, followed by Random Forest with 92.60% (Formula presented.) 0.38, both demonstrating robust and balanced performance. Statistical significance tests confirmed the robustness of the results. The results validate the effectiveness of K-mer-based encoding combined with traditional machine learning models in classifying COVID-19 variants, offering a scalable and efficient solution for genomic surveillance. © 2025 Wiley Periodicals LLC.
dc.identifier.citationInternational Journal of Imaging Systems and Technology, 2025, 35, 6, pp. -
dc.identifier.issn8999457
dc.identifier.urihttps://doi.org/10.1002/ima.70231
dc.identifier.urihttps://idr.nitk.ac.in/handle/123456789/20005
dc.publisherJohn Wiley and Sons Inc
dc.subjectClassification (of information)
dc.subjectClassifiers
dc.subjectCOVID-19
dc.subjectDecision trees
dc.subjectDNA
dc.subjectDNA sequences
dc.subjectEncoding (symbols)
dc.subjectGene encoding
dc.subjectGradient methods
dc.subjectLearning systems
dc.subjectLogistic regression
dc.subjectNatural language processing systems
dc.subjectNearest neighbor search
dc.subjectPublic health
dc.subjectRandom forests
dc.subjectSignal encoding
dc.subjectSupport vector regression
dc.subjectViruses
dc.subjectDNA Sequencing
dc.subjectEncodings
dc.subjectExtra-trees
dc.subjectK-mer encoding
dc.subjectLanguage processing
dc.subjectLearning frameworks
dc.subjectMachine-learning
dc.subjectNatural language processing
dc.subjectNatural languages
dc.subjectViral genotype differentiation
dc.subjectStochastic systems
dc.titleMachine Learning Framework for Classification of COVID-19 Variants Using K-mer Based DNA Sequencing

Files

Collections