Kumar, S.Raju, S.Bhowmik, B.2026-02-032025International Journal of Imaging Systems and Technology, 2025, 35, 6, pp. -8999457https://doi.org/10.1002/ima.70231https://idr.nitk.ac.in/handle/123456789/20005Accurate classification of viral DNA sequences is essential for tracking mutations, understanding viral evolution, and enabling timely public health responses. Traditional alignment-based methods are often computationally intensive and less effective for highly mutating viruses. This article presents a machine learning framework for classifying DNA sequences of COVID-19 variants using K-mer-based tokenization and vectorization techniques inspired by Natural Language Processing (NLP). DNA sequences corresponding to Alpha, Beta, Gamma, and Omicron variants are obtained from the Global Initiative on Sharing All Influenza Data (GISAID) database and encoded into feature vectors. Multiple classifiers, including Extra Trees, Random Forest, Support Vector Classifier (SVC), Decision Tree, Logistic Regression, Naive Bayes, K-Nearest Neighbor (KNN), Ridge Classifier, Stochastic Gradient Descent (SGD), and XGBoost, are evaluated based on accuracy, precision, recall, and F1-score. The Extra Trees model achieved the highest accuracy of 93.10% (Formula presented.) 0.42, followed by Random Forest with 92.60% (Formula presented.) 0.38, both demonstrating robust and balanced performance. Statistical significance tests confirmed the robustness of the results. The results validate the effectiveness of K-mer-based encoding combined with traditional machine learning models in classifying COVID-19 variants, offering a scalable and efficient solution for genomic surveillance. © 2025 Wiley Periodicals LLC.Classification (of information)ClassifiersCOVID-19Decision treesDNADNA sequencesEncoding (symbols)Gene encodingGradient methodsLearning systemsLogistic regressionNatural language processing systemsNearest neighbor searchPublic healthRandom forestsSignal encodingSupport vector regressionVirusesDNA SequencingEncodingsExtra-treesK-mer encodingLanguage processingLearning frameworksMachine-learningNatural language processingNatural languagesViral genotype differentiationStochastic systemsMachine Learning Framework for Classification of COVID-19 Variants Using K-mer Based DNA Sequencing