Imbalanced Multi-Class Research Article Classification using Sentence Transformers and Machine Learning Algorithms

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Association for Computing Machinery, Inc

Abstract

Categorizing scientific articles into specific research fields is a challenging problem, considering the volume and variety of published literature. However, existing classification systems often suffer from limitations regarding taxonomy or the models used for classification. This article explores approaches built on Sentence Transformer embeddings combined with Machine Learning algorithms to classify articles into 123 predefined classes, with the dataset being heavily imbalanced in nature. The effectiveness of Large Language Models (LLMs) for generating synthetic data is also experimented with, along with synonym augmentation and SMOTE. The best-performing model, the One vs Rest classifier trained on MP-Net sentence embeddings with SMOTE, achieved an accuracy of 77%, and outperformed all the other models. © 2024 Copyright held by the owner/author(s).

Description

Keywords

Document classification, Machine Learning, Natural Language Processing, Sentence Transformers

Citation

CODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data, 2025, Vol., , p. 309-310

Endorsement

Review

Supplemented By

Referenced By