Faculty Publications

We propose an Extreme Gradient Boosting framework for classification and regression problems emerging in machine learning for small-sized data sources sampled from a discrete distribution, i.e. data containing discrete or quantized attributes. The model parameters are iteratively refined from a prior belief for specific use cases using Bayesian optimization. We focus the application area of this framework on detecting fraudulent websites. With properly stated reasoning, we empirically test our methodology on a publicly available and bench-marked UCI Phishing dataset to demonstrate the superior performance of this approach as compared to existing methods in the literature. Â© 2021 IEEE.

Inter-protein interactions are critical in biological pathways. Determining the protein–protein interaction (PPI) sites is vital for comprehending protein behavior and designing medications. Traditional experimental protocols for pinpointing these sites are prolonged and costly, making computational approaches an efficient alternative. However, many computational methods fail to resolve the problem of class imbalance in PPI datasets and focus predominantly on local contextual features, ignoring global sequence information. In this work, we address class imbalance in PPI site prediction by applying a series of balancing techniques: selective thinning of the majority class, Tomek Links to remove noisy samples near the class boundary, and random augmentation of the minority class. We then further balance the data using Synthetic Minority Over-sampling Technique (SMOTE) and Generative Adversarial Networks (GANs), with GANs showing a slight edge in improving data quality and reducing noise. Our approach incorporates four key features: secondary structure, raw protein sequence, Position-Specific Scoring Matrix (PSSM), and Relative Solvent Accessibility (RSA). We use both nearby contextual and holistic sequence features for training two models: XGBoost and a Deep Neural Network (DNN). The performance of the models was assessed using accuracy, Matthews correlation coefficient (MCC), precision, recall, and F-score. We correlate the impact of using balanced versus unbalanced datasets and measure the share of global features in enhancing model performance. The findings demonstrate that class balancing significantly upgrades prediction performance. The XGBoost model realized an accuracy of 0.831 and precision of 0.417, outperforming the DNN in these metrics. The DNN model attained a higher recall of 0.723 and an F-score of 0.485, exemplifying its effectiveness in accurately detecting true PPI sites. Both models showcased a good MCC of 0.30, corroborating the effectiveness of the introduced balancing strategies and the assimilation of global features in robust PPI site prediction. © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2025.

Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results