Classification of Phishing Email Using Word Embedding and Machine Learning Techniques

Somesha, M.; Pais, A.R.

Classification of Phishing Email Using Word Embedding and Machine Learning Techniques

dc.contributor.author	Somesha, M.
dc.contributor.author	Pais, A.R.
dc.date.accessioned	2026-02-04T12:28:27Z
dc.date.issued	2022
dc.description.abstract	Email phishing is a cyber-attack, bringing substantial financial damage to corporate and commercial organizations. A phishing email is a special type of spamming, used to trick the user to disclose personal information to access his digital assets. Phishing attack is generally triggered by emailing links to spoofed websites that collect sensitive information. The APWG survey suggests that the existing countermeasures remain ineffective and insufficient for detecting phishing attacks. Hence there is a need for an efficient mechanism to detect phishing emails to provide better security against such attacks to the common user. The existing open-source data sets are limited in diversity, hence they do not capture the real picture of the attack. Hence there is a need for real-time input data set to design accurate email anti-phishing solutions. In the current work, it has been created a real-time in-house corpus of phishing and legitimate emails and proposed efficient techniques to detect phishing emails using a word embedding and machine learning algorithms. The proposed system uses only four email header-based heuristics for the classification of emails. The proposed word embedding cum machine learning framework comprises six word embedding techniques with five machine learning classifiers to evaluate the best performing combination. Among all six combinations, Random Forest consistently performed the best with FastText (CBOW) by achieving an accuracy of 99.50% with a false positive rate of 0.053%, TF-IDF achieved an accuracy of 99.39% with a false positive rate of 0.4% and Count Vectorizer achieved an accuracy of 99.18% with a false positive rate of 0.98% respectively for three datasets used. © 2022 River Publishers.
dc.identifier.citation	Journal of Cyber Security and Mobility, 2022, 11, 3, pp. 279-320
dc.identifier.issn	22451439
dc.identifier.uri	https://doi.org/10.13052/jcsm2245-1439.1131
dc.identifier.uri	https://idr.nitk.ac.in/handle/123456789/22758
dc.publisher	River Publishers
dc.subject	Computer crime
dc.subject	Cybersecurity
dc.subject	Decision trees
dc.subject	Deep learning
dc.subject	Electronic mail
dc.subject	Open systems
dc.subject	Sensitive data
dc.subject	Email phishing detection
dc.subject	Embeddings
dc.subject	False positive rates
dc.subject	Fasttext
dc.subject	Machine-learning
dc.subject	Phishing
dc.subject	Phishing detections
dc.subject	TF-IDF
dc.subject	Word embedding
dc.subject	Word2ec
dc.title	Classification of Phishing Email Using Word Embedding and Machine Learning Techniques

Collections

Journal Articles

Classification of Phishing Email Using Word Embedding and Machine Learning Techniques

Files

Collections