SPAM EMAIL DETECTION SCHEME BASED ON RANDOM FOREST ALGORITHM
Keywords:
Email, Spam Email Attacks, Detection Accuracy, Ensemble AlgorithmAbstract
Emails are used for communication purposes in different sectors of the economy such as education, health, businesses, manufacturing, agriculture. People with malicious intent have been using emails accounts for different spam email attacks. Spam email refers to as unsolicited bulk email. It is the practice of sending large frequent, unwanted e-mail messages with commercial content to indiscriminate set of recipients. Spam emails expose users to challenges such as time wastage, high usage of computing resources and stealing of valuable information. Machine learning approaches have been widely accepted to be better than traditional approaches for the identification of spam emails. For this reason, several machine learning techniques have been proposed in the literature for the classification of spams in emails. This paper proposed a Random Forest-based scheme for email spam detection. A fairly large spam email dataset named spam base was collected from UCI machine learning repository. The dataset was pre-processed based on the feature encoding. Then, promising features were selected using feature importance technique. The feature selection yielded 12-feature subsets that were arrived at based on the feature scores. The Random Forest (RF) spam email detection model that was built achieved 99.65% Accuracy, 99.21% Precision, 99.46% of Recall and F1-score of 99.33%. The study concluded that the RF-based spam email detection model performed better than some of the approaches in similar studies.