COMPARISON OF MACHINE LEARNING TECHNIQUES IN SPAM E-MAIL CLASSIFICATION

E-mail still proves to be very popular and an efficient communication tool. Due to its misuse, however, managing e-mails problem for organizations and individuals. Spam, known as unwanted message, is an example of misuse. Specifically, spam is defined as the arrival of unwelcomed bulk email not being requested for by recipients. This paper compares different Machine Learning Techniques classification of spam e-mails. Random Forest (RF), C4.5 and Artificial Neural Network (ANN) were tested to determine which method provides the best results in spam e-mail classification. Our results show that RF is the best technique applied on dataset Labs, indicating that ensemble methods may have an edge in spam detection effective susceptible to is spam, also is defined messages not istaken with or religious he most email by a . Furthermore, by spam. around (which makes (Grant, 2003; Every e-mail user in America received an average of 2200 pieces of spam e-mails in 2002. In 2007 it reached 3600 pieces of spam e-mails due to increase rate of 2% per month conducted a survey revealing that a Chinese spam e-mails weekly. Due to spam e enterprises lose up to 9 billion yearly reveal that spam e-mails take about 60% of the incomin in a corporate network. With inappropriate or no countermeasures, the situation will worsen and, in the end, spam e-mails may destruct the usage of e countries are slowly starting to use anti (Gaikwad & Halkarnikar, 2014). The main argument supporting spam increase is the fact that spammers do not have any costs for it: “Because email technology allows spammers to shift the costs almost entirely to third parties, there is no incentive for the spammers to reduce the volume” (Hann, Hui, Lai, Lee, & Png, 2006) issue for spam is the annoying content they carry significant amount of spam contains some offensive materials (Maria & Ng, 2009). In China, some specialists suggest spam email measure as early as possible. However, because of 1210 Sarajevo,


INTRODUCTION
E-mail still represents a common and communication tool which is unfortunately misuse. The most popular example of misuse known as unwanted message. More precisely, spam as the receiving of unwanted bulk commercial demanded by receivers. Spam should not be m non-commercial solicitations such as political tones even if unwelcomed. Recent studies show that t popular spamming practice on the internet was still huge margin (Youn & McLeod, 2007;Gaikwad & Halkarnikar, 2014).
Spammers collect e-mail addresses from websites, chatrooms, customer lists and viruses. In last few years, spam emails have grown into a serious threat for security, and act as a really good phishing agent for sensitive data malicious software is carried to numerous users Daily, one typical user can receive 10-50 spam emails; 13 billion of unwanted commercial e-mail around 50% of all e-mail sent) is sent each day Gaikwad & Halkarnikar, 2014 (Grant, 2003; Every e-mail user in America received an average of 2200 pieces of spam e-mails in 2002. In 2007 it reached 3600 pieces of spam e-mails due to increase rate of 2% per month conducted a survey revealing that a Chinese spam e-mails weekly. Due to spam e enterprises lose up to 9 billion yearly reveal that spam e-mails take about 60% of the incomin in a corporate network. With inappropriate or no countermeasures, the situation will worsen and, in the end, spam e-mails may destruct the usage of e countries are slowly starting to use anti (Gaikwad & Halkarnikar, 2014).
The main argument supporting spam increase is the fact that spammers do not have any costs for it: "Because email technology allows spammers to shift the costs almost entirely to third parties, there is no incentive for the spammers to reduce the volume" (Hann, Hui, Lai, Lee, & Png, 2006) issue for spam is the annoying content they carry significant amount of spam contains some offensive materials (Maria & Ng, 2009 (CNNIC, 2004). Studies mails take about 60% of the incoming mails in a corporate network. With inappropriate or no countermeasures, the situation will worsen and, in the end, the usage of e-mail systems. Many countries are slowly starting to use anti-spam legal measures supporting spam increase is the fact pammers do not have any costs for it: "Because email technology allows spammers to shift the costs almost entirely incentive for the spammers to (Hann, Hui, Lai, Lee, & Png, 2006). The is the annoying content they carry. However, contains some offensive materials suggest executing effective antiemail measure as early as possible. However, because of the Internet's open architecture, only limited effect was seen in these legal measures by now. Due to that, we should be opting for additional effective methodologies. Currently, majority of systems stop spam messages by means of banning frequent spammers (Gaikwad & Halkarnikar, 2014;Chuan, Xian-liang, Xu, & Meng-shu, 2005).
Automated approaches discriminating between junk and legitimate emails are becoming necessity because of this growing problem (Sahami, Dumaisy, Heckerman, & Horvitz, 1998). Huge number of documents, relatively great number of features and unstructured information are challenges for automated detection of spam email. All of these features may badly impact the performance regarding speed and quality, as the usage increases. Many recent algorithms use just significant features for classification. A huge and different number of features in the dataset and a big number of documents cause a problem to the text and email classification. Since that huge number of features makes most documents indistinguishable, the applicability in datasets using existing classification techniques is limited. Different datasets use classification algorithms such as Support Vector Machine (SVM), Artificial Neural Network (NN), and Naïve Bayesian (NB) classifiers which currently show good classification results (Gaikwad & Halkarnikar, 2014;Youn & McLeod, 2007).
This paper describes the detection of spam messages using various machine learning methods. Random Forest, C4.5 and ANN methods were compared based on different performance evaluation criteria. The organization of the paper is as follows. Section 2 presents background work on detection of spam email, whereas Section 3 describes the Spam dataset and ML techniques applied. Section 4 presents the experimental results. Finally, Section 5 concludes the paper.

LITERATURE OVERVIEW
A number of early studies have taken advantage of probabilistic Naïve Bayes theory in spam detection. Deshpande et al suggested an anti-spam filtering method based on Naïve Bayes. In addition, the filters were trained on huge amount of non-spam and spam e-mails and tested on unseen incoming e-mail messages. Authors conclude that safety measures are required before a Bayesian anti-spam filter is practically usable but can act as a first pass filter (Deshpande, Erbacher, & Harris, 2007). Obeid suggested a data mining paradigm grounded on Bayesian analysis for filtering spam. The algorithm learns patterns related to legitimate and spam messages and then classifies new e-mail as either legitimate or spam (binary classification). The author demonstrated the capability of filter to detect spam with high accuracy (Obied, 2007). Sahami and thee Microsoft researchers tested techniques for the automatic filter creation for removing unwelcomed mail by employing probabilistic learning methods. They show that superior results are obtained once domain-specific features and the text of e-mail messages is considered together (Sahami, Dumaisy, Heckerman, & Horvitz, 1998). Another approach to automatic e-mail classification using Bayesian Theorem by inspecting its textual contents is presented in (Vira, Raja, & Gada, 2012). An enhanced Bayesian anti-spam mail filter is presented in (Chuan, Xian-liang, Xu, & Meng-shu, 2005). The improvement in total performance is acquired as features are extracted based on word entropy, and vector weights are characterized by word frequency.
A decision tree based ensemble learning paradigm for spam email detection is suggested in (SHI, WANG, MA, WENG, & QIAO, 2012). Public spam e-mail dataset was used to evaluate performance of a few machine learning methods. The suggested ensemble learning technique showed to be mostly superior to benchmark methods. Ozarkar and Patwardhan applied Random Forest and PART Decision Trees to discriminate between legitimate and spam messages in public spam database. Different attribute extraction techniques were implemented. Although this pre-processing step decreased training times, it did not bring substantial improvement in accuracy. However, other benchmark methods were outperformed by Random Forest ensemble (Ozarkar & Patwardhan, 2013). Abu-Nimeh et al compared six data mining methods for phishing detection. Authors produced a Dataset containing 1718 non-phishing and 1171 phishing emails, where each e-mail was characterized by 43 attributes. 10-fold cross-validation was used to evaluate the classifiers performance. Random Forest was again superior to all other algorithms with overall accuracy of 92.28%. The worst performers were Support Vector Machines and Neural Networks. However, one of the disadvantages of Random Forests was high rate of false positives (Abu-Nimeh, Nappa, Wang, & Nair, 2007).
In this paper, we will confirm the superiority of Random Forest ensemble learning over single methods as in (SHI, WANG, MA, WENG, & QIAO, 2012) and (Ozarkar & Patwardhan, 2013). In addition, our work is among a few which included and evaluated ANN for spam detection. Unlike (Abu-Nimeh, Nappa, Wang, & Nair, 2007), our study identified no disadvantages of Random Forests when compared to other benchmark methods.

Dataset
Email database is acquired from UCI's machine learning data repository (UCI, 2015). HP Labs created and donated the dataset in July 1999. Dataset collection of spam messages is from individuals and postmaster who had filed spam. On the other hand, collection of legitimate messages came from filed work and personal e-mails. In the Spam database there are completely 4601 messages out of which 1813 (39.4%) are characterized as spam. Every e-mail message is characterized as a feature vector comprising of 57 real numbers. Majority of them (47) represent frequencies of certain words. Frequencies of certain characters in the email are stored in the following 6 features. Statistics regarding capital letters constitute the remaining 3 features. These last three features hold the longest, average and sum of lengths of continuous capital letters respectively (Zhao, 2004). The names of all 57 features can be found at (UCI, 2015).

Random Forest (RF)
Random Forest (RF), proposed by Breiman (Breiman, 2001), is novel, fast, highly accurate, noise resistant classification method. Bagging and random feature selection are combined together in RF. Every tree in the forest is influenced by the values of random vectors sampled separately and has identical distribution as any other tree in the forest (Breiman, 2001). RF consists of outsized number of decision trees where decision tree select their separating features from bootstrap training set i S where i represent i th internal node.
Trees in RF are grown by means of Classification and Regression Tree (CART) method with no pruning. As number of trees in the forest turns into outsized number, generalization error will also increase until it converges to some boundary level (Breiman, 2001). More details about RF can be found in (Breiman, 2001).

C4.5
The C4.5 calculation uses the same fundamental inductive tree creation approach as ID3, yet extends its abilities to characterization of ceaseless information by gathering together discrete estimations of a trait into subsets or reaches. Another point of interest of C4.5 is that it can foresee values for information with missing properties in light of learning of the important spaces (Dunham, 2003). C4.5 additionally gives an approach to prune or diminish the extent of the tree with no noteworthy lessening in precision. Pruning happens in two structures (Dunham, 2003): subtree substitution and subtree raising. If there should arise an occurrence of the previous, a subtree is supplanted with a leaf node, and in the second system, a subtree is supplanted with its most every now and again utilized subtree (Browne & Berry, 2006).
In both cases, substitution is worthy just when the first tree experiences negligible contortion as an aftereffect of pruning. In circumstances where tree pruning does not adequately diminish the unpredictability of the DT structure, C4.5 produces choice principles in view of the decisions connected with a way, which is characterized as a situated of branches uniting two nodes (Browne & Berry, 2006).

ANN
An ANN can be characterized as an exceedingly associated cluster of rudimentary processors called neurons. A generally utilized model called the multi-layered perceptron (MLP) is indicated in Figure 1. The MLP comprises of one input layer, one or more hidden layers and one output layer. Every layer utilizes a few neurons and every neuron in a layer is associated with the neurons in the contiguous layer with diverse weights. The attributes (or features) stream into the input layer, go through the hidden layers, and produce an output at the output layer. Except for the input layer, every neuron gets signals from the neurons of the past layer straightly weighted by the interconnect values between neurons. The neuron then creates its output by passing the summed signal through a sigmoid or other types of activation function (Park, El-Sharkawi, Marks II, Atlas, & Damborg, 1991;Sobajic & Pao, 1989).

RESULTS AND DISCUSSION
The most commonly used approach for algorithm comparison is the classification performance which is usually not focused on a class (Sokolova, Japkowicz, & Szpakowicz, 2006). For example, accuracy provides no separation among the true labels of different classes (it only evaluates the general performance of the algorithm): In spam e-mail classification, the Sensitivity shows how good the algorithm is in detecting spam messages, whereas the Specificity is a measure of recognition of legitimate e-mail. In other words, they both evaluate the probability of each label being correct.
There are three more measures that differentiate properly classified samples within various classes: precision, recall, and F-measure. Relation between correctly classified samples and those that are misclassified as positives is called precision.  (Sokolova, Japkowicz, & Szpakowicz, 2006).
On the other hand, ROC can provide an extensive estimation of a classifier's effectiveness: where ( ) implies the likelihood that a sample belongs to the class C. In other words, ROC represents a function of the classifier's sensitivity and specificity values. Table 1 presents performance assessment of three machine learning techniques tested on Spam database: C4.5 decision tree, ANN, and Random Forest (RF). For every algorithm, ROC area and F-Measure can be observed for each class and averaged. More importantly, Table 1 shows the detection accuracy values for spam (Sensitivity) and non-spam (Specificity) messages together with the average detection accuracy of algorithms. All these measures have been obtained by employing 10fold cross-validation (CV) approach. Dataset is arbitrarily divided into 10 mutually exclusive folds (subsets) of practically the identical size. Nine (9) folds are used for training and remaining one (1) fold is used for testing so the process repeats 10 times. The average of accuracies of each iteration is then reported in Table 1.
All three classifiers have been implemented and tested in software package WEKA (Holmes, Donkin, & Witten, 1994) using default parameters. C4.5 has been used with pruning option disabled. The same holds true for the Random Forest, where the number of generated trees was 100 which will give class label by majority vote. In ANN, the number of input nodes is equal to the number of features used, namely 57. The other parameters of ANN are provided in Table 2.  Figure 2 is a graphical representation of accuracy values for all classes and all algorithms from table 1. The observations and conclusion drawn in the previous paragraph are now even more evident.  E-mail spam detection has gotten a colossal consideration by greater part of the researchers as it serves to recognize the undesirable data and potential dangerous activity. Hence, the greater part of the analysts focuses on discovering the best classifier for recognizing spam messages. This paper portrayed diverse ML systems for spam messages characterization, among which RF proved to be the best one. The upside of RF is that it runs proficiently on huge datasets with high number of samples and attributes, which makes it exceptionally appealing for content classification. In the period of testing the framework different performance measures (ROC area, F-measure, and Accuracy) were taken into consideration. The proposed framework accomplishes average accuracy of 95.56% in spam detection using RF. Future work will incorporate a comparison of ensemble methods in e-mail spam detection.