We will use Weka to train a Naïve Bayes classifier for the purposes of spam detection.
The data set we will consider is the Spambase set, consisting of tagged emails from a single email account. Read through the description available for this data to get a feel for what you’re dealing with. The dataset is available on D2L.
Some simple preprocessing of the data will be required before it is ready for use. We can do this in Weka:
1. A full list of the attributes in this data set will appear in the “Attributes” frame.
2. Delete the capital_run_length_average, capital_run_length_longest and capital_run_length_total attributes by checking the box to their left and hitting the Remove button.
3. The remaining attributes represent relative frequencies of various important words and characters in emails. We wish to convert these to Boolean values instead: 1 if the word or character is present in the email, 0 if not. To do this, select the Choose button in the Filter frame at the top of the window, and pick filters > unsupervised > attribute > NumericToBinary. Now hit the Apply button. All the numeric frequency attributes are now converted to Booleans. Each e-mail is now represented by a 55 dimensional vector representing whether or not a particular word exists in an e-mail. This is the so called bag of words representation (this is clearly a very crude assumption since it does not take into account the order of the words).
4. Save this preprocessed data set for future use using the Save… button.
Given the data set we’ve just loaded, we wish to train a Naïve Bayes classifier to distinguish spam from regular email by fitting a distribution of the number of occurrences of each word for all the spam and non-spam e-mails. Under the Classify tab:
1. Select Choose in the Classifier frame at the top and select classifiers > bayes > NaiveBayes.
2. (25 Points) Leave the default settings and hit Start to build the classifier. Study the output produced, most importantly the percentages of correctly and incorrectly classified instances. You probably will notice that your classifier does rather well despite making a very strong assumption on the form of the data.
o Can you come up with a reason for the good performance? What would be the main practical problems we would face if we were not to make this assumption for this particular dataset?
o How long did your classifer take to train and classify? Given this, how scalable do you think the Naïve Bayes classifier is to large datasets? Can you come up with a good reason for this?
3. (25 Pints) Examine the classifier models produced by Weka (printed above the performance summary). Find the prior probabilities for each class.
o How does Naïve Bayes compute the probability of an e-mail belonging to a class (spam/not spam)?
o Compute the conditional probability of observing the word “3d” given that an e-mail is spam P(3d|spam) and that it is non-spam P(3d|non-spam). To do this, we need to use the counts of the built model that are produced within the Classifier output screen under the Classify tab. The general format of the Weka count output is (Note: this is a toy example. You will need to examine your Weka output to find the true counts for the word “3d”.):
Class
0 1
0 1 2
1 3 4
total 4 6
o This means that 4 instances (e.g. e-mails) contain that particular attribute value (e.g. the word “3d”) in Class 1 (e.g. Is Spam). 2 instances didn’t contain that value of the attribute in Class 1. 3 instances of Class 0 contained that attribute value, whilst 1 instance of Class 0 (e.g. Not Spam) didn’t contain that attribute value. The totals reflect the number of instances belonging to both classes e.g. the number of e-mails that are Spam and not Spam.
4. (50 Points) Run Naïve Bayes, Decision Tree, Logistic regression on data set using 5 fold-cross validation, 10 fold-cross validation, and 75% of training. For each classifier, compare its performance obtained from different model. Use Microsoft Excel to draw a graph (2-D column graph) corresponding to the accuracy of each model in different settings (5 fold, 10 fold, and 75% of training). Give explanations for your observations above.
Deliverable:
• Your report including the screenshots of your implementation for each section and the results.