Economics

Economics 1 407916/COMP723 Data Mining and Knowledge Engineering Assignment 2 – Text Classification (50%) 1 Objective To develop a broad understanding of text mining by performing a representative task, Text Classification. 2 Task Specification This assignment requires you to extend your data mining skills and knowledge from structured context to unstructured context, where the items to be classified are “free text” snippets. You are required to use Weka to train a range of classification algorithms on the given dataset, analyse the results and present a report of your findings. 2.1.1 Due Dates and Submission • This assignment is to be done in pairs. The report should clearly state the name and student ID of both members of the team. Furthermore, the contributions made by each team member must be clearly stated. • The due date for the written part of your assignment is due on 30 October at 6pm. • You are required to submit two copies of the assignments. o A pdf copy via the turnitin assignment Submission tab (on the course homepage) on AutOnline. o A second copy via the normal submission tab (non-turnitin), submission tab on AutOnline. 2.1.2 Marking • This assignment will be marked out of 100 marks and is worth 50% of the overall mark for the paper. • To pass this module you must pass each assessment separately, and gain at least 50% in total. The minimum pass mark for this assignment is 40%. 3 Assignment Details • Download the data file from AutOnline under the Assignment 2 folder. The corpus contains 5574 emails classified as SPAM (the positive class) and non-SPAM, the negative class as a single file. • Read through the readme file to understand the data. • Either programmatically or otherwise convert the data file into the following 2 arff files. o First file one containing 66 percent of the instances for training o Second file one containing 34 percent of the instances for texting • Ensure you have equal proportions of positive and negative instances in across each the files above. 2 • Convert the arff files into Word Vectors by applying the StringToWordVector filter. • Use NaiveBayes and 3 other classifiers of your choice from Weka to train a model and validate the accuracy of the model on the testing dataset. Record these results and use this as the benchmark for comparison to other runs later. • Now use feature engineering as discussed in lectures in an attempt to improve the accuracy of your classification using the same 4 algorithms as above. Your feature engineering can include any or multiple parameters given in the screen dump below. • You should make feature property changes in a systematic way and record all your results so that it can be presented in a graphical form if applicable. • Once you have settled on the optimum set of features, reconfigure the training dataset and make it balanced with respect to the number of positive and negative class instances. Now train the classifiers and record the results for the testing dataset. 3 • Now use 2 attribute selection methods to select a subset of attributes in an attempt to improve the accuracy. 3.1.1 Written Report • You will write a minimum of 6 and a maximum of 12 page report (excluding the references and appendix) describing the results of your experiment. • You are required to write a coherent report describing all aspects of the experiment as an attempt to get the best possible accuracy for the classification task. Any screen shots or large result outputs that doesn’t directly contribute to your argument should be included in the appendix, rather than as part of the main report. • You are not required to have a table of contents or executive summary for this report. • There is no fixed format for the report. You can format it close to an academic paper containing the usual sections such as Abstract, Introduction, Data Description, Results, Discussion, Conclusion and a bibliography. • As a minimum your report should contain a discussion of the following points 1. Presentation and discussion of the results obtained. You should use the correct evaluation metrics in your discussion. 2. The rationale used for feature engineering decisions and tuning of any parameter values for the classifiers used. 3. A discussion of the results for the imbalanced data, including the effect of imbalance on classifiers and the methods used to deal with this imbalance. 4. A comparison of the test results from the classifiers trained on balanced dataset to imbalanced dataset. 5. A brief discussion of applications of text classification. 6. The difference between the use of generic machine learning algorithms for structured data such as what you did for the first assignment and what you did for this assignment. 7. A discussion of the similarity and the differences of the machine learning algorithms that you have used as applicable to text classification. 8. A reflection of what you learnt from this assignment and what you would do differently if you were to do the assignment again. • The following approximate matrix would be used to grade your assignment. Written Report Formatting, Language and Presentation 10% Discussion to demonstrate an understanding of the experimental tasks 20% Explanation of the rationale used for various tasks 20% Discussion of the results 25% Discussion to demonstrate an overall understanding of text classification 25%

Economics

Latest Post

Writing Services

Unlock Your Academic Potential with Our Expert Writers