Plot summaries file: https://we.tl/ZHTV57CuZG
Data Mining Project -Project description:
Certain business problems cannot be addressed using the analytical technique that you have been using thus far in the course. Also, quite often you need to deal with unstructured data that requires you to process text data using web and text mining methods.
In this project, you will be working on a web or text mining techniques including Natural Language Processing (NLP). In general, in text mining, the primary goal is to extract useful information and put it in some structure format for exploratory analysis and, as necessary, to build statistical and machine learning models. Important elements in natural language-based text mining include: terms, concepts, tokenization, stop words, synonyms, homonyms, corpus, word frequency, stemming, n-grams, etc. In most applications of text mining, the typical steps include, creating corpus, performing data transformation (replace, remove…), creating term document matrix, creating word cloud, plotting term frequencies, and sometimes, also fitting a model using machine learning algorithm to do classification and regression tasks. Although you are not expected to perform the model fitting for this project, you are encouraged to explore on your own fitting a classifier on the data prepared using NLP using some of the machine learning algorithms we learned in this class.
.
• Locate a large dataset that relates to the domain that interests the team.
• Import the dataset into RStudio. Prepare and analyze the data.
• Prepare a formal data mining study report that is described in the next section.
Deliverables
An analysis report by addressing the following critical areas:
1. Introduction: give some background and context about the domain of application, provide the rationale for the type of analysis, and state the objective clearly.
2. Analysis: describe the data both qualitatively and quantitatively through exploratory analysis, perform necessary preprocessing activities, give some intuition about the algorithm and core parameters, demonstrate the model building steps along with parameter tuning, and explain all your assumptions.
3. Result: explain the result and interpret the model output using terms that reflect the application area, perform model evaluation using the appropriate metrics, and leverage visualization.
4. Conclusion: summarize your main findings, discuss experimental limitations related to the data and/or implementation of the algorithm, and suggest improvement areas as a potential future work.
• Follow appropriate APA formatting and provide all references
• Include your R script and extended model outputs in an Appendix section.
• The length of the final report should be 8 pages excluding the title page, appendix and R script.