Exploratory text analysis

Q1 Perform:

a.Text extraction & creating a corpus
b.Text Pre-processing
c.Create the DTM & TDM from the corpus
d.Exploratory text analysis
e.Feature extraction by removing sparsity
f.Build the Classification Models and compare Logistic Regression to Random Forest regression https://medium.com/analytics-vidhya/customer-review-analytics-using-text-mining-cd1e17d6ee4e

Q2 – Analyze the customer reviews in the file Restaurant_Reviews.tsv

a.Explain each step for the following text clean-up commands
corpus = VCorpus(VectorSource(dataset_original$Review))

corpus = tm_map(corpus, content_transformer(tolower))

corpus = tm_map(corpus, removeNumbers)

b. What is the classification question?

c. Use CM for Random Forest classifier to calculate:

TP = # True Positives,

TN = # True Negatives,

FP = # False Positives,

FN = # False Negatives):

Accuracy = (TP + TN) / (TP + TN + FP + FN)

d. Apply the logistic regression classifier to the problem – recalculate “Q2c” i.e. TP, TN, FP, FN, Accuracy

e.Apply SVM classifier to the same question – recalculate “Q2c”
corpus = tm_map(corpus, removePunctuation)

corpus = tm_map(corpus, removeWords, stopwords())

corpus = tm_map(corpus, stemDocument)

corpus = tm_map(corpus, stripWhitespace)

Uncomment in order to see the impact:

as.character(corpus[[841]])

as.character(corpus[[1]])

Q3: Study the quanteda toolkit for R

Q3a: Compare quanteda to: alternative R packages for quantitative text analysis (tm, tidytext, corpus, and koRpus)

Q3b: Install(quanteda) and then library(quanteda) – and explain different features of the quanteda package for text analysis

Q4 Spam Text Message Classification – Use the quanteda package to perform “spam” classification on the text message file in Q4

The file name: Q4.spam-text-message-classification.zip

a. Create the ”word” cloud for spam and ham messages b. Apply a Naïve Bayes Classifier and compute TP, TN, FP, FN, Accuracy c. Use a Logistic Regression Classifier and compute TP, TN, FP, FN, Accuracy d. Use a Random Forest Classifier and compute TP, TN, FP, FN, Accuracy
Q5. The State of the Union is an annual address by the President of the United States before a joint session of congress. In it, the President reviews the previous year and lays out his legislative agenda for the coming year

This dataset contains the full text of the State of the Union address from 1989 (Regan) to 2017 (Trump).

a.Topic modeling: Which topics have become more popular over time? Which has become less popular?
b.Sentiment analysis: Are there differences in tone between different Presidents? Presidents from different parties?

This question has been answered.

Get Answer