The dataset ‘Credit data

 

Question 1:

The dataset ‘Credit data.xlsx’ contains data on 10,000 borrowers and whether they subsequently experienced serious delinquency (see variable ‘SeriousDlqin2yrs’). Assume the lender now wishes to use this data to build a credit scoring model that predicts serious delinquency based on the other variables. The dataset contains the following variables:

 

1.1 Carefully pre-process the dataset by considering the following activities:

• Exploratory data analysis.

• Missing value handling (if any), including a suitable analysis of missing values and justification of the chosen method.

• Outlier detection and treatment (if any), with appropriate analysis/justification.

• Binning the variables (if deemed useful)

• Coding the variable bins using Weights of Evidence

. • Splitting the data set into a training and test set.

1.2 Build an intuitive and predictive scorecard using a logistic regression classifier and report the following:

• The most important variables

• The impact of the variables on the target

• The performance of the model. Use various performance metrics and discuss their relationship if any.

Compare this scorecard with the result of a Random Forest model run over the data. Discuss your results. Why do banks often use Logistic Regression as their classifier? What do banks win and lose by doing this? In terms of software, you are expected to use SAS Enterprise Miner. Carefully report the various steps of your methodology and discuss your results in a rigorous way! NOTE: It is unlikely that different students will come up with the exact same parameter estimates. Special consideration will be given to submissions whose estimates are identical.

Question 2:

Find an academic paper published in 2020 or later (based on online or print publication date) discussing a real-life application of data mining or credit scoring. It is important

 

This question has been answered.

Get Answer