- BACKGROUND
The purpose of this assignment is to learn how to perform a preliminary data analysis, as we have to do every time we want to use a new database and before implementing more advanced data mining techniques. - QUESTIONS
Question 1 (25 points): Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are (in increasing order) 13, 15, 16, 17, 19, 20, 20, 21, 32, 32, 32, 35, 35, 35, 40, 43, 43, 45, 45, 45, 35, 46, 50, 55, 56, 62, 80.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
(g) How is a quantile–quantile plot different from a quantile plot?
Question 2 (25points): Create a histogram for one numeric variable from any dataset of your choice from UCI dataset used in class. Provide script and figure.
Question 3 (25 points): Suppose that a hospital tested the age and body fat data for 18 randomly selected adults
with the following results:
age 21 23 24 27 39 41 43 49 50
%fat 9.5 42.5 7.8 43.4 31.4 25.9 32.9 27.2 31.2
age 52 54 55 56 57 58 59 60 61
%fat 34.6 26.5 28.8 27.8 30.2 34.1 27.4 41.2 35.7
(a) Calculate the mean, median, and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
(c) Draw a scatter plot and a q-q plot based on these two variables.
Question 4 (25 points): Given two objects represented by the tuples (42, 1, 30, 10) and (38, 0, 26, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using h=3