Questions:
- Logistic Regression. Logistic regression treats a binary classification (e.g. is/is-not a dog) probabilistically, where the probability for any example x to be assigned the class y = 1 is given in terms the
logistic sigmoid function g and a weight vector w that must be learned:
g(w · x) = exp (w · x)
1 + exp (w · x)
p(1 | x, w) = g(w · x)
p(0 | x, w) = 1 − g(w · x)
(a) (10 points) Let g(w · x) = q
From the definitions above, prove that w · x is a logit or “log-odds” function when 0 < q < 1
w · x = ln q
1 − q
(b) (10 points) Just as p(1|w, x) = g(w · x) is the probability for a positive classification conditioned
on a set of weights, it is the likelihood of weights w given a classification y = 1. Show that the
gradient (in parameter space – derivatives should be taken with respect to the components of w) of
this log likelihood function reduces to
∇ log g(w · x) = x(1 − g(w · x))
which is a vector quantity parallel to x. - (30 points) Naive Bayes. Use Naive Bayes to estimate whether a student will be an honor student
(H) or normal student (N) in college based on their high school performance. Each instance has two
features: the student’s high school GPA (a real number) and whether or not the student took any AP
courses (a Boolean value, yes=1, no=0). Based on the following training data, create (by hand or with
computational tools) a Naive Bayes prediction rule using normal (Gaussian) distributions to estimate
the conditional probability density of high school GPAs given honors status (H or N) (this assigns nonzero probability to negative or greater-than-four GPA values, but that is fine for our purposes) and a
Bernoulli distribution for the AP feature.
label AP GPA
H yes 4.0
H yes 3.7
H no 2.5
N no 3.8
N yes 3.3
N yes 3.0
N no 3.0
N no 2.7
N no 2.2
Recall that Naive Bayes makes the simplifying assumption that the features are conditionally independent
given a class / label:
p(gpa, ap | honors) = p(gpa | honors)p(ap | honors).
Use maximum likelihood estimation (not the unbiased or Laplace estimates) for the distributions of the
two features conditioned on the two classes. Give the mean and variance for each distribution over GPA
values. For the variance here, you only need to calculate the biased sample variance estimator (divided
by n), not the unbiased one (divided by n − 1).
Describe your prediction rule in the following form:
If AP courses are taken, predict H if the GPA is in a ⊂ R
if AP courses are not taken, predict H if the GPA is in b ⊂ R
where a and b must be found.
Hint: It is probably easier to get this description if you take logarithms. 3 digits of precision should
sufficient. Also, the logarithm of the Gaussian densities are quadratic, possibly yielding two distrinct
zeros that correspond to the boundaries of the intervals a or b. - Nearest Neighbors. Assume that examples are drawn uniformly from the unit square. Independent
of example features (i.e., location (x1, x2) in the unit square), labels are generated at random such that
a proportion q where 0.5 < q < 1 are assigned label y = 1 and the remainder are assigned label y = 0.
The Bayes-optimal hypothesis will minimize the error rate of its predictions given (x1, x2) by always
predicting y = 1 and suffering an error rate of 1 − q.
How should we expect the 3-Nearest-Neighbors algorithm perform? Assume that the algorithm is trained
on a large set of known labels. For each new sample, the algorithm finds the three closest (according to
any metric!) points in its training set, finds the label shared by the majority of these three, and assigns
this label to the new example.
(a) (10 points) What is the expected error rate of the 3-Nearest-Neighbor algorithm, in terms of q?
(b) (10 points) When is this better or worse than the Naive Bayes solution (recall, 0.5 < q < 1)? - Decision Tree. Sammy-the slug owns a car dealership that sells two types of cars: Honda(H) and
BMW(B). He collected the following data of his customers that records their Gender, Annual income
and the type of car they purchased:
Gender AI (Annual Income in thousands) Preference
F 100 B
M 150 B
F 80 B
M 75 B
F 90 H
F 85 H
M 85 H
F 70 H
Page 2
This question is about building decision trees to predict the car type that a new customer will prefer.
Note that one of the attributes, Annual Income (AI), is a continuous variable. Lets assume that we will
only allow binary splits for this attribute of the form AI < a and AI ≥ a, where a lies in the dataset.
However, there can be multiple such splits in one path from root to leaf.
For all your calculations, use log base 2.
(a) (3 points) For this part of the question, lets assume that for some unknown reason, Sammy insists
on keeping the “Annual Income” at the root node. How many possible values of a does he need to
consider? What are they?
(b) (3 points) What is the entropy of labels (car type) in the training dataset?
(c) (9 points) What is the optimal root node for this dataset? Show your calculations.
(d) (15 points) Draw the Decision Tree that would be learned by ID3 on this dataset. Only binary splits
are allowed i.e. a node can have only two children. Your tree should contain all the information
necessary to read it. At each node, indicate: - the attribute you are splitting on (when splitting on AI, also include the a used);
- the label distribution of the instances at that node before the split (e.g. if there are 3 instances
at a node and 2 belong to H and 1 belongs to B class, then label the node as h2, 1i). - Additionally, for each non-leaf node indicate the gain attained by the corresponding split, and
label each leaf node by its class-label (H or B). - All edges between a parent and a child should be labeled with the value of the attribute.
- It is okay to draw the tree by hand and include a clear picture in your pdf.
- Don’t forget to include your calculations (6 points here).