1. When reporting the results of different regressions you need to run for each question, do not give me your Stata outputs. You need to create clearly labeled tables following the format of the example below. Remember that you don’t need to report the estimated coefficients for time and entity fixed effects, just specify if a regression includes them. Also, make sure you use the complete name for each variable in the table and not the variable name in the data (see the example on the next page).
2. Written answers to each part of a question should be provided below that part in this document and not in a separate document. Provide the tables at the very end of the associated question and refer to them (specifying the column you talk about) in your written answer if you need to.
3. When answering a part requires running a regression in Stata, provide the Stata program you have used to run the regression in your answer. For example, if I ask you to run a regression to answer a certain question (interpret your estimates or compare them with a previous part or something like that), you need to provide the Stata code for the regression you ran (and when applicable all the other stuff you had to do before running that regression like creating some variables, as long as they are related to that part), and then provide your answer to my question. Here is an example
gen lwage = log(wage)
gen expersq = exper^2
reg lwage educ exper expersq black south smsa smsa66 reg661-reg668 [aw=weight]
Running the WLS regression, we observe that now, except the coefficient of educ (.075), other coefficients are different from card’s results in column 2 of table 2, although the sign of the coefficients are still the same…
4. The instructions above will allow me to evaluate your answers to each part without going back and forth between different documents, meaning that everything I need to see as part of your answer (the estimates, the program you used to produce them, and your explanation) need to be provided in this document. Anything that is not provided in this document as part of your answer is not going to be evaluated.
5. You need to send back to me a copy of this document (with your name and student number in the first page) with all your answers following the instructions above. You also need to send me a copy of your dofiles (one dofile for each question) and logfiles (one log file for each question). Given that we have 4 empirical questions, you need to send me 9 files (this document with answers, 4 dofiles, 4 logfiles).
Table example:
Table 1: Question 1
Dependent Variable: Indicator for promotion
(1) (2) (3) (4) (5)
Visible Minority Canadian Born -0.198*** -0.019 -0.050 -0.095 -0.025
(0.059) (0.047) (0.067) (0.080) (0.051)
White Immigrant -0.091 -0.066 0.020 0.008 0.096*
(0.064) (0.063) (0.087) (0.070) (0.055)
Visible Minority Immigrant -0.091** -0.052 0.037 0.059 0.012
(0.039) (0.045) (0.063) (0.058) (0.045)
Bachelor’s degree or higher -0.105*** -0.061 0.038 0.018 -0.075**
(0.036) (0.040) (0.059) (0.058) (0.036)
High school degree -0.198*** -0.019 -0.050 -0.095 -0.025
(0.059) (0.047) (0.067) (0.080) (0.051)
Married -0.091 -0.066 0.020 0.008 0.096*
(0.064) (0.063) (0.087) (0.070) (0.055)
Labour market experience -0.091** -0.052 0.037 0.059 0.012
(0.039) (0.045) (0.063) (0.058) (0.045)
Female -0.105*** -0.061 0.038 0.018 -0.075**
(0.036) (0.040) (0.059) (0.058) (0.036)
Stata Fixed Effects No No YES YES YES
Time Fixed Effects No No No YES YES
Number of observations 42467 42467 42467 42467 42467
Notes: Standard errors are in parentheses; *** indicates statistically significant at 1%, ** indicates statistically significant at 5%, and * indicates statistically significant at 10%.
Question 1 [Fixed Effects Model].
Traffic crashes are the leading cause of death for Americans between the ages of 5 and 32. Through various spending policies, the federal government has encouraged states to institute mandatory seat belt laws to reduce the number of fatalities and serious injuries. In this question you will investigate how effective these laws are in increasing seat belt use and reducing fatalities. The data “SeatBelts.dta” contains a panel of data from 50 US states plus the District of Columbia for the years 1983 through 1997. A detailed description is given in “SeatBelts_Description” document.
(A) Estimate the effect of seat belt use on fatalities by regressing FatalityRate on sb_useage, speed65, speed70, ba08, drinkage21, ln(income), and age. Report your results in a table. Does the estimated regression suggest that increased seat belt use reduces fatalities?
(B) Run a regression similar to part (A) and add state fixed effects to your regression using
(i) State de-meaning
(ii) (N-1) binary indicators
(iii) “xtreg” command with clustered standard errors.
Report your results in the same table as part (A).
(C) What do state fixed effects control for in the regressions in part (B)? Do the results change when you add state fixed effects? Provide an intuitive explanation.
(D) Run a regressions similar to part (A) and add both state fixed effects and year fixed effects to your regression using
(i) State and year de-meaning.
(ii) Binary indicators for both state and year fixed effects.
(iii) Binary indicators for year fixed effects and “xtreg” command for states fixed effects (cluster your standard errors).
Report your results in the same table as part (A).
(E) What do year fixed effects control for in the regressions in part (D)? Do the results change when you add both state and year fixed effects compared to your results in part (B) where you only have state fixed effects? Provide an intuitive explanation.
(F) Using your results in part (D-iii), discuss the size of the coefficient on sb_useage. Is it large? Small? How many lives would be saved if seat belt use increased from 52% to 90%?
(G) There are two ways that mandatory seat belt laws are enforced: “Primary” enforcement means that police officer can stop a car and ticket the driver if the officer observes an occupant not wearing a seat belt; “secondary” enforcement means that police officer can write a ticket if an occupant is not wearing a seat belt, but must have another reason to stop car. In the data set, primary is a binary variable for primary enforcement and secondary is a binary variable for secondary enforcement. Run a regression of sb_useage on primary, secondary, speed65, speed70, ba08, drinkage21, ln(income), and age, including fixed state and year effects in the regression. Does primary enforcement lead to more seat belt use? What about secondary enforcement?
(H) In 2000, New Jersey changed from secondary enforcement to primary enforcement. Estimate the number of lives saved per year by making this change.
(I) What is the source of identification in the regression model with state fixed effects? Explain clearly and in plain English.
Question 2 [IV Model].
In this exercise, you are asked to reproduce some of the results of the paper by Angrist and Krueger (1991) titled “Does compulsory school attendance affect schooling and earnings?”.
Here are the variables you need to use in your analysis:
LWKLYWGE is log of weekly wage. YOB measures year of birth. QOB measures quarter of birth. AGE measures age of individuals and AGEQSQ is its quadratic term. EDUC is the variable of interest and measures years of education. The instruments Angrist and Krueger use are the interactions between the quarter of birth and year of birth (there are four quarters of birth and ten years of birth for men born between 1920 and 1929 which is the sample analyzed in this table). Keep in mind that you need to leave out one set of interactions between one of the quarters of births and year dummies. If you implement the regressions correctly, you will get estimates identical to those in table IV in the paper. RACE is =1 if black and zero otherwise, SMSA is =1 of center city and zero otherwise, and MARRIED is =1 if married and zero otherwise. Finally, “NEWENG MIDATL ENOCENT WNOCENT SOATL ESOCENT WSOCENT MT” are eight regions of residence dummies.
(A) Use and OLS model to estimates the returns to education controlling for race, marital status, whether the city where an individual lives is a center city, and regions of residence. Interpret the estimated coefficient on education. Why the estimated return to education is likely to be biased?
(B) Angrist and Krueger use the IV method to estimate the casual effect of education on wages. Students born in different months of the year start school at different ages. This fact, in conjunction with compulsory schooling laws, which require students to attend school until they reach a specified birthday, produces a correlation between date of birth and years of schooling. Students who are born early in the calendar year are typically older when they enter school than children born late in the year. Bbecause children born in the first quarter of the year enter school at an older age, they attain the legal dropout age after having attended school for a shorter period of time than those born near the end of the year. Hence, if a fixed fraction of students is constrained by the compulsory attendance law, those born in the beginning of the year will have less schooling, on average, than those born near the end of the year (see the paper for more details).
Replicate the results reported in table IV of the paper (page 999).
The instruments Angrist and Krueger use are the interactions between the quarter of birth and year of birth (there are four quarters of birth and ten years of birth for men born between 1920 and 1929 which is the sample analyzed in this table). Keep in mind that you need to leave out one set of interactions between one of the quarters of births and year dummies. If you implement the regressions correctly, you will get estimates identical to those in table IV. You can use “ivregress 2sls” to implement the 2SLS regressions.
(C) Test the exogeneity of the instruments using the J-test. What does the test suggest regarding the exogeneity of the instruments?
Question 3 [Limited Dependant Variable: Binary Indicators].
For this question you will use “LEE” data set that is a linked employer-employee data set. In this dataset we observe 149 different firms (docket) in different years (1999, 2001, 2003, 2005). Not all firms are observed in all 4 years (unbalanced panel). We also observe multiple workers within each firm (seq_no). Total number of workers in the data is 2377.
[Note1: this panel structure is different than what you have seen in the lecture. In the lecture, when we talked about beer tax and fatality rate, you had a state-by-year panel where the unit of observation was a state in a given year. Here you have a firm-by-year panel where the unit of observation is a worker in a given firm. This different structure will change some of the interpretations for the fixed effect model, keep that in mind when answering the following questions]
(A) variable “promot” is an indicator which is equal to one for workers who have been promoted at least one time while working for their current employer, and zero otherwise. Use a linear probability model regression to examine whether immigrants are less likely to get promoted than Canadian-borns. Report your estimated coefficient in a table and interpret your result.
(B) Now add the following control variables to your LPM regression above:
• Education (educ_maphdmd, educ_othgrad educ_bachelors educ_someuniv educ_collcert educ_somecollnonuniv educ_hsgrad) [omitted category: education less than high school]
• number of children (kids_1, kids_2, kids_3 and kids_4plus) [omitted category: zero children]
• full-time/part-time status (fulltime), age, gender(female), years of labour market experience (yrs_exp), union membership (cba_or_union)
• marital status(mar_married mar_commonlaw mar_separated mar_divorced mar_widowed) [omitted category: single workers]
Report your estimates in the same table as Part (A). Are your results different than part (A)? Explain why or why not?
(C) Re-estimate the regression specification in part (B) using a probit model this time. Report your results in the same table (report the marginal effects for an average individual). Are your results different than part (B)? What do you infer from the difference (or absence of difference) between these two sets of estimates?
(D) Your friend argues that it might be interesting to dig a bit deeper and examine whether there is any heterogeneity in differences in promotion opportunities across different groups of immigrants and Canadian-borns. Specifically, he suggests to examine whether visible minority immigrants, visible minority Canadian-borns and white immigrants face different promotion opportunities compared to white Canadian-borns. Run a probit regression to investigate this issue. Report your estimates (marginal effects) in the same table as before and answer the question your friend is interested in [Hint: “vismin” is an indicator which is equal to one if a worker is a visible minority and zero if the worker is white. Use this indicator and the “immigrant” indicator to create the appropriate indicators for your regression specification].
(E) Consider your regression specification in part (C). Another friend suggests it might be a good idea to examine whether time spent in Canada has any impact on differences in promotion opportunities faced by immigrants. He expects as immigrants spend more time in Canada the gap in promotion opportunities they experience compared to their Canadian counterparts might narrow down or disappear.
Variable “yrs_since_immig” measures the number of years an immigrant worker has spent in Canada (its value is equal to zero for all Canadian-borns). Use this variable to create four indicators: an indicator for immigrants who have been in Canada between 0 and 5 years, between 6 to 10 years, between 11 to 20 years, and more than 20 years.
You run the following regression:
probit promot immigrant yr_img_0to5 yr_img_6to10 yr_img_11to20 yr_img_21plus
When you look at your Stata outputs, you notice that Stata drops one of your variables. Why?
(F) Now that you have identified the source of problem in part (E) run the regression again using the correct specification. Report your results (marginal effects) in the same table as before and interpret your estimated coefficients (all of them, one by one). Do you find any evidence that the gap in promotion opportunities between immigrants and Canadians gets smaller as immigrants spend more time in Canada? Explain.
(G) Re-estimate your LPM regression in part (B) but this time include firm effects to your regression (use xtreg).
(i) Report your results in the same table as before (exclude estimated firm effects from the table).
[Note2: if you try to use “xtset docket year” to let Stata know you are using a panel data, you will get the following error “repeated time values within panel”. Xtset works when your unit of observation is the same as your fixed-effect (for instance when your unit of observation is state-by-year and you are using state fixed effects, which was the case in the beertax and fatality rate we talked about in the lecture). In this case, your unit of observation is a worker, and you are using firm fixed effects, and that is why Stata gives you an error because you have more than one observation per firm-year (multiple workers within each firm in a given year). In a situation like this, you can use the following method to implement the fixed effect regression:
Xtreg log_hourly_wage female yrs_exp …, fe i(docket)
Where docket is the firm ID.]
(ii) [Bonus: 2%] What do firm effects control for in this specification? Explain in plain English.
(iii) [Bonus: 2%] Are there differences in the source of identification between your specifications in part (B) and (G)? Explain in plain English.
[Hint: keep in mind the note at the beginning of the questions and what we talked in the last lecture when you are answering parts (ii) and (iii)]
(H) [bonus: 3%] the margins command used after the probit regression reports the marginal effects for average values of regressors. Your friend suggests that it might be a good idea to calculate the average marginal effects than marginal effects at the average.
(i) What is the difference between the two? Explain.
(ii) Use Stata and your specification in part (C) to implement what your friend has suggested and report your results.
Question 4 [Multinomial and Ordered Logit]
Use “ologit.dta” in “Question4” folder to answer the following questions.
We would like to examine factors that influence the decision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Hence, our outcome variable, apply, has three categories. The factors we are interested in are:
1. Parental educational status: pared, which is a 0/1 variable indicating whether at least one parent has a graduate degree
2. Whether the undergraduate institution is public or private: public, which is a 0/1 variable where 1 indicates that the undergraduate institution is public and 0 private
3. Current GPA: which is the student’s grade point average
The researchers have reason to believe that the “distances” between the three categories of the outcome variable are not equal. For example, the “distance” between “unlikely” and “somewhat likely” may be shorter than the distance between “somewhat likely” and “very likely”.
(A) What is the problem with using OLS in this case? Briefly explain.
(B) Use ordered logit to estimate how the three factors mentioned above influence the decision to apply to graduate school. Report the odds ratio from this estimation in a table.
(C) Use your results in part (B) to explain how parental education influences the decision to apply to graduate school. Use plain English to provide a non-technical explanation understandable for a lay person and then back up your claim using an accurate statistical interpretation of the estimated coefficient for “pared”.
(D) Use your results in part (B) to explain how undergraduate institution (public vs private) influences the decision to apply to graduate school. Use plain English to provide a non-technical explanation understandable for a lay person and then back up your claim using an accurate statistical interpretation of the estimated coefficient for “public”.
(E) Use your results in part (B) to explain how current GPA influences the decision to apply to graduate school. Use plain English to provide a non-technical explanation understandable for a lay person and then back up your claim using an accurate statistical interpretation of the estimated coefficient for “gpa”.
(F) Use “margins” command to answer the following questions:
1. What is the marginal effect of parental education on probability of being unlikely to apply to graduate school? Explain.
2. What is the marginal effect of going to a private school on probability of being very likely to apply to graduate school? Explain.
3. What is the marginal effect of current GPA on probability of being unlikely to apply for graduate school? Explain.
(G) Use “margins” command to calculate the predicated probability of each outcome category for different variables. Use your results to populate the second column of the table below (provide both estimates and standard errors)
Table
Ordered Logit Multinomial Logit
Outcome: probability (apply to graduate school = unlikely)
At least one parent has a graduate degree
None of the parents has a graduate degree
Difference (marginal effect)
Undergraduate institutions = public
Undergraduate institutions = private
Difference (marginal effect)
GPA (marginal effect)
Outcome: probability (apply to graduate school = somewhat likely)
At least one parent has a graduate degree
None of the parents has a graduate degree
Difference (marginal effect)
Undergraduate institutions = public
Undergraduate institutions = private
Difference (marginal effect)
GPA (marginal effect)
Outcome: probability (apply to graduate school = very likely)
At least one parent has a graduate degree
None of the parents has a graduate degree
Difference (marginal effect)
Undergraduate institutions = public
Undergraduate institutions = private
Difference (marginal effect)
GPA (marginal effect)
* — significant at the 10% level ** — significant at the 5% level *** —significant at the 1% level
Notes: standard errors are reported in parenthesis.
(H) Now use multinomial logit to estimate how the three factors mentioned above influence the decision to apply to graduate school. Report the relative risk ratios from this estimation in the second column of the table you created for part (B) above.
(I) Use “margins” command after multinomial logit to calculate the predicated probability of each outcome category for different variables. Use your results to populate the third column of the table you used for part G (provide both estimates and standard errors).
(J) do you find any difference in the marginal effects from ordered logit model versus those from multinomial logit? Explain.
(K) What are some of the differences between ordered logit estimator and multinomial logit estimator? Which one you think is more appropriate as an estimator in this context?