Methods of summarizing the relationship between two variables

Problem Set 1
This exercise examines methods of summarizing the relationship between two variables: a simple
graphical analysis, the bivariate linear regression model. The application is to the relationship
between infant mortality rates (IMRs) and total suspended particulates (TSPs) air pollution. The
Environmental Protection Agency recently toughened the regulations that limit firms abil-ity to
emit TSPs, because of the presumed health effects of TSPs. Whether or not, IMRs and TSPs are
causally related is an issue of tremendous importance to public policy.
Feel free to work cooperatively but each person is required to turn in their own problem set that
provides the solutions in their own words.
For those of you who become interested in this topic, you might be interested in:
Chay, Kenneth Y. and Michael Greenstone. 2005. Does Air Quality Matter?: Evidence from the
Housing Market. Journal of Political Economy, 113(2): 376-424.
Chay, Kenneth Y. and Michael Greenstone. 2003. Air Quality, Infant Mortality, and the Clean
Air Act of 1970. MIT Department of Economics Working Paper No. 04-08.
Data Source: imrtsp71.dta and imrtsp72.dta
imrtsp71.dta is a data file from 1971. The unit of observation is the county and there are 715
observations of 21 variables.
This Stata format data file contains county-level information on county-level number of infant
mortalities per 1000 births (IMR), the ln of this same number, TSPs concentrations, number of
births, characteristics of new parents (e.g. race of mother, years of education, marital status of
mother, mothers age), whether the infant is considered to have a low-birth weight (a poor
indicator of infant health), month of the pregnancy that the mother initiated prenatal care, and
mean per-capita income.
The relevant variables with descriptions in quotations are:
imr71 “# inf deaths per 1000 births 71”
lnimr71 “ln(# inf death per 1000 births 71)”
mtspar71 “county-level tsps concentration, measured in micrograms per cubic meter 71”
tsp sq “the square of mtspar71”
birth71“# births 71”
white71 “% births, white mom 71”
1
othr71 “% births, nonwhite/nonblack mom 71”
female71 “% female births 71”
edudad71 ‘mean father years of ed 71”
edumom71 “mean mother years of ed 71”
maried71 “% mother married 71”
umard71 “% mother unmarried 71”
agemom71 “mean mother age 71”
lwght71 “% births with weight <2500 g in 71”
pcare171 “% mother began prenatal care in 1st or 2nd month 71”
pcare271 “% mother began prenatal care in 3rd month 71”
pcare371 “% mother began prenatal care in 4th-6th month 71”
pcare471 “% mother began prenatal care in 7th-9th month 71”
pcinc71 “county-level per cap income 71”
location “5-digit county fips code”
fstate “2 digit state fips code”;
[Note: There may be a few extra variables in the data file, but you should ignore them.]
imrtsp72.dta is structured exactly the same way except that the observations are from 1972 and
all the appropriate variable names end with “72” instead of “71”. Again, the unit of observation
is the county and here there are 983 observations of 22 variables. DO NOT USE imrtsp72.dta
in this problem set.

  1. Summarize the relationship between the number of infant deaths per 1000 births and TSPs
    concentrations.
    a. Create histograms of imr71 and lnimr71. Do either of these variables look normal?
    (Hint: experimenting with the number of bins and overlaying a normal curve will help
    with this.)
    b. Graph scatter plots of imr71 and lnimr71 against mtspar71. Does it look like there is
    an association between infant mortality and tsps?
    c. Examine the edudad71 variable. What are the deciles of the variable? What is the
    average year of education in the largest decile? Graph scatter plot of imr71 and
    eudad71. Do you think that counties with more educated fathers have lower levels of
    infant mortality?
    d. Graph scatter plots of imr71 and lnimr71 against mtspar71, but this time, weight the
    observations by the total number of births in the county. What is your prediction
    about the covariance of infant mortality rates and tsps? Does this relationship appear
    linear for either form of the dependent variable?
  2. Background Questions
    a. Does the available data allow for a determination of the causal relationship between
    infant mortality and TSPs? Why not? Describe the data file that would allow for an
    examination of this issue?
    b. Under what assumptions is the least squares estimator the best linear unbiased estimator (BLUE)?
    2
    c. What assumption is necessary for LS to produce an unbiased estimate of the IMR/TSPs
    relationship? Do you think this assumption is likely to hold? If you had any data file
    that you wanted, how would you test whether this assumption may be valid? Describe
    your ideal data file. With the current data file, present some evidence as to whether
    this assumption is likely to hold?
    d. In the bivariate linear regression model, derive the estimating equations for the intercept and slope coefficients? Derive their standard errors?
  3. The bivariate linear regression model of infant mortality rates and TSPs.
    a. Run the regressions of imr71 on a constant and mtspar71 and lnimr71 on a constant
    and mtspar71. In both cases, weight the regressions by birth71 so that larger counties
    have a greater influence. Interpret the parameter estimate (i.e., Beta Hat) in words;
    for instance, describe the effect of a 10 unit decline in TSPs on infant mortality.
    b. Plot the residuals from both regressions and overlay a normal curve. Does the normality assumption appear reasonable? Does homoscedasticity of residuals hold? (Hint:
    graph residuals against the fitted values)
    c. Use the total sum of squares (TSS), error sum of squares (ESS), and regression sum of
    squares (RSS) to derive the R2 statistic? Determine the components of the corrected
    R2 statistic and show that STATA accurately calculated that statistic.
    d. Determine the values of TSPs that define the deciles of TSPs. Create 10 dummy
    variables where each one corresponds to a decile of TSPs. For instance, an observation
    that has a TSPs concentration in the smallest decile would have a value of 1 for the
    dummy variable that corresponds to the smallest decile and a value of 0 for the other
    9 dummy variables. Regress imr71 on a constant and the 10 dummy variables. Why
    does STATA drop one of the dummy variables? Plot the parameter estimates from the
    dummy variables where the y-axis is the parameter estimate of the dummy variables
    and the values on the x-axis are the midpoint of the range that determine each of the
    dummy variables. Is the effect of TSPs on imr71 linear in TSPs?
    e Now regress imr71 on mtspar71 and the square of mtspar71. (Note you will have
    to generate the square variable.) Plot the predicted values of this regression against
    mtspar71. Describe the shape of this function.

This question has been answered.

Get Answer