Your data set must have at least 50 observations and might have a lot more. It is up to you as to where you find data. Lots of economic data are available on the web. (Your data does not need to be strictly “economic.” You should NOT simply use data that came with the book (or other textbooks) or that is downloadable from the Gretl website. Instead, you should look for a topic which is relevant to you and find data to test it. This is your opportunity to do something new and creative. Note that you will likely want to concentrate on cross-sectional data since we focused on this type for most of the class (and since methods vary for time-series and for panels). If you do still choose an alternate type of data, you should be careful in your write-up to critically list any interpretation problems that may remain especially in your conclusions section in the second submission.)
The Yale Law Library has a great list of U.S. data sources at: https://library.law.yale.edu/news/75-sources- economic-data-statistics-reports-and-commentary, and international organizations such as the World Bank (https://data.worldbank.org/) and the International Monetary Fund (https://www.imf.org/en/Data) also have public use data. So do state and local government websites such as https://data.colorado.gov/ and https://opendata.fcgov.com/. There’s also lots of data on other things in life (e.g., financial data at https://www.wsj.com/market-data; sports data at https://www.espn.com/ (pick a sport and then select “stats”)). There are all kinds of other sources out there too. It is up to you where you locate data.
To use Gretl, it will likely be necessary to import your data from another format (e.g., from Excel, ASCII, CSV, etc.). To do this, use file → open data → userfile. Then select data type on the bottom right hand side of the window and navigate your computer to find your file. (In some versions of Gretl, use file → open data → import and then select the file type from the menu that pops up.) Gretl will prompt you during this import stage to indicate whether the data are cross-sectional, time series, or panel (longitudinal). I often suggest putting data in Excel first and then loading it into Gretl in a second step, but that’s up to you.
To facilitate grading, please clearly label all subsections
Problem Set Part I
(a) Statement of Research Question
Briefly pose the question that you will be investigating using data. Good econometric questions are generally based on economic theory, however econometrics can be used to analyze all kinds of cause and effect relationships even if they don’t directly relate to your past economic theory courses. You can study just about anything that interests you. Since the goal of the project is causal identification of the effect of one variable on another variable, a good question would specify one primary X variable and one primary Y variable. (As in the book and other class examples, you will have other X’s as controls later (choosing specific variables that satisfy criteria for omitted variables bias).
Write out the question in the form “What is the effect of X on Y?” (with X and Y filled in respectively). Note again that X should be one main variable instead of a set of variables. (For reference, in our class example X was student- teacher ratio and Y was test score (though there were also other specific X variables included in the models.)
(b) Introduction
Motivate your research question by writing a clear introduction to your topic question. Unlike part (a) (which is just for my reference to quickly understand your topic), part (b) should look like a regular term paper introduction. Describe the question and why it is of interest. Briefly cite relevant literature (e.g., previous studies) that you come across while researching your topic or other relevant information/background. Your goal is to help me understand both your question and the mechanism that you are envisioning behind the association between your X and Y.
(c) Formulation of Your Baseline Linear Model
Now express the question (which should match what you have done in parts (a) and (b)) in the form of an equation to be estimated. In other words, your goal is to provide a clear theoretical equation to be estimated. Fill in illustrative names of the dependent and independent variables instead of just writing in terms of generic X’s and a Y. Make sure to use population parameters (e.g., β’s with appropriate subscripts) and an error term. You must include at least three independent variables (your main X plus at least two more), but you can include as many as you’d like beyond that threshold. (Note that you will eventually present at least one nonlinear specification for comparison but that won’t be unlike Part II). Here, you are being asked to construct a baseline linear case.
Clearly explain what each of the variables in your equation should capture and why exactly you are including each. In other words, explain the mechanisms you believe pertain to your study question and which led you to select the particular X variables that you did (note that unlike part (b) you are now doing this for all X variables and note that this is a theoretical discussion and not a data description (coming in the next part)). When describing additional regressors (beyond your primary X), please make sure to clearly indicate how these variables satisfy the criteria for omitted variables bias. Clearly indicate for each your expectation as to the sign of the coefficient associated with it. Are you expecting a positive relationship? Negative? Why?
(d) Data Description
Now describe the data set that you are using to estimate the model. Where does the data come from? (Note that these may be multiple sources which you put together.) How are each of the variables measured? In what units? (Note that this is different from (c) since in (c) you are discussing how variables theoretically relate and here you are defining the specific data that will be used empirically.
You should construct (1) a table that provides the mean, minimum, maximum, and standard deviation (and any other summary statistics that you feel are relevant) of all variables in the model and (2) a scatterplot of your primary X versus your primary Y variable).
Describe in the text what you are learning from this table and figure in this section. You might also want to show and discuss other key scatterplots of interest (e.g., between your Y variable and other key regressors). If any of your figures show significant outliers, discuss these and whether you are doing something about them.
(e) Bibliography
All cited sources should be reported in a bibliography in a consistent format. One useful reference is S&W. The bibliography should be short and should include our course textbook as well as sources related to your question.
Problem Set Part II
To facilitate grading, please clearly label all subsections (e.g., part (a), part (b)…). Please do not include your part I.
(a) (One line) Restate your primary question in the form “What is the effect of X on Y?” (b) (One line) Also rewrite the base theoretical linear equation to be estimated.
(Parts (a) and (b) are for my memory because I will be reading a bunch of these papers.)
(c) Estimation of Linear Models
Present the results of the OLS estimation in the form of a regression table. Your table should be formatted as an easy-to-read table (not just cut and pasted out of Gretl). You should show at least three specifications of your design (in this one table). Include coefficients, standard errors, *’s to indicate the result of hypothesis testing, and goodness of fit statistics in the formatted table. Table 8.3 in your text is a good example of the formatting that I am expecting.
Your first regression should include your primary X variable and no other regressors. Your second regression should include the addition of at least one more regressor entered linearly. Your third regression should match your theoretical linear equation from part (b). (You may include regressions beyond these three and if relevant to your project, but this is not required.)
(d) Interpretations of Linear Models
(i) Discuss the estimated effect of X on Y across your models. Interpret your results in terms of both (1) economic significance and (2) statistical significance. Specifically, explain the meaning of estimated coefficients in your model in terms of magnitudes of impacts (economic significance). Please be very precise about units in all your interpretations. Also, indicate if the coefficients are statistically different from zero (and at what significance level). Include a discussion of both your primary slope and of the intercept.
(e) Discussion of Single versus Multivariate Regressions
Now interpret the coefficients on the added regressors. Discuss particularly how the inclusion of additional X’s changes your primary coefficient of interest. Make sure to discuss how and why your single variable regression model has omitted variable bias and how the specific additional X’s that you include decrease this bias. Refer to the criteria for and the direction of omitted variable bias (from Chapter 6) in your discussion of the single variable model relative to the expanded ones.
(f) Discussion of Model Fit
Discuss the goodness of fit of your final linear model (e.g., R-squared, SER, adjusted R-squared). Focus on your last specification (the one that includes all the variables that you are using).
(g) Empirical Results from a Nonlinear Model
Present the results from at least one nonlinear model in the form of a table (e.g., add a natural log term, add a polynomial, add an interaction, etc.). Your table should be formatted as easy-to-read tables as opposed to cut and pasted out of Gretl. Include measures of statistical significance and goodness of fit in the formatted table.
(h) Interpretation of Nonlinear Model(s)
Discuss the estimated nonlinear effect (or effects) which you have found in terms of both (1) economic significance and (2) statistical significance. Make sure you are clear about the units of measurement in your interpretations.
(i) Summary and Discussion
Summarize your main results and indicate limitations to your approach. Carefully discuss caveats to your study, especially whether parameter estimates may still be biased at the end of your study. Suggest what other independent variables, possible functional forms, or statistical tests might be appropriate and/or any interesting follow-up questions or extensions that have come to your mind. You should clearly address all five issues of internal validity and should also address external validity. Use S&W Chapter 9 as a reference/checklist to guide your discussion.