Regression Analysis

Description

You own a small real estate office. Part of your services involves advising your clients who are putting their home on the market about what would be a realistic price for their home. Currently, your method for pricing new property involves looking at similar listings, consulting with other professionals in your network, and following opinions of market experts. To speed up this process and make it more quantitative, you have decided to estimate a regression function that can be used to predict the price of a home in your area using the home’s information, such as number of rooms, size, etc.

Your task for this assignment is to visit Redfin.com (Links to an external site.) and to retrieve home sale data on in the area of your choice. Using the downloaded data, you will decide which variables you wish to include in your regression model. Once you have specified the model, you will estimate it using MS Excel or any other comparable software of your choice capable of running multiple linear regression.

To help you get up to speed with the technical aspects of this assignment, below is a crash course in linear regression estimation with all the information that you will need for this project.

Linear Regression Modeling

If you decide to use Excel, which will probably be the default choice for most of you, you will need to make sure your copy of Excel has the Analysis ToolPak activated. For information on how to load the Analysis ToolPak in Excel, visit this Microsoft Support (Links to an external site.) page.

A simple bivariate linear regression model has the general form y = b0 + b1x, where b0 is the vertical-axis intercept and b1 is the slope on the variable x (which here is also the slope of the line here in the simple model).

Your model might look like this: price = b0 + b1rooms, where price is the sale price of the home and rooms is the number of rooms in the house.

As you might suspect, this model can be extended to include more than one explanatory variable on the right-hand side. So, we might decide to extend our model to include not only the number of rooms, but also the size of the lot. This revised model would look as follows: price = b0 + b1rooms + b2lotsize. Although you could add as many variable as your imagination can come up with, you should strive for parsimony, meaning, you want the model to be as simple and as elegant as possible, but not at the cost of truncating it for the mere sake of elegance. Include only the major variables that you think determine the value of a home. Obviously, this is an area where you will need to exercise your own judgement. If in doubt as to which variables should most definitely be in the model, feel free to research this further to see what other experts have to say on the subject. Keep also in mind that the variables you choose to include in your model will also be dictated in no small degree by the data that you have at your disposal and by the data that you can realistically expect to obtain.

The variables that you include in your model will fall into two categories: metric and categorical. Metric data is data that was obtained through measurement. Categorical data is data that describes whether something belongs to a certain group.

An example of a model that includes both metric and categorical data is the model price = b0 + b1rooms + b2pool, where pool is a variable that takes into account whether the house for sale has a swimming pool.

In Excel, my spreadsheet with a model like that would look something like this.

Price Rooms Pool
$123,456 2 0
$234,567 3 1
$240,000 4 0
$160,000 2 1
$300,000 4 1
Notice how the explanatory variables rooms and pool are placed next to each other. This is important for Excel, as Excel needs the independent variables to be placed in contiguous (adjacent) columns. Notice also the following. The variable pool actually has two categories: pool available and pool not available. Even though this variable has two categories, we only included one variable, pool (i.e., pool available). The variable pool is also called a dummy variable. When building regression models, it is important to always include one less level of the variable than there are levels. The reasons for this are technical and something you need not concern yourself with, as long as you make sure you avoid falling into the dummy variable trap. Here is another example. Say you decide a variable on the season in which the house was listed. Since there are four seasons (spring, summer, fall, winter), you would include only three of them as individual terms in your regression model. Which three you include is up to you, but keep in mind that the season that you omit will become the point of reference for the other three seasons.

Assuming you have downloaded your data, you might have to also do some cleaning up before you can run the regression. Thus, missing values might have to be dealt with. How you decide to deal with observations that have missing values is up to you, but keep in mind that Excel does not accept cells with missing values (or otherwise you will get an error). Sometimes, researchers also examine each variable for extreme values, called outliers, and sometimes decide to remove extreme values from the analysis. This will be left to you to decide, if you choose to look into this as part of your data clean-up process.

Assuming you have your data and it has been cleaned up, you are ready to run the regression. The following webpage at Excel-easy.com (Links to an external site.) provides brief instructions on how to do that, including a crash-course on how to interpret the results in a little more depth.

Important note: Before you run the regression, choose one observation in your dataset (the last one at the bottom would be a convenient choice) and put it aside (remove it from your data set and write it down or save it somewhere else). You will use the information on this home and its price later to plug its information into your model and see how well your model is performing at estimating prices of homes in the area.

A few more general tips on modeling:

Avoid including variables on the right-hand side that are highly correlated. For example, lot size and square footage are probably strongly correlated. Generally speaking, the bigger the lot, the bigger the house is likely to be, so you will obtain inaccurate results if you include both lot size and square footage of the home in your model.
Whenever possible, use short abbreviations for your variable names in Excel and in the model. For example, use price instead of price_of_home, etc.
Avoid using too many variables just for the sake of increasing the reported R-squared.
The more variables you include on the right-hand side, the more data you will need to estimate a good model. As a rule of thumb, somewhere between 70–80 observations (homes) will be sufficient to get reasonably good results.
Write-up Guidelines and Assignment Submission

This question has been answered.

Get Answer