Population Analysis

Take a look at www.realtor.ca and decide upon a city that you would like to
focus on (e.g. Vancouver). I’d recommend investigating a city that has at least a population size of 200,000
people. Next, decide what area or suburb you will be focusing on (e.g., Kitsilano) and type in the area into
the search bar. Zoom into your suburb/area until there is between 250 and 500 property listings. Refine
your search by selecting “Filter” and define options (1) Building Type = House, and (2) Style = Detached.
Write down the boundaries of your data collection (e.g., Kitsilano: West of Burrard St. and North of 29th).
Randomly select (as best as possible) 100 listings from the total number of property listings. The 100
listings will constitute your sample. MLS search listings are displayed from lowest to highest in prices on
the right side of the webpage. Try to randomly select homes across the spectrum of prices that are
displayed in your search listing. For example, suppose you have a total of 250 listings of homes in your
selected area. Note on the right side of the webpage, the cheapest 12 of the 250 listings are shown and
2 of 5
© Dr. Michael R. Johnson (2021). All Rights Reserved. All content provided is not to be
used, copied, revised or shared without explicit written permission from copyright holder.
with each sequential page the prices increase. Thus, for this example, there are approximately 250/12 = 21
pages of listings for the 250 houses. Therefore, try to select approx. 5 listings from each of the 21 pages
(rather than selecting all prices in one page) so that the data collected is a better representation of the
prices in the area. Collect the following information (8 variables) from each of the 100 real estate property
listings:
MLS Listing Number (this is REQUIRED or your data is NOT valid!!)
Y = listing price
X1 = interior floor space (square footage of the house)
X2 = land size (square footage). This is usually represented by length of the front of the lot (in ft) x the
depth of the lot (in ft). Sizes range but a common lot size in Vancouver is 33 x 120. You will need to convert this
to feet squared (area) before entering it into your spreadsheet (i.e., 3960)
X3 = number of bedrooms
X4 = number of bathrooms
X5 = age of building (often listed as “Built in” year date. Thus, you will need to calculate it!)
X6 = a variable of your team’s choice! You can use anything here. Perhaps a binary variable?
Key all data into an Excel spreadsheet. Make column A “MLS listing number” and put the one response
(dependent) variable and the 6 (independent) explanatory variables into columns B–H.
Note: Please be sure to carefully listen to Mike’s video that discusses this assignment. You have a lot of
flexibility in terms of what variables to collect (as long as you collect the MLS Listing Number, Listing Price
and 6 independent variables). Also, some listings will not have all the data. If you are collecting data in a
geographical area that does not provide the information above, please move onto another area. You will
find that suburban areas around the Lower Mainland provide the above information quite readily.
Steps

  1. Check that the assumptions that are required by a linear regression model are valid by creating 6
    scatter plots and provide some appropriate comments. The regression assumptions are 1) linearity, 2)
    constant variance, 3) normality, and 4) independence. To check these assumptions, please create 6
    scatter plots: Y versus each Xi for i = 1, 2, …, 6 and provide a comment with respect to the presence of
    an “approximate” linear relationship or not. Are any of your variables potentially problematic from
    this perspective? (Answer in, at most, one bullet point per scatter plot). Note: You are not expected to
    provide formal rigorous tests on the regression assumptions. Carry out the above scatter plots only.
  2. Check for potential multicollinearity by creating a correlation matrix. Correlation stronger than ±0.6
    between any two X‐variables indicates that they are somewhat redundant and MAY cause problems in
    your analysis. Do you have any such potential problems? (Answer in, at most, three bullet points.)
  3. Now, determine the regression model that best predicts selling price. (Proceed regardless of any
    potential problems you see in steps 1 or 2.)
    (a) You will be performing Backward Step‐Wise Regression. Run a multiple regression of Y on all six X‐
    variables together and review the output.
    (b) Is the overall model significant? Carry out an F‐test to determine this. Be certain to state the
    hypotheses, your decision rule and provide a concluding statement in the context of the problem.
    3 of 5
    © Dr. Michael R. Johnson (2021). All Rights Reserved. All content provided is not to be
    used, copied, revised or shared without explicit written permission from copyright holder.
    (c) Are some of the X‐variables clearly insignificant and others apparently significant? Choose one X‐
    variable that seems to be the most insignificant and eliminate it from your analysis. Run another
    regression of Y on the remaining X‐variables. Review the output.
    (d) If the reduced model is still inadequate (i.e., insignificant X‐variables are still present), repeat step
    (c) and try to reduce it further. For each reduction, provide a clear statement why a particular
    independent variable was eliminated.
    (e) Continue this iterative process of running another regression and until you have discarded all
    variables that are not significant in predicting the selling price. This is your final reduced model.
    Your final reduced model should be checked for multicollinearity by comparing the mathematical
    operators in front of coefficients to the correlation matrix. Remember that multicollinearity MAY
    surface in more ways than just high p‐values: incorrect “signs” in front of coefficients need to be
    assessed in relationship to the correlation matrix!
    (f) What is your final TRUE REGRESSION model? State your final model using the population
    parameters (It should look something like Y =  1X1 + 2X2 + … +  with the Y and all Xi clearly
    defined.)
    (g) What is your estimate of the final TRUE REGRESSION model? (It should look something like yˆ = b0 +
    b1X1 + … but you should have the coefficient estimates from the regression output in place of the bs.)
    (h) Clearly state the meaning of the regression coefficients in part g) in the context of this problem.
    (i) Provide a proper Hypothesis Test at the 5% significance level on your final estimated regression
    model to demonstrate if there is a significant linear relationship between each independent and
    dependent variable.
    (j) How good is the fit of your model? Quote a measure from the regression output. Provide a clear
    statement of its meaning.
    (k) Use your estimated regression model to predict the selling price of a house in your selected city of
    choice. Select any reasonable values for your independent variables in your calculation. Create an
    approximate 95% confidence interval of your predicted selling price as conducted in class.
    (l) Provide a conclusion with regard to the predictive utility of the model. In other words, do you think
    the model that you have developed is a good or poor predictor of selling price for your studied
    area? Substantiate your conclusion with statistical evidence that you already conducted. If you
    think it could be improved, how would you improve it?
    4 of 5
    © Dr. Michael R. Johnson (2021). All Rights Reserved. All content provided is not to be
    used, copied, revised or shared without explicit written permission from copyright holder.
    Please see Page 5 of this assignment for suggestions on how to work between Excel and Word so that
    you can properly format this assignment using the following guidelines (Note: the following are
    guidelines ONLY. As long as your solution file is well organized and neat no marks will be deducted):
    First 3 pages on your Word document: State the community where (approximate is fine) you conducted
    your data collection (e.g., Kitsilano area west of Burrard st. and north of 29th Ave.). Six scatter plots and
    comments about what they tell you about the validity of the assumptions implicit in your use of a linear
    regression model. Make sure to properly label your scatter plots with appropriate labels and a title.
    4th page: A correlation matrix and a comment or two on whether it indicates any potential problems.
    5th page: The regression output using all independent variables and the overall F‐test (formally stated with hypotheses, decision rule and conclusion).

This question has been answered.

Get Answer