Housingcosts

12 Jan 2020 - Roberto Lambert

INTRODUCTION

In this project, the monthly housing cost is predicted by applying several supervised machine learning models available in the Python scikit-learn library. A potential home buyer could benefit from this machine learning application by providing the user a cost estimate of a property before closing the deal. Government agencies could use this application to identify homes in financial distress in a given region, thereby influencing their policy decisions.

DATA

The data set used is called the Housing Affordability Data System (HADS) which consists of individual datasets spanning the years 1985 to 2013. The data sets are available for download here: American Housing Survey: Housing Affordability Data System. The HADS data are sourced from the American Housing Survey (AHS) national sample microdata and the AHS metropolitan sample microdata. Therefore, more details can be obtained by referring to the AHS data. Each row in the data is an observation of a housing unit. The features fall under 4 categories:

To avoid data leakage issues, the features corresponding to cost measures are not fed into the machine learning models. The table below lists the features used to predict the monthly housing cost.

Feature Name Description
AGE Age of head of household
BEDRMS Number of bedrooms in unit
FMR Fair market rent
INCRELAMIPCT Household income relative to area median income (percent)
IPOV Poverty income
LMED Area median income (average)
NUNITS Number of units in building
PER Number of persons in household
ROOMS Number of rooms in unit
VALUE Current market value of unit
REGION Census region
YEAR Year of housing survey
FMTBUILT Year unit was built
FMTASSISTED Assisted housing
FMTSTRUCTURETYPE Structure type
FMTMETRO Central city/suburban status
FMTZADEQ Adequacy of unit

The HADS data files provide more features which are not included in the following analysis. These discarded variables are related to either LMED or INCRELAMIPCT.

Note that the features starting with string FMT (for formatted) are categorical variables. Therefore one-hot encoding was applied to these, ending up with with a 29-dimensional feature space. Following various data clearning/wrangling steps, the data set ends up having 486,785 observations.

For the analysis that follows, Python code was written in various Jupyter notebooks. The notebook files can be found here

MODELS

Two supervised learning algorithms are applied to predict the monthly housing cost (column name is ZSMHC in data files): ridge linear and random forest. The data is split into a test and training set. The training set is used to fit the model and the test data is used to calculate the model’s score. The table below lists the scores for the different models1:

Model Score
Linear Ridge 0.554
Random Forest 0.606

Ridge Linear Model

This model is applied due to its simplicity and easy interpretability of the output coefficients. At first, the hyperparameter alpha is tuned over the range from 0.01 to 100. The best value was found to be an alpha of 10. However, the score value was roughly the same across all values of alpha used (score was about 0.57). In order to compare the relative importance of the coefficients, the data was scaled to mean 0 and variance 1, and the train scaled data was fitted to a linear ridge instance with alpha set to 10. The corresponding coefficients for the scaled data are plotted below.

coefficients scaled linear model

Random Forest

The next model used is the random forest. This is a popular model that can account for nonlinear effects in the data, unlike the ridge linear model. The training data is fed into an instance of a random forest model with default parameters. The score on the test data is around 0.61. Next the training data set is used for 3-fold cross validation, obtaining scores in the 0.60-0.62 range. The feature importance values are plotted below. The importance values add up to one. The higher a feature’s importance value, the more important that feature is.

random forest importance

It is to be noted that parameter tuning was performed over the following parameter grid2:

Grid 1
Parameter Values
bootstrap True, False
max_depth 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, None
max_features auto, sqrt, log2
min_samples_leaf 1, 2, 4
min_samples_split 2, 5, 10
n_estimators 20, 50, 100, 150, 200

Instead of scanning the entire parameter space, 100 parameter combinations were chosen at random and the score was evaluated over these points. Over these 100 points, the score varied from 0.26 up to 0.62. The best score obtained was 0.619 with best parameter values set to:

Best Parameters Grid 1
Parameter Value
bootstrap False
max_depth 18
max_features sqrt
min_samples_leaf 1
min_samples_split 10
n_estimators 200

Next, a second grid around the best parameter values listed above was created:

Grid 2
Parameter Values
bootstrap False
max_depth 16, 18, 20
max_features sqrt
min_samples_leaf 1, 2
min_samples_split 2, 5, 10
n_estimators 100, 200, 500

The score was evaluated for all the points in this grid, with scores ranging from 0.616 up to the best score of 0.619. The best score parameters turned out to be:

Best Parameters Grid 2
Parameter Value
bootstrap False
max_depth 20
max_features sqrt
min_samples_leaf 1
min_samples_split 10
n_estimators 500

For the results that follow in the discussion section, n_estimators (number of trees) was set to 100 and the other parameters were set to their default values. These settings result in a score of 0.61 which is quite close to the best scores of 0.62 obtained from the parameter grid searches.

Finally, the plot below shows the average fit time versus n_estimators parameter for the points from the second grid. Note the positive linear relationship. For our purposes, a 100-tree random forest might not give the best score (although comes close), but it does run about 5 times faster than a 500-tree random forest.

time vs number of trees

Support Vector Machine

Finally, a support vector machine (SVM) model is used to predict whether or not a housing unit monthly cost is in the top 10%. For this part, a new column was created indicating whether or not a housing unit is in the top 10% of monthly housing cost for a given survey year. This column becomes the variable that the support vector machine needs to predict. The results of this model will be presented in the next section. Again the training data is used to fit the model and the test data is used to quantify the prediction accuracy.

DISCUSSION

In the table below, the top 10 features are listed for the ridge linear and random forest model. Both models agree that the housing unit market value (feature name VALUE) is the most important factor in determining the monthly housing cost.

For random forest, the second most important predicter is AGE (age of head of household), followed by INCRELAMIPCT (income). Next most important predictors for this model are variables that measure local market conditions IPOV (poverty level income), FMR (fair market rent), and LMED (median income). The least important variables include the survey year (YEAR) and variables describing the housing characteristics (ROOMS, BEDRMS, PER (number of people)).

For the linear model, the same set of variables (except for BEDRMS) are considered important; however the order of importance is different than the random forest model. Whether or not the housing unit was built in 2000-2009 was an important predictor for the linear model.

Ridge Linear Random Forest
Feature Coefficent Feature Importance
VALUE 0.396 VALUE 0.483
IPOV 0.301 AGE 0.120
PER 0.179 INCRELAMIPCT 0.086
AGE 0.173 IPOV 0.066
FMR 0.146 FMR 0.064
YEAR 0.135 LMED 0.054
INCRELAMIPCT 0.125 ROOMS 0.027
ROOMS 0.104 YEAR 0.023
LMED 0.099 BEDRMS 0.012
FMTBUILT_2000-2009 0.088 PER 0.007

Next, the percent error is calculated for each model. As a baseline for comparison, a third model is defined which predicts the sample average, regardless of the feature values. Such a model would have a score of 0. The average percent error is plotted across survey years for linear ridge, random forest, and baseline model. The random forest has by far the smallest average percent error. For earlier survey years, the linear model performs better than the baseline model. For later survey years, the baseline model outperforms the linear model.

average percent error

The plot below lists only the linear ridge and random forest models:

average percent error rf linear only

Below are boxplots of the percent error for random forest model (full plot and zommed in version):

box plot rf

box plot rf zoomed

For the support vector machine, a binary variable is defined if the housing unit is above the 90th percentile for the monthly housing cost. The accuracy, precision, and recall are calculated for this variable.3 These values are listed below.

Accuracy 0.925
Precision 0.747
Recall 0.421

These values are plotted versus survey year below. Note that for 1985 and 1987, the second column of confusion matrix consists of zeroes. Therefore the precision is not defined and has been plotted as zero.

svm

FURTHER WORK

To improve the predictive power of these machine learning models, the following items will be considered:

CONCLUSION

In conclusion, a linear ridge model and random forest model were used to predict the monthly housing cost. Both models determined that the housing unit market value is an important indicator of the monthly housing cost. Both models almost had the same set of important predictor variables. However, the order of importance differed. The random forest model has on average a smaller percent error than the linear ridge model. A support vector machine is used to predict whether or not a housing unit is in the top 10% of monthly housing cost. The accuracy, precision, and recall are calculated for this model. Finally, ideas on improving the predictive power of these models are discussed.


1. The score function used is the r-squared value. Other score functions are available in sci-kit learn. For r-squared, a score of 1 indicates a model that exactly predicts the actual values

2. See sci-kit learn RadomForestRegressor documentation for definition of these parameters

3. Accuracy is the sum of true positives and true negatives divided by the total number of observations; precision is true positives divided by total predicted positive values; recall is true positives divided by total actual positive values