Mod Three: Classification

Classification Models

What factors make affordable housing actually affordable? This is the question that I sought to answer using affordable housing data from the US government and running a logistic regression, a Random Forest and an XGBoost model. The results were 4 factors that explained whether or not a housing unit was affordable, as defined by rent being less than or equal to 30% of income.

Finding a dataset large enough (around1 50,000 rows) was difficult, but luckily the government collects survey data on this topic, and they are sure to include only complete survey responses. Complete responses means less null values to work with. I downloaded the dataset and began to explore it. I isolated the key variables and then proceeded to perform EDA. Although most of the data was present, there were still several factors that had nulls as represented by negative numbers. I replaced this with either the median or the mode depending on the type of data. I also got rid of outliers, and highly correlated variables. Finally, I made sure that the data was the appropriate type. Once it was clean, I proceeded to run models.

The first model was a logistic regression. I created my target variable by assigning “Yes” to units where housing costs were greater than 30% and “No” to units where it was =< 30%. This is crucial to logistic regressions. Then, I one hot encoded my data. When I created by training and testing data and fitted the model to it, I surprisingly got a 99% fit. This was unusually high, so proceeded to Random Forest and checked feature important and found that there was a two features that were overwhelmingly important, and one of them was redundant considering other features. Upon removing it, my model reflected 77% accuracy, which was more credible.

To attempt to raise the accuracy, I corrected for imbalanced data using SMOTE, which created a small improvement.

I moved on to Random Forest, which uses tree logic to categorize data. The model originally had 77% accuracy, but after running feature importance I eliminated the bottom 4 variables. Re-running the model resulted in 79% accuracy.

I took that model and fitted it using XGBoost, which provided 81% accuracy. After eliminating a few more features based off of feature importance, I felt comfortable with the fit of the remaining features.

Leave a comment