Mod Three: Classification

Classification Models

What factors make affordable housing actually affordable? This is the question that I sought to answer using affordable housing data from the US government and running a logistic regression, a Random Forest and an XGBoost model. The results were 4 factors that explained whether or not a housing unit was affordable, as defined by rent being less than or equal to 30% of income.

Finding a dataset large enough (around1 50,000 rows) was difficult, but luckily the government collects survey data on this topic, and they are sure to include only complete survey responses. Complete responses means less null values to work with. I downloaded the dataset and began to explore it. I isolated the key variables and then proceeded to perform EDA. Although most of the data was present, there were still several factors that had nulls as represented by negative numbers. I replaced this with either the median or the mode depending on the type of data. I also got rid of outliers, and highly correlated variables. Finally, I made sure that the data was the appropriate type. Once it was clean, I proceeded to run models.

The first model was a logistic regression. I created my target variable by assigning “Yes” to units where housing costs were greater than 30% and “No” to units where it was =< 30%. This is crucial to logistic regressions. Then, I one hot encoded my data. When I created by training and testing data and fitted the model to it, I surprisingly got a 99% fit. This was unusually high, so proceeded to Random Forest and checked feature important and found that there was a two features that were overwhelmingly important, and one of them was redundant considering other features. Upon removing it, my model reflected 77% accuracy, which was more credible.

To attempt to raise the accuracy, I corrected for imbalanced data using SMOTE, which created a small improvement.

I moved on to Random Forest, which uses tree logic to categorize data. The model originally had 77% accuracy, but after running feature importance I eliminated the bottom 4 variables. Re-running the model resulted in 79% accuracy.

I took that model and fitted it using XGBoost, which provided 81% accuracy. After eliminating a few more features based off of feature importance, I felt comfortable with the fit of the remaining features.

Project 2- Hypothesis Testing

For this project, we had to conduct four hypothesis tests for data contained in the following SQL database:

This requires accessing the appropriate data in SQL, designing hypothesis tests and then conducting the tests using T-Tests and ANOVA.

The first hypothesis test was given to for us to test: Do customers buy more discounted products? At what level of discount?

To test this, we needed to define the null hypothesis. That would be that there is no effect of discounted products on levels of discount. Mathematically, Ho = Ha. Our alternative hypothesis is that customers buy more discounted products, or Ha > Ho. The plan was to get the data from the “Order Detail” table, and sample it, and then run a two tailed, two sample T-Test. Our alpha, for this and all other tests, was set at .05.

This was simple enough. The data was all contained in the Order Detail table so no SQL joins were needed. I simply pulled the Quantity and Discount attributes in two separate tables: the control where “discount = 0”, and the experimental where “discount != 0.” My final tables had the average number of products purchased for each product according to the two respective discount levels – discounted, or not discounted.

I then sampled the two groups so they were normally distributed, ran some cursory stats on them, checked the effect size using the Cohen D formula, and performed a T-Test. Just looking at the histograms and the means of the two groups showed there was a difference in average quantity bought. The effect size was greater than .8, so the difference that we see is relevant taking into accounts the units involved. There is a large effect. The t statistic was used in conjunction with degrees of freedom to create a p-value. That p-value was less than our alpha of .05, so we rejected the null hypothesis. Discounts do affect the quantity of products purchased.

To see at what level, I split the discounts into three levels based on the amount of discount – small, medium and large. I performed a t-test on each group to see if it had an significant effect on quantity of products purchased. They all had p-values less than .05, so all levels of discount are relevant.

That concluded the first hypothesis test. I then designed three more looking at 1) does the price affect if a product is reordered or not? 2) does the season affect the quantity of products purchased and 3) What is the effect of quantity and price on if a product is discontinued or not?

For each of these, I used SQL to get the data I needed, sampled the data, and then performed hypothesis tests to see if the null hypothesis could be rejected.

Findings:

  1. More expensive products are not reordered.
  2. People buy more products in winter.
  3. Quantity does have a significant effect on if a product is discontinued, price does have a significant effect.