King County Project

5 min readAug 18, 2020

For 2 million people King County, Washington is home, but to an aspiring Data Scientist it’s a land of countless houses with secret monetary value waiting to be revealed by our analytical skills. The King County Housing Price data set has been worked on countless amounts of time by others on Kaggle and this is my documented experience at it.

The data was modified, by Flatiron Staff, to give us more of a challenge. The goal was to build a linear regression model to predict the prices of homes in King County. There are many steps we must take before the prediction can even be attempted so this blog will partially go through my process and ideology.

Getting familiar with the Data

The first step I take when working with a large CSV file is to import it as a Pandas Data Frame. The Pandas library contains a lot of methods and functions that help visualize the information. As a visual learner I feel it help me understand how the information is structured and answer fundamental questions like: What are the feature names? , What is the shape?,What are the data types?, Etc.

#load data 
df= pd.read_csv('data/kc_house_data.csv') #return first 5 rows 
df.head()

#return info of df
df.info()

df.describe()

From these 3 methods we are able to obtain crucial information about the data we are working with. df.head() which returns the first 5 rows of data gives a good visualization of a small portion of the data. Immediately we see a null value, so that’s something we need to take a look at. From df.info() we can get a sense of the size of the data, 21597 entries with 21 columns, and the data type of each column. While scanning the data types sqft_basement stood out to me. This column is an object type, so that would be noted for further investigation. Lastly we return a dateframe with descriptive statistics of our data. From this we can observe outlier, understand the range of our data, etc. The only obvious outlier that stood out to me was bedrooms having a max value of 33, but of course we have to investigate further for a clearer understanding of outliers in our data.

Dealing with nulls

#Total number of nulls for each feature
df.isnull().sum()

We see that 3 columns contain null values. I ended up imputing 0.0 values for the nulls. For the full analysis of why I imputed the null values with 0, check out the github repository.

Handling Outliers

#value count 
df.bedrooms.value_counts()

As we saw before, there was a home with 33 bedrooms. Although possible it is highly unlikely. We see that only 1 home has 33 bedrooms. There are a few ways we can handle this: Completely drop the row, figure out the exactly location on Google Maps and count the floors, or replace it with the most common value. I decided to just remove it along with the data point with 11 bedrooms. Although I believed that this home had 3 bedrooms and 33 was just a typo I couldn't definitively prove that.

Uninterpretable Values

#value count
df.sqft_basement.value_counts()

Remember when we stated that the column sqft_basement was an object type. Well, it turns out to be so because 454 entries were “?”. Typically square footage should be displayed as integers or floats just like the other square footage feature in the dataset.

Exploratory Analysis

During the process of data cleaning and exploratory analysis a few features stood out to me that made me seek answers. To me this is the most exciting part of the project, when you are able to find real world explanation for your digital findings.

This is a regression plot of square footage of lot against the price. We can see a clear linear relationship between the 2 features, as lot size increases so does price, which makes sense. A bigger property tend to cost more than a smaller one. But as you can see in this graph, in the red circle, a data point all the way on the right side of the graph yet its not one of the most expensive properties. Luckily for us each data point has a coordinate(latitude, longitude) which allowed me to find out that this was in fact a farm; explains why it has such a big square footage of land. Also according to King County’s website there are 1800 farms covering 50,000 acres of land. I dropped the data point from the set because we are trying to make a model on just house prices and this entry could skew the prediction.

Again stay with the same logic, having more floors should increase the value of a property. Yet, the median price of the houses, according to this box plot, shows a decrease in price when going from 2.5 floors up to 3 floors. Ignoring the fact that maybe outliers with 2.5 floors skews the median price, I search for an explanation.

https://www.king5.com/article/news/local/disaster/why-you-should-be-prepared-3-big-earthquake-threats-in-pnw/281-457421137

I came across numerous new articles that reported multiple earthquakes in seattle. Turns out King County is located in an area known as the Pacific Ring of Fire, where earthquakes, volcano, Tsunami are frequent. King County go through thousand of minor earthquakes each year, although majority of the time theses natural disasters go unnoticed, the threat of living in a earthquake prone could possible cause a disinterest in taller structures.

OLS Model

The variables that will be selected for our final model are: bedrooms, bathrooms, sqft_above, sqft_basement, yr_built, wf_property, renovated’, Condition_3, Condition_4,Condition_5, top_5_zip, floors_2, floors_3, floors_4, grade_7, grade_8, grade_9, grade_10, grade_11, grade_12. With an R- Squared of 0.693 and a RMSE of 158,084.