Aidan Cole, Sean McClellan, and Ben Aiken

Introduction

Predicting home values is incredibly challenging since homebuyers value features differently and housing markets fluctuate for a variety of reasons. The tremendous financial decision of buying a home is influenced greatly by internal features such as square feet and number of bedrooms - especially as more Americans are working from home. However, predicting home prices exclusively based on internal features and ignoring external amenities and public services is a grave mistake in predicting home prices.

Using the hedonic model framework, we set out to develop an accurate predictive model for the Boulder, Colorado housing market that considers both internal and external features. The hedonic model suggests:

Home Price = Internal Characteristics + Public Services/Amenities + Spatial Structure + Error

Internal Characteristics such as square feet, heating and cooling systems, and number of rooms are included along with factors such as proximity to schools and local crime rates. While our new model is not perfect, it is a great improvement on the internal features model and produces limited error. Implementing this model will give Zillow more accurate data for its users and customers, thus maximizing the profits that stem form Zillow services.

Data

We utilized the student dataset for internal features and found public service and amenity data online. Additionally, we developed several new variables that improved upon the provided data such as total square feet and heating value, both of which were derived from the existing variables.

Our public services and amenities data, were download from the census as well as open data from Boulder County and Boulder City including the locations of bicycle thefts, emergency calls, accessory dwelling units, and playgrounds. We also found school and hospital locations from state and national websites. Several of the datasets were not in the same coordinate reference system, so needed transformations. Additionally, since we were unable to find a geospatial crime dataset, the 28 day police call data was selected as a close approximation of the locations of crimes or at the very least threats of crimes. However, the dataset was exhaustive and included calls that were unlikely to indicate where crimes took place such as information calls. To refine this data to the most relevant locations, only the violent or drug related crimes were selected and the rest were filtered out.

#load libraries and functions
library(tidyverse)
library(tidycensus)
library(sf)
library(kableExtra)
library(tigris)
library(ggmap)
library(raster)
library(stargazer)
library(caTools)
library(caret)
library(spdep)
library(ggcorrplot)
library(GISTools)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(jtools)
library(ggstance)

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

#load styling options
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 16,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.text.x = element_text(size = 14))
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 16,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

# Load Quantile break functions
qBr <- function(df,variable,rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]],
                                  c(.01,.2,.4,.6,.8), na.rm=T), digits = 3))
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

# Load hexadecimal color palette
palette5 <- c("#edf8fb","#b3cde3","#8c96c6","#8856a7","#810f7c")

# wrangle internal features and census data
# --housing data--
boulder <- st_read("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/studentData.geojson",crs = 'ESRI:102254')%>%
  filter(MUSA_ID!=8735)

boulder.sf <- boulder%>%
  dplyr::select(price,builtYear,qualityCode,nbrBedRoom,nbrRoomsNobath,bsmtSF,carStorageSF,mainfloorSF,nbrThreeQtrBaths,nbrFullBaths,nbrHalfBaths,TotalFinishedSF,HeatingDscr,toPredict,geometry)

# Make a column that has the REAL total square foot of house 
boulder.sf <- boulder.sf%>% 
  mutate(TotalSF = ifelse(TotalFinishedSF< bsmtSF+carStorageSF+mainfloorSF, bsmtSF+carStorageSF+mainfloorSF, TotalFinishedSF))

# Make a column that has the price per square foot 
boulder.sf <- boulder.sf%>% 
  mutate(PricePerSq = price/TotalSF) # %>% na.omit() takes na values out of map below 

# Make a column that has house age 
boulder.sf <- boulder.sf%>%
  mutate(Age = 2021-builtYear)

# Delete houses that have insufficient data 
neatboulder.sf <- subset(boulder.sf, TotalSF!=0 & TotalSF!=1)

# --Census--
census_api_key("ea7bd3babfc0da6a4caa6d2536b66403a88929f2", overwrite = TRUE) 

tracts19 <- 
  get_acs(geography = "tract", variables = c("B01003_001E","B02001_002E","B02001_003E",
                                             "B03002_003E","B02001_005E","B19013_001E",
                                             "B15002_015E","B15002_032E","B06012_002E"), 
          year=2019, state=08, county=013, geometry=T, output='wide') %>%
  st_transform('ESRI:102254') %>%
  rename(TotalPop = B01003_001E, 
         Whites = B02001_002E,
         FemaleBachelors = B15002_032E, 
         MaleBachelors = B15002_015E,
         MedHHInc = B19013_001E, 
         AfricanAmericans = B02001_003E,
         Latinos = B03002_003E,
         Asians = B02001_005E,
         TotalPoverty = B06012_002E) %>%
  dplyr::select(-NAME, -starts_with("B")) %>%
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop,0),
         pctBlack = ifelse(TotalPop > 0, AfricanAmericans / TotalPop,0),
         pctLatino = ifelse(TotalPop > 0, Latinos / TotalPop,0),
         pctAsian = ifelse(TotalPop > 0, Asians / TotalPop,0),
         pctBachelors = ifelse(TotalPop > 0, ((FemaleBachelors + MaleBachelors) / TotalPop),0),
         pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
         year = "2019") %>%
  dplyr::select(-Whites, -FemaleBachelors, -MaleBachelors, -TotalPoverty, -AfricanAmericans, -Asians, -Latinos) 

# wrangling external features

# --Libraries--
# Read in the library CSV data and store it in a variable 
libAddress <- read.csv("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/Libraries.csv")

# Convert lat/long to a sf
libs_sf <- libAddress %>%
  st_as_sf(coords = c("X", "Y"), crs=4326)%>%
  st_transform('ESRI:102254') #was 102253
  
# --Inc/Poverty--
# Read in the income/poverty CSV data and store it in a variable 
inc_pov <- read.csv("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/Income_Poverty_(Census_Tracts).csv")

# Filter to only Boulder county 
bol_pov <- filter(inc_pov, County=="BOULDER")

#Change FIPS to GEOID 
colnames(bol_pov)[2] <- "GEOID"

# Edit GEOID column in tracts data to match that of bol_pov 
tracts19$GEOID <- as.character(tracts19$GEOID)
tracts19$GEOID <- gsub("^.{0,1}", "", tracts19$GEOID)

inc_pov_tracts <- merge(x = tracts19, y = bol_pov, by = "GEOID", all.x=TRUE) 

# --Hospitals--
# Read in the hospital CSV data and store it in a variable 
hospAddress <- read.csv("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/us_hospital_locations.csv")

# Filter to only Boulder county 
bol_hosp <- filter(hospAddress, COUNTY=="BOULDER")

# Convert lat/long to a sf 
hosps_sf <- bol_hosp %>%
  st_as_sf(coords = c("LONGITUDE","LATITUDE"), crs=4326) %>%
  st_transform('ESRI:102254') #was 102253

# --Playgrounds--
# Read in playground geojson data and store it in a variable 
playgrounds <- st_read("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/Playground_Sites_Points.geojson",crs = 4326) %>%
  dplyr::select(OBJECTID,NAME,geometry)%>%
  st_transform('ESRI:102254') #was 102253
# Delete NA values
playgrounds[is.na(playgrounds)] = 0 

# --Schools--
# Read in school geojson data and store it in a variable 
schools <- st_read("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/CDPHE_CDOE_School_Locations_and_District_Office_Locations.geojson",crs=4326)

# Filter to Boulder county 
bol_schools <- filter(schools, COUNTY=="BOULDER") %>%
  st_transform('ESRI:102254') #was 102253

# --Municipalities--
# Read in municipalities geojson data and store it in a variable 
munis <- st_read("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/Municipalities.geojson",crs=4326)%>%
  st_transform('ESRI:102254') 

# Select wanted columns 
bol_munis <- munis %>%
  dplyr::select(ZONEDESC,ShapeSTArea,ShapeSTLength,geometry)

# --Stolen Bikes--
# Read in stolen bike geojson data and store it in a variable 
bikes <- st_read("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/Stolen_Bikes.geojson",crs=4326)%>%
  st_transform('ESRI:102254')

# get rid of na values
cleanbikes <- subset(bikes, OBJECTID!=2173514 & OBJECTID!=2172860)

# --Accessory Dwelling Units--
dwellingUnits <- st_read("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/Accessory_Dwelling_Units.geojson",crs=4326) %>%
  st_transform('ESRI:102254')

# --28 Day Police Call data--
# Read in police call geojson data and store it in a variable 
police <- st_read("/Users/baiken/Desktop/MUSA/508 - Public Policy Analytics/Mid-term/Police_Call_Data_28_Days.geojson")

#filter out violent/drug crimes
Dangerous.policecalls <-
  police %>%
  filter(str_detect(PROBLEM, "ASSAU|SAS|WEAP|SHOOT|SHOTS|NARCO|DTF|DOME|HARAS|ROBBE"))

#extract x and y
separated_coord <- Dangerous.policecalls %>%
  mutate(lat = unlist(map(Dangerous.policecalls$geometry,1)),
         long = unlist(map(Dangerous.policecalls$geometry,2)))

just_coord <- st_drop_geometry(separated_coord)

#make new geometry
police4326 <- 
  just_coord %>% 
  st_as_sf(coords = c("long", "lat"), crs = 4326, agr = "constant")

#reproject to esri 102254
cleanedpolice <- 
  police4326 %>%
  st_transform('ESRI:102254')

# Buffers and Nearest Neighbors

# Filter out only geometries 
libraryGeometry <- subset(libs_sf, select = -c(OBJECTID, Name, Address,
                                               City, Phone, State, Notes))
hospitalGeometry <- subset(hosps_sf, select = -c(X, Y, FID, ID, NAME, ADDRESS,
                                                 CITY, STATE, ZIP, ZIP4,
                                                 TELEPHONE, TYPE, STATUS,
                                                 POPULATION, COUNTY, COUNTYFIPS,
                                                 COUNTRY, NAICS_CODE, NAICS_DESC,
                                                 SOURCE, SOURCEDATE, VAL_METHOD,
                                                 VAL_DATE, WEBSITE, STATE_ID,
                                                 ALT_NAME, ST_FIPS, OWNER,
                                                 TTL_STAFF, BEDS, TRAUMA, 
                                                 HELIPAD))


playgroundGeometry <- subset(playgrounds, select = -c(OBJECTID, NAME))

schoolGeometry <- subset(bol_schools, select = -c(OBJECTID, School_Code, School_Name,
                                              Address, City, State, Zip, Lowest_Grade, 
                                              Highest_Grade, District_Code, District_Name, 
                                              Setting, Charter_YN, Type_, COUNTY, 
                                              LATITUDE, LONGITUDE, Phone))


dwellingUnitsGeometry <- subset(dwellingUnits, select = -c(OBJECTID, CASENUMBER, UNITTYPE, ADDRESS, 
                                                           CLASSIFICATION, DECISIONDATE, SUBMITTALDATE,
                                                           DECISION))

bikesGeometry <- subset(cleanbikes, select = -c(OBJECTID, REPORTNUM, NUMBIKESSTOLEN, STOLENVALUE, 
                                           NUMBIKESRECOVERED, RECOVEREDVALUE, ARREST, REPORTYEAR, 
                                           REPORTDATE, ADDRESS, CITY, DISTRICT, BEAT, Shape))

policeGeometry <- subset(cleanedpolice, select = -c(OBJECTID, ID, INCNUM, AGENCY, 
                                                       RESPONSEDATE, FIRSTUNITARRIVEDTIME, 
                                                       PROBLEM, DISPOSITION, CALLCLOSEDTIME, 
                                                       CALLDURATION, ADDRESS))

# add buffer columns
neatboulder.sf$hosp_count <-
  st_buffer(neatboulder.sf, 3219) %>% 
  aggregate(mutate(hospitalGeometry, counter = 1),., sum) %>%
  pull(counter)

neatboulder.sf$lib_count <-
  st_buffer(neatboulder.sf, 3219) %>% 
  aggregate(mutate(libraryGeometry, counter = 1),., sum) %>%
  pull(counter)  

neatboulder.sf$playground_count <-
  st_buffer(neatboulder.sf, 3219) %>% 
  aggregate(mutate(playgroundGeometry, counter = 1),., sum) %>%
  pull(counter)  

neatboulder.sf$school_count <-
  st_buffer(neatboulder.sf, 3219) %>% 
  aggregate(mutate(schoolGeometry, counter = 1),., sum) %>%
  pull(counter)  

neatboulder.sf$stolenbike_count = 
  st_buffer(neatboulder.sf, 3219) %>%
  aggregate(mutate(bikesGeometry, counter = 1),., sum) %>%
  pull(counter)

neatboulder.sf$policecall_count = 
  st_buffer(neatboulder.sf, 3219) %>%
  aggregate(mutate(policeGeometry, counter = 1),., sum) %>%
  pull(counter)

neatboulder.sf$ADU_count <-
  st_buffer(neatboulder.sf, 3219) %>% 
  aggregate(mutate(dwellingUnitsGeometry, counter = 1),., sum) %>%
  pull(counter)

## Nearest Neighbor Feature
st_c <- st_coordinates

neatboulder.sf <-
  neatboulder.sf %>% 
  mutate(
    school_nn1 = nn_function(st_c(neatboulder.sf), st_c(bol_schools), 1),
    school_nn2 = nn_function(st_c(neatboulder.sf), st_c(bol_schools), 2), 
    school_nn3 = nn_function(st_c(neatboulder.sf), st_c(bol_schools), 3), 
    school_nn4 = nn_function(st_c(neatboulder.sf), st_c(bol_schools), 4), 
    school_nn5 = nn_function(st_c(neatboulder.sf), st_c(bol_schools), 5),
    playground_nn1 = nn_function(st_c(neatboulder.sf), st_c(playgrounds), 1),
    playground_nn2 = nn_function(st_c(neatboulder.sf), st_c(playgrounds), 2), 
    playground_nn3 = nn_function(st_c(neatboulder.sf), st_c(playgrounds), 3), 
    playground_nn4 = nn_function(st_c(neatboulder.sf), st_c(playgrounds), 4), 
    playground_nn5 = nn_function(st_c(neatboulder.sf), st_c(playgrounds), 5),
    hospital_nn1 = nn_function(st_c(neatboulder.sf), st_c(hosps_sf), 1),
    hospital_nn2 = nn_function(st_c(neatboulder.sf), st_c(hosps_sf), 2), 
    hospital_nn3 = nn_function(st_c(neatboulder.sf), st_c(hosps_sf), 3), 
    hospital_nn4 = nn_function(st_c(neatboulder.sf), st_c(hosps_sf), 4), 
    hospital_nn5 = nn_function(st_c(neatboulder.sf), st_c(hosps_sf), 5),
    library_nn1 = nn_function(st_c(neatboulder.sf), st_c(libs_sf), 1),
    library_nn2 = nn_function(st_c(neatboulder.sf), st_c(libs_sf), 2), 
    library_nn3 = nn_function(st_c(neatboulder.sf), st_c(libs_sf), 3), 
    library_nn4 = nn_function(st_c(neatboulder.sf), st_c(libs_sf), 4), 
    library_nn5 = nn_function(st_c(neatboulder.sf), st_c(libs_sf), 5), 
    policecall_nn1 = nn_function(st_c(neatboulder.sf), st_c(policeGeometry), 1),    
    policecall_nn2 = nn_function(st_c(neatboulder.sf), st_c(policeGeometry), 2), 
    policecall_nn3 = nn_function(st_c(neatboulder.sf), st_c(policeGeometry), 3), 
    policecall_nn4 = nn_function(st_c(neatboulder.sf), st_c(policeGeometry), 4), 
    policecall_nn5 = nn_function(st_c(neatboulder.sf), st_c(policeGeometry), 5),
    bike_nn1 = nn_function(st_c(neatboulder.sf), st_c(bikesGeometry), 1),
    bike_nn2 = nn_function(st_c(neatboulder.sf), st_c(bikesGeometry), 2), 
    bike_nn3 = nn_function(st_c(neatboulder.sf), st_c(bikesGeometry), 3), 
    bike_nn4 = nn_function(st_c(neatboulder.sf), st_c(bikesGeometry), 4), 
    bike_nn5 = nn_function(st_c(neatboulder.sf), st_c(bikesGeometry), 5),
    ADU_nn1 = nn_function(st_c(neatboulder.sf), st_c(dwellingUnitsGeometry), 1),
    ADU_nn2 = nn_function(st_c(neatboulder.sf), st_c(dwellingUnitsGeometry), 2), 
    ADU_nn3 = nn_function(st_c(neatboulder.sf), st_c(dwellingUnitsGeometry), 3), 
    ADU_nn4 = nn_function(st_c(neatboulder.sf), st_c(dwellingUnitsGeometry), 4), 
    ADU_nn5 = nn_function(st_c(neatboulder.sf), st_c(dwellingUnitsGeometry), 5)) 

neatboulder.sf[is.na(neatboulder.sf)] = 0

# Create Heating Value column
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "Wall Furnace"] <- 2.32
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "Radiant Floor"] <- 7.69
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "Heat Pump"] <- 4.66
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "Electric"] <- 2.21
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "Hot Water"] <- 4.07
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "No HVAC"] <- 3.71
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "Package Unit"] <- 1.6
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "Gravity"] <- 2.13
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "Electric"] <- 2.21
neatboulder.sf$HeatingValue[neatboulder.sf$HeatingDscr == "Forced Air"] <- 2.91

The tables below display the summary statistics of selected variables, first for the Internal Variables and second for the Public Service and Amenity Variables. The tables show the tremendous range of housing price from 10,000 to over 7 million, with a standard deviation of 543,875. This wide range made the modeling especially difficult.

## --- Summary Statistics ---

predict <- filter(neatboulder.sf,toPredict==0)

internal <- subset(predict, select = c(price, nbrBedRoom, nbrRoomsNobath, bsmtSF, carStorageSF, mainfloorSF, nbrThreeQtrBaths, nbrFullBaths,nbrHalfBaths, TotalFinishedSF, TotalSF, PricePerSq, Age, qualityCode, HeatingValue))

#Turn into Data frames
internal_stats <- data.frame(internal)

#Separate into categories
summary_stats <- stargazer(internal_stats, type = "text",  omit.stat="n",title="Summary Statistics of Internal Variables", digits=1)

## 
## Summary Statistics of Internal Variables
## ==============================================================================
## Statistic          N      Mean    St. Dev.   Min   Pctl(25) Pctl(75)    Max   
## ------------------------------------------------------------------------------
## price            11,255 746,065.6 543,875.6 10,000 452,500  830,500  7,350,000
## nbrBedRoom       11,255    3.4       1.0      0       3        4        24    
## nbrRoomsNobath   11,255    7.4       2.1      0       6        9        28    
## bsmtSF           11,255   744.7     627.6     0      140     1,126     4,534  
## carStorageSF     11,255   471.9     235.5     0      380      600      3,040  
## mainfloorSF      11,255  1,314.1    606.7     0      924     1,619     7,795  
## nbrThreeQtrBaths 11,255    0.6       0.7      0       0        1         9    
## nbrFullBaths     11,255    1.8       0.9      0       1        2        12    
## nbrHalfBaths     11,255    0.6       0.5      0       0        1         3    
## TotalFinishedSF  11,255  1,956.0    890.4     0     1,278    2,443    10,188  
## TotalSF          11,255  2,627.0   1,183.1   135    1,764    3,242    11,294  
## PricePerSq       11,255   307.6     226.7    4.2    195.8    350.3    6,805.6 
## Age              11,255   33.9      26.8      0       15       49       161   
## qualityCode      11,255   37.3       7.9      10      30       41       80    
## HeatingValue     11,123    3.0       0.5     1.6     2.9      2.9       7.7   
## ------------------------------------------------------------------------------

The variables in public service table were derived using buffers around the houses and the nearest neighbor function. The count variables express how many hospitals, libraries, playgrounds, schools, stolen bikes, police calls, and accessory dwelling units occur within a two mile buffer. The nn variables, or nearest neighbor, express the average distance to the x closest hospitals, schools, etc. In this case the x is the number of occurrences. For example, policecall_nn5 is the average distance to the 5 closest police calls.

public <- subset(predict, select = c(hosp_count, hospital_nn1, lib_count, library_nn1,  playground_count, school_count, school_nn1, stolenbike_count, bike_nn5, policecall_count, policecall_nn5, ADU_count))

public_stats <- data.frame(public)

summary_stats2 <- stargazer(public_stats, type = "text",  omit.stat="n",title="Descriptive Statistics of Public Service/Amentity Variables", digits=1)

## 
## Descriptive Statistics of Public Service/Amentity Variables
## =========================================================================
## Statistic          N      Mean   St. Dev. Min  Pctl(25) Pctl(75)   Max   
## -------------------------------------------------------------------------
## hosp_count       11,255   0.5      0.7     0      0        1        2    
## hospital_nn1     11,255 4,757.9  3,970.8  96.2 2,262.6  5,817.4  34,086.7
## lib_count        11,255   1.5      2.6     0      0        1        12   
## library_nn1      11,255 3,250.1  2,383.0  51.2 1,559.1  4,145.7  21,165.4
## playground_count 11,255   5.7      11.3    0      0        1        45   
## school_count     11,255   14.7     10.1    0      6        22       44   
## school_nn1       11,255 1,136.7  1,764.2  15.9  396.5   1,126.3  20,801.3
## stolenbike_count 11,255  196.1    459.4    0      0        3      2,024  
## bike_nn5         11,255 7,018.7  4,589.1  28.8 4,398.4  10,226.7 30,521.1
## policecall_count 11,255   34.7     82.6    0      0        0       363   
## policecall_nn5   11,255 10,014.7 6,179.5  62.0 5,671.4  14,645.1 30,961.2
## ADU_count        11,255   36.0     81.0    0      0        0       316   
## -------------------------------------------------------------------------

Using the ggmap package in R, we plotted relevant variables on the correlation matrix below. This matrix informed our variable selection for the final model. For example, school_nn1 appears to have little to no correlation with price, so it was left out. On the other hand, policecall_nn5 has a strong negative correlation and playground_count has a strong positive correlation so they were included in the model.

# Correlation Matrix 
corr.sf <- neatboulder.sf%>%
  dplyr::select(price,Age,qualityCode,nbrBedRoom,nbrRoomsNobath,mainfloorSF,nbrFullBaths,carStorageSF,TotalSF,HeatingValue,hosp_count,hospital_nn1,lib_count,playground_count,school_count,school_nn1,policecall_count,policecall_nn5,stolenbike_count,bike_nn5,ADU_count)

corr <- 
  select_if(st_drop_geometry(corr.sf), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(corr), 1), 
  p.mat = cor_pmat(corr),
  colors = c("#b3cde3", "white", "#8856a7"),
  type="lower",
  insig = "blank") +  
  labs(title = "Correlation across numeric house and indicator variables",
       caption = "Figure 1")

The scatterplots below investigate these correlations further, showing the relationships between age, total square feet, nearest school, and nearest hospital, with housing price. TotalSF has a clear positive correlation with home price. This is not surprising since larger houses would logically cost more. Nearest hospital and school, surprisingly, both appear to have a slight negative correlation.

#Scatter Plots
st_drop_geometry(neatboulder.sf) %>% 
  dplyr::select(price, Age, TotalSF) %>%
  gather(Variable, Value, -price) %>% 
  ggplot(aes(Value, price)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a function of Age and Total Square Feet",
       caption = "Figure 2.1") +
  plotTheme()

st_drop_geometry(neatboulder.sf) %>% 
  dplyr::select(price, school_nn1, hospital_nn1) %>%
  gather(Variable, Value, -price) %>% 
  ggplot(aes(Value, price)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a function of Nearest School and Hospital",
       caption = "Figure 2.2") +
  plotTheme()

Figure 3 below shows a map of the house prices throughout the county. There appear to be clear clustering of expensive homes in the city and Southeast fo the city, while Northeast Boulder appears to have a cluster of lower priced homes.

# Visualize House Prices 
ggplot() +
  geom_sf(data = inc_pov_tracts)+
  geom_sf(data = bol_munis, fill = "grey40") +
  geom_sf(data = neatboulder.sf, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(neatboulder.sf,"price"),
                      name="Quintile\nBreaks") +
  labs(title="Price of Houses, Boulder County",
       caption = "Figure 3") +
  mapTheme()

Furthermore, the maps below display the spatial relationship of several variables. The playgrounds are clustered in the city where the home prices appear to be highest. The population density map does not show a strong correlation for home prices since two of the clusters (the city and Southeast of the city) have high home prices, but the third cluster in the Northeast part of the city appear to have lower prices. Finally, the schools appear most similar to the population density, which makes logical sense.

#----Maps----
# Playgrounds
ggplot() + 
  geom_sf(data=tracts19) +
  geom_sf(data=playgrounds, color = "#810f7c", size = 2.5, alpha = 0.65)+
  labs(title="Playgrounds", 
       subtitle="Boulder, CO",
       caption="Figure 4.1") +
  mapTheme()

# population density
ggplot() +
  geom_sf(data=st_union(inc_pov_tracts)) +
  geom_sf(data=inc_pov_tracts,aes(fill=q5(Population_Density_PerLandSquareMile)))+
  scale_fill_manual(values=palette5,
                    labels=qBr(inc_pov_tracts,"Population_Density_PerLandSquareMile"),
                    name="Population Density\n(Quintile Breaks)")+
  labs(title="Population Density in Boulder, CO.", 
       subtitle="Per Square Mile", 
       caption="Figure 4.2") +
  mapTheme()

# schools
ggplot() + 
  geom_sf(data=tracts19) +
  geom_sf(data=schoolGeometry, color = "#810f7c", size = 2.5, alpha = 0.65)+
  labs(title="Schools", 
       subtitle="Boulder, CO",
       caption="Figure 4.3") +
  mapTheme()

A final map of density of police calls reveal that the vast majority of the calls are taking place in center of Boulder City, where price appears to be highest according the Figure 3.

ggplot() + geom_sf(data = inc_pov_tracts, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(policeGeometry)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#b3cde3", high = "#8856a7", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Police Calls, Boulder County",
       caption = "Figure 5") +
  mapTheme()

Methods

After wrangling the provided and downloaded data and developing the buffer and nearest neighbor variables, the correlation matrix was used to identify the variables that had noticeable correlation with price. After running R’s regression functions for several sets of variables, the following variables were ultimately selected for the final model:

Age
Quality Code
Heating Value
Number of Rooms, excluding bathrooms
Main floor square feet
Car Storage Square Feet
Total Square Feet
Number of playgrounds within 2 miles
Number of libraries within 2 miles
Number of accessory dwelling units within 2 miles
Average distance to the two closest hospitals
Average distance to the five closest police calls
Average distance to the five closest stolen bikes

These were selected since the resulting model had the least error of all the combinations attempted. In order to avoid multicollinearity, or two variables overlapping and essentially predicting the same thing, which would break an important assumption of OLS regressions, only one variable for each of the downloaded datasets were used (either the number of occurrences that fell within a 2 mile buffer or the average distance to x occurrences).

The resulting model is presented and tested in the following section.

Results

The table below displays the selected variables and their coefficients once plugged into an OLS regression. The coefficients indicate the amount the home price increases when the predictor increases by one unit, holding all other variables constant.

allvars.sf <- neatboulder.sf%>%
  dplyr::select(price,Age,qualityCode,nbrBedRoom,nbrRoomsNobath,bsmtSF,carStorageSF,mainfloorSF,nbrThreeQtrBaths,nbrFullBaths,nbrHalfBaths,TotalFinishedSF,toPredict,geometry,hosp_count,lib_count,playground_count,school_count,stolenbike_count,policecall_count,ADU_count,TotalSF,PricePerSq,school_nn1,school_nn2,school_nn3,school_nn4,school_nn5,playground_nn1,playground_nn2,playground_nn3,playground_nn4,playground_nn5,hospital_nn1,hospital_nn2,hospital_nn3,hospital_nn4,hospital_nn5,library_nn1,library_nn2,library_nn3,library_nn4,library_nn5,policecall_nn1,policecall_nn2,policecall_nn3,policecall_nn4,policecall_nn5,bike_nn1,bike_nn2,bike_nn3,bike_nn4,bike_nn5,ADU_nn1,ADU_nn2,ADU_nn3,ADU_nn4,ADU_nn5,HeatingValue)

allVars <- 
  select_if(st_drop_geometry(allvars.sf), is.numeric) %>% na.omit()

reg1 <- lm(price ~ ., data = st_drop_geometry(allvars.sf) %>% 
             dplyr::select(price, Age, qualityCode, HeatingValue,
                           nbrRoomsNobath, mainfloorSF, carStorageSF, TotalSF,
                           playground_count, lib_count, 
                           ADU_count, hospital_nn2, policecall_nn5,
                           bike_nn5))

#summary(reg1)

predict0 <- filter(allvars.sf,toPredict==0)

#75% of the sample size
smp_size <- floor(0.75 * nrow(predict0))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(predict0)), size = smp_size)

train <- predict0[train_ind, ]
test <- predict0[-train_ind, ]

reg.training <- lm(price ~ ., data = st_drop_geometry(train) %>% 
                     dplyr::select(price, Age, qualityCode, 
                                   nbrRoomsNobath, mainfloorSF, TotalSF,
                                   playground_count, lib_count, carStorageSF, 
                                   ADU_count, hospital_nn2, policecall_nn5,
                                   bike_nn5,HeatingValue))

# Make summary table for lm 

summary(reg.training)

## 
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(train) %>% dplyr::select(price, 
##     Age, qualityCode, nbrRoomsNobath, mainfloorSF, TotalSF, playground_count, 
##     lib_count, carStorageSF, ADU_count, hospital_nn2, policecall_nn5, 
##     bike_nn5, HeatingValue))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1536815  -150558   -21518   110907  6646997 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -8.264e+05  3.512e+04 -23.534  < 2e-16 ***
## Age               2.326e+03  1.812e+02  12.835  < 2e-16 ***
## qualityCode       2.579e+04  7.185e+02  35.896  < 2e-16 ***
## nbrRoomsNobath    2.462e+04  2.252e+03  10.936  < 2e-16 ***
## mainfloorSF       1.309e+02  1.381e+01   9.479  < 2e-16 ***
## TotalSF           2.359e+00  8.440e+00   0.280    0.780    
## playground_count -1.588e+04  1.085e+03 -14.636  < 2e-16 ***
## lib_count        -2.827e+04  2.546e+03 -11.105  < 2e-16 ***
## carStorageSF      1.131e+02  2.239e+01   5.052 4.46e-07 ***
## ADU_count         4.415e+03  1.602e+02  27.554  < 2e-16 ***
## hospital_nn2      9.526e+00  1.347e+00   7.071 1.67e-12 ***
## policecall_nn5   -2.607e+01  1.714e+00 -15.215  < 2e-16 ***
## bike_nn5          1.163e+00  1.744e+00   0.667    0.505    
## HeatingValue      8.637e+04  7.097e+03  12.170  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 335600 on 8330 degrees of freedom
##   (97 observations deleted due to missingness)
## Multiple R-squared:  0.6026, Adjusted R-squared:  0.602 
## F-statistic: 971.6 on 13 and 8330 DF,  p-value: < 2.2e-16

The table below shows the Mean Absolute Error and the MAPE or Mean Absolute Percent Error, which indicates the model has roughly 32.5% error.

test <-
  test %>%
  mutate(price.predict = predict(reg.training, test),
         price.error = price.predict - price,
         price.abserror = abs(price.predict - price),
         price.APE = (abs(price.predict - price)) / price.predict)%>%
  filter(price < 5000000)%>%
  filter(price.predict > 0)
     
# Make summary table for MAE and MAPE 

test.summary <- 
  st_drop_geometry(test) %>%
  summarize(Mean_Absolute_Error = mean(price.abserror, na.rm = T),
            Mean_Absolute_Percent_Error = mean(price.APE, na.rm = T)) 

kable(test.summary, "simple")

Mean_Absolute_Error	Mean_Absolute_Percent_Error
180475.3	0.325396

Next, a k-fold cross-validation was completed with 100 folds. The Histogram of Figure 6 shows a slightly positively skewed distribution of the Mean Absolute Error, with an outlier (the extremely expensive house). Ideally, the distribution would be normal indicating that the model is generalizable. Since the distribution is slightly skewed it indicates that this model could be improved in order to be more generalizable.

# Cross-Validation Tests 

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(allvars.sf) %>% 
          dplyr::select(price, Age, qualityCode, 
                        nbrRoomsNobath, mainfloorSF, TotalSF,
                        playground_count, lib_count, carStorageSF, 
                        ADU_count, hospital_nn2, policecall_nn5,
                        bike_nn5,HeatingValue),
          method = "lm", trControl = fitControl, na.action = na.pass)

#reg.cv

reg.cv$resample[1:5,]

##       RMSE  Rsquared      MAE Resample
## 1 551290.7 0.6614009 227845.4  Fold001
## 2 291030.0 0.6852373 192818.4  Fold002
## 3 257525.5 0.6003039 186005.6  Fold003
## 4 236032.8 0.6678465 160617.4  Fold004
## 5 553614.7 0.5232083 232804.4  Fold005

banana = mean(reg.cv$resample[,3])

# Histogram 
hist <- hist(reg.cv$resample[,3],
     main="Figure 6: Distribution of MAE",
     xlab="Mean Absolute Error",
     ylab="Count",
     col="darkmagenta",
     freq=TRUE,
     breaks=20)

Figure 7 displays the predicted prices (the orange line) with the observed prices (scatter plot). The line is certainly not over-fit to the data, but it does show a generally accurate trend. Also, it is important to note that the observed prices fortunately do not appear to be hoteroscedastic, which would violate a major assumption of OLS regression.

test <- na.omit(test)

# Scatterplot predicted price as function of observed price 
pred_obs <- st_drop_geometry(test) %>% 
  dplyr::select(price, price.predict) %>%
  gather(Variable, Value, -price) %>% 
  ggplot(aes(Value, price)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, scales = "free") +
  labs(title = "Predicted price as a function of observed price",
       caption = "Figure 7") +
  plotTheme()

pred_obs

Figure 8 shows that the most significant errors on the test set are laregly clustered in the city. Given the density of Boulder City compared with the rest of the County the buffers and nearest neighbor functions are hard to generalize for the entire county. This could be part of the reason why the error is so much higher in the densest part of the county.

Next, the spatial lag and Moran’s I are dispalyed in Figures 9 and 10. The Moran’s I indicates that there is clear spatial auto correlation.

# Provide map of residuals 
ggplot() +
  geom_sf(data = inc_pov_tracts)+
  geom_sf(data = bol_munis, fill = "grey40") +
  geom_sf(data = test, aes(colour = q5(price.abserror)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(test,"price.abserror"),
                      name="Quintile\nBreaks") +
  labs(title="Sale price errors on the test set, Boulder County",
       caption = "Figure 8") +
  mapTheme()

# Spatial Lag in errors 
coords.test <-  st_coordinates(test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

test$lagPriceError <- lag.listw(spatialWeights.test, test$price.error)

spatial_lag <- st_drop_geometry(test) %>% 
  dplyr::select(lagPriceError, price.error) %>%
  gather(Variable, Value, -price.error) %>% 
  ggplot(aes(Value, price.error)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, scales = "free") +
  labs(title = "Error as a function of the spatial lag of price",
       caption = "Figure 9") +
  plotTheme()

spatial_lag

# Moran I Test 

moranTest <- moran.mc(test$price.error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       caption = "Figure 10",
       x="Moran's I",
       y="Count") +
  plotTheme()

The map below in Figure 11 shows the predicted home sale prices of both the training and test sets.

# Map of predicted values for toPredict ==0 and toPredict == 1 

test1_0 <- allvars.sf

reg1_0 <- lm(price ~ ., data = st_drop_geometry(test1_0) %>% 
                     dplyr::select(price, Age, qualityCode, 
                                   nbrRoomsNobath, mainfloorSF, TotalSF,
                                   playground_count, lib_count, carStorageSF, 
                                   ADU_count, hospital_nn2, policecall_nn5,
                                   bike_nn5,HeatingValue))

test1_0 <-
  test1_0 %>%
  mutate(price.predict = predict(reg1_0, test1_0),
         price.error = price.predict - price,
         price.abserror = abs(price.predict - price),
         price.APE = (abs(price.predict - price)) / price.predict)%>%
  filter(price < 5000000)%>%
  filter(price.predict >0)

test1_0 <- na.omit(test1_0)

ggplot() +
  geom_sf(data = inc_pov_tracts)+
  geom_sf(data = bol_munis, fill = "grey40") +
  geom_sf(data = test1_0, aes(colour = q5(price.predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(test1_0,"price.predict"),
                      name="Quintile\nBreaks") +
  labs(title="Predicted home sale prices, Boulder County",
       caption = "Figure 11") +
  mapTheme()

Below, in Figure 12, the mean average percent error, or MAPE, is displayed by the census tracts. The tracts that appear to have similar house values have smaller MAPE values, but tracts with a wide range of house values, like the darkest Eastern Tract, has a higher MAPE value possible due to this variety.

# Map of MAPE by tract 

intersection <- st_intersection(x = inc_pov_tracts, y = test1_0) %>%
  dplyr::select(GEOID,Tract_Name,price,price.APE,geometry)%>%
  group_by(GEOID)%>%
  mutate(MAPE=mean(price.APE))%>%
  mutate(meanprice=mean(price))

int_tracts <- merge(x = st_drop_geometry(intersection), y = inc_pov_tracts, by = "GEOID", all.x=FALSE)%>%
  dplyr::select(GEOID,MAPE,meanprice,geometry)%>%
  distinct()

ggplot() +
  geom_sf(data=int_tracts, aes(fill=MAPE,geometry=geometry))+
  geom_sf(data=boulder.sf, colour="black",show.legend="point",size=.5)+
  scale_fill_gradient(low=palette5[1], high=palette5[5],name="MAPE")+
  labs(title="MAPE by Census Tract, Boulder County",
       caption = "Figure 12")+
  mapTheme()

Figure 13 shows the MAPE by neighborhood compared with the mean price by census tract.

scatterMAPE <- int_tracts %>% 
  dplyr::select(MAPE, meanprice) %>%
  gather(Variable, Value, -MAPE) %>% 
  ggplot(aes(Value, MAPE)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, scales = "free") +
  labs(title = "MAPE by census tract as a function of mean price by census tract",
       caption = "Figure 13") +
  plotTheme()

scatterMAPE

In order to test the generalizablity of our model, we ran two splits to our data. First, we split the city by education and ran the model resulting in an average MAPE of 0.300 for high education and 0.334 for lower education. This difference of 0.034 shows that our model is somewhat bias towards higher educated residents. Next, we split the city by income levels. The resulting MAPE was 0.312 for high income and 0.326 for low income. This difference of 0.013, while less than the education difference, once again suggests some bias. Considering the sizes of these difference, our model is somewhat generalizable, but could certainly be improved.

generalize_tracts <- inc_pov_tracts %>%
  mutate(eduContext = ifelse(pctBachelors > .255, "High Education", "Low Education"),
         incomeContext = ifelse(MedHHInc > 100000, "High Income", "Low Income"))

grid.arrange(ncol = 2,
             ggplot() + geom_sf(data = na.omit(generalize_tracts), aes(fill = eduContext)) +
               scale_fill_manual(values = c("#b3cde3", "#8856a7"), name="Education Context") +
               labs(title = "Education Context",
                    caption = "Figure 14.1") +
               mapTheme() + theme(legend.position="bottom"), 
             ggplot() + geom_sf(data = na.omit(generalize_tracts), aes(fill = incomeContext)) +
               scale_fill_manual(values = c("#b3cde3", "#8856a7"), name="Income Context") +
               labs(title = "Income Context",
                    captions = "Figure 14.2") +
               mapTheme() + theme(legend.position="bottom"))

# Education Context table (should have MAPE(in percent) across high education and low education, just two rows) 

int_tracts_final <- merge(x = st_drop_geometry(intersection), y = generalize_tracts, by = "GEOID", all.x=FALSE)%>%
  dplyr::select(GEOID,MAPE,meanprice,eduContext,incomeContext,geometry)%>%
  distinct()

education_table <-  int_tracts_final %>%
  group_by(eduContext) %>%
  summarize(Average_MAPE = mean(MAPE, na.rm = T)) %>%
  kable() %>% kable_styling()

education_table

eduContext	Average_MAPE
High Education	0.2999450
Low Education	0.3335119

# Income Context table (should have MAPE(in percent) across high income and low income, just two rows)

income_table <-  int_tracts_final %>%
  group_by(incomeContext) %>%
  summarize(Average_MAPE = mean(MAPE, na.rm = T)) %>%
  kable() %>% kable_styling()

income_table

incomeContext	Average_MAPE
High Income	0.3123265
Low Income	0.3261770

Discussion

Through analysis of our results, our housing market prediction tool we built for Zillow was somewhat effective. While our tool met the request of incorporating local intelligence into the model and led to some of our most interesting variables, the size of the mean absolute percent error associated with the model calls for further work. The emphasis of local intelligence being included in a new model introduced some significant challenges.

Numbers of stolen bikes, police requests, and number of playgrounds within a certain distance of a home influenced market value in our model. These variables, along with others seen in the correlation matrix, were associated with value changes for homes in Boulder County. However, how these variables and values were related was subject to the data available. Many of the local intelligence datasets available through open data portals were only published with data within the city limits, leaving the rest of Boulder County unaccounted for. Still, these were important datasets and created a stronger prediction model.

There was no major pattern in where our prediction model was more accurate. Error was scattered between urban and suburban areas. Therefore, the predictions from our model were especially interesting. For local intelligence such as locations of playgrounds, that was only listed for sites within the city boundary, the data was not overbearing and did not outweigh local intelligence throughout the rest of Boulder County. This designates the potential of this model as it balances the wide range of Boulder County land use and density.

Due to the geographic diversity of the county, factors that play a role on value in the city will not have the same impact in the rural west part of the county. As the MAPE by tract figure shows, our model handled spatial variation well. Our model demonstrates a well-rounded approach to market value prediction with potential for county wide use, not just within the city. Despite having a higher than desired error, access to more local intelligence data at the county level could make the tool even stronger and useful for Zillow and its customers.

Conclusion

Overall, this model is an improvement from using exclusively internal features to predict housing prices. So, in that respect we would recommend this model to Zillow. However, there is a great deal of room for improvement, and before presenting this work to Zillow, we would search for more relevant data to incorporate. A better source from crime data would be a great place to start. Additionally, with a better understanding of the priorities of Boulder’s residents might reveal other datasets to explore. If residents are especially worried about wildfires for example, we would investigate where wildfires have occurred recently and add that information to the model. Ultimately, we are proud of this model but recognize its short comings and with more time and access to data would decrease our error.

Predicting Boulder House Prices Using Machine Learning

Introduction

Data

Methods

Results

Discussion

Conclusion