I. Introduction

Predictive Policing is becoming a popular policing method across the world. Policing strategies based on predicitive models and algorithms are very susceptible to bias, however. The data these predictive models are built upon heavily influence the results these models will produce. For instance, if a city’s violent crime is disproportionately concentrated in a neighborhood where the majority of the population is black, which could be a result of decades of other policing methods and policies, the predictive model will likely continue to target mostly black neighborhoods. This built-in bias in predictive models is dangerous and must be accounted for.

In this model, simple batteries were measured and analyzed in Chicago for the year 2017. This data was used to predict risk areas for future crime. These risk areas are realistic tools that police departments could find useful.

II. Data Wrangling

A. Load Data

Police districts and beats were read using st_read and transformed to a common and desriable projection for further plotting and analysis.

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

policeUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

2017 crime data was obtained from the City of Chicago open data portal. This crime was initially read using read.socrata and subsequently filtered to include only simple battery offenses. Once the data was filtered, coordinates were created from the open data. Additionally, a boundary of the City of Chicago was loaded into the model.

chicagoCrimes <-
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr")

batteries <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
  filter(Primary.Type == "BATTERY" & Description == "SIMPLE") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform('ESRI:102271') %>% 
  distinct()

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271')

B. Initial Plots

1. Plot Density

grid.arrange(ncol=2,
             ggplot() + 
               geom_sf(data = chicagoBoundary, size=0.5) +
               geom_sf(data = batteries, color="navyblue", size=0.09,
                       show.legend = "point") +
               labs(title= "Simple Batteries, Chicago - 2017") +
               mapTheme(title_size = 14),
             
             ggplot() + 
               geom_sf(data = chicagoBoundary, fill = "gray40", size=0.5) +
               stat_density2d(data = data.frame(st_coordinates(batteries)), 
                              aes(X, Y, fill = ..level.., alpha = ..level..),
                              size = 0.01, bins = 40, geom = 'polygon') +
               scale_fill_viridis() +
               scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
               labs(title = "Density of Simple Batteries") +
               mapTheme(title_size = 14) + theme(legend.position = "none"))

2. Fishnet

Joining the 2017 battery data to a fishnet allocates each individual incident into a grid cell for a given area. In this case, the grid cell size was set to 500. This then creates a count of simple batteries within each grid cell.

fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%       
  st_sf() %>%
  mutate(uniqueID = rownames(.))

crime_net <- 
  dplyr::select(batteries) %>% 
  mutate(countBatteries = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countBatteries = replace_na(countBatteries, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countBatteries), color = NA, linewidth = 0) +
  scale_fill_viridis() +
  labs(title = "Count of Simple Batteries in Fishnet") +
  mapTheme()

C. Additional Risk Factors

Loading additional variables and predictors into the model can show possible relationships and strengthen the model based on correlations between these and the dependent variable.

Data from Chicago’s Open Data Portal was loaded into the model including information on abandoned cars, abandoned buildings, graffiti locations, broken street lights, sanitation complaints, and liquor retail locations.

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")
  
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

III. Feature Engineering

A. Risk Factor Fishnets

Similar to how simple batteries were added to a fishnet above, each of the additional risk factors were added to fishnets. This offers a visual understanding of which areas are experiencing which problems.

vars_net <- 
  rbind(abandonCars, abandonBuildings, graffiti, streetLightsOut, sanitation,
   liquorRetail) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet, by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit() %>%
  ungroup()

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)

mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol=3, top="Risk Factors by Fishnet"))

B. Near Neighbors

While the fishnet grids offer a strong understanding of exposure, these visualizations do not account for much more. Using the near neighbor function nn_function shows an alternative understanding of how these individual events are connected with one another, specifically their neighboring events.

st_c <- st_coordinates
st_coid <- st_centroid

vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3))

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol = 3, top = "Nearest Neighbor risk Factors by Fishnet"))

C. Create Final Net

In order to run our model through a regression, the crime_net and vars_net are joined using left_join on their “uniqueID”.

final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
    st_join(dplyr::select(policeDistricts, District)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

neighborhood_net <- 
  ggplot() +
    geom_sf(data = final_net, aes(fill = name), show.legend = FALSE,
            color = "transparent") +
    scale_fill_viridis_d() + 
    labs(title = "Neighborhood") +
    mapTheme()

district_net <- 
  ggplot() +
    geom_sf(data = final_net, aes(fill = District), show.legend = FALSE,
            color = "transparent") +
    scale_fill_viridis_d() + 
    labs(title = "Police District") +
    mapTheme()

grid.arrange(district_net, neighborhood_net, nrow = 1)

IV. Spatial Process

A. Local Moran’s I

To observe spatial autocorrelation at a lower elevation, localmoran was used to conduct a Local Moran’s I test. This indicates spatial association, and does so on a localized level.

local_morans <- localmoran(final_net$countBatteries, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Batteries_Count = countBatteries, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)
  
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), linewidth=0, colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Moran's I statistics, Battery"))

B. Dependent Variable Near Neighbor

final_net <- final_net %>% 
  mutate(battery.isSig = 
           ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
  mutate(battery.isSig.dist = 
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(filter(final_net, 
                                          battery.isSig == 1))), 
                       k = 1))

reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn")

reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn",
                 "Graffiti.nn", "Liquor_Retail.nn",
                 "Street_Lights_Out.nn", "Sanitation.nn",
                 "battery.isSig", "battery.isSig.dist")

ggplot() +
      geom_sf(data = final_net, aes(fill=battery.isSig.dist), colour=NA) +
      scale_fill_viridis(name="NN Distance") +
      labs(title="Simple Battery NN Distance") +
      mapTheme()

C. Risk Factor Correlations

Circling back to the additional variables and risk factors that were imported, a correlation plot with a line of best fit is created for each factor. This shows which variables are actually related or correlated with our dependent variable.

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -name, -District) %>%
    gather(Variable, Value, -countBatteries)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countBatteries, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countBatteries)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "gray40") +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Simple Battery Count As a Function of Risk Factors") +
  plotTheme()

V. Regression

A. Distribution

The skewed distribution below is expected. Any distribution other than one resembling this would be extremely concerning, as batteries should not be a common occurrence. Even though this skewed distribution is expected, it still requires a regression other than OLS. Instead, a Poisson Regression is much better suited for this data.

ggplot(crime_net, aes(countBatteries)) + 
  geom_histogram(binwidth = 2, color = 'black', fill = "#FDE725FF") +
  labs(title = "Simple Battery Distribution", subtitle = "Chicago, IL") +
  plotTheme()

B. Cross-Validated Poisson Regression

Because crime is a complex and multifaceted issue, this model must be accurate in predicting crime at high and low elevations. If the model is strong only at a citywide level and not a local level, it should not be used by police departments. That is why this model uses Leave One Group Out Cross Validation. It test the data by holding one particular piece out and training on the rest of the data ensuring a well balanced outcome.

final_net <- final_net %>%
  rename(countBurglaries = countBatteries)

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countBurglaries",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countBurglaries, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countBurglaries",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countBurglaries, Prediction, geometry)
  
reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countBurglaries",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countBurglaries, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countBurglaries",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countBurglaries, Prediction, geometry)

1. Accuracy and Generalizability

The figure below shows the difference in error between the Spatial LOGO-CV and a regular k-fold CV. The Spatial LOGO-CV clearly has lower MAEs than the k-fold CV.

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countBurglaries,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countBurglaries,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countBurglaries,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countBurglaries,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf()

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countBurglaries, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    facet_wrap(~Regression) +  
    geom_vline(xintercept = 0) + 
    labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
         x="Mean Absolute Error", y="Count") +
    plotTheme()

The table and maps below again show the LOGO-CV and Spatial Process reduce error.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(4, color = "black", background = "#FDE725FF")

Regression	Mean_MAE	SD_MAE
Random k-fold CV: Just Risk Factors	1.56	1.50
Random k-fold CV: Spatial Process	1.40	1.43
Spatial LOGO-CV: Just Risk Factors	5.41	8.16
Spatial LOGO-CV: Spatial Process	4.11	6.22

error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Battery errors by LOGO-CV Regression") +
    mapTheme() + theme(legend.position="bottom")

neighborhood.weights <-
  filter(error_by_reg_and_fold, Regression == "Spatial LOGO-CV: Spatial Process") %>%
    group_by(cvID) %>%
      poly2nb(as_Spatial(.), queen=TRUE) %>%
      nb2listw(., style="W", zero.policy=TRUE)

filter(error_by_reg_and_fold, str_detect(Regression, "LOGO"))  %>% 
    st_drop_geometry() %>%
    group_by(Regression) %>%
    summarize(Morans_I = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[1]],
              p_value = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[3]])

## # A tibble: 2 × 3
##   Regression                         Morans_I p_value
##   <chr>                                 <dbl>   <dbl>
## 1 Spatial LOGO-CV: Just Risk Factors    0.224   0.005
## 2 Spatial LOGO-CV: Spatial Process      0.211   0.002

st_drop_geometry(reg.summary) %>%
  group_by(Regression) %>%
    mutate(burglary_Decile = ntile(countBurglaries, 10)) %>%
  group_by(Regression, burglary_Decile) %>%
    summarize(meanObserved = mean(countBurglaries, na.rm=T),
              meanPrediction = mean(Prediction, na.rm=T)) %>%
    gather(Variable, Value, -Regression, -burglary_Decile) %>%          
    ggplot(aes(burglary_Decile, Value, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = burglary_Decile), colour = "black") +
      scale_shape_manual(values = c(2, 17)) +
      facet_wrap(~Regression) + xlim(0,10) +
      labs(title = "Predicted and observed burglary by observed burglary decile")  +
      plotTheme()

tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

ggplot()+
  geom_sf(data = tracts18, aes(fill = raceContext)) +
  scale_fill_viridis_d(begin = 0, end = 1, option = "plamsa") +
  labs(title= "Race Context") +
    mapTheme()

reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
    st_centroid() %>%
    st_join(tracts18) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "Mean Error by neighborhood racial context") %>%
        kable_styling("striped", full_width = F)

Mean Error by neighborhood racial context
Regression	Majority_Non_White	Majority_White
Spatial LOGO-CV: Just Risk Factors	-1.1017887	1.1654398
Spatial LOGO-CV: Spatial Process	-0.6151689	0.6330728

batt_ppp <- as.ppp(st_coordinates(batteries), W = st_bbox(final_net))
batt_KD.1000 <- spatstat.core::density.ppp(batt_ppp, 1000)
batt_KD.1500 <- spatstat.core::density.ppp(batt_ppp, 1500)
batt_KD.2000 <- spatstat.core::density.ppp(batt_ppp, 2000)
batt_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(batt_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(batt_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(batt_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

batt_KD.df$Legend <- factor(batt_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

ggplot(data=batt_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis(name="Density") +
  labs(title = "Kernel density with 3 different search radii") +
  mapTheme(title_size = 14)

batt_ppp <- as.ppp(st_coordinates(batteries), W = st_bbox(final_net))
batt_KD <- spatstat.core::density.ppp(batt_ppp, 1000)

as.data.frame(batt_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value), color = NA) +
     geom_sf(data = sample_n(batteries, 1500), size = .3) +
     scale_fill_viridis(name = "Density") +
     labs(title = "Kernel density of 2017 batteries") +
     mapTheme()

chicagoCrimes18 <-
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy")

batteries18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "BATTERY" & 
         Description == "SIMPLE") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

batt_KDE_sf <- as.data.frame(batt_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category  <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(batteries18) %>% mutate(battCount = 1), ., sum) %>%
    mutate(battCount = replace_na(battCount, 0))) %>%
  dplyr::select(label, Risk_Category, battCount)

batt_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(batteries18) %>% mutate(battCount = 1), ., sum) %>%
      mutate(battCount = replace_na(battCount, 0))) %>%
  dplyr::select(label,Risk_Category, battCount)

rbind(batt_KDE_sf, batt_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(batteries18, 3000), size = .3, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2017 Battery Risk Predictions; 2018 Batteries") +
    mapTheme()

rbind(batt_KDE_sf, batt_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countBatteries = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countBatteries / sum(countBatteries)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE) +
      labs(title = "Risk prediction vs. Kernel density, 2018 Batteries",
           subtitle = "Chicago, IL") +
      plotTheme() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

VI. Conclusion

While a predictive model like the one in this report may seem like an easy way to approach crime since it simply shows where the crime is occurring, we must be careful when it comes to actually implementing these systems. Because this model focused on Simple Battery offenses alone, I would not recommend this model for use in any policing strategy. It is simply too specific of an offense to account for an entire policing strategy. If this model analyzed and predicted on all violent crimes, including aggravated assaults, robberies, rape, homicides, and other offenses, I would feel much more comfortable about its ability to accurately predict crime.

Even with a model based on a broader category of crimes like violent crime, I would still be extremely hesitant to implement it into a real policing practice. Predictive models are limited to the data they are created with. Crime is a multi-faceted issue with unique and complicated connections to many things that a predictive model cannot take into account without years of careful guidance and attention to detail. It is a dangerous risk to take when models are the backbone of a policing strategy. Policing has a history of biases and mistreatment that has transformed communities into places that both welcome and generate crime. These communities become targets with these types of models, even when the true causes of crime are most likely connected to other simple causes. However, the model will not account for these outside causes, and this community will be overly policed and likely produce “positive” results from the model and continue to be targeted by future models.

Predictive Policing in Chicago

Sean McClellan

10/22/2021