Climate Resilience is an important, but neglected, field of City Planning. Developing a climate action plan can be just as important to a city as a economic, housing, and community plan.
For Calgary in 2013, their neglect to perform substantial flood preventative measures caught up to the city. The June 2013 floods hit Calgary hard as the city is at the meeting point of multiple rivers. So the abnormally large snow melt flooded the city’s downtown – causing approximately $409 million in infrastructure damages. [source]
For this project, we set out to understand where Calagary’s Flood Hazard areas are then understand which factors contribute the most to those hazards. As a comparison, this project will also use Calgary’s neighbor, Edmonton, as they are compareable in size.
This report will be structure in this format:
1. Data Import & Visualizations – City Area, Water, Flood Hazards, Elevation
2. Feature Engineering – Topology, Water Distances, Soil Types
3. Logistic Model – Fitting, Summary
4. Results – Cut-Off Threshold, ROC Curve, Confusion Matrix & Maps
5. Conclusions
In order to best predict which areas will flood in Calgary & Edmonton, we will (1) pull in reference data, (2) elevations, (3) the flood hazard zones themselves, and then (4) attach them to fishnets.
A majority of this data comes from the Province of Alberta’s Open Data portal, and their private contractor, Altalis. A more in-depth list of our data can be found in our Google Sheet.
proj_crs = st_crs(file.path(DATA_PATH,'Alberta_Projection.prj'))
pather = function(focus_paths){
focus_paths = paste(c(focus_paths), collapse='/')
final_path = paste(DATA_PATH, focus_paths, sep='/')
return(final_path)
}
# # GRIDS
# ## ALBERTA TOWNSHIP SURVEY SYSTEM's SUBDIVISIONS -- 1/16 square mile grid
# ## https://www.alberta.ca/alberta-township-survey-system.aspx
# GRIDS_PATH = file.path(DATA_PATH, 'clean_data/grids')
#
# ## CALGARY
# GRID_NAME = 'CGY_GRIDS.shp'
# #GRID_NAME = 'CGY_sub_0625mi.shp'
# cgy.grid = st_read(file.path(GRIDS_PATH, GRID_NAME)) %>%
# st_transform(proj_crs) %>%
# st_make_valid()
#
# ## EDMONTON
# GRID_NAME = 'EDM_GRIDS.shp'
# #GRID_NAME = 'EDM_sub_0625mi.shp'
# edm.grid = st_read(file.path(GRIDS_PATH, GRID_NAME)) %>%
# st_transform(proj_crs) %>%
# st_make_valid()
# CITY BOUNDARIES & BOUNDARY BOXES (~10mi away from city boundaries)
## function to add
REF_PATH = 'raw_data/reference'
BOUND_PATH = file.path(DATA_PATH, 'clean_data/boundaries')
get_bound = function(name) {
file.path(BOUND_PATH, name) %>%
st_read() %>%
st_transform(proj_crs)
}
## CALGARY
cgy.bbox =
st_read(pather(c(REF_PATH, 'cgy_bbox.shp'))) %>%
# reduce by 3 miles
st_buffer(-4828.03) %>%
st_make_grid(., n=1) %>%
st_sf()
cgy.bound = get_bound('CGY_citybound.shp')
## EDMONTON
edm.bbox =
st_read(pather(c(REF_PATH, 'edm_bbox.shp'))) %>%
# reduce by 3 miles
st_buffer(-4828.03) %>%
st_make_grid(., n=1) %>%
st_sf()
edm.bound = get_bound('EDM_citybound.shp')
WATER_PATH = pather(c(REF_PATH, 'alberta_water.shp'))
WATER_ALL_PATH = pather(c(REF_PATH, 'alberta_water_all.shp'))
alta.water = st_read(WATER_PATH) %>% st_transform(proj_crs)
cgy.water = alta.water %>% st_intersection(cgy.bbox)
edm.water = alta.water %>% st_intersection(edm.bbox)
rm(alta.water)
alta.water.all = st_read(WATER_ALL_PATH) %>% st_transform(proj_crs)
cgy.water.all = alta.water.all %>% st_intersection(cgy.bbox)
edm.water.all = alta.water.all %>% st_intersection(edm.bbox)
rm(alta.water.all)
Our two project areas of Calgary & Edmonton can be seen in Map 1.1. With the map having the same scale, it is apparent that they share a few similarities in: 1. Shape 2. Geographic Size 3. Orientation around a Major River
geom_bound = function(data=cgy.bbox, color = 'black', lwd=1, fill=NA, ...){
geom_sf(data=data,color=color,lwd=lwd,fill=fill,...)
}
geom_edm_bbox = function(data=edm.bbox, color = 'black', lwd=1, fill=NA, ...){geom_bound(data,color,lwd,fill,...)}
geom_cgy_bbox = function(data=cgy.bbox, color = 'black', lwd=1, fill=NA, ...){geom_bound(data,color,lwd,fill,...)}
geom_edm_bound = function(data=edm.bound, color = 'black', lwd=1, fill=NA, ...){geom_bound(data,color,lwd,fill,...)}
geom_cgy_bound = function(data=cgy.bound, color = 'black', lwd=1, fill=NA, ...){geom_bound(data,color,lwd,fill,...)}
water_color = '#00A9E6'
edm_plot =
ggplot() +
geom_edm_bound(fill='lightgrey', alpha=.5) +
geom_sf(data=edm.water, fill='#00A9E6', color=NA) +
geom_edm_bound() +
geom_edm_bbox() +
labs(title = "Map 1.1: Edmonton", subtitle='Alberta, Canada') +
mapTheme
cgy_plot =
ggplot() +
geom_cgy_bound(fill='lightgrey', alpha=.5) +
geom_sf(data=cgy.water, fill='#00A9E6', color=NA) +
geom_cgy_bound() +
geom_cgy_bbox() +
labs(title = "Map 1.1: Calgary", subtitle='Alberta, Canada') +
mapTheme
ggarrange(
cgy_plot,
edm_plot,
ncol=2,
common.legend = TRUE,
legend = 'right'
)
The city’s elevations, in Map 1.2, negate the geographic similarities in the last map. Calgary has a much higher overall elevation and sits at the foothills of the Rocky Mountains. Although Edmonton is 183 miles Northeast of Calgary, it sits 500 meters lower than than its southern neighbor. Not to mention that Edmonton is relatively flat. I had to configure the breaks so that you see at least two elevation breaks in the Edmonton map.
# IMPORT
DEM_PATH = 'raw_data/dem'
CGY_DEM_PATH = pather(c(DEM_PATH, 'cgy_dem_100m.tif'))
cgy.dem.100 =
raster(CGY_DEM_PATH) %>%
crop(., cgy.bbox)
EDM_DEM_PATH = pather(c(DEM_PATH, 'edm_dem_100m.tif'))
edm.dem.100 = raster(EDM_DEM_PATH) %>%
crop(., edm.bbox)
Spectral6 = c("#d53e4f","#fc8d59","#fee08b","#e6f598","#99d594","#3288bd")
Spectral7 = c("#d53e4f","#fc8d59","#fee08b","#ffffbf","#e6f598","#99d594","darkgreen")#"#3288bd")
pbreaks = c(0, 600, 700, 800, 900, 1000, 1200, 3000)
plabs = c('<600', '<700', '<800', '<900', '<1,000', '<1,200', '<1,400')
pColors = Spectral7
ggarrange(
ggplot() +
geom_raster(data=rast(cgy.dem.100),
aes(x,y,fill=as.factor(cut(value, breaks=pbreaks)))
) +
geom_sf(data=cgy.water, fill='lightgrey', color=NA) +
scale_fill_manual(values = pColors,
labels=plabs,
name="Elevation (Meters)",
drop = FALSE
) +
geom_cgy_bound() +
labs(title = "Map 1.2: Calgary Elevation", subtitle='Alberta, Canada') +
mapTheme,
ggplot() +
geom_raster(data=rast(edm.dem.100),
aes(x,y,fill=as.factor(cut(value, breaks=pbreaks)))
) +
geom_sf(data=edm.water, fill='lightgrey', color=NA) +
scale_fill_manual(values = pColors,
labels=plabs,
name="Elevation (Meters)",
drop = FALSE
) +
geom_edm_bound() +
labs(title = "Map 1.2: Edmonton Elevation", subtitle='Alberta, Canada') +
mapTheme,
ncol=2,
common.legend = TRUE,
legend = 'right'
)
Now jumping to the flood indudation map in Map 1.3, it appears that Calgary has larger sections of flood hazards. Not to mention that main area in the center is downtown (thinking back to the 2013 Calgary flood picture in the introduction). The larger red “Floodway” in Calgary to the center-west is the up-stream reservoir that couldn’t contain the excess flood in 201, leading to overflows of the river all throughout the rest of the city.
Although Edmonton has a sizeable flood hazard in the Northeast section of the city, it is at a slight lower elevation and located outside the main city area. So even if Edmonton’s single river flooded, it narrow and deep enough to mitigate the damages.
FLOOD_PATH = pather(c(REF_PATH, 'alberta_floodhazard.shp'))
cgy.flood = st_read(FLOOD_PATH) %>% st_intersection(cgy.bbox)
edm.flood = st_read(FLOOD_PATH) %>% st_intersection(edm.bbox)
edm_plot =
ggplot() +
geom_edm_bound(fill='lightgrey', alpha=.5) +
geom_sf(data=edm.flood, aes(fill=Multi_Zone), color=NA) +
geom_sf(data=edm.water, fill='#00A9E6', color=NA) +
geom_edm_bound() +
geom_edm_bbox() +
scale_fill_manual(values = c('salmon','red','maroon'),name="Flood Zone Types") +
labs(title = "Map 1.3: Edmonton Flood Zones", subtitle='Alberta, Canada') +
mapTheme
cgy_plot =
ggplot() +
geom_cgy_bound(fill='lightgrey', alpha=.5) +
geom_sf(data=cgy.flood, aes(fill=Multi_Zone), color=NA) +
geom_sf(data=cgy.water, fill='#00A9E6', color=NA) +
geom_cgy_bound() +
geom_cgy_bbox() +
scale_fill_manual(values = c('salmon','red','maroon'),name="Flood Zone Types") +
labs(title = "Map 1.4: Calgary Flood Zones", subtitle='Alberta, Canada') +
mapTheme
ggarrange(
cgy_plot,
edm_plot,
ncol=2,
common.legend = TRUE,
legend = 'right'
)
cgy.flood =
rbind(
cgy.flood %>% dplyr::select(geometry),
cgy.water %>% dplyr::select(geometry)
) %>%
st_union() %>%
st_sf() %>%
mutate(flood=1)
edm.flood =
rbind(
edm.flood %>% dplyr::select(geometry),
edm.water %>% dplyr::select(geometry)
) %>%
st_union() %>%
st_sf() %>%
mutate(flood=1)
With Calgary’s more concerning flood hazards, we will join all of this stated data to fishnets. The fishnets are 100m by 100m cells that are based on the DEM’s raster. This allows the elevation to easily compare to other features we create.
cgy.fish =
cgy.dem.100 %>%
raster_to_fishnet(.) %>%
rownames_to_column("id.fishnet") %>%
mutate(id.fishnet = as.numeric(id.fishnet)) %>%
st_sf(., crs=proj_crs) %>%
rename(dem_100m = cgy_dem_100m)
edm.fish =
edm.dem.100 %>%
raster_to_fishnet(.) %>%
rownames_to_column("id.fishnet") %>%
mutate(id.fishnet = as.numeric(id.fishnet)) %>%
st_sf(., crs=proj_crs) %>%
rename(dem_100m = edm_dem_100m)
edm.empty_raster = edm.dem.100
edm.empty_raster[] = NA
cgy.empty_raster = cgy.dem.100
cgy.empty_raster[] = NA
st_rasterize = function(focus_sf, empty_raster){
focus_sf %>%
as(.,'Spatial') %>%
rasterize(., empty_raster)
}
edm.flood.rast = st_rasterize(edm.flood, edm.empty_raster)
edm.flood.rast[is.na(edm.flood.rast)] = 0
cgy.flood.rast = st_rasterize(cgy.flood, cgy.empty_raster)
cgy.flood.rast[is.na(cgy.flood.rast)] = 0
# add raster to col
add_raster_as_col = function(
# raster
focus_raster,
# focus col to rename in raster
# & add to focus_fishnet as column
focus_col,
focus_fishnet){
st_join(
focus_fishnet %>% dplyr::select(geometry),
focus_raster %>%
raster_to_centroid() %>%
rename_first_col(.,focus_col)
) %>%
st_drop_geometry() %>%
pull(focus_col)
}
edm.fish$flood =
add_raster_as_col(edm.flood.rast, 'flood', edm.fish)
cgy.fish$flood =
add_raster_as_col(cgy.flood.rast, 'flood', cgy.fish)
Map 1.4 highlights these flood hazards as fishnet cells, rather than the winding vector shapes. Whether of not the cell is a flood indundation zone, will be the dependent variable of our logistic regression in Section 3.
colors2 = c('white', 'red')
edm_plot =
ggplot() +
geom_edm_bound(fill='lightgrey', alpha=.5) +
geom_sf(data=edm.fish, aes(fill=as.factor(flood)), color=NA) +
geom_edm_bound() +
geom_edm_bbox() +
scale_fill_manual(values = colors2,name="Flood Zones") +
labs(title = "Map 1.4: Edmonton Flood Fishnets", subtitle='Alberta, Canada') +
mapTheme
cgy_plot =
ggplot() +
geom_cgy_bound(fill='lightgrey', alpha=.5) +
geom_sf(data=cgy.fish, aes(fill=as.factor(flood)), color=NA) +
geom_cgy_bound() +
geom_cgy_bbox() +
scale_fill_manual(values = colors2,name="Flood Zones") +
labs(title = "Map 1.4: Calgary Flood Fishnets", subtitle='Alberta, Canada') +
mapTheme
ggarrange(
cgy_plot,
edm_plot,
ncol=2,
common.legend = TRUE,
legend = 'right'
)
Before modeling, this section will try to form other features that can predict flooding. In this project, we will be looking at topological formations, distances to water, and soil types.
For topology features, we will be looking at the cell’s slope, flow direction, and the aspect (defining which direction the slope is pointed).
# 1. slope
cgy.slope = terrain(cgy.dem.100, opt = 'slope', unit = 'degrees')
# 2. water flow
cgy.flow = terrain(cgy.dem.100, opt="flowdir", neighbors=8)
# 3. aspect/orientation
cgy.aspect = terrain(cgy.dem.100, opt = "aspect", unit = "degrees")
# 4. Topographic Position Index (TPI) of elevation
## difference between the [elevation] of a cell and the mean [elevation] of its 8 surrounding cells
cgy.tpi = terrain(cgy.dem.100, opt = "tpi")
# 5. Terrain Ruggedness Index (TRI) of elevation
## the mean of the absolute differences between the [elevation] of a cell and the [elevation] of its 8 surrounding cells
cgy.tri = terrain(cgy.dem.100, opt = "tri")
# 6. Roughness of elevation
## the difference between the maximum and the minimum [elevation] of a cell and its 8 surrounding cells
cgy.rough = terrain(cgy.dem.100, opt = "roughness")
# 1. slope
edm.slope = terrain(edm.dem.100, opt = 'slope', unit = 'degrees')
# 2. water flow
edm.flow = terrain(edm.dem.100, opt="flowdir", neighbors=8)
# 3. aspect/orientation
edm.aspect = terrain(edm.dem.100, opt = "aspect", unit = "degrees")
# 4. Topographic Position Index (TPI) of elevation
## difference between the [elevation] of a cell and the mean [elevation] of its 8 surrounding cells
edm.tpi = terrain(edm.dem.100, opt = "tpi")
# 5. Terrain Ruggedness Index (TRI) of elevation
## the mean of the absolute differences between the [elevation] of a cell and the [elevation] of its 8 surrounding cells
edm.tri = terrain(edm.dem.100, opt = "tri")
# 6. Roughness of elevation
## the difference between the maximum and the minimum [elevation] of a cell and its 8 surrounding cells
edm.rough = terrain(edm.dem.100, opt = "roughness")
cgy.slope[is.na(cgy.slope)]=0
cgy.fish$topo.slope =
add_raster_as_col(cgy.slope, 'slope', cgy.fish)
cgy.flow[is.na(cgy.flow)]=0
cgy.fish$topo.flow =
add_raster_as_col(cgy.flow, 'flow', cgy.fish)
cgy.aspect[is.na(cgy.aspect)]=0
cgy.fish$topo.aspect =
add_raster_as_col(cgy.aspect, 'aspect', cgy.fish)
edm.slope[is.na(edm.slope)]=0
edm.fish$topo.slope =
add_raster_as_col(edm.slope, 'slope', edm.fish)
edm.flow[is.na(edm.flow)]=0
edm.fish$topo.flow =
add_raster_as_col(edm.flow, 'flow', edm.fish)
edm.aspect[is.na(edm.aspect)]=0
edm.fish$topo.aspect =
add_raster_as_col(edm.aspect, 'aspect', edm.fish)
Map 2.1 shows these three topology features for each city. Starting at slope, Map 2.1A shows that Calgary has a West to East downhill with many streams & valleys – all aiming at the center of Calgary. Edmonton’s slopes, in Map 2.1D, shows that the steepest areas are the banks of the single river. The river’s depth will make it relatively harder to overflow.
Calgary & Edmonton’s aspects help reinforce the downhill direction and canyons. Calgary’s Westerly canyons & streams funnel towards the center of the city then exit to the south. Edmonton’s single river cuts through the city in a Northeasterly direction.
cgy_plot1 =
ggplot() +
geom_raster(data=rast(cgy.slope),aes(x,y,fill=value)) +
geom_cgy_bbox() +
geom_cgy_bound() +
scale_fill_viridis(option = "D", name='') +
labs(title='Map 2.1A: Calgary Slope') +
mapTheme +
guides(color = guide_legend(override.aes = list(size = 0.5)))
cgy_plot2 =
ggplot() +
geom_raster(data=rast(cgy.flow),aes(x,y,fill=value)) +
geom_cgy_bbox() +
geom_cgy_bound() +
scale_fill_viridis(option = "A", name='') +
labs(title='Map 2.1B: Calgary Flow') +
mapTheme +
guides(color = guide_legend(override.aes = list(size = 0.5)))
cgy_plot3 =
ggplot() +
geom_raster(data=rast(cgy.aspect),aes(x,y,fill=value)) +
geom_cgy_bbox() +
geom_cgy_bound() +
scale_fill_viridis(option = "A", name='') +
labs(title='Map 2.1C: Calgary Aspect') +
mapTheme +
guides(color = guide_legend(override.aes = list(size = 0.5)))
edm_plot1 =
ggplot() +
geom_raster(data=rast(edm.slope),aes(x,y,fill=value)) +
geom_edm_bbox() +
geom_edm_bound() +
scale_fill_viridis(option = "D", name='') +
labs(title='Map 2.1B: Edmonton Slope') +
mapTheme +
guides(color = guide_legend(override.aes = list(size = 0.5)))
edm_plot2 =
ggplot() +
geom_raster(data=rast(edm.flow),aes(x,y,fill=value)) +
geom_edm_bbox() +
geom_edm_bound() +
scale_fill_viridis(option = "A", name='') +
labs(title='Map 2.1E: Edmonton Flow') +
mapTheme +
guides(color = guide_legend(override.aes = list(size = 0.5)))
edm_plot3 =
ggplot() +
geom_raster(data=rast(edm.aspect),aes(x,y,fill=value)) +
geom_edm_bbox() +
geom_edm_bound() +
scale_fill_viridis(option = "A", name='') +
labs(title='Map 2.1F: Edmonton Aspect') +
mapTheme +
guides(color = guide_legend(override.aes = list(size = 0.5)))
ggarrange(
cgy_plot1,
cgy_plot2,
cgy_plot3,
edm_plot1,
edm_plot2,
edm_plot3
)
Another important consideration is how far areas are to water. Map 2.2 shows that the cities appear to have the same amount of close distances to water. However, Calgary’s distances emphasize that most areas are close to a major river or stream. Edmonton has more lakes, leading to semingly shorter distances. But the lakes could potentially capture more flood water
cgy.water.raster = st_rasterize(cgy.water.all, cgy.empty_raster)
cgy.water.raster[is.na(cgy.water.raster)] = 0
cgy.fish$water =
add_raster_as_col(cgy.water.raster, 'water', cgy.fish)
edm.water.raster = st_rasterize(edm.water.all, edm.empty_raster)
edm.water.raster[is.na(edm.water.raster)] = 0
edm.fish$water =
add_raster_as_col(edm.water.raster, 'water', edm.fish)
cgy.fish$water.dist =
st_distance(
st_centroid(cgy.fish),
st_cast(cgy.water$geometry, "LINESTRING") %>%
st_union() %>%
st_as_sf()
) %>%
as.numeric()
edm.fish$water.dist =
st_distance(
st_centroid(edm.fish),
st_cast(edm.water$geometry, "LINESTRING") %>%
st_union() %>%
st_as_sf()
) %>%
as.numeric()
edm_plot =
ggplot() +
geom_edm_bound(fill='lightgrey', alpha=.5) +
geom_sf(data=edm.fish, aes(fill=water.dist), color=NA) +
geom_edm_bound() +
geom_edm_bbox() +
labs(title = "Map 2.2: Edmonton Distance to Water", subtitle='Alberta, Canada') +
mapTheme
cgy_plot =
ggplot() +
geom_cgy_bound(fill='lightgrey', alpha=.5) +
geom_sf(data=cgy.fish, aes(fill=water.dist), color=NA) +
geom_cgy_bound() +
geom_cgy_bbox() +
labs(title = "Map 2.2: Calgary Distance to Water", subtitle='Alberta, Canada') +
mapTheme
ggarrange(
cgy_plot,
edm_plot,
ncol=2,
common.legend = TRUE,
legend = 'right'
)
An interesting aspect to bring in is ‘how likely will the flood water be absorbed by the ground’? Map 2.3 shows the different types of sandy soils. Both cities appear to have an equal amount of sandy soils, but different types. Sand, gravel, and clay don’t absorb water as well as other soils, including silt. Calgary’s large amount of non-absorbing sand in the middle of the city could make it difficult for the water to soak up.
alta.soils =
pather(c('raw_data/sand_and_gravel', 'sand_and_gravel_proj.shp')) %>%
st_read(.)
cgy.soils = alta.soils %>% st_intersection(cgy.bbox)%>%
mutate(
soil_materials = ifelse(
MATERIAL=="Sand and silt", 'soil.sand_silt',
ifelse(
MATERIAL=="Sand, silt and clay, minor gravel", 'soil.sand_silt_clay',
"soil.sand_gravel"
)))
edm.soils = alta.soils %>% st_intersection(edm.bbox) %>%
mutate(
soil_materials = ifelse(
MATERIAL=="Sand and silt", 'soil.sand_silt',
ifelse(
MATERIAL=="Sand, silt and clay, minor gravel", 'soil.sand_silt_clay',
"soil.sand_gravel"
)))
get_soil_type = function(main_soils, material_type, empty_raster, focus_fish){
focus_soils = main_soils %>%
filter(soil_materials==material_type) %>%
st_union() %>% st_sf() %>%
mutate(value=1) %>% dplyr::select(value)
focus_soil_raster = st_rasterize(focus_soils, empty_raster)
focus_soil_raster[is.na(focus_soil_raster)] = 0
return(add_raster_as_col(focus_soil_raster, material_type, focus_fish))}
material_types = edm.soils$soil_materials %>% unique()
for(material_type in material_types){
edm.fish[[material_type]] =
get_soil_type(edm.soils, material_type, edm.empty_raster, edm.fish)
}
material_types = cgy.soils$soil_materials %>% unique()
for(material_type in material_types){
cgy.fish[[material_type]] =
get_soil_type(cgy.soils, material_type, cgy.empty_raster, cgy.fish)
}
edm_plot =
edm.fish %>%
dplyr::select(soil.sand_silt, soil.sand_gravel, soil.sand_silt_clay) %>%
transmute(
`Sand & Gravel` = soil.sand_gravel,
`Sand & Silt` = soil.sand_silt,
`Sand, Silt, & Clay` = soil.sand_silt_clay
) %>%
gather(Variable, Value, -geometry) %>%
ggplot(.) +
geom_sf(
aes(fill=as.factor(Value)),
color=NA) +
geom_edm_bbox() +
geom_edm_bound() +
facet_wrap(~Variable) +
scale_fill_manual(values = c('#FFFFFF','red'),
labels=c("","Soil Type"),
name="") +
labs(title="Map 2.3: Edmonton Soil Types", subtitle="Alberta, Canada") +
mapTheme +
theme(legend.position="none")
cgy_plot =
cgy.fish %>%
dplyr::select(soil.sand_silt, soil.sand_gravel, soil.sand_silt_clay) %>%
transmute(
`Sand & Gravel` = soil.sand_gravel,
`Sand & Silt` = soil.sand_silt,
`Sand, Silt, & Clay` = soil.sand_silt_clay
) %>%
gather(Variable, Value, -geometry) %>%
ggplot(.) +
geom_sf(
aes(fill=as.factor(Value)),
color=NA) +
geom_cgy_bbox() +
geom_cgy_bound() +
facet_wrap(~Variable) +
scale_fill_manual(values = c('#FFFFFF','red'),
labels=c("","Soil Type"),
name="") +
labs(title="Map 2.3: Calgary Soil Types", subtitle="Alberta, Canada") +
mapTheme +
theme(legend.position="none")
ggarrange(
cgy_plot,
edm_plot,
ncol=1
)
To model which areas are Flooding Hazards, I will be using a logistic regression. With the dependent variable being a binary of flooding hazard zones. The independent/predictor variables have been the list of elevation, topology, water, and soil columns. This model will be run on both cities so we can see which predictors are more valuable in each geography. Even though we displayed values outside of city boundaries, the model will only evaluate fishnet cells inside city boundaries.
The model will have a 75%/25% train/test split
edm.fish.final =
edm.fish %>%
st_intersection(edm.bound %>% dplyr::select(geometry)) %>%
dplyr::select(-id.fishnet, -water)
cgy.fish.final =
cgy.fish %>%
st_intersection(cgy.bound %>% dplyr::select(geometry)) %>%
dplyr::select(-id.fishnet, -water)
###########
####
set.seed(420)
trainIndex =
createDataPartition(edm.fish.final$flood, p = .75,
list = FALSE,
times = 1)
edm.fish.train = edm.fish.final[ trainIndex,]
edm.fish.test = edm.fish.final[-trainIndex,]
set.seed(420)
trainIndex =
createDataPartition(cgy.fish.final$flood, p = .75,
list = FALSE,
times = 1)
cgy.fish.train = cgy.fish.final[ trainIndex,]
cgy.fish.test = cgy.fish.final[-trainIndex,]
########
dv.var = 'flood'
iv.vars = colnames(edm.fish.final)
iv.vars = iv.vars[!(iv.vars %in% c(dv.var,'geometry'))]
edm_dev_formula =
paste(iv.vars, collapse=" + ") %>%
paste0(dv.var,' ~ ', .) %>%
as.formula()
model.edm =
glm(formula=edm_dev_formula,
family="binomial"(link="logit"),
data = edm.fish.train, maxit = 100)
dv.var = 'flood'
iv.vars = colnames(cgy.fish.final)
iv.vars = iv.vars[!(iv.vars %in% c(dv.var,'geometry'))]
cgy_dev_formula =
paste(iv.vars, collapse=" + ") %>%
paste0(dv.var,' ~ ', .) %>%
as.formula()
model.cgy =
glm(formula=cgy_dev_formula,
family="binomial"(link="logit"),
data = cgy.fish.train, maxit = 100)
The logistic model summaries will show potent variables through their coefficients, p-values and Odds Ratio metrics. For Odd Ratio, if a value is farther from 1 (whether higher or lower) then it has a higher likelihood of being an effective variable. Closer to 1, then there is less of a likelihood of a strong relationship with flood zones.
For the Calgary model, most variables are fairly close to a 1 Odds Ratio. This could be due to these variables being too consistent throughout the entire city. Or the predictor variables could be too similar.
Water distance was pretty even throughout the city, but I would think that it would be impactful. It is likely that water distance, elevation, and slope could all locate the same cells in the flood zone. Although this is helpful, if they are predicting the same thing, it could be diluting the model due to multicollinearity (which I should have tested for).
Variables that are significant include all of the soil types.
library(broom)
focus_model = model.cgy
model.sum =
focus_model %>%
tidy() %>%
#filter(!grepl("id.dist",term)) %>%
mutate(
Variable = term,
Estimate = estimate %>% round(3) %>% format(big.mark=','),
std.error = std.error %>% round(3) %>% format(big.mark=','),
t.value = round_thresh(statistic, thresh = .001, digits=3),
p.value = p.value %>% round_thresh(., thresh = .001, digits=3)
) %>%
dplyr::select(Variable, Estimate, std.error, t.value, p.value)
model.sum.CI =
exp(
cbind(
OR = coef(focus_model),
confint(focus_model)
)
) %>%
round(3) %>%
as.data.frame()
## Waiting for profiling to be done...
model.sum.CI$Variable = rownames(model.sum.CI)
model.all.summ.CI =
model.sum %>%
merge(
.,
model.sum.CI,
on='Variable',
sort = FALSE
)
model.all.summ.CI[!apply( model.all.summ.CI, 1, function(x) sum(is.na(x))>1 ), ] %>%
mutate(
Estimate = Estimate, #%>% as.numeric() %>% round(2),
std.error = std.error, # %>% as.numeric() %>% round(0),
OR = ifelse(
(OR>1000 | OR == Inf),
'>1,000',
OR %>% round_thresh(., int_check = F)),
# `2.5 %`=`2.5 %` %>%
# round_thresh(., int_check = F),
# `97.5 %`=ifelse(
# (`97.5 %` >1000 | `97.5 %` == Inf),
# '>1,000',
# `97.5 %` %>% round_thresh(., int_check = F))
) %>%
dplyr::select(-`2.5 %`, -`97.5 %`) %>%
flextable(.) %>%
theme_vanilla(.) %>%
# align(.,
# align = "center",
# part = "header") %>%
# align(., j=3:length(cols)-2,
# align = "right",
# part = "body") %>%
set_table_properties(
., layout='autofit') #%>%
The low estimate coefficients and closet-to-1 Odds Ratios suggest that this model might not have enough features to provide a cohesive prediction. We will have to wait for the results section.
The Edmonton model appears to have more significant variables – but still a low-amount. The soil types appear to be the have the highest Odds Ratio values. This could be due to sandy soils not being as present in the city besides the small river area – which happens to be have the most flooding.
(There was an odd knitting error that prevented me from adding the Odds Ratio column to this table)
focus_model = model.edm
model.sum =
focus_model %>%
tidy() %>%
#filter(!grepl("id.dist",term)) %>%
mutate(
Variable = term,
Estimate = estimate %>% round(3) %>% format(big.mark=','),
std.error = std.error %>% round(3) %>% format(big.mark=','),
t.value = round_thresh(statistic, thresh = .001, digits=3),
p.value = p.value %>% round_thresh(., thresh = .001, digits=3)
) %>%
dplyr::select(Variable, Estimate, std.error, t.value, p.value)
# model.sum.CI =
# exp(
# cbind(
# OR = coef(focus_model),
# confint(focus_model)
# )
# ) %>%
# round(3)
# model.sum.CI$Variable = rownames(model.sum.CI)
# model.all.summ.CI =
# model.sum %>%
# merge(
# .,
# model.sum.CI,
# on='Variable',
# sort = FALSE
# )
model.sum[!apply( model.sum, 1, function(x) sum(is.na(x))>1 ), ] %>%
mutate(
Estimate = Estimate, #%>% as.numeric() %>% round(2),
std.error = std.error, # %>% as.numeric() %>% round(0),
# OR = ifelse(
# (OR>1000 | OR == Inf),
# '>1,000',
# OR %>% round_thresh(., int_check = F)),
# `2.5 %`=`2.5 %` %>%
# round_thresh(., int_check = F),
# `97.5 %`=ifelse(
# (`97.5 %` >1000 | `97.5 %` == Inf),
# '>1,000',
# `97.5 %` %>% round_thresh(., int_check = F))
) %>%
#dplyr::select(-`2.5 %`, -`97.5 %`) %>%
flextable(.) %>%
theme_vanilla(.) %>%
# align(.,
# align = "center",
# part = "header") %>%
# align(., j=3:length(cols)-2,
# align = "right",
# part = "body") %>%
set_table_properties(
., layout='autofit') #%>%
Our model won’t immediately give us the predicted flood zones. Instead it will give us the probability it is a flood zone. To determine the true predictive nature of our binomial logistic model, we will:
Determine the Cut-Off threshold of how to label the outcomes of our New Developments Dependent Variable,
Re-Evaluate the Specificity, Sensitivity, & Misclassification rates, then
Use a ROC Curve to determine the quality of our fit
To save time, I will just be evaluating Calgary’s results more than Edmonton’s.
Using a function that finds where the Sensitivity and Specificity rates intersect (where the True Positive & Negative rate is maximized), we find that the optimum cut-off threshold is 7.24%.
7% appears like a low probability, but this is likely due to flood hazard zones being a smaller share of the cells in the city. We will have to better understand which share it is actually predicting.
# lm.dwtn.all$labels
#
library(ROCR)
focus_dv = 'flood'
focus_df = cgy.fish.train
focus_glm = model.cgy
focus.ROC = cbind(focus_df[[focus_dv]], focus_glm$fitted.values) %>%
as.data.frame()
colnames(focus.ROC) = c("labels","predictions")
pred = prediction(focus.ROC$predictions, focus.ROC$labels)
ROC.perf = performance(pred, measure = "tpr", x.measure="fpr")
####
opt.cut = function(perf, pred){
cut.ind = mapply(FUN=function(x, y, p){
d = (x - 0)^2 + (y-1)^2
ind = which(d == min(d))
c(sensitivity = y[[ind]],
specificity = 1-x[[ind]],
cutoff = p[[ind]])
}, perf@x.values, perf@y.values, pred@cutoffs)
}
cut_off_opt_df = opt.cut(ROC.perf, pred) %>%
t() %>% as.data.frame()
cut_off_opt = cut_off_opt_df$cutoff
cut_off_list = c(0.001, 0.01, 0.05, .11, 0.25, .5, .75)
cut_off_list = c(cut_off_list,cut_off_opt)
cut_off_values = get_cut_off_values(cut_off = cut_off_list)
sf.cutoff = get_sens_spec_miss(
focus_df = cut_off_values,
cut_off=cut_off_list[1])
for (curr_cut_off in cut_off_list[2:length(cut_off_list)]){
sf.cutoff = rbind(
sf.cutoff,
get_sens_spec_miss(
focus_df = cut_off_values,
cut_off = curr_cut_off)
) %>% as.data.frame()
row.names(sf.cutoff) = NULL}
sf.cutoff =
sf.cutoff %>%
round(4) %>% arrange(Cut_Off) %>%
mutate(
Cut_Off = sub("0+$", "", as.character(Cut_Off)) %>%
str_remove(., "^0+")
)
cols = colnames(sf.cutoff)
new_cols = setNames(
replace(cols, cols=='Cut_Off', 'Cut-Off Value'),
cols)
#sf.cutoff.flex =
sf.cutoff %>%
mutate(Cut_Off =
ifelse(Cut_Off=='.072',
'7.2%*',
Cut_Off %>%
as.numeric() %>%
percent_formatter(n=3)
),
Sensitivity = Sensitivity %>%
as.numeric() %>%
percent_formatter(n=1),
Specificity = Specificity %>%
as.numeric() %>%
percent_formatter(n=1),
Misclassification = Misclassification %>%
as.numeric() %>%
percent_formatter(n=1)
) %>%
flextable(., col_keys = cols) %>%
set_header_labels(
.,
values = new_cols) %>%
theme_vanilla(.) %>%
align(.,
align = "center",
part = "header") %>%
align(., j=2:length(cols),
align = "right",
part = "body") %>%
align(., j=1,
align = "center",
part = "body") %>%
set_table_properties(
., layout='autofit') %>%
bg(., i = ~ Cut_Off == '9.1%*', bg = "wheat", part = "body") %>%
add_footer_row(
.,
values=c("* Optimium Cut-Off"),
colwidths = c(length(cols)))
The confusion matrix for the threshold of 7.24% highlights the True/False Negative/Positive rates that form the previously discussed rates. Our True Positives (90%) are high. But so is our False Positive (7%) – suggesting that the model is definitely over predicting New Developments.
Is this because our model needs to more features?
cgy.fish =
cgy.fish %>%
mutate(
flood.predict.pct = predict(
model.cgy,
cgy.fish, # %>% na.omit(),
type= "response"),
flood.predict =
as.factor(
ifelse(
flood.predict.pct > cut_off_opt,
1,
0
)))
testlen = nrow(cgy.fish)
cgy.fish.test.mx = caret::confusionMatrix(
cgy.fish$flood.predict %>% as.factor(),
cgy.fish$flood %>% as.factor()) %>%
as.matrix(., what = "xtabs") %>%
as.data.frame() %>%
transmute(
Results = c('Predicted\nOutcome'),
Outcome = c("Below Cutoff\n(No Dev)","Above Cutoff\n(Developed)"),
`Below Cutoff\n(No Dev)` = (`0`/testlen) %>% percent_formatter() %>%
paste(c('(a)','(c)'), ., sep=' '),
`Above Cutoff\n(Developed)` = (`1`/testlen) %>% percent_formatter() %>%
paste(c('(b)','(d)'), ., sep=' ')
)
rownames(cgy.fish.test.mx) = NULL
rename_cols = setNames(
c("", "", "Below Cutoff\n(No Dev)", "Above Cutoff\n(Developed)" ),
colnames(cgy.fish.test.mx))
cgy.fish.test.mx %>%
flextable(.) %>%
theme_vanilla(.) %>%
align(.,
align = "center",
part = "header") %>%
align(., j=2:4,
align = "center",
part = "body") %>%
align(., j=1,
align = "center",
part = "body") %>%
set_table_properties(
., layout='autofit') %>%
merge_v(j = ~Results) %>%
set_header_labels(
.,
values = rename_cols) %>%
add_header_row(
.,
values = c('', 'Observed Outcome'),
colwidths = c(2,2)) %>%
bold(., i = 1:2, j = 1:2, bold = TRUE, part = "body") %>%
add_footer_row(
.,
values=c(glue("Cutoff = {percent_formatter(cut_off_opt,n=3)} \nSpecificity = a/(a+c)\nSensitivity = d/(d+b)\nMisclassification Rate = (b+c)/(a+b+c+d)")),
colwidths = c(4))
The Figure 2.1 Predicted Probabilities graph below provides a visual of the confusion matrix. My concerns that the dataset just has too many True Negatves seems valid. The probability of flood hazard zones are typically low, which could make it difficult to make a prediction with surface level features.
auc.perf = performance(pred, measure ="auc")
AUC = auc.perf@y.values[[1]] %>% round(4)
palette2 = c("#981FAC","#FF006A")
ggplot() +
# geom_density(
# data=focus.ROC,
# aes(x = predictions, group=labels),
# fill='grey90', color='black', lwd=1.5
# ) +
geom_density(
data=focus.ROC %>%
mutate(
labels = ifelse(
labels==1,
'New Development',
'No Change in Land Use'
)),
aes(x = predictions, group=labels, fill=labels, color=labels),
fill='grey90', lwd=1.25 #, color='orange'
# linetype='dashed'
) +
facet_grid(labels ~ .) +
scale_fill_manual(values = palette2) +
labs(
x = "Probability",
y = "Density of Probabilities",
title = "Figure 2.1: Predicted Probabilities - Calgary Flood Zones",
subtitle = glue("Optimal Cut-Off Rate = {cut_off_opt %>% percent_formatter(n=1)} (dashed line)")
) +
scale_x_continuous(
labels = function(num) num %>% percent_formatter(),
name="Probability") +
geom_vline(xintercept=cut_off_opt,
linetype='dashed') +
theme(strip.text.x = element_text(size = 18),
legend.position = "none") +
plotTheme
#flood.off()
The Figure 2.2 ROC curve is a goodness-of-fit plot of the true positive rate (i.e., sensitivity) against false positive rate (i.e., specificity). A curve at a 45 degree angle (i.e. just a diagonal line) suggests the model is inaccurate. A curve with a perfect 90 degree angle would be very predictive but is likely the result of over fitting. In the chart below, our curve’s semi-bow shape is concerning as it leans closer to a 45-degree angle than the 90-degree right angle.
The area under our model’s ROC curve is 0.9185 – suggesting that our model is high quality! But with the information from the Confusion Matrix & Graph, it is high quality at predicting the majority of our data – which is not hazard zones. Lets get a map to truly understand the locational issues
#library(rROC)
library(plotROC)
ggplot() +
geom_roc(
data = focus.ROC,
aes(d = labels, m = predictions),
n.cuts = 50, labels = FALSE, colour = "#000000") +
# geom_roc(
# data = test.predict,
# aes(d = as.numeric(test.predict$outcome), m = predict.2),
# n.cuts = 50, labels = FALSE, colour = "#FE9900") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(
title = "Figure 2.2: ROC Curve - Calgary Flood Zones",
subtitle = glue("Area Under Curve = {AUC}")
)
#### Confusion Map
Map 5.1 secures the evidence that the model is over estimating certain areas but underestimating others. Our model’s predictions appear to be missing the smaller streams. The predictions make a heavy overestimate of the size of the flood areas near the river.
cgy.fish %>%
dplyr::select(flood, flood.predict) %>%
transmute(
`Observed` = flood,
`Predicted` = flood.predict
) %>%
gather(Variable, Value, -geometry) %>%
ggplot(
data=.,
aes(x=xyC(.data[['geometry']])[,1], y=xyC(.data[['geometry']])[,2],
colour=Value %>% as.factor())
) +
geom_point(shape=18) +
facet_wrap(~Variable) +
scale_colour_manual(values = c('#FFFFFF','#800000'), labels=c("No Hazards","Flood Hazards"),
name="") +
geom_cgy_bound() +
geom_cgy_bbox() +
labs(title="Map 2.1: Calgary Flood Zones", subtitle="Which areas are flood indundation zones? (with 7.2% Probability Cutoff)") +
mapTheme
Excuses aside, Map 2.2 highlights the results in a map. It highlights that the areas near rivers are over estimated as flood zones. There is likely a feature missing that stresses a difference between floodable area near the river. Maybe adding concrete or asphalt variables could help.
ConfusionMatrix.metrics <-
cgy.fish %>%
mutate(flood.True.Pos = ifelse(flood == 1 & flood.predict == 1, 1,0),
flood.True.Neg = ifelse(flood == 0 & flood.predict == 0, 1,0),
flood.False.Pos = ifelse(flood == 0 & flood.predict == 1, 1,0),
flood.False.Neg = ifelse(flood == 1 & flood.predict == 0, 1,0)
) %>%
dplyr::select(.,
c('flood.True.Pos', 'flood.True.Neg', 'flood.False.Pos', 'flood.False.Neg')) %>%
transmute(
`True Negative\nActually No Hazard` = flood.True.Neg,
`False Positive\nPredicted Flood Hazard` = flood.False.Pos,
`False Negative\nPredicted No Hazaerd` = flood.False.Neg,
`True Positive\nActually Flood Hazard` = flood.True.Pos
) %>%
gather(Variable, Value, -geometry)
ggplot(data=ConfusionMatrix.metrics) +
geom_point(aes(x=xyC(ConfusionMatrix.metrics)[,1],
y=xyC(ConfusionMatrix.metrics)[,2], colour = as.factor(Value))) +
facet_wrap(~Variable) +
scale_colour_manual(values = c('#FFFFFF', '#FF0000'),
labels=c("No Hazards","Flood Hazards"),
name="") +
geom_cgy_bound() +
geom_cgy_bbox() +
labs(title="Map 2.2: Calgary Flood Hazards",
subtitle="Red = Test Results (with 7.2% Cutoff)") +
mapTheme +
theme(legend.position = "none")
Although the Roc Curve suggested a high area under the curve, the other results charts suggest the model performs too much overestimation.
Overall the earlier visualizations show that Calgary has a physical and urban geography that is vulnerable to flooding. Although Edmonton has some shared features (e.g. rivers), it doesn’t have to worry about flooding as much.
Calgary’s history and vulnerability means that the city has to focus on understanding and mitigating the flooding. Even a local activist group is calling for a [Comprehensive Flood Mitigation Strategy] (http://protectcalgary.com/need-comprehensive-flood-mitigation-strategy-protect-economic-engine-canada-calgary-alberta/). To be fair, Calgary has already been active against another big flood by raising the reservoir walls, creating an early flood warning system, and developing upstream storage [source].
The prediction model created in this project could help understand which areas could be hazardous to flooding, but aren’t labeled as hazard zones.