I. Introduction

Bike share is a modern take on public transportation that is becoming a popular means for tourists, commuters, students, and others in cities across the country. In a bike share program, such as Indego, bicycles are strategically placed around the city at docking stations where anyone can purchase a pass to be able to use the bike for a chosen amount of time before returning it to any other station in the city. Because these systems are designed for the user to return a bike to another station, it could be very useful for the bike share company to have information on where most trips begin and end. Additionally, being able to predict bike demand can help companies like Indego balance their location distribution.

Balancing the distribution of bikes is absolutely vital to any bike share program. If redistribution or rebalancing is done incorrectly, bikes may end up in one location, leaving users in other areas of demand searching for another form of transportation. The goal of this analysis is to give Indego a predictive model that will allow them to perfect their rebalancing strategy and allocate the correct amount of bikes to specific areas. This will maximize ridership as well as profits. Along with this rebalacing tool, this analysis proposes a strategy for rebalancing bike distribution that borrows from rebalancing stratgies from electric scooter companies such as Lime in Austin, Texas. Lime Juicing Tasks are offered through Lime that people of the public can earn money to collect Lime Scooters, charge them, and drop them off in areas of demand or areas that are under-stocked. Indego should implement a similar system that offers rewards, discounts, or performance-based pay for those who transport bikes to the areas of high demand denoted by time and day in this analysis.

Data Wrangling

Import Indego data

Data from Indego’s website for the second quarter of 2019 is imported and filtered for the 5 weeks we need for our analysis. The data includes information such as the start time, duration, end time, start location, end location, and type of pass for each Indego ride. This 5 week stretch will later be split into a training and test set.

indego_trips_raw <- read_csv("data/indego-trips-2019-q2.csv")

indego_trips <- filter(indego_trips_raw, start_time <= as.Date("2019-06-21") & start_time >= as.Date("2019-05-17"))

bikerides <- indego_trips %>%
  mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")) %>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))

Census Data

Gathering US Census data offers information on the population, household income, age, means of transport and other important feautures like the geography of our study area, in this case Philadelphia. The geography gathered from the Census allows for the Indego trips to be identified by the Census Tract the trip originated in as well as the tract the trip ended in.

tracts19 <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2019, 
          state = 42, 
          county = 101,
          geometry = TRUE, 
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

phlTracts<- 
  tracts19 %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

tracts_joined <- st_join(bikerides %>% 
                        filter(is.na(start_lat) == FALSE &
                                 is.na(end_lat) == FALSE &
                                 is.na(start_lon) == FALSE &
                                 is.na(end_lon) == FALSE) %>%
                        st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
                      phlTracts %>%
                        st_transform(crs=4326),
                      join=st_intersects,
                      left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., phlTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

Further Feature Engineering

Weather Data

Weather data for Philadelphia is obtained through the riem_measures function from the riem library. This function imports weather data from airports. For this analysis, data containing weather over our study period was collected at Philadelphia International Airport (PHL). This weather data includes temperature, precipitation, and wind speed. Obviously, bike ridership will fluctuate with weather. Rain will likely decrease ridership as well as colder temperatures. Analyzing the relationships between weather variables and ridership will improve the accuracy of the model for Indego.

weather <- riem_measures(station = "PHL", 
                         date_start = "2019-05-02", 
                         date_end = "2019-06-29")

weatherPanel <-  
  weather %>%
  mutate_if(is.character, list(~replace(as.character(.), is.na(.), "0"))) %>% 
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

grid.arrange(top = "Weather Data - Philadelphia (2019)",
             ggplot(weatherPanel, aes(interval60,Precipitation)) + geom_line() + 
               labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
             ggplot(weatherPanel, aes(interval60,Wind_Speed)) + geom_line() + 
               labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
             ggplot(weatherPanel, aes(interval60,Temperature)) + geom_line() + 
               labs(title="Temperature", x="Hour", y="Temperature") + plotTheme())

Indego Stations

stations <- st_read("https://kiosks.bicycletransit.workers.dev/phl") %>%
  st_transform(st_crs(tracts19)) %>%
  dplyr::select(name, id, geometry) %>%
  arrange(name)

II. Exploratory Analysis

Visualize Ridership

Across Time

ggplot(tracts_joined %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Bike share trips per hr by station. Philadelphia",
       x="Trip Counts", 
       y="Number of Stations")+
  plotTheme()

tracts_joined %>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  group_by(interval60, start_station, time_of_day) %>%
  tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. Philadelphia",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme()

dotw_rides <- 
  ggplot(tracts_joined %>% mutate(hour = hour(start_time)))+
  geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike Share Trips in Philadelphia, by Day of the Week",
       x="Hour", 
       y="Trip Counts")+
  plotTheme()

weekend_rides <- 
  ggplot(tracts_joined %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sat", "Sun"), "Weekend", "Weekday")))+
  geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike Share Trips in Philadelphia - Weekend vs Weekday",
       x="Hour", 
       y="Trip Counts")+
  scale_fill_viridis()+
  plotTheme()

grid.arrange(dotw_rides, weekend_rides, ncol = 1)

Across Space

rideTotals_origin <-ggplot(
  bikerides %>%
         group_by(start_station, time_of_day, start_lon, start_lat, weekend) %>%
         tally()) +
  geom_sf(data = phlTracts, fill = "transparent", color="black", size=2)+
  geom_sf(data = phlTracts, fill = "gray98", color="gray85")+
  geom_point(aes(x = start_lon, y = start_lat, color = n, size = n), alpha = 0.9)+
  scale_color_viridis_c(direction = -1,
                       option = "D")+
  guides(color=guide_legend(title="Trips"))+
  labs(title="Gross Bike Share Trips By Origin") +
  ylim(min(tracts_joined$start_lat), max(tracts_joined$start_lat))+
  xlim(min(tracts_joined$start_lon), max(tracts_joined$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  mapTheme()

rideTotals_destination <- ggplot(
  bikerides %>%
         group_by(end_station, time_of_day, end_lon, end_lat, weekend) %>%
         tally()) +
  geom_sf(data = phlTracts, fill = "transparent", color="black", size=2)+
  geom_sf(data = phlTracts, fill = "gray98", color="gray85")+
  geom_point(aes(x = end_lon, y = end_lat, color = n, size = n), alpha = 0.9)+
  scale_color_viridis_c(direction = -1,
                       option = "D")+
  guides(color=guide_legend(title="Trips"))+
  labs(title="Gross Bike Share Trips By Destination") +
  ylim(min(tracts_joined$start_lat), max(tracts_joined$start_lat))+
  xlim(min(tracts_joined$start_lon), max(tracts_joined$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  mapTheme()

grid.arrange(rideTotals_origin, rideTotals_destination, ncol = 1)

Across Time and Space

rides.animation.df <- bikerides %>%
  filter(week == 23 & dotw == "Mon") %>%
  mutate(Trip_Counter = 1) %>%
  group_by(interval15, start_station) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(., stations, by =c("start_station" = "id")) %>%
  st_sf() %>%
  ungroup() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
          Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
          Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
          Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
          Trip_Count > 10 ~ "11+ trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
  "7-10 trips","10+ trips")) %>%
  arrange(Trips)

ride_animation <- 
  ggplot()+
  geom_sf(data = phlTracts, fill = "transparent", color="black", size=2) +
  geom_sf(data = tracts19, size =0.2, color = "grey45", fill = "gray55")+
  geom_sf(data = stations, size = 0.2, alpha = 0.5, color = "#fde725ff") +
  geom_point(data = rides.animation.df,
          aes(color=Trip_Count, size = Trip_Count, geometry = geometry),
          stat = "sf_coordinates",
          fill = "transparent", alpha = 0.9)+
  scale_size_area(max_size = 7) +
  scale_colour_viridis("Trips", direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(tracts_joined$start_lat), max(tracts_joined$start_lat))+
  xlim(min(tracts_joined$start_lon), max(tracts_joined$start_lon))+
  labs(title="Indego Bike Trips: {current_frame}",
       subtitle = "15 minute intervals") +
  mapTheme() +
  enter_fade() + 
    exit_shrink() +
  transition_manual(interval15)

ride_animation2 <- 
  ggplot()+
  geom_sf(data = phlTracts, fill = "transparent", color="black", size=2) +
  geom_sf(data = tracts19, size =0.2, color = "grey45", fill = "gray55")+
  geom_sf(data = stations, size = 0.2, alpha = 0.5, color = "#fde725ff") +
  geom_point(data = rides.animation.df,
          aes(color=Trip_Count, size = Trip_Count, geometry = geometry),
          stat = "sf_coordinates",
          fill = "transparent", alpha = 0.8)+
  scale_size_area(max_size = 6) +
  scale_colour_viridis("Trips", direction = -1,
                       discrete = FALSE, option = "D")+
  labs(title="Indego Bike Trips: {current_frame}",
       subtitle = "15 minute intervals") +
  mapTheme() +
    enter_fade() + 
    exit_shrink() +
  transition_manual(interval15)

animate(ride_animation2, duration=22, renderer = gifski_renderer())

If we zoom in closer to the Indego docking stations, we get a better idea of patterns in the number of bike trips based on time of day. Both of these animations plot bike rides at 15 minute intervals for a full 24 hours on a Monday from the study period in this analysis.

animate(ride_animation, duration=22, renderer = gifski_renderer())

III. Model Testing

Create Ride Panel

study.panel <- 
  expand.grid(interval60=unique(tracts_joined$interval60), 
              start_station = unique(tracts_joined$start_station)) %>%
  left_join(., tracts_joined %>%
              select(start_station, Origin.Tract, start_lon, start_lat)%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

ride.panel <- 
  tracts_joined %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weatherPanel) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, tracts19 %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

Create Time Lags

ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
  mutate(day = yday(interval60)) 

as.data.frame(ride.panel) %>%
  group_by(interval60) %>% 
  summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -Trip_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                              "lag12Hours","lag1day"))) %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, Trip_Count),2))

Split Data

Here the data from our original 5 week set is split into a training set of 3 weeks and a test set of 2 weeks.

ride.Train <- filter(ride.panel, week <= 23)
ride.Test <- filter(ride.panel, week > 23)

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)
reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride.Train)
reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)
reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
       lagHour + lag2Hours + lag3Hours + lag4Hours + lag12Hours + lag1day, 
     data=ride.Train)

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week)

model_pred <- function(dat, fit){
  pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
  mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
         BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
         CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
         DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred)) %>% 
  gather(Regression, Prediction, -data, -week) %>%
  mutate(Observed = map(data, pull, Trip_Count),
         Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
         sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions

## # A tibble: 8 × 8
##    week data      Regression    Prediction  Observed  Absolute_Error   MAE sd_AE
##   <dbl> <list>    <chr>         <list>      <list>    <list>         <dbl> <dbl>
## 1    24 <tibble … ATime_FE      <dbl [22,3… <dbl [22… <dbl [22,344]> 0.924 1.04 
## 2    25 <tibble … ATime_FE      <dbl [9,31… <dbl [9,… <dbl [9,310]>  0.920 1.06 
## 3    24 <tibble … BSpace_FE     <dbl [22,3… <dbl [22… <dbl [22,344]> 0.945 1.05 
## 4    25 <tibble … BSpace_FE     <dbl [9,31… <dbl [9,… <dbl [9,310]>  0.974 1.03 
## 5    24 <tibble … CTime_Space_… <dbl [22,3… <dbl [22… <dbl [22,344]> 0.925 1.04 
## 6    25 <tibble … CTime_Space_… <dbl [9,31… <dbl [9,… <dbl [9,310]>  1.01  1.12 
## 7    24 <tibble … DTime_Space_… <dbl [22,3… <dbl [22… <dbl [22,344]> 0.789 0.931
## 8    25 <tibble … DTime_Space_… <dbl [9,31… <dbl [9,… <dbl [9,310]>  0.835 0.951

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    scale_x_continuous(breaks = c(24, 25)) +
    labs(title = "Mean Absolute Errors",
         subtitle = "By Model and Week",
         x = "Week") +
  plotTheme() +
  theme(legend.position = "bottom")

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station)) %>%
  dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = mean(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      scale_colour_manual(values = palette2) +
      labs(title = "Mean Predicted/Observed ride share by hourly interval", 
           x = "Hour", y= "Rideshare Trips") +
      plotTheme()

Plot Error at Each Station

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon)) %>%
  select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
  unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags") %>%
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phlTracts, fill = "transparent", color="black", size=2) +
  geom_sf(data = tracts19, size =0.2, color = "grey85", fill = "gray98")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             alpha = 0.9, size = 4)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(tracts_joined$start_lat), max(tracts_joined$start_lat))+
  xlim(min(tracts_joined$start_lon), max(tracts_joined$start_lon))+
  labs(title="MAE - Test Set",
  subtitle = "DTime_Space_FE_timelags Model")+
  mapTheme()

Observed vs Predicted Values

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw)) %>%
  select(interval60, start_station, start_lon, 
         start_lat, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
  geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
  geom_abline(slope = 1, intercept = 0)+
  facet_grid(~time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme()

Error by Time and Space

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw) ) %>%
  select(interval60, start_station, start_lon, 
         start_lat, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
    geom_sf(data = phlTracts, fill = "transparent", color="black", size=2) +
    geom_sf(data = tracts19,size =0.2, color = "grey85", fill = "gray98")+
    geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 3, alpha = 0.9)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(tracts_joined$start_lat), max(tracts_joined$start_lat))+
  xlim(min(tracts_joined$start_lon), max(tracts_joined$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set",
       subtitle = "DTime_Space_FE_timeLags Model")+
  mapTheme()

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw),
         Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
         Med_Inc = map(data, pull, Med_Inc),
         Percent_White = map(data, pull, Percent_White)) %>%
  select(interval60, start_station, start_lon, 
         start_lat, Observed, Prediction, Regression,
         dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE))+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a Function of Socio-Economic Variables",
       x= "Value",
       y="Mean Absolute Error (Trips)")+
  plotTheme()

control <- trainControl(method = "cv", number = 50)
set.seed(337)

# train model
reg_cv <- train(
    Trip_Count ~ ., 
    data = ride.panel,
    method = "lm",
    trControl = control,
    na.action = na.omit)

# retrieve MAE for all folds
cv_mae <- data.frame(MAE = reg_cv$resample[, 3])

ggplot(data = reg_cv$resample) +
  geom_histogram(aes(x = reg_cv$resample$MAE), fill = '#88aab8') +
  geom_vline(xintercept = mean(cv_mae$MAE), color = "gray10", linetype = 2) +
  labs(title="Distribution of Cross-validation MAE",
       subtitle = "K = 50",
       caption = "Figure 5.3 ") +
  xlab('MAE ') +
  ylab('Count') +
  plotTheme()

IV. Conclusion

This type of model can be transformative for a bike share company such as Indego. Being able to allocate bicycles to areas known to have higher ridership based on time of day or day of the week can push bike share towards being a direct competitor for other forms of ride share like Uber and Lyft. This model should be implemented into Indego’s rebalancing strategy once it is fitted with more local intelligence. Proximity to hospitals, for instance, may boost ridership for nurses commuting to and from work. Additionally, local intelligence such as hospitals and universities can allow Indego to tailor rebalacing for specific times of day. While students may ride more during rush hour, nurses may ride further into the evening when shifts end. This model is a useful one and belongs in a bike share program like Indego.

Predicting Indego Bike Rides in Philadelphia

Sean McClellan

11/19/2021