Duc Vu Data Science Blog:
    About     Archive     Feed

Want to avoid crash acccident - Don't go to high-crime neighborhoods

Image test

My approach

The goal for the second project is to analysis and predict the number of car accidents in the city of Chicago, taking into account of multiple variables such as crime rate, weather, accident location , driving conditions, population density, traffic density and road condition.

Table of content

Packages:

  • Pandas: for data frames
  • Seaborn : for statistical data visualization
  • Statsmodels : for statistical models, and perform statistical tests
  • Scikit-learn : for machine learning library
  • Scipy : for mathematical functions.
  • dbfread : for reading DBF files and returning the data as native Python data types
  • Shapely : for manipulation and analysis of planar geometric objects
  • Folium : for interactive map

Get the data

Now that we know what we want to predict, what are the inputs? What could cause a car crash. There are many factors, some of which I include in this analysis.

  • Weather (temperature, wind speed, visibility, rainy/snowy/icy, snow depth, rain depth, etc)

  • Time features: hour of day, day of the week, month of the year.

  • Static features such as number of intersection , speed limit, average traffic volumes, etc

  • Human factors such as population density, etc

  • Countless others

The city of Chicago publishes those dataset that are worth exploring:

I find the vehicle accidents to be strongly related to location (Geography), so I will grab the Geojson file, so that I can perform geographic data analysis and aggregate the data at zipcode level and month level.

Handling with spatial data to geoinference the zipcode

I utilized dataset from OSMnx Street Networks Dataverse - U.S. Street Network Shapefiles, Node/Edge Lists, and GraphML Files which provides the lat/lon coordinates of all crossroads in Chicago. There are total 28372 intersections recorded in Chicago. The next step is to retrieve the zipcode for each crossroad.

Given geoJSON database of Chicago with lots of polygons (census tracts specifically) and intersection coordinates, I did some geoinferencing with Shapely and GeoJSON to identify which census tract and zipcode each intersection belongs to.

step 1. Import all the required libraries

import pandas as pd
import json
from shapely.geometry import shape, Point

step 2. Load the dataset and geoinference the zipcode

# load GeoJSON file containing sectors
with open('./dataset/chicago_boundaries_zipcodes.geojson') as f:
    js = json.load(f)



# load file containing lat/lon coordinate of each intersection
df_intersection = pd_read_csv('./dataset/intersections.csv')

# extract the zipcode from coordinates
zipcode = []
for i in range(len(df_intersection)):

    # construct point based on lon/lat returned by geocoder
    point = Point( df_intersection.iloc[i] ['LONGITUDE'], df_intersection.iloc[i] ['LATITUDE'] )


    # check each polygon to see if it contains the point
    for feature in js['features']:
        polygon = shape(feature['geometry'])
        if polygon.contains(point):
            #print('Found containing polygon:', feature['properties']['zip'])
            zipcode.append(feature['properties']['zip'])

Visualize crash rate with Folium map

chicago_metis = [41.876585, -87.653472]
chicago_map = folium.Map(chicago_metis, zoom_start=10, tiles="Stamen Terrain") #OpenStreetMap , tiles="Stamen Terrain"
    
# reading of the updated GeoJSON file    
chicago_geo = './dataset/chicago_boundaries_zipcodes.geojson'
   
choropleth = folium.Choropleth(
        geo_data=chicago_geo,
        data=df_crash,
        columns = ['zipcode', 'Number_of_Crash'],
        key_on = 'feature.properties.zip',
        fill_opacity = 0.7,
        line_opacity = 0.2,
        fill_color='BuGn',
        nan_fill_color='blue',
        legend_name=(' ').join('Crash'.split('_')).title() ,
        name='Crash density'
    ).add_to(chicago_map)

folium.LayerControl().add_to(chicago_map)
chicago_map.save('chicago_map.html')

Here is the visualization for rate of car accident:

crash

Exploratory Data Analysis

By exploring the data and analyzing the vehicle crash and other factors , we noticed there are some correlations in dataset. At first, let’s check the relationship between number of crossroads and number of accidents

intersection_vs_accident

Explore the target data and plot histogram of dependent variable to check if it is normal distribution

plt.figure(figsize=(12, 8))
sns.distplot(df['num_accident'])
plt.title('Histogram of target')

pothole

Pairwise scatter plots and correlation heatmap for checking multicollinearity

pairplot

Visualize Confusion Matrix using Heatmap

Here, I visualize the confusion matrix using Heatmap.

correlationmatrix

I anticipated there is a strong correlation between condition road and car crash. Contrary to expectations, the crime rate turned out to positively affect the accident rate.

Model selection

Based on the exploratory analysis, here are the features I use as independent variables to fit model :

  • percent_accident_at_night – percentage of accident at night
  • percent_accident_in_snow – percentage of accident in snow
  • lat, lgn – latitude and longitude of each zipcode
  • population – population of zipcode where car accident happens
  • density – population density per zipcode
  • num_potholes – total number of potholes per zipcode
  • percent_traffic_related_crime – percentage of vehicle-related crime
  • num_crime – total number of crimes per zipcode
  • intersection_count – number of intersections in each zipcode

When picking a model to fit the data to there are several factors to consider:

  • What type of problem am I solving? Classification? Predictive?

  • Does only predictive accuracy matter to me or do I want to also understand how the variables effect the model?

For this problem, I am predicting the number of accidents, which is a classic predictive problem. I do want to understand variables’ importance and know how the they effect the model. Ideally, I’d want to tell drivers which area they should avoid and take caution when driving through.

Build linear regression model

 X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
    
train_score = lr_model.score(X_train, y_train)
# score fit model on validation data
val_score = lr_model.score(X_val, y_val)
    
# report results
print('\nTrain R^2 score was:', train_score)
print('\nValidation R^2 score was:', val_score)

Here is the $R^2$ score :

Train $`R^2`$ score was: 0.6717622381724969
Validation $`R^2`$ score was: 0.6752109416359127

Let’s do some feature engineering to improve $R^2$

X2 = X.copy()
X2['population2'] = X2['population'] ** 2
X2['num_crime_/_num_intersection'] = X2['num_crime'] / X2['intersection_count']
X2['intersection_count5'] = X2['intersection_count'] ** 5
X2['num_crime_x_num_pothole'] = X2['num_crime'] * X2['num_potholes']

Also let’s check the accuracy of the model again.

Train $`R^2`$ score was: 0.7465677013645713
Validation $`R^2`$ score was: 0.7505576728834923

Surprisingly, $R^2$ score improved significantly which is considered quite good.

I also fitted polynomial model with data and the result is much better

Degree 2 polynomial regression val $`R^2`$: 0.814
Degree 2 polynomial regression test $`R^2`$: 0.812

Check the quality of regression model by plotting residual and prediction

Plot of prediction vs. residuals helps to check homoscedasticity. When we plot the prediction vs. the residuals, we clearly observe that the variance of the residuals are distributed uniformly randomly around the zero x-axes and do not form specific clusters. Therefore, it respects homoscedasticity.

pred_res

Plot Cooks Distance to detect outliers

fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(model_2_fit, alpha  = 0.05, ax = ax, criterion="cooks")

cook_dist

Find the most significant features with LARS model selection

LARS is a method for finding/computing the entire regularization path in a way that is nearly as computationally efficient as lasso because it shrinks the coefficients towards zero in a data-dependent way. Obviously, num_crime has the most positive effect on target num_accident

lars

Predict number of accidents for zipcode 60616 (Chinatown and Armour Square)

# Predict car accident at zipcode 60616 in December
crash_id = X_test.index[10]

crash = df['id'].loc[crash_id]
zipcode, v = crash[0], crash[1]

predictions = model_fit.predict(sm.add_constant(X_test))

print(f'The actual number of accidents  at zipcode {zipcode : .0f} in {month : .0f} is {y_test.num_accident[crash_id] : .0f}')
print(f'The predicted number of accidents  at zipcode {zipcode1 : .0f} in {month : .0f} is {predictions[crash_id]: .0f}')
The actual number of accidents at zipcode  60616 in December is  187
The predicted number of accidents at zipcode  60616 in  December is  162

The predicted accidents in Chinatown during December is not too far from the actual one.

Conclusion

In this blog, I covered some methods to extract spatial data, cleaning and fitting the regression model to estimate the car crash based on the zipcode. There is much room for improvements.

  • Deploying different algorithms to improve $R^2$ score and the accuracy of model.

  • Adding more features such as ridesharing data or median income to figure out the affect of those new features on model.

  • Prediting car crash can also reduce the risk through enforcement, education.

You can find the project on Github and reach me out on Linkedin

Thank you for reading this article.

First week at Metis - Project Benson

Image test

Challenge

An email from a potential client:

Vinny & Julia - It was great to meet with you and chat at the event where we recently met and had a nice chat. We’d love to take some next steps to see if working together is something that would make sense for both parties. As we mentioned, we are interested in harnessing the power of data and analytics to optimize the effectiveness of our street team work, which is a significant portion of our fundraising efforts. WomenTechWomenYes (WTWY) has an annual gala at the beginning of the summer each year. As we are new and inclusive organization, we try to do double duty with the gala both to fill our event space with individuals passionate about increasing the participation of women in technology, and to concurrently build awareness and reach. To this end we place street teams at entrances to subway stations. The street teams collect email addresses and those who sign up are sent free tickets to our gala. Where we’d like to solicit your engagement is to use MTA subway data, which as I’m sure you know is available freely from the city, to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend the gala and contribute to our cause. The ball is in your court now—do you think this is something that would be feasible for your group? From there we can explore what kind of an engagement would make sense for all of us. Best, Karrine and Dahlia WTWY International

Approach

The motivation for this project is help the WTWY figure out the best place and time for them to deploy their street teams, so they can invite more people to the gala. In order to give them a good recommendation, we utilized historical MTA turnstile data and census data ACS to gain the insight of communting pattern in NYC.

Our potential target are female commuters and women working in tech. So, we combined traffic flow from MTA data and demographic information to give our client the best location. Because a lot tourists visit NYC and WTWY employees do not work at weekend, Saturday and Sunday are excluded from our recommendation to avoid the potential outliers in riderships data.

Data Analysis

By exploring the data and analyzing the daily commuter traffic and hour of day traffic, we noticed there is inconsistency in data recorded by hour due to the glitch in the turnstile. Therefore we remove dataset containing negative ridership or ridership greater than 80000 per day.

daily_traffic

The top five MTA stations we recommend are:

  • FULTON ST
  • 23 ST
  • 86 ST
  • 59 ST COLUMBUS
  • 14 ST-UNION SQ

yearly_traffic

map_mta

Conclusion

  • The stations with the highest traffic have transfers to different lines such as the 23rd Street have connection to different MTA lines and also to buses, which guarantee a high traffic to capture emails

  • The morning period (8am -12pm) is the best time to capture emails, followed by the 12-8pm period

  • The expansion of data to cover multiple years could help to improve analysis to reduce the effect of outliers and special events

  • Expanding street teams outside subway stations can improve target accuracy because not everybody commute to work using the train, adding street teams outside train stations in trendy neighborhoods could increase reach

You can find the project on Github