Exploring Neighbourhood Crime in the City of Toronto using Open Data

Toronto is considered to be a safe city in comparison to other big cities. In an article in the Economist (2015), Toronto was ranked as the safest major city in North America and the eighth safest major city in the world. Despite being ranked as a relatively safe city, Toronto has its fair share of crime. The city consists of 140 officially recognized neighbourhoods along with many other unofficial, smaller neighbourhoods. As is the case with any big city, some neighbourhoods are considered to be less safe than others. Several reasons are attributed to higher crime such as lower income, higher unemployment, lower literacy and access to education, among other reasons.

The City of Toronto has an Open Data portal, which consists of over 200 datasets organized into 15 different categories. I was motivated to download some of these readily available open datasets and explore neighbourhood crime within Toronto. I found three datasets from the Open Data portal, which I thought were relevant to crime – safety, demographics, and economics data. The advantage of using these datasets was that the data was already available in a relatively clean format and did not involve extensive wrangling. Also, each of these three datasets had exactly 140 rows – one for each official neighbourhood in Toronto. The disadvantage was that the data was available only for two years, 2008 and 2011, which limited my freedom for making predictions based on this data. Despite this limitation, I decided to subject these datasets to a typical data science pipeline (i.e., wrangling, data analysis, data visualization, and prediction) and extract any hidden value with respect to neighbourhood crime.

For detailed steps, and to replicate my results, or to run your own analyses, please go to my github page to download/clone the IPython notebook and datasets.

The source file for each dataset was provided as an excel file with two sheets – one for 2008 and one for 2011. I converted these sheets into separate csv files, and imported them as python pandas dataframes. So each raw dataset resulted in two pandas dataframes – six in total. There were differences in the demographics data for 2008 and 2011. In 2008, the city collected language and ethnicity data for each neighbourhood whereas in 2011, it only collected language data.

The datasets had long column titles with spaces in between. For easier data access, sub-selection and sorting, I shortened the column names. I focused only on the total number of major crime incidents in 2008 and 2011. The column titles for major crime incidents were TMCI and TMCI2 for 2011 and 2008, respectively. Major crime is the sum of eight different crime categories – Assaults (Ass), Break & Enters, Drug Arrests, Murders, Robberies, Sexual Assaults, Thefts, and Vehicle Thefts. TMCI2 had to be calculated since it was not available as an existing column in the 2008 data.

While generating a few initial plots, I realized that population would be a confounding variable. In other words, a neighbourhood might have more number of major crimes occurring merely because of having a higher population density. This could overpower other salient contributors of crime. To avoid this effect, I decided to normalize the data by calculating major crime per capita. So, I normalized the data by dividing all crimes for each neighbourhood with the neighbourhood’s population (obtained from the demographics data), and then multiplying these values with 1000. This gave me the values for major crime incidents in each neighbourhood per 1000 people.

As a first step, I compared the means of all crimes that fell under major crime incidents for 2011 with 2008. I found that Assaults (Ass), Drug Arrests (DA), and Break & Enters (BE) were the main major crime categories for both these years in terms of frequency. Murders (M) and Thefts (T) were the lowest two categories of major crime. Perhaps this finding is one reason why Toronto is generally a safe city.

Mean normalized major crime data for 2011:


Mean normalized major crime data for 2008:


Next, I sorted the data to find the five neighbourhoods for both these years with the least and most major crime incidents.

What were the top five major crime neighbourhoods in 2011?


What were the top five major crime neighbourhoods in 2008?


Four of the top five neighbourhoods matched in 2008 and 2011. This is because Yorkdale-Glen Park (indexed as 30) was ranked eighth in 2008 and had a slight increase in crime in 2011, hence showing up in the 2011 list of five neighbourhoods with most major crime incidents. Also, Danforth (indexed as 65), which is generally considered to be a high crime neighbourhood, had a drop in crime in 2011. A closer look at Danforth showed that while Danforth had an overall reduction in major crime, there was a considerable reduction specifically in the Drug Arrests category by approximately 75%, which may have accounted for Danforth not showing up in the top five neighbourhoods for 2011.

What were the bottom five major crime neighbourhoods in 2011?


What were the bottom five major crime neighbourhoods in 2008?


Again, I noticed that four of the five neighbourhoods matched in 2008 and 2011, suggesting that major crime had more or less been stable in Toronto, in these three years.

Next, I decided to visualize major crime incidents for 2008 and 2011.A good way to compare both years would be to show the major crimes as bivariate distributions in a scatter plot, where crimes for each year could be plotted as histograms. This also showed how correlated both distributions were. Higher correlation indicates lesser differences in major crime rates for 2008 and 2011. In this plot, 2011 major crime data is on the x-axis and 2008 major crime data is on the y-axis.


As expected, there was a strong correlation of 0.91 in major crime for both years. In addition to outliers in the plot, I also noticed two interesting points, which did not fall along the general trend. These two points refer to major crime (per 1000 people) in two neighbourhoods where 45 < TMCI2 (i.e., major crime in 2008) < 55, and 25 < TMCI (i.e., major crime in 2011) < 35. I found that these two neighbourhoods were Danforth (Neighbourhood ID: 66) and Waterfront Communities (Neighbourhood ID: 77). We discussed how Danforth, despite being a crime prone neighbourhood, had a reduction in major crimes in 2011. Likewise, Waterfront Communities seemed to have a drop in crime as well.

Then I plotted the distribution of major crime for both years as box plots. Major crime data looked similar for both 2008 and 2011.


Another exploration I felt would be interesting was to find out how major crime had changed in Toronto’s neighbourhoods between 2008 and 2011. So, I computed a percentage change value for major crime from 2008 to 2011. Positive values of TCDiff indicate an increase in major crime and negative values indicate a decrease in major crime, from 2008 to 2011.

The five neighbourhoods with the maximum increase in major crime from 2008 to 2011 were the following.


The five neighbourhoods with the maximum decrease in major crime from 2008 to 2011 were the following.


We already discussed how Danforth, generally a high major crime neighbourhood, had a decrease in crime in 2011. This becomes obvious when we look at the TCDiff value which shows a 48.5% decrease from 2008 to 2011. We also noticed the decrease in major crime for Waterfront Communities in the bivariate scatter plot. In support of this finding, the TCDiff data reveals a 33.7% decrease in major crime for Waterfront Communities from 2008 to 2011.

As a next step, I looked at the demographics data for 2011 and 2008 and focused on four different age groups – (1) Children (0-14 years), (2) Youth (15-24 years), (3) Adults (25-54 years), (4) Seniors (55 and over). Most of these categories were already available. Computing values for the remaining categories was relatively easy. I also wanted to look at the median/mean household income in each neighbourhood for 2008 and 2011. Unfortunately, income data was available only with the 2008 safety data from the Open Data portal, and was unavailable for 2011. So, I had to take a closer look at the economics datasets. Surprisingly, the economics datasets did not contain any income data. However, they had other potentially important variables such as number of people employed, and number of people on social assistance.

I selected three columns which I thought were the most relevant – number of businesses in each neighbourhood, number of people employed in each neighbourhood, and number of social assistance recipients. All three variables seemed important as they are connected to income and employment, which are traditionally considered as important motivators for committing crime. These values were normalized to 100 people

Finally, I decided to use all this data for predicting major crime in 2008 and 2011. First, I plotted cross-correlations to see if there were any features that could be removed.


For the 2011 data, shown above, there was a moderate to high positive correlation between the number of males and major crime (TMCI), number of adults and major crime, number of people employed and major crime, and number of social assistance recipients and major crime. There was a negative correlation between the number of seniors and major crime. There was also a strong positive correlation between the number of people employed and the number of businesses in that neighbourhood, suggesting that these features might be containing redundant information. I also noticed a strong negative correlation between the percentage of female population and the percentage of male population.


For the 2008 data, surprisingly, there was a moderate positive correlation between the percentage of females and the percentage of females, unlike in 2011, indicating that there was no consistent pattern. In any case, major crime is primarily associated with males rather than females. This led to me removing the percentage of females as a feature. I noticed a strong correlation between the number of people employed and the number of businesses in that neighbourhood for 2008 as well. Ideally, I would have removed one of these features. However, I decided against it based on the following reasoning. First, it is important to acknowledge that there is likely to be some correlation between these two variables. The number of businesses in a neighbourhood are likely to be a good indicator of the overall economic health of that neighbourhood. However, it is also possible that the businesses in a specific neighbourhood are not necessarily employing the people from that neighbourhood. Several businesses exist in downtown areas of cities, despite which neighbourhoods closer to downtown could still be high crime prone areas. Additionally, the nature of businesses can vary. A neighbourhood might consist entirely of small businesses such as sole proprietorships and small partnerships, which might not be having a sizeable number of employees. Given all these reasons, I included both these variables within the feature set. So, in summary, I used eight features. 

This was essentially a regression problem where I had decided to predict major crime in each neighbourhood using the eight original features as independent variables. I wanted to have two separate machine learning models – one for 2008 and one for 2011. All machine learning was performed using scikit-learn. I started with linear regression as my first choice of machine learning model. I used a randomized partitioning – 70% of the data was used for training and 30% for testing. Both the models were able to explain the variance in the training data reasonably well, as reflected by their values (71% for 2011 and 72% for 2008). For the 2011 model, mean squared errors were 22.3 on the training data and 40.7 on the test data, whereas for the 2008 model, mean squared errors were 40.9 on the training data and 46.6 on the test data. So, although the performance for the 2011 model seemed to be superior, the 2008 model seemed more robust with respect to generalizability, given that the difference between the MSE for training and test data was quite low.

To get an idea on feature importance, I examined the beta coefficients of all the predictors for both models using the “coef_” attribute for linear regression in scikit-learn.


For the 2008 linear regression model, the beta coefficient values suggest that almost all the features except the number of people employed, are important predictors for major crime in a neighbourhood. These features were people in each of the four age groups, percentage of males, number of businesses, and number of social assistance recipients in each neighbourhood. For the 2011 linear regression model, however, only the percentage of males, number of businesses, and number of social assistance recipients showed up as important predictors of major crime. So, the models for both years were not consistent with each other.

As a next step, I used a random forest regression model to see if there would be an improvement in prediction performance, with the same 70:30 partitioning of the data into training and test data.

Both the models were able to explain the variance in the training data very well, as reflected by their values (90.8% for 2011 and 92.4% for 2008). For the 2011 model, mean squared errors was 153.3 on the test data, whereas for the 2008 model, mean squared error was 156.4 on the test data. These results indicate that the random forest models, despite having better values, performed poorer than linear regression. In other words, they were too flexible and as a result overfitted to the training data. From a generalizability and robustness perspective, linear regression might be the better option as a regression model for predicting major crime in each neighbourhood.

Since random forest models allow us to find out the most important features for that model, I plotted the features.


For both 2008 and 2011, again the percentage of males, number of businesses, and number of social assistance recipients show up as the most important predictors of major crime, similar to the 2011 linear regression model.

 My most important finding from the 2011 linear regression model and the random forest models was that among the limited set of independent variables that I were available to me, the percentage of males, the number of businesses, and the number of social assistance recipients within a neighbourhood are the most important predictors of major crime in that neighbourhood.

This finding cannot be generalized because the data was limited to only two years – 2008 and 2011. Besides, there were several other features that should have ideally been included as independent variables (i.e., features in a machine learning model) which I could not include as they were unavailable for both years. Perhaps the most important missing feature was income data. I would also have liked to include a feature that either quantified urbanization or captured gentrification in these neighbourhoods. Toronto has been undergoing a lot of change in the form of major construction projects in low, and mid income neighbourhoods. Some of these are considered to be a part of revitalization projects. For example, the city of Toronto began an initiative in 2005 known as the Regent Park Revitalization Plan. The plan involved transforming an area from a social housing neighbourhood into a thriving mixed income neighbourhood by implementing construction in three phases that included a mix of rental and condominium buildings, townhouses, commercial space with community facilities, active parks and open spaces. Currently, phases 2 and 3 are underway. Variables that capture urbanization can definitely show how the demographics of the city are being reshaped and how these changes are affecting crime in that neighbourhood. To obtain some of these features I would have to explore beyond the Open Data portal made available by the City of Toronto. This would be a possible extension to the current work.

Toronto is also going through a condominium construction boom, which is increasingly escalating rental and housing prices, as well as affecting affordability of living for low-income residents. How much of this change could be affecting crime? Housing and rental prices could serve as important independent variables that affect crime.

One final aspect I would like to investigate is the effectiveness of social assistance programs. My findings show that the greater the number of people on social assistance, the more the crime in that area. But does this mean that social assistance is causing more crime? Clearly no. However, it is a reflection of the income needs of people in that neighbourhood. The expectation is that with greater social assistance, the income of each person and therefore the overall economic health of that neighbourhood will change. But is this really happening? One way to investigate this is to look at neighbourhoods that received more social assistance, and see if the crime rates in that neighbourhood reduced in a few years from the point of receiving higher social assistance.

From a prediction standpoint, obtaining more data for at least 15-20 years would have allowed the models to capture predictable trends. Despite these limitations, the data provided us with interesting findings and a set of action items for future extension of this work.