Exploring Neighbourhood Crime in the City of Toronto using Open Data

Toronto is considered to be a safe city in comparison to other big cities. In an article in the Economist (2015), Toronto was ranked as the safest major city in North America and the eighth safest major city in the world. Despite being ranked as a relatively safe city, Toronto has its fair share of crime. The city consists of 140 officially recognized neighbourhoods along with many other unofficial, smaller neighbourhoods. As is the case with any big city, some neighbourhoods are considered to be less safe than others. Several reasons are attributed to higher crime such as lower income, higher unemployment, lower literacy and access to education, among other reasons.

The City of Toronto has an Open Data portal, which consists of over 200 datasets organized into 15 different categories. I was motivated to download some of these readily available open datasets and explore neighbourhood crime within Toronto. I found three datasets from the Open Data portal, which I thought were relevant to crime – safety, demographics, and economics data. The advantage of using these datasets was that the data was already available in a relatively clean format and did not involve extensive wrangling. Also, each of these three datasets had exactly 140 rows – one for each official neighbourhood in Toronto. The disadvantage was that the data was available only for two years, 2008 and 2011, which limited my freedom for making predictions based on this data. Despite this limitation, I decided to subject these datasets to a typical data science pipeline (i.e., wrangling, data analysis, data visualization, and prediction) and extract any hidden value with respect to neighbourhood crime.

For detailed steps, and to replicate my results, or to run your own analyses, please go to my github page to download/clone the IPython notebook and datasets.

The source file for each dataset was provided as an excel file with two sheets – one for 2008 and one for 2011. I converted these sheets into separate csv files, and imported them as python pandas dataframes. So each raw dataset resulted in two pandas dataframes – six in total. There were differences in the demographics data for 2008 and 2011. In 2008, the city collected language and ethnicity data for each neighbourhood whereas in 2011, it only collected language data.

The datasets had long column titles with spaces in between. For easier data access, sub-selection and sorting, I shortened the column names. I focused only on the total number of major crime incidents in 2008 and 2011. The column titles for major crime incidents were TMCI and TMCI2 for 2011 and 2008, respectively. Major crime is the sum of eight different crime categories – Assaults (Ass), Break & Enters, Drug Arrests, Murders, Robberies, Sexual Assaults, Thefts, and Vehicle Thefts. TMCI2 had to be calculated since it was not available as an existing column in the 2008 data.

While generating a few initial plots, I realized that population would be a confounding variable. In other words, a neighbourhood might have more number of major crimes occurring merely because of having a higher population density. This could overpower other salient contributors of crime. To avoid this effect, I decided to normalize the data by calculating major crime per capita. So, I normalized the data by dividing all crimes for each neighbourhood with the neighbourhood’s population (obtained from the demographics data), and then multiplying these values with 1000. This gave me the values for major crime incidents in each neighbourhood per 1000 people.

As a first step, I compared the means of all crimes that fell under major crime incidents for 2011 with 2008. I found that Assaults (Ass), Drug Arrests (DA), and Break & Enters (BE) were the main major crime categories for both these years in terms of frequency. Murders (M) and Thefts (T) were the lowest two categories of major crime. Perhaps this finding is one reason why Toronto is generally a safe city.

Mean normalized major crime data for 2011:


Mean normalized major crime data for 2008:


Next, I sorted the data to find the five neighbourhoods for both these years with the least and most major crime incidents.

What were the top five major crime neighbourhoods in 2011?


What were the top five major crime neighbourhoods in 2008?


Four of the top five neighbourhoods matched in 2008 and 2011. This is because Yorkdale-Glen Park (indexed as 30) was ranked eighth in 2008 and had a slight increase in crime in 2011, hence showing up in the 2011 list of five neighbourhoods with most major crime incidents. Also, Danforth (indexed as 65), which is generally considered to be a high crime neighbourhood, had a drop in crime in 2011. A closer look at Danforth showed that while Danforth had an overall reduction in major crime, there was a considerable reduction specifically in the Drug Arrests category by approximately 75%, which may have accounted for Danforth not showing up in the top five neighbourhoods for 2011.

What were the bottom five major crime neighbourhoods in 2011?


What were the bottom five major crime neighbourhoods in 2008?


Again, I noticed that four of the five neighbourhoods matched in 2008 and 2011, suggesting that major crime had more or less been stable in Toronto, in these three years.

Next, I decided to visualize major crime incidents for 2008 and 2011.A good way to compare both years would be to show the major crimes as bivariate distributions in a scatter plot, where crimes for each year could be plotted as histograms. This also showed how correlated both distributions were. Higher correlation indicates lesser differences in major crime rates for 2008 and 2011. In this plot, 2011 major crime data is on the x-axis and 2008 major crime data is on the y-axis.


As expected, there was a strong correlation of 0.91 in major crime for both years. In addition to outliers in the plot, I also noticed two interesting points, which did not fall along the general trend. These two points refer to major crime (per 1000 people) in two neighbourhoods where 45 < TMCI2 (i.e., major crime in 2008) < 55, and 25 < TMCI (i.e., major crime in 2011) < 35. I found that these two neighbourhoods were Danforth (Neighbourhood ID: 66) and Waterfront Communities (Neighbourhood ID: 77). We discussed how Danforth, despite being a crime prone neighbourhood, had a reduction in major crimes in 2011. Likewise, Waterfront Communities seemed to have a drop in crime as well.

Then I plotted the distribution of major crime for both years as box plots. Major crime data looked similar for both 2008 and 2011.


Another exploration I felt would be interesting was to find out how major crime had changed in Toronto’s neighbourhoods between 2008 and 2011. So, I computed a percentage change value for major crime from 2008 to 2011. Positive values of TCDiff indicate an increase in major crime and negative values indicate a decrease in major crime, from 2008 to 2011.

The five neighbourhoods with the maximum increase in major crime from 2008 to 2011 were the following.


The five neighbourhoods with the maximum decrease in major crime from 2008 to 2011 were the following.


We already discussed how Danforth, generally a high major crime neighbourhood, had a decrease in crime in 2011. This becomes obvious when we look at the TCDiff value which shows a 48.5% decrease from 2008 to 2011. We also noticed the decrease in major crime for Waterfront Communities in the bivariate scatter plot. In support of this finding, the TCDiff data reveals a 33.7% decrease in major crime for Waterfront Communities from 2008 to 2011.

As a next step, I looked at the demographics data for 2011 and 2008 and focused on four different age groups – (1) Children (0-14 years), (2) Youth (15-24 years), (3) Adults (25-54 years), (4) Seniors (55 and over). Most of these categories were already available. Computing values for the remaining categories was relatively easy. I also wanted to look at the median/mean household income in each neighbourhood for 2008 and 2011. Unfortunately, income data was available only with the 2008 safety data from the Open Data portal, and was unavailable for 2011. So, I had to take a closer look at the economics datasets. Surprisingly, the economics datasets did not contain any income data. However, they had other potentially important variables such as number of people employed, and number of people on social assistance.

I selected three columns which I thought were the most relevant – number of businesses in each neighbourhood, number of people employed in each neighbourhood, and number of social assistance recipients. All three variables seemed important as they are connected to income and employment, which are traditionally considered as important motivators for committing crime. These values were normalized to 100 people

Finally, I decided to use all this data for predicting major crime in 2008 and 2011. First, I plotted cross-correlations to see if there were any features that could be removed.


For the 2011 data, shown above, there was a moderate to high positive correlation between the number of males and major crime (TMCI), number of adults and major crime, number of people employed and major crime, and number of social assistance recipients and major crime. There was a negative correlation between the number of seniors and major crime. There was also a strong positive correlation between the number of people employed and the number of businesses in that neighbourhood, suggesting that these features might be containing redundant information. I also noticed a strong negative correlation between the percentage of female population and the percentage of male population.


For the 2008 data, surprisingly, there was a moderate positive correlation between the percentage of females and the percentage of females, unlike in 2011, indicating that there was no consistent pattern. In any case, major crime is primarily associated with males rather than females. This led to me removing the percentage of females as a feature. I noticed a strong correlation between the number of people employed and the number of businesses in that neighbourhood for 2008 as well. Ideally, I would have removed one of these features. However, I decided against it based on the following reasoning. First, it is important to acknowledge that there is likely to be some correlation between these two variables. The number of businesses in a neighbourhood are likely to be a good indicator of the overall economic health of that neighbourhood. However, it is also possible that the businesses in a specific neighbourhood are not necessarily employing the people from that neighbourhood. Several businesses exist in downtown areas of cities, despite which neighbourhoods closer to downtown could still be high crime prone areas. Additionally, the nature of businesses can vary. A neighbourhood might consist entirely of small businesses such as sole proprietorships and small partnerships, which might not be having a sizeable number of employees. Given all these reasons, I included both these variables within the feature set. So, in summary, I used eight features. 

This was essentially a regression problem where I had decided to predict major crime in each neighbourhood using the eight original features as independent variables. I wanted to have two separate machine learning models – one for 2008 and one for 2011. All machine learning was performed using scikit-learn. I started with linear regression as my first choice of machine learning model. I used a randomized partitioning – 70% of the data was used for training and 30% for testing. Both the models were able to explain the variance in the training data reasonably well, as reflected by their values (71% for 2011 and 72% for 2008). For the 2011 model, mean squared errors were 22.3 on the training data and 40.7 on the test data, whereas for the 2008 model, mean squared errors were 40.9 on the training data and 46.6 on the test data. So, although the performance for the 2011 model seemed to be superior, the 2008 model seemed more robust with respect to generalizability, given that the difference between the MSE for training and test data was quite low.

To get an idea on feature importance, I examined the beta coefficients of all the predictors for both models using the “coef_” attribute for linear regression in scikit-learn.


For the 2008 linear regression model, the beta coefficient values suggest that almost all the features except the number of people employed, are important predictors for major crime in a neighbourhood. These features were people in each of the four age groups, percentage of males, number of businesses, and number of social assistance recipients in each neighbourhood. For the 2011 linear regression model, however, only the percentage of males, number of businesses, and number of social assistance recipients showed up as important predictors of major crime. So, the models for both years were not consistent with each other.

As a next step, I used a random forest regression model to see if there would be an improvement in prediction performance, with the same 70:30 partitioning of the data into training and test data.

Both the models were able to explain the variance in the training data very well, as reflected by their values (90.8% for 2011 and 92.4% for 2008). For the 2011 model, mean squared errors was 153.3 on the test data, whereas for the 2008 model, mean squared error was 156.4 on the test data. These results indicate that the random forest models, despite having better values, performed poorer than linear regression. In other words, they were too flexible and as a result overfitted to the training data. From a generalizability and robustness perspective, linear regression might be the better option as a regression model for predicting major crime in each neighbourhood.

Since random forest models allow us to find out the most important features for that model, I plotted the features.


For both 2008 and 2011, again the percentage of males, number of businesses, and number of social assistance recipients show up as the most important predictors of major crime, similar to the 2011 linear regression model.

 My most important finding from the 2011 linear regression model and the random forest models was that among the limited set of independent variables that I were available to me, the percentage of males, the number of businesses, and the number of social assistance recipients within a neighbourhood are the most important predictors of major crime in that neighbourhood.

This finding cannot be generalized because the data was limited to only two years – 2008 and 2011. Besides, there were several other features that should have ideally been included as independent variables (i.e., features in a machine learning model) which I could not include as they were unavailable for both years. Perhaps the most important missing feature was income data. I would also have liked to include a feature that either quantified urbanization or captured gentrification in these neighbourhoods. Toronto has been undergoing a lot of change in the form of major construction projects in low, and mid income neighbourhoods. Some of these are considered to be a part of revitalization projects. For example, the city of Toronto began an initiative in 2005 known as the Regent Park Revitalization Plan. The plan involved transforming an area from a social housing neighbourhood into a thriving mixed income neighbourhood by implementing construction in three phases that included a mix of rental and condominium buildings, townhouses, commercial space with community facilities, active parks and open spaces. Currently, phases 2 and 3 are underway. Variables that capture urbanization can definitely show how the demographics of the city are being reshaped and how these changes are affecting crime in that neighbourhood. To obtain some of these features I would have to explore beyond the Open Data portal made available by the City of Toronto. This would be a possible extension to the current work.

Toronto is also going through a condominium construction boom, which is increasingly escalating rental and housing prices, as well as affecting affordability of living for low-income residents. How much of this change could be affecting crime? Housing and rental prices could serve as important independent variables that affect crime.

One final aspect I would like to investigate is the effectiveness of social assistance programs. My findings show that the greater the number of people on social assistance, the more the crime in that area. But does this mean that social assistance is causing more crime? Clearly no. However, it is a reflection of the income needs of people in that neighbourhood. The expectation is that with greater social assistance, the income of each person and therefore the overall economic health of that neighbourhood will change. But is this really happening? One way to investigate this is to look at neighbourhoods that received more social assistance, and see if the crime rates in that neighbourhood reduced in a few years from the point of receiving higher social assistance.

From a prediction standpoint, obtaining more data for at least 15-20 years would have allowed the models to capture predictable trends. Despite these limitations, the data provided us with interesting findings and a set of action items for future extension of this work.





An analysis of music hits across decades: 1950-2009

music through the decades

Lately, I’ve been wondering how music has changed across decades. Specifically, what are the distinctive features that separate music of one decade from another? To this end, I have created a dataset of the top 100 music hits per decade, spanning six decades between 1950-2009. This dataset represents derived work collected from two sources –tsort.info/music/ and the Echo Nest database. It consists of the Echo Nest’s audio attributes for these 600 songs.  For more information about the dataset and its attributes, how I created it, and how to use it (licensing and citation info), please download the latest release from my github page and refer to the README.md file.

The Echo Nest (in case you haven’t heard of it) is a music intelligence and data platform that was acquired by Spotify in 2014 – an excellent acquisition, in my opinion. From a business strategy perspective, this acquisition positions Spotify very well in the competitive domain of music recommendation/music streaming service providers (e.g., Pandora, Apple Music, Google Play Music). tsort.info, provided by Steve Hawtin et al., is a comprehensive site that uses 130 music charts as its sources for aggregating information about hit songs. More information about the Echo Nest’s song attributes is available here and here.

To replicate the results below in R, download the dataset with the decades data titled “NVDecades.csv”, from the latest release of Music-Decades-Top-100. Then make sure to set your working directory in R to the folder containing this dataset, or mention the full/absolute path of the file in the read.csv function, and follow along with the instructions in the “MusicDecades.pdf” file.

Box plots are a great way to summarize and compare distributions of data from each decade for any attribute. Here are some box plots for each decade.
Box plots of danceability:


The Echo Nest describes danceability as how suitable a track is for dancing. The Echo Nest computes this attribute using a combination of tempo, rhythm stability, beat strength, and regularity. When considering the median danceability for each decade, we notice two obvious jumps in danceability – one for the 1970s and one for the 2000s. The sudden jump in the 1970s clearly represents how the 70s ushered in the disco and disco-funk era – a departure from the rock & roll and R&B hits of the 50s and 60s.

Box plots of energy:
boxplots-energyEnergy represents how energetic the song is. While danceability is more subjective to the listener, energy is more directly dependent on the audio characteristics of the song. The Echo Nest uses a combination of loudness and segment durations to compute energy. We notice a spike in the energy once in the 60s and again in the 70s. One possible explanation for the increase in energy levels of songs in the 60s could be the popularity of amplified distortion (especially in the latter part of the 60s). As for the 70s, this was when big mixing consoles (still analog, but allowing 32 channels) started gaining popularity in recording studios, representing a big revolution in music recording. More audio content could be accommodated and EQs could be adjusted. Could this perhaps have had something to do with the increase in energy starting with the 70s?

Box plots of tempo:

We don’t notice much difference in median tempi between decades. As one might likely expect, songs in the 50s have the lowest tempi. However, an interesting thing to note is that hit songs from the 50s varied more in tempo than in the other decades, as perceived by the size of the rectangle in the box plot. In other words, the 50s exhibited a wider range of tempi.

Box plots of duration:

We don’t see anything of note here, except that songs in the 50s and 60s were slightly shorter in duration.

Box plots of acousticness:

According to the Echo Nest, acousticness is a measure of the likelihood that a song was created by acoustic means such as voice and acoustic instruments that are not electronically synthesized or amplified. So, the inclusion of electric guitars, distortion, synthesizers, auto-tuned vocals, and drum machines will considerably lower the acousticness of a song. Not surprisingly, we notice a big drop in acousticness from the 50s to the 60s when distorted amplifiers and electric guitars started becoming popular. We also notice another drop, although not as huge, between the 60s and the 70s corresponding to the advances in recording consoles as well as the use of moog synthesizers.

Box plots of valence:
Valence is a subjective measure of pleasantness and is very listener-dependent. It needs to be interpreted with caution. The Echo Nest associates valence with positivity. The slightly lower median valence for the top songs of the 50s and the 90s, when compared with other decades, is hard to explain.

Now that we have a summarized overview of how hit songs are distributed with respect to these attributes across each decade, we can ask some interesting questions:

  • The box plots for danceability and energy show a similar pattern. Is a more energetic song generally more danceable or vice versa?
  • Do faster songs tend to be more energetic (or more danceable)?
  • How do danceability, energy, duration, and tempo affect the perceived pleasantness of a song? In other words, how are these attributes correlated with valence?

A good basic method to help address these questions is to compute cross-correlations (i.e., Pearson’s correlation coefficients).


Perhaps a better way to visualize these correlations would be either as a table of correlations or as a series of ellipses. In the second plot, the closer the ellipse is to a straight line, the more correlated the two features are. This is also indicated by its color. Additionally, the direction of correlation between any two attributes is also indicated.corrplot numbers


Coming back to our questions, we notice that energy and danceability are positively correlated, although the correlation is weak. Again there is a weak positive correlation between tempo and energy. This suggests that faster songs and danceable songs might sound more energetic, but not always. The positive correlation between danceability and valence is much clearer, as might be expected. The more danceable a song is, the more positive or pleasant the listening experience. Another clear insight is the strong negative correlation between acousticness and energy. So, the less acoustic or more synthesized (i.e., artificially amplified, compressed etc.) a song is, the more energetic it sounds.

We can also dig deeper and examine each decade. Each decade could be thought of as a genre. This becomes clearer in retrospect. For instance, songs of the 70s when thought of as one unified schema, have a distinct sound, a strong association with disco/disco-funk etc. However, when living in the 70s and listening to music, this might not have seemed obvious, given that the 70s had a variety of music – hard rock, early progressive rock and heavy metal, disco, synth rock, funk and so on. But first, let’s find the top 5 songs with maximum danceability, across all decades.


Three of the top 5 are from the 2000s, two from the 90s, and one from the 80s. Justin Timberlake’s “SexyBack” ranks at #1 (note that this dataset contains songs only up to 2009). What are the top 5 songs with maximum energy, across all decades?


We have a better contribution from the decades here, with at least one song each from the 70s, 80s, 90s, and 2000s. Since we have an officially streamable version available on YouTube, here is Scatman John’s “Scatman“. Gloria Gaynor’s “I Will Survive” is an interesting entry, suggesting that the song remains a mood uplifter perhaps not just because of its lyrics but also because of its structural arrangement and characteristics that make the song more energetic. What are the top 5 most pleasant songs (i.e., songs with maximum valence), across all decades?


Elvis scores high with two entries in this list. What are the top 5 songs with maximum acousticness?


As might be expected, 4 of the songs are from the 50s and the 60s. Just for fun, what are the 5 songs with minimum acousticness?


Let’s recall what lesser acousticness means. Lesser acousticness indicates that a song is more electronically synthesized or amplified. The inclusion of electric guitars, distortion, synthesizers, auto-tuned vocals, and drum machines would have considerably lowered the acousticness of a song. Ricky Martin’s “Livin’ La Vida Loca” requires special mention here. This is a great song, in no small part due to Ricky Martin’s delivery and showmanship, but also because of being a watershed moment in music recording history. This song was noted for its extreme use of dynamic range compression to increase the perceived loudness of the song. More information about the recording process behind La Vida Loca is provided in this article. We can dig deeper and examine the top songs for each decade. What are the 5 most danceable songs from the 50s?


What are the 5 most danceable songs from the 60s?


Elvis again shows up twice – another clear reflection of why he was (and possibly still is?) such a popular artist. What are the 5 most danceable songs from the 70s?


No surprises here with Chic coming on top. The Nile Rodgers/Bernard Edwards combo was legendary in bringing a unique sound that shaped and defined disco funk. What are the 5 most danceable songs from the 80s?


The fact that Michael Jackson shows up on the 80s and 70s lists indicates two things: (a) his exceptional talent as an artist and a performer, and (b) the fantastic production teams he worked with for his “Off the Wall” and “Thriller” albums. What are the 5 most danceable songs from the 90s?


Finally, what are the 5 most danceable songs from the 2000s?


I have just about scratched the surface here and would love to see other analyses with this dataset.

Image credit: The Plastic Mancunian

Will the “real” data scientist please stand up?


The term ‘data science’ has become increasingly popular over the last few years, especially when coupled with its more famous cousin, ‘big data’. A simple keyword search on social media forums and job portals is sufficient to suggest that data science is here to stay beyond its initial hype.

In a recent conversation with a friend who is a data scientist for a Toronto-based technology start-up, we discussed how the knowledge domains and skills associated with data science (e.g. creating datasets, querying databases for information, data analysis/interpretation using statistical methods, prediction using machine learning methods, and so on), have existed for decades before this term started gaining popularity. Both of us agreed that it makes sense to embrace some of this current hype around data science.

For those interested in the history of the discipline, here is an excellent blog post that talks about the chronological evolution and usage of data science. To me, metaphorically speaking, adopting a new term in business and technological circles is akin to the creation of a music genre. A new genre, as symbolized by a name (e.g. heavy metal, jazz fusion, disco in the 70’s; synth pop in the 80’s; grunge in the early 90’s; dubstep in the late 90’s; etc.), is popularized, partly, due to the uniqueness of that genre as represented by its musical elements, and also by the media surrounding that culture/community. A new genre is not born out of nothing. The musical elements/subgenres comprising this new genre have existed on their own somewhat independently, but now play a collaborative role in defining this new genre as something unique. Likewise, ‘data science’ symbolizes a unique amalgamation of various previously existing domains.

Despite the popularity of this term, many of us, including hiring managers at IT companies, are still surprisingly unclear about what data science entails. There is vagueness in the usage of this term as regards skill-sets. How many of us would qualify as data scientists and how many of us would be considered as not quite up to the mark, or “fake”? Bernard Marr presents this reality check.

Extending the music analogy further, with data science we are in an interesting transitional phase where the genre is still continuing to evolve and re-define itself. Let’s take heavy metal, for example. During its initial years, bands such as Led Zeppelin, Deep Purple, and Black Sabbath were associated with the sound of heavy metal. However, today, five decades later, many metal enthusiasts including myself would wince at the thought of referring to Led Zeppelin and Deep Purple as metal bands, but would unhesitatingly prescribe Black Sabbath as required course material for Heavy Metal 101. In other words, the boundaries and components of data science are still evolving. The fully defined picture for data science will be clear only in retrospect.

Having said that, are there ways to enlighten ourselves about what constitutes data science in its present state? Fortunately, several people have been asking similar questions, and their efforts have resulted in an abundance of useful, relevant information on the web. Data science can be understood through the core areas of study or knowledge domains and disciplines (e.g. computer science, statistics, databases). We could also think of it in terms of job roles and skills within the data science spectrum.

Data science can also be understood as a combination of disciplines involved and skills required. Ferris Jumah provides a novel way of visualizing what is hot in data science using a “data-centric” approach. Finally, here is an exhaustive information resource from DataCamp for aspiring data scientists, which provides areas of study, skills and background required, and even a roadmap with resources to acquire those skills – the mother of all data science infographics, in my opinion! There is a wealth of this kind of information available online and I have barely scraped the surface.

So, who qualifies as a data scientist? This blog post gets right to the point by listing 14 definitions, each highlighting different aspects of data science, some poignant, some detailed, and some tickling the funny bone. A quick read of all these definitions reveals key areas and corresponding skill sets, allowing us to build a summarized knowledge schema around data science – data analysis; data munging, cleaning, and manipulation; inference from big datasets; interpretation using statistical methods and machine learning; software engineering; story telling and visualization, to name a few in no particular order. My particular favorite is #11 by John Rauser, which I have often heard being paraphrased. Another interesting point to note is that the nature of the definition varies depending on the background (i.e. knowledge, training, experience) and current position of the person defining it. Each individual’s perception of data science has a bias that is unique to his or her present and past experiences, and knowledge. I find this exciting because it enables us to aggregate diverse perspectives from current data science practitioners in the real world and assimilate these views together into a common data science schema. This also serves as a good method to learn more about data science – by hearing it straight from the horse’s mouth, so to speak.

Data Science Weekly runs a section titled “Data Scientist Interviews” in its weekly newsletter. The title is self-explanatory. At least two volumes of interviews with data scientists have been published since 2014. After reading the first volume, I can say with conviction that this is clearly one of the more valuable resources for keeping abreast with the field of data science, understanding its scope, the nature of data science problems, and monitoring how this field is evolving. Volume 1 contains interviews with data scientists from diverse backgrounds working in a wide range of fields, and sharing a passion for answering questions pertaining to data.

Now that we have a good overview of what data science involves, how do we recognize the hidden data scientist within us? I believe, the key is to find our niche. Going back to musical metaphors, we are like musicians in a band playing jazz fusion, for instance. There isn’t enough time to master every style. But it helps to perfect one or two styles and have at least a surface-level understanding of others. Members of 70’s jazz rock fusion bands such as The Mahavishnu Orchestra, Weather Report, and Shakti were exemplary individual musicians who brought their uniqueness towards the common goals of the band to form a distinctive voice. In data science teams, irrespective of whether we are statisticians, computer scientists, software engineers, or hybrids of these areas, we all have a role to play. We use our strengths to provide insights and useful answers to common questions the team is attempting to address, while utilizing the rich data sources that are presently available. Have you found the hidden data scientist within yourself?

To your success as a data scientist!

Image credit: Perkins Consulting

To predict or to interpret in data science, that is the question


If you’re a data scientist or an aspiring one, you may sometimes have had this question – is a more complex learning algorithm better for making predictions?

Let’s assume we have a supervised learning task at hand that involves predicting the average selling price of houses in different neighbourhoods of Toronto, projected over 5 years, 10 years, and so on. We have access to vast geographic, demographic, socioeconomic, and historical data, from which we have extracted features that we consider to be important for influencing as well as determining real estate prices.  We could think of these features as our independent variables or predictors, and the selling price as our dependent variable. Since we are trying to predict the average selling price of a house (i.e, a continuous value) as a function of a set of features, this is clearly a regression problem.

What next? Should we start with some deep learning, which has been getting a lot of attention over the last few years, or support vector regression, in an attempt to achieve maximum performance? To address this question, we need to look at two things (a) the size and quality of our data set, and (b) our end goals.

Much has been written about the importance of data in the context of machine learning. If we have good data and a sizeable amount of it, we do not need complex algorithms. More data allows us to see more. Essentially, we can perceive clearer patterns in the data beyond the noise, which in turn could guide us towards simpler algorithms for modeling the data, thus obviating the need for something more complex. In other words, complex algorithms may not provide us with a good return on investment, as regards performance, in comparison to simpler ones that require fewer assumptions to be made. Garrett Wu expands on these points in more detail in this blog post.

Speaking to the second point, what are our end goals? In the context of our example, are we interested in good prediction performance alone or do we want to capture relationships between features and the selling price of a house? If performance is what really matters to us, then complex nonlinear methods offer more flexibility in finding a function that fits our training data better (assuming we have taken measures to control for overfitting). However, this comes at the cost of interpretability. The more complex the method, the more likely it is that we have lost track of how our features relate to the dependent variable. But if we wish to understand how the number of up-and-coming retail stores in a neighbourhood affect the home prices in that neighbourhood, a simpler algorithm such as linear regression is way more powerful.

So, coming back to our question, is a more complex learning algorithm better for making predictions? The answer depends on what our definition of “better” is, then finding the right balance between performance and interpretability, and choosing an algorithm accordingly. James, Witten, Hastie, and Tibshirani provide an excellent figure (Figure 2.7) and a helpful explanation, capturing the essence of this point, in Chapter 2 of their book titled, “An Introduction to Statistical Learning”, freely downloadable here.

Image credit: http://blog.kaggle.com/2014/08/01/learning-from-the-best/