To predict or to interpret in data science, that is the question

wind

If you’re a data scientist or an aspiring one, you may sometimes have had this question – is a more complex learning algorithm better for making predictions?

Let’s assume we have a supervised learning task at hand that involves predicting the average selling price of houses in different neighbourhoods of Toronto, projected over 5 years, 10 years, and so on. We have access to vast geographic, demographic, socioeconomic, and historical data, from which we have extracted features that we consider to be important for influencing as well as determining real estate prices.  We could think of these features as our independent variables or predictors, and the selling price as our dependent variable. Since we are trying to predict the average selling price of a house (i.e, a continuous value) as a function of a set of features, this is clearly a regression problem.

What next? Should we start with some deep learning, which has been getting a lot of attention over the last few years, or support vector regression, in an attempt to achieve maximum performance? To address this question, we need to look at two things (a) the size and quality of our data set, and (b) our end goals.

Much has been written about the importance of data in the context of machine learning. If we have good data and a sizeable amount of it, we do not need complex algorithms. More data allows us to see more. Essentially, we can perceive clearer patterns in the data beyond the noise, which in turn could guide us towards simpler algorithms for modeling the data, thus obviating the need for something more complex. In other words, complex algorithms may not provide us with a good return on investment, as regards performance, in comparison to simpler ones that require fewer assumptions to be made. Garrett Wu expands on these points in more detail in this blog post.

Speaking to the second point, what are our end goals? In the context of our example, are we interested in good prediction performance alone or do we want to capture relationships between features and the selling price of a house? If performance is what really matters to us, then complex nonlinear methods offer more flexibility in finding a function that fits our training data better (assuming we have taken measures to control for overfitting). However, this comes at the cost of interpretability. The more complex the method, the more likely it is that we have lost track of how our features relate to the dependent variable. But if we wish to understand how the number of up-and-coming retail stores in a neighbourhood affect the home prices in that neighbourhood, a simpler algorithm such as linear regression is way more powerful.

So, coming back to our question, is a more complex learning algorithm better for making predictions? The answer depends on what our definition of “better” is, then finding the right balance between performance and interpretability, and choosing an algorithm accordingly. James, Witten, Hastie, and Tibshirani provide an excellent figure (Figure 2.7) and a helpful explanation, capturing the essence of this point, in Chapter 2 of their book titled, “An Introduction to Statistical Learning”, freely downloadable here.

Image credit: http://blog.kaggle.com/2014/08/01/learning-from-the-best/

Advertisements

Published by: Naresh Vempala

I am a cognitive and a data scientist, project manager, and writer rolled into one. As a data scientist/cognitive scientist - I perform data analysis and visualization on various data sets using a combination of R, Python, Matlab, SPSS, Excel, and occasionally Java. My work involves building computational models for prediction and classification using machine learning methods (particularly neural networks). As a project manager - I work at the intersection of industry and academia, where I manage and co-supervise industry-academia collaborative projects that require (a) conducting cutting-edge applied research within an academic lab, and (b) translating research findings into innovative product development for the industry. As a writer - I write technical reports, grant proposals, scientific journal articles, book chapters, and conference papers. Some of my career highlights are: • Successfully secured over $940,000 in cumulative funding from Mitacs, FedDev, NSERC, and SSHRC. • 15+ years work in all phases of the Software Development Life Cycle of projects and project process groups, in time-sensitive and fast-paced environments. CompTIA Project+ certified. • Scientific reviewer for Mitacs, Ontario Centres of Excellence • My Superpower - researching, analyzing and methodically laying out project plans with goals and deadlines, evaluating corporate plans/procedures, making dependable recommendations, and deciding resourcefulness of people based on competencies and work ethic. • My GQ (Geek Quotient) - is ever-evolving, and currently includes R, Python, Matlab, Java, Object Oriented programming concepts, and relational databases.

Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s