This article is here to answer all the questions you’ve ever had about linear regression. I’ll walk you through every step of the process, from acquiring data, describing it, creating a model, evaluating it, optimizing it, and preventing it from overfitting. Along the way I’ll introduce generally applicable data science techniques like data munging, visualizations, and some nifty Python code. I take a two-pronged approach to introducing linear regression: code and math. Every problem I present in this article can be described from an abstract, mathematical perspective as well as concretely in code. Both of these perspectives are important, and shed light on different aspects of linear regression, so I do a careful dance back and forth between both of these. At the end of this article, you’ll have a deep understanding of linear regression. If you’re a beginner, you’ll find yourself comfortable. And if you’ve already encountered linear regression before, you’ll leave with all of your nagging questions answered.
Before reading this you should have a basic understanding of coding principles. All the code in this article is in Python, but if you know another high-level language it should be easy to follow along even if you’ve never written a single line of Python before. I also use Python’s matplotlib and pandas libraries frequently, and it’ll be worth your while to check out their documentation if you find yourself getting lost. I also introduce some mathematical formalisms from calculus and linear algebra. However, I explain these topics in such a way that they will make intuitive (if not explicit sense) even to people who’ve never taken a calculus class before.
Although linear regression is often introduced in a formal, statistical way, I’m going to talk about the topic from a less formal, more machine-learly perspective. This doesn’t mean that I’m any less rigorous, but that the rigour comes from the practical perspective of a data scientist rather than the theoretical persective of a mathematician. Hold onto your seats and enjoy the ride!
This blog post also exists as a Jupyter notebook. Check it out here.