121 KiB
Introduction to Simple Linear Regression¶
In this very simple example, we'll explore how to create a very simple fit line, the classic case of y=mx+b. We'll go carefully through each step, so you can see what type of question a simple fit line can answer. Keep in mind, this case is very simplified and is not the approach we'll take later on, its just here to get you thinking about linear regression in perhaps the same way Galton did.
Imports¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Sample Data¶
This sample data is from ISLR. It displays sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media.
df = pd.read_csv("Advertising.csv")
df.head()
Is there a relationship between total advertising spend and sales?
df['total_spend'] = df['TV'] + df['radio'] + df['newspaper']
sns.scatterplot(x='total_spend',y='sales',data=df)
Least Squares Line¶
Full formulas available on Wikipedia: https://en.wikipedia.org/wiki/Linear_regression ,as well as in ISLR reading.
Understanding what a line of best fit answers. If someone was to spend a total of $200 , what would the expected sales be? We have simplified this quite a bit by combining all the features into "total spend", but we will come back to individual features later on. For now, let's focus on understanding what a linear regression line can help answer.
Our next ad campaign will have a total spend of $200, how many units do we expect to sell as a result of this?
# Basically, we want to figure out how to create this line
sns.regplot(x='total_spend',y='sales',data=df)
Let's go ahead and start solving: $$y=mx+b$$
Simply solve for m and b, remember, that as shown in the video, we are solving in a generalized form:
$$ \hat{y} = \beta_0 + \beta_1X$$Capitalized to signal that we are dealing with a matrix of values, we have a known matrix of labels (sales numbers) Y and a known matrix of total_spend (X). We are going to solve for the beta coefficients, which as we expand to more than just a single feature, will be important to build an understanding of what features have the most predictive power. We use y hat to indicate that y hat is a prediction or estimation, y would be a true label/known value.
We can use NumPy for this (if you really wanted to, you could solve this by hand)
X = df['total_spend']
y = df['sales']
help(np.polyfit)
# Returns highest order coef first!
np.polyfit(X,y,1)
# Potential Future Spend Budgets
potential_spend = np.linspace(0,500,100)
predicted_sales = 0.04868788*potential_spend + 4.24302822
plt.plot(potential_spend,predicted_sales)
sns.scatterplot(x='total_spend',y='sales',data=df)
plt.plot(potential_spend,predicted_sales,color='red')
Our next ad campaign will have a total spend of $200, how many units do we expect to sell as a result of this?
spend = 200
predicted_sales = 0.04868788*spend + 4.24302822
predicted_sales
Further considerations...which we will explore in much more depth!¶
Overfitting, Underfitting, and Measuring Performance¶
Notice we fit to order=1 , essentially a straight line, we can begin to explore higher orders, but does higher order mean an overall better fit? Is it possible to fit too much? Too little? How would we know and how do we even define a good fit?
np.polyfit(X,y,3)
# Potential Future Spend Budgets
potential_spend = np.linspace(0,500,100)
predicted_sales = 3.07615033e-07*potential_spend**3 + -1.89392449e-04*potential_spend**2 + 8.20886302e-02*potential_spend**1 + 2.70495053e+00
sns.scatterplot(x='total_spend',y='sales',data=df)
plt.plot(potential_spend,predicted_sales,color='red')
Is this better than our straight line fit? What are good ways of measuring this?
Multiple Features¶
The real data had 3 features, not everything in total spend, this would allow us to repeat the process and maybe get a more accurate result?
X = df[['TV','radio','newspaper']]
y = df['sales']
# Note here we're passing in 3 which matches up with 3 unique features, so we're not polynomial yet
np.polyfit(X,y,1)
Uh oh! Polyfit only works with a 1D X array! We'll need to move on to a more powerful library...