You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1072 lines
67 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
"___\n",
"<center><em>Copyright by Pierian Data Inc.</em></center>\n",
"<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Linear Regression with SciKit-Learn\n",
"\n",
"We saw how to create a very simple best fit line, but now let's greatly expand our toolkit to start thinking about the considerations of overfitting, underfitting, model evaluation, as well as multiple features!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\ProgramData\\Anaconda3\\lib\\site-packages\\statsmodels\\tools\\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.\n",
" import pandas.util.testing as tm\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sample Data\n",
"\n",
"This sample data is from ISLR. It displays sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"Advertising.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>TV</th>\n",
" <th>radio</th>\n",
" <th>newspaper</th>\n",
" <th>sales</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>230.1</td>\n",
" <td>37.8</td>\n",
" <td>69.2</td>\n",
" <td>22.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>44.5</td>\n",
" <td>39.3</td>\n",
" <td>45.1</td>\n",
" <td>10.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>17.2</td>\n",
" <td>45.9</td>\n",
" <td>69.3</td>\n",
" <td>9.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>151.5</td>\n",
" <td>41.3</td>\n",
" <td>58.5</td>\n",
" <td>18.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>180.8</td>\n",
" <td>10.8</td>\n",
" <td>58.4</td>\n",
" <td>12.9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" TV radio newspaper sales\n",
"0 230.1 37.8 69.2 22.1\n",
"1 44.5 39.3 45.1 10.4\n",
"2 17.2 45.9 69.3 9.3\n",
"3 151.5 41.3 58.5 18.5\n",
"4 180.8 10.8 58.4 12.9"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Everything BUT the sales column\n",
"X = df.drop('sales',axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"y = df['sales']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SciKit Learn \n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Polynomial Regression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**From Preprocessing, import PolynomialFeatures, which will help us transform our original data set by adding polynomial features**\n",
"\n",
"We will go from the equation in the form (shown here as if we only had one x feature):\n",
"\n",
"$$\\hat{y} = \\beta_0 + \\beta_1x_1 + \\epsilon $$\n",
"\n",
"and create more features from the original x feature for some *d* degree of polynomial.\n",
"\n",
"$$\\hat{y} = \\beta_0 + \\beta_1x_1 + \\beta_1x^2_1 + ... + \\beta_dx^d_1 + \\epsilon$$\n",
"\n",
"Then we can call the linear regression model on it, since in reality, we're just treating these new polynomial features x^2, x^3, ... x^d as new features. Obviously we need to be careful about choosing the correct value of *d* , the degree of the model. Our metric results on the test set will help us with this!\n",
"\n",
"**The other thing to note here is we have multiple X features, not just a single one as in the formula above, so in reality, the PolynomialFeatures will also take *interaction* terms into account for example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import PolynomialFeatures"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"polynomial_converter = PolynomialFeatures(degree=2,include_bias=False)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Converter \"fits\" to data, in this case, reads in every X column\n",
"# Then it \"transforms\" and ouputs the new polynomial data\n",
"poly_features = polynomial_converter.fit_transform(X)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(200, 9)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"poly_features.shape"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(200, 3)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.shape"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"TV 230.1\n",
"radio 37.8\n",
"newspaper 69.2\n",
"Name: 0, dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.iloc[0]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2.301000e+02, 3.780000e+01, 6.920000e+01, 5.294601e+04,\n",
" 8.697780e+03, 1.592292e+04, 1.428840e+03, 2.615760e+03,\n",
" 4.788640e+03])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"poly_features[0]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([230.1, 37.8, 69.2])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"poly_features[0][:3]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([52946.01, 1428.84, 4788.64])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"poly_features[0][:3]**2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The interaction terms $$x_1 \\cdot x_2 \\text{ and } x_1 \\cdot x_3 \\text{ and } x_2 \\cdot x_3 $$"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8697.779999999999"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"230.1*37.8"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"15922.92"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"230.1*69.2"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2615.7599999999998"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"37.8*69.2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train | Test Split\n",
"\n",
"Make sure you have watched the Machine Learning Overview videos on Supervised Learning to understand why we do this step"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# random_state: \n",
"# https://stackoverflow.com/questions/28064634/random-state-pseudo-random-number-in-scikit-learn\n",
"X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model for fitting on Polynomial Data\n",
"\n",
"#### Create an instance of the model with parameters"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression(fit_intercept=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit/Train the Model on the training data\n",
"\n",
"**Make sure you only fit to the training data, in order to fairly evaluate your model's performance on future data**"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.fit(X_train,y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"-----\n",
"\n",
"## Evaluation on the Test Set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculate Performance on Test Set\n",
"\n",
"We want to fairly evaluate our model, so we get performance metrics on the test set (data the model has never seen before)."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"test_predictions = model.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import mean_absolute_error,mean_squared_error"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"MAE = mean_absolute_error(y_test,test_predictions)\n",
"MSE = mean_squared_error(y_test,test_predictions)\n",
"RMSE = np.sqrt(MSE)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.489679804480361"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MAE"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4417505510403426"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MSE"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6646431757269028"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"RMSE"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14.022500000000003"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['sales'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comparison with Simple Linear Regression\n",
"\n",
"**Results on the Test Set (Note: Use the same Random Split to fairly compare!)**\n",
"\n",
"* Simple Linear Regression:\n",
" * MAE: 1.213\n",
" * RMSE: 1.516\n",
"\n",
"* Polynomial 2-degree:\n",
" * MAE: 0.4896\n",
" * RMSE: 0.664"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"---\n",
"## Choosing a Model\n",
"\n",
"### Adjusting Parameters\n",
"\n",
"Are we satisfied with this performance? Perhaps a higher order would improve performance even more! But how high is too high? It is now up to us to possibly go back and adjust our model and parameters, let's explore higher order Polynomials in a loop and plot out their error. This will nicely lead us into a discussion on Overfitting.\n",
"\n",
"Let's use a for loop to do the following:\n",
"\n",
"1. Create different order polynomial X data\n",
"2. Split that polynomial data for train/test\n",
"3. Fit on the training data\n",
"4. Report back the metrics on *both* the train and test results\n",
"5. Plot these results and explore overfitting"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# TRAINING ERROR PER DEGREE\n",
"train_rmse_errors = []\n",
"# TEST ERROR PER DEGREE\n",
"test_rmse_errors = []\n",
"\n",
"for d in range(1,10):\n",
" \n",
" # CREATE POLY DATA SET FOR DEGREE \"d\"\n",
" polynomial_converter = PolynomialFeatures(degree=d,include_bias=False)\n",
" poly_features = polynomial_converter.fit_transform(X)\n",
" \n",
" # SPLIT THIS NEW POLY DATA SET\n",
" X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)\n",
" \n",
" # TRAIN ON THIS NEW POLY SET\n",
" model = LinearRegression(fit_intercept=True)\n",
" model.fit(X_train,y_train)\n",
" \n",
" # PREDICT ON BOTH TRAIN AND TEST\n",
" train_pred = model.predict(X_train)\n",
" test_pred = model.predict(X_test)\n",
" \n",
" # Calculate Errors\n",
" \n",
" # Errors on Train Set\n",
" train_RMSE = np.sqrt(mean_squared_error(y_train,train_pred))\n",
" \n",
" # Errors on Test Set\n",
" test_RMSE = np.sqrt(mean_squared_error(y_test,test_pred))\n",
"\n",
" # Append errors to lists for plotting later\n",
" \n",
" \n",
" train_rmse_errors.append(train_RMSE)\n",
" test_rmse_errors.append(test_RMSE)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x168c0d109c8>"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(range(1,6),train_rmse_errors[:5],label='TRAIN')\n",
"plt.plot(range(1,6),test_rmse_errors[:5],label='TEST')\n",
"plt.xlabel(\"Polynomial Complexity\")\n",
"plt.ylabel(\"RMSE\")\n",
"plt.legend()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x168c1d7df08>"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(range(1,10),train_rmse_errors,label='TRAIN')\n",
"plt.plot(range(1,10),test_rmse_errors,label='TEST')\n",
"plt.xlabel(\"Polynomial Complexity\")\n",
"plt.ylabel(\"RMSE\")\n",
"plt.legend()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x168c41e5a88>"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(range(1,10),train_rmse_errors,label='TRAIN')\n",
"plt.plot(range(1,10),test_rmse_errors,label='TEST')\n",
"plt.xlabel(\"Polynomial Complexity\")\n",
"plt.ylabel(\"RMSE\")\n",
"plt.ylim(0,100)\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finalizing Model Choice\n",
"\n",
"There are now 2 things we need to save, the Polynomial Feature creator AND the model itself. Let's explore how we would proceed from here:\n",
"\n",
"1. Choose final parameters based on test metrics\n",
"2. Retrain on all data\n",
"3. Save Polynomial Converter object\n",
"4. Save model"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# Based on our chart, could have also been degree=4, but \n",
"# it is better to be on the safe side of complexity\n",
"final_poly_converter = PolynomialFeatures(degree=3,include_bias=False)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"final_model = LinearRegression()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"final_model.fit(final_poly_converter.fit_transform(X),y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Saving Model and Converter"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"from joblib import dump, load"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['sales_poly_model.joblib']"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dump(final_model, 'sales_poly_model.joblib') "
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['poly_converter.joblib']"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dump(final_poly_converter,'poly_converter.joblib')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deployment and Predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prediction on New Data\n",
"\n",
"Recall that we will need to **convert** any incoming data to polynomial data, since that is what our model is trained on. We simply load up our saved converter object and only call **.transform()** on the new data, since we're not refitting to a new data set.\n",
"\n",
"**Our next ad campaign will have a total spend of 149k on TV, 22k on Radio, and 12k on Newspaper Ads, how many units could we expect to sell as a result of this?**"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"loaded_poly = load('poly_converter.joblib')\n",
"loaded_model = load('sales_poly_model.joblib')"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"campaign = [[149,22,12]]"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"campaign_poly = loaded_poly.transform(campaign)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1.490000e+02, 2.200000e+01, 1.200000e+01, 2.220100e+04,\n",
" 3.278000e+03, 1.788000e+03, 4.840000e+02, 2.640000e+02,\n",
" 1.440000e+02, 3.307949e+06, 4.884220e+05, 2.664120e+05,\n",
" 7.211600e+04, 3.933600e+04, 2.145600e+04, 1.064800e+04,\n",
" 5.808000e+03, 3.168000e+03, 1.728000e+03]])"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"campaign_poly"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([14.64501014])"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"final_model.predict(campaign_poly)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-----\n",
"---"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}