{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "\n", "\n", "___\n", "
Copyright by Pierian Data Inc.
\n", "
For more information, visit us at www.pieriandata.com
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Cross Validation\n", "\n", "In this lecture series we will do a much deeper dive into various methods of cross-validation. As well as a discussion on the general philosphy behind cross validation. A nice official documentation guide can be found here: https://scikit-learn.org/stable/modules/cross_validation.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\statsmodels\\tools\\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.\n", " import pandas.util.testing as tm\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Example" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"../DATA/Advertising.csv\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TVradionewspapersales
0230.137.869.222.1
144.539.345.110.4
217.245.969.39.3
3151.541.358.518.5
4180.810.858.412.9
\n", "
" ], "text/plain": [ " TV radio newspaper sales\n", "0 230.1 37.8 69.2 22.1\n", "1 44.5 39.3 45.1 10.4\n", "2 17.2 45.9 69.3 9.3\n", "3 151.5 41.3 58.5 18.5\n", "4 180.8 10.8 58.4 12.9" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "----\n", "----\n", "## Train | Test Split Procedure \n", "\n", "0. Clean and adjust data as necessary for X and y\n", "1. Split Data in Train/Test for both X and y\n", "2. Fit/Train Scaler on Training X Data\n", "3. Scale X Test Data\n", "4. Create Model\n", "5. Fit/Train Model on X Train Data\n", "6. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)\n", "7. Adjust Parameters as Necessary and repeat steps 5 and 6" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "## CREATE X and y\n", "X = df.drop('sales',axis=1)\n", "y = df['sales']\n", "\n", "# TRAIN TEST SPLIT\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)\n", "\n", "# SCALE DATA\n", "from sklearn.preprocessing import StandardScaler\n", "scaler = StandardScaler()\n", "scaler.fit(X_train)\n", "X_train = scaler.transform(X_train)\n", "X_test = scaler.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Create Model**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Ridge" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Poor Alpha Choice on purpose!\n", "model = Ridge(alpha=100)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,\n", " normalize=False, random_state=None, solver='auto', tol=0.001)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X_train,y_train)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "y_pred = model.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Evaluation**" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.34177578903413" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_test,y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Adjust Parameters and Re-evaluate**" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "model = Ridge(alpha=1)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,\n", " random_state=None, solver='auto', tol=0.001)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X_train,y_train)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "y_pred = model.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Another Evaluation**" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "2.319021579428752" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_test,y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Much better! We could repeat this until satisfied with performance metrics. (We previously showed RidgeCV can do this for us, but the purpose of this lecture is to generalize the CV process for any model)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "----\n", "----\n", "## Train | Validation | Test Split Procedure \n", "\n", "This is often also called a \"hold-out\" set, since you should not adjust parameters based on the final test set, but instead use it *only* for reporting final expected performance.\n", "\n", "0. Clean and adjust data as necessary for X and y\n", "1. Split Data in Train/Validation/Test for both X and y\n", "2. Fit/Train Scaler on Training X Data\n", "3. Scale X Eval Data\n", "4. Create Model\n", "5. Fit/Train Model on X Train Data\n", "6. Evaluate Model on X Evaluation Data (by creating predictions and comparing to Y_eval)\n", "7. Adjust Parameters as Necessary and repeat steps 5 and 6\n", "8. Get final metrics on Test set (not allowed to go back and adjust after this!)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "## CREATE X and y\n", "X = df.drop('sales',axis=1)\n", "y = df['sales']" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "######################################################################\n", "#### SPLIT TWICE! Here we create TRAIN | VALIDATION | TEST #########\n", "####################################################################\n", "from sklearn.model_selection import train_test_split\n", "\n", "# 70% of data is training data, set aside other 30%\n", "X_train, X_OTHER, y_train, y_OTHER = train_test_split(X, y, test_size=0.3, random_state=101)\n", "\n", "# Remaining 30% is split into evaluation and test sets\n", "# Each is 15% of the original data size\n", "X_eval, X_test, y_eval, y_test = train_test_split(X_OTHER, y_OTHER, test_size=0.5, random_state=101)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# SCALE DATA\n", "from sklearn.preprocessing import StandardScaler\n", "scaler = StandardScaler()\n", "scaler.fit(X_train)\n", "X_train = scaler.transform(X_train)\n", "X_eval = scaler.transform(X_eval)\n", "X_test = scaler.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Create Model**" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Ridge" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Poor Alpha Choice on purpose!\n", "model = Ridge(alpha=100)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,\n", " normalize=False, random_state=None, solver='auto', tol=0.001)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X_train,y_train)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "y_eval_pred = model.predict(X_eval)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Evaluation**" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.320101458823871" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_eval,y_eval_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Adjust Parameters and Re-evaluate**" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "model = Ridge(alpha=1)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,\n", " random_state=None, solver='auto', tol=0.001)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X_train,y_train)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "y_eval_pred = model.predict(X_eval)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Another Evaluation**" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "2.383783075056986" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_eval,y_eval_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Final Evaluation (Can no longer edit parameters after this!)**" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "y_final_test_pred = model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.254260083800517" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_test,y_final_test_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "----\n", "----\n", "## Cross Validation with cross_val_score\n", "\n", "----\n", "\n", "\n", "\n", "----" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "## CREATE X and y\n", "X = df.drop('sales',axis=1)\n", "y = df['sales']\n", "\n", "# TRAIN TEST SPLIT\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)\n", "\n", "# SCALE DATA\n", "from sklearn.preprocessing import StandardScaler\n", "scaler = StandardScaler()\n", "scaler.fit(X_train)\n", "X_train = scaler.transform(X_train)\n", "X_test = scaler.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "model = Ridge(alpha=100)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "# SCORING OPTIONS:\n", "# https://scikit-learn.org/stable/modules/model_evaluation.html\n", "scores = cross_val_score(model,X_train,y_train,\n", " scoring='neg_mean_squared_error',cv=5)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ -9.32552967, -4.9449624 , -11.39665242, -7.0242106 ,\n", " -8.38562723])" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8.215396464543607" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Average of the MSE scores (we set back to positive)\n", "abs(scores.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Adjust model based on metrics**" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "model = Ridge(alpha=1)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "# SCORING OPTIONS:\n", "# https://scikit-learn.org/stable/modules/model_evaluation.html\n", "scores = cross_val_score(model,X_train,y_train,\n", " scoring='neg_mean_squared_error',cv=5)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.344839296530695" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Average of the MSE scores (we set back to positive)\n", "abs(scores.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Final Evaluation (Can no longer edit parameters after this!)**" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,\n", " random_state=None, solver='auto', tol=0.001)" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Need to fit the model first!\n", "model.fit(X_train,y_train)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "y_final_test_pred = model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.319021579428752" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_test,y_final_test_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "----\n", "----\n", "\n", "# Cross Validation with cross_validate\n", "\n", "The cross_validate function differs from cross_val_score in two ways:\n", "\n", "It allows specifying multiple metrics for evaluation.\n", "\n", "It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.\n", "\n", "For single metric evaluation, where the scoring parameter is a string, callable or None, the keys will be:\n", " \n", " - ['test_score', 'fit_time', 'score_time']\n", "\n", "And for multiple metric evaluation, the return value is a dict with the following keys:\n", "\n", " ['test_', 'test_', 'test_', 'fit_time', 'score_time']\n", "\n", "return_train_score is set to False by default to save computation time. To evaluate the scores on the training set as well you need to be set to True." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "## CREATE X and y\n", "X = df.drop('sales',axis=1)\n", "y = df['sales']\n", "\n", "# TRAIN TEST SPLIT\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)\n", "\n", "# SCALE DATA\n", "from sklearn.preprocessing import StandardScaler\n", "scaler = StandardScaler()\n", "scaler.fit(X_train)\n", "X_train = scaler.transform(X_train)\n", "X_test = scaler.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "model = Ridge(alpha=100)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_validate" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "# SCORING OPTIONS:\n", "# https://scikit-learn.org/stable/modules/model_evaluation.html\n", "scores = cross_validate(model,X_train,y_train,\n", " scoring=['neg_mean_absolute_error','neg_mean_squared_error','max_error'],cv=5)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'fit_time': array([0.00102687, 0.00088882, 0.00099993, 0.00099945, 0. ]),\n", " 'score_time': array([0.00108409, 0. , 0. , 0.00064516, 0.00086308]),\n", " 'test_neg_mean_absolute_error': array([-2.31243044, -1.74653361, -2.56211701, -2.01873159, -2.27951906]),\n", " 'test_neg_mean_squared_error': array([ -9.32552967, -4.9449624 , -11.39665242, -7.0242106 ,\n", " -8.38562723]),\n", " 'test_max_error': array([ -6.44988486, -5.58926073, -10.33914027, -6.61950405,\n", " -7.75578515])}" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fit_timescore_timetest_neg_mean_absolute_errortest_neg_mean_squared_errortest_max_error
00.0010270.001084-2.312430-9.325530-6.449885
10.0008890.000000-1.746534-4.944962-5.589261
20.0010000.000000-2.562117-11.396652-10.339140
30.0009990.000645-2.018732-7.024211-6.619504
40.0000000.000863-2.279519-8.385627-7.755785
\n", "
" ], "text/plain": [ " fit_time score_time test_neg_mean_absolute_error \\\n", "0 0.001027 0.001084 -2.312430 \n", "1 0.000889 0.000000 -1.746534 \n", "2 0.001000 0.000000 -2.562117 \n", "3 0.000999 0.000645 -2.018732 \n", "4 0.000000 0.000863 -2.279519 \n", "\n", " test_neg_mean_squared_error test_max_error \n", "0 -9.325530 -6.449885 \n", "1 -4.944962 -5.589261 \n", "2 -11.396652 -10.339140 \n", "3 -7.024211 -6.619504 \n", "4 -8.385627 -7.755785 " ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(scores)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fit_time 0.000783\n", "score_time 0.000518\n", "test_neg_mean_absolute_error -2.183866\n", "test_neg_mean_squared_error -8.215396\n", "test_max_error -7.350715\n", "dtype: float64" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(scores).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Adjust model based on metrics**" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "model = Ridge(alpha=1)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "# SCORING OPTIONS:\n", "# https://scikit-learn.org/stable/modules/model_evaluation.html\n", "scores = cross_validate(model,X_train,y_train,\n", " scoring=['neg_mean_absolute_error','neg_mean_squared_error','max_error'],cv=5)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fit_time 0.000901\n", "score_time 0.000200\n", "test_neg_mean_absolute_error -1.319685\n", "test_neg_mean_squared_error -3.344839\n", "test_max_error -5.161145\n", "dtype: float64" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(scores).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Final Evaluation (Can no longer edit parameters after this!)**" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,\n", " random_state=None, solver='auto', tol=0.001)" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Need to fit the model first!\n", "model.fit(X_train,y_train)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "y_final_test_pred = model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.319021579428752" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_test,y_final_test_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "----" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 1 }