You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

555 lines
16 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
"___\n",
"<center><em>Copyright by Pierian Data Inc.</em></center>\n",
"<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Linear Regression Project Exercise "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have learned about feature engineering, cross validation, and grid search, let's test all your new skills with a project exercise in Machine Learning. This exercise will have a more guided approach, later on the ML projects will begin to be more open-ended. We'll start off with using the final version of the Ames Housing dataset we worked on through the feature engineering section of the course. Your goal will be to create a Linear Regression Model, train it on the data with the optimal parameters using a grid search, and then evaluate the model's capabilities on a test set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"---\n",
"---\n",
"## Complete the tasks in bold\n",
"\n",
"**TASK: Run the cells under the Imports and Data section to make sure you have imported the correct general libraries as well as the correct datasets. Later on you may need to run further imports from scikit-learn.**\n",
"\n",
"### Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"../DATA/AMES_Final_DF.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Lot Frontage</th>\n",
" <th>Lot Area</th>\n",
" <th>Overall Qual</th>\n",
" <th>Overall Cond</th>\n",
" <th>Year Built</th>\n",
" <th>Year Remod/Add</th>\n",
" <th>Mas Vnr Area</th>\n",
" <th>BsmtFin SF 1</th>\n",
" <th>BsmtFin SF 2</th>\n",
" <th>Bsmt Unf SF</th>\n",
" <th>...</th>\n",
" <th>Sale Type_ConLw</th>\n",
" <th>Sale Type_New</th>\n",
" <th>Sale Type_Oth</th>\n",
" <th>Sale Type_VWD</th>\n",
" <th>Sale Type_WD</th>\n",
" <th>Sale Condition_AdjLand</th>\n",
" <th>Sale Condition_Alloca</th>\n",
" <th>Sale Condition_Family</th>\n",
" <th>Sale Condition_Normal</th>\n",
" <th>Sale Condition_Partial</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>141.0</td>\n",
" <td>31770</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>1960</td>\n",
" <td>1960</td>\n",
" <td>112.0</td>\n",
" <td>639.0</td>\n",
" <td>0.0</td>\n",
" <td>441.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80.0</td>\n",
" <td>11622</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>1961</td>\n",
" <td>1961</td>\n",
" <td>0.0</td>\n",
" <td>468.0</td>\n",
" <td>144.0</td>\n",
" <td>270.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>81.0</td>\n",
" <td>14267</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>1958</td>\n",
" <td>1958</td>\n",
" <td>108.0</td>\n",
" <td>923.0</td>\n",
" <td>0.0</td>\n",
" <td>406.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>93.0</td>\n",
" <td>11160</td>\n",
" <td>7</td>\n",
" <td>5</td>\n",
" <td>1968</td>\n",
" <td>1968</td>\n",
" <td>0.0</td>\n",
" <td>1065.0</td>\n",
" <td>0.0</td>\n",
" <td>1045.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>74.0</td>\n",
" <td>13830</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>1997</td>\n",
" <td>1998</td>\n",
" <td>0.0</td>\n",
" <td>791.0</td>\n",
" <td>0.0</td>\n",
" <td>137.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 274 columns</p>\n",
"</div>"
],
"text/plain": [
" Lot Frontage Lot Area Overall Qual Overall Cond Year Built \\\n",
"0 141.0 31770 6 5 1960 \n",
"1 80.0 11622 5 6 1961 \n",
"2 81.0 14267 6 6 1958 \n",
"3 93.0 11160 7 5 1968 \n",
"4 74.0 13830 5 5 1997 \n",
"\n",
" Year Remod/Add Mas Vnr Area BsmtFin SF 1 BsmtFin SF 2 Bsmt Unf SF ... \\\n",
"0 1960 112.0 639.0 0.0 441.0 ... \n",
"1 1961 0.0 468.0 144.0 270.0 ... \n",
"2 1958 108.0 923.0 0.0 406.0 ... \n",
"3 1968 0.0 1065.0 0.0 1045.0 ... \n",
"4 1998 0.0 791.0 0.0 137.0 ... \n",
"\n",
" Sale Type_ConLw Sale Type_New Sale Type_Oth Sale Type_VWD \\\n",
"0 0 0 0 0 \n",
"1 0 0 0 0 \n",
"2 0 0 0 0 \n",
"3 0 0 0 0 \n",
"4 0 0 0 0 \n",
"\n",
" Sale Type_WD Sale Condition_AdjLand Sale Condition_Alloca \\\n",
"0 1 0 0 \n",
"1 1 0 0 \n",
"2 1 0 0 \n",
"3 1 0 0 \n",
"4 1 0 0 \n",
"\n",
" Sale Condition_Family Sale Condition_Normal Sale Condition_Partial \n",
"0 0 1 0 \n",
"1 0 1 0 \n",
"2 0 1 0 \n",
"3 0 1 0 \n",
"4 0 1 0 \n",
"\n",
"[5 rows x 274 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 2925 entries, 0 to 2924\n",
"Columns: 274 entries, Lot Frontage to Sale Condition_Partial\n",
"dtypes: float64(11), int64(263)\n",
"memory usage: 6.1 MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: The label we are trying to predict is the SalePrice column. Separate out the data into X features and y labels**"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Use scikit-learn to split up X and y into a training set and test set. Since we will later be using a Grid Search strategy, set your test proportion to 10%. To get the same data split as the solutions notebook, you can specify random_state = 101**"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: The dataset features has a variety of scales and units. For optimal regression performance, scale the X features. Take carefuly note of what to use for .fit() vs what to use for .transform()**"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: We will use an Elastic Net model. Create an instance of default ElasticNet model with scikit-learn**"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: The Elastic Net model has two main parameters, alpha and the L1 ratio. Create a dictionary parameter grid of values for the ElasticNet. Feel free to play around with these values, keep in mind, you may not match up exactly with the solution choices**"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Using scikit-learn create a GridSearchCV object and run a grid search for the best parameters for your model based on your scaled training data. [In case you are curious about the warnings you may recieve for certain parameter combinations](https://stackoverflow.com/questions/20681864/lasso-on-sklearn-does-not-converge)**"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Display the best combination of parameters for your model**"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'alpha': 100, 'l1_ratio': 1}"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Evaluate your model's performance on the unseen 10% scaled test set. In the solutions notebook we achieved an MAE of $\\$$14149 and a RMSE of $\\$$20532**"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14149.055026374837"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"20532.890234901013"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Great work!\n",
"\n",
"----"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}