You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2138 lines
72 KiB

2 years ago
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
"___\n",
"<center><em>Copyright by Pierian Data Inc.</em></center>\n",
"<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dealing with Categorical Data\n",
"\n",
"Many machine learning models can not deal with categorical data set as strings. For example linear regression can not apply a a Beta Coefficent to colors like \"red\" or \"blue\". Instead we need to convert these categories into \"dummy\" variables, otherwise known as \"one-hot\" encoding."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"\n",
"We will open the .csv file that has been \"cleaned\" to remove outliers and NaN from the previous lectures."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"../DATA/Ames_NO_Missing_Data.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Lot Frontage</th>\n",
" <th>Lot Area</th>\n",
" <th>Street</th>\n",
" <th>Lot Shape</th>\n",
" <th>Land Contour</th>\n",
" <th>Utilities</th>\n",
" <th>Lot Config</th>\n",
" <th>Land Slope</th>\n",
" <th>...</th>\n",
" <th>Enclosed Porch</th>\n",
" <th>3Ssn Porch</th>\n",
" <th>Screen Porch</th>\n",
" <th>Pool Area</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Yr Sold</th>\n",
" <th>Sale Type</th>\n",
" <th>Sale Condition</th>\n",
" <th>SalePrice</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>141.0</td>\n",
" <td>31770</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>215000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>20</td>\n",
" <td>RH</td>\n",
" <td>80.0</td>\n",
" <td>11622</td>\n",
" <td>Pave</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>120</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>6</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>105000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>81.0</td>\n",
" <td>14267</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>12500</td>\n",
" <td>6</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>172000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>93.0</td>\n",
" <td>11160</td>\n",
" <td>Pave</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>244000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60</td>\n",
" <td>RL</td>\n",
" <td>74.0</td>\n",
" <td>13830</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>189900</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 76 columns</p>\n",
"</div>"
],
"text/plain": [
" MS SubClass MS Zoning Lot Frontage Lot Area Street Lot Shape \\\n",
"0 20 RL 141.0 31770 Pave IR1 \n",
"1 20 RH 80.0 11622 Pave Reg \n",
"2 20 RL 81.0 14267 Pave IR1 \n",
"3 20 RL 93.0 11160 Pave Reg \n",
"4 60 RL 74.0 13830 Pave IR1 \n",
"\n",
" Land Contour Utilities Lot Config Land Slope ... Enclosed Porch 3Ssn Porch \\\n",
"0 Lvl AllPub Corner Gtl ... 0 0 \n",
"1 Lvl AllPub Inside Gtl ... 0 0 \n",
"2 Lvl AllPub Corner Gtl ... 0 0 \n",
"3 Lvl AllPub Corner Gtl ... 0 0 \n",
"4 Lvl AllPub Inside Gtl ... 0 0 \n",
"\n",
" Screen Porch Pool Area Misc Val Mo Sold Yr Sold Sale Type \\\n",
"0 0 0 0 5 2010 WD \n",
"1 120 0 0 6 2010 WD \n",
"2 0 0 12500 6 2010 WD \n",
"3 0 0 0 4 2010 WD \n",
"4 0 0 0 3 2010 WD \n",
"\n",
" Sale Condition SalePrice \n",
"0 Normal 215000 \n",
"1 Normal 105000 \n",
"2 Normal 172000 \n",
"3 Normal 244000 \n",
"4 Normal 189900 \n",
"\n",
"[5 rows x 76 columns]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Description"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MSSubClass: Identifies the type of dwelling involved in the sale.\t\n",
"\n",
" 20\t1-STORY 1946 & NEWER ALL STYLES\n",
" 30\t1-STORY 1945 & OLDER\n",
" 40\t1-STORY W/FINISHED ATTIC ALL AGES\n",
" 45\t1-1/2 STORY - UNFINISHED ALL AGES\n",
" 50\t1-1/2 STORY FINISHED ALL AGES\n",
" 60\t2-STORY 1946 & NEWER\n",
" 70\t2-STORY 1945 & OLDER\n",
" 75\t2-1/2 STORY ALL AGES\n",
" 80\tSPLIT OR MULTI-LEVEL\n",
" 85\tSPLIT FOYER\n",
" 90\tDUPLEX - ALL STYLES AND AGES\n",
" 120\t1-STORY PUD (Planned Unit Development) - 1946 & NEWER\n",
" 150\t1-1/2 STORY PUD - ALL AGES\n",
" 160\t2-STORY PUD - 1946 & NEWER\n",
" 180\tPUD - MULTILEVEL - INCL SPLIT LEV/FOYER\n",
" 190\t2 FAMILY CONVERSION - ALL STYLES AND AGES\n",
"\n",
"MSZoning: Identifies the general zoning classification of the sale.\n",
"\t\t\n",
" A\tAgriculture\n",
" C\tCommercial\n",
" FV\tFloating Village Residential\n",
" I\tIndustrial\n",
" RH\tResidential High Density\n",
" RL\tResidential Low Density\n",
" RP\tResidential Low Density Park \n",
" RM\tResidential Medium Density\n",
"\t\n",
"LotFrontage: Linear feet of street connected to property\n",
"\n",
"LotArea: Lot size in square feet\n",
"\n",
"Street: Type of road access to property\n",
"\n",
" Grvl\tGravel\t\n",
" Pave\tPaved\n",
" \t\n",
"Alley: Type of alley access to property\n",
"\n",
" Grvl\tGravel\n",
" Pave\tPaved\n",
" NA \tNo alley access\n",
"\t\t\n",
"LotShape: General shape of property\n",
"\n",
" Reg\tRegular\t\n",
" IR1\tSlightly irregular\n",
" IR2\tModerately Irregular\n",
" IR3\tIrregular\n",
" \n",
"LandContour: Flatness of the property\n",
"\n",
" Lvl\tNear Flat/Level\t\n",
" Bnk\tBanked - Quick and significant rise from street grade to building\n",
" HLS\tHillside - Significant slope from side to side\n",
" Low\tDepression\n",
"\t\t\n",
"Utilities: Type of utilities available\n",
"\t\t\n",
" AllPub\tAll public Utilities (E,G,W,& S)\t\n",
" NoSewr\tElectricity, Gas, and Water (Septic Tank)\n",
" NoSeWa\tElectricity and Gas Only\n",
" ELO\tElectricity only\t\n",
"\t\n",
"LotConfig: Lot configuration\n",
"\n",
" Inside\tInside lot\n",
" Corner\tCorner lot\n",
" CulDSac\tCul-de-sac\n",
" FR2\tFrontage on 2 sides of property\n",
" FR3\tFrontage on 3 sides of property\n",
"\t\n",
"LandSlope: Slope of property\n",
"\t\t\n",
" Gtl\tGentle slope\n",
" Mod\tModerate Slope\t\n",
" Sev\tSevere Slope\n",
"\t\n",
"Neighborhood: Physical locations within Ames city limits\n",
"\n",
" Blmngtn\tBloomington Heights\n",
" Blueste\tBluestem\n",
" BrDale\tBriardale\n",
" BrkSide\tBrookside\n",
" ClearCr\tClear Creek\n",
" CollgCr\tCollege Creek\n",
" Crawfor\tCrawford\n",
" Edwards\tEdwards\n",
" Gilbert\tGilbert\n",
" IDOTRR\tIowa DOT and Rail Road\n",
" MeadowV\tMeadow Village\n",
" Mitchel\tMitchell\n",
" Names\tNorth Ames\n",
" NoRidge\tNorthridge\n",
" NPkVill\tNorthpark Villa\n",
" NridgHt\tNorthridge Heights\n",
" NWAmes\tNorthwest Ames\n",
" OldTown\tOld Town\n",
" SWISU\tSouth & West of Iowa State University\n",
" Sawyer\tSawyer\n",
" SawyerW\tSawyer West\n",
" Somerst\tSomerset\n",
" StoneBr\tStone Brook\n",
" Timber\tTimberland\n",
" Veenker\tVeenker\n",
"\t\t\t\n",
"Condition1: Proximity to various conditions\n",
"\t\n",
" Artery\tAdjacent to arterial street\n",
" Feedr\tAdjacent to feeder street\t\n",
" Norm\tNormal\t\n",
" RRNn\tWithin 200' of North-South Railroad\n",
" RRAn\tAdjacent to North-South Railroad\n",
" PosN\tNear positive off-site feature--park, greenbelt, etc.\n",
" PosA\tAdjacent to postive off-site feature\n",
" RRNe\tWithin 200' of East-West Railroad\n",
" RRAe\tAdjacent to East-West Railroad\n",
"\t\n",
"Condition2: Proximity to various conditions (if more than one is present)\n",
"\t\t\n",
" Artery\tAdjacent to arterial street\n",
" Feedr\tAdjacent to feeder street\t\n",
" Norm\tNormal\t\n",
" RRNn\tWithin 200' of North-South Railroad\n",
" RRAn\tAdjacent to North-South Railroad\n",
" PosN\tNear positive off-site feature--park, greenbelt, etc.\n",
" PosA\tAdjacent to postive off-site feature\n",
" RRNe\tWithin 200' of East-West Railroad\n",
" RRAe\tAdjacent to East-West Railroad\n",
"\t\n",
"BldgType: Type of dwelling\n",
"\t\t\n",
" 1Fam\tSingle-family Detached\t\n",
" 2FmCon\tTwo-family Conversion; originally built as one-family dwelling\n",
" Duplx\tDuplex\n",
" TwnhsE\tTownhouse End Unit\n",
" TwnhsI\tTownhouse Inside Unit\n",
"\t\n",
"HouseStyle: Style of dwelling\n",
"\t\n",
" 1Story\tOne story\n",
" 1.5Fin\tOne and one-half story: 2nd level finished\n",
" 1.5Unf\tOne and one-half story: 2nd level unfinished\n",
" 2Story\tTwo story\n",
" 2.5Fin\tTwo and one-half story: 2nd level finished\n",
" 2.5Unf\tTwo and one-half story: 2nd level unfinished\n",
" SFoyer\tSplit Foyer\n",
" SLvl\tSplit Level\n",
"\t\n",
"OverallQual: Rates the overall material and finish of the house\n",
"\n",
" 10\tVery Excellent\n",
" 9\tExcellent\n",
" 8\tVery Good\n",
" 7\tGood\n",
" 6\tAbove Average\n",
" 5\tAverage\n",
" 4\tBelow Average\n",
" 3\tFair\n",
" 2\tPoor\n",
" 1\tVery Poor\n",
"\t\n",
"OverallCond: Rates the overall condition of the house\n",
"\n",
" 10\tVery Excellent\n",
" 9\tExcellent\n",
" 8\tVery Good\n",
" 7\tGood\n",
" 6\tAbove Average\t\n",
" 5\tAverage\n",
" 4\tBelow Average\t\n",
" 3\tFair\n",
" 2\tPoor\n",
" 1\tVery Poor\n",
"\t\t\n",
"YearBuilt: Original construction date\n",
"\n",
"YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)\n",
"\n",
"RoofStyle: Type of roof\n",
"\n",
" Flat\tFlat\n",
" Gable\tGable\n",
" Gambrel\tGabrel (Barn)\n",
" Hip\tHip\n",
" Mansard\tMansard\n",
" Shed\tShed\n",
"\t\t\n",
"RoofMatl: Roof material\n",
"\n",
" ClyTile\tClay or Tile\n",
" CompShg\tStandard (Composite) Shingle\n",
" Membran\tMembrane\n",
" Metal\tMetal\n",
" Roll\tRoll\n",
" Tar&Grv\tGravel & Tar\n",
" WdShake\tWood Shakes\n",
" WdShngl\tWood Shingles\n",
"\t\t\n",
"Exterior1st: Exterior covering on house\n",
"\n",
" AsbShng\tAsbestos Shingles\n",
" AsphShn\tAsphalt Shingles\n",
" BrkComm\tBrick Common\n",
" BrkFace\tBrick Face\n",
" CBlock\tCinder Block\n",
" CemntBd\tCement Board\n",
" HdBoard\tHard Board\n",
" ImStucc\tImitation Stucco\n",
" MetalSd\tMetal Siding\n",
" Other\tOther\n",
" Plywood\tPlywood\n",
" PreCast\tPreCast\t\n",
" Stone\tStone\n",
" Stucco\tStucco\n",
" VinylSd\tVinyl Siding\n",
" Wd Sdng\tWood Siding\n",
" WdShing\tWood Shingles\n",
"\t\n",
"Exterior2nd: Exterior covering on house (if more than one material)\n",
"\n",
" AsbShng\tAsbestos Shingles\n",
" AsphShn\tAsphalt Shingles\n",
" BrkComm\tBrick Common\n",
" BrkFace\tBrick Face\n",
" CBlock\tCinder Block\n",
" CemntBd\tCement Board\n",
" HdBoard\tHard Board\n",
" ImStucc\tImitation Stucco\n",
" MetalSd\tMetal Siding\n",
" Other\tOther\n",
" Plywood\tPlywood\n",
" PreCast\tPreCast\n",
" Stone\tStone\n",
" Stucco\tStucco\n",
" VinylSd\tVinyl Siding\n",
" Wd Sdng\tWood Siding\n",
" WdShing\tWood Shingles\n",
"\t\n",
"MasVnrType: Masonry veneer type\n",
"\n",
" BrkCmn\tBrick Common\n",
" BrkFace\tBrick Face\n",
" CBlock\tCinder Block\n",
" None\tNone\n",
" Stone\tStone\n",
"\t\n",
"MasVnrArea: Masonry veneer area in square feet\n",
"\n",
"ExterQual: Evaluates the quality of the material on the exterior \n",
"\t\t\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tAverage/Typical\n",
" Fa\tFair\n",
" Po\tPoor\n",
"\t\t\n",
"ExterCond: Evaluates the present condition of the material on the exterior\n",
"\t\t\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tAverage/Typical\n",
" Fa\tFair\n",
" Po\tPoor\n",
"\t\t\n",
"Foundation: Type of foundation\n",
"\t\t\n",
" BrkTil\tBrick & Tile\n",
" CBlock\tCinder Block\n",
" PConc\tPoured Contrete\t\n",
" Slab\tSlab\n",
" Stone\tStone\n",
" Wood\tWood\n",
"\t\t\n",
"BsmtQual: Evaluates the height of the basement\n",
"\n",
" Ex\tExcellent (100+ inches)\t\n",
" Gd\tGood (90-99 inches)\n",
" TA\tTypical (80-89 inches)\n",
" Fa\tFair (70-79 inches)\n",
" Po\tPoor (<70 inches\n",
" NA\tNo Basement\n",
"\t\t\n",
"BsmtCond: Evaluates the general condition of the basement\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tTypical - slight dampness allowed\n",
" Fa\tFair - dampness or some cracking or settling\n",
" Po\tPoor - Severe cracking, settling, or wetness\n",
" NA\tNo Basement\n",
"\t\n",
"BsmtExposure: Refers to walkout or garden level walls\n",
"\n",
" Gd\tGood Exposure\n",
" Av\tAverage Exposure (split levels or foyers typically score average or above)\t\n",
" Mn\tMimimum Exposure\n",
" No\tNo Exposure\n",
" NA\tNo Basement\n",
"\t\n",
"BsmtFinType1: Rating of basement finished area\n",
"\n",
" GLQ\tGood Living Quarters\n",
" ALQ\tAverage Living Quarters\n",
" BLQ\tBelow Average Living Quarters\t\n",
" Rec\tAverage Rec Room\n",
" LwQ\tLow Quality\n",
" Unf\tUnfinshed\n",
" NA\tNo Basement\n",
"\t\t\n",
"BsmtFinSF1: Type 1 finished square feet\n",
"\n",
"BsmtFinType2: Rating of basement finished area (if multiple types)\n",
"\n",
" GLQ\tGood Living Quarters\n",
" ALQ\tAverage Living Quarters\n",
" BLQ\tBelow Average Living Quarters\t\n",
" Rec\tAverage Rec Room\n",
" LwQ\tLow Quality\n",
" Unf\tUnfinshed\n",
" NA\tNo Basement\n",
"\n",
"BsmtFinSF2: Type 2 finished square feet\n",
"\n",
"BsmtUnfSF: Unfinished square feet of basement area\n",
"\n",
"TotalBsmtSF: Total square feet of basement area\n",
"\n",
"Heating: Type of heating\n",
"\t\t\n",
" Floor\tFloor Furnace\n",
" GasA\tGas forced warm air furnace\n",
" GasW\tGas hot water or steam heat\n",
" Grav\tGravity furnace\t\n",
" OthW\tHot water or steam heat other than gas\n",
" Wall\tWall furnace\n",
"\t\t\n",
"HeatingQC: Heating quality and condition\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tAverage/Typical\n",
" Fa\tFair\n",
" Po\tPoor\n",
"\t\t\n",
"CentralAir: Central air conditioning\n",
"\n",
" N\tNo\n",
" Y\tYes\n",
"\t\t\n",
"Electrical: Electrical system\n",
"\n",
" SBrkr\tStandard Circuit Breakers & Romex\n",
" FuseA\tFuse Box over 60 AMP and all Romex wiring (Average)\t\n",
" FuseF\t60 AMP Fuse Box and mostly Romex wiring (Fair)\n",
" FuseP\t60 AMP Fuse Box and mostly knob & tube wiring (poor)\n",
" Mix\tMixed\n",
"\t\t\n",
"1stFlrSF: First Floor square feet\n",
" \n",
"2ndFlrSF: Second floor square feet\n",
"\n",
"LowQualFinSF: Low quality finished square feet (all floors)\n",
"\n",
"GrLivArea: Above grade (ground) living area square feet\n",
"\n",
"BsmtFullBath: Basement full bathrooms\n",
"\n",
"BsmtHalfBath: Basement half bathrooms\n",
"\n",
"FullBath: Full bathrooms above grade\n",
"\n",
"HalfBath: Half baths above grade\n",
"\n",
"Bedroom: Bedrooms above grade (does NOT include basement bedrooms)\n",
"\n",
"Kitchen: Kitchens above grade\n",
"\n",
"KitchenQual: Kitchen quality\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tTypical/Average\n",
" Fa\tFair\n",
" Po\tPoor\n",
" \t\n",
"TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)\n",
"\n",
"Functional: Home functionality (Assume typical unless deductions are warranted)\n",
"\n",
" Typ\tTypical Functionality\n",
" Min1\tMinor Deductions 1\n",
" Min2\tMinor Deductions 2\n",
" Mod\tModerate Deductions\n",
" Maj1\tMajor Deductions 1\n",
" Maj2\tMajor Deductions 2\n",
" Sev\tSeverely Damaged\n",
" Sal\tSalvage only\n",
"\t\t\n",
"Fireplaces: Number of fireplaces\n",
"\n",
"FireplaceQu: Fireplace quality\n",
"\n",
" Ex\tExcellent - Exceptional Masonry Fireplace\n",
" Gd\tGood - Masonry Fireplace in main level\n",
" TA\tAverage - Prefabricated Fireplace in main living area or Masonry Fireplace in basement\n",
" Fa\tFair - Prefabricated Fireplace in basement\n",
" Po\tPoor - Ben Franklin Stove\n",
" NA\tNo Fireplace\n",
"\t\t\n",
"GarageType: Garage location\n",
"\t\t\n",
" 2Types\tMore than one type of garage\n",
" Attchd\tAttached to home\n",
" Basment\tBasement Garage\n",
" BuiltIn\tBuilt-In (Garage part of house - typically has room above garage)\n",
" CarPort\tCar Port\n",
" Detchd\tDetached from home\n",
" NA\tNo Garage\n",
"\t\t\n",
"GarageYrBlt: Year garage was built\n",
"\t\t\n",
"GarageFinish: Interior finish of the garage\n",
"\n",
" Fin\tFinished\n",
" RFn\tRough Finished\t\n",
" Unf\tUnfinished\n",
" NA\tNo Garage\n",
"\t\t\n",
"GarageCars: Size of garage in car capacity\n",
"\n",
"GarageArea: Size of garage in square feet\n",
"\n",
"GarageQual: Garage quality\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tTypical/Average\n",
" Fa\tFair\n",
" Po\tPoor\n",
" NA\tNo Garage\n",
"\t\t\n",
"GarageCond: Garage condition\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tTypical/Average\n",
" Fa\tFair\n",
" Po\tPoor\n",
" NA\tNo Garage\n",
"\t\t\n",
"PavedDrive: Paved driveway\n",
"\n",
" Y\tPaved \n",
" P\tPartial Pavement\n",
" N\tDirt/Gravel\n",
"\t\t\n",
"WoodDeckSF: Wood deck area in square feet\n",
"\n",
"OpenPorchSF: Open porch area in square feet\n",
"\n",
"EnclosedPorch: Enclosed porch area in square feet\n",
"\n",
"3SsnPorch: Three season porch area in square feet\n",
"\n",
"ScreenPorch: Screen porch area in square feet\n",
"\n",
"PoolArea: Pool area in square feet\n",
"\n",
"PoolQC: Pool quality\n",
"\t\t\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tAverage/Typical\n",
" Fa\tFair\n",
" NA\tNo Pool\n",
"\t\t\n",
"Fence: Fence quality\n",
"\t\t\n",
" GdPrv\tGood Privacy\n",
" MnPrv\tMinimum Privacy\n",
" GdWo\tGood Wood\n",
" MnWw\tMinimum Wood/Wire\n",
" NA\tNo Fence\n",
"\t\n",
"MiscFeature: Miscellaneous feature not covered in other categories\n",
"\t\t\n",
" Elev\tElevator\n",
" Gar2\t2nd Garage (if not described in garage section)\n",
" Othr\tOther\n",
" Shed\tShed (over 100 SF)\n",
" TenC\tTennis Court\n",
" NA\tNone\n",
"\t\t\n",
"MiscVal: $Value of miscellaneous feature\n",
"\n",
"MoSold: Month Sold (MM)\n",
"\n",
"YrSold: Year Sold (YYYY)\n",
"\n",
"SaleType: Type of sale\n",
"\t\t\n",
" WD \tWarranty Deed - Conventional\n",
" CWD\tWarranty Deed - Cash\n",
" VWD\tWarranty Deed - VA Loan\n",
" New\tHome just constructed and sold\n",
" COD\tCourt Officer Deed/Estate\n",
" Con\tContract 15% Down payment regular terms\n",
" ConLw\tContract Low Down payment and low interest\n",
" ConLI\tContract Low Interest\n",
" ConLD\tContract Low Down\n",
" Oth\tOther\n",
"\t\t\n",
"SaleCondition: Condition of sale\n",
"\n",
" Normal\tNormal Sale\n",
" Abnorml\tAbnormal Sale - trade, foreclosure, short sale\n",
" AdjLand\tAdjoining Land Purchase\n",
" Alloca\tAllocation - two linked properties with separate deeds, typically condo with a garage unit\t\n",
" Family\tSale between family members\n",
" Partial\tHome was not completed when last assessed (associated with New Homes)\n",
"\n"
]
}
],
"source": [
"with open('../DATA/Ames_Housing_Feature_Description.txt','r') as f: \n",
" print(f.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Numerical Column to Categorical\n",
"\n",
"We need to be careful when it comes to encoding categories as numbers. We want to make sure that the numerical relationship makes sense for a model. For example, the encoding MSSubClass is essentially just a number code per class:\n",
"\n",
" MSSubClass: Identifies the type of dwelling involved in the sale.\t\n",
"\n",
" 20\t1-STORY 1946 & NEWER ALL STYLES\n",
" 30\t1-STORY 1945 & OLDER\n",
" 40\t1-STORY W/FINISHED ATTIC ALL AGES\n",
" 45\t1-1/2 STORY - UNFINISHED ALL AGES\n",
" 50\t1-1/2 STORY FINISHED ALL AGES\n",
" 60\t2-STORY 1946 & NEWER\n",
" 70\t2-STORY 1945 & OLDER\n",
" 75\t2-1/2 STORY ALL AGES\n",
" 80\tSPLIT OR MULTI-LEVEL\n",
" 85\tSPLIT FOYER\n",
" 90\tDUPLEX - ALL STYLES AND AGES\n",
" 120\t1-STORY PUD (Planned Unit Development) - 1946 & NEWER\n",
" 150\t1-1/2 STORY PUD - ALL AGES\n",
" 160\t2-STORY PUD - 1946 & NEWER\n",
" 180\tPUD - MULTILEVEL - INCL SPLIT LEV/FOYER\n",
" 190\t2 FAMILY CONVERSION - ALL STYLES AND AGES\n",
"\n",
"The number itself does not appear to have a relationship to the other numbers. While 30 > 20 is True, it doesn't really make sense that \"1-STORY 1945 & OLDER\" > \"1-STORY 1946 & NEWER ALL STYLES\". Keep in mind, this isn't always the case, for example 1st class seats versus 2nd class seats encoded as 1 and 2. Make sure you fully understand your data set to examine what needs to be converted/changed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### MSSubClass"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# Convert to String\n",
"df['MS SubClass'] = df['MS SubClass'].apply(str)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating \"Dummy\" Variables\n",
"\n",
"## Avoiding MultiCollinearity and the Dummy Variable Trap\n",
"\n",
"https://stats.stackexchange.com/questions/144372/dummy-variable-trap"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"person_state = pd.Series(['Dead','Alive','Dead','Alive','Dead','Dead'])"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Dead\n",
"1 Alive\n",
"2 Dead\n",
"3 Alive\n",
"4 Dead\n",
"5 Dead\n",
"dtype: object"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"person_state"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Alive</th>\n",
" <th>Dead</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Alive Dead\n",
"0 0 1\n",
"1 1 0\n",
"2 0 1\n",
"3 1 0\n",
"4 0 1\n",
"5 0 1"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.get_dummies(person_state)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Dead</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Dead\n",
"0 1\n",
"1 0\n",
"2 1\n",
"3 0\n",
"4 1\n",
"5 1"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.get_dummies(person_state,drop_first=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Dummy Variables from Object Columns\n",
"\n",
"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Street</th>\n",
" <th>Lot Shape</th>\n",
" <th>Land Contour</th>\n",
" <th>Utilities</th>\n",
" <th>Lot Config</th>\n",
" <th>Land Slope</th>\n",
" <th>Neighborhood</th>\n",
" <th>Condition 1</th>\n",
" <th>...</th>\n",
" <th>Kitchen Qual</th>\n",
" <th>Functional</th>\n",
" <th>Fireplace Qu</th>\n",
" <th>Garage Type</th>\n",
" <th>Garage Finish</th>\n",
" <th>Garage Qual</th>\n",
" <th>Garage Cond</th>\n",
" <th>Paved Drive</th>\n",
" <th>Sale Type</th>\n",
" <th>Sale Condition</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>NAmes</td>\n",
" <td>Norm</td>\n",
" <td>...</td>\n",
" <td>TA</td>\n",
" <td>Typ</td>\n",
" <td>Gd</td>\n",
" <td>Attchd</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>P</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>20</td>\n",
" <td>RH</td>\n",
" <td>Pave</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Gtl</td>\n",
" <td>NAmes</td>\n",
" <td>Feedr</td>\n",
" <td>...</td>\n",
" <td>TA</td>\n",
" <td>Typ</td>\n",
" <td>None</td>\n",
" <td>Attchd</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>Y</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>NAmes</td>\n",
" <td>Norm</td>\n",
" <td>...</td>\n",
" <td>Gd</td>\n",
" <td>Typ</td>\n",
" <td>None</td>\n",
" <td>Attchd</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>Y</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>Pave</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>NAmes</td>\n",
" <td>Norm</td>\n",
" <td>...</td>\n",
" <td>Ex</td>\n",
" <td>Typ</td>\n",
" <td>TA</td>\n",
" <td>Attchd</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>Y</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60</td>\n",
" <td>RL</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Gtl</td>\n",
" <td>Gilbert</td>\n",
" <td>Norm</td>\n",
" <td>...</td>\n",
" <td>TA</td>\n",
" <td>Typ</td>\n",
" <td>TA</td>\n",
" <td>Attchd</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>Y</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2920</th>\n",
" <td>80</td>\n",
" <td>RL</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>CulDSac</td>\n",
" <td>Gtl</td>\n",
" <td>Mitchel</td>\n",
" <td>Norm</td>\n",
" <td>...</td>\n",
" <td>TA</td>\n",
" <td>Typ</td>\n",
" <td>None</td>\n",
" <td>Detchd</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>Y</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2921</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Low</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Mod</td>\n",
" <td>Mitchel</td>\n",
" <td>Norm</td>\n",
" <td>...</td>\n",
" <td>TA</td>\n",
" <td>Typ</td>\n",
" <td>None</td>\n",
" <td>Attchd</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>Y</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2922</th>\n",
" <td>85</td>\n",
" <td>RL</td>\n",
" <td>Pave</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Gtl</td>\n",
" <td>Mitchel</td>\n",
" <td>Norm</td>\n",
" <td>...</td>\n",
" <td>TA</td>\n",
" <td>Typ</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Y</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2923</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>Pave</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Mod</td>\n",
" <td>Mitchel</td>\n",
" <td>Norm</td>\n",
" <td>...</td>\n",
" <td>TA</td>\n",
" <td>Typ</td>\n",
" <td>TA</td>\n",
" <td>Attchd</td>\n",
" <td>RFn</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>Y</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2924</th>\n",
" <td>60</td>\n",
" <td>RL</td>\n",
" <td>Pave</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Mod</td>\n",
" <td>Mitchel</td>\n",
" <td>Norm</td>\n",
" <td>...</td>\n",
" <td>TA</td>\n",
" <td>Typ</td>\n",
" <td>TA</td>\n",
" <td>Attchd</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>Y</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2925 rows × 40 columns</p>\n",
"</div>"
],
"text/plain": [
" MS SubClass MS Zoning Street Lot Shape Land Contour Utilities Lot Config \\\n",
"0 20 RL Pave IR1 Lvl AllPub Corner \n",
"1 20 RH Pave Reg Lvl AllPub Inside \n",
"2 20 RL Pave IR1 Lvl AllPub Corner \n",
"3 20 RL Pave Reg Lvl AllPub Corner \n",
"4 60 RL Pave IR1 Lvl AllPub Inside \n",
"... ... ... ... ... ... ... ... \n",
"2920 80 RL Pave IR1 Lvl AllPub CulDSac \n",
"2921 20 RL Pave IR1 Low AllPub Inside \n",
"2922 85 RL Pave Reg Lvl AllPub Inside \n",
"2923 20 RL Pave Reg Lvl AllPub Inside \n",
"2924 60 RL Pave Reg Lvl AllPub Inside \n",
"\n",
" Land Slope Neighborhood Condition 1 ... Kitchen Qual Functional \\\n",
"0 Gtl NAmes Norm ... TA Typ \n",
"1 Gtl NAmes Feedr ... TA Typ \n",
"2 Gtl NAmes Norm ... Gd Typ \n",
"3 Gtl NAmes Norm ... Ex Typ \n",
"4 Gtl Gilbert Norm ... TA Typ \n",
"... ... ... ... ... ... ... \n",
"2920 Gtl Mitchel Norm ... TA Typ \n",
"2921 Mod Mitchel Norm ... TA Typ \n",
"2922 Gtl Mitchel Norm ... TA Typ \n",
"2923 Mod Mitchel Norm ... TA Typ \n",
"2924 Mod Mitchel Norm ... TA Typ \n",
"\n",
" Fireplace Qu Garage Type Garage Finish Garage Qual Garage Cond \\\n",
"0 Gd Attchd Fin TA TA \n",
"1 None Attchd Unf TA TA \n",
"2 None Attchd Unf TA TA \n",
"3 TA Attchd Fin TA TA \n",
"4 TA Attchd Fin TA TA \n",
"... ... ... ... ... ... \n",
"2920 None Detchd Unf TA TA \n",
"2921 None Attchd Unf TA TA \n",
"2922 None None None None None \n",
"2923 TA Attchd RFn TA TA \n",
"2924 TA Attchd Fin TA TA \n",
"\n",
" Paved Drive Sale Type Sale Condition \n",
"0 P WD Normal \n",
"1 Y WD Normal \n",
"2 Y WD Normal \n",
"3 Y WD Normal \n",
"4 Y WD Normal \n",
"... ... ... ... \n",
"2920 Y WD Normal \n",
"2921 Y WD Normal \n",
"2922 Y WD Normal \n",
"2923 Y WD Normal \n",
"2924 Y WD Normal \n",
"\n",
"[2925 rows x 40 columns]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.select_dtypes(include='object')"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"df_nums = df.select_dtypes(exclude='object')\n",
"df_objs = df.select_dtypes(include='object')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 2925 entries, 0 to 2924\n",
"Data columns (total 36 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Lot Frontage 2925 non-null float64\n",
" 1 Lot Area 2925 non-null int64 \n",
" 2 Overall Qual 2925 non-null int64 \n",
" 3 Overall Cond 2925 non-null int64 \n",
" 4 Year Built 2925 non-null int64 \n",
" 5 Year Remod/Add 2925 non-null int64 \n",
" 6 Mas Vnr Area 2925 non-null float64\n",
" 7 BsmtFin SF 1 2925 non-null float64\n",
" 8 BsmtFin SF 2 2925 non-null float64\n",
" 9 Bsmt Unf SF 2925 non-null float64\n",
" 10 Total Bsmt SF 2925 non-null float64\n",
" 11 1st Flr SF 2925 non-null int64 \n",
" 12 2nd Flr SF 2925 non-null int64 \n",
" 13 Low Qual Fin SF 2925 non-null int64 \n",
" 14 Gr Liv Area 2925 non-null int64 \n",
" 15 Bsmt Full Bath 2925 non-null float64\n",
" 16 Bsmt Half Bath 2925 non-null float64\n",
" 17 Full Bath 2925 non-null int64 \n",
" 18 Half Bath 2925 non-null int64 \n",
" 19 Bedroom AbvGr 2925 non-null int64 \n",
" 20 Kitchen AbvGr 2925 non-null int64 \n",
" 21 TotRms AbvGrd 2925 non-null int64 \n",
" 22 Fireplaces 2925 non-null int64 \n",
" 23 Garage Yr Blt 2925 non-null float64\n",
" 24 Garage Cars 2925 non-null float64\n",
" 25 Garage Area 2925 non-null float64\n",
" 26 Wood Deck SF 2925 non-null int64 \n",
" 27 Open Porch SF 2925 non-null int64 \n",
" 28 Enclosed Porch 2925 non-null int64 \n",
" 29 3Ssn Porch 2925 non-null int64 \n",
" 30 Screen Porch 2925 non-null int64 \n",
" 31 Pool Area 2925 non-null int64 \n",
" 32 Misc Val 2925 non-null int64 \n",
" 33 Mo Sold 2925 non-null int64 \n",
" 34 Yr Sold 2925 non-null int64 \n",
" 35 SalePrice 2925 non-null int64 \n",
"dtypes: float64(11), int64(25)\n",
"memory usage: 822.8 KB\n"
]
}
],
"source": [
"df_nums.info()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 2925 entries, 0 to 2924\n",
"Data columns (total 40 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 MS SubClass 2925 non-null object\n",
" 1 MS Zoning 2925 non-null object\n",
" 2 Street 2925 non-null object\n",
" 3 Lot Shape 2925 non-null object\n",
" 4 Land Contour 2925 non-null object\n",
" 5 Utilities 2925 non-null object\n",
" 6 Lot Config 2925 non-null object\n",
" 7 Land Slope 2925 non-null object\n",
" 8 Neighborhood 2925 non-null object\n",
" 9 Condition 1 2925 non-null object\n",
" 10 Condition 2 2925 non-null object\n",
" 11 Bldg Type 2925 non-null object\n",
" 12 House Style 2925 non-null object\n",
" 13 Roof Style 2925 non-null object\n",
" 14 Roof Matl 2925 non-null object\n",
" 15 Exterior 1st 2925 non-null object\n",
" 16 Exterior 2nd 2925 non-null object\n",
" 17 Mas Vnr Type 2925 non-null object\n",
" 18 Exter Qual 2925 non-null object\n",
" 19 Exter Cond 2925 non-null object\n",
" 20 Foundation 2925 non-null object\n",
" 21 Bsmt Qual 2925 non-null object\n",
" 22 Bsmt Cond 2925 non-null object\n",
" 23 Bsmt Exposure 2925 non-null object\n",
" 24 BsmtFin Type 1 2925 non-null object\n",
" 25 BsmtFin Type 2 2925 non-null object\n",
" 26 Heating 2925 non-null object\n",
" 27 Heating QC 2925 non-null object\n",
" 28 Central Air 2925 non-null object\n",
" 29 Electrical 2925 non-null object\n",
" 30 Kitchen Qual 2925 non-null object\n",
" 31 Functional 2925 non-null object\n",
" 32 Fireplace Qu 2925 non-null object\n",
" 33 Garage Type 2925 non-null object\n",
" 34 Garage Finish 2925 non-null object\n",
" 35 Garage Qual 2925 non-null object\n",
" 36 Garage Cond 2925 non-null object\n",
" 37 Paved Drive 2925 non-null object\n",
" 38 Sale Type 2925 non-null object\n",
" 39 Sale Condition 2925 non-null object\n",
"dtypes: object(40)\n",
"memory usage: 914.2+ KB\n"
]
}
],
"source": [
"df_objs.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Converting"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"df_objs = pd.get_dummies(df_objs,drop_first=True)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"final_df = pd.concat([df_nums,df_objs],axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Lot Frontage</th>\n",
" <th>Lot Area</th>\n",
" <th>Overall Qual</th>\n",
" <th>Overall Cond</th>\n",
" <th>Year Built</th>\n",
" <th>Year Remod/Add</th>\n",
" <th>Mas Vnr Area</th>\n",
" <th>BsmtFin SF 1</th>\n",
" <th>BsmtFin SF 2</th>\n",
" <th>Bsmt Unf SF</th>\n",
" <th>...</th>\n",
" <th>Sale Type_ConLw</th>\n",
" <th>Sale Type_New</th>\n",
" <th>Sale Type_Oth</th>\n",
" <th>Sale Type_VWD</th>\n",
" <th>Sale Type_WD</th>\n",
" <th>Sale Condition_AdjLand</th>\n",
" <th>Sale Condition_Alloca</th>\n",
" <th>Sale Condition_Family</th>\n",
" <th>Sale Condition_Normal</th>\n",
" <th>Sale Condition_Partial</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>141.000000</td>\n",
" <td>31770</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>1960</td>\n",
" <td>1960</td>\n",
" <td>112.0</td>\n",
" <td>639.0</td>\n",
" <td>0.0</td>\n",
" <td>441.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80.000000</td>\n",
" <td>11622</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>1961</td>\n",
" <td>1961</td>\n",
" <td>0.0</td>\n",
" <td>468.0</td>\n",
" <td>144.0</td>\n",
" <td>270.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>81.000000</td>\n",
" <td>14267</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>1958</td>\n",
" <td>1958</td>\n",
" <td>108.0</td>\n",
" <td>923.0</td>\n",
" <td>0.0</td>\n",
" <td>406.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>93.000000</td>\n",
" <td>11160</td>\n",
" <td>7</td>\n",
" <td>5</td>\n",
" <td>1968</td>\n",
" <td>1968</td>\n",
" <td>0.0</td>\n",
" <td>1065.0</td>\n",
" <td>0.0</td>\n",
" <td>1045.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>74.000000</td>\n",
" <td>13830</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>1997</td>\n",
" <td>1998</td>\n",
" <td>0.0</td>\n",
" <td>791.0</td>\n",
" <td>0.0</td>\n",
" <td>137.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2920</th>\n",
" <td>37.000000</td>\n",
" <td>7937</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>1984</td>\n",
" <td>1984</td>\n",
" <td>0.0</td>\n",
" <td>819.0</td>\n",
" <td>0.0</td>\n",
" <td>184.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2921</th>\n",
" <td>75.144444</td>\n",
" <td>8885</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>1983</td>\n",
" <td>1983</td>\n",
" <td>0.0</td>\n",
" <td>301.0</td>\n",
" <td>324.0</td>\n",
" <td>239.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2922</th>\n",
" <td>62.000000</td>\n",
" <td>10441</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>1992</td>\n",
" <td>1992</td>\n",
" <td>0.0</td>\n",
" <td>337.0</td>\n",
" <td>0.0</td>\n",
" <td>575.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2923</th>\n",
" <td>77.000000</td>\n",
" <td>10010</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>1974</td>\n",
" <td>1975</td>\n",
" <td>0.0</td>\n",
" <td>1071.0</td>\n",
" <td>123.0</td>\n",
" <td>195.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2924</th>\n",
" <td>74.000000</td>\n",
" <td>9627</td>\n",
" <td>7</td>\n",
" <td>5</td>\n",
" <td>1993</td>\n",
" <td>1994</td>\n",
" <td>94.0</td>\n",
" <td>758.0</td>\n",
" <td>0.0</td>\n",
" <td>238.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2925 rows × 274 columns</p>\n",
"</div>"
],
"text/plain": [
" Lot Frontage Lot Area Overall Qual Overall Cond Year Built \\\n",
"0 141.000000 31770 6 5 1960 \n",
"1 80.000000 11622 5 6 1961 \n",
"2 81.000000 14267 6 6 1958 \n",
"3 93.000000 11160 7 5 1968 \n",
"4 74.000000 13830 5 5 1997 \n",
"... ... ... ... ... ... \n",
"2920 37.000000 7937 6 6 1984 \n",
"2921 75.144444 8885 5 5 1983 \n",
"2922 62.000000 10441 5 5 1992 \n",
"2923 77.000000 10010 5 5 1974 \n",
"2924 74.000000 9627 7 5 1993 \n",
"\n",
" Year Remod/Add Mas Vnr Area BsmtFin SF 1 BsmtFin SF 2 Bsmt Unf SF \\\n",
"0 1960 112.0 639.0 0.0 441.0 \n",
"1 1961 0.0 468.0 144.0 270.0 \n",
"2 1958 108.0 923.0 0.0 406.0 \n",
"3 1968 0.0 1065.0 0.0 1045.0 \n",
"4 1998 0.0 791.0 0.0 137.0 \n",
"... ... ... ... ... ... \n",
"2920 1984 0.0 819.0 0.0 184.0 \n",
"2921 1983 0.0 301.0 324.0 239.0 \n",
"2922 1992 0.0 337.0 0.0 575.0 \n",
"2923 1975 0.0 1071.0 123.0 195.0 \n",
"2924 1994 94.0 758.0 0.0 238.0 \n",
"\n",
" ... Sale Type_ConLw Sale Type_New Sale Type_Oth Sale Type_VWD \\\n",
"0 ... 0 0 0 0 \n",
"1 ... 0 0 0 0 \n",
"2 ... 0 0 0 0 \n",
"3 ... 0 0 0 0 \n",
"4 ... 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"2920 ... 0 0 0 0 \n",
"2921 ... 0 0 0 0 \n",
"2922 ... 0 0 0 0 \n",
"2923 ... 0 0 0 0 \n",
"2924 ... 0 0 0 0 \n",
"\n",
" Sale Type_WD Sale Condition_AdjLand Sale Condition_Alloca \\\n",
"0 1 0 0 \n",
"1 1 0 0 \n",
"2 1 0 0 \n",
"3 1 0 0 \n",
"4 1 0 0 \n",
"... ... ... ... \n",
"2920 1 0 0 \n",
"2921 1 0 0 \n",
"2922 1 0 0 \n",
"2923 1 0 0 \n",
"2924 1 0 0 \n",
"\n",
" Sale Condition_Family Sale Condition_Normal Sale Condition_Partial \n",
"0 0 1 0 \n",
"1 0 1 0 \n",
"2 0 1 0 \n",
"3 0 1 0 \n",
"4 0 1 0 \n",
"... ... ... ... \n",
"2920 0 1 0 \n",
"2921 0 1 0 \n",
"2922 0 1 0 \n",
"2923 0 1 0 \n",
"2924 0 1 0 \n",
"\n",
"[2925 rows x 274 columns]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"final_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Final Thoughts\n",
"\n",
"Keep in mind, we don't know if 274 columns is very useful. More columns doesn't necessarily lead to better results. In fact, we may want to further remove columns (or later on use a model with regularization to choose important columns for us). What we have done here has greatly expanded the ratio of rows to columns, which may actually lead to worse performance (however you don't know until you've actually compared multiple models/approaches)."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Exter Qual_TA -0.591459\n",
"Kitchen Qual_TA -0.527461\n",
"Fireplace Qu_None -0.481740\n",
"Bsmt Qual_TA -0.453022\n",
"Garage Finish_Unf -0.422363\n",
" ... \n",
"Garage Cars 0.648488\n",
"Total Bsmt SF 0.660983\n",
"Gr Liv Area 0.727279\n",
"Overall Qual 0.802637\n",
"SalePrice 1.000000\n",
"Name: SalePrice, Length: 274, dtype: float64"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"final_df.corr()['SalePrice'].sort_values()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" OverallQual: Rates the overall material and finish of the house\n",
"\n",
" 10\tVery Excellent\n",
" 9\tExcellent\n",
" 8\tVery Good\n",
" 7\tGood\n",
" 6\tAbove Average\n",
" 5\tAverage\n",
" 4\tBelow Average\n",
" 3\tFair\n",
" 2\tPoor\n",
" 1\tVery Poor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most likely a human realtor rated this \"Overall Qual\" column, which means it highly likely takes into account many of the other features. It also means that any future house we intend to predict a price for will need this \"Overall Qual\" feature, which implies that every new house on the market that will be priced with our ML model will still require a human person!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save Final DF"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"final_df.to_csv('../DATA/AMES_Final_DF.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}