You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

3300 lines
253 KiB

2 years ago
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
"___\n",
"<center><em>Copyright by Pierian Data Inc.</em></center>\n",
"<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dealing with Missing Data\n",
"\n",
"We already reviewed Pandas operations for missing data, now let's apply this to clean a real data file. Keep in mind, there is no 100% correct way of doing this, and this notebook just serves as an example of some reasonable approaches to take on this data.\n",
"\n",
"#### Note: Throughout this section we will be slowly cleaning and adding features to the Ames Housing Dataset for use in the next section. Make sure to always be loading the same file name as in the notebook.\n",
"\n",
"#### 2nd Note: Some of the methods shown here may not lead to optimal performance, but instead are shown to display examples of various methods available.\n",
"-----\n",
"\n",
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MSSubClass: Identifies the type of dwelling involved in the sale.\t\n",
"\n",
" 20\t1-STORY 1946 & NEWER ALL STYLES\n",
" 30\t1-STORY 1945 & OLDER\n",
" 40\t1-STORY W/FINISHED ATTIC ALL AGES\n",
" 45\t1-1/2 STORY - UNFINISHED ALL AGES\n",
" 50\t1-1/2 STORY FINISHED ALL AGES\n",
" 60\t2-STORY 1946 & NEWER\n",
" 70\t2-STORY 1945 & OLDER\n",
" 75\t2-1/2 STORY ALL AGES\n",
" 80\tSPLIT OR MULTI-LEVEL\n",
" 85\tSPLIT FOYER\n",
" 90\tDUPLEX - ALL STYLES AND AGES\n",
" 120\t1-STORY PUD (Planned Unit Development) - 1946 & NEWER\n",
" 150\t1-1/2 STORY PUD - ALL AGES\n",
" 160\t2-STORY PUD - 1946 & NEWER\n",
" 180\tPUD - MULTILEVEL - INCL SPLIT LEV/FOYER\n",
" 190\t2 FAMILY CONVERSION - ALL STYLES AND AGES\n",
"\n",
"MSZoning: Identifies the general zoning classification of the sale.\n",
"\t\t\n",
" A\tAgriculture\n",
" C\tCommercial\n",
" FV\tFloating Village Residential\n",
" I\tIndustrial\n",
" RH\tResidential High Density\n",
" RL\tResidential Low Density\n",
" RP\tResidential Low Density Park \n",
" RM\tResidential Medium Density\n",
"\t\n",
"LotFrontage: Linear feet of street connected to property\n",
"\n",
"LotArea: Lot size in square feet\n",
"\n",
"Street: Type of road access to property\n",
"\n",
" Grvl\tGravel\t\n",
" Pave\tPaved\n",
" \t\n",
"Alley: Type of alley access to property\n",
"\n",
" Grvl\tGravel\n",
" Pave\tPaved\n",
" NA \tNo alley access\n",
"\t\t\n",
"LotShape: General shape of property\n",
"\n",
" Reg\tRegular\t\n",
" IR1\tSlightly irregular\n",
" IR2\tModerately Irregular\n",
" IR3\tIrregular\n",
" \n",
"LandContour: Flatness of the property\n",
"\n",
" Lvl\tNear Flat/Level\t\n",
" Bnk\tBanked - Quick and significant rise from street grade to building\n",
" HLS\tHillside - Significant slope from side to side\n",
" Low\tDepression\n",
"\t\t\n",
"Utilities: Type of utilities available\n",
"\t\t\n",
" AllPub\tAll public Utilities (E,G,W,& S)\t\n",
" NoSewr\tElectricity, Gas, and Water (Septic Tank)\n",
" NoSeWa\tElectricity and Gas Only\n",
" ELO\tElectricity only\t\n",
"\t\n",
"LotConfig: Lot configuration\n",
"\n",
" Inside\tInside lot\n",
" Corner\tCorner lot\n",
" CulDSac\tCul-de-sac\n",
" FR2\tFrontage on 2 sides of property\n",
" FR3\tFrontage on 3 sides of property\n",
"\t\n",
"LandSlope: Slope of property\n",
"\t\t\n",
" Gtl\tGentle slope\n",
" Mod\tModerate Slope\t\n",
" Sev\tSevere Slope\n",
"\t\n",
"Neighborhood: Physical locations within Ames city limits\n",
"\n",
" Blmngtn\tBloomington Heights\n",
" Blueste\tBluestem\n",
" BrDale\tBriardale\n",
" BrkSide\tBrookside\n",
" ClearCr\tClear Creek\n",
" CollgCr\tCollege Creek\n",
" Crawfor\tCrawford\n",
" Edwards\tEdwards\n",
" Gilbert\tGilbert\n",
" IDOTRR\tIowa DOT and Rail Road\n",
" MeadowV\tMeadow Village\n",
" Mitchel\tMitchell\n",
" Names\tNorth Ames\n",
" NoRidge\tNorthridge\n",
" NPkVill\tNorthpark Villa\n",
" NridgHt\tNorthridge Heights\n",
" NWAmes\tNorthwest Ames\n",
" OldTown\tOld Town\n",
" SWISU\tSouth & West of Iowa State University\n",
" Sawyer\tSawyer\n",
" SawyerW\tSawyer West\n",
" Somerst\tSomerset\n",
" StoneBr\tStone Brook\n",
" Timber\tTimberland\n",
" Veenker\tVeenker\n",
"\t\t\t\n",
"Condition1: Proximity to various conditions\n",
"\t\n",
" Artery\tAdjacent to arterial street\n",
" Feedr\tAdjacent to feeder street\t\n",
" Norm\tNormal\t\n",
" RRNn\tWithin 200' of North-South Railroad\n",
" RRAn\tAdjacent to North-South Railroad\n",
" PosN\tNear positive off-site feature--park, greenbelt, etc.\n",
" PosA\tAdjacent to postive off-site feature\n",
" RRNe\tWithin 200' of East-West Railroad\n",
" RRAe\tAdjacent to East-West Railroad\n",
"\t\n",
"Condition2: Proximity to various conditions (if more than one is present)\n",
"\t\t\n",
" Artery\tAdjacent to arterial street\n",
" Feedr\tAdjacent to feeder street\t\n",
" Norm\tNormal\t\n",
" RRNn\tWithin 200' of North-South Railroad\n",
" RRAn\tAdjacent to North-South Railroad\n",
" PosN\tNear positive off-site feature--park, greenbelt, etc.\n",
" PosA\tAdjacent to postive off-site feature\n",
" RRNe\tWithin 200' of East-West Railroad\n",
" RRAe\tAdjacent to East-West Railroad\n",
"\t\n",
"BldgType: Type of dwelling\n",
"\t\t\n",
" 1Fam\tSingle-family Detached\t\n",
" 2FmCon\tTwo-family Conversion; originally built as one-family dwelling\n",
" Duplx\tDuplex\n",
" TwnhsE\tTownhouse End Unit\n",
" TwnhsI\tTownhouse Inside Unit\n",
"\t\n",
"HouseStyle: Style of dwelling\n",
"\t\n",
" 1Story\tOne story\n",
" 1.5Fin\tOne and one-half story: 2nd level finished\n",
" 1.5Unf\tOne and one-half story: 2nd level unfinished\n",
" 2Story\tTwo story\n",
" 2.5Fin\tTwo and one-half story: 2nd level finished\n",
" 2.5Unf\tTwo and one-half story: 2nd level unfinished\n",
" SFoyer\tSplit Foyer\n",
" SLvl\tSplit Level\n",
"\t\n",
"OverallQual: Rates the overall material and finish of the house\n",
"\n",
" 10\tVery Excellent\n",
" 9\tExcellent\n",
" 8\tVery Good\n",
" 7\tGood\n",
" 6\tAbove Average\n",
" 5\tAverage\n",
" 4\tBelow Average\n",
" 3\tFair\n",
" 2\tPoor\n",
" 1\tVery Poor\n",
"\t\n",
"OverallCond: Rates the overall condition of the house\n",
"\n",
" 10\tVery Excellent\n",
" 9\tExcellent\n",
" 8\tVery Good\n",
" 7\tGood\n",
" 6\tAbove Average\t\n",
" 5\tAverage\n",
" 4\tBelow Average\t\n",
" 3\tFair\n",
" 2\tPoor\n",
" 1\tVery Poor\n",
"\t\t\n",
"YearBuilt: Original construction date\n",
"\n",
"YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)\n",
"\n",
"RoofStyle: Type of roof\n",
"\n",
" Flat\tFlat\n",
" Gable\tGable\n",
" Gambrel\tGabrel (Barn)\n",
" Hip\tHip\n",
" Mansard\tMansard\n",
" Shed\tShed\n",
"\t\t\n",
"RoofMatl: Roof material\n",
"\n",
" ClyTile\tClay or Tile\n",
" CompShg\tStandard (Composite) Shingle\n",
" Membran\tMembrane\n",
" Metal\tMetal\n",
" Roll\tRoll\n",
" Tar&Grv\tGravel & Tar\n",
" WdShake\tWood Shakes\n",
" WdShngl\tWood Shingles\n",
"\t\t\n",
"Exterior1st: Exterior covering on house\n",
"\n",
" AsbShng\tAsbestos Shingles\n",
" AsphShn\tAsphalt Shingles\n",
" BrkComm\tBrick Common\n",
" BrkFace\tBrick Face\n",
" CBlock\tCinder Block\n",
" CemntBd\tCement Board\n",
" HdBoard\tHard Board\n",
" ImStucc\tImitation Stucco\n",
" MetalSd\tMetal Siding\n",
" Other\tOther\n",
" Plywood\tPlywood\n",
" PreCast\tPreCast\t\n",
" Stone\tStone\n",
" Stucco\tStucco\n",
" VinylSd\tVinyl Siding\n",
" Wd Sdng\tWood Siding\n",
" WdShing\tWood Shingles\n",
"\t\n",
"Exterior2nd: Exterior covering on house (if more than one material)\n",
"\n",
" AsbShng\tAsbestos Shingles\n",
" AsphShn\tAsphalt Shingles\n",
" BrkComm\tBrick Common\n",
" BrkFace\tBrick Face\n",
" CBlock\tCinder Block\n",
" CemntBd\tCement Board\n",
" HdBoard\tHard Board\n",
" ImStucc\tImitation Stucco\n",
" MetalSd\tMetal Siding\n",
" Other\tOther\n",
" Plywood\tPlywood\n",
" PreCast\tPreCast\n",
" Stone\tStone\n",
" Stucco\tStucco\n",
" VinylSd\tVinyl Siding\n",
" Wd Sdng\tWood Siding\n",
" WdShing\tWood Shingles\n",
"\t\n",
"MasVnrType: Masonry veneer type\n",
"\n",
" BrkCmn\tBrick Common\n",
" BrkFace\tBrick Face\n",
" CBlock\tCinder Block\n",
" None\tNone\n",
" Stone\tStone\n",
"\t\n",
"MasVnrArea: Masonry veneer area in square feet\n",
"\n",
"ExterQual: Evaluates the quality of the material on the exterior \n",
"\t\t\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tAverage/Typical\n",
" Fa\tFair\n",
" Po\tPoor\n",
"\t\t\n",
"ExterCond: Evaluates the present condition of the material on the exterior\n",
"\t\t\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tAverage/Typical\n",
" Fa\tFair\n",
" Po\tPoor\n",
"\t\t\n",
"Foundation: Type of foundation\n",
"\t\t\n",
" BrkTil\tBrick & Tile\n",
" CBlock\tCinder Block\n",
" PConc\tPoured Contrete\t\n",
" Slab\tSlab\n",
" Stone\tStone\n",
" Wood\tWood\n",
"\t\t\n",
"BsmtQual: Evaluates the height of the basement\n",
"\n",
" Ex\tExcellent (100+ inches)\t\n",
" Gd\tGood (90-99 inches)\n",
" TA\tTypical (80-89 inches)\n",
" Fa\tFair (70-79 inches)\n",
" Po\tPoor (<70 inches\n",
" NA\tNo Basement\n",
"\t\t\n",
"BsmtCond: Evaluates the general condition of the basement\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tTypical - slight dampness allowed\n",
" Fa\tFair - dampness or some cracking or settling\n",
" Po\tPoor - Severe cracking, settling, or wetness\n",
" NA\tNo Basement\n",
"\t\n",
"BsmtExposure: Refers to walkout or garden level walls\n",
"\n",
" Gd\tGood Exposure\n",
" Av\tAverage Exposure (split levels or foyers typically score average or above)\t\n",
" Mn\tMimimum Exposure\n",
" No\tNo Exposure\n",
" NA\tNo Basement\n",
"\t\n",
"BsmtFinType1: Rating of basement finished area\n",
"\n",
" GLQ\tGood Living Quarters\n",
" ALQ\tAverage Living Quarters\n",
" BLQ\tBelow Average Living Quarters\t\n",
" Rec\tAverage Rec Room\n",
" LwQ\tLow Quality\n",
" Unf\tUnfinshed\n",
" NA\tNo Basement\n",
"\t\t\n",
"BsmtFinSF1: Type 1 finished square feet\n",
"\n",
"BsmtFinType2: Rating of basement finished area (if multiple types)\n",
"\n",
" GLQ\tGood Living Quarters\n",
" ALQ\tAverage Living Quarters\n",
" BLQ\tBelow Average Living Quarters\t\n",
" Rec\tAverage Rec Room\n",
" LwQ\tLow Quality\n",
" Unf\tUnfinshed\n",
" NA\tNo Basement\n",
"\n",
"BsmtFinSF2: Type 2 finished square feet\n",
"\n",
"BsmtUnfSF: Unfinished square feet of basement area\n",
"\n",
"TotalBsmtSF: Total square feet of basement area\n",
"\n",
"Heating: Type of heating\n",
"\t\t\n",
" Floor\tFloor Furnace\n",
" GasA\tGas forced warm air furnace\n",
" GasW\tGas hot water or steam heat\n",
" Grav\tGravity furnace\t\n",
" OthW\tHot water or steam heat other than gas\n",
" Wall\tWall furnace\n",
"\t\t\n",
"HeatingQC: Heating quality and condition\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tAverage/Typical\n",
" Fa\tFair\n",
" Po\tPoor\n",
"\t\t\n",
"CentralAir: Central air conditioning\n",
"\n",
" N\tNo\n",
" Y\tYes\n",
"\t\t\n",
"Electrical: Electrical system\n",
"\n",
" SBrkr\tStandard Circuit Breakers & Romex\n",
" FuseA\tFuse Box over 60 AMP and all Romex wiring (Average)\t\n",
" FuseF\t60 AMP Fuse Box and mostly Romex wiring (Fair)\n",
" FuseP\t60 AMP Fuse Box and mostly knob & tube wiring (poor)\n",
" Mix\tMixed\n",
"\t\t\n",
"1stFlrSF: First Floor square feet\n",
" \n",
"2ndFlrSF: Second floor square feet\n",
"\n",
"LowQualFinSF: Low quality finished square feet (all floors)\n",
"\n",
"GrLivArea: Above grade (ground) living area square feet\n",
"\n",
"BsmtFullBath: Basement full bathrooms\n",
"\n",
"BsmtHalfBath: Basement half bathrooms\n",
"\n",
"FullBath: Full bathrooms above grade\n",
"\n",
"HalfBath: Half baths above grade\n",
"\n",
"Bedroom: Bedrooms above grade (does NOT include basement bedrooms)\n",
"\n",
"Kitchen: Kitchens above grade\n",
"\n",
"KitchenQual: Kitchen quality\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tTypical/Average\n",
" Fa\tFair\n",
" Po\tPoor\n",
" \t\n",
"TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)\n",
"\n",
"Functional: Home functionality (Assume typical unless deductions are warranted)\n",
"\n",
" Typ\tTypical Functionality\n",
" Min1\tMinor Deductions 1\n",
" Min2\tMinor Deductions 2\n",
" Mod\tModerate Deductions\n",
" Maj1\tMajor Deductions 1\n",
" Maj2\tMajor Deductions 2\n",
" Sev\tSeverely Damaged\n",
" Sal\tSalvage only\n",
"\t\t\n",
"Fireplaces: Number of fireplaces\n",
"\n",
"FireplaceQu: Fireplace quality\n",
"\n",
" Ex\tExcellent - Exceptional Masonry Fireplace\n",
" Gd\tGood - Masonry Fireplace in main level\n",
" TA\tAverage - Prefabricated Fireplace in main living area or Masonry Fireplace in basement\n",
" Fa\tFair - Prefabricated Fireplace in basement\n",
" Po\tPoor - Ben Franklin Stove\n",
" NA\tNo Fireplace\n",
"\t\t\n",
"GarageType: Garage location\n",
"\t\t\n",
" 2Types\tMore than one type of garage\n",
" Attchd\tAttached to home\n",
" Basment\tBasement Garage\n",
" BuiltIn\tBuilt-In (Garage part of house - typically has room above garage)\n",
" CarPort\tCar Port\n",
" Detchd\tDetached from home\n",
" NA\tNo Garage\n",
"\t\t\n",
"GarageYrBlt: Year garage was built\n",
"\t\t\n",
"GarageFinish: Interior finish of the garage\n",
"\n",
" Fin\tFinished\n",
" RFn\tRough Finished\t\n",
" Unf\tUnfinished\n",
" NA\tNo Garage\n",
"\t\t\n",
"GarageCars: Size of garage in car capacity\n",
"\n",
"GarageArea: Size of garage in square feet\n",
"\n",
"GarageQual: Garage quality\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tTypical/Average\n",
" Fa\tFair\n",
" Po\tPoor\n",
" NA\tNo Garage\n",
"\t\t\n",
"GarageCond: Garage condition\n",
"\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tTypical/Average\n",
" Fa\tFair\n",
" Po\tPoor\n",
" NA\tNo Garage\n",
"\t\t\n",
"PavedDrive: Paved driveway\n",
"\n",
" Y\tPaved \n",
" P\tPartial Pavement\n",
" N\tDirt/Gravel\n",
"\t\t\n",
"WoodDeckSF: Wood deck area in square feet\n",
"\n",
"OpenPorchSF: Open porch area in square feet\n",
"\n",
"EnclosedPorch: Enclosed porch area in square feet\n",
"\n",
"3SsnPorch: Three season porch area in square feet\n",
"\n",
"ScreenPorch: Screen porch area in square feet\n",
"\n",
"PoolArea: Pool area in square feet\n",
"\n",
"PoolQC: Pool quality\n",
"\t\t\n",
" Ex\tExcellent\n",
" Gd\tGood\n",
" TA\tAverage/Typical\n",
" Fa\tFair\n",
" NA\tNo Pool\n",
"\t\t\n",
"Fence: Fence quality\n",
"\t\t\n",
" GdPrv\tGood Privacy\n",
" MnPrv\tMinimum Privacy\n",
" GdWo\tGood Wood\n",
" MnWw\tMinimum Wood/Wire\n",
" NA\tNo Fence\n",
"\t\n",
"MiscFeature: Miscellaneous feature not covered in other categories\n",
"\t\t\n",
" Elev\tElevator\n",
" Gar2\t2nd Garage (if not described in garage section)\n",
" Othr\tOther\n",
" Shed\tShed (over 100 SF)\n",
" TenC\tTennis Court\n",
" NA\tNone\n",
"\t\t\n",
"MiscVal: $Value of miscellaneous feature\n",
"\n",
"MoSold: Month Sold (MM)\n",
"\n",
"YrSold: Year Sold (YYYY)\n",
"\n",
"SaleType: Type of sale\n",
"\t\t\n",
" WD \tWarranty Deed - Conventional\n",
" CWD\tWarranty Deed - Cash\n",
" VWD\tWarranty Deed - VA Loan\n",
" New\tHome just constructed and sold\n",
" COD\tCourt Officer Deed/Estate\n",
" Con\tContract 15% Down payment regular terms\n",
" ConLw\tContract Low Down payment and low interest\n",
" ConLI\tContract Low Interest\n",
" ConLD\tContract Low Down\n",
" Oth\tOther\n",
"\t\t\n",
"SaleCondition: Condition of sale\n",
"\n",
" Normal\tNormal Sale\n",
" Abnorml\tAbnormal Sale - trade, foreclosure, short sale\n",
" AdjLand\tAdjoining Land Purchase\n",
" Alloca\tAllocation - two linked properties with separate deeds, typically condo with a garage unit\t\n",
" Family\tSale between family members\n",
" Partial\tHome was not completed when last assessed (associated with New Homes)\n",
"\n"
]
}
],
"source": [
"with open('../DATA/Ames_Housing_Feature_Description.txt','r') as f: \n",
" print(f.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"../DATA/Ames_outliers_removed.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PID</th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Lot Frontage</th>\n",
" <th>Lot Area</th>\n",
" <th>Street</th>\n",
" <th>Alley</th>\n",
" <th>Lot Shape</th>\n",
" <th>Land Contour</th>\n",
" <th>Utilities</th>\n",
" <th>...</th>\n",
" <th>Pool Area</th>\n",
" <th>Pool QC</th>\n",
" <th>Fence</th>\n",
" <th>Misc Feature</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Yr Sold</th>\n",
" <th>Sale Type</th>\n",
" <th>Sale Condition</th>\n",
" <th>SalePrice</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>526301100</td>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>141.0</td>\n",
" <td>31770</td>\n",
" <td>Pave</td>\n",
" <td>NaN</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>215000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>526350040</td>\n",
" <td>20</td>\n",
" <td>RH</td>\n",
" <td>80.0</td>\n",
" <td>11622</td>\n",
" <td>Pave</td>\n",
" <td>NaN</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>MnPrv</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>6</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>105000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>526351010</td>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>81.0</td>\n",
" <td>14267</td>\n",
" <td>Pave</td>\n",
" <td>NaN</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Gar2</td>\n",
" <td>12500</td>\n",
" <td>6</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>172000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>526353030</td>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>93.0</td>\n",
" <td>11160</td>\n",
" <td>Pave</td>\n",
" <td>NaN</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>244000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>527105010</td>\n",
" <td>60</td>\n",
" <td>RL</td>\n",
" <td>74.0</td>\n",
" <td>13830</td>\n",
" <td>Pave</td>\n",
" <td>NaN</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>MnPrv</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>189900</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 81 columns</p>\n",
"</div>"
],
"text/plain": [
" PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley \\\n",
"0 526301100 20 RL 141.0 31770 Pave NaN \n",
"1 526350040 20 RH 80.0 11622 Pave NaN \n",
"2 526351010 20 RL 81.0 14267 Pave NaN \n",
"3 526353030 20 RL 93.0 11160 Pave NaN \n",
"4 527105010 60 RL 74.0 13830 Pave NaN \n",
"\n",
" Lot Shape Land Contour Utilities ... Pool Area Pool QC Fence Misc Feature \\\n",
"0 IR1 Lvl AllPub ... 0 NaN NaN NaN \n",
"1 Reg Lvl AllPub ... 0 NaN MnPrv NaN \n",
"2 IR1 Lvl AllPub ... 0 NaN NaN Gar2 \n",
"3 Reg Lvl AllPub ... 0 NaN NaN NaN \n",
"4 IR1 Lvl AllPub ... 0 NaN MnPrv NaN \n",
"\n",
" Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice \n",
"0 0 5 2010 WD Normal 215000 \n",
"1 0 6 2010 WD Normal 105000 \n",
"2 12500 6 2010 WD Normal 172000 \n",
"3 0 4 2010 WD Normal 244000 \n",
"4 0 3 2010 WD Normal 189900 \n",
"\n",
"[5 rows x 81 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"81"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df.columns)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 2927 entries, 0 to 2926\n",
"Data columns (total 81 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 PID 2927 non-null int64 \n",
" 1 MS SubClass 2927 non-null int64 \n",
" 2 MS Zoning 2927 non-null object \n",
" 3 Lot Frontage 2437 non-null float64\n",
" 4 Lot Area 2927 non-null int64 \n",
" 5 Street 2927 non-null object \n",
" 6 Alley 198 non-null object \n",
" 7 Lot Shape 2927 non-null object \n",
" 8 Land Contour 2927 non-null object \n",
" 9 Utilities 2927 non-null object \n",
" 10 Lot Config 2927 non-null object \n",
" 11 Land Slope 2927 non-null object \n",
" 12 Neighborhood 2927 non-null object \n",
" 13 Condition 1 2927 non-null object \n",
" 14 Condition 2 2927 non-null object \n",
" 15 Bldg Type 2927 non-null object \n",
" 16 House Style 2927 non-null object \n",
" 17 Overall Qual 2927 non-null int64 \n",
" 18 Overall Cond 2927 non-null int64 \n",
" 19 Year Built 2927 non-null int64 \n",
" 20 Year Remod/Add 2927 non-null int64 \n",
" 21 Roof Style 2927 non-null object \n",
" 22 Roof Matl 2927 non-null object \n",
" 23 Exterior 1st 2927 non-null object \n",
" 24 Exterior 2nd 2927 non-null object \n",
" 25 Mas Vnr Type 2904 non-null object \n",
" 26 Mas Vnr Area 2904 non-null float64\n",
" 27 Exter Qual 2927 non-null object \n",
" 28 Exter Cond 2927 non-null object \n",
" 29 Foundation 2927 non-null object \n",
" 30 Bsmt Qual 2847 non-null object \n",
" 31 Bsmt Cond 2847 non-null object \n",
" 32 Bsmt Exposure 2844 non-null object \n",
" 33 BsmtFin Type 1 2847 non-null object \n",
" 34 BsmtFin SF 1 2926 non-null float64\n",
" 35 BsmtFin Type 2 2846 non-null object \n",
" 36 BsmtFin SF 2 2926 non-null float64\n",
" 37 Bsmt Unf SF 2926 non-null float64\n",
" 38 Total Bsmt SF 2926 non-null float64\n",
" 39 Heating 2927 non-null object \n",
" 40 Heating QC 2927 non-null object \n",
" 41 Central Air 2927 non-null object \n",
" 42 Electrical 2926 non-null object \n",
" 43 1st Flr SF 2927 non-null int64 \n",
" 44 2nd Flr SF 2927 non-null int64 \n",
" 45 Low Qual Fin SF 2927 non-null int64 \n",
" 46 Gr Liv Area 2927 non-null int64 \n",
" 47 Bsmt Full Bath 2925 non-null float64\n",
" 48 Bsmt Half Bath 2925 non-null float64\n",
" 49 Full Bath 2927 non-null int64 \n",
" 50 Half Bath 2927 non-null int64 \n",
" 51 Bedroom AbvGr 2927 non-null int64 \n",
" 52 Kitchen AbvGr 2927 non-null int64 \n",
" 53 Kitchen Qual 2927 non-null object \n",
" 54 TotRms AbvGrd 2927 non-null int64 \n",
" 55 Functional 2927 non-null object \n",
" 56 Fireplaces 2927 non-null int64 \n",
" 57 Fireplace Qu 1505 non-null object \n",
" 58 Garage Type 2770 non-null object \n",
" 59 Garage Yr Blt 2768 non-null float64\n",
" 60 Garage Finish 2768 non-null object \n",
" 61 Garage Cars 2926 non-null float64\n",
" 62 Garage Area 2926 non-null float64\n",
" 63 Garage Qual 2768 non-null object \n",
" 64 Garage Cond 2768 non-null object \n",
" 65 Paved Drive 2927 non-null object \n",
" 66 Wood Deck SF 2927 non-null int64 \n",
" 67 Open Porch SF 2927 non-null int64 \n",
" 68 Enclosed Porch 2927 non-null int64 \n",
" 69 3Ssn Porch 2927 non-null int64 \n",
" 70 Screen Porch 2927 non-null int64 \n",
" 71 Pool Area 2927 non-null int64 \n",
" 72 Pool QC 12 non-null object \n",
" 73 Fence 572 non-null object \n",
" 74 Misc Feature 105 non-null object \n",
" 75 Misc Val 2927 non-null int64 \n",
" 76 Mo Sold 2927 non-null int64 \n",
" 77 Yr Sold 2927 non-null int64 \n",
" 78 Sale Type 2927 non-null object \n",
" 79 Sale Condition 2927 non-null object \n",
" 80 SalePrice 2927 non-null int64 \n",
"dtypes: float64(11), int64(27), object(43)\n",
"memory usage: 1.8+ MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Removing the PID\n",
"\n",
"We already have an index, so we don't need the PID unique identifier for the regression we will perform later on."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"df = df.drop('PID',axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"80"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Observing NaN Features"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Lot Frontage</th>\n",
" <th>Lot Area</th>\n",
" <th>Street</th>\n",
" <th>Alley</th>\n",
" <th>Lot Shape</th>\n",
" <th>Land Contour</th>\n",
" <th>Utilities</th>\n",
" <th>Lot Config</th>\n",
" <th>...</th>\n",
" <th>Pool Area</th>\n",
" <th>Pool QC</th>\n",
" <th>Fence</th>\n",
" <th>Misc Feature</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Yr Sold</th>\n",
" <th>Sale Type</th>\n",
" <th>Sale Condition</th>\n",
" <th>SalePrice</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2922</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2923</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2924</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2925</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2926</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2927 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" MS SubClass MS Zoning Lot Frontage Lot Area Street Alley \\\n",
"0 False False False False False True \n",
"1 False False False False False True \n",
"2 False False False False False True \n",
"3 False False False False False True \n",
"4 False False False False False True \n",
"... ... ... ... ... ... ... \n",
"2922 False False False False False True \n",
"2923 False False True False False True \n",
"2924 False False False False False True \n",
"2925 False False False False False True \n",
"2926 False False False False False True \n",
"\n",
" Lot Shape Land Contour Utilities Lot Config ... Pool Area Pool QC \\\n",
"0 False False False False ... False True \n",
"1 False False False False ... False True \n",
"2 False False False False ... False True \n",
"3 False False False False ... False True \n",
"4 False False False False ... False True \n",
"... ... ... ... ... ... ... ... \n",
"2922 False False False False ... False True \n",
"2923 False False False False ... False True \n",
"2924 False False False False ... False True \n",
"2925 False False False False ... False True \n",
"2926 False False False False ... False True \n",
"\n",
" Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type \\\n",
"0 True True False False False False \n",
"1 False True False False False False \n",
"2 True False False False False False \n",
"3 True True False False False False \n",
"4 False True False False False False \n",
"... ... ... ... ... ... ... \n",
"2922 False True False False False False \n",
"2923 False True False False False False \n",
"2924 False False False False False False \n",
"2925 True True False False False False \n",
"2926 True True False False False False \n",
"\n",
" Sale Condition SalePrice \n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"... ... ... \n",
"2922 False False \n",
"2923 False False \n",
"2924 False False \n",
"2925 False False \n",
"2926 False False \n",
"\n",
"[2927 rows x 80 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.isnull()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MS SubClass 0\n",
"MS Zoning 0\n",
"Lot Frontage 490\n",
"Lot Area 0\n",
"Street 0\n",
" ... \n",
"Mo Sold 0\n",
"Yr Sold 0\n",
"Sale Type 0\n",
"Sale Condition 0\n",
"SalePrice 0\n",
"Length: 80, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MS SubClass 0.00000\n",
"MS Zoning 0.00000\n",
"Lot Frontage 16.74069\n",
"Lot Area 0.00000\n",
"Street 0.00000\n",
" ... \n",
"Mo Sold 0.00000\n",
"Yr Sold 0.00000\n",
"Sale Type 0.00000\n",
"Sale Condition 0.00000\n",
"SalePrice 0.00000\n",
"Length: 80, dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"100* df.isnull().sum() / len(df)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def percent_missing(df):\n",
" percent_nan = 100* df.isnull().sum() / len(df)\n",
" percent_nan = percent_nan[percent_nan>0].sort_values()\n",
" return percent_nan"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"percent_nan = percent_missing(df)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAE6CAYAAADtBhJMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAA9wElEQVR4nO2debglRZG33+huEGQTpGmRfRsVGRBsFMFRFveFfXXQBhEEEQEdURwdHFw/RxGBAURREQFH2Ww3FhsaRATpptlBQVBAEdoVRFTQ+P6IPN11z62qU1nn3tuX4vc+Tz2nlsyqrKw6UZmREZHm7gghhOgWU5Z0AYQQQow9Eu5CCNFBJNyFEKKDSLgLIUQHkXAXQogOIuEuhBAdZNqSLgDAqquu6uuuu+6SLoYQQjypmD9//m/dfXrZsUkh3Nddd13mzZu3pIshhBBPKszsl1XHpJYRQogOIuEuhBAdRMJdCCE6iIS7EEJ0kIHC3cy+ZGYPmdkthX2rmNmlZnZn+l25cOxoM7vLzH5qZq8er4ILIYSopknL/SvAa/r2vR+Y4+4bAXPSNma2MbA38PyU52QzmzpmpRVCCNGIgcLd3a8Eft+3eyfgjLR+BrBzYf/X3f1v7n4PcBfworEpqhBCiKa01bnPcPcHANLvamn/GsB9hXT3p31CCCEmkLF2YrKSfaWzgZjZQcBBAGuvvfYYF0MIIZ6cPHTSdxulW+2dr6893rbl/qCZrQ6Qfh9K++8H1iqkWxP4ddkJ3P00d5/p7jOnTy/1nhVCCNGStsJ9NjArrc8CvlXYv7eZPc3M1gM2An4yXBGFEELkMlAtY2bnANsCq5rZ/cAxwCeBb5jZAcC9wB4A7n6rmX0DuA14AjjU3f8xTmUXQghRwUDh7u77VBzaoSL9x4CPDVMoIYQQwyEPVSGE6CAS7kII0UEmRTx3IYToKg+e8MOBaWa869/G/LpquQshRAeRcBdCiA4i4S6EEB1Ewl0IITqIhLsQQnQQCXchhOggMoUUQoiGPPjZmxqlm3HkpuNcksGo5S6EEB1Ewl0IITqIhLsQQnQQCXchhOggEu5CCNFBJNyFEKKDSLgLIUQHkXAXQogOIuEuhBAdRMJdCCE6iIS7EEJ0EAl3IYToIBLuQgjRQSTchRCig0i4CyFEB5FwF0KIDiLhLoQQHUTCXQghOoiEuxBCdBAJdyGE6CAS7kII0UEk3IUQooMMJdzN7Egzu9XMbjGzc8xsGTNbxcwuNbM70+/KY1VYIYQQzWgt3M1sDeBdwEx33wSYCuwNvB+Y4+4bAXPSthBCiAlkWLXMNGBZM5sGPB34NbATcEY6fgaw85DXEEIIkcm0thnd/Vdm9mngXuAx4BJ3v8TMZrj7AynNA2a22hiVVQghxpQHPnV/o3SrH7XmOJdk7BlGLbMy0UpfD3g2sJyZ7ZuR/yAzm2dm8xYuXNi2GEIIIUoYRi3zCuAed1/o7o8D5wNbAw+a2eoA6fehsszufpq7z3T3mdOnTx+iGEIIIfoZRrjfC2xlZk83MwN2AG4HZgOzUppZwLeGK6IQQohchtG5X2tm5wLXA08AC4DTgOWBb5jZAcQHYI+xKKgQQojmtBbuAO5+DHBM3+6/Ea14IYQQSwh5qAohRAeRcBdCiA4i4S6EEB1Ewl0IITqIhLsQQnQQCXchhOggEu5CCNFBJNyFEKKDSLgLIUQHkXAXQogOIuEuhBAdRMJdCCE6iIS7EEJ0EAl3IYToIBLuQgjRQSTchRCig0i4CyFEB5FwF0KIDiLhLoQQHUTCXQghOoiEuxBCdBAJdyGE6CAS7kII0UEk3IUQooNIuAshRAeRcBdCiA4i4S6EEB1Ewl0IITqIhLsQQnQQCXchhOggEu5CCNFBJNyFEKKDSLgLIUQHGUq4m9kzzOxcM7vDzG43s5eY2SpmdqmZ3Zl+Vx6rwgohhGjGsC33zwEXuftzgc2A24H3A3PcfSNgTtoWQggxgbQW7ma2IvAy4HQAd/+7u/8R2Ak4IyU7A9h5uCIKIYTIZZiW+/rAQuDLZrbAzL5oZssBM9z9AYD0u9oYlFMIIUQGwwj3acAWwCnuvjnwKBkqGDM7yMzmmdm8hQsXDlEMIYQQ/Qwj3O8H7nf3a9P2uYSwf9DMVgdIvw+VZXb309x9prvPnD59+hDFEEII0U9r4e7uvwHuM7PnpF07ALcBs4FZad8s4FtDlVAIIUQ204bMfxhwlpktDdwN7E98ML5hZgcA9wJ7DHkNIYQQmQwl3N39BmBmyaEdhjmvEEKI4ZCHqhBCdBAJdyGE6CAS7kII0UEk3IUQooNIuAshRAeRcBdCiA4i4S6EEB1Ewl0IITqIhLsQQnQQCXchhOggEu5CCNFBJNyFEKKDSLgLIUQHkXAXQogOIuEuhBAdRMJdCCE6iIS7EEJ0EAl3IYToIBLuQgjRQYadIFsIISYFt53yYKN0Gx8yY5xLMjlQy10IITqIhLsQQnQQCXchhOggEu5CCNFBJNyFEKKDSLgLIUQHkXAXQogOIuEuhBAdRMJdCCE6iIS7EEJ0EAl3IYToIBLuQgjRQYYW7mY21cwWmNl30vYqZnapmd2ZflcevphCCCFyGIuW++HA7YXt9wNz3H0jYE7aFkIIMYEMJdzNbE3g9cAXC7t3As5I62cAOw9zDSGEEPkM23I/HjgK+Gdh3wx3fwAg/a425DWEEEJk0lq4m9kbgIfcfX7L/AeZ2Twzm7dw4cK2xRBCCFHCMC33bYAdzewXwNeB7c3sa8CDZrY6QPp9qCyzu5/m7jPdfeb06dOHKIYQQoh+Wgt3dz/a3dd093WBvYHL3H1fYDYwKyWbBXxr6FIKIYTIYjzs3D8JvNLM7gRembaFEEJMIGMyQba7zwXmpvXfATuMxXmFEEK0Qx6qQgjRQSTchRCig0i4CyFEB5FwF0KIDiLhLoQQHUTCXQghOoiEuxBCdBAJdyGE6CAS7kII0UEk3IUQooNIuAshRAeRcBdCiA4i4S6EEB1Ewl0IITqIhLsQQnQQCXchhOggEu5CCNFBJNyFEKKDSLgLIUQHkXAXQogOIuEuhBAdRMJdCCE6iIS7EEJ0EAl3IYToIBLuQgjRQSTchRCig0i4CyFEB5FwF0KIDiLhLoQQHUTCXQghOoiEuxBCdBAJdyGE6CAS7kII0UFaC3czW8vMLjez283sVjM7PO1fxcwuNbM70+/KY1dcIYQQTRim5f4E8B53fx6wFXComW0MvB+Y4+4bAXPSthBCiAmktXB39wfc/fq0/ghwO7AGsBNwRkp2BrDzkGUUQgiRyZjo3M1sXWBz4Fpghrs/APEBAFYbi2sIIYRoztDC3cyWB84DjnD3hzPyHWRm88xs3sKFC4cthhBCiAJDCXczW4oQ7Ge5+/lp94Nmtno6vjrwUFledz/N3We6+8zp06cPUwwhhBB9DGMtY8DpwO3uflzh0GxgVlqfBXyrffGEEEK0YdoQebcB3gzcbGY3pH0fAD4JfMPMDgDuBfYYqoRCCCGyaS3c3f0qwCoO79D2vEIIIYZHHqpCCNFBJNyFEKKDDKNzF0KIceNHX21mIr3NW2RtV4Za7kII0UEk3IUQooNIuAshRAeRcBdCiA4i4S6EEB1Ewl0IITqIhLsQQnQQCXchhOggEu5CCNFBJNyFEKKDSLgLIUQHkXAXQogOIuEuhBAdRMJdCCE6iIS7EEJ0EAl3IYToIBLuQgjRQSTchRCig0i4CyFEB5FwF0KIDiLhLoQQHUTCXQghOsi0JV0AIUQ9u5x3RaN0F+z2cgB2P+/GRunP3W2zRevvuuC+RnlO2GUtAE49/8FG6Q/edcai9dnf/O3A9DvusWqj84rBqOUuhBAdRMJdCCE6iIS7EEJ0EAl3IYToIBLuQgjRQWQtI540vO6CTzRK971djgbg9eef3Cj9d3d9x6L1N5z3lUZ5vrPbfpH+3G80S7/7novWdzz3OwPTz979DY3OK0QVarkLIUQHGTfhbmavMbOfmtldZvb+8bqOEEKI0YyLcDezqcD/Aq8FNgb2MbONx+NaQgghRjNeOvcXAXe5+90AZvZ1YCfgtnG
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x=percent_nan.index,y=percent_nan)\n",
"plt.xticks(rotation=90);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Removing Features or Removing Rows\n",
"\n",
"If only a few rows relative to the size of your dataset are missing some values, then it might just be a good idea to drop those rows. What does this cost you in terms of performace? It essentialy removes potential training/testing data, but if its only a few rows, its unlikely to change performance.\n",
"\n",
"Sometimes it is a good idea to remove a feature entirely if it has too many null values. However, you should carefully consider why it has so many null values, in certain situations null could just be used as a separate category. \n",
"\n",
"Take for example a feature column for the number of cars that can fit into a garage. Perhaps if there is no garage then there is a null value, instead of a zero. It probably makes more sense to quickly fill the null values in this case with a zero instead of a null. Only you can decide based off your domain expertise and knowledge of the data set!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Working based on Rows Missing Data\n",
"\n",
"## Filling in Data or Dropping Data?\n",
"\n",
"Let's explore how to choose to remove or fill in missing data for rows that are missing some data. Let's choose some threshold where we decide it is ok to drop a row if its missing some data (instead of attempting to fill in that missing data point). We will choose 1% as our threshold. This means if less than 1% of the rows are missing this feature, we will consider just dropping that row, instead of dealing with the feature itself. There is no right answer here, just use common sense and your domain knowledge of the dataset, obviously you don't want to drop a very high threshold like 50% , you should also explore correlation to the dataset, maybe it makes sense to drop the feature instead.\n",
"\n",
"Based on the text description of the features, you will see that most of this missing data is actually NaN on purpose as a placeholder for 0 or \"none\"."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example of Filling in Data : Basement Columns"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, 1.0)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAE+CAYAAACdoOtZAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAA+tUlEQVR4nO2de9xnY9X/32tmiJwPQ8I4d5CHaJSO6EgHIhWdUBEhql9UT6eHTs/TSYihkqh4npCmUpRjEpkxjFNqojLJKZSoGNbvj3V9Z/b9vffe333t+3vf99h93q/Xft177++69r724V77uta11rrM3RFCCPH4Z8pkV0AIIcRwkEIXQoiOIIUuhBAdQQpdCCE6ghS6EEJ0BCl0IYToCAMVupmdbGZ3mdn1Fb+bmR1jZgvMbL6ZbTP8agohhBhEkxb6KcBONb/vDGyWlv2BE8ZeLSGEELkMVOjufilwb43IrsCpHlwBrGpm6wyrgkIIIZoxDBv6usBthe2FaZ8QQogJZNoQjmEl+0rzCZjZ/oRZhhVWWOFZT3va04ZwetFlfnv/HY3kNlv1SUn+7oby0xevL7jvL43KbLraGkn+vobyqyX5vzaUX2Xx+u/ue6BRmU1WWynJ/6Oh/PKL12+7/+FGZdZfdVkA7r7/kUby01ddBoD771vUSH7V1Zaoob//pVmZFdeIMv+8u1mdlpu+zOL1R+5odt3LPCmu+5E7m93bZdaOe/vIXX9vJr/WiovXF93V7B2ZttYqzJ079x53n176e6Oj1LMQWL+wvR5we5mgu58EnAQwc+ZMnzNnzhBOL7rMK7/3mUZy5+72IQBedfbxjeR/tPu7F6+/+qxTGpX54ev2Cfkz/6+Z/B5vAGCXM3/YSH72Hq9evL7bWZc0KvO9120PwB5nXdtI/szXbbV4/T3fu61GcgnH7Bb/3rPOvrOR/AG7rw3A7O/e00h+l9evuXj9F6c2+yA//22hz248oVmdNj9w7cXrf/6fhY3KrHP4egDc+aX5jeTXfu+WIX/Mz5vJv+eFi9fvOu5HjcqsdfCrMLM/VP0+DJPLbOBtydtlO+Cv7v7nIRxXCCFEBgNb6GZ2OrADsKaZLQQ+DiwD4O6zgHOBVwILgIeAfcerskIIIaoZqNDdfa8Bvztw0NBqJIQQohWKFBVCiI4ghS6EEB1BCl0IITqCFLoQQnQEKXQhhOgIUuhCCNERpNCFEKIjSKELIURHkEIXQoiOIIUuhBAdQQpdCCE6ghS6EEJ0BCl0IYToCFLoQgjREaTQhRCiI0ihCyFER5BCF0KIjiCFLoQQHUEKXQghOoIUuhBCdAQpdCGE6AhS6EII0RGk0IUQoiNIoQshREeQQhdCiI4ghS6EEB1BCl0IITqCFLoQQnQEKXQhhOgIUuhCCNERpNCFEKIjSKELIURHkEIXQoiOIIUuhBAdQQpdCCE6QiOFbmY7mdnNZrbAzD5Y8vsqZvYDM7vWzG4ws32HX1UhhBB1DFToZjYV+AqwM7A5sJeZbd4ndhBwo7tvBewAfMHMlh1yXYUQQtTQpIX+bGCBu9/i7g8DZwC79sk4sJKZGbAicC+waKg1FUIIUUsThb4ucFthe2HaV+Q44OnA7cB1wKHu/thQaiiEEKIRTRS6lezzvu1XANcATwaeCRxnZiuPOpDZ/mY2x8zm3H333ZlVFUIIUUcThb4QWL+wvR7REi+yL3C2BwuAW4Gn9R/I3U9y95nuPnP69Olt6yyEEKKEJgr9KmAzM9soDXTuCczuk/kj8BIAM1sbeCpwyzArKoQQop5pgwTcfZGZHQycB0wFTnb3G8zsgPT7LOAo4BQzu44w0Rzh7veMY73FUsC+39upkdw3dvsJADufc2gj+R+/9sut6yTEvzMDFTqAu58LnNu3b1Zh/Xbg5cOtmhBCiBwUKSqEEB1BCl0IITqCFLoQQnQEKXQhhOgIUuhCCNERpNCFEKIjSKELIURHkEIXQoiOIIUuhBAdQQpdCCE6ghS6EEJ0BCl0IYToCFLoQgjREaTQhRCiI0ihCyFER5BCF0KIjiCFLoQQHUEKXQghOoIUuhBCdAQpdCGE6AhS6EII0RGk0IUQoiNIoQshREeQQhdCiI4ghS6EEB1BCl0IITqCFLoQQnQEKXQhhOgIUuhCCNERpNCFEKIjSKELIURHkEIXQoiOIIUuhBAdQQpdCCE6QiOFbmY7mdnNZrbAzD5YIbODmV1jZjeY2SXDraYQQohBTBskYGZTga8ALwMWAleZ2Wx3v7EgsypwPLCTu//RzNYap/oKIYSooEkL/dnAAne/xd0fBs4Adu2TeRNwtrv/EcDd7xpuNYUQQgyiiUJfF7itsL0w7SvyFGA1M7vYzOaa2duGVUEhhBDNGGhyAaxkn5cc51nAS4DlgV+a2RXu/psRBzLbH9gfYMaMGfm1FUIIUUmTFvpCYP3C9nrA7SUyP3H3B939HuBSYKv+A7n7Se4+091nTp8+vW2dhRBClNBEoV8FbGZmG5nZssCewOw+me8DLzSzaWb2ROA5wE3DraoQQog6Bppc3H2RmR0MnAdMBU529xvM7ID0+yx3v8nMfgLMBx4Dvubu149nxYUQQoykiQ0ddz8XOLdv36y+7c8Bnxte1YQQQuSgSFEhhOgIUuhCCNERpNCFEKIjSKELIURHkEIXQoiOIIUuhBAdQQpdCCE6ghS6EEJ0BCl0IYToCFLoQgjREaTQhRCiI0ihCyFER5BCF0KIjiCFLoQQHUEKXQghOoIUuhBCdAQpdCGE6AhS6EII0RGk0IUQoiNIoQshREeQQhdCiI4ghS6EEB1BCl0IITqCFLoQQnQEKXQhhOgIUuhCCNERpNCFEKIjSKELIURHkEIXQoiOIIUuhBAdQQpdCCE6ghS6EEJ0BCl0IYToCFLoQgjREaTQhRCiIzRS6Ga2k5ndbGYLzOyDNXLbmtmjZrbH8KoohBCiCQMVuplNBb4C7AxsDuxlZptXyP03cN6wKymEEGIwTVrozwYWuPst7v4wcAawa4ncIcBZwF1DrJ8QQoiGNFHo6wK3FbYXpn2LMbN1gd2AWcOrmhBCiByaKHQr2ed920cDR7j7o7UHMtvfzOaY2Zy77767YRWFEEI0YVoDmYXA+oXt9YDb+2RmAmeYGcCawCvNbJG7n1MUcveTgJMAZs6c2f9REEIIMQaaKPSrgM3MbCPgT8CewJuKAu6+UW/dzE4BftivzIUQQowvAxW6uy8ys4MJ75WpwMnufoOZHZB+l91cCCGWApq00HH3c4Fz+/aVKnJ332fs1RJCCJGLIkWFEKIjSKELIURHkEIXQoiOIIUuhBAdQQpdCCE6ghS6EEJ0BCl0IYToCFLoQgjREaTQhRCiI0ihCyFER5BCF0KIjiCFLoQQHUEKXQghOoIUuhBCdAQpdCGE6AhS6EII0RGk0IUQoiNIoQshREeQQhdCiI4ghS6EEB1BCl0IITqCFLoQQnQEKXQhhOgIUuhCCNERpNCFEKIjSKELIURHkEIXQoiOIIUuhBAdQQpdCCE6ghS6EEJ0BCl0IYToCFLoQgjREaTQhRCiI0ihCyFER2ik0M1sJzO72cwWmNkHS35/s5nNT8vlZrbV8KsqhBCijoEK3cymAl8BdgY2B/Yys837xG4Ftnf3LYGjgJOGXVEhhBD1NGmhPxtY4O63uPvDwBnArkUBd7/c3e9Lm1cA6w23mkIIIQbRRKGvC9xW2F6Y9lXxDuDHY6mUEEKIfKY1kLGSfV4qaLYjodBfUPH7/sD+ADNmzGhYRSGEEE1o0kJfCKxf2F4PuL1fyMy2BL4G7Orufyk7kLuf5O4z3X3m9OnT29RXCCFEBU0U+lXAZma2kZktC+wJzC4KmNkM4Gzgre7+m+FXUwghxCAGmlzcfZGZHQycB0wFTnb3G8zsgPT7LOBjwBrA8WYGsMjdZ45ftYUQQvTTxIaOu58LnNu3b1Zh/Z3AO4dbNSGEEDkoUlQIITqCFLoQQnQEKXQhhOgIUuhCCNERpNCFEKIjSKELIURHkEIXQoiOIIUuhBAdQQpdCCE6ghS6EEJ0BCl0IYToCFLoQgjREaTQhRC
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x=percent_nan.index,y=percent_nan)\n",
"plt.xticks(rotation=90);\n",
"\n",
"# Set 1% Threshold\n",
"plt.ylim(0,1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's drop or fill the rows based on this data. You could either manually fill in the data (especially the Basement data based on the description text file) OR you could simply drop the row and not consider it. Watch the video for a full explanation of this, in reality it probably makes more sense to fill in the Missing Basement data since its well described in the text description."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Electrical 0.034165\n",
"Garage Area 0.034165\n",
"Total Bsmt SF 0.034165\n",
"Bsmt Unf SF 0.034165\n",
"BsmtFin SF 1 0.034165\n",
"BsmtFin SF 2 0.034165\n",
"Garage Cars 0.034165\n",
"Bsmt Full Bath 0.068329\n",
"Bsmt Half Bath 0.068329\n",
"Mas Vnr Area 0.785787\n",
"Mas Vnr Type 0.785787\n",
"dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Could also imply we should ex\n",
"percent_nan[percent_nan < 1]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0341646737273659"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"100/len(df)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Lot Frontage</th>\n",
" <th>Lot Area</th>\n",
" <th>Street</th>\n",
" <th>Alley</th>\n",
" <th>Lot Shape</th>\n",
" <th>Land Contour</th>\n",
" <th>Utilities</th>\n",
" <th>Lot Config</th>\n",
" <th>...</th>\n",
" <th>Pool Area</th>\n",
" <th>Pool QC</th>\n",
" <th>Fence</th>\n",
" <th>Misc Feature</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Yr Sold</th>\n",
" <th>Sale Type</th>\n",
" <th>Sale Condition</th>\n",
" <th>SalePrice</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1341</th>\n",
" <td>20</td>\n",
" <td>RM</td>\n",
" <td>99.0</td>\n",
" <td>5940</td>\n",
" <td>Pave</td>\n",
" <td>NaN</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>FR3</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>MnPrv</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>2008</td>\n",
" <td>ConLD</td>\n",
" <td>Abnorml</td>\n",
" <td>79000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape \\\n",
"1341 20 RM 99.0 5940 Pave NaN IR1 \n",
"\n",
" Land Contour Utilities Lot Config ... Pool Area Pool QC Fence \\\n",
"1341 Lvl AllPub FR3 ... 0 NaN MnPrv \n",
"\n",
" Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition \\\n",
"1341 NaN 0 4 2008 ConLD Abnorml \n",
"\n",
" SalePrice \n",
"1341 79000 \n",
"\n",
"[1 rows x 80 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['Total Bsmt SF'].isnull()]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Lot Frontage</th>\n",
" <th>Lot Area</th>\n",
" <th>Street</th>\n",
" <th>Alley</th>\n",
" <th>Lot Shape</th>\n",
" <th>Land Contour</th>\n",
" <th>Utilities</th>\n",
" <th>Lot Config</th>\n",
" <th>...</th>\n",
" <th>Pool Area</th>\n",
" <th>Pool QC</th>\n",
" <th>Fence</th>\n",
" <th>Misc Feature</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Yr Sold</th>\n",
" <th>Sale Type</th>\n",
" <th>Sale Condition</th>\n",
" <th>SalePrice</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1341</th>\n",
" <td>20</td>\n",
" <td>RM</td>\n",
" <td>99.0</td>\n",
" <td>5940</td>\n",
" <td>Pave</td>\n",
" <td>NaN</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>FR3</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>MnPrv</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>2008</td>\n",
" <td>ConLD</td>\n",
" <td>Abnorml</td>\n",
" <td>79000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1497</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>123.0</td>\n",
" <td>47007</td>\n",
" <td>Pave</td>\n",
" <td>NaN</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>2008</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>284700</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape \\\n",
"1341 20 RM 99.0 5940 Pave NaN IR1 \n",
"1497 20 RL 123.0 47007 Pave NaN IR1 \n",
"\n",
" Land Contour Utilities Lot Config ... Pool Area Pool QC Fence \\\n",
"1341 Lvl AllPub FR3 ... 0 NaN MnPrv \n",
"1497 Lvl AllPub Inside ... 0 NaN NaN \n",
"\n",
" Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition \\\n",
"1341 NaN 0 4 2008 ConLD Abnorml \n",
"1497 NaN 0 7 2008 WD Normal \n",
"\n",
" SalePrice \n",
"1341 79000 \n",
"1497 284700 \n",
"\n",
"[2 rows x 80 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['Bsmt Half Bath'].isnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Filling in data based on column names. There are 2 types of basement features, numerical and string descriptives.**\n",
"\n",
"The numerical basement columns:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"bsmt_num_cols = ['BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF','Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath']\n",
"df[bsmt_num_cols] = df[bsmt_num_cols].fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The string basement columns:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"bsmt_str_cols = ['Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2']\n",
"df[bsmt_str_cols] = df[bsmt_str_cols].fillna('None')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"percent_nan = percent_missing(df)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEzCAYAAADKCUOEAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAArr0lEQVR4nO3de7zlc73H8dfbDI1oRMak3KMkEY1CnQrddHEJhdJUIh2R6kSpc5Qup/tNJ5qSphsJoZtoGCWhubmFMw5CiUmSFMLn/PH9rtlrr9l7Ztbv+/vNXvs37+fjsR5r/X5rr8/+rrXX/vy+v+/ve1FEYGZm7bLKWBfAzMzq5+RuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQhPHugAA6667bmyyySZjXQwzs3Fl7ty5f46IKSM9NxDJfZNNNmHOnDljXQwzs3FF0u9He87NMmZmLeTkbmbWQk7uZmYt5ORuZtZCy0zukr4h6S5J13TtW0fSBZIW5vu1u557v6QbJd0g6WVNFdzMzEa3PDX3bwIv79n3PmBWRGwBzMrbSNoK2B94Rn7NVyRNqK20Zma2XJaZ3CPil8BfenbvCczMj2cCe3XtPy0iHoyIm4EbgefUU1QzM1teVdvcp0bEHQD5fr28/8nAbV0/d3vetwRJh0qaI2nOokWLKhbDzMxGUvcgJo2wb8TVQCJiBjADYNq0aV4xxMxWCnedcFFxjPWO2GWZP1O15n6npPUB8v1def/twIZdP7cB8MeKv8PMzCqqmtzPBabnx9OBc7r27y/pMZI2BbYArigropmZ9WuZzTKSTgVeBKwr6XbgOOATwOmSDgZuBfYDiIhrJZ0O/A54GDg8Ih5pqOxmZjaKZSb3iDhglKd2G+XnPwZ8rKRQZmZWxiNUzcxayMndzKyFBmI+dzOzQfSnz15fHOOJ79myhpL0zzV3M7MWcnI3M2shJ3czsxZycjczayEndzOzFnJvGTNrhWtPurM4xjMOm1pDSQaDa+5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYtVJTcJb1L0rWSrpF0qqRJktaRdIGkhfl+7boKa2Zmy6dycpf0ZOBIYFpEbA1MAPYH3gfMiogtgFl528zMVqDSZpmJwOqSJgKPBf4I7AnMzM/PBPYq/B1mZtaniVVfGBF/kPQZ4Fbgn8D5EXG+pKkRcUf+mTskrVdTWc2sRX72/T8XvX73161bU0naqaRZZm1SLX1T4EnAGpLe0MfrD5U0R9KcRYsWVS2GmZmNoKRZ5sXAzRGxKCL+BZwF7AzcKWl9gHx/10gvjogZETEtIqZNmTKloBhmZtarJLnfCuwo6bGSBOwGXAecC0zPPzMdOKesiGZm1q+SNvfLJZ0BzAMeBuYDM4A1gdMlHUw6AOxXR0HNzGz5VU7uABFxHHBcz+4HSbV4MzMbIx6hambWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYt5ORuZtZCRcld0uMlnSHpeknXSdpJ0jqSLpC0MN+vXVdhzcxs+ZTW3L8InBcRWwLbAtcB7wNmRcQWwKy8bWZmK1Dl5C5pMvAC4GSAiHgoIv4K7AnMzD82E9irrIhmZtavkpr7ZsAi4BRJ8yV9XdIawNSIuAMg369XQznNzKwPJcl9IrA9cGJEbAfcTx9NMJIOlTRH0pxFixYVFMPMzHqVJPfbgdsj4vK8fQYp2d8paX2AfH/XSC+OiBkRMS0ipk2ZMqWgGGZm1qtyco+IPwG3SXpa3rUb8DvgXGB63jcdOKeohGZm1reJha8/AviupNWAm4A3kw4Yp0s6GLgV2K/wd5iZWZ+KkntELACmjfDUbiVxzcysjEeompm1kJO7mVkLObmbmbWQk7uZWQs5uZuZtZCTu5lZCzm5m5m1kJO7mVkLObmbmbWQk7uZWQs5uZuZtZCTu5lZCzm5m5m1kJO7mVkLObmbmbWQk7uZWQs5uZuZtZCTu5lZC5WuoWpmK4FP/vCOotcfs/f6NZXElpdr7mZmLeTkbmbWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLeTkbmbWQk7uZmYt5ORuZtZCTu5mZi3k5G5m1kJO7mZmLVSc3CVNkDRf0o/z9jqSLpC0MN+vXV5MMzPrRx0193cC13Vtvw+YFRFbALPytpmZrUBFyV3SBsArga937d4TmJkfzwT2KvkdZmbWv9Ka+xeAo4FHu/ZNjYg7APL9eoW/w8zM+lQ5uUt6FXBXRMyt+PpDJc2RNGfRokVVi2FmZiMoqbk/D9hD0i3AacCukr4D3ClpfYB8f9dIL46IGRExLSKmTZkypaAYZmbWq3Jyj4j3R8QGEbEJsD9wYUS8ATgXmJ5/bDpwTnEpzcysL030c/8E8BJJC4GX5G0zM1uBJtYRJCJmA7Pz47uB3eqIa2Zm1XiEqplZCzm5m5m1kJO7mVkLObmbmbWQk7uZWQs5uZuZtZCTu5lZCzm5m5m1kJO7mVkLObmbmbWQk7uZWQs5uZuZtZCTu5lZCzm5m5m1kJO7mVkLObmbmbWQk7uZWQs5uZuZtZCTu5lZCzm5m5m1kJO7mVkLObmbmbWQk7uZWQs5uZuZtZCTu5lZCzm5m5m1kJO7mVkLObmbmbWQk7uZWQs5uZuZtZCTu5lZCzm5m5m1kJO7mVkLVU7ukjaUdJGk6yRdK+mdef86ki6QtDDfr11fcc3MbHmU1NwfBt4TEU8HdgQOl7QV8D5gVkRsAczK22ZmtgJVTu4RcUdEzMuP7wOuA54M7AnMzD82E9irsIxmZtanWtrcJW0CbAdcDkyNiDsgHQCA9er4HWZmtvyKk7ukNYEzgaMi4m99vO5QSXMkzVm0aFFpMczMrEtRcpe0Kimxfzcizsq775S0fn5+feCukV4bETMiYlpETJsyZUpJMczMrEdJbxkBJwPXRcTnup46F5ieH08HzqlePDMzq2JiwWufBxwEXC1pQd53LPAJ4HRJBwO3AvsVldDMzPpWOblHxCWARnl6t6pxzcysnEeompm1UEmzjJkNoH3PnFf0+jP22b6mkthYcs3dzKyFnNzNzFrIyd3MrIWc3M3MWsjJ3cyshZzczcxayMndzKyFnNzNzFrIyd3MrIWc3M3MWsjJ3cyshZzczcxayMndzKyFnNzNzFrIyd3MrIWc3M3MWsjJ3cyshZzczcxayMndzKyFnNzNzFrIyd3MrIWc3M3MWsjJ3cyshSaOdQFsfDnmjJcXx/jkvucN237F2e8pjvnTvT47bPuVZ51QHPMnrzli2Parzvhuccwf7/v6Ydt7nvGzonjn7Lt70eutvVxzNzNrISd3M7MWcnI3M2shJ3czsxZycjczayEndzOzFnJyNzNrocaSu6SXS7pB0o2S3tfU7zEzsyU1ktwlTQD+B9gd2Ao4QNJWTfwuMzNbUlMjVJ8D3BgRNwFIOg3YE/hdQ79v3DvnG+UjDfd8y/DRjl/99suKY77toJ8XxzCzFU8RUX9QaV/g5RHx1rx9EPDciHhH188cChyaN58G3LCc4dcF/lxjcVfmmOOhjI7pmI45uo0jYspITzRVc9cI+4YdRSJiBjCj78DSnIiYVrVgjtlcPMd0TMccnJhNXVC9Hdiwa3sD4I8N/S4zM+vRVHL/LbCFpE0lrQbsD5zb0O8yM7MejTTLRMTDkt4B/ByYAHwjIq6tKXzfTTmOucLiOaZjOuaAxGzkgqqZmY0tj1A1M2shJ3czsxZycreBI2m/5dlnZqNbKdv
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x=percent_nan.index,y=percent_nan)\n",
"plt.xticks(rotation=90);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dropping Rows\n",
"\n",
"A few of these features appear that it is just one or two rows missing the data. Based on our description .txt file of the dataset, we could also fill in these data points easily, and that is the more correct approach, but here we show how to drop in case you find yourself in a situation where it makes more sense to drop a row, based on missing column features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" df.dropna() ---\n",
" subset : array-like, optional\n",
" Labels along other axis to consider, e.g. if you are dropping rows\n",
" these would be a list of columns to include."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"df = df.dropna(axis=0,subset= ['Electrical','Garage Cars'])"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"percent_nan = percent_missing(df)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, 1.0)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAE3CAYAAAC6r7qRAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAn5ElEQVR4nO3deZxcVZn/8c+XBATDOhBRgSBChInIGhaVUUBEwBHEuBBGZBNEiYo6gjLzGxCXcXAZBJRMlMUdR1kGNQOoEJAJWwJhE5EMqARUwp4BBQLP749zK6lUV3dX0ufeVB+/79erX133VvV9bnV1P/ecc8+iiMDMzEa/VVb2CZiZWR5O6GZmhXBCNzMrhBO6mVkhnNDNzArhhG5mVohhE7qkcyQ9KOn2QZ6XpNMlzZd0q6Qd8p+mmZkNp5cS+nnAPkM8vy8wsfo6Gjhr5KdlZmbLa9iEHhFXA48M8ZIDgG9Fch2wrqSX5DpBMzPrTY429I2A+9q2F1T7zMysQWMzHENd9nWdT0DS0aRmGcaNG7fjVlttlSF8f1n00N3Zj7nWBhMH7Hv44d9kj7P++q8YsO/+R/O+n43WG/heAO5+7P6scSauO7BMcfejD2WNATBxvQ0G7Jv/6GPZ42yx3roD9v3vo09mjbH5euO67v/TY89mjbPhuqsO2LfokcVZYwCs9TcD09uzf8z7XgBWffHA9/Psg/+XN8aL1lzyeO7cuQ9FxPhur8uR0BcAm7Rtbww80O2FETEDmAEwefLkmDNnTobw/eXKb7w5+zH3eO9PB+w775t7Z49z2KGXD9j3Tz8c6vbJ8vvsOy7tun+/i0/MGmfmWz83YN+bL/h61hgAP51y1IB9b/nRRdnj/PjtBw7YN+WCG7LGuGDKzl33f/miP2aN89EDXzxg36zvLMwaA2D3dw/MeX849Q/Z47zk+IEtzH86/ZqsMTb80G5LHkv63WCvy9Hkcgnwnqq3y67A4xGR/7dmZmZDGraELun7wO7ABpIWACcBqwJExHRgJrAfMB94Cji8rpM1M7PBDZvQI2LqMM8HcGy2MzIzsxXikaJmZoVwQjczK4QTuplZIZzQzcwK4YRuZlYIJ3Qzs0I4oZuZFcIJ3cysEE7oZmaFcEI3MyuEE7qZWSGc0M3MCuGEbmZWCCd0M7NCOKGbmRXCCd3MrBBO6GZmhXBCNzMrhBO6mVkhnNDNzArhhG5mVggndDOzQjihm5kVwgndzKwQTuhmZoVwQjczK4QTuplZIZzQzcwK4YRuZlYIJ3Qzs0I4oZuZFcIJ3cysEE7oZmaFcEI3MyuEE7qZWSF6SuiS9pF0l6T5kj7R5fl1JP1Y0i2S7pB0eP5TNTOzoQyb0CWNAb4K7AtMAqZKmtTxsmOBX0XEtsDuwJckrZb5XM3MbAi9lNB3BuZHxD0R8QxwPnBAx2sCWEuSgDWBR4DFWc/UzMyG1EtC3wi4r217QbWv3ZnA3wIPALcBH46I5zsPJOloSXMkzVm4cOEKnrKZmXXTS0JXl33Rsf0mYB7wUmA74ExJaw/4oYgZETE5IiaPHz9+OU/VzMyG0ktCXwBs0ra9Makk3u5w4MJI5gP3AlvlOUUzM+vF2B5ecyMwUdJmwP3AQcDBHa/5PfAG4JeSNgS2BO7p9SQWnvWdXl/ak/Hvf3fX/QvOPCJrnI2nnZP1eGZmIzFsQo+IxZKmAZcBY4BzIuIOScdUz08HPg2cJ+k2UhPNCRHxUI3nbWZmHXopoRMRM4GZHfumtz1+ANg776mZmdny8EhRM7NCOKGbmRXCCd3MrBBO6GZmhXBCNzMrhBO6mVkhnNDNzArhhG5mVggndDOzQjihm5kVwgndzKwQTuhmZoVwQjczK4QTuplZIZzQzcwK4YRuZlYIJ3Qzs0I4oZuZFcIJ3cysEE7oZmaFcEI3MyuEE7qZWSGc0M3MCuGEbmZWCCd0M7NCOKGbmRXCCd3MrBBO6GZmhXBCNzMrhBO6mVkhnNDNzArhhG5mVggndDOzQjihm5kVoqeELmkfSXdJmi/pE4O8ZndJ8yTdIemqvKdpZmbDGTvcCySNAb4KvBFYANwo6ZKI+FXba9YFvgbsExG/l/Sims7XzMwG0UsJfWdgfkTcExHPAOcDB3S85mDgwoj4PUBEPJj3NM3MbDi9JPSNgPvathdU+9q9AlhP0ixJcyW9J9cJmplZb4ZtcgHUZV90Oc6OwBuANYBrJV0XEb9Z5kDS0cDRABMmTFj+szUzs0H1UkJfAGzStr0x8ECX11waEU9GxEPA1cC2nQeKiBkRMTkiJo8fP35Fz9nMzLroJaHfCEyUtJmk1YCDgEs6XvNfwN9JGivphcAuwJ15T9XMzIYybJNLRCyWNA24DBgDnBMRd0g6pnp+ekTcKelS4FbgeeAbEXF7nSduZmbL6qUNnYiYCczs2De9Y/sLwBfynZqZmS0PjxQ1MyuEE7qZWSGc0M3MCuGEbmZWCCd0M7NCOKGbmRXCCd3MrBBO6GZmhXBCNzMrhBO6mVkhnNDNzArhhG5mVggndDOzQjihm5kVwgndzKwQTuhmZoVwQjczK4QTuplZIZzQzcwK4YRuZlYIJ3Qzs0I4oZuZFcIJ3cysEE7oZmaFcEI3MyuEE7qZWSGc0M3MCuGEbmZWCCd0M7NCOKGbmRXCCd3MrBBO6GZmhXBCNzMrhBO6mVkhnNDNzArRU0KXtI+kuyTNl/SJIV63k6TnJL093ymamVkvhk3oksYAXwX2BSYBUyVNGuR1/wZclvskzcxseL2U0HcG5kfEPRHxDHA+cECX130QuAB4MOP5mZlZj3pJ6BsB97VtL6j2LSFpI+BAYPpQB5J0tKQ5kuYsXLhwec/VzMyG0EtCV5d90bF9GnBCRDw31IEiYkZETI6IyePHj+/xFM3MrBdje3jNAmCTtu2NgQc6XjMZOF8SwAbAfpIWR8TFOU7SzMyG10tCvxGYKGkz4H7gIODg9hdExGatx5LOA37iZG5m1qxhE3pELJY0jdR7ZQxwTkTcIemY6vkh283NzKwZvZTQiYiZwMyOfV0TeUQcNvLTMjOz5eWRomZmhXBCNzMrhBO6mVkhnNDNzArhhG5mVggndDOzQjihm5kVwgndzKwQTuhmZoVwQjczK4QTuplZIZzQzcwK4YRuZlYIJ3Qzs0I4oZuZFcIJ3cysEE7oZmaFcEI3MyuEE7qZWSGc0M3MCuGEbmZWCCd0M7NCOKGbmRXCCd3MrBBO6GZmhXBCNzMrhBO6mVkhnNDNzArhhG5mVggndDOzQjihm5kVwgndzKwQTuhmZoVwQjczK0RPCV3SPpLukjRf0ie6PP8Pkm6tvmZL2jb/qZqZ2VCGTeiSxgBfBfYFJgFTJU3qeNm9wOsjYhvg08CM3CdqZmZD66WEvjMwPyLuiYhngPOBA9pfEBGzI+LRavM6YOO8p2lmZsPpJaFvBNzXtr2g2jeYI4H/HslJmZnZ8hvbw2vUZV90faG0Bymh7zbI80cDRwNMmDChx1M0M7Ne9FJCXwBs0ra9MfBA54skbQN8AzggIh7udqCImBERkyNi8vjx41fkfM3MbBC9JPQbgYmSNpO0GnAQcEn7CyRNAC4EDomI3+Q/TTMzG86wTS4RsVjSNOAyYAxwTkTcIemY6vnpwL8A6wNfkwSwOCIm13faZmbWqZc2dCJiJjCzY9/0tsfvBd6b99TMzGx5eKSomVkhnNDNzArhhG5mVggndDOzQjihm5kVwgndzKwQTuhmZoVwQjczK4QTuplZIZzQzcwK4YRuZlYIJ3Qzs0I4oZuZFcIJ3cysEE7oZmaFcEI3MyuEE7qZWSGc0M3MCuGEbmZWCCd0M7NCOKGbmRXCCd3MrBBO6GZmhXBCNzMrhBO6mVkhnNDNzArhhG5mVggndDOzQjihm5kVwgndzKwQTuhmZoVwQjczK4QTuplZIZzQzcwK4YRuZlaInhK6pH0k3SVpvqRPdHlekk6vnr9V0g75T9XMzIYybEKXNAb4KrAvMAmYKmlSx8v2BSZWX0cDZ2U+TzMzG0YvJfSdgfkRcU9EPAOcDxzQ8ZoDgG9Fch2wrqSXZD5
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x=percent_nan.index,y=percent_nan)\n",
"plt.xticks(rotation=90);\n",
"plt.ylim(0,1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mas Vnr Feature "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the Description Text File, Mas Vnr Type and Mas Vnr Area being missing (NaN) is likely to mean the house simply just doesn't have a masonry veneer, in which case, we will fill in this data as we did before."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"df[\"Mas Vnr Type\"] = df[\"Mas Vnr Type\"].fillna(\"None\")\n",
"df[\"Mas Vnr Area\"] = df[\"Mas Vnr Area\"].fillna(0)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"percent_nan = percent_missing(df)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEzCAYAAADKCUOEAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAjy0lEQVR4nO3deZRlVX328e9Dt8gkCqEhKJMIkQCKYKsMLgdwNgIGMKCQNhARRQUxAUTzor7q6zLGCSMGRew4BkEGjQoEGUJQsBkFwUBQBkVoEIWggMDz/rHP7b5dVEF3193n9D31fNaqde85p+797apb9dvn7LMH2SYiIvplpa4LEBERo5fkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UOzuy4AwDrrrONNNtmk62JERIyVSy655A7bcyY7tkIk90022YQFCxZ0XYyIiLEi6capjqVZJiKih5LcIyJ6KMk9IqKHktwjInroMZO7pC9Kul3SVUP71pZ0lqTrmse1ho69W9L1kn4m6eW1Ch4REVNbmjP3LwGvmLDvSOBs25sDZzfbSNoS2BvYqnnNZyXNGllpIyJiqTxmcrd9PvCbCbt3A+Y3z+cDuw/t/4bt+23/HLgeeO5oihoREUtredvc17N9K0DzuG6z/ynAzUPfd0uz7xEkHShpgaQFCxcuXM5iRETEZEY9iEmT7Jt0NRDbxwHHAcydOzcrhkREL9z2yYurvv96hy5dY8jynrnfJml9gObx9mb/LcCGQ9+3AfCr5YwRERHLaXmT++nAvOb5POC0of17S3q8pKcCmwN1q7GIiHiEx2yWkfR14EXAOpJuAY4GPgKcKOkA4CZgLwDbV0s6Efgp8CBwsO2HKpU9IiKm8JjJ3fY+UxzaZYrv/xDwoekUKiIipicjVCMieijJPSKih1aI+dwjIkbp2s/eVj3GFm9dr3qM6ciZe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA+lt0xEVHPiyXdUff/X7bFO1fcfZzlzj4jooST3iIgeSnKPiOihJPeIiB5Kco+I6KEk94iIHkpyj4jooST3iIgeSnKPiOihJPeIiB5Kco+I6KEk94iIHkpyj4jooST3iIgeSnKPiOihJPeIiB5Kco+I6KEk94iIHkpyj4jooST3iIgeSnKPiOihJPeIiB5Kco+I6KFpJXdJ75R0taSrJH1d0iqS1pZ0lqTrmse1RlXYiIhYOsud3CU9BXgHMNf21sAsYG/gSOBs25sDZzfbERHRouk2y8wGVpU0G1gN+BWwGzC/OT4f2H2aMSIiYhnNXt4X2v6lpI8BNwF/AM60faak9Wzf2nzPrZLWHVFZI2I57HXyVdVjfHOPravHiGUznWaZtShn6U8FngysLmnfZXj9gZIWSFqwcOHC5S1GRERMYjrNMi8Bfm57oe0/At8CdgRuk7Q+QPN4+2Qvtn2c7bm2586ZM2caxYiIiImmk9xvAraXtJokAbsA1wCnA/Oa75kHnDa9IkZExLKaTpv7RZJOAi4FHgQuA44D1gBOlHQApQLYaxQFjYiIpbfcyR3A9tHA0RN23085i4+IiI5khGpERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0RED00ruUt6kqSTJF0r6RpJO0haW9JZkq5rHtcaVWEjImLpTPfM/VPA921vAWwDXAMcCZxte3Pg7GY7IiJatNzJXdKawAuA4wFsP2D7t8BuwPzm2+YDu0+viBERsaymc+a+KbAQOEHSZZK+IGl1YD3btwI0j+uOoJwREbEMppPcZwPbAcfa3ha4l2VogpF0oKQFkhYsXLhwGsWIiIiJppPcbwFusX1Rs30SJdnfJml9gObx9slebPs423Ntz50zZ840ihERERMtd3K3/WvgZklPb3btAvwUOB2Y1+ybB5w2rRJGRMQymz3N178d+KqklYEbgL+hVBgnSjoAuAnYa5oxIiJiGU0rudu+HJg7yaFdpvO+ERExPRmhGhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPTTdNVQjYintelLdteJP33O3qu8f4yVn7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPTTt5C5plqTLJH2n2V5b0lmSrmse15p+MSMiYlmM4sz9EOCaoe0jgbNtbw6c3WxHRESLppXcJW0AvBr4wtDu3YD5zfP5wO7TiREREctuumfunwQOBx4e2ree7VsBmsd1pxkjIiKW0XInd0l/Adxu+5LlfP2BkhZIWrBw4cLlLUZERExiOmfuOwG7SvoF8A1gZ0lfAW6TtD5A83j7ZC+2fZztubbnzpkzZxrFiIiIiZY7udt+t+0NbG8C7A38wPa+wOnAvObb5gF1F46MiIhHqNHP/SPASyVdB7y02Y6IiBbNHsWb2D4XOLd5fiewyyjeNyIilk9GqEZE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T0UJJ7REQPJblHRPRQkntERA8luUdE9FCSe0REDyW5R0T00HInd0kbSjpH0jWSrpZ0SLN/bUlnSbqueVxrdMWNiIilMZ0z9weBd9n+c2B74GBJWwJHAmfb3hw4u9mOiIgWLXdyt32r7Uub5/cA1wBPAXYD5jffNh/YfZpljIiIZTSSNndJmwDbAhcB69m+FUoFAKw7ihgREbH0pp3cJa0BnAwcavvuZXjdgZIWSFqwcOHC6RYjIiKGTCu5S3ocJbF/1fa3mt23SVq/Ob4+cPtkr7V9nO25tufOmTNnOsWIiIgJptNbRsDxwDW2Pz506HRgXvN8HnDa8hcvIiKWx+xpvHYnYD/gJ5Iub/YdBXwEOFHSAcBNwF7TKmFERCyz5U7uti8ANMXhXZb3fSMiYvoyQjUiooem0ywTMZZeffK/VH3/f9/jzVXfP2Jp5Mw9IqKHktwjInooyT0iooeS3CMieijJPSKih5LcIyJ6KMk9IqKHktwjInooyT0iooeS3CMieijJPSKih5LcIyJ6KMk9IqKHktwjInooyT0iooeS3CMieijJPSKih5LcIyJ6KMk9IqKHktwjInooyT0iooeS3CMieijJPSKih2Z3XYCJFh77leox5rxl30n3/88xu1WN+7S3nzbp/u988ZVV4wL8xf7fm3T/J7728qpx3/n6Mybd/8rTDqoa93u7fa7q+0es6HLmHhHRQ0nuERE9lOQeEdFDSe4RET2U5B4R0UNJ7hERPZTkHhHRQ9WSu6RXSPqZpOslHVkrTkREPFKV5C5pFvDPwCuBLYF9JG1ZI1ZERDxSrTP35wLX277B9gPAN4C6wz8jImIR2R79m0p7Aq+w/bfN9n7A82y/beh7DgQObDafDvxsGiHXAe6YxusTd8WPnZ95ZsSeaXGnG3tj23MmO1BrbhlNsm+JWsT2ccBxIwkmLbA9dxTvlbgrZuz8zDMj9kyLWzN2rWaZW4ANh7Y3AH5VKVZERExQK7n/GNhc0lMlrQzsDZxeKVZERExQpVnG9oOS3gacAcwCvmj76hqxGiNp3kncFTp2fuaZEXumxa0Wu8oN1YiI6FZGqEZE9FCSe0REDyW5LwNJey3NvoiIro1
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x=percent_nan.index,y=percent_nan)\n",
"plt.xticks(rotation=90);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Filling In Missing Column Data\n",
"\n",
"Our previous approaches were based more on rows missing data, now we will take an approach based on the column features themselves, since larger percentages of the data appears to be missing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Garage Columns\n",
"\n",
"Based on the data description, these NaN seem to indicate no garage, so we will substitute with \"None\" or 0."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Garage Type</th>\n",
" <th>Garage Finish</th>\n",
" <th>Garage Qual</th>\n",
" <th>Garage Cond</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Attchd</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Attchd</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Attchd</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Attchd</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Attchd</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2922</th>\n",
" <td>Detchd</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2923</th>\n",
" <td>Attchd</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2924</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2925</th>\n",
" <td>Attchd</td>\n",
" <td>RFn</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2926</th>\n",
" <td>Attchd</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2925 rows × 4 columns</p>\n",
"</div>"
],
"text/plain": [
" Garage Type Garage Finish Garage Qual Garage Cond\n",
"0 Attchd Fin TA TA\n",
"1 Attchd Unf TA TA\n",
"2 Attchd Unf TA TA\n",
"3 Attchd Fin TA TA\n",
"4 Attchd Fin TA TA\n",
"... ... ... ... ...\n",
"2922 Detchd Unf TA TA\n",
"2923 Attchd Unf TA TA\n",
"2924 NaN NaN NaN NaN\n",
"2925 Attchd RFn TA TA\n",
"2926 Attchd Fin TA TA\n",
"\n",
"[2925 rows x 4 columns]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['Garage Type', 'Garage Finish', 'Garage Qual', 'Garage Cond']]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"gar_str_cols = ['Garage Type', 'Garage Finish', 'Garage Qual', 'Garage Cond']\n",
"df[gar_str_cols] = df[gar_str_cols].fillna('None')"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"df['Garage Yr Blt'] = df['Garage Yr Blt'].fillna(0)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"percent_nan = percent_missing(df)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEtCAYAAADz1SBvAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAZO0lEQVR4nO3de7ScdX3v8feHBAXFCEigUS5Bm6pcBDRYFA5YI/VWhYooKDRalVJRRG3l1gp4isul1dOKBz0pFOIRQQoicM4pQqMJohSbAAoYEMpNNMK2IuANuXzOH8+zybCzc9l7ZvZvz28+r7WyZp5n9mS+s5L9mWd+V9kmIiLqslHpAiIiovcS7hERFUq4R0RUKOEeEVGhhHtERIVmli4AYKuttvLcuXNLlxERMVBWrFjxM9uzx3tsWoT73LlzWb58eekyIiIGiqS71vZYmmUiIiqUcI+IqFDCPSKiQgn3iIgKJdwjIiq03nCX9M+S7pN0Y8e5LSVdIenW9naLjseOl3SbpFskvbpfhUdExNptyJX72cBrxpw7Dlhiex6wpD1G0k7AIcDO7XNOlzSjZ9VGRMQGWW+4274S+PmY0wcAi9v7i4EDO86fZ/th23cAtwEv7U2pERGxoSbb5r6N7VUA7e3W7fnnAD/q+Ll72nNrkHSEpOWSlo+MjEyyjIiIGE+vZ6hqnHPj7gZiexGwCGD+/PnZMSQipsyph725dAkTduKXLpjQz0/2yv1eSXMA2tv72vP3ANt1/Ny2wE8m+RoRETFJkw33S4CF7f2FwMUd5w+R9FRJOwLzgO92V2JEREzUeptlJJ0LvALYStI9wEnAJ4DzJb0LuBs4GMD2TZLOB34APAocZfuxPtUeERFrsd5wt33oWh5asJafPxU4tZuiIiKiO5mhGhFRoWmxnntETC+f+/ClpUuYsPd9+g2lS5hWcuUeEVGhhHtERIUS7hERFUq4R0RUKOEeEVGhhHtERIUyFDJiEpbtu1/pEiZsvyuXlS4hplCu3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCXYW7pA9KuknSjZLOlbSJpC0lXSHp1vZ2i14VGxERG2bS4S7pOcDRwHzbuwAzgEOA44AltucBS9rjiIiYQjN78PxNJT0CPA34CXA88Ir28cXAUuDYLl8nBtDep+1duoQJ+fb7v126hIiemfSVu+0fA38P3A2sAh6wfTmwje1V7c+sArYe7/mSjpC0XNLykZGRyZYRERHj6KZZZgvgAGBH4NnA0yUdtqHPt73I9nzb82fPnj3ZMiIiYhzddKi+CrjD9ojtR4CvAi8H7pU0B6C9va/7MiMiYiK6Cfe7gb0kPU2SgAXASuASYGH7MwuBi7srMSIiJmrSHaq2r5F0AXAt8ChwHbAI2Aw4X9K7aD4ADu5FoRERseG6Gi1j+yTgpDGnH6a5io+IiEIyQzUiokIJ94iICiXcIyIqlHCPiKhQwj0iokIJ94iICiXcIyIqlHCPiKhQwj0iokIJ94iICiXcIyIqlHCPiKhQwj0iokIJ94iICiXcIyIqlHCPiKhQwj0iokIJ94iICiXcIyIqlHCPiKhQwj0iokIJ94iICiXcIyIqlHCPiKhQwj0iokIJ94iICiXcIyIqlHCPiKhQwj0iokIJ94iICiXcIyIqlHCPiKhQwj0iokJdhbukzSVdIOlmSSslvUzSlpKukHRre7tFr4qNiIgN0+2V+z8Cl9l+AbAbsBI4Dlhiex6wpD2OiIgpNOlwlzQL2Bc4E8D272z/AjgAWNz+2GLgwO5KjIiIiermyv25wAhwlqTrJJ0h6enANrZXAbS3W4/3ZElHSFouafnIyEgXZURExFjdhPtM4MXA523vAfyKCTTB2F5ke77t+bNnz+6ijIiIGKubcL8HuMf2Ne3xBTRhf6+kOQDt7X3dlRgRERM16XC3/VPgR5Ke355aAPwAuARY2J5bCFzcVYURETFhM7t8/vuBcyQ9BbgdeCfNB8b5kt4F3A0c3OVrRETEBHUV7ravB+aP89CCbv7eiIjoTmaoRkRUKOEeEVGhhHtERIUS7hERFUq4R0RUKOEeEVGhhHtERIUS7hERFUq4R0RUKOEeEVGhhHtERIUS7hERFUq4R0RUKOEeEVGhhHtERIUS7hERFUq4R0RUKOEeEVGhbvdQjS7c/bFdS5cwIdt/9IbSJUTEBsqVe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhboOd0kzJF0n6f+0x1tKukLSre3tFt2XGRERE9GLK/cPACs7jo8DltieByxpjyMiYgp1Fe6StgVeD5zRcfoAYHF7fzFwYDevERERE9ftlfs/AB8BHu84t43tVQDt7dbjPVHSEZKWS1o+MjLSZRkREdFp0uEu6U+A+2yvmMzzbS+yPd/2/NmzZ0+2jIiIGEc32+ztDbxR0uuATYBZkr4E3Ctpju1VkuYA9/Wi0IiI2HCTvnK3fbztbW3PBQ4BvmH7MOASYGH7YwuBi7uuMiIiJqQf49w/Aewv6VZg//Y4IiKmUDfNMk+wvRRY2t7/L2BBL/7eiIiYnMxQjYioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4ioUMI9IqJCCfeIiAol3CMiKpRwj4io0KTDXdJ2kr4paaWkmyR9oD2/paQrJN3a3m7Ru3IjImJDdHPl/ijwYdsvBPYCjpK0E3AcsMT2PGBJexwREVNo0uFue5Xta9v7DwErgecABwCL2x9bDBzYZY0RETFBPWlzlzQX2AO4BtjG9ipoPgCArdfynCMkLZe0fGRkpBdlREREq+twl7QZcCFwjO0HN/R5thfZnm97/uzZs7stIyIiOnQV7pI2pgn2c2x/tT19r6Q57eNzgPu6KzEiIiaqm9EyAs4EVtr+TMdDlwAL2/sLgYsnX15EREzGzC6euzdwOHCDpOvbcycAnwDOl/Qu4G7g4K4qjIiICZt0uNu+CtBaHl4w2b83IiK6lxmqEREV6qZZpu9e8tdfLF3ChK341J+VLiEiIlfuERE1SrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaGEe0REhRLuEREVSrhHRFQo4R4RUaG+hbuk10i6RdJtko7r1+tERMSa+hLukmYA/xN4LbATcKiknfrxWhERsaZ+Xbm/FLjN9u22fwecBxzQp9eKiIgxZLv3f6n0ZuA1tt/dHh8O/KHt93X8zBHAEe3h84Fbel7I2m0F/GwKX2+q5f0NtprfX83vDab+/e1ge/Z4D8zs0wtqnHNP+hSxvQhY1KfXXydJy23PL/HaUyHvb7DV/P5qfm8wvd5fv5pl7gG26zjeFvhJn14rIiLG6Fe4/wcwT9KOkp4CHAJc0qfXioiIMfrSLGP7UUnvA74OzAD+2fZN/XitSSrSHDSF8v4GW83vr+b3BtPo/fWlQzUiIsrKDNWIiAol3CMiKpRwj4io0FCEuxqHSfpoe7y9pJeWrqtX2vezxp/SdcWGkbRL6Rr6pdbfvfY9HT7O+fdIeluJmsYaig5VSZ8HHgdeafuFkrYALre9Z+HSekLSDTSTxARsAuwI3GJ756KF9ZCkbYCPA8+2/dp2raKX2T6
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x=percent_nan.index,y=percent_nan)\n",
"plt.xticks(rotation=90);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dropping Feature Columns\n",
"\n",
"Sometimes you may want to take the approach that above a certain missing percentage threshold, you will simply remove the feature from all the data. For example if 99% of rows are missing a feature, it will not be predictive, since almost all the data does not have any value for it. In our particular data set, many of these high percentage NaN features are actually plasceholders for \"none\" or 0. But for the sake of showing variations on dealing with missing data, we will remove these features, instead of filling them in with the appropriate value."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Lot Frontage', 'Fireplace Qu', 'Fence', 'Alley', 'Misc Feature',\n",
" 'Pool QC'],\n",
" dtype='object')"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"percent_nan.index"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Lot Frontage</th>\n",
" <th>Fireplace Qu</th>\n",
" <th>Fence</th>\n",
" <th>Alley</th>\n",
" <th>Misc Feature</th>\n",
" <th>Pool QC</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>141.0</td>\n",
" <td>Gd</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80.0</td>\n",
" <td>NaN</td>\n",
" <td>MnPrv</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>81.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Gar2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>93.0</td>\n",
" <td>TA</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>74.0</td>\n",
" <td>TA</td>\n",
" <td>MnPrv</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2922</th>\n",
" <td>37.0</td>\n",
" <td>NaN</td>\n",
" <td>GdPrv</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2923</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>MnPrv</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2924</th>\n",
" <td>62.0</td>\n",
" <td>NaN</td>\n",
" <td>MnPrv</td>\n",
" <td>NaN</td>\n",
" <td>Shed</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2925</th>\n",
" <td>77.0</td>\n",
" <td>TA</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2926</th>\n",
" <td>74.0</td>\n",
" <td>TA</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2925 rows × 6 columns</p>\n",
"</div>"
],
"text/plain": [
" Lot Frontage Fireplace Qu Fence Alley Misc Feature Pool QC\n",
"0 141.0 Gd NaN NaN NaN NaN\n",
"1 80.0 NaN MnPrv NaN NaN NaN\n",
"2 81.0 NaN NaN NaN Gar2 NaN\n",
"3 93.0 TA NaN NaN NaN NaN\n",
"4 74.0 TA MnPrv NaN NaN NaN\n",
"... ... ... ... ... ... ...\n",
"2922 37.0 NaN GdPrv NaN NaN NaN\n",
"2923 NaN NaN MnPrv NaN NaN NaN\n",
"2924 62.0 NaN MnPrv NaN Shed NaN\n",
"2925 77.0 TA NaN NaN NaN NaN\n",
"2926 74.0 TA NaN NaN NaN NaN\n",
"\n",
"[2925 rows x 6 columns]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['Lot Frontage', 'Fireplace Qu', 'Fence', 'Alley', 'Misc Feature','Pool QC']]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"df = df.drop(['Pool QC','Misc Feature','Alley','Fence'],axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"percent_nan = percent_missing(df)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEsCAYAAADaVeizAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAARFElEQVR4nO3dfYxldX3H8ffHXRSsrrIyixt5WG23PqAFYUVbbI1QLFUr1BYfa7ctdtuqqUStpbapD9HExNaY2mrd+LSmViVVBP1D3G4BS0rFWeSxgGsVKbJhR0XB1ifg2z/uGRyGWeayM/ee/c19v5LNPed37537SXbyyZnf/Z1zUlVIktrzgL4DSJL2jQUuSY2ywCWpURa4JDXKApekRq0e54cdcsghtWHDhnF+pCQ1b+fOnd+qqqn540MVeJIbgNuBO4E7qmpTkrXAJ4ANwA3AC6rq1vv6ORs2bGB6evr+JZekCZfkGwuN358plGdW1TFVtanbPwvYUVUbgR3dviRpTJYyB34qsK3b3gactuQ0kqShDVvgBXw+yc4kW7qxQ6tqN0D3uG6hNybZkmQ6yfTMzMzSE0uSgOG/xDyhqm5Osg7YnuS6YT+gqrYCWwE2bdrkefuStEyGOgKvqpu7xz3AOcDxwC1J1gN0j3tGFVKSdG+LFniSn0ny0Nlt4FnA1cB5wObuZZuBc0cVUpJ0b8NMoRwKnJNk9vX/XFWfS/Il4OwkZwA3AqePLqYkab5FC7yqvgYcvcD4t4GTRhFKkrQ4T6WXpEaN9VR6aSW78S1P6juC9kNH/PVVI/vZHoFLUqMscElqlAUuSY2ywCWpURa4JDXKApekRlngktQoC1ySGmWBS1KjLHBJapQFLkmNssAlqVEWuCQ1ygKXpEZZ4JLUKAtckhplgUtSoyxwSWqUBS5JjbLAJalRFrgkNcoCl6RGWeCS1CgLXJIaZYFLUqMscElqlAUuSY2ywCWpURa4JDXKApekRlngktQoC1ySGjV0gSdZleTLST7b7a9Nsj3Jru7x4NHFlCTNd3+OwF8NXDtn/yxgR1VtBHZ0+5KkMRmqwJMcBjwHeP+c4VOBbd32NuC0ZU0mSbpPwx6Bvwt4PXDXnLFDq2o3QPe4bqE3JtmSZDrJ9MzMzFKySpLmWLTAkzwX2FNVO/flA6pqa1VtqqpNU1NT+/IjJEkLWD3Ea04Anpfk2cCBwJok/wTckmR9Ve1Osh7YM8qgkqR7WvQIvKr+oqoOq6oNwIuAf6uq3wHOAzZ3L9sMnDuylJKke1nKOvC3Aycn2QWc3O1LksZkmCmUu1XVhcCF3fa3gZOWP5IkaRieiSlJjbLAJalRFrgkNcoCl6RGWeCS1CgLXJIaZYFLUqMscElqlAUuSY2ywCWpURa4JDXKApekRlngktQoC1ySGmWBS1KjLHBJapQFLkmNssAlqVEWuCQ1ygKXpEZZ4JLUKAtckhplgUtSoyxwSWqUBS5JjbLAJalRFrgkNcoCl6RGWeCS1CgLXJIaZYFLUqMscElqlAUuSY2ywCWpUYsWeJIDk1ya5Iok1yR5cze+Nsn2JLu6x4NHH1eSNGuYI/AfASdW1dHAMcApSZ4GnAXsqKqNwI5uX5I0JosWeA18v9s9oPtXwKnAtm58G3DaKAJKkhY21Bx4klVJLgf2ANur6ovAoVW1G6B7XLeX925JMp1kemZmZpliS5KGKvCqurOqjgEOA45P8sRhP6CqtlbVpqraNDU1tY8xJUnz3a9VKFX1XeBC4BTgliTrAbrHPcsdTpK0d8OsQplK8vBu+yDgV4HrgPOAzd3LNgPnjiijJGkBq4d4zXpgW5JVDAr/7Kr6bJJLgLOTnAHcCJw+wpySpHkWLfCquhJ48gLj3wZOGkUoSdLiPBNTkhplgUtSoyxwSWqUBS5JjbLAJalRFrgkNcoCl6RGWeCS1CgLXJIaZYFLUqMscElqlAUuSY2ywCWpURa4JDXKApekRlngktQoC1ySGmWBS1KjLHBJapQFLkmNssAlqVEWuCQ1ygKXpEZZ4JLUKAtckhplgUtSoyxwSWqUBS5JjbLAJalRFrgkNcoCl6RGWeCS1CgLXJIatWiBJzk8yQVJrk1yTZJXd+Nrk2xPsqt7PHj0cSVJs4Y5Ar8DeG1VPR54GvDKJE8AzgJ2VNVGYEe3L0kak0ULvKp2V9Vl3fbtwLXAo4BTgW3dy7YBp40ooyRpAfdrDjzJBuDJwBeBQ6tqNwxKHli3l/dsSTKdZHpmZmaJcSVJs4Yu8CQPAT4JnFlVtw37vqraWlWbqmrT1NTUvmSUJC1gqAJPcgCD8v5oVX2qG74lyfru+fXAntFElCQtZJhVKAE+AFxbVe+c89R5wOZuezNw7vLHkyTtzeohXnMC8DLgqiSXd2NvAN4OnJ3kDOBG4PSRJJQkLWjRAq+qi4Hs5emTljeOJGlYnokpSY2ywCWpURa4JDXKApekRlngktQoC1ySGmWBS1KjLHBJatQwZ2LuV477s4/0HUH7oZ3v+N2+I0hj5xG4JDXKApekRlngktQoC1ySGmWBS1KjLHBJapQFLkmNssAlqVEWuCQ1ygKXpEZZ4JLUKAtckhplgUtSoyxwSWqUBS5JjbLAJalRFrgkNcoCl6RGWeCS1CgLXJIaZYFLUqMscElqlAUuSY2ywCWpURa4JDVq0QJP8sEke5JcPWdsbZLtSXZ1jwePNqYkab5hjsA/DJwyb+wsYEdVbQR2dPuSpDFatMCr6gvAd+YNnwps67a3AactbyxJ0mL2dQ780KraDdA9rtvbC5NsSTKdZHpmZmYfP06SNN/Iv8Ssqq1VtamqNk1NTY364yRpYuxrgd+SZD1A97hn+SJJkoaxrwV+HrC5294MnLs8cSRJwxpmGeHHgEuAxya5KckZwNuBk5PsAk7u9iVJY7R6sRdU1Yv38tRJy5xFknQ/eCamJDXKApekRlngktQoC1ySGmWBS1KjLHBJapQFLkmNssAlqVEWuCQ1ygKXpEZZ4JLUKAtckhplgUtSoyxwSWqUBS5JjbLAJalRFrgkNcoCl6RGWeCS1CgLXJIaZYFLUqMscElqlAUuSY2ywCWpURa4JDXKApekRlngktQoC1ySGmWBS1KjLHBJapQFLkmNssAlqVEWuCQ1ygKXpEYtqcCTnJLk+iRfTXLWcoWSJC1unws8ySrgH4BfB54AvDjJE5YrmCTpvi3lCPx44KtV9bWq+jHwceDU5YklSVrM6iW891HA/8zZvwl46vwXJdkCbOl2v5/k+iV8pu7pEOBbfYfYH+RvNvcdQffk7+asN2Y5fsqRCw0upcAXSlX3GqjaCmxdwudoL5JMV9WmvnNI8/m7OR5LmUK5CTh8zv5hwM1LiyNJGtZSCvxLwMYkj07yQOBFwHnLE0uStJh9nkKpqjuSvAo4H1gFfLCqrlm2ZBqGU1PaX/m7OQapute0tSSpAZ6JKUmNssAlqVEWuCQ1ainrwDVmSQK8FHhMVb0lyRHAI6vq0p6jacJ1v4v3UlU3jjvLJPFLzIYkeS9wF3BiVT0+ycHA56vqKT1H04RLchWDE/kCHAg8Gri+qo7qNdgK5xF4W55aVccm+TJAVd3arcGXelVVT5q7n+RY4I96ijMxnANvy0+6q0AWQJIpBkfk0n6lqi4D/MtwxDwCb8vfAecA65K8Dfht4K/6jSRBktfM2X0AcCww01OcieEceGOSPA44icFc446qurbnSBJJ3jhn9w7gBuCTVfXDfhJNBgu8IUnWLjB8e1X9ZOxhpHmSrAGoqtv6zjIpnANvy2UM/iz9CrCr2/56ksuSHNdrMk2sJGcm+SbwdeCGJF9J8qLuucPv+91aCgu8LZ8Dnl1Vh1TVIxjczu5s4BXAe3pNpomU5E3As4BfrqpHVNVa4BTgpUn+HLigz3wrnVMoDVnoIvmzY0kur6pjeoqmCZVkF/Ck+XPdSQ5i8BfiS6rKy0yPiKtQ2vKd7qjm493+C4Fbu6WFLidUH+5a6IvKqvpBkm9a3qPlFEpbXsLgzkefBs4FjujGVgEv6C+WJthNSU6aP5jkROCbPeSZKE6hSNpnSY5icDB
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x=percent_nan.index,y=percent_nan)\n",
"plt.xticks(rotation=90);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filling in Fireplace Quality based on Description Text"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"df['Fireplace Qu'] = df['Fireplace Qu'].fillna(\"None\")"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"percent_nan = percent_missing(df)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEsCAYAAADaVeizAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAO50lEQVR4nO3dfYxldX3H8fenLCioCHavtQV1wSitGB9wfMRqgZpSNdIH00qCxYd2ExuttCpqTaRtamOqUdvaajewgi3BGsXWNKmVUB5CQm1nERRcrEYRV2j3EghQawT02z/mbhiH2blPZ+fuL+f9SsjO/d07c75/wDuH354zJ1WFJKk9P7HoASRJszHgktQoAy5JjTLgktQoAy5JjdqymQfbunVrbdu2bTMPKUnN27Vr1x1VNVi7vqkB37ZtG8vLy5t5SElqXpJvr7fuFookNcqAS1KjDLgkNcqAS1KjDLgkNcqAS1KjDLgkNcqAS1KjDLgkNWpT78Q8GDz77Z9Y9AiSGrHr/b+16BE25Bm4JDXKgEtSowy4JDXKgEtSowy4JDVqbMCT7EyyN8mNa9bfnORrSW5K8ucHbkRJ0nomOQO/EDh99UKSU4AzgKdX1YnAB7ofTZK0kbEBr6qrgTvXLL8ReF9V/WD0mb0HYDZJ0gZm3QN/CvDzSb6Y5Kokz9nfB5NsT7KcZHk4HM54OEnSWrMGfAtwNPB84O3Ap5JkvQ9W1Y6qWqqqpcHgIc/klCTNaNaA7wEurRX/AfwI2NrdWJKkcWYN+D8CpwIkeQpwGHBHRzNJkiYw9pdZJbkE+AVga5I9wHnATmDn6NLC+4Czq6oO5KCSpB83NuBVdeZ+3jqr41kkSVPwTkxJapQBl6RGGXBJapQBl6RGGXBJapQBl6RGGXBJapQBl6RGGXBJapQBl6RGGXBJapQBl6RGGXBJapQBl6RGGXBJapQBl6RGjQ14kp1J9o6evrP2vbclqSQ+D1OSNtkkZ+AXAqevXUzyeOClwK0dzyRJmsDYgFfV1cCd67z1IeBcwGdhStICzLQHnuSVwHer6oYJPrs9yXKS5eFwOMvhJEnrmDrgSY4A3g28Z5LPV9WOqlqqqqXBYDDt4SRJ+zHLGfiTgOOAG5LcAhwLXJfkcV0OJkna2JZpv6GqvgI8dt/rUcSXquqODueSJI0xyWWElwDXAick2ZPkDQd+LEnSOGPPwKvqzDHvb+tsGknSxLwTU5IaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaNckDHXYm2ZvkxlVr709yc5IvJ/lskqMO6JSSpIeY5Az8QuD0NWuXAU+rqqcD/wW8q+O5JEljjA14VV0N3Llm7QtV9cDo5b+z8mBjSdIm6mIP/PXAv+zvzSTbkywnWR4Ohx0cTpIEcwY8ybuBB4CL9/eZqtpRVUtVtTQYDOY5nCRplbEPNd6fJGcDrwBOq6rqbiRJ0iRmCniS04F3AC+pqv/rdiRJ0iQmuYzwEuBa4IQke5K8AfgI8CjgsiTXJ/nYAZ5TkrTG2DPwqjpzneULDsAskqQpeCemJDXKgEtSowy4JDXKgEtSowy4JDXKgEtSowy4JDXKgEtSowy4JDXKgEtSowy4JDXKgEtSowy4JDXKgEtSowy4JDXKgEtSoyZ5Is/OJHuT3Lhq7TFJLkvy9dGfRx/YMSVJa01yBn4hcPqatXcCl1fVk4HLR68lSZtobMCr6mrgzjXLZwAXjb6+CPiVbseSJI0z6x74T1XV7QCjPx+7vw8m2Z5kOcnycDic8XCSpLUO+F9iVtWOqlqqqqXBYHCgDydJvTFrwP8nyU8DjP7c291IkqRJzBrwzwFnj74+G/inbsaRJE1qkssILwGuBU5IsifJG4D3AS9N8nXgpaPXkqRNtGXcB6rqzP28dVrHs0iSpuCdmJLUKAMuSY0y4JLUKAMuSY0y4JLUKAMuSY0y4JLUKAMuSY0y4JLUKAMuSY0y4JLUKAMuSY0y4JLUKAMuSY0y4JLUKAMuSY2aK+BJfj/JTUluTHJJkod3NZgkaWMzBzzJMcDvAUtV9TTgEODVXQ0mSdrYvFsoW4DDk2wBjgBum38kSdIkZg54VX0X+ABwK3A7cHdVfWHt55JsT7KcZHk4HM4+qSTpx8yzhXI0cAZwHPAzwCOSnLX2c1W1o6qWqmppMBjMPqkk6cfMs4Xyi8C3qmpYVfcDlwIv7GYsSdI48wT8VuD5SY5IEuA0YHc3Y0mSxplnD/yLwKeB64CvjH7Wjo7mkiSNsWWeb66q84DzOppFkjQF78SUpEYZcElqlAGXpEYZcElqlAGXpEYZcElqlAGXpEYZcElqlAGXpEYZcElqlAGXpEYZcElqlAGXpEYZcElqlAGXpEbNFfAkRyX5dJKbk+xO8oKuBpMkbWyuBzoAfwF8vqpeleQw4IgOZpIkTWDmgCc5Engx8FqAqroPuK+bsSRJ48yzhXI8MAQ+nuRLSc5P8oiO5pIkjTFPwLcAJwEfrapnAd8D3rn2Q0m2J1lOsjwcDuc4nCRptXkCvgfYM3o6Paw8of6ktR+qqh1VtVRVS4PBYI7DSZJWmzngVfXfwHeSnDBaOg34aidTSZLGmvcqlDcDF4+uQPkm8Lr5R5IkTWKugFfV9cBSN6NIkqbhnZiS1CgDLkmNMuCS1CgDLkmNMuCS1CgDLkmNMuCS1CgDLkmNMuCS1CgDLkmNMuCS1CgDLkmNMuCS1CgDLkmNMuCS1CgDLkmNmjvgSQ4ZPZX+n7sYSJI0mS7OwN8C7O7g50iSpjBXwJMcC7wcOL+bcSRJk5r3DPzDwLnAj/b3gSTbkywnWR4Oh3MeTpK0z8wBT/IKYG9V7droc1W1o6qWqmppMBjMejhJ0hrznIGfDLwyyS3AJ4FTk/x9J1NJksaaOeBV9a6qOraqtgGvBv6tqs7qbDJJ0oa8DlySGrWlix9SVVcCV3bxsyRJk/EMXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVHzPNT48UmuSLI7yU1J3tLlYJKkjc3zRJ4HgLdW1XVJHgXsSnJZVX21o9kkSRuY56HGt1fVdaOv7wV2A8d0NZgkaWOd7IEn2QY8C/jiOu9tT7KcZHk4HHZxOEkSHQQ8ySOBzwDnVNU9a9+vqh1VtVRVS4PBYN7DSZJG5gp4kkNZiffFVXVpNyNJkiYxz1UoAS4AdlfVB7sbSZI0iXnOwE8GXgOcmuT60T8v62guSdIYM19GWFXXAOlwFknSFLwTU5IaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVEGXJIaZcAlqVHzPhPz9CRfS/KNJO/saihJ0njzPBPzEOCvgV8GngqcmeSpXQ0mSdrYPGfgzwW+UVXfrKr7gE8CZ3QzliRpnJmfiQkcA3xn1es9wPPWfijJdmD76OX/JvnaHMeUDpStwB2LHkIHl3zg7EWPsM8T11ucJ+DrPdC4HrJQtQPYMcdxpAMuyXJVLS16Dmka82yh7AEev+r1scBt840jSZrUPAH/T+DJSY5LchjwauBz3YwlSRpn5i2UqnogyZuAfwUOAXZW1U2dTSZtLrf51JxUPWTbWpLUAO/ElKRGGXBJapQBl6RGGXD1UlacleQ9o9dPSPLcRc8lTcO/xFQvJfko8CPg1Kr6uSRHA1+oqucseDRpYvPciSm17HlVdVKSLwFU1V2j+xmkZriFor66f/QbNQsgyYCVM3KpGQZcffWXwGeBxyZ5L3AN8GeLHUmajnvg6q0kPwucxsovZru8qnYveCRpKgZcvZTkMess31tV92/6MNKMDLh6KcktrPw2zbtYOQM/Crgd2Av8TlXtWthw0oTcA1dffR54WVVtraqfZOXRgJ8Cfhf4m4VOJk3IM3D10noPcNi3luT6qnrmgkaTJuZ14OqrO5O8g5VnuQL8JnDX6NJCLyd
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x=percent_nan.index,y=percent_nan)\n",
"plt.xticks(rotation=90);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# [Imputation](https://en.wikipedia.org/wiki/Imputation_(statistics)) of Missing Data\n",
"\n",
"To impute missing data, we need to decide what other filled in (no NaN values) feature most probably relates and is correlated with the missing feature data. In this particular case we will use:\n",
"\n",
"Neighborhood: Physical locations within Ames city limits\n",
"\n",
"LotFrontage: Linear feet of street connected to property\n",
"\n",
"We will operate under the assumption that the Lot Frontage is related to what neighborhood a house is in."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['NAmes', 'Gilbert', 'StoneBr', 'NWAmes', 'Somerst', 'BrDale',\n",
" 'NPkVill', 'NridgHt', 'Blmngtn', 'NoRidge', 'SawyerW', 'Sawyer',\n",
" 'Greens', 'BrkSide', 'OldTown', 'IDOTRR', 'ClearCr', 'SWISU',\n",
" 'Edwards', 'CollgCr', 'Crawfor', 'Blueste', 'Mitchel', 'Timber',\n",
" 'MeadowV', 'Veenker', 'GrnHill', 'Landmrk'], dtype=object)"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Neighborhood'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:xlabel='Lot Frontage', ylabel='Neighborhood'>"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhIAAAK5CAYAAADq5seSAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAACB5UlEQVR4nOzdfXxcZZ3//9c7pFCgUm2oYKsthhtZrVogqKAiZDUFVxSU9SZ1V1y/4na1on4bYIHf6u4iSymu98bt+nXR3QTv7wBtw25AlEUlhQpFqzSBoK1iSaGlLS2d5vP7Y86ESTpJJpO5Td7PxyOPzLmu61zX55xJO5+5znVmFBGYmZmZFaKu0gGYmZlZ7XIiYWZmZgVzImFmZmYFcyJhZmZmBXMiYWZmZgWrr3QAtebII4+MY445ptJhmJmZlc26desejYi5ueqcSEzQMcccQ09PT6XDMDMzKxtJ/aPVOZGwimhvb6evr2/cdlu2bAFg3rx547ZtbGxk2bJlk47NzMzy50TCKqKvr49Nv/o1C2bPGbPdru3bAXgqDhqz3cPbtxUtNjMzy58TCauYBbPncOWrW8Zsc9VPugDybmdmZuXluzbMzMysYE4kbJj29nba29srHUZN8LkyM/OlDRshnwWQluZzZWbmGYmqtm7dOs455xzuueeeUdsMDAywYsUK1q1bx/nnn+8XtzLbt28fK1asYNu2sRd7Zp6nke02bdrE+eefz913331A/cDAABdffDHvec97OPvss3P+HYzsd7RxComtkL4qrRZjNiu2cv87qKpEQlJI+kTW9gpJHxvR5peSbih7cBVw9dVXMzg4yFVXXTVqm87OTjZs2MDVV1/N7t27ueaaa8oYoT3yyCNs2LCBjo6OMdtlnqeR7a699lp2797Nxz/+8QPqOzs72bhxI7///e+JiJx/ByP7HW2cQmIrpK9Kq8WYzYqt3P8OqiqRAPYCb5Z0ZK5KSX9GOuYzJB1e1sjKbN26dezcuROAnTt3jvputKuri4gYatvf3+9ZiTLZt28fjz32GBFBV1fXqNl/9vOU3W7Tpk3096c/42Xnzp3D6gcGBli7du2wfkb+HYzst7e3N+c4YxktttHKq1ktxmxWbJX4d1BtayRSwGrgw8AVOepbgf8E/gx4I3ADgKTbgHuAU4C5wF8Dfw+8GPh6RFyZtHsn8EHgYODnwN8l/f4/oAkI4MsR8cniH9rEXH311cO2r7rqKr797W8PK+vs7GRwcPCAfa+55hpWr15d0LibN29mz549tLW1FbR/vnp7ezl4fxStv0d2PcFTvTtLHne2Bx54gIj0MQwODtLR0cHy5csPaJf9PGW3u/baaw9om6kHSKVSB9Rn/x2M7HflypU5xxnLaLGNVl7NajFms2KrxL+DapuRAPg8sFTS7Bx1bwO+TjqBeMeIuqci4gzgi8D3gfcDi4ALJTUksxlvA14ZEYuB/cBSYDEwPyIWRcSLgf8YOaikiyT1SOrZunVrMY5xXJkZhtG2Abq7u3O+2GTe5VppZZ/7VCpFd3d3znbZz1N2u1zPU6a+u7t7KEnJlv13MLLf/v7+nOOMZbTYRiuvZrUYs1mxVeLfQbXNSBAROyR9lfTMwZOZckmnAlsjol/S74EvS3pWRDyWNPlB8vs+4P6I+EOyXx/wPOBVpGcs7pIEcCjwJ+BGoFHSZ4GbgQM+2SgiVpOeKaGpqal4b6PHMGvWrGEvGrNmzTqgTXNzM2vWrDkgmVi4cGHB486fPx+AVatWFdxHPtra2nhq8yNF6++ow5/BwfOPKnnc2d7xjncMTRvW19fT3Nycs13285TdbuHChQckE9n1N9988wHJRPbfwch+58+fz+bNmw8YZyyjxTZaeTWrxZjNiq0S/w6qcUYC4FPAe4DsdRDvAE6U9BDQCxwBvCWrfm/yezDrcWa7HhDwlYhYnPy8ICI+liQiLwVuIz2L8aWiH00BLr/88mHbV1555QFtWltbqas78Cm87LLLShaXPe2oo44iSUqpq6tj6dKlOdtlP0/Z7S655JID2mbqW1tbqa8/MM/P/jsY2e+ll16ac5yxjBbbaOXVrBZjNiu2Svw7qMpEIiK2Ad8gnUwgqQ74S+AlEXFMRBwDvIkDL2+M5X+ACyQ9O+lzjqSFycLOuoj4NvD/AScX70gKd8oppwy9+5w1axYnnXTSAW0aGhpoaWlB0lDbhQsX0tjYWNZYp6sZM2bwrGc9C0m0tLQwZ07u7w3Jfp6y2x133HFDs0ezZs0aVt/Q0MCSJUuG9TPy72Bkv8cee2zOccYyWmyjlVezWozZrNgq8e+gKhOJxCeAzN0bZwCbI2JzVv3twAslPSefziLiV8CVQJeke4FbgOcA84HbJK0Hrie9SLMqXH755dTV1eWcjchobW1l0aJFXH755Rx22GGejSizo446ikWLFo2b9Weep5HtLrnkEg477DCuuOKKA+pbW1s58cQTee5zn4ukUWelsvcbbZxCYiukr0qrxZjNiq3c/w6Ua0GXja6pqSl6enoqHUbJZD7yudRfx51ZI1HML+0q9xqJcp0rM7NKk7QuIppy1VXdYkurLL8o5s/nysysui9tmJmZWZXzjIRVzMPbtw1duhhN//b07ZXjtXt4+zaOm39U0WIzM7P8OJGwisj3zpLDtR+Ag+eNnSQcN/8o361iZlYBTiSsIry+wMxsavAaCTMzMyuYEwkzMzMrmC9tWNG1t7eP+lXmW7ZsAWDevHkF9d3Y2OjLImZmVcSJhBVdX18fm379KxbMPuKAul2P7wDgKQ78+vPxPLx9x6RjMzOz4nIiYSWxYPYRXPHqlx9Q/vGf/BwgZ914MvuamVn18BoJMzMzK5gTCTMzMyuYE4lppr29fejLpmzifP7MzIZzIjHN9PX1jXpHhY0vc/7WrVvHOeecw49//GNWrFjBtm3pj/IeGBhgxYoV9Pb2Disvlkz/k+l3vD4KGWPkPhPpo1Rta7Uvs1pTlYmEpKMkdUrqk7RO0p2SzpfUJOkzSZsLJX0ueXy9pAsmMd4zJf1dseK3qe/qq69mcHCQa6+9lg0bNtDR0QFAZ2cnGzZsYOXKlcPKiyXT/2T6Ha+PQsYYuc9E+ihV21rty6zWVF0iIUnA94DbI6IxIk4B3g48NyJ6IuKDRR7vIOCZgBMJy8uOHTvYuXMnAKlUioigq6uL3t5eurq6iAj6+/uHyov1LnVgYGCo/0L7Ha+PQsYYuU/2eRivj4mMV4zjr/a+zGpRNd7+2Qw8FRFfzBRERD/wWUlnAisi4g059nutpIuBo4CPRMRNSZJwDXAmcAjw+Yj4t6SfjwJ/ABYD9wLHSloP3BIRbSU5siqwefNm9uzZQ1tb6Q6xt7eXg/enit7vIzt38VRvb0ljH09vby+7du06oHxwcJCVK1cyODh4QHlHRwfLly+f9NidnZ1D/Rfa73h9FDLGyH2yz8N4fUxkvGIcf7X3ZVaLqm5GAngRcHcB+x0DvAb4C+CLkmYC7wG2R8SpwKnAeyU9P2n/MuCKiHghcBnQGxGLcyURki6S1COpZ+vWrQWEZlNdKpWiv7+fVCp1QHl3d3dRxuju7h7qv9B+x+ujkDFG7pN9HsbrYyLjFeP4q70vs1pUjTMSw0j6PPAq4ClgrLei34iIQeABSX3AiUAL8JKs9ROzgeOTvn4REQ/mE0NErAZWAzQ1NUVBB1Il5s+fD8CqVatKNkZbWxtPbfl90fs9atbhHDzvuSWNfTxtbW1s2LDhgJmH+vp65s+fz+bNm4clE/X19TQ3Nxdl7ObmZtasWUMqlSq43/H6KGSMkftkn4fx+pjIeMU4/mrvy6wWVeOMxP3AyZmNiHg/8OfA3HH2G/kCH4CA5clMw+KIeH5EdCX1B85Pm+VhwYIFB5TV1dVx6aWXUldXd0D50qVLizJua2vrUP+F9jteH4WMMXKf7PMwXh8TGa8Yx1/tfZnVompMJLqBmZKyv5npsDz2+0tJdZKOBRqB3wBrgWWSZgBIOkHS4Tn2fQJ4xiTjtmniiCOOYNasWUB6xkESLS0tHHvssbS0tCCJhQsXDpXPmTOnKOM2NDQ
"text/plain": [
"<Figure size 576x864 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(8,12))\n",
"sns.boxplot(x='Lot Frontage',y='Neighborhood',data=df,orient='h')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Impute Missing Data based on other Features\n",
"\n",
"There are more complex methods, but usually the simpler the better, it avoids building models on top of other models.\n",
"\n",
"More Info on Options: https://scikit-learn.org/stable/modules/impute.html"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000211F18E23C8>"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Neighborhood')['Lot Frontage']"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Neighborhood\n",
"Blmngtn 46.900000\n",
"Blueste 27.300000\n",
"BrDale 21.500000\n",
"BrkSide 55.789474\n",
"ClearCr 88.150000\n",
"CollgCr 71.336364\n",
"Crawfor 69.951807\n",
"Edwards 64.794286\n",
"Gilbert 74.207207\n",
"Greens 41.000000\n",
"GrnHill NaN\n",
"IDOTRR 62.383721\n",
"Landmrk NaN\n",
"MeadowV 25.606061\n",
"Mitchel 75.144444\n",
"NAmes 75.210667\n",
"NPkVill 28.142857\n",
"NWAmes 81.517647\n",
"NoRidge 91.629630\n",
"NridgHt 84.184049\n",
"OldTown 61.777293\n",
"SWISU 59.068182\n",
"Sawyer 74.551020\n",
"SawyerW 70.669811\n",
"Somerst 64.549383\n",
"StoneBr 62.173913\n",
"Timber 81.303571\n",
"Veenker 72.000000\n",
"Name: Lot Frontage, dtype: float64"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Neighborhood')['Lot Frontage'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform Column\n",
"\n",
"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 141.0\n",
"1 80.0\n",
"2 81.0\n",
"3 93.0\n",
"4 74.0\n",
"Name: Lot Frontage, dtype: float64"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()['Lot Frontage']"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Lot Frontage</th>\n",
" <th>Lot Area</th>\n",
" <th>Street</th>\n",
" <th>Lot Shape</th>\n",
" <th>Land Contour</th>\n",
" <th>Utilities</th>\n",
" <th>Lot Config</th>\n",
" <th>Land Slope</th>\n",
" <th>...</th>\n",
" <th>Enclosed Porch</th>\n",
" <th>3Ssn Porch</th>\n",
" <th>Screen Porch</th>\n",
" <th>Pool Area</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Yr Sold</th>\n",
" <th>Sale Type</th>\n",
" <th>Sale Condition</th>\n",
" <th>SalePrice</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>NaN</td>\n",
" <td>7980</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>500</td>\n",
" <td>3</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>185000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>120</td>\n",
" <td>RL</td>\n",
" <td>NaN</td>\n",
" <td>6820</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>140</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>6</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>212000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>60</td>\n",
" <td>FV</td>\n",
" <td>NaN</td>\n",
" <td>7500</td>\n",
" <td>Pave</td>\n",
" <td>Reg</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>216000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>NaN</td>\n",
" <td>11241</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>CulDSac</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>700</td>\n",
" <td>3</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>149000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>NaN</td>\n",
" <td>12537</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>CulDSac</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>2010</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>149900</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2891</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>NaN</td>\n",
" <td>16669</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2006</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>228000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2894</th>\n",
" <td>60</td>\n",
" <td>RL</td>\n",
" <td>NaN</td>\n",
" <td>11170</td>\n",
" <td>Pave</td>\n",
" <td>IR2</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>2006</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>250000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2895</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>NaN</td>\n",
" <td>8098</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>10</td>\n",
" <td>2006</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>202000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2909</th>\n",
" <td>90</td>\n",
" <td>RL</td>\n",
" <td>NaN</td>\n",
" <td>11836</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Lvl</td>\n",
" <td>AllPub</td>\n",
" <td>Corner</td>\n",
" <td>Gtl</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>2006</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>146500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2923</th>\n",
" <td>20</td>\n",
" <td>RL</td>\n",
" <td>NaN</td>\n",
" <td>8885</td>\n",
" <td>Pave</td>\n",
" <td>IR1</td>\n",
" <td>Low</td>\n",
" <td>AllPub</td>\n",
" <td>Inside</td>\n",
" <td>Mod</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>6</td>\n",
" <td>2006</td>\n",
" <td>WD</td>\n",
" <td>Normal</td>\n",
" <td>131000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>490 rows × 76 columns</p>\n",
"</div>"
],
"text/plain": [
" MS SubClass MS Zoning Lot Frontage Lot Area Street Lot Shape \\\n",
"11 20 RL NaN 7980 Pave IR1 \n",
"14 120 RL NaN 6820 Pave IR1 \n",
"22 60 FV NaN 7500 Pave Reg \n",
"23 20 RL NaN 11241 Pave IR1 \n",
"24 20 RL NaN 12537 Pave IR1 \n",
"... ... ... ... ... ... ... \n",
"2891 20 RL NaN 16669 Pave IR1 \n",
"2894 60 RL NaN 11170 Pave IR2 \n",
"2895 20 RL NaN 8098 Pave IR1 \n",
"2909 90 RL NaN 11836 Pave IR1 \n",
"2923 20 RL NaN 8885 Pave IR1 \n",
"\n",
" Land Contour Utilities Lot Config Land Slope ... Enclosed Porch \\\n",
"11 Lvl AllPub Inside Gtl ... 0 \n",
"14 Lvl AllPub Corner Gtl ... 0 \n",
"22 Lvl AllPub Inside Gtl ... 0 \n",
"23 Lvl AllPub CulDSac Gtl ... 0 \n",
"24 Lvl AllPub CulDSac Gtl ... 0 \n",
"... ... ... ... ... ... ... \n",
"2891 Lvl AllPub Corner Gtl ... 0 \n",
"2894 Lvl AllPub Corner Gtl ... 0 \n",
"2895 Lvl AllPub Inside Gtl ... 0 \n",
"2909 Lvl AllPub Corner Gtl ... 0 \n",
"2923 Low AllPub Inside Mod ... 0 \n",
"\n",
" 3Ssn Porch Screen Porch Pool Area Misc Val Mo Sold Yr Sold Sale Type \\\n",
"11 0 0 0 500 3 2010 WD \n",
"14 0 140 0 0 6 2010 WD \n",
"22 0 0 0 0 1 2010 WD \n",
"23 0 0 0 700 3 2010 WD \n",
"24 0 0 0 0 4 2010 WD \n",
"... ... ... ... ... ... ... ... \n",
"2891 0 0 0 0 1 2006 WD \n",
"2894 0 0 0 0 4 2006 WD \n",
"2895 0 0 0 0 10 2006 WD \n",
"2909 0 0 0 0 3 2006 WD \n",
"2923 0 0 0 0 6 2006 WD \n",
"\n",
" Sale Condition SalePrice \n",
"11 Normal 185000 \n",
"14 Normal 212000 \n",
"22 Normal 216000 \n",
"23 Normal 149000 \n",
"24 Normal 149900 \n",
"... ... ... \n",
"2891 Normal 228000 \n",
"2894 Normal 250000 \n",
"2895 Normal 202000 \n",
"2909 Normal 146500 \n",
"2923 Normal 131000 \n",
"\n",
"[490 rows x 76 columns]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['Lot Frontage'].isnull()]"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"21 85.0\n",
"22 NaN\n",
"23 NaN\n",
"24 NaN\n",
"25 65.0\n",
"Name: Lot Frontage, dtype: float64"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.iloc[21:26]['Lot Frontage']"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 141.000000\n",
"1 80.000000\n",
"2 81.000000\n",
"3 93.000000\n",
"4 74.000000\n",
" ... \n",
"2922 37.000000\n",
"2923 75.144444\n",
"2924 62.000000\n",
"2925 77.000000\n",
"2926 74.000000\n",
"Name: Lot Frontage, Length: 2925, dtype: float64"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Neighborhood')['Lot Frontage'].transform(lambda val: val.fillna(val.mean()))"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"21 85.000000\n",
"22 64.549383\n",
"23 75.210667\n",
"24 75.210667\n",
"25 65.000000\n",
"Name: Lot Frontage, dtype: float64"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Neighborhood')['Lot Frontage'].transform(lambda val: val.fillna(val.mean())).iloc[21:26]"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"df['Lot Frontage'] = df.groupby('Neighborhood')['Lot Frontage'].transform(lambda val: val.fillna(val.mean()))"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"percent_nan = percent_missing(df)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEsCAYAAADNd3h6AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAQG0lEQVR4nO3df6zdd13H8efLW8pPS9Fdw2wLLaYBinFQaimMYAAT12Fs/BHdEhiOaDNdhSmIkz9AjfiPhOB0rjQwYpU4DT+SxjSMRAcJCSO93Ua1liY35UdrS7iE2U1G6Apv/zin8eTu9Jzvub23p/34fCQ3uefz/Xzvfd8/9tx3333PvakqJEnt+pFpDyBJWlmGXpIaZ+glqXGGXpIaZ+glqXGrpj3AMNdcc01t3Lhx2mNI0lXj8OHD366q2WHHrsjQb9y4kbm5uWmPIUlXjSRfv9gxb91IUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuOuyHfGXgle+Yf7pz2CpKvE4b+8ZdojjOQVvSQ1rlPok9yQ5HiS+SR3Djn+kiRfTPL9JO+a5FxJ0soaG/okM8DdwE5gC3Bzki2Ltn0HeDvwgSWcK0laQV2u6LcD81V1oqrOAfcBuwY3VNW3quoQ8OSk50qSVlaX0K8DTg68PtVf66LzuUl2J5lLMrewsNDxy0uSxukS+gxZq45fv/O5VbWvqrZV1bbZ2aG/O1+StARdQn8K2DDwej1wuuPXv5RzJUnLoEvoDwGbk2xKshq4CTjQ8etfyrmSpGUw9g1TVXU+yR7gfmAGuLeqjia5rX98b5LnA3PAGuCHSe4AtlTVY8POXaGfRZI0RKd3xlbVQeDgorW9A59/k95tmU7nSpIuH98ZK0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mNM/SS1DhDL0mN6xT6JDckOZ5kPsmdQ44nyV3940eSbB049vtJjib5jyT/mOQZy/kDSJJGGxv6JDPA3cBOYAtwc5Iti7btBDb3P3YD9/TPXQe8HdhWVT8NzAA3Ldv0kqSxulzRbwfmq+pEVZ0D7gN2LdqzC9hfPQ8Ca5Nc2z+2CnhmklXAs4DTyzS7JKmDLqFfB5wceH2qvzZ2T1X9F/AB4BvAGeBsVX122DdJsjvJXJK5hYWFrvNLksboEvoMWasue5I8j97V/ibgJ4FnJ3nzsG9SVfuqaltVbZudne0wliSpiy6hPwVsGHi9nqfefrnYnp8HvlpVC1X1JPAp4DVLH1eSNKkuoT8EbE6yKclqev8z9cCiPQeAW/pP3+ygd4vmDL1bNjuSPCtJgDcCx5ZxfknSGKvGbaiq80n2APfTe2rm3qo6muS2/vG9wEHgRmAeeAK4tX/sS0k+ATwEnAceBvatxA8iSRpubOgBquogvZgPru0d+LyA2y9y7vuA913CjJKkS+A7YyWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcYZekhpn6CWpcZ1Cn+SGJMeTzCe5c8jxJLmrf/xIkq0Dx9Ym+USSryQ5luTVy/kDSJJGGxv6JDPA3cBOYAtwc5Iti7btBDb3P3YD9wwc+yvgM1X1EuA64NgyzC1J6qjLFf12YL6qTlTVOeA+YNeiPbuA/dXzILA2ybVJ1gCvAz4KUFXnquq/l298SdI4XUK/Djg58PpUf63LnhcBC8DHkjyc5CNJnn0J80qSJtQl9BmyVh33rAK2AvdU1SuA7wJPuccPkGR3krkkcwsLCx3GkiR10SX0p4ANA6/XA6c77jkFnKqqL/XXP0Ev/E9RVfuqaltVbZudne0yuySpgy6hPwRsTrIpyWrgJuDAoj0HgFv6T9/sAM5W1Zmq+iZwMsmL+/veCPzncg0vSRpv1bgNVXU+yR7gfmAGuLeqjia5rX98L3AQuBGYB54Abh34Er8HfLz/L4kTi45JklbY2NADVNVBejEfXNs78HkBt1/k3EeAbUsfUZJ0KXxnrCQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuMMvSQ1ztBLUuM6hT7JDUmOJ5lPcueQ40lyV//4kSRbFx2fSfJwkn9ZrsElSd2MDX2SGeBuYCewBbg5yZZF23YCm/sfu4F7Fh1/B3DskqeVJE2syxX9dmC+qk5U1TngPmDXoj27gP3V8yCwNsm1AEnWA28CPrKMc0uSOuoS+nXAyYHXp/prXfd8CHg38MNR3yTJ7iRzSeYWFhY6jCVJ6qJL6DNkrbrsSfKLwLeq6vC4b1JV+6pqW1Vtm52d7TCWJKmLLqE/BWwYeL0eON1xz/XALyX5Gr1bPm9I8g9LnlaSNLEuoT8EbE6yKclq4CbgwKI9B4Bb+k/f7ADOVtWZqvrjqlpfVRv75/1bVb15OX8ASdJoq8ZtqKrzSfYA9wMzwL1VdTTJbf3je4GDwI3APPAEcOvKjSxJmsTY0ANU1UF6MR9c2zvweQG3j/kanwM+N/GEkqRL4jtjJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxhl6SGmfoJalxnUKf5IYkx5PMJ7lzyPEkuat//EiSrf31DUkeSHIsydEk71juH0CSNNrY0CeZAe4GdgJbgJuTbFm0bSewuf+xG7inv34eeGdVvRTYAdw+5FxJ0grqckW/HZivqhNVdQ64D9i1aM8uYH/1PAisTXJtVZ2pqocAqupx4BiwbhnnlySN0SX064CTA69P8dRYj92TZCPwCuBLw75Jkt1J5pLMLSwsdBhLktRFl9BnyFpNsifJc4BPAndU1WPDvklV7auqbVW1bXZ2tsNYkqQuuoT+FLBh4PV64HTXPUmeRi/yH6+qTy19VEnSUnQJ/SFgc5JNSVYDNwEHFu05ANzSf/pmB3C2qs4kCfBR4FhVfXBZJ5ckdbJq3IaqOp9kD3A/MAPcW1VHk9zWP74XOAjcCMwDTwC39k+/HngL8O9JHumvvaeqDi7rTyFJuqixoQfoh/ngorW9A58XcPuQ877A8Pv3kqTLxHfGSlLjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNc7QS1LjDL0kNa5T6JPckOR4kvkkdw45niR39Y8fSbK167mSpJU1NvRJZoC7gZ3AFuDmJFsWbdsJbO5/7AbumeBcSdIK6nJFvx2Yr6oTVXUOuA/YtWjPLmB/9TwIrE1ybcdzJUkraFWHPeuAkwOvTwGv6rBnXcdzAUiym95/DQD8T5LjHWaTLrdrgG9PewhdWfKBt057BIAXXuxAl9BnyFp13NPl3N5i1T5gX4d5pKlJMldV26Y9hzSJLqE/BWwYeL0eON1xz+oO50qSVlCXe/SHgM1JNiVZDdwEHFi05wBwS//pmx3A2ao60/FcSdIKGntFX1Xnk+wB7gdmgHur6mi
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x=percent_nan.index,y=percent_nan)\n",
"plt.xticks(rotation=90);"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"df['Lot Frontage'] = df['Lot Frontage'].fillna(0)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"percent_nan = percent_missing(df)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Series([], dtype: float64)"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"percent_nan"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great! We no longer have any missing data in our entire data set! Keep in mind, we should eventually turn all these transformations into an easy to use function. For now, lets' save this dataset:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(\"../DATA/Ames_NO_Missing_Data.csv\",index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}