You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
udemy-ML/03-Pandas/12-Pandas-Project-Exercise-...

1428 lines
51 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
"___\n",
"<center><em>Copyright by Pierian Data Inc.</em></center>\n",
"<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas Project Exercise - Solutions\n",
"\n",
"# The Data\n",
"\n",
"This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.\n",
"\n",
"All personally identifying information has been removed from the data.\n",
"\n",
"Acknowledgements\n",
"The data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.\n",
"\n",
"\n",
"----------------------------\n",
"\n",
"#### NOTE: Names, Emails, Phone Numbers, and Credit Card numbers in the data are synthetic and not real information from people. The hotel data is real from the publication listed above.\n",
"\n",
"## <div style=\"text-align: center\">Data Column Reference</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table><thead><tr class=\"rowsep-1\"><th scope=\"col\"><strong>Variable</strong></th><th scope=\"col\"><strong>Type</strong></th><th scope=\"col\"><strong>Description</strong></th><th scope=\"col\"><strong>Source/Engineering</strong></th></tr></thead><tbody><tr><th scope=\"row\"><em>ADR</em></th><td>Numeric</td><td>Average Daily Rate as defined by <a name=\"bbib5\" href=\"#bib5\" class=\"workspace-trigger\">[5]</a></td><td>BO, BL and TR / Calculated by dividing the sum of all lodging transactions by the total number of staying nights</td></tr><tr><th scope=\"row\"><em>Adults</em></th><td>Integer</td><td>Number of adults</td><td>BO and BL</td></tr><tr><th scope=\"row\"><em>Agent</em></th><td>Categorical</td><td>ID of the travel agency that made the booking<a name=\"btbl1fna\" href=\"#tbl1fna\" class=\"workspace-trigger\"><sup>a</sup></a></td><td>BO and BL</td></tr><tr><th scope=\"row\"><em>ArrivalDateDayOfMonth</em></th><td>Integer</td><td>Day of the month of the arrival date</td><td>BO and BL</td></tr><tr><th scope=\"row\"><em>ArrivalDateMonth</em></th><td>Categorical</td><td>Month of arrival date with 12 categories: “January” to “December”</td><td>BO and BL</td></tr><tr><th scope=\"row\"><em>ArrivalDateWeekNumber</em></th><td>Integer</td><td>Week number of the arrival date</td><td>BO and BL</td></tr><tr><th scope=\"row\"><em>ArrivalDateYear</em></th><td>Integer</td><td>Year of arrival date</td><td>BO and BL</td></tr><tr><th scope=\"row\"><em>AssignedRoomType</em></th><td>Categorical</td><td>Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons</td><td>BO and BL</td></tr><tr><th scope=\"row\"><em>Babies</em></th><td>Integer</td><td>Number of babies</td><td>BO and BL</td></tr><tr><th scope=\"row\"><em>BookingChanges</em></th><td>Integer</td><td>Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation</td><td>BO and BL/Calculated by adding the number of unique iterations that change some of the booking attributes, namely: persons, arrival date, nights, reserved room type or meal</td></tr><tr><th scope=\"row\"><em>Children</em></th><td>Integer</td><td>Number of children</td><td>BO and BL/Sum of both payable and non-payable children</td></tr><tr><th scope=\"row\"><em>Company</em></th><td>Categorical</td><td>ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons</td><td>BO and BL.</td></tr><tr><th scope=\"row\"><em>Country</em></th><td>Categorical</td><td>Country of origin. Categories are represented in the ISO 31553:2013 format <a name=\"bbib6\" href=\"#bib6\" class=\"workspace-trigger\">[6]</a></td><td>BO, BL and NT</td></tr><tr><th scope=\"row\"><br></th><td><br></td><td><br></td><td><br></td></tr><tr><th scope=\"row\" rowspan=\"5\"><em>CustomerType</em></th><td rowspan=\"5\">Categorical</td><td>Type of booking, assuming one of four categories:</td><td rowspan=\"5\">BO and BL</td></tr><tr><td>Contract - when the booking has an allotment or other type of contract associated to it;</td></tr><tr><td>Group when the booking is associated to a group;</td></tr><tr><td>Transient when the booking is not part of a group or contract, and is not associated to other transient booking;</td></tr><tr><td>Transient-party when the booking is transient, but is associated to at least other transient booking</td></tr><tr><th scope=\"row\"><em>DaysInWaitingList</em></th><td>Integer</td><td>Number of days the booking was in the waiting list before it was confirmed to the customer</td><td>BO/Calculated by subtracting the date the booking was confirmed to the customer from the date the booking entered on the PMS</td></tr><tr><th scope=\"row\"><br></th><td><br></td><td><br></td><td><br></td></tr><tr><th scope=\"row\" rowspan=\"7\"><em>DepositType</em></th><td rowspan=\"7\">Categorical</td><td>Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:</td><td rowspan=\"2\">BO and TR/Value calculated based on the payments identified for the booking in the transaction (TR) table before the booking׳s arrival or cancellation date.</td></tr><tr><td rowspan=\"3\">No Deposit no deposit was made;</td></tr><tr><td>In case no payments were found the value is “No Deposit”.</td></tr><tr><td rowspan=\"2\">If the payment was equal or exceeded the total cost of stay, the value is set as “Non Refund”.</td></tr><tr><td rowspan=\"2\">Non Refund a deposit was made in the value of the total stay cost;</td></tr><tr><td rowspan=\"2\">Otherwise the value is set as “Refundable”</td></tr><tr><td>Refundable a deposit was made with a value under the total cost of stay.</td></tr><tr><th scope=\"row\"><em>DistributionChannel</em></th><td>Categorical</td><td>Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”</td><td>BO, BL and DC</td></tr><tr><th scope=\"row\"><em>IsCanceled</em></th><td>Categorical</td><td>Value indicating if the booking was canceled (1) or not (0)</td><td>BO</td></tr><tr><th scope=\"row\"><em>IsRepeatedGuest</em></th><td>Categorical</td><td>Value indicating if the booking name was from a repeated guest (1) or not (0)</td><td>BO, BL and C/ Variable created by verifying if a profile was associated with the booking customer. If so, and if the customer profile creation date was prior to the creation date for the booking on the PMS database it was assumed the booking was from a repeated guest</td></tr><tr><th scope=\"row\"><em>LeadTime</em></th><td>Integer</td><td>Number of days that elapsed between the entering date of the booking into the PMS and the arrival date</td><td>BO and BL/ Subtraction of the entering date from the arrival date</td></tr><tr><th scope=\"row\"><em>MarketSegment</em></th><td>Categorical</td><td>Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”</td><td>BO, BL and MS</td></tr><tr><th scope=\"row\"><br></th><td><br></td><td><br></td><td><br></td></tr><tr><th scope=\"row\" rowspan=\"5\"><em>Meal</em></th><td rowspan=\"5\">Categorical</td><td>Type of meal booked. Categories are presented in standard hospitality meal packages:</td><td rowspan=\"5\">BO, BL and ML</td></tr><tr><td>Undefined/SC no meal package;</td></tr><tr><td>BB Bed &amp; Breakfast;</td></tr><tr><td>HB Half board (breakfast and one other meal usually dinner);</td></tr><tr><td>FB Full board (breakfast, lunch and dinner)</td></tr><tr><th scope=\"row\"><em>PreviousBookingsNotCanceled</em></th><td>Integer</td><td>Number of previous bookings not cancelled by the customer prior to the current booking</td><td>BO and BL / In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and not canceled.</td></tr><tr><th scope=\"row\"><em>PreviousCancellations</em></th><td>Integer</td><td>Number of previous bookings that were cancelled by the customer prior to the current booking</td><td>BO and BL/ In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and canceled.</td></tr><tr><th scope=\"row\"><em>RequiredCardParkingSpaces</em></th><td>Integer</td><td>Number of car parking spaces required by the customer</td><td>BO and BL</td></tr><tr><th scope=\"row\"><br></th><td><br></td><td><br></td><td><br></td></tr><tr><th scope=\"row\" rowspan=\"4\"><em>ReservationStatus</em></th><td rowspan=\"4\">Categorical</td><td>Reservation last status, assuming one of three categories:</td><td rowspan=\"4\">BO</td></tr><tr><td>Canceled booking was canceled by the customer;</td></tr><tr><td>Check-Out customer has checked in but already departed;</td></tr><tr><td>No-Show customer did not check-in and did inform the hotel of the reason why</td></tr><tr><th scope=\"row\"><em>ReservationStatusDate</em></th><td>Date</td><td>Date at which the last status was set. This variable can be used in conjunction with the <em>ReservationStatus</em> to understand when was the booking canceled or when did the customer checked-out of the hotel</td><td>BO</td></tr><tr><th scope=\"row\"><em>ReservedRoomType</em></th><td>Categorical</td><td>Code of room type reserved. Code is presented instead of designation for anonymity reasons</td><td>BO and BL</td></tr><tr><th scope=\"row\"><em>StaysInWeekendNights</em></th><td>Integer</td><td>Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel</td><td>BO and BL/ Calculated by counting the number of weekend nights from the total number of nights</td></tr><tr><th scope=\"row\"><em>StaysInWeekNights</em></th><td>Integer</td><td>Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel</td><td>BO and BL/Calculated by counting the number of week nights from the total number of nights</td></tr><tr><th scope=\"row\"><em>TotalOfSpecialRequests</em></th><td>Integer</td><td>Number of special requests made by the customer (e.g. twin bed or high floor)</td><td>BO and BL/Sum of all special requests</td></tr></tbody></table>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-----------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TASKS\n",
"\n",
"** Complete the tasks shown in bold below. The expected output is shown in a cell below. Be careful not to run the cell above the expected output, as it will clear the expected output. Try your best to solve these in one line of pandas code (not every single question can be solved in one line, but many can be!) Refer to solutions notebook and video to view possible solutions. NOTE: Many tasks have multiple correct solution methods!**"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"hotels = pd.read_csv(\"hotel_booking_data.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>hotel</th>\n",
" <th>is_canceled</th>\n",
" <th>lead_time</th>\n",
" <th>arrival_date_year</th>\n",
" <th>arrival_date_month</th>\n",
" <th>arrival_date_week_number</th>\n",
" <th>arrival_date_day_of_month</th>\n",
" <th>stays_in_weekend_nights</th>\n",
" <th>stays_in_week_nights</th>\n",
" <th>adults</th>\n",
" <th>...</th>\n",
" <th>customer_type</th>\n",
" <th>adr</th>\n",
" <th>required_car_parking_spaces</th>\n",
" <th>total_of_special_requests</th>\n",
" <th>reservation_status</th>\n",
" <th>reservation_status_date</th>\n",
" <th>name</th>\n",
" <th>email</th>\n",
" <th>phone-number</th>\n",
" <th>credit_card</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Resort Hotel</td>\n",
" <td>0</td>\n",
" <td>342</td>\n",
" <td>2015</td>\n",
" <td>July</td>\n",
" <td>27</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>Transient</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Check-Out</td>\n",
" <td>2015-07-01</td>\n",
" <td>Ernest Barnes</td>\n",
" <td>Ernest.Barnes31@outlook.com</td>\n",
" <td>669-792-1661</td>\n",
" <td>************4322</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Resort Hotel</td>\n",
" <td>0</td>\n",
" <td>737</td>\n",
" <td>2015</td>\n",
" <td>July</td>\n",
" <td>27</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>Transient</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Check-Out</td>\n",
" <td>2015-07-01</td>\n",
" <td>Andrea Baker</td>\n",
" <td>Andrea_Baker94@aol.com</td>\n",
" <td>858-637-6955</td>\n",
" <td>************9157</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Resort Hotel</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>2015</td>\n",
" <td>July</td>\n",
" <td>27</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>Transient</td>\n",
" <td>75.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Check-Out</td>\n",
" <td>2015-07-02</td>\n",
" <td>Rebecca Parker</td>\n",
" <td>Rebecca_Parker@comcast.net</td>\n",
" <td>652-885-2745</td>\n",
" <td>************3734</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Resort Hotel</td>\n",
" <td>0</td>\n",
" <td>13</td>\n",
" <td>2015</td>\n",
" <td>July</td>\n",
" <td>27</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>Transient</td>\n",
" <td>75.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Check-Out</td>\n",
" <td>2015-07-02</td>\n",
" <td>Laura Murray</td>\n",
" <td>Laura_M@gmail.com</td>\n",
" <td>364-656-8427</td>\n",
" <td>************5677</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Resort Hotel</td>\n",
" <td>0</td>\n",
" <td>14</td>\n",
" <td>2015</td>\n",
" <td>July</td>\n",
" <td>27</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>Transient</td>\n",
" <td>98.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Check-Out</td>\n",
" <td>2015-07-03</td>\n",
" <td>Linda Hines</td>\n",
" <td>LHines@verizon.com</td>\n",
" <td>713-226-5883</td>\n",
" <td>************5498</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 36 columns</p>\n",
"</div>"
],
"text/plain": [
" hotel is_canceled lead_time arrival_date_year arrival_date_month \\\n",
"0 Resort Hotel 0 342 2015 July \n",
"1 Resort Hotel 0 737 2015 July \n",
"2 Resort Hotel 0 7 2015 July \n",
"3 Resort Hotel 0 13 2015 July \n",
"4 Resort Hotel 0 14 2015 July \n",
"\n",
" arrival_date_week_number arrival_date_day_of_month \\\n",
"0 27 1 \n",
"1 27 1 \n",
"2 27 1 \n",
"3 27 1 \n",
"4 27 1 \n",
"\n",
" stays_in_weekend_nights stays_in_week_nights adults ... customer_type \\\n",
"0 0 0 2 ... Transient \n",
"1 0 0 2 ... Transient \n",
"2 0 1 1 ... Transient \n",
"3 0 1 1 ... Transient \n",
"4 0 2 2 ... Transient \n",
"\n",
" adr required_car_parking_spaces total_of_special_requests \\\n",
"0 0.0 0 0 \n",
"1 0.0 0 0 \n",
"2 75.0 0 0 \n",
"3 75.0 0 0 \n",
"4 98.0 0 1 \n",
"\n",
" reservation_status reservation_status_date name \\\n",
"0 Check-Out 2015-07-01 Ernest Barnes \n",
"1 Check-Out 2015-07-01 Andrea Baker \n",
"2 Check-Out 2015-07-02 Rebecca Parker \n",
"3 Check-Out 2015-07-02 Laura Murray \n",
"4 Check-Out 2015-07-03 Linda Hines \n",
"\n",
" email phone-number credit_card \n",
"0 Ernest.Barnes31@outlook.com 669-792-1661 ************4322 \n",
"1 Andrea_Baker94@aol.com 858-637-6955 ************9157 \n",
"2 Rebecca_Parker@comcast.net 652-885-2745 ************3734 \n",
"3 Laura_M@gmail.com 364-656-8427 ************5677 \n",
"4 LHines@verizon.com 713-226-5883 ************5498 \n",
"\n",
"[5 rows x 36 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hotels.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 119390 entries, 0 to 119389\n",
"Data columns (total 36 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 hotel 119390 non-null object \n",
" 1 is_canceled 119390 non-null int64 \n",
" 2 lead_time 119390 non-null int64 \n",
" 3 arrival_date_year 119390 non-null int64 \n",
" 4 arrival_date_month 119390 non-null object \n",
" 5 arrival_date_week_number 119390 non-null int64 \n",
" 6 arrival_date_day_of_month 119390 non-null int64 \n",
" 7 stays_in_weekend_nights 119390 non-null int64 \n",
" 8 stays_in_week_nights 119390 non-null int64 \n",
" 9 adults 119390 non-null int64 \n",
" 10 children 119386 non-null float64\n",
" 11 babies 119390 non-null int64 \n",
" 12 meal 119390 non-null object \n",
" 13 country 118902 non-null object \n",
" 14 market_segment 119390 non-null object \n",
" 15 distribution_channel 119390 non-null object \n",
" 16 is_repeated_guest 119390 non-null int64 \n",
" 17 previous_cancellations 119390 non-null int64 \n",
" 18 previous_bookings_not_canceled 119390 non-null int64 \n",
" 19 reserved_room_type 119390 non-null object \n",
" 20 assigned_room_type 119390 non-null object \n",
" 21 booking_changes 119390 non-null int64 \n",
" 22 deposit_type 119390 non-null object \n",
" 23 agent 103050 non-null float64\n",
" 24 company 6797 non-null float64\n",
" 25 days_in_waiting_list 119390 non-null int64 \n",
" 26 customer_type 119390 non-null object \n",
" 27 adr 119390 non-null float64\n",
" 28 required_car_parking_spaces 119390 non-null int64 \n",
" 29 total_of_special_requests 119390 non-null int64 \n",
" 30 reservation_status 119390 non-null object \n",
" 31 reservation_status_date 119390 non-null object \n",
" 32 name 119390 non-null object \n",
" 33 email 119390 non-null object \n",
" 34 phone-number 119390 non-null object \n",
" 35 credit_card 119390 non-null object \n",
"dtypes: float64(4), int64(16), object(16)\n",
"memory usage: 32.8+ MB\n"
]
}
],
"source": [
"hotels.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"**TASK: How many rows are there?**"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"119390"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(hotels) #hotels.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Is there any missing data? If so, which column has the most missing data?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"hotel 0\n",
"is_canceled 0\n",
"lead_time 0\n",
"arrival_date_year 0\n",
"arrival_date_month 0\n",
"arrival_date_week_number 0\n",
"arrival_date_day_of_month 0\n",
"stays_in_weekend_nights 0\n",
"stays_in_week_nights 0\n",
"adults 0\n",
"children 4\n",
"babies 0\n",
"meal 0\n",
"country 488\n",
"market_segment 0\n",
"distribution_channel 0\n",
"is_repeated_guest 0\n",
"previous_cancellations 0\n",
"previous_bookings_not_canceled 0\n",
"reserved_room_type 0\n",
"assigned_room_type 0\n",
"booking_changes 0\n",
"deposit_type 0\n",
"agent 16340\n",
"company 112593\n",
"days_in_waiting_list 0\n",
"customer_type 0\n",
"adr 0\n",
"required_car_parking_spaces 0\n",
"total_of_special_requests 0\n",
"reservation_status 0\n",
"reservation_status_date 0\n",
"name 0\n",
"email 0\n",
"phone-number 0\n",
"credit_card 0\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# hotels.isna().sum() #works as well\n",
"hotels.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Yes, missing data, company column missing: 112593 rows.\n"
]
}
],
"source": [
"print(f\"Yes, missing data, company column missing: {hotels['company'].isna().sum()} rows.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Drop the \"company\" column from the dataset.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"hotels = hotels.drop('company',axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: What are the top 5 most common country codes in the dataset?**"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"PRT 48590\n",
"GBR 12129\n",
"FRA 10415\n",
"ESP 8568\n",
"DEU 7287\n",
"Name: country, dtype: int64"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hotels['country'].value_counts()[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: What is the name of the person who paid the highest ADR (average daily rate)? How much was their ADR?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"adr 5400\n",
"name Daniel Walter\n",
"Name: 48515, dtype: object"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hotels.sort_values('adr',ascending=False)[['adr','name']].iloc[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: The adr is the average daily rate for a person's stay at the hotel. What is the mean adr across all the hotel stays in the dataset?**"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"101.83"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"round(hotels['adr'].mean(),2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: What is the average (mean) number of nights for a stay across the entire data set? Feel free to round this to 2 decimal points.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.43"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hotels['total_stay_days'] = hotels['stays_in_week_nights'] + hotels['stays_in_weekend_nights']\n",
"\n",
"round(hotels['total_stay_days'].mean(),2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: What is the average total cost for a stay in the dataset? Not *average daily cost*, but *total* stay cost. Feel free to round this to 2 decimal points.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"357.85"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hotels['total_paid'] = hotels['adr'] * hotels['total_stay_days']\n",
"round(hotels['total_paid'].mean(),2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: What are the names and emails of people who made 5 \"Special Requests\"?**"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>email</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>7860</th>\n",
" <td>Amanda Harper</td>\n",
" <td>Amanda.H66@yahoo.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11125</th>\n",
" <td>Laura Sanders</td>\n",
" <td>Sanders_Laura@hotmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14596</th>\n",
" <td>Tommy Ortiz</td>\n",
" <td>Tommy_O@hotmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14921</th>\n",
" <td>Gilbert Miller</td>\n",
" <td>Miller.Gilbert@aol.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14922</th>\n",
" <td>Timothy Torres</td>\n",
" <td>TTorres@protonmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24630</th>\n",
" <td>Jennifer Weaver</td>\n",
" <td>Jennifer_W@aol.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27288</th>\n",
" <td>Crystal Horton</td>\n",
" <td>Crystal.H@mail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27477</th>\n",
" <td>Brittney Burke</td>\n",
" <td>Burke_Brittney16@att.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29906</th>\n",
" <td>Cynthia Cabrera</td>\n",
" <td>Cabrera.Cynthia@xfinity.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29949</th>\n",
" <td>Sarah Floyd</td>\n",
" <td>Sarah_F@gmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32267</th>\n",
" <td>Michelle Villa</td>\n",
" <td>Michelle.Villa@aol.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39027</th>\n",
" <td>Nichole Hebert</td>\n",
" <td>Hebert.Nichole@gmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39129</th>\n",
" <td>Lindsey Mckenzie</td>\n",
" <td>Lindsey.Mckenzie@att.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39525</th>\n",
" <td>Ashley Edwards</td>\n",
" <td>Edwards.Ashley@yahoo.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70114</th>\n",
" <td>Christopher Torres</td>\n",
" <td>Torres.Christopher@gmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78819</th>\n",
" <td>Mrs. Tara Sullivan DVM</td>\n",
" <td>Mrs..DVM@xfinity.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78820</th>\n",
" <td>Michaela Brown</td>\n",
" <td>MichaelaBrown@att.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78822</th>\n",
" <td>Kurt Maldonado MD</td>\n",
" <td>KMD15@xfinity.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97072</th>\n",
" <td>Jason Richardson</td>\n",
" <td>Jason.R@zoho.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97099</th>\n",
" <td>Terri Hurley</td>\n",
" <td>THurley@xfinity.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97261</th>\n",
" <td>Mrs. Caitlin Webb</td>\n",
" <td>Mrs._W@comcast.net</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98410</th>\n",
" <td>Holly Arroyo</td>\n",
" <td>Arroyo_Holly@mail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98674</th>\n",
" <td>Denise Campbell</td>\n",
" <td>Denise_C@gmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99887</th>\n",
" <td>Michael Smith</td>\n",
" <td>Michael.S42@aol.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99888</th>\n",
" <td>Dr. Trevor Sellers</td>\n",
" <td>Dr._S@aol.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101569</th>\n",
" <td>Kayla Murphy</td>\n",
" <td>Kayla.Murphy@yahoo.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102061</th>\n",
" <td>Taylor Martinez</td>\n",
" <td>Taylor.Martinez@hotmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>109511</th>\n",
" <td>Charles Wilson</td>\n",
" <td>Charles_Wilson@yahoo.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>109590</th>\n",
" <td>Tyler Allison</td>\n",
" <td>Tyler.A@protonmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>110082</th>\n",
" <td>Matthew Bailey</td>\n",
" <td>Matthew_Bailey@aol.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>110083</th>\n",
" <td>Charlotte Acevedo</td>\n",
" <td>Charlotte_A@verizon.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>111909</th>\n",
" <td>Darrell Brennan</td>\n",
" <td>Brennan_Darrell51@hotmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>111911</th>\n",
" <td>Melinda Jensen</td>\n",
" <td>MelindaJensen@zoho.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>113915</th>\n",
" <td>Terry Arnold</td>\n",
" <td>Arnold.Terry@zoho.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>114770</th>\n",
" <td>Mary Nguyen</td>\n",
" <td>Nguyen.Mary@protonmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>114909</th>\n",
" <td>Lindsay Cuevas</td>\n",
" <td>Lindsay.Cuevas40@mail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>116455</th>\n",
" <td>Cynthia Hernandez</td>\n",
" <td>CynthiaHernandez@xfinity.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>116457</th>\n",
" <td>Angela Hawkins</td>\n",
" <td>Angela_H@gmail.com</td>\n",
" </tr>\n",
" <tr>\n",
" <th>118817</th>\n",
" <td>Sue Lawson</td>\n",
" <td>Sue.L52@comcast.net</td>\n",
" </tr>\n",
" <tr>\n",
" <th>119161</th>\n",
" <td>Alyssa Richards</td>\n",
" <td>Alyssa_Richards@aol.com</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name email\n",
"7860 Amanda Harper Amanda.H66@yahoo.com\n",
"11125 Laura Sanders Sanders_Laura@hotmail.com\n",
"14596 Tommy Ortiz Tommy_O@hotmail.com\n",
"14921 Gilbert Miller Miller.Gilbert@aol.com\n",
"14922 Timothy Torres TTorres@protonmail.com\n",
"24630 Jennifer Weaver Jennifer_W@aol.com\n",
"27288 Crystal Horton Crystal.H@mail.com\n",
"27477 Brittney Burke Burke_Brittney16@att.com\n",
"29906 Cynthia Cabrera Cabrera.Cynthia@xfinity.com\n",
"29949 Sarah Floyd Sarah_F@gmail.com\n",
"32267 Michelle Villa Michelle.Villa@aol.com\n",
"39027 Nichole Hebert Hebert.Nichole@gmail.com\n",
"39129 Lindsey Mckenzie Lindsey.Mckenzie@att.com\n",
"39525 Ashley Edwards Edwards.Ashley@yahoo.com\n",
"70114 Christopher Torres Torres.Christopher@gmail.com\n",
"78819 Mrs. Tara Sullivan DVM Mrs..DVM@xfinity.com\n",
"78820 Michaela Brown MichaelaBrown@att.com\n",
"78822 Kurt Maldonado MD KMD15@xfinity.com\n",
"97072 Jason Richardson Jason.R@zoho.com\n",
"97099 Terri Hurley THurley@xfinity.com\n",
"97261 Mrs. Caitlin Webb Mrs._W@comcast.net\n",
"98410 Holly Arroyo Arroyo_Holly@mail.com\n",
"98674 Denise Campbell Denise_C@gmail.com\n",
"99887 Michael Smith Michael.S42@aol.com\n",
"99888 Dr. Trevor Sellers Dr._S@aol.com\n",
"101569 Kayla Murphy Kayla.Murphy@yahoo.com\n",
"102061 Taylor Martinez Taylor.Martinez@hotmail.com\n",
"109511 Charles Wilson Charles_Wilson@yahoo.com\n",
"109590 Tyler Allison Tyler.A@protonmail.com\n",
"110082 Matthew Bailey Matthew_Bailey@aol.com\n",
"110083 Charlotte Acevedo Charlotte_A@verizon.com\n",
"111909 Darrell Brennan Brennan_Darrell51@hotmail.com\n",
"111911 Melinda Jensen MelindaJensen@zoho.com\n",
"113915 Terry Arnold Arnold.Terry@zoho.com\n",
"114770 Mary Nguyen Nguyen.Mary@protonmail.com\n",
"114909 Lindsay Cuevas Lindsay.Cuevas40@mail.com\n",
"116455 Cynthia Hernandez CynthiaHernandez@xfinity.com\n",
"116457 Angela Hawkins Angela_H@gmail.com\n",
"118817 Sue Lawson Sue.L52@comcast.net\n",
"119161 Alyssa Richards Alyssa_Richards@aol.com"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hotels[hotels['total_of_special_requests'] == 5][['name','email']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: What percentage of hotel stays were classified as \"repeat guests\"? (Do not base this off the name of the person, but instead of the is_repeated_guest column)**"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.19"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# You can sum booleans, False gets treated as zero, True as one\n",
"round(100 * sum(hotels['is_repeated_guest'] == 1) / len(hotels),2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: What are the top 5 most common last name in the dataset? Bonus: Can you figure this out in one line of pandas code? (For simplicity treat the a title such as MD as a last name, for example Caroline Conley MD can be said to have the last name MD)**"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Smith 2510\n",
"Johnson 1998\n",
"Williams 1628\n",
"Jones 1441\n",
"Brown 1433\n",
"Name: name, dtype: int64"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hotels['name'].apply(lambda name: name.split()[1]).value_counts()[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: What are the names of the people who had booked the most number children and babies for their stay? (Don't worry if they canceled, only consider number of people reported at the time of their reservation)**"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>adults</th>\n",
" <th>total_kids</th>\n",
" <th>babies</th>\n",
" <th>children</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>328</th>\n",
" <td>Jamie Ramirez</td>\n",
" <td>2</td>\n",
" <td>10.0</td>\n",
" <td>0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46619</th>\n",
" <td>Nicholas Parker</td>\n",
" <td>2</td>\n",
" <td>10.0</td>\n",
" <td>10</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78656</th>\n",
" <td>Marc Robinson</td>\n",
" <td>1</td>\n",
" <td>9.0</td>\n",
" <td>9</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name adults total_kids babies children\n",
"328 Jamie Ramirez 2 10.0 0 10.0\n",
"46619 Nicholas Parker 2 10.0 10 0.0\n",
"78656 Marc Robinson 1 9.0 9 0.0"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hotels['total_kids'] = hotels['babies'] + hotels['children']\n",
"\n",
"hotels.sort_values('total_kids',ascending=False)[['name','adults','total_kids','babies','children']][:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: What are the top 3 most common area code in the phone numbers? (Area code is first 3 digits)**"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Code - Total Count\n"
]
},
{
"data": {
"text/plain": [
"799 168\n",
"185 167\n",
"541 166\n",
"Name: phone-number, dtype: int64"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print('Code - Total Count')\n",
"hotels['phone-number'].apply(lambda num:num[:3]).value_counts()[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: How many arrivals took place between the 1st and the 15th of the month (inclusive of 1 and 15) ? Bonus: Can you do this in one line of pandas code?**"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"58152"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hotels['arrival_date_day_of_month'].apply(lambda day: day in range(1,16)).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**HARD BONUS TASK: Create a table for counts for each day of the week that people arrived. (E.g. 5000 arrivals were on a Monday, 3000 were on a Tuesday, etc..)**"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def convert(day,month,year):\n",
" return f'{day}-{month}-{year}'"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [],
"source": [
"hotels['date'] = np.vectorize(convert)(hotels['arrival_date_day_of_month'],\n",
" hotels['arrival_date_month'],\n",
" hotels['arrival_date_year'])"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"hotels['date'] = pd.to_datetime(hotels['date'])"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Friday 19631\n",
"Thursday 19254\n",
"Monday 18171\n",
"Saturday 18055\n",
"Wednesday 16139\n",
"Sunday 14141\n",
"Tuesday 13999\n",
"Name: date, dtype: int64"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.day_name.html#pandas.Series.dt.day_name\n",
"hotels['date'].dt.day_name().value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}