Practical 12 Data Science Project: Prediction on Hotel Room booking demand with respect to inventory (Machine Learning model).
18IT110: Maharshi Relia
Overview
Presently, rooms booking prediction is a quite in demand for Hotel owners and inventory managers. They always want to keep the future in mind while setting up and finalize the future strategies of inventories and rates, whether they are offline rack rates or OTA rates.
Let us understand our model
Using Logistic Regression model including the Data Cleaning, Preprocessing, and Exploratory Data Analysis, we can execute this model. For some copyright issues I am not declaring the GitHub repo link here but you can definitely check out the screenshots.



1. Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings("ignore")
2. Loading the dataset
df = pd.read_csv('hotel_bookings.csv')
df = df.iloc[0:2999]
df.head()

df.shape
Out[3]:(2999, 32)
df.info()

df.columns

3. Data Cleaning & Preprocessing
Checking for NaN Values
In [6]: df.isnull().sum()

4. Cleaning Country Feature
In [7]: df['country'].describe()Out[7]:
count 2998
unique 53
top PRT
freq 1774
Name: country, dtype: objectIn [8]: df['country'].value_counts()Out[8]:
PRT 1774
ESP 343
GBR 316
IRL 143
FRA 79
DEU 38
NLD 35
CN 28
USA 27
CHE 22
BEL 22
POL 19
RUS 18
ITA 15
BRA 13
ROU 12
AUT 10
FIN 8
SWE 7
LVA 6
DNK 5
ARG 4
NOR 4
LUX 4
IND 3
AUS 3
CHN 3
CHL 2
EST 2
LTU 2
OMN 2
CZE 2
TUR 2
MOZ 2
DZA 2
MEX 2
SVN 2
MAR 2
BWA 1
ZAF 1
ALB 1
GRC 1
BLR 1
CPV 1
UKR 1
SRB 1
SMR 1
ISR 1
ZWE 1
PRI 1
ZMB 1
CYM 1
AGO 1
Name: country, dtype: int64In [9]:df[df['country'].isnull()]

In [10]: df[df['lead_time'] == 118]['country'].value_counts()Out[10]:
PRT 7
GBR 3
Name: country, dtype: int64In [11]: df['country'] = df['country'].fillna('PRT')
df['country'].isnull().sum()Out[11]: 0
I replaced the NaN value in country with PRT based on the lead_time feature, where based on the code above, the lead_time feature in the row where the Nan is located is 118, and PRT is the most common country in which lead_time = 118.
5. Cleaning ‘agent’ Feature
In [12]: df['agent'].describe()Out[12]:
count 2534.000000
mean 209.990529
std 70.282437
min 2.000000
25% 208.000000
50% 240.000000
75% 241.000000
max 334.000000
Name: agent, dtype: float64In [13]: df['agent'].value_counts()Out[13]:
240.0 1242
250.0 246
241.0 151
242.0 68
40.0 56
...
303.0 1
273.0 1
244.0 1
167.0 1
304.0 1
Name: agent, Length: 66, dtype: int64In [14]:
df[df['agent'].isnull()]

In [21]: df['agent'] = df['agent'].fillna('0')
df['agent'].isnull().sum()Out[21]: 0
I was trying to replace the Nan values on the ‘agent’ feature based on lead_time, arrival_date_month and arrival_date_week_number, but most of them have ‘240’ as the most common agent. After I read the description and explanation of the dataset that can be found on the internet, the author(s) describe the ‘agent’ feature as “ID of the travel agency that made the bookinga’.
So, those who has ‘agent’ in the dataset are the only ones that made the book through travel agency, and those who don’t have ‘agent’ or the value is Nan, are those who did not make the book through travel agency. So, based on that, I think it is best to fill the Nan values with 0 other than fill them with the agent which would make the dataset different from the original.
6. Cleaning ‘company’ Feature
In [22]: df['company'].describe()Out[22]:
count 123.000000
mean 221.178862
std 83.433790
min 12.000000
25% 115.500000
50% 270.000000
75% 281.000000
max 318.000000
Name: company, dtype: float64In [23]: df['company'].unique()Out[23]:
array([ nan, 110., 113., 270., 178., 240., 154., 144., 307., 268., 59.,
204., 312., 318., 94., 174., 274., 195., 223., 317., 281., 118.,
53., 286., 12., 47.])In [24]: df['company'].value_counts()Out[24]:
281.0 43
110.0 18
307.0 8
268.0 7
94.0 5
223.0 4
317.0 4
154.0 4
113.0 3
174.0 3
270.0 3
195.0 3
59.0 2
178.0 2
274.0 2
312.0 2
240.0 2
204.0 1
286.0 1
53.0 1
318.0 1
47.0 1
118.0 1
12.0 1
144.0 1
Name: company, dtype: int64In [25]: df[df['company'].isnull()]

In [26]: print("Percentage of NaN values in 'company' feature is: ")
print(2876/3000)Percentage of NaN values in 'company' feature is:
0.9586666666666667In [27]: df = df.drop(['company'], axis=1)
I choose to drop the entire ‘company’ feature because the Nan in that feature is about 96% of the data. If I choose to mofify the data, it could make a huge difference to the data, and could change the entire data, especially in the company feature.
In [28]: df.isnull().sum()Out[28]:
hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 0
babies 0
meal 0
country 0
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 0
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
dtype: int64
The data has been cleaned.
MODEL BUILDING & HYPERPARAMETER TUNING
1. Splitting Dataset
In [55]: df_new = df.copy()[['required_car_parking_spaces','lead_time','booking_changes','adr','adults', 'is_canceled']]
df_new.head()Out[55]: required_car_parking_spaceslead_timebooking_changesadradultsis_canceled0034230.0201073740.020207075.0103013075.0104014098.020
In [56]:x = df_new.drop(['is_canceled'], axis=1)
y = df_new['is_canceled']
I am trying to split the dataset based on the top 5 that have the biggest correlation to the target (is_cancelled) which are required_car_parking_spaces’,’lead_time’,’booking_changes’,’adr’,’adults’, ‘is_canceled’
In [57]: x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, shuffle=False)
Trying the train test split
Fitting Model — Logistic Regression
In [69]:
model_LogReg_Asli = LogisticRegression()
model_LogReg_Asli.fit(x_train, y_train)
print(model_LogReg_Asli.coef_)
print(model_LogReg_Asli.intercept_)m = model_LogReg_Asli.coef_[0][0]
c = model_LogReg_Asli.intercept_[0][[-4.05812101 0.0065949 -0.77725534 0.00978826 0.08451063]]
[-2.01165506]
Predict
In [114]: # prediksi
y_pred_lr = model_LogReg_Asli.predict(x_test)
In [115]:#trying to predict using model
coba = x_test.iloc[:1]
model.predict(coba)
cobaOut[115]: required_car_parking_spaceslead_timebooking_changesadradults2399012050.42
Model Performance
In [116]: model_LogReg_Asli.score(x_train, y_train)Out[116]: 0.68528553563985
In [117]: model_LogReg_Asli.score(x_test, y_test)
Out[117]: 0.6933333333333334
model_LogReg_Asli only has about 69.3% accuracy
Model Parameter
In [73]: # Parameter yang dipakai di model asli
model_LogReg_Asli.get_params()
Out[73]:{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': None,
'solver': 'lbfgs',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
In [74]:# parameter model linear regression yang akan dituned + nilai yang mungkinpenalty = ['l1', 'l2', 'elasticnet', 'none']
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
max_iter = [1, 10, 100, 1000, 10000]param = {'penalty': penalty, 'solver': solver, 'max_iter': max_iter}
paramOut[74]:{'penalty': ['l1', 'l2', 'elasticnet', 'none'],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'max_iter': [1, 10, 100, 1000, 10000]}
and so on..
Contact for request to access the dataset and .ipynb source model. I hope you liked it. Thank You.
Happy Data Predicting &Mining ! :)