Practical 12 Data Science Project: Prediction on Hotel Room booking demand with respect to inventory (Machine Learning model).

Maharshi Relia
6 min readOct 28, 2021

18IT110: Maharshi Relia

Overview

Presently, rooms booking prediction is a quite in demand for Hotel owners and inventory managers. They always want to keep the future in mind while setting up and finalize the future strategies of inventories and rates, whether they are offline rack rates or OTA rates.

Let us understand our model

Using Logistic Regression model including the Data Cleaning, Preprocessing, and Exploratory Data Analysis, we can execute this model. For some copyright issues I am not declaring the GitHub repo link here but you can definitely check out the screenshots.

EXPLORATORY DATA ANALYSIS
Based on the plot above, most of the adults arrive at their hotels in September, with over 2.0 adults.

1. Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore")

2. Loading the dataset

df = pd.read_csv('hotel_bookings.csv')
df = df.iloc[0:2999]
df.head()
df.shape
Out[3]:(2999, 32)
df.info()
df.columns

3. Data Cleaning & Preprocessing

Checking for NaN Values

In [6]: df.isnull().sum()

4. Cleaning Country Feature

In [7]: df['country'].describe()Out[7]:
count 2998
unique 53
top PRT
freq 1774
Name: country, dtype: object
In [8]: df['country'].value_counts()Out[8]:
PRT 1774
ESP 343
GBR 316
IRL 143
FRA 79
DEU 38
NLD 35
CN 28
USA 27
CHE 22
BEL 22
POL 19
RUS 18
ITA 15
BRA 13
ROU 12
AUT 10
FIN 8
SWE 7
LVA 6
DNK 5
ARG 4
NOR 4
LUX 4
IND 3
AUS 3
CHN 3
CHL 2
EST 2
LTU 2
OMN 2
CZE 2
TUR 2
MOZ 2
DZA 2
MEX 2
SVN 2
MAR 2
BWA 1
ZAF 1
ALB 1
GRC 1
BLR 1
CPV 1
UKR 1
SRB 1
SMR 1
ISR 1
ZWE 1
PRI 1
ZMB 1
CYM 1
AGO 1
Name: country, dtype: int64
In [9]:df[df['country'].isnull()]
In [10]: df[df['lead_time'] == 118]['country'].value_counts()Out[10]:
PRT 7
GBR 3
Name: country, dtype: int64
In [11]: df['country'] = df['country'].fillna('PRT')
df['country'].isnull().sum()
Out[11]: 0

I replaced the NaN value in country with PRT based on the lead_time feature, where based on the code above, the lead_time feature in the row where the Nan is located is 118, and PRT is the most common country in which lead_time = 118.

5. Cleaning ‘agent’ Feature

In [12]: df['agent'].describe()Out[12]:
count 2534.000000
mean 209.990529
std 70.282437
min 2.000000
25% 208.000000
50% 240.000000
75% 241.000000
max 334.000000
Name: agent, dtype: float64
In [13]: df['agent'].value_counts()Out[13]:
240.0 1242
250.0 246
241.0 151
242.0 68
40.0 56
...
303.0 1
273.0 1
244.0 1
167.0 1
304.0 1
Name: agent, Length: 66, dtype: int64
In [14]:
df[df['agent'].isnull()]
In [21]: df['agent'] = df['agent'].fillna('0')
df['agent'].isnull().sum()
Out[21]: 0

I was trying to replace the Nan values on the ‘agent’ feature based on lead_time, arrival_date_month and arrival_date_week_number, but most of them have ‘240’ as the most common agent. After I read the description and explanation of the dataset that can be found on the internet, the author(s) describe the ‘agent’ feature as “ID of the travel agency that made the bookinga’.

So, those who has ‘agent’ in the dataset are the only ones that made the book through travel agency, and those who don’t have ‘agent’ or the value is Nan, are those who did not make the book through travel agency. So, based on that, I think it is best to fill the Nan values with 0 other than fill them with the agent which would make the dataset different from the original.

6. Cleaning ‘company’ Feature

In [22]: df['company'].describe()Out[22]:
count 123.000000
mean 221.178862
std 83.433790
min 12.000000
25% 115.500000
50% 270.000000
75% 281.000000
max 318.000000
Name: company, dtype: float64
In [23]: df['company'].unique()Out[23]:
array([ nan, 110., 113., 270., 178., 240., 154., 144., 307., 268., 59.,
204., 312., 318., 94., 174., 274., 195., 223., 317., 281., 118.,
53., 286., 12., 47.])
In [24]: df['company'].value_counts()Out[24]:
281.0 43
110.0 18
307.0 8
268.0 7
94.0 5
223.0 4
317.0 4
154.0 4
113.0 3
174.0 3
270.0 3
195.0 3
59.0 2
178.0 2
274.0 2
312.0 2
240.0 2
204.0 1
286.0 1
53.0 1
318.0 1
47.0 1
118.0 1
12.0 1
144.0 1
Name: company, dtype: int64
In [25]: df[df['company'].isnull()]
In [26]: print("Percentage of NaN values in 'company' feature is: ")
print(2876/3000)
Percentage of NaN values in 'company' feature is:
0.9586666666666667
In [27]: df = df.drop(['company'], axis=1)

I choose to drop the entire ‘company’ feature because the Nan in that feature is about 96% of the data. If I choose to mofify the data, it could make a huge difference to the data, and could change the entire data, especially in the company feature.

In [28]: df.isnull().sum()Out[28]:
hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 0
babies 0
meal 0
country 0
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 0
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
dtype: int64

The data has been cleaned.

MODEL BUILDING & HYPERPARAMETER TUNING

1. Splitting Dataset

In [55]: df_new = df.copy()[['required_car_parking_spaces','lead_time','booking_changes','adr','adults', 'is_canceled']]
df_new.head()
Out[55]: required_car_parking_spaceslead_timebooking_changesadradultsis_canceled0034230.0201073740.020207075.0103013075.0104014098.020
In [56]:
x = df_new.drop(['is_canceled'], axis=1)
y = df_new['is_canceled']

I am trying to split the dataset based on the top 5 that have the biggest correlation to the target (is_cancelled) which are required_car_parking_spaces’,’lead_time’,’booking_changes’,’adr’,’adults’, ‘is_canceled’

In [57]: x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, shuffle=False)

Trying the train test split

Fitting Model — Logistic Regression

In [69]: 
model_LogReg_Asli = LogisticRegression()
model_LogReg_Asli.fit(x_train, y_train)
print(model_LogReg_Asli.coef_)
print(model_LogReg_Asli.intercept_)
m = model_LogReg_Asli.coef_[0][0]
c = model_LogReg_Asli.intercept_[0]
[[-4.05812101 0.0065949 -0.77725534 0.00978826 0.08451063]]
[-2.01165506]

Predict

In [114]: # prediksi
y_pred_lr = model_LogReg_Asli.predict(x_test)
In [115]:
#trying to predict using model
coba = x_test.iloc[:1]
model.predict(coba)
coba
Out[115]: required_car_parking_spaceslead_timebooking_changesadradults2399012050.42

Model Performance

In [116]: model_LogReg_Asli.score(x_train, y_train)Out[116]: 0.68528553563985
In [117]: model_LogReg_Asli.score(x_test, y_test)
Out[117]: 0.6933333333333334

model_LogReg_Asli only has about 69.3% accuracy

Model Parameter

In [73]: # Parameter yang dipakai di model asli
model_LogReg_Asli.get_params()
Out[73]:
{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': None,
'solver': 'lbfgs',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
In [74]:
# parameter model linear regression yang akan dituned + nilai yang mungkinpenalty = ['l1', 'l2', 'elasticnet', 'none']
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
max_iter = [1, 10, 100, 1000, 10000]
param = {'penalty': penalty, 'solver': solver, 'max_iter': max_iter}
param
Out[74]:{'penalty': ['l1', 'l2', 'elasticnet', 'none'],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'max_iter': [1, 10, 100, 1000, 10000]}

and so on..

Contact for request to access the dataset and .ipynb source model. I hope you liked it. Thank You.

Happy Data Predicting &Mining ! :)

--

--

Maharshi Relia

IT Consultant | UI-UX Designer | Web Developer | SEO Analyst & Executive | Marketing Executive | Passionate Hotelier