Practical 12 Data Science Project: Prediction on Hotel Room booking demand with respect to inventory (Machine Learning model).

6 min readOct 28, 2021

18IT110: Maharshi Relia

Overview

Presently, rooms booking prediction is a quite in demand for Hotel owners and inventory managers. They always want to keep the future in mind while setting up and finalize the future strategies of inventories and rates, whether they are offline rack rates or OTA rates.

Let us understand our model

Using Logistic Regression model including the Data Cleaning, Preprocessing, and Exploratory Data Analysis, we can execute this model. For some copyright issues I am not declaring the GitHub repo link here but you can definitely check out the screenshots.

Based on the plot above, most of the adults arrive at their hotels in September, with over 2.0 adults.

1. Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore")

2. Loading the dataset

df = pd.read_csv('hotel_bookings.csv')
df = df.iloc[0:2999]
df.head()

df.shape
Out[3]:(2999, 32)
df.info()

df.columns

3. Data Cleaning & Preprocessing

Checking for NaN Values

In [6]: df.isnull().sum()

4. Cleaning Country Feature

In [7]: df['country'].describe()Out[7]:
count     2998
unique      53
top        PRT
freq      1774
Name: country, dtype: objectIn [8]: df['country'].value_counts()Out[8]:
PRT    1774
ESP     343
GBR     316
IRL     143
FRA      79
DEU      38
NLD      35
CN       28
USA      27
CHE      22
BEL      22
POL      19
RUS      18
ITA      15
BRA      13
ROU      12
AUT      10
FIN       8
SWE       7
LVA       6
DNK       5
ARG       4
NOR       4
LUX       4
IND       3
AUS       3
CHN       3
CHL       2
EST       2
LTU       2
OMN       2
CZE       2
TUR       2
MOZ       2
DZA       2
MEX       2
SVN       2
MAR       2
BWA       1
ZAF       1
ALB       1
GRC       1
BLR       1
CPV       1
UKR       1
SRB       1
SMR       1
ISR       1
ZWE       1
PRI       1
ZMB       1
CYM       1
AGO       1
Name: country, dtype: int64In [9]:df[df['country'].isnull()]

In [10]: df[df['lead_time'] == 118]['country'].value_counts()Out[10]:
PRT    7
GBR    3
Name: country, dtype: int64In [11]: df['country'] = df['country'].fillna('PRT')
df['country'].isnull().sum()Out[11]: 0

I replaced the NaN value in country with PRT based on the lead_time feature, where based on the code above, the lead_time feature in the row where the Nan is located is 118, and PRT is the most common country in which lead_time = 118.

5. Cleaning ‘agent’ Feature

In [12]: df['agent'].describe()Out[12]:
count    2534.000000
mean      209.990529
std        70.282437
min         2.000000
25%       208.000000
50%       240.000000
75%       241.000000
max       334.000000
Name: agent, dtype: float64In [13]: df['agent'].value_counts()Out[13]:
240.0    1242
250.0     246
241.0     151
242.0      68
40.0       56
         ... 
303.0       1
273.0       1
244.0       1
167.0       1
304.0       1
Name: agent, Length: 66, dtype: int64In [14]:
df[df['agent'].isnull()]

In [21]: df['agent'] = df['agent'].fillna('0')
df['agent'].isnull().sum()Out[21]: 0

I was trying to replace the Nan values on the ‘agent’ feature based on lead_time, arrival_date_month and arrival_date_week_number, but most of them have ‘240’ as the most common agent. After I read the description and explanation of the dataset that can be found on the internet, the author(s) describe the ‘agent’ feature as “ID of the travel agency that made the bookinga’.

So, those who has ‘agent’ in the dataset are the only ones that made the book through travel agency, and those who don’t have ‘agent’ or the value is Nan, are those who did not make the book through travel agency. So, based on that, I think it is best to fill the Nan values with 0 other than fill them with the agent which would make the dataset different from the original.

6. Cleaning ‘company’ Feature

In [22]: df['company'].describe()Out[22]:
count    123.000000
mean     221.178862
std       83.433790
min       12.000000
25%      115.500000
50%      270.000000
75%      281.000000
max      318.000000
Name: company, dtype: float64In [23]: df['company'].unique()Out[23]:
array([ nan, 110., 113., 270., 178., 240., 154., 144., 307., 268.,  59.,
       204., 312., 318.,  94., 174., 274., 195., 223., 317., 281., 118.,
        53., 286.,  12.,  47.])In [24]: df['company'].value_counts()Out[24]:
281.0    43
110.0    18
307.0     8
268.0     7
94.0      5
223.0     4
317.0     4
154.0     4
113.0     3
174.0     3
270.0     3
195.0     3
59.0      2
178.0     2
274.0     2
312.0     2
240.0     2
204.0     1
286.0     1
53.0      1
318.0     1
47.0      1
118.0     1
12.0      1
144.0     1
Name: company, dtype: int64In [25]: df[df['company'].isnull()]

In [26]: print("Percentage of NaN values in 'company' feature is: ")
print(2876/3000)Percentage of NaN values in 'company' feature is: 
0.9586666666666667In [27]: df = df.drop(['company'], axis=1)

I choose to drop the entire ‘company’ feature because the Nan in that feature is about 96% of the data. If I choose to mofify the data, it could make a huge difference to the data, and could change the entire data, especially in the company feature.

In [28]: df.isnull().sum()Out[28]:
hotel                             0
is_canceled                       0
lead_time                         0
arrival_date_year                 0
arrival_date_month                0
arrival_date_week_number          0
arrival_date_day_of_month         0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          0
babies                            0
meal                              0
country                           0
market_segment                    0
distribution_channel              0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
reserved_room_type                0
assigned_room_type                0
booking_changes                   0
deposit_type                      0
agent                             0
days_in_waiting_list              0
customer_type                     0
adr                               0
required_car_parking_spaces       0
total_of_special_requests         0
reservation_status                0
reservation_status_date           0
dtype: int64

The data has been cleaned.

MODEL BUILDING & HYPERPARAMETER TUNING

1. Splitting Dataset

In [55]: df_new = df.copy()[['required_car_parking_spaces','lead_time','booking_changes','adr','adults', 'is_canceled']]
df_new.head()Out[55]: required_car_parking_spaceslead_timebooking_changesadradultsis_canceled0034230.0201073740.020207075.0103013075.0104014098.020
In [56]:x = df_new.drop(['is_canceled'], axis=1)
y = df_new['is_canceled']

I am trying to split the dataset based on the top 5 that have the biggest correlation to the target (is_cancelled) which are required_car_parking_spaces’,’lead_time’,’booking_changes’,’adr’,’adults’, ‘is_canceled’

In [57]: x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, shuffle=False)

Trying the train test split

Fitting Model — Logistic Regression

In [69]: 
model_LogReg_Asli = LogisticRegression()
model_LogReg_Asli.fit(x_train, y_train)
print(model_LogReg_Asli.coef_)
print(model_LogReg_Asli.intercept_)m = model_LogReg_Asli.coef_[0][0]
c = model_LogReg_Asli.intercept_[0][[-4.05812101  0.0065949  -0.77725534  0.00978826  0.08451063]]
[-2.01165506]

Predict

In [114]: # prediksi
y_pred_lr = model_LogReg_Asli.predict(x_test)
In [115]:#trying to predict using model
coba = x_test.iloc[:1]
model.predict(coba)
cobaOut[115]: required_car_parking_spaceslead_timebooking_changesadradults2399012050.42

Model Performance

In [116]: model_LogReg_Asli.score(x_train, y_train)Out[116]: 0.68528553563985
In [117]: model_LogReg_Asli.score(x_test, y_test)
Out[117]: 0.6933333333333334

model_LogReg_Asli only has about 69.3% accuracy

Model Parameter

In [73]: # Parameter yang dipakai di model asli
model_LogReg_Asli.get_params()
Out[73]:{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}
In [74]:# parameter model linear regression yang akan dituned + nilai yang mungkinpenalty = ['l1', 'l2', 'elasticnet', 'none']
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
max_iter = [1, 10, 100, 1000, 10000]param = {'penalty': penalty, 'solver': solver, 'max_iter': max_iter}
paramOut[74]:{'penalty': ['l1', 'l2', 'elasticnet', 'none'],
 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
 'max_iter': [1, 10, 100, 1000, 10000]}

and so on..

Contact for request to access the dataset and .ipynb source model. I hope you liked it. Thank You.

Happy Data Predicting &Mining ! :)