Machine learning in practice – Quick data analysis

Recently I have been asked to do the sample data analysis for a Hotel affiliate business. The company simply makes money by displaying ads for a certain hotel online.
What I was explained is they don’t really keep historical data, and they didn’t have a traditional database where you can take a look of their actions in detail.

They had a simple excel spreadsheet where you could find information like Hotel_ID, Clicks, Bookings, Average Book, Cost, Revenue, Profit, Cost per click, Conversion rate, Hotel_ID, Hotel_Stars, Hotel_Rating, Hotel_Luxurious, Hotel_ChainID, Hotel_ID, Hotel_City, Hotel_Country
What they wanted to have is starting the analysis of their data in order to get some idea what they can do with it.
Here is what I did in 2 hours time.
The first step is data analysis is Exploratory Data Analysis (EDA), or to analyze the data set in order to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Install necessary software

First, I installed a Tableau trial. I think it is really easy and fast to set up the dashboard and visualizations with Tableau.


From the Excel I have implemented few more metrics to help me better understand and interpret the data:
Revenue – Cost —  It’s a good idea to see better and worse profits
Cost per click
Cost/Click — It is good for optimizing the click price
Conversion rate
Bookings/Click  — It’s a good idea to see the High and Low convertors, it can give us an idea where things go wrong, for example, low conversion rate, meaning that there is something wrong at the hotel side but we placed the ad really good. And vice versa.


First and foremost, correlation is not causation. If we see that there is a relation between two variables doesn’t mean that there is a correlation. For example, global warming started around 1800 and the number of pirates went down as global warming started off. This doesn’t mean that global warming caused the pirates to go away.

I have used R to make simple exploratory analysis and to see the correlation between different variables.
#instal the necesary packages
#Read the data set
 Hotels.Data <- read.csv(“C:/Users/Hotel data/Hotels.csv”)
 #Convert some measures to Numeric type
   #Generate US data set
#generate Corelograms to see the variables correlation
 corrgram(Hotels.Data, order=TRUE, lower.panel=panel.ellipse,
          upper.panel=panel.pts, text.panel=panel.txt,
          main=”variables Relationship”)
From the metrics available, we can see a few strong correlations that can help us in optimizing the business:
00-.19 “very weak” · .20-.39 “weak” · .40-.59 “moderate” · .60-.79 “strong” · .80-1.0 “very strong”
                From the Chart1 above and Chart2 we can make the following assumptions:
·         Clicks vs Cost – 0.961991978 This is the strongest correlation, which on first sight is obvious. If you spend more money on display you will get more Clicks, but if we look at
·         Clicks vs Booking – 0.53948721 – Moderate correlation, meaning having a click doesn’t always bring a booking
·         Costs vs Bookings – 0.53952241 – Moderate correlation, meaning investing money does not always bring bookings
·         Clicks vs Average Booking – 0.27334842 Weak correlation, this can mean that on the average number of bookings will not grow if we have more clicks meaning we should invest more money in displaying ads on better positions or better sites or investing in the quality in general. This can be confirmed looking at the Cost per click vs Bookings
·        Click vs Costs – 0.96173728 -Very strong correlation – Meaning if we spend more money on Ads we will have more bookings, but looking at the Clicks vs Booking and Clicks vs Average Booking correlation doesn’t mean that trying to get more Clicks will get us more Bookings, but we should invest time and strategy on more expensive and effective Ads, like better placement on more popular(thus more expensive sites)


I have used Tableau software to generate visual reports and dashboard for better data exploration

Clicks vs Bookings

From here we can conclude that most of the Bookings are made within 2 to 4 Clicks.
Most of the Bookings were made in the US and Western Europe and the most profit is made in the US and Western Europe accordingly. From this dashboard, we can see possible new markets like in South America, South Africa, and the Pacific.
I would do more detailed research on other travel sites to try to identify the popularity for each of these areas before creating a new strategy.

Hotel rating vs Conversion rate

The report above shows us in parallel Costs vs Clicks vs Profit. Here we can get a good idea how are our strategies, performing in terms of hotel rating.
Namely, we can see that the best performing hotels are hotels with a rating 8.23,8 .39 and 8.44.
The worst performers are hotels with rating 9.0 and 9.18.
Also, we can see that the biggest interest in terms of clicks is hotels with 7.0, 8.0 and 9.0 stars, but the conversion rate is not as expected here, which means
 that there is a big room for improvement in this target.
Another interesting observation is hotels by rating 4.0 and 8.89 we have achieved substantial profit with minimal investment, that is the potential segment where we can invest more to gain more or learn from what caused the increased interest in this hotel rating.


City Conversion rate

This report can give us a good picture of our strategy of investing per city. Orange is cost and Green is profit. From there we can see fast the negative profiteer and more than that the weak performers. For example, I draw a red line that will touch most of the profiteers top so I can get a rough median over the observed sample. From there it is easy to see that we have many cities that the profit is greater than the cost or cities where the Conversion rate is higher, from where we can get points for action, stop investing in the weak performers like c00014, c00734,c00064 and make a better investment in cities that are performing with low cost like c01213, c01258.

Country conversion rate


We can make a similar analysis of Country too. From here we can quickly pinpoint weak performers NL, BE, MA. We can make better investment instead for better performers like BJ, SC, HU, which performed much better in terms of investing money.

Conversion rate vs profit

Not every high converter brings high profit. We can get good examples of this by looking at the chart above. BR and FI are good examples that they converted well, but brought negative profit. BE, DE and IN are also another example of countries who converted high, but the profit is rather low. In contrast Countries like TN, ID, and BJ bring a bigger profit with lower conversion meaning we can develop better strategies for advertising these countries and thus boosting profit.

Observations from Top 10 Countries

Using Excel I generated a new tab named Pivot Table, where I have generated Pivot table and 2 reports out of it:
Looking at the chart above, generated from the Pivot Table in the Excel file, we can make an assumption that the best performers are hotels with 4 and 3 starts, where the company did invest the most looking at the Costs per Hotel star. Exceptions are FR and BE where we see 5 and 4-star hotels as the best performers.

Since I don’t have historical data to compare I cannot make a hard statement, whether this is an improvement or not, but if nothing changes in the next period keep on investing in ads for hotels in 3 and 4 stars looks like a reasonable decision.


Predictive modeling

Using R I have set up some simple models to predict the number of Bookings of the hotel in US. I must say, there is room to make much better and more complex models over this data, but in 2 hour time, this is just to give an idea.
#generate model to estimate the number of Bookings taking in account Hotel_Stars+Hotel_Rating+Hotel_Luxurious+Hotel_City
# 10 fold cross validation.Like this will test the model in 10 fold cross validation and pick the best performing one
fitControl<- trainControl(## 10-fold CV
                           method = “repeatedcv
                 number = 10,
                           preProcOptions = list(thresh = 0.8
                       ## repeate
d ten times
                        repeats = 10,
                           #selectionFunction = “best” ## a function to choose the optimal tuning parameters. andexamples.
#Run parralelexecutions on 12 cores, change this to the specific numbers of cores you want to dedicate to the model
#Formula for the model
#basic linear model
model<- lm(formula,data=Hotels.US)
#Random forest with 10 fold cross validation, scaling and centering the data.
              model=”rf”, #Random forest
              trControl = fitControl,#10 fold Cross Validation
              preProc = c(“center”, “scale”),#scaling and centering the data.
              tuneLength = 8,
              importance=TRUE)#importance =TRUE, later we can extract the importance from the Variables included in the model and we can fine tune the model even further
#save the model to our HDD
save(model2, “model2.RData”)
Hotels.US$PredictedBookings<-predict(model2,Hotels.US) # Add the finnal predictions intoa new column in the Hotels.US data frame
#some other simple linear mo
#Check the corelation between Cost and Click
or( Clicks,Cost, method = “pearson”) # high corelation
Other ideas:
I had considered to set up a model to find the best price for Click per City and Country, this can help out in the final bidding procedure.
Another idea is forecasting the demand for Hotels in the specific area. This is a more complex model and it can not be done within a day. Since I have already set up this kind of model in Airfare industry I am comfortable to explain it in details if needed.
I am using a chart to try to explain the idea as simple as possible:


Current machine learning models

I have used 2 models, one linear model and one random forest model.
I have demonstrated 10 folded cross-validation for testing and choosing the best performing model on the Random forest.
Both of these models are not perfect and there is a lot of space for improvement. I am aware that the chosen variables can be fine-tuned by adding or deleting some of them. For this, I have included the Importance parameter in the Random forest model so it can help with the fine-tuning.
I have not tuned or tested other model because of a lack of time. But I feel fully competent to do so If I am requested.

Advanced features engineering of machine learning models

I have set up multilinear exhaustive screening of candidate set to set up the best model for Bookings.
This can take long time and as a result, it will give the most significant variables for setting up a model to predict the specific number of Bookings and with that tuning our models.
This is only one approach for automatic exhaustive model tuning.
## Advanced feature engeneering
## multi linear exhaustive screening of candidate set to set up the best model for Bookings.
 # This can take long time and as a result itwill give the most significant variables for seting up a model to predict the specific number of Bookings
Bookings ~ Avgerage.Book + Cost+ Hotel.City+Hotel.Country+ Hotel.rating+Hotel.stars
global.model<-glm( Bookings ~ Avgerage_Book + Clicks + Cost + Hotel_City+Hotel_Country+ Hotel_Rating+Hotel_Stars+Hotel_Luxurious , data=Hotels.Data)
model <- glmulti(global.model, # use the model with built as a starting poin
                  level = 1,  #  just look at main effects
                      crit=”aicc”) # use AICc because it works better than AIC for small sampl

Future actions for improving machine learning models and observations

As future actions I always recommend model assembling. That means training different models with different parameters and ensemble them as one final model. This can improve the model accuracy.


Low conversion

A low conversion rate doesn’t always mean that you’re getting invalid clicks. A number of different factors can lead to high levels of traffic, but relatively few sales. Here are some common causes of, and solutions to, low conversion rates:
Changes in market conditions, user behavior, and web content
Sites that are difficult to navigate
Your keywords and ad text
Your campaign may not be optimized for the Display Network: 

One thought on “Machine learning in practice – Quick data analysis

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.