Deep Contextual Bandits for model selection in travel e-commerce

Vijay Yadav
MakeMyTrip-Engineering
7 min readNov 11, 2022

--

Primary authors: Vijay Yadav, Deepak HR

Co-author : Narasimha

1. INTRODUCTION

MakeMyTrip has multiple ranking/recommendation systems. We use learning to rank, sequence-based recommendation models, content or behavior representation-based algorithms, and collaborative filtering algorithms too.

After the customer browses around, searching for hotels in city A, B or C, a substantial portion search for a single hotel and look at the responses/price/content to review. For such “direct hotel search” scenarios, it is prudent to show a few alternatives as well, for customers to compare and contrast them. More important if alternative hotels can bring higher value proposition to the business or to the customer (for ex., better service/value for same price). In this scenario, one could recommend hotels similar to the pivot hotel user has searched.

Recommendation page, for single hotel search

2. APPROACHES

Ranking algorithms use content, behavior, model-based representation learnings, and meta data such as location, price. In many destinations, espcially tourist destinations, customers search for high-end properties, resorts, villas and look for similar properties which may not be in same location/area.

While representations learnt based on user behavior (GNN embeddings) were very powerful & useful in our context, we wondered why not enhance the algorithm using content based representations or similarities too. Note that MakeMytrip and GoIbibo have many hotel reviews and also hotel images.

There are multiple ways to handle such problems other than contextual bandits model selection. One could train joint embedding model, or combine behavioral and content into single ranking/rec model etc. One can define loss over behavioral data and estimate appropriate NN weights/representation wts. For example, one can frame the problem as in link prediction problem and use GNN or heterogenous GNN models. Each method has its own advantages and disadvantages. If the loss is defined over behavioral outcome(hotel click yes/no), model might under perform in long tail scenarios unless model has a provision to generalize for unseen data or has provision to handle uncertainty. Also, it is subjective how closely the model target is linked to the online business metric of interest.

Similar to experiments run by Amazon, Spotify, Netflix, for algorithm selection ( see algorithm selection), we explored contextual bandits. In our model we took a Neural network + Linear TS based approach. There were a few interesting patterns in this project and it led to 1.5% -2% lift in overall conversion ratio. When the “pivot hotel” is sold out, impact of recommending good alternatives is much more(20%+ lift).

The rest of the blog explains this approach.

Example business applications by action set sizes

3. OUR METHOD

Train multiple models with data you have. Learn based on feedback loop which model is appropriate for a context.

Model 1: Shallow GNN embedding similarity –Train node2vec hotel embeddings using user-hotel click browse sequence of data and compute cosine similarity.

Representation learning is indeed core ingredient in deep learning recipe. It has worked quite well at MakeMyTrip. There are methods to train representations starting with prior belief or without. Even though we train these models without strong priors, models are able to estimate great representations in most of the cases. If we look which hotels are similar to one another in representation space, and project these on a 2D map (illustrative) we find intuitive clusters. Besides location-based similarity assessment, we use multiple quantitative metrics to assess quality of these models and representations.

Model 2- User Review (user generated content) based similarity — We concatenated all user reviews of a hotel and have experimented with multiple sentence representations in past.

  • Fast Text +IDF embeddings
  • Custom domain specific embeddings
  • Google universal sentence encoder

These methods are highly competitive, and universal sentence encoders are quite rich & easy to use.

Model 3- Metadata based similarity — cosine similarity based on meta data vectors

Model 4- Location based proximity– Uber hexagonal hierarchical spatial indices are used to encode hotel ID meta data location indices and to compute proximity. Highly flexible and efficient library in our use case.

4. DATA DECISION FLOW

Customers search directly for hotels (pivot hotel), DS/ranking algorithms pass on searched hotel info to back-end DS ranking systems, retrieve recommended alternative hotels. The architecture is described below. Data science RL/bandit experimentation platform “ODIN” helps speed up such experiments.

Each sub-model is a choice of a contextual multi-arm bandit. Choices (model selection) initially are selected at random, for exploration. Customers respond to the recommendations by clicking (implicit feedback) either the searched hotel (pivot hotel) or switch to recommended hotels. This data is fed back continuously to the contextual bandit model training step.

5. DATA DISTRIBUTION

The distribution of clicks looks, approximately, as below. Skewed, not unlike zero-inflated log-normal distribution.

Distribution of outcome(illustrative)

here “Pivot Hotel” — Direct/single hotel search,

“Rec Htl 1” –1st alternative recommended hotel,

“Rec Htl 2” — 2nd alternative recommended hotel

….

“Rec Htl 11” — 11th alternative recommended hotel

6. MODELS

There are two parts ….

  1. A deep learning model to learn context embedding.
  2. Contextual bandit model, to factor reward predictability and uncertainty.

Contextual bandit problems are a special case of reinforcement learning where the state(context) at each time step t is independent of the states and actions taken at previous time steps. Here goal is to maximize the reward yₜ obtained by taking the action aₜ for given input context xₜ at each time step t.

to further know about contextual bandit applications, read other tech blogs of MakeMyTrip.

  1. Deep Learning model:

Primary objective of this model is to estimate contextual representation vector and use it in linear contextual bandit (linear Thompson sampling).

Predict the outcome distribution which is akin to zero inflated log normal distribution (ZILN).

Feature set — search parameters like pax, number of rooms, length of stay, city embedding, pivot hotel’s embedding and meta data, gap between date of browsing and check-in-date.

There are two components to fit -

  1. Probability of clicking pivot hotel vs switching to alternative rec
  2. Conditioned user clicked alternative rec, predict which recommendation.

The loss function is defined as following -

where log normal loss is defined as …

The first term corresponds to the classification loss (prob of pivot hotel click yes/no). The second term is log of error distribution for Mean Reciprocal Rank(MRR).

The last layer of NN has three pre-activation logits units, separately determining the pivot hotel click probability p, and mean (μ) and standard deviation (σ) of the MRR distribution.

Illustrative network

From the network, we pull out “final hidden layer”, i.e. L-1 layer representation vector for each training row, and use as features in 2nd step.

2. Contextual Bandit: Linear contextual Thompson sampling

Let us restate the notation.

At each time step t, the agent’s would collect feedback about the action set a and update θ. Since θ are unknown, we design the agent to first explore and then exploit learnings.

In Sequential linear Thompson sampling model, these are the steps we follow:

7. RESULTS AND INSIGHTS

In our experiment, we have 5 arms, where 4 arms are corresponding to each sub model, and one arm is ensemble of all the four sub models. We ran the experiment for many days and analyzed the results.

  1. From the plot below we can see the optimal arm depends on the context. Optimal arm turns out to be arm 2 when context consists of android flavour, whereas optimal arm gets switched to arm 1 when context consists of desktop flavour.
  2. We can also see how model moves to exploitation from exploration in starting days. One can control the exploration by modifying the hyper-parameters for noise scale (priors of InverseGammaDist) .
#requests served by each arm: Arm 2 is selected when context consists of android flavour
#requests served by each arm: Arm1 is selected when context consists of desktop flavour

Arm 2 is most rewarding and chosen most often, but arm 1 and 2 are not far behind, they are selected for sub-contexts sufficiently large number of instances (sub-context not shown here).

#requests served by each arm

Thanks to the readers who patiently read through the blog. Please share questions or comments if you have any.

References

  1. A DEEP PROBABILISTIC MODEL FOR CUSTOMER LIFETIME VALUE PREDICTION
  2. Efficient Online Bayesian Inference for Neural Bandits
  3. Thompson Sampling for Contextual Bandits with Linear Payoffs
  4. DEEP BAYESIAN BANDITS SHOWDOWN

--

--