Deep Contextual Bandits in Travel Ecommerce — journey towards real time machine learning

Narasimha M

Published in

MakeMyTrip-Engineering

10 min readJan 5, 2022

Co-authored by data scientist RaviKiran Daparthy

Other key contributors (data scientists): Chandrakant N, Tarang Agrawal

1. Introduction

Many e-commerce sites or apps have subsystems or microservices running behind the scenes, making many decisions. A system could be selecting a choice out of a few options based on a rule or criteria or based on the output of another recommendation/decision system.

AI systems could, for example, use bookings, clickstream data, user, device, and other metadata to learn. Supervised AI/ML models trained on such “offline” data are tuned based on offline metrics such as accuracy, F1 score, AUC-ROC curve, Accuracy, MAPE/MSE error, etc.

While supervised-ML models perform quite well against a static/rule-based baseline, their impact on business may not be consistent. Data drifts, covariate shifts, training data sampling biases, position click biases, data logging issues, omitted confounding variable bias, unexplained features and many reasons can cause misalignment between the offline performance of a model and the impact on business/product metrics post-production.

Through appropriate data augmentation, correction, sampling, MLOps, data quality, and modeling techniques, these issues can be addressed to a large extent. However, there can still be a gap in learning. For example, there might not be sufficient data exposed to “decision A” as opposed to decisions “B”, “C” and “D”. So no model can learn the expected outcome for decision A, accurately, for every context. While decision A could be most optimal for many subgroups, models may choose other choices. Supervised AI models, especially those without uncertainty quantification perform poorly under such circumstances.

Also, the business/behavioural outcome from a user, given a recommended decision, may not come immediately, i.e. it could be implicit , delayed feedback. For example, users might come and browse for travel choices multiple sessions in a day or even browse over multiple days before finalizing the decision. Throughout the decision-making journey, a backend system for example could have recommended multiple decisions. Users would have received multiple choices/decisions, touchpoints, information in-app, and competitor sites before finalizing the decision. This brings in a layer of multi-touch multi-decision attribution complexity — attributing outcome/reward to various touchpoints and decisions made by backend systems.

On a specific dataset, one could experiment with various offline metric-based supervised ML models, NN architectures, loss functions, etc. But e-commerce sector has multiple problems/systems and the same supervised ML model form or architecture might not yield satisfactory online results.

To that end, reinforcement learning (full or partially observed Markov decision process) and by extension the contextual bandit methods, have been useful catering to at least a subset of applied data science problems. A subset of problems, mainly, bridging the gap between offline and online feedback-based sequential learning. And a few other benefits (though NOT unique to the reinforcement learning framework) which are handy, such as a provision to account for uncertainty (at least aleatoric uncertainty, in some cases epistemic uncertainty too), and a provision to model immediate and discounted future reward/outcomes.

These advantages also push data science systems towards real time machine learning or real time AI.

In this blog we focus on contextual bandit framework. As you may know, contextual bandit algorithms capture two key themes — “exploitation” and “exploration”.

Exploitation:

Learn “expected rewards” as a function of the context, using state-of-the-art models. Use linear or non-linear function approximators, esp. DNNs. After training model, use function approximator to predict expected reward, choose optimal decision which suggests maximum expected reward, given a context.

Exploration:

Acknowledge uncertainty in reward given a context and utilize it optimally. Use models which have provisions for uncertainty quantification. Explore all decisions to some extent, especially in the early stages of sequential learning. With more data for all arms, exploration should ideally decrease and exploitation should be strong — all else said equal.

In the next section, we discuss domain-specific scenarios for context.

2. DOMAIN

In travel e-commerce, decisions from an online model/agent are passed back to the environment (apps and customers), and the customers react in the same session or in subsequent sessions (delayed feedback). The context, recommended action(s), and the reward/outcomes are logged in clickstream data tables (off-policy data) which are used by RL/bandit models to learn sequentially to make better decisions.

Types of decision systems at MakeMyTrip have RL/bandits touched upon?

How much importance to give to hotels which were clicked, vs booked vs reviewed in the past, while figuring their value in creating personalized hotel recommendations.
Which slots to select while boosting or interspersing properties (Apartments, villas, cottages, hostels)?
How to boost hotels that have high “value” now (episodic), which in past might have had poor market presence?
Which payment/insurance option to show as a function of context?
Vertical or horizontal placement of special panels (Ads, games, etc).
How much coupon % be offered for flights or hotels?

Our data science teams have successfully experimented with various bandit models, especially these model types.

Contextual bandits
Linear contextual Thompson sampling
Linear UCB
Deep NN + linear contextual Thompson sampling

The next section describes models and variations.

3. Contextual Bandits — models

Introduction

In the previous sections, we discussed use cases and the role of supervised vs reinforcement learning contextual bandit algorithm. In this section, we will discuss Linear and Deep contextual bandits, especially using the Thompson sampling method.

Let us call each decision an “action” or “arm”. The features, predictors, variables as “context” and the business outcome given arm_i, context_x as “reward”. Also, note that the action space is discretized. Save continuous action space methods for another discussion.

There are many algorithms in literature introducing bandits and how bandits are better than alternatives such as a static action or a random selection or an A/B-based action selection system. Some of these model forms tend to have the following themes.

Epsilon Greedy, adaptive epsilon greedy variants
Upper confidence bound method (UCB, linear UCB, constrained UCB-ALP, etc).
Thompson Sampling (TS) (Thompson sampling, constrained TS, linear contextual TS, Deep NN Thompson Sampling, etc).

Thompson sampling and UCB methods have been highly competitive, robust as per literature. At MakeMyTrip, we have experimented with both methods, on a different set of problems. However, the ThomponSampling method has been used more often by our DS team because of its reliance on looking at “uncertainty” via the distributional lens, which can be modeled using many methods.

For example,

Use Beta distribution to learn expected reward/regret distribution, per arm, within mutually-exclusive collectively-exhaustive context cohorts.
Classical Linear/logistic regression model plus leverage variance-covariance around the parameters plus expected “noise level”.
Bayesian linear/logistic regression models where parameter distributions are estimated, to account for both epistemic and aleatoric uncertainty.
Deep NN + Linear TS
Deep learning model, leveraging “drop out” perturbation during inference to mimic uncertainty distribution
And, the gamut of Bayesian NN models.

3.a Thompson sampling, within context cohorts

Algorithm to build sequential learning TS bandits. Treat each context cohort as an independent “population”, with no partial or full pooling information across groups.

3.b Linear Contextual Bandits

How about pooling information across cohorts and learning expected reward using a function approximator, while also accounting for uncertainty in reward? It is necessary when there are hundreds or thousands of context cohorts, without sufficient data to learn in each cohort independently.

Let us start with the linear regression function approximator.

Linear Regression Basics

Basics

Closed-form, analytical solution

Next, let us review Linear contextual Bandits with the Thompson sampling method. There are two approaches 1.) sequential learning 2.) batch off-policy data learning.

For sequential learning TS, Gaussian priors and Gaussian likelihood function are used to design the algorithm. If you have an off-policy/offline dataset with which you want to learn, the formulae would be slightly different.

Sequential learning TS bandits

The literature proposes an algorithm, along the following lines

Off-policy/batch dataset learner TS bandit

If you have an off-policy dataset which has significant amount of data per each arm from an existing system (rule-based or old RL/bandit system), instead of starting with priors (diffused or informed) and updating parameter posteriors using bayesian rule, you can estimate linear regression estimates βp and variance/covariance of βp directly.

In both methods, variance (including “noise level”) around the coefficients is used as a lever to acknowledge uncertainty in expected or actual reward as a function of context. When variance is high, decisions implicitely are driven more by “exploration” than when variance estimates are low, i.e. when confidence bounds of the coefficients are “tighter”.

The “noise level/scale”:

Literature on bandits has treated this as a hyperparameter.

For example, a constant (0.0025, 0.005, 0.2, 2, 5, 10 etc..) value is assumed as hyper parameter and simulated reward/regret is used to identify optimal “noise level”.
A few references point us to classical unbiased sigma σ² = (Y- Yʰ)²/(n-p) . Perhaps you can experiment with it. Or, perhaps a scaled-down version of σ², scale down factor as hyper-parameter which is based on training sample size and dataset.

How to improve this linear algorithm?

Literature proposed adding lambda λ ridge hyperparameter factor to the regressions.

3.c Ridge linear regression

Motivation:

Often projects have features (context) that are highly or partially correlated with each other.
And on occasions, the correlation between outcome (Y) and features can be very sparse
Ill-conditioned feature/context matrix, where (X′X)−1 cannot be computed if X′X is singular or near-singular. You may observe this on rare occasions when features are highly correlated with one another and data sizes are small.

Impact:

β parameters may not generalize outside the training samples.
Small changes to context could cause unrealistic changes in β parameters and/or to (X′X)⁻¹
When features are correlated with one another, it increases the variance around the estimates (see Variance Inflation Factor), the confidence bounds around the β coefficients. Every time the β coefficient vector is sampled from the distribution, it is likely to change a lot making your algorithm to “explore” more and “exploit” less than optimal.

To estimate β coefficients that generalize better outside the training sample, L2 (ridge) regularization cost is imposed on the magnitude of the coefficients in the loss function. The close-form analytical solution is as follows

These can be plugged easily, into the TS bandit algorithm described above. Both v² “noise level” and lambda λ are hyper-parameters.

3.d. Deep learning bandits

Motivation:

Obviously, linear function approximators will not be ideal for capturing interaction effects.
To strike the right balance of “exploration” and “exploitation”, we’d quantify uncertainty as accurately as possible. As seen in the previous section, higher multicollinearity between features inflates confidence bounds, variance estimates of β.

Pass through context features through DNN helps with 1.) transformations, 2.) feature interactions, and 3.) feature representation which tends to be decorrelated with one another.

Train a deep learning function approximator to predict expected reward, using appropriate loss function.
After training the model, take L-1 representation (zᵖ ) as “features” and estimate β and var(β) estimates, per arm/action using linear ridge contextual TS.

Conclusion

RL/Contextual bandits have been very useful in the e-commerce space. There are many opportunities for real time AI/ML in e-commerce, and bandits or reinforcement learning frameworks have significant potential.

At MakeMyTrip and Goibibo, RL/bandits have yielded ~4% to 15% lift on business metrics (compared to baseline) such as conversion ratio or room nights, or CTR, or Ad Tech revenue.

There is a healthy scope for improvement to look forward, with advanced model forms or function approximators or uncertainty estimators.

For example:

Explore full MDP with discounted rewards, i.e. RL. While we dabbled with full MDP, at MakeMyTrip, significant progress is yet to be made to ensure these models offer stable intuitive solutions, which is quite hard with off-policy limited sample datasets with very high noise-to-signal ratios.
Experiments with continuous action space bandits/RL, in place of discrete action.
Explore Multinomial Bandits, for example, MNL TS bandits
Experiment with Gated Linear Network bandits or other function approximators.
Experiment with models in the intersection of reinforcement learning and probabilistic inference.

In another blog, we will explain how our data scientists are solving multi-objective/reward problems in conjunction with RL/bandit policy recommenders.

If you are curious to learn how to build large-scale real time AI/ML systems based on the above concepts, or curious about model training nuances/caveats reach out to us.

Finally thanks to the data engineers, software engineers (for data), product, category, and revenue managers who contributed to these projects.

Special thanks to the RL/bandit data scientists who lead these projects to success: Vijay Yadav, Nevin Kollanoor, Pulkit Bansal, Debayan Mitra, Mayur Hampiholi