Taking Travel E-commerce ML decisions to Pareto Frontier

Vijay Yadav
MakeMyTrip-Engineering
5 min readApr 8, 2022

--

Cowriter — Narasimha.M

In previous MakeMyTrip blogs we discussed decision-making systems in e-commerce space recommending actions based on a rule or a learning system. We discussed how reinforcement learning methods, contextual bandits plus deep learning, and other function approximators can be used to learn expected reward.

In those conversations authors assumed there exists a single reward to measure. On many occasions, multiple rewards must be considered when selecting a choice. These rewards could be highly orthogonal or even conflicting to one another.

Here are a few use cases in travel e-commerce space for illustration.

  • Ranking hotels based on hotel “click-propensity” vs “book-propensity.” Various ranking submodules make decisions trading off these two dimensions.
  • Ensembles of sub-models
  • Importance/value associated with past clicked, booked, reviewed hotels while personalizing rank for that user.
  • Boosting a subset of “high opportunity” hotels which historically were of poor value . In this case we want to boost sub-clusters of good hotels to maximize CTR (Click Through Rate) of these hotels, while not negatively affecting overall business metrics.
  • Placing advertisements/gaming console, to maximize impressions without deteriorating quality of regular funnel/business.
  • Slot selection for sub-category of products, maximizing CTR of sub-category and while retaining overall bookings/conversion ratio.

Often, we combine objectives or rewards into single weighted average score using pre-defined weights which product, revenue and data science teams agree as reasonable. However, evaluating or learning based on a-priori weighted average score method can be sub-optimal. Choices which are mediocre w.r.t. multiple rewards could be competing with choices which are better at least on one of the rewards by a huge margin.

To account for such multi-objective problems, literature recommends using “linear/non-linear scalarization sum” methods or use “pareto frontier.” In this blog, we discuss pareto frontier and explain how we combined it with the contextual bandit problem.

Multi-objective Multi-Arm-Bandits with Thompson Sampling

Let us consider simple multi-armed bandit Thompson sampling for illustration instead of contextual parametric model and look at how it can be adjusted when you have multiple objectives.

We assume prior and posterior distribution for each objective 𝑑ⱼ∈(𝑑₁,𝑑₂,………𝑑ₙ) to be 𝐵𝑒𝑡𝑎(𝛼, 𝛽) distribution with starting values of 𝛼,𝛽=1 and likelihood as either Bernoulli or Binomial Distribution (Since Beta is conjugate prior for both Bernoulli and Binomial distribution).

We can update the posterior for each objective using Bayesian rule as follows….

Multi-objective

Once n-dimensional expected reward vector is sampled for each arm, agent has to identity the optimal arm considering all rewards. Here, multi-objective optimization principles kick in.

We can either convert multiple rewards into single score and select best arm or select arms which are not inferior on every objective. The later principle is explained well with pareto frontiers.

What is Pareto Frontier?

Imagine multi-dimensional space where each dimension is a reward/objective scale. Large positive value implies greater reward/objective value, small positive or negative values implies lesser reward.

In such multi-dimensional space, there exists multiple frontiers. Every point on a frontier is a non-dominated point, within subset of points. The best frontier is the pareto frontier which has points not dominated on ALL dimensions, across ALL points/subsets.

In this example, arms are the points, each axis corresponds to “expected reward” measured using simple or contextual MAB or any other function approximator, and the task is to find best pareto frontier of arms and select the best arm.

How to identify Pareto Frontier?

The Pareto Front A* can be found either with scalarized function e.g., Linear Scalarized Function (LSF) or Pareto dominance relation (PDR) rule.

Pareto dominance relation uses the following relations between the estimated reward vectors of two arms.

After finding Pareto frontier consisting of all non-dominated arms, we can randomly select one of the arms which lie on pareto frontier.

Pareto Frontier vs Linear Scalarization

When number of dimensions are large, identifying Pareto frontier can be challenging. Methods such as linear scalarization are used in practise for such scenarios.

A linear scalarization function (LSF) converts the multi objective optimization (MOO) problem into single score. However, solving a MOO problem means finding the Pareto front A*. Thus, we need a set of scalarized functions 𝐹={𝑓¹, 𝑓², …………, 𝑓ˢ} to generate a variety of elements belonging to the Pareto front A*.

The estimated rewards can be seen as success rate. The scale of success rate could be different for each objective, for example we can notice the scale/range of CTR and booking conversion rate to be different in travel e-commerce space. Therefore, one needs to be extra cautious while using this method. We can of course rescale/standardize the values before applying LSFs.

Does LSF always gaurantee optimal Pareto frontier???

LSFs are popular because of its simplicity. However, it might not find all the optimal arms in the Pareto front A* , if success rate vectors form non-convex Pareto set[refer this paper for explanation] , However there are multiple scalarization methods out of these Chebyshev scalarization has the advantage that in certain conditions it can find all the points in a non-convex Pareto set.

In practise, if you have fewer dimensions you could opt for pareto frontier search, else use LSFs, in conjunction with contextual MABs.

Conclusion:

In this blog we discussed Pareto Thompson sampling and linear scalarization Thompson sampling to solve multi-objective decision selection optimization problem.

Literature/experiments show that Pareto Thompson sampling outperforms linear scalarization Thompson sampling. However, if weights to apply for each objective are already known, one would prefer linear scalarization Thompson sampling over Pareto.

At MakeMyTrip, data science teams have experimented with both methods based on data/problem requirements and found them useful.

#method-to-madness

#Machinelearning

#pareto

#multi-objective

#contextualbandits

References:

https://www.esann.org/sites/default/files/proceedings/legacy/es2015-27.pdf

https://www.researchgate.net/publication/236480059_Designing_multi-objective_multi-armed_bandits_algorithms_A_study

G.Eichfelder, Adaptive Scalarization Methods in Multi objective Optimization

--

--