Learning to Rank, in Hotel Travel e-commerce space

Published in

MakeMyTrip-Engineering

11 min readSep 13, 2021

Co-written by data scientists — RakhiAgrawal , VishalSharma & ShivamPrasad

Introduction

With the advent of the internet and e-commerce, ranking documents or items in response to a user query/request has become a key task in the area of information retrieval or recommendation systems. In response to this, researchers have proposed many models which can be categorized as relevance ranking models and importance ranking models.

The learning to rank framework captures elements of both relevance of items to the query and the importance/popularity of the items. However, many recommendation systems or IR systems use a two-step approach, where an importance ranking model is used to retrieve a subset of items from millions of items, and relevance ranking or learning to rank model is used to order the subset.

The problem is often framed as supervised ML, learning to predict “relevance” as a function of item features, item/document representations, and the query.

Offline process of a Learning to Rank approach

At MakeMyTrip, ranking items is a critical problem in hotels, flights, and other lines of business. A consumer-centric ranking system can drive conversion, improve user experience, reduce time to book (when done appropriately), and reduce portfolio risk by giving an opportunity to a diverse set of products/items to be seen by consumers.

The data science team has developed many ranking algorithms, especially in hotels' line of business. A different set of models power ranking scenarios, based on user type and degree of information available during inference.

Broadly, the ranking scenarios can be grouped into three types.

1. New users and logged-off/guest users

2. Users who have booked/interacted with MakeMyTrip in previous trip searches.

3. Recently interacted/clicked/reviewed hotels.

For the 2nd scenario and 3rd scenarios, multiple AI/ML algorithms are in production which have yielded a 10% to 15% lift in conversion for domestic and international hotel categories, compared to respective baselines.

In scenario 2, for example, collaborate filtering/matrix factorization methods performed quite well (70–80% cases) compared to a similar simpler baseline.

In scenario 3, international hotels, an ensemble of content embedding and click sequence embeddings (graph network model) performed better than collaborative filtering by 10% and 12% (conversion ratio) in android and iOS respectively. We will explain these embedding and graph network models in another blog.

For this topic, consider the 1st scenario. It is a cold start problem when viewed from the perspective of matrix factorization (MF or Neural MF) or collaborative filtering “solutions”. But there is merit in improving ranking as many users come to the platform solely for a single session or browse without logging in. The ranking objective for scenario 1 now translates to, how to leverage user search context, seasonal popularity of hotels, hotel metadata, hotel’s recent performance data (CTR, conversion ratios, number of bookings and clicks), and improve the ranking for new users and logged-off/guest users.

The data science team has experimented with learning to rank (L2R) framework for this scenario, in domestic and international hotels over the past couple of years for MakeMyTrip and GoIbibo.

We have also begun personalized L2R experiments for domestic flights and home landing card ranking. The rest of the article dives into L2R in the context of hotels.

Hotel Sample Funnel Flow

Logged Off/New users enter search criteria

Illustrative funnel flow, model results may vary based on user and search context, date/time, and device platforms.

How to define relevance?

The key purpose of L2R algorithms is to understand “relevance”. The user’s behavior within the app informs us how relevant is a recommendation in the item list (implicit feedback). For example, in hotel LOB, funnel depth indicates user-item interaction quality. We can consider that the deeper user goes into the funnel for a hotel, the more relevant such hotels are for the user’s trip. And, hotel funnel depth has broadly following list of key sections.

1. Hotel listing

2. Hotel detail

3. Detail page, location map view

4. Detail page, hotel photos

5. Select room and rate plan code, for the hotel

6. Review

7. Payment page review

8. Book the hotel

Lowest relevance anchored at “listing”, and highest relevance for “booked hotels”.

Model Algorithm types

L2R algorithms can be categorized into three classes, based on function approximator “objective” or “likelihood”/“loss function.

1.Pointwise

— — Treat each row in the training dataset as “independent”. Assume error is I.I.D distributed (independent and identically distributed).

— — Basic version, could easily capture “importance” based ranking when appropriate features are present.

2.Pair-wise

— — Learn relative importance of hotels/items within each pair, from pairs of hotels as input.

— — Each informative pair would contribute equally towards the loss function.

3.List-wise

— — Learn relevance across the entire list of items for a given user-trip-query.

— — One of the approaches is to modify loss/objective function to treat pairs as “not equal”. Weight for a pair could be proportional to the expected drop in model metrics if the model gets the order of items wrong in the pair.

What offline metrics are appropriate to track?

Normalized Discounted Cumulative Gain (NDCG)

NDCG summarizes the cumulative effect of the items which were “relevant” in a list of recommendations while discounting for the rank/position in the items in the list. In an ideal world, all relevant items will be positioned at the top of the list, in descending order of relevance.

Since the “discounted cumulative” score/gain can be different for one user to another user, based on the number of items interacted and the relevance/depth of interaction, such score should be “normalized with respect to “ideal discounted cumulative” score, which leads to NDCG.

EXAMPLE: Consider a ranking system that returns a hotels list for a given search context in a city. Let’s assume there are only 7 hotels in the city and for now and we will look at the top 5 hotels only. So, p is 5. Result list = [H1, H2, H3 ,H4, H5, H6, H7]. With historical clickstream data we know that relevance of these hotels for current search context are : relevance = [2, 1, 0, 1, 0, 0, 1]

Discounted Cumulative Gain:

Ideal Discounted Cumulative Gain:

Mean reciprocal rate (MRR)

The Reciprocal Rank (RR) measure calculates the inverse of the rank at which the most relevant item was retrieved. RR is 1 if a relevant item was retrieved at rank 1, if not it is 0.5 if a relevant document was retrieved at rank 2, and so on. When averaged across search queries, the measure is called the Mean Reciprocal Rank (MRR)

A few other metrics explored by researchers are MAP, Average Precision, Expected Reciprocal Rate, Yandex P-found et. Some have also recommended metrics for dispersion, diversity, and degree of novelty recommendations.

Simulated ranking stats

While we emphasize NDCG, MRR, and a measure of dispersion as primary metrics, at MakeMyTrip, before deploying a model we also simulate and assess whether key feature distributions among the top 30 recommended items are satisfactory or not (“sanity check”). The key features to check for example could be the ranks of the recommended hotels on booking count, clicks context-specific rank, CTR etc., within the city OR within a category of hotels.

Assortment of Model types

How to choose ideal relevance values?

We experimented with multiple relevance sets and assessed which set maximizes product metrics (CTR) and business metrics (conversion ratio). While it is trivial to define monotonically increasing relevance scores with funnel depth, the gap between funnel steps doesn’t have to be uniform.

For example, the gap between booking and review funnel depth could have a huge non-linear contribution to the objective function (2^booking_relevance — 2^review_relevance), which may or may not be ideal.

Based on relevance structure, the contribution of features and feature interactions in the function approximator could vary, leading to a very different set of hotel recommendations.

For example, price/discount is a primary driver for converting users from review to book step. If the relevance (modeling assumption) gap is too big between these two, the function approximator learns to be biased to ranking cheaper or hotels with steep discount % in the top, sacrificing specific relevance of hotels for the search context.

In applied data science, NDCG alone will not be sufficient to decide which set or model is appropriate. On many occasions, we observed different performance rank order of models/sets based on offline metrics (such as NDCG, MRR) vs based on online metric impact (%lift in CTR or %lift in conversion ratio). One must be prepared to test multiple models in production and learn which model/relevance-set is appropriate, from offline AND online metric(s) improvement.

Model form and loss functions

We have experimented with pair-wise, list-wise loss functions, with xgBoost, lightGBM, TensorFlow, TFranking library, and custom list MLE losses ( PyTorch).

In addition to testing a few loss functions, we have also experimented with MLP NN, Dense Net, AutoINT, and deep/medium depth neural nets with residual connections.

Of all neural network architectures, NN with residual connections gave better offline results (mild improvement in NDCG in 3rd decimal, substantive lift in MRR metric) compared to xgBoost implementation of listwise learners.

However, the simulated rank stats were only comparable for lambdaMART/GBDT LambdaRank. As NN structures warranted further iterations, GBDT LambdaRank based function approximator was productized for domestic hotels.

A few experiments with the NN model structure were conducted for International hotels, in early 2020.

AutoINT (Automatic feature interaction with self-attention network)

With AutoINT preliminary iterations, the NDCG hardly showed improvement, with higher inference latency. We deprioritized further optimizing/fine-tuning this architecture. Note that, unlike other datasets where AutoINT structure could be more “effective”, out datasets had fewer categorical variables; most of the variables had continuous value distributions.

TFranking library group-wise loss yielded comparable NDCG score, but simulated rank statistics were poor compared with GBDT LambdaMART/LambdaRank function approximators.

NN parameter initialization experiments

Orthogonal initializations helped stabilize the model iterations and converge with fewer epochs, with no substantive impact on offline metrics in the test dataset (NDCG, MRR).

Feature transformations

While using continuous value features, the exact numerical values are slightly less important in tree-based function approximators, as long as their relative ordering is meaningful. But for neural network models, the relative ordering of values alone doesn’t suffice, extreme values in features could cause large gradients shutting down activation functions like ReLU.

To avoid these pitfalls long-tailed features can be transformed and restricted/clipped.

Most of the continuous variables took skewed power-law or non-normal gamma or log-normal distributions. For such features, median value is mapped closer to 0 — log(1+value/ (1+median))

Also, to make the inference results be more sensitive to the important features, we apply rank indexing. Bin or Rank each feature (for ex., booking counts, click counts, P15day CTR, conversion ratio, etc) of a hotel w.r.t. all other hotels within the city.

Selection bias correction (negative sample insertion)

Every ranking system primes users to engage more with items at the top of the list. Options to sort or filter mitigate this problem but not entirely.

Hotel impressions will systematically be higher for popular hotels than others. Hence training data has substantial “selection bias”. Training a model with such data creates a “selection bias” gap between “training” and “inference”. The parameters may not generalize enough to score all items/hotels in the city.

To reduce such bias in cold start L2R ranking problems, we generate negative samples/items per QID and inserted them in training data.

In many situations, one cannot generate appropriate “negative samples”. Instead, the selection bias correction techniques such as “Heckman_rank” or “propensity_SVM_rank” or counterfactual methods, as mentioned in the literature, could be explored.

Position bias correction

In addition to selection bias, the training data has click bias in “relevance”/implicit feedback. Users tend to interact/click/review top-ranked hotels though some of them may not be relevant to them.

To weed out position bias correction, we are experimenting, currently, with inverse propensity score weighting and other debiasing methods.

Business Impact

Overall, we have found ~5% to ~14% lift in conversion ratio (varies across projects, device types, compared to dynamic-yet-rule-based-engine as baseline), CTR, and dispersion of hotel preference clicks.

Across the spectrum of ML projects at MakeMyTrip & GoIbibo, there are a few where online metric improvement did not tally with offline metric improvement. A few such experiments were in learning to rank (L2r) charter too.

For example, L2R experiments with similarity scores (based on user-hotel recent interaction — scenario 3) gave +ve impact on conversion ratio in iOS but not in android traffic (w.r.t. baseline collaborative filtering algorithm). Note that separate models are trained & tuned for iOS and android with respective consumer behavior data.

At the same time, L2R experiments for cold start problem in Goibibo, gave substantive positive impact on conversion ratio (7.5% — 15%), improved CTR in top ranks, and dispersion metric in both android & iOS device users.

Also, the preliminary results of personalized L2R experiments for Domestic Flights have been positive.

Based on these insights, the data science teams at MakeMyTrip (& GoIbibo) are working on both model and data-driven ML tasks to bolster learning to rank model performances for all devices users(android, iOS, desktop), for both brands.

Thanks for reading along !!

Finally, thanks to the product leaders, and data platform team for powering these projects, and to an ex-MakeMyTrip data scientist (Arpit Katiyar) for his contributions to the ranking charter.

References:

1. https://www.microsoft.com/en-us/research/uploads/prod/2016/02/MSR-TR-2010-82.pdf

2. https://dl.acm.org/doi/10.1145/1645953.1646033

3. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-40.pdf

4. ListMLE: Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise Approach to Learning to Rank: Theory and Algorithm. In Proceedings of the 25th ICML. 1192–1199

5. AutoINT https://arxiv.org/pdf/1810.11921.pdf

6. Correcting for Selection Bias in Learning-to-rank Systems https://arxiv.org/pdf/2001.11358.pdf

7. Explaining and illustrating Orthogonal Initializations in RNNs https://smerity.com/articles/2016/orthogonal_init.html