

9:00 am - 12:00 pm, August 14, 2022, @KDD2022, Washington DC Convention Center, Room 204C.



Counterfactual estimators enable the use of existing log data to estimate how some new target policy would have performed, if it had been used instead of the policy that logged the data. We say that those estimators work “off-policy”, since the policy that logged the data is different from the target policy. In this way, counterfactual estimators enable Off-policy Evaluation (OPE) akin to an unbiased offline A/B test, as well as learning new decision-making policies through Off-policy Learning (OPL). The goal of this tutorial is to summarize Foundations, Implementations, and Recent Advances of OPE and OPL (OPE/OPL), with applications in recommendation, search, and an ever growing range of interactive systems. Specifically, we will introduce the fundamentals of OPE/OPL and provide theoretical and empirical comparisons of conventional methods. Then, we will cover emerging practical challenges such as how to handle large action spaces, distributional shift, and hyper-parameter tuning. We will then present Open Bandit Pipeline, an open-source Python software for OPE/OPL to better enable new research and applications. We will conclude the tutorial with future directions and an interactive QA session.

The learning outcomes of this tutorial are to enable the participants (such as applied researchers, practitioners, and students):

  • to know fundamental concepts and conventional methods of OPE/OPL
  • to be familiar with recent advances to address practical challenges such as large action spaces and hyper-parameter tuning
  • to understand how to implement OPE/OPL in their research and applications
  • to be aware of remaining challenges and opportunities in the relevant field

Note that all materials, including slides and demo code, will be available during and after the tutorial on this tutorial website.

Target Audience and Prerequisites

This tutorial is aimed at an audience with intermediate experience in machine learning, data mining, or recommender systems who are interested in using OPE/OPL methods in their research and applications. Participants are expected to have basic knowledge of machine learning, probability theory, and statistics. Basic knowledge about causal inference might help understand the contents better, but is not required.


Sections Presenter Duration
1: Introduction to OPE/OPL Thorsten Joachims 30min
2: Bias-Variance Control Yuta Saito 35min
3: Recent Advances in OPE Yuta Saito 45min
Break   10min
4: Off-Policy Learning Thorsten Joachims 40min
5: Implementations Yuta Saito 15min
6: Conclusions Both 5min

Section Abstracts

1. Introduction to OPE/OPL (Thorstem Joachims; 30min)

We will introduce conventional formulation of OPE and how it helps improve interactive systems quickly and safely. We also introduce basic estimators in OPE including Direct Method (DM) and Inverse Propensity Score (IPS) weighting with some empirical illustrations to highlight their bias-variance trade-off.

2. Bias-Variance Control (Yuta Saito; 35min)

This section summarizes a wide range of existing estimators in OPE including Self-Normalized IPS, Doubly Robust, Switch, and Doubly Robust with Shrinkage. These estimators aim at achieving a better bais-variance trade-off compared to DM and IPS. We will provide comprehensive comparisons of these estimators from both theoretical and empirical perspectives.

3. Recent Advances (Yuta Saito; 45min)

This section will cover recent related methods to handle emerging practical challenges such as OPE of ranking policies, large-scale applications, deficient support, multiple loggers, and hyper-parameter tuning for OPE. These challenges are closely related to real-world applications such as recommender and retrieval systems where the estimators have to deal with many number of actions and non-stationary dynamics.

4. Off-Policy Learning (Thorsten Joachims; 40min)

This section will cover the fundamental methods for OPL where we aim at training a new decision-making policy using only the logged bandit data.

5. Implementations (Yuta Saito; 15min)

This section will introduce Open Bandit Pipeline, an open-source Python package for OPE/OPL, and demonstrate how it helps us implement OPE/OPL for both research and practical purposes.

6. Conclusions (Both Presenters; 5min)

This section will conclude the tutorial by summarizing the previous sections and presenting remaining research challenges of the area. There will also be a live QA session.


Yuta Saito ([email protected])

He is a Ph.D. student in the Department of Computer Science at Cornell University, advised by Prof. Thorsten Joachims. His current research focuses on OPE of bandit algorithm, learning from human behavior data, and fairness in ranking. Some of his recent work has been published at top conferences, including ICML, NeurIPS, SIGIR, RecSys, and WSDM. He has won the Best Paper Runner-Up Award at WSDM2020 and co-lectured a tutorial related to counterfactual inference at RecSys 2021.

Throsten Joachims ([email protected])

He is a Professor in the Department of Computer Science and in the Department of Information Science at Cornell University, and he is an Amazon Scholar. His research interests center on the synthesis of theory and system building in machine learning, with applications in information retrieval and recommendation. His past research focused on support vector machines, learning to rank, learning with preferences, and learning from implicit feedback, text classification, and structured output prediction. Working with his students and collaborators, his papers won 9 Best Paper Awards and 4 Test-of-Time Awards. He is also an ACM Fellow, AAAI Fellow, KDD Innovations Award recipient, and member of the SIGIR Academy.


