Skip to content

MobileTeleSystems/rlrtb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Collection of articles about Reinforcement Learning in Real-Time Bidding

Real-Time Bidding (RTB) optimization is a complex problem that requires instant decisions in the context of highly dynamic auctions, non-stationary competition and limited budgets. Traditional methods are often not flexible enough to adapt to a rapidly changing environment, which opens up opportunities for applying Reinforcement Learning (RL) methods.

This repository brings together key theoretical materials, algorithms and practical tools for diving into the problem of RL bid optimization in RTB. Here you find a structured selection of educational resources on the basics of RL, reviews of modern algorithms (from classic DQN to advanced SAC and TD3), research on bid landscape forecasting, as well as examples of implementing an auction simulator and RL-based agents.

Processed sample of dataset and links to popular benchmarks such as iPinYou are provided for experiments. Using such examples of agents and environment, you can customize the experiment for your tasks, embedding the fulfillment of certain goals into the logic of agents, experiment with complex reward functions, tune agents and so on.

Introduction

RTB is the process of automated buying and selling of advertisements in real time. RTB allows advertisers to compete for ad impressions based on target audiences and other factors, making it an attractive tool for effective advertising placement.

Optimizing an RTB bid is challenging because it requires making a decision in a short period of time (usually within tens of milliseconds) and taking into account many factors, such as user history, the context of the openRTB request, and competition among other advertisers. In such conditions, the use of traditional bid optimization algorithms may not be effective enough.

The use of RL methods in RTB bid optimization will allow you to increase the efficiency of your campaigns by assigning a bid close to the optimal one for each openRTB request, improve conversion rates and reduce advertising costs.

Basics

Title Short description Link
Reinforcement Learning An Introduction second edition Comprehensive introduction [Link]
CS234: Reinforcement Learning Lectures from Stanford RL course [Link]
CS224R: Deep Reinforcement Learning Lectures from Stanford Deep RL course [Link]
UCL Course on RL David Silver's RL course [Link]
CS 285: Deep Reinforcement Learning Deep RL cousre at UC Berkeley [Link]
DeepMind Advanced DL and RL DL and RL couse (slides and videos) [Link]
Foundations of Deep RL Pieter Abbeel lectures (videos) [Link]
Lilian Weng’s blog Detailed description of algorithms [Link]
Algorithms of Reinforcement Learning Lecture notes [Link]
Reinforcement Learning: A Comprehensive Overview Overview of the field of (Deep) RL [Link]
Bandit Algorithms Complete tutorial on Multi-Armed Bandit (MAB) problem [Link]
COMS E6998.001: Bandits and Reinforcement Learning Alex Slivkins MAB course [Link]
Introduction to Multi-Armed Bandits MAB book [Link]
Hugging Face Deep RL course Deep RL course with theory and practice [Link]
Practical_RL Open course on RL in the wild [Link]
CSE 599: Adaptive Machine Learning (Winter 2018) Online and Adaptive Methods for Machine Learning [Link]
Real-Time Bidding with Side Information MAB regret analysis in online advertising auctions [Link]
Regret Minimization for Reserve Prices in Second-Price Auctions Analysis of regret minimization algorithm for reserve price optimization in a second-price auction [Link]
Real-Time Bidding by Reinforcement Learning in Display Advertising Studying the bid decision process as a reinforcement learning problem [Link]
Efficient Algorithms for Stochastic Repeated Second-price Auctions MAB (Upper Confidence Bound, UCB), analysis of regret in a second-price auction. [Link]
Multi-Armed Bandit with Budget Constraint and Variable Costs MAB (UCB-based) for constrained budgets and variable costs [Link]
Display Advertising with Real-Time Bidding (RTB) and Behavioural Targeting RTB basics, bid landscape, bidding strategies, dynamic pricing [Link]
Offline and Online Optimization with Applications in Online Advertising Optimal bidding and pacing strategies [Link]
ROI-Constrained Bidding via Curriculum-Guided Bayesian Reinforcement Learning A framework for adaptively managing constraint-target tradeoffs in non-stationary advertising markets [Link]
Bidding Machine: Learning to Bid for Directly Optimizing Profits in Display Advertising Description of the bidder's work [Link]
Online Causal Inference for Advertising in Real-Time Bidding Auctions Online method for performing causal inference in RTB advertising [Link]
Real-Time Bidding A New Frontier of Computational Advertising Research Description of auction types, bidding strategies, pacing (slides) [Link]
Real-Time Bid Optimization with Smooth Budget Delivery in Online Advertising Description of pacing types [Link]
Optimal Real-Time Bidding for Display Advertising Bidding strategies overview [Link]
Auto-bidding in real-time auctions via Oracle Imitation Learning Multiple-choice Knapsack problem with a nonlinear objective [Link]

Bid Landcape Forecasting

Title Short description Link The year of publication
Functional Bid Landscape Forecasting for Display Advertising Bid landscape forecasting: tree-based, node splitting, survival modeling (slides) [Link] 2016
Deep Landscape Forecasting for Real-time Bidding Advertising Forecasting the bid landscape (Deep Learning) without making any assumptions about the distribution of rates for successive price patterns. [Link] 2019
Scalable Bid Landscape Forecasting in Real-time Bidding Forecasting the price landscape (censored regression) with some simplifications/assumptions [Link] 2020
Arbitrary Distribution Modeling with Censorship in Real-Time Bidding Advertising Neighborhood Likelihood Loss in bidding landscape forecasting problem [Link] 2021

Theory of algorithms

ALT More attention is paid to model-free RL algorithms, since they are preferable for solving the problem of bid optimization in RTB. The advantage of this approach is not only to adhere to the theoretically optimal bidding strategy, but also to avoid the expensive computational costs associated with simulating an extremely dynamic non-stationary environment.

Title Algorithm Link The year of publication
Technical note: Q-learning Q-learning [Link] 1992
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning REINFORCE [Link] 1992
On-line Q-learning Using Connectionist Systems SARSA [Link] 1994
Policy Gradient Methods for Reinforcement Learning with Function Approximation Policy Gradient (PG) [Link] 1999
Actor-Critic Algorithms Actor-Critic (AC) [Link] 1999
Playing Atari with Deep Reinforcement Learning Deep Q-Network (DQN) [Link] 2013
Deterministic Policy Gradient Algorithms Deteministic Policy Gradient (DPG) [Link] 2015
Continuous Control with Deep Reinforcement Learning Deep Deterministic Policy Gradient (DDPG) [Link] 2015
Asynchronous Methods for Deep Reinforcement Learning Asynchronous Advantage Actor-Critic (A3C), Advantage Actor-Critic (A2C) [Link] 2016
Deep Reinforcement Learning with Double Q-learning Double Q-learning (Double DQN) [Link] 2016
Dueling Network Architectures for Deep Reinforcement Learning Dueling Network Architectures (Dueling DQN) [Link] 2016
Prioritized Experience Replay Prioritized Experience Replay (PER) (DQN replay buffer improvement) [Link] 2016
Asynchronous Methods for Deep Reinforcement Learning N-step Q-learning (N-step DQN) [Link] 2016
Proximal Policy Optimization Algorithms Proximal Policy Optimization (PPO) [Link] 2017
Hindsight Experience Replay Hindsight Experience Replay (HER) (wrapper) [Link] 2017
Rainbow: Combining Improvements in Deep Reinforcement Learning RAINBOW (Combination of DQN improvements) [Link] 2017
A Distributional Perspective on Reinforcement Learning C51 (Categorical DQN) [Link] 2017
Distributional Reinforcement Learning with Quantile Regression Quantile Regression DQN (QR-DQN) [Link] 2017
Implicit Quantile Networks for Distributional Reinforcement Learning Impcit Quantile Networks (distributional generalization of the DQN) [Link] 2018
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Soft Actor-Critic (SAC) [Link] 2018
Noisy Nets For Exploration Noisy Nets (NN) [Link] 2018
Addressing Function Approximation Error in Actor-Critic Methods Twin Delayed Deep Deterministic policy gradient (TD3) [Link] 2018
Distributed Distributional Deterministic Policy Gradients Distributed Distributional Deterministic Policy Gradients (D4PG) [Link] 2018

RL in RTB

Model-free approach: getting rid of modeling complexity

As already mentioned above, creating an accurate model of the RTB auction dynamics is an extremely difficult task. The auction environment depends on many factors: the behavior of other bidders, changing user preferences, auction algorithms, which are often opaque. Trying to build an explicit model of this environment, which is necessary for model-based RL methods, can be prohibitively difficult and probably ineffective due to inevitable simplifications and inaccuracies.

In contrast, model-free RL methods learn directly from experience interacting with the environment, without the need for explicit modeling. The agent simply observes the state of the environment, takes actions (places bids), receives rewards (e.g. clicks, conversions or profits) and adjusts its strategy based on these observations. This makes model-free methods much more practical and flexible for application in complex and dynamic environments such as RTB auctions.

Off-policy learning: learning from past experience and exploring new strategies

Also, to learn effectively in RTB auctions, we often need to use data collected in the past or data obtained by exploring different bidding strategies. Off-policy RL methods are ideal for this task.

On-policy methods, such as SARSA or Policy Gradient, learn directly from the experience gained from the agent’s current strategy. This means that to explore new strategies and improve the current one, the agent needs to constantly interact with the environment, generating new experience using the current strategy. This can be slow and inefficient, especially if exploring new strategies results in a temporary decrease in performance.

Off-policy methods, on the other hand, allow the agent to learn from the experience gained by any strategy, including past strategies or even random actions. This is achieved by separating the strategy used to collect data (behavior policy) and the strategy we are trying to optimize (target policy). This approach gives us a number of advantages:

  1. Efficient use of experience: we can use accumulated data (e.g. from past simulations or even real auctions) to train the agent, even if this data was obtained using other strategies.

  2. Experience Replay: Off-policy methods often use an Experience Replay Buffer, where past transitions (state, action, reward, next state) are stored. The agent can revisit and learn from these past experiences many times, which significantly improves training efficiency and stability.

  3. Exploring new strategies more safely: we can explore new, potentially risky strategies by collecting data that can then be used to learn a more conservative and stable target strategy.

Of course, the algorithms are not without nuances and not without shortcomings (Off-Policy Deep Reinforcement Learning without Exploration), but they are still good for our task.

Title Algorithm Link The year of publication
Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising DQN [Link] 2018
Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Advertising MADDPG [Link] 2018
Real-Time Bidding with Soft Actor-Critic Reinforcement Learning in Display Advertising SAC [Link] 2019
A Dynamic Bidding Strategy Based on Model-Free Reinforcement Learning in Display Advertising TD3 [Link] 2020
Bid Optimization using Maximum Entropy Reinforcement Learning SAC [Link] 2021
Dynamic pricing under competition using Reinforcement Learning DQN, SAC [Link] 2021
Multi-Objective Actor-Critics for Real-Time Bidding in Display Advertising DQN, A2C, A3C [Link] 2022
Real-time Bidding Strategy in Display Advertising: An Empirical Analysis DQN, TD3 [Link] 2022
RTBAgent: A LLM-based Agent System for Real-Time Bidding LLM agent [Link] 2025

Datasets and benchmarks

Title Short description Paper link Download link The year of publication
Real-Time Bidding Benchmarking with iPinYou Dataset The most popular dataset/benchmark. Advertising campaigns refer to products from 9 different categories over 10 days in 2013. Contains 64.5M bids, 19.5M impressions, 14.79K clicks. Full dataset size approx 5.6 Gb. [Link] [Link] 2014
User Response Learning for Directly Optimizing Campaign Performance in Display Advertising A huge dataset of RTB data on advertising campaigns that ran for 10 days in 2016. Contains 402 million impressions and 500 thousand clicks. Full dataset size approx 88.0 Gb. [Link] [Link] 2016

Related repositories

Title Short description Link
OpenAI Spinning Up Educational resource to help anyone learn Deep RL [Link]
Stable-Baselines3 PyTorch version of Stable Baselines, reliable implementations of RL algorithms [Link]
Paper Collection of Real-Time Bidding A collection of research and survey papers of RTB based display advertising techniques [Link]
Deep RL with PyTorch PyTorch implementation of various algorithms [Link]
CleanRL Single-file implementation of Deep RL algorithms with research-friendly features [Link]
CORL Single-file implementations of SOTA offline and offline-to-online RL algorithms [Link]
EasyRL PyTorch implementation of various algorithms [Link]

Useful resources

Title Short description Link
OpenAI Spinning Up Educational resource to help anyone learn Deep RL [Link]
Stable-Baselines3 Stable-Baselines3 Docs - reliable RL implementations [Link]
Gym Collection of reference environments (moved to Gymnasium) [Link]
Gymnasium Collection of reference environments [Link]
Third-Party Environments (Gym list) Collection of third-party environments [Link]
Third-Party Environments (Gymnasium list) Collection of third-party environments [Link]

Authors

  • Dmitrii Frolov

License

The MIT License (MIT) Copyright © 2025 MTS ADTECH, LLC. All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

Collection of articles about Reinforcement Learning in Real-Time Bidding

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published