Real-Time Bidding (RTB) optimization is a complex problem that requires instant decisions in the context of highly dynamic auctions, non-stationary competition and limited budgets. Traditional methods are often not flexible enough to adapt to a rapidly changing environment, which opens up opportunities for applying Reinforcement Learning (RL) methods.
This repository brings together key theoretical materials, algorithms and practical tools for diving into the problem of RL bid optimization in RTB. Here you find a structured selection of educational resources on the basics of RL, reviews of modern algorithms (from classic DQN to advanced SAC and TD3), research on bid landscape forecasting, as well as examples of implementing an auction simulator and RL-based agents.
Processed sample of dataset and links to popular benchmarks such as iPinYou are provided for experiments. Using such examples of agents and environment, you can customize the experiment for your tasks, embedding the fulfillment of certain goals into the logic of agents, experiment with complex reward functions, tune agents and so on.
RTB is the process of automated buying and selling of advertisements in real time. RTB allows advertisers to compete for ad impressions based on target audiences and other factors, making it an attractive tool for effective advertising placement.
Optimizing an RTB bid is challenging because it requires making a decision in a short period of time (usually within tens of milliseconds) and taking into account many factors, such as user history, the context of the openRTB request, and competition among other advertisers. In such conditions, the use of traditional bid optimization algorithms may not be effective enough.
The use of RL methods in RTB bid optimization will allow you to increase the efficiency of your campaigns by assigning a bid close to the optimal one for each openRTB request, improve conversion rates and reduce advertising costs.
| Title | Short description | Link |
|---|---|---|
| Reinforcement Learning An Introduction second edition | Comprehensive introduction | [Link] |
| CS234: Reinforcement Learning | Lectures from Stanford RL course | [Link] |
| CS224R: Deep Reinforcement Learning | Lectures from Stanford Deep RL course | [Link] |
| UCL Course on RL | David Silver's RL course | [Link] |
| CS 285: Deep Reinforcement Learning | Deep RL cousre at UC Berkeley | [Link] |
| DeepMind Advanced DL and RL | DL and RL couse (slides and videos) | [Link] |
| Foundations of Deep RL | Pieter Abbeel lectures (videos) | [Link] |
| Lilian Weng’s blog | Detailed description of algorithms | [Link] |
| Algorithms of Reinforcement Learning | Lecture notes | [Link] |
| Reinforcement Learning: A Comprehensive Overview | Overview of the field of (Deep) RL | [Link] |
| Bandit Algorithms | Complete tutorial on Multi-Armed Bandit (MAB) problem | [Link] |
| COMS E6998.001: Bandits and Reinforcement Learning | Alex Slivkins MAB course | [Link] |
| Introduction to Multi-Armed Bandits | MAB book | [Link] |
| Hugging Face Deep RL course | Deep RL course with theory and practice | [Link] |
| Practical_RL | Open course on RL in the wild | [Link] |
| CSE 599: Adaptive Machine Learning (Winter 2018) | Online and Adaptive Methods for Machine Learning | [Link] |
| Real-Time Bidding with Side Information | MAB regret analysis in online advertising auctions | [Link] |
| Regret Minimization for Reserve Prices in Second-Price Auctions | Analysis of regret minimization algorithm for reserve price optimization in a second-price auction | [Link] |
| Real-Time Bidding by Reinforcement Learning in Display Advertising | Studying the bid decision process as a reinforcement learning problem | [Link] |
| Efficient Algorithms for Stochastic Repeated Second-price Auctions | MAB (Upper Confidence Bound, UCB), analysis of regret in a second-price auction. | [Link] |
| Multi-Armed Bandit with Budget Constraint and Variable Costs | MAB (UCB-based) for constrained budgets and variable costs | [Link] |
| Display Advertising with Real-Time Bidding (RTB) and Behavioural Targeting | RTB basics, bid landscape, bidding strategies, dynamic pricing | [Link] |
| Offline and Online Optimization with Applications in Online Advertising | Optimal bidding and pacing strategies | [Link] |
| ROI-Constrained Bidding via Curriculum-Guided Bayesian Reinforcement Learning | A framework for adaptively managing constraint-target tradeoffs in non-stationary advertising markets | [Link] |
| Bidding Machine: Learning to Bid for Directly Optimizing Profits in Display Advertising | Description of the bidder's work | [Link] |
| Online Causal Inference for Advertising in Real-Time Bidding Auctions | Online method for performing causal inference in RTB advertising | [Link] |
| Real-Time Bidding A New Frontier of Computational Advertising Research | Description of auction types, bidding strategies, pacing (slides) | [Link] |
| Real-Time Bid Optimization with Smooth Budget Delivery in Online Advertising | Description of pacing types | [Link] |
| Optimal Real-Time Bidding for Display Advertising | Bidding strategies overview | [Link] |
| Auto-bidding in real-time auctions via Oracle Imitation Learning | Multiple-choice Knapsack problem with a nonlinear objective | [Link] |
| Title | Short description | Link | The year of publication |
|---|---|---|---|
| Functional Bid Landscape Forecasting for Display Advertising | Bid landscape forecasting: tree-based, node splitting, survival modeling (slides) | [Link] | 2016 |
| Deep Landscape Forecasting for Real-time Bidding Advertising | Forecasting the bid landscape (Deep Learning) without making any assumptions about the distribution of rates for successive price patterns. | [Link] | 2019 |
| Scalable Bid Landscape Forecasting in Real-time Bidding | Forecasting the price landscape (censored regression) with some simplifications/assumptions | [Link] | 2020 |
| Arbitrary Distribution Modeling with Censorship in Real-Time Bidding Advertising | Neighborhood Likelihood Loss in bidding landscape forecasting problem | [Link] | 2021 |
More attention is paid to model-free RL algorithms, since they are preferable for solving the problem of bid optimization in RTB. The advantage of this approach is not only to adhere to the theoretically optimal bidding strategy, but also to avoid the expensive computational costs associated with simulating an extremely dynamic non-stationary environment.
| Title | Algorithm | Link | The year of publication |
|---|---|---|---|
| Technical note: Q-learning | Q-learning | [Link] | 1992 |
| Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning | REINFORCE | [Link] | 1992 |
| On-line Q-learning Using Connectionist Systems | SARSA | [Link] | 1994 |
| Policy Gradient Methods for Reinforcement Learning with Function Approximation | Policy Gradient (PG) | [Link] | 1999 |
| Actor-Critic Algorithms | Actor-Critic (AC) | [Link] | 1999 |
| Playing Atari with Deep Reinforcement Learning | Deep Q-Network (DQN) | [Link] | 2013 |
| Deterministic Policy Gradient Algorithms | Deteministic Policy Gradient (DPG) | [Link] | 2015 |
| Continuous Control with Deep Reinforcement Learning | Deep Deterministic Policy Gradient (DDPG) | [Link] | 2015 |
| Asynchronous Methods for Deep Reinforcement Learning | Asynchronous Advantage Actor-Critic (A3C), Advantage Actor-Critic (A2C) | [Link] | 2016 |
| Deep Reinforcement Learning with Double Q-learning | Double Q-learning (Double DQN) | [Link] | 2016 |
| Dueling Network Architectures for Deep Reinforcement Learning | Dueling Network Architectures (Dueling DQN) | [Link] | 2016 |
| Prioritized Experience Replay | Prioritized Experience Replay (PER) (DQN replay buffer improvement) | [Link] | 2016 |
| Asynchronous Methods for Deep Reinforcement Learning | N-step Q-learning (N-step DQN) | [Link] | 2016 |
| Proximal Policy Optimization Algorithms | Proximal Policy Optimization (PPO) | [Link] | 2017 |
| Hindsight Experience Replay | Hindsight Experience Replay (HER) (wrapper) | [Link] | 2017 |
| Rainbow: Combining Improvements in Deep Reinforcement Learning | RAINBOW (Combination of DQN improvements) | [Link] | 2017 |
| A Distributional Perspective on Reinforcement Learning | C51 (Categorical DQN) | [Link] | 2017 |
| Distributional Reinforcement Learning with Quantile Regression | Quantile Regression DQN (QR-DQN) | [Link] | 2017 |
| Implicit Quantile Networks for Distributional Reinforcement Learning | Impcit Quantile Networks (distributional generalization of the DQN) | [Link] | 2018 |
| Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor | Soft Actor-Critic (SAC) | [Link] | 2018 |
| Noisy Nets For Exploration | Noisy Nets (NN) | [Link] | 2018 |
| Addressing Function Approximation Error in Actor-Critic Methods | Twin Delayed Deep Deterministic policy gradient (TD3) | [Link] | 2018 |
| Distributed Distributional Deterministic Policy Gradients | Distributed Distributional Deterministic Policy Gradients (D4PG) | [Link] | 2018 |
Model-free approach: getting rid of modeling complexity
As already mentioned above, creating an accurate model of the RTB auction dynamics is an extremely difficult task. The auction environment depends on many factors: the behavior of other bidders, changing user preferences, auction algorithms, which are often opaque. Trying to build an explicit model of this environment, which is necessary for model-based RL methods, can be prohibitively difficult and probably ineffective due to inevitable simplifications and inaccuracies.
In contrast, model-free RL methods learn directly from experience interacting with the environment, without the need for explicit modeling. The agent simply observes the state of the environment, takes actions (places bids), receives rewards (e.g. clicks, conversions or profits) and adjusts its strategy based on these observations. This makes model-free methods much more practical and flexible for application in complex and dynamic environments such as RTB auctions.
Off-policy learning: learning from past experience and exploring new strategies
Also, to learn effectively in RTB auctions, we often need to use data collected in the past or data obtained by exploring different bidding strategies. Off-policy RL methods are ideal for this task.
On-policy methods, such as SARSA or Policy Gradient, learn directly from the experience gained from the agent’s current strategy. This means that to explore new strategies and improve the current one, the agent needs to constantly interact with the environment, generating new experience using the current strategy. This can be slow and inefficient, especially if exploring new strategies results in a temporary decrease in performance.
Off-policy methods, on the other hand, allow the agent to learn from the experience gained by any strategy, including past strategies or even random actions. This is achieved by separating the strategy used to collect data (behavior policy) and the strategy we are trying to optimize (target policy). This approach gives us a number of advantages:
-
Efficient use of experience: we can use accumulated data (e.g. from past simulations or even real auctions) to train the agent, even if this data was obtained using other strategies.
-
Experience Replay: Off-policy methods often use an Experience Replay Buffer, where past transitions (state, action, reward, next state) are stored. The agent can revisit and learn from these past experiences many times, which significantly improves training efficiency and stability.
-
Exploring new strategies more safely: we can explore new, potentially risky strategies by collecting data that can then be used to learn a more conservative and stable target strategy.
Of course, the algorithms are not without nuances and not without shortcomings (Off-Policy Deep Reinforcement Learning without Exploration), but they are still good for our task.
| Title | Algorithm | Link | The year of publication |
|---|---|---|---|
| Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising | DQN | [Link] | 2018 |
| Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Advertising | MADDPG | [Link] | 2018 |
| Real-Time Bidding with Soft Actor-Critic Reinforcement Learning in Display Advertising | SAC | [Link] | 2019 |
| A Dynamic Bidding Strategy Based on Model-Free Reinforcement Learning in Display Advertising | TD3 | [Link] | 2020 |
| Bid Optimization using Maximum Entropy Reinforcement Learning | SAC | [Link] | 2021 |
| Dynamic pricing under competition using Reinforcement Learning | DQN, SAC | [Link] | 2021 |
| Multi-Objective Actor-Critics for Real-Time Bidding in Display Advertising | DQN, A2C, A3C | [Link] | 2022 |
| Real-time Bidding Strategy in Display Advertising: An Empirical Analysis | DQN, TD3 | [Link] | 2022 |
| RTBAgent: A LLM-based Agent System for Real-Time Bidding | LLM agent | [Link] | 2025 |
| Title | Short description | Paper link | Download link | The year of publication |
|---|---|---|---|---|
| Real-Time Bidding Benchmarking with iPinYou Dataset | The most popular dataset/benchmark. Advertising campaigns refer to products from 9 different categories over 10 days in 2013. Contains 64.5M bids, 19.5M impressions, 14.79K clicks. Full dataset size approx 5.6 Gb. | [Link] | [Link] | 2014 |
| User Response Learning for Directly Optimizing Campaign Performance in Display Advertising | A huge dataset of RTB data on advertising campaigns that ran for 10 days in 2016. Contains 402 million impressions and 500 thousand clicks. Full dataset size approx 88.0 Gb. | [Link] | [Link] | 2016 |
| Title | Short description | Link |
|---|---|---|
| OpenAI Spinning Up | Educational resource to help anyone learn Deep RL | [Link] |
| Stable-Baselines3 | PyTorch version of Stable Baselines, reliable implementations of RL algorithms | [Link] |
| Paper Collection of Real-Time Bidding | A collection of research and survey papers of RTB based display advertising techniques | [Link] |
| Deep RL with PyTorch | PyTorch implementation of various algorithms | [Link] |
| CleanRL | Single-file implementation of Deep RL algorithms with research-friendly features | [Link] |
| CORL | Single-file implementations of SOTA offline and offline-to-online RL algorithms | [Link] |
| EasyRL | PyTorch implementation of various algorithms | [Link] |
| Title | Short description | Link |
|---|---|---|
| OpenAI Spinning Up | Educational resource to help anyone learn Deep RL | [Link] |
| Stable-Baselines3 | Stable-Baselines3 Docs - reliable RL implementations | [Link] |
| Gym | Collection of reference environments (moved to Gymnasium) | [Link] |
| Gymnasium | Collection of reference environments | [Link] |
| Third-Party Environments (Gym list) | Collection of third-party environments | [Link] |
| Third-Party Environments (Gymnasium list) | Collection of third-party environments | [Link] |
- Dmitrii Frolov
The MIT License (MIT) Copyright © 2025 MTS ADTECH, LLC. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.