Connection between Causal Inference and off-policy in Reinforcement Learning

Lili Wu, PhD Student

Recently I am kind of excited because I suddenly realize that there are some amazing similarities between causal inference and off-policy in reinforcement learning. Who “stole” the other?

Through some of the previous blogs, you may already more or less know about reinforcement learning (such as Lin’s blog about reinforcement learning in education), which is now very popular in Artificial Intelligence. Reinforcement learning is trying to find a sequential decision strategy to maximize the long-term cumulative rewards. Okay, now I am going to directly jump to introducing what off-policy is!

To understand “off-policy” it is probably easiest to start with “on-policy.” First, a policy is a “mapping” from a state to an action. For example, clinicians use patient’s health information (the “state”) to recommend treatment (the “action”). Clearly, in this example, we care about how good a policy is at determining the best action. But how do we evaluate that? One way is to follow the policy, record the outcomes, and then use some measure to evaluate how good the policy is. Such a procedure is called “on-policy” – the policy is always followed. However, it may not be possible to always follow a policy. For instance, clinicians cannot give some treatments to patients because the treatment may be dangerous; some clinicians may only follow a conservative policy; or we may only have access to observational data that did not follow a specific policy. Then “off-policy” plays a role! It deals with the situation where we want to learn a policy (we call it “target policy”) while following some other different policy (“behavior policy”). Most off-policy methods use a general technique known as “importance sampling” to estimate the expected value. The importance sampling ratio is the relative probability of the trajectory under the target and behavior policy, so we can use it to reweight the outcomes under the behavior policy to estimate what it will be if we follow the target policy, and thus measure the “goodness” of the policy.

Okay, if you know about causal inference, you may already have to be familiar with the our patient example. You only have observational data, but you want to use that data to learn the effect of some idealized treatment. In the language of our AI method — we can regard the idealized treatment as the target policy and the treatment actually assigned in the observational data as the behavior policy. Now – can we reweight the outcomes?? Yes! There was this method — “inverse probability weighting” proposed by Horvitz and Thompson in 1952, which is used to do the same thing as important sampling methods — reweight the outcomes!  See, this is the connection between the two!

Nowadays, there are more and more connection between causal inference and reinforcement learning. Statisticians combine the ideas of reinforcement learning like Q-learning into the causal inference frame, such as dynamic treatment regime. Computer scientists get inspired by some work in causal inference to estimate policy value, such as doubly robust estimator for off-policy evaluation. And I am excited about these connections, because causal inference has a long history, and it has built tons of good work; because reinforcement learning is getting more and more attention and has a lot of interesting ideas. Can we get more inspiration from each other? I think this is an emerging area and has a lot of possibilities to explore! I am looking forward to working on this and finding more!


Lili is a PhD candidate working with Laber Labs. Her research interests include  reinforcement learning and local linear models. We thought this posting was a great excuse to get to know a little more about her, so we we asked her:

Q: Do you have a motto?

A:

人生得意须尽欢,莫使金樽空对月。

(When hopes are won, oh! drink your fill in utmost delight.

And never leave your wine-cup empty in moonlight! )

天生我材必有用,千金散尽还复来。

(Heaven has made us talents, we’re not made in vain.

A thousand gold coins spent, more will turn up again.)

— from 《将进酒(Invitation to Wine)》by 李白 (Li, Bai)

Leave a Reply