Exploration and Exploitation Trade-off

Jesse Clifton
Jesse Clifton, PhD Student 

Most people want to eat as many delicious meals as possible throughout the course of their life. What’s the optimal strategy for accomplishing this goal? Every time you decide on a restaurant, you have a choice between going to the best restaurant that you know or trying somewhere new. If you always go to your favorite restaurant and never try anything new, you’re likely to miss out on even better dishes at new places you’ve never been to. But if you always try a new restaurant, you’ll eat a lot of meals that aren’t as good as your current favorite. Maximizing the number of delicious meals over a lifetime means balancing this trade-off.

The exploration-exploitation trade-off is a dilemma we face in sequential decision-making: each time we have a decision to make, should we try to learn more about the world or stick with what we currently think is the best decision?. Acting to learn — exploration — gives you more information to help achieve your goals in the long run, but you lose out on gains from going with your current best guess. When you exploit, you give up the chance to learn something new.

The exploration-exploitation trade-off arises in many problems studied in Laber Labs: decision-making in artificial intelligence, design of optimal medical treatment regimes, and effectively preventing the spread of disease are a few examples. I’m currently researching exploration and exploitation in cases where there is a huge number of choices available to the actor. For example, when public health decision-makers were trying to stop the spread of the recent Ebola epidemic, they had to decide whether to treat (given limited resources) each of dozens or hundreds of locations. All possible combinations of decisions to treat or not-treat each location add up to an astronomical number of possible decisions, so this is an example of a large action-space problem.

To explore effectively in large action-spaces, I’m looking into variants of an old technique called Thompson sampling. In Thompson sampling, we maintain a probability distribution that expresses our uncertainty over various models of the environment and continually update a probability distribution over the parameters of these models. In order to explore, we sample one model from this probability distribution and try to make the best decision acting as if this model were true. However, we also exploit effectively, because — as we get more data — our probability distribution will concentrate on the most accurate models and, therefore, lead to reasonable decisions.

Continuing the Ebola example, our models might be epidemiological models of how disease spreads between locations. As we observe the actual spread of disease, we update our uncertainty (probability distribution) over the parameters of these disease models. Each time we need to make a decision, we sample a single model from this probability distribution and try to act optimally according to this sampled model.

So much for our brief introduction to Thompson sampling. While the techniques of formal sequential decision-making may be less relevant to our everyday lives, the exploration-exploitation trade-off crops up in many of the decisions we make under uncertainty. Simply being aware of the costs and benefits of exploring and exploiting may help you to maximize your own payoffs in the long run.


Jesse is a PhD Candidate whose research interests include reinforcement learning and artificial intelligence. His current research focuses on finite approximations to infinite-horizon decision problems. We asked a fellow Laber-Labs colleague to ask Jesse a probing question —

Q: Suppose you made a significant discovery in the course of your research that could lead to the development of an Artificial Intelligent Digital Assistant (AIDA) which could result in medical breakthroughs that we have up until now only been able to dream about. However, there’s a 0.01% chance that AIDA could develop a mind of HER own, work toward the annihilation of the human race and succeed. Would you publish your research or would you destroy it so that it never sees the light of day? Perhaps, the discovery of a cure to cancer is worth the risk?

A: Assuming we ought to maximize expected value, consider that the expected number of lives saved by not turning on AIDA is 0.01 * (Expected number of people ever to live if AIDA doesn’t destroy the world). The latter is astronomically large, given that if civilization survives it may spread through the galaxy and beyond and persist until the heat death of the universe. This dwarfs the good that would come from medical breakthroughs, unless we expect these medical breakthroughs to be a necessary condition for civilization’s colonization of the universe.

This leaves out some considerations, such as scenarios where an AIDA-like discovery is made by someone else even if I don’t share my findings. But altogether, on the (debatable) assumption that saving astronomically many lives in expectation is good, I would destroy my research.

Leave a Reply

Your email address will not be published. Required fields are marked *