Any adult of a certain age knows the video game Pac-Man! It is a classic “pursuit-evasion” game, where the player maneuvers Pac-Man around a maze to gobble up as many dots as possible before being captured by a ghost. Such pursuit-evasion problems are found in the real world as well, such as applications to missile guidance systems. Unfortunately, learning to optimally coordinate pursuer behaviors so as to minimize time to capture of the evader is challenging because the action space for the pursuers can be quite large and some noisy information exists. Consequently, previous approaches have relied primarily on heuristics. However, we have developed a variant of Thompson Sampling for pursuit-evasion that performs favorably relative to competitors in terms of time-to-capture.
Thompson sampling is a online decision algorithm that balances the exploration and exploitation of the decision making process based on the Bayesian approach. At each time point in a pursuit-evasion problem, we have the following steps:
(i) computing the posterior distribution over the space of possible evader strategies and the posterior distribution of the evader’s location;
(ii) sampling a strategy from the posterior distribution; and
(iii) using the model-based method in reinforcement learning to estimate the optimal pursuer strategy.
The proposed algorithm performs favorably relative to competitors in terms of time-to-capture in a suite of simulation experiments including pursuit-evasion over a grid and the coordination of ghost behavior in the class arcade game Pac-Man.
Zhen is a PhD candidate working with Laber Labs. His research interests include optimal treatment regime, machine learning, and Bayesian Inference . We thought this posting was a great excuse to get to know a little more about him, so we we asked:
Q: Explain what a p-value is, from the perspective of a deranged professor who’s clearly had enough, can’t take it any more, and is having a career-ending meltdown in front of their stoic, uncomprehending students.
A:We want to evaluate the free throw percentage of Jack, who loves playing basketball. We are interested in whether his free throw percentage is over 50%. Thus, we can set the null hypothesis that Jack’s free throw percentage is not greater than 50%. For simplicity, we want him to try two free throws. We consider two cases as follows: (i) Jack successfully puts the two in or (ii) Jack misses both free throws. The p-value is the probability Jack can get the same or better result than his current result when his free throw percentage is 50%. For the case (i), the same or better result than he has made two shots is only he makes the two shots. If his free throw percentage is 50%, then the probability he makes the two shots is 50%x50%=25%. So the p-value in case (i) is 25%. For the case (ii), every shooting result is the same or better than he has missed two shots. So, the probability he gets the same or better result is 100%. So the p-value in case (ii) is 100%. The p-value is used to help us decide if we accept our null hypothesis (i.e. Jack’s free throw percentage is not greater than 50% here). We should always remember a rule: the smaller the p-value is, we are less likely to accept the null hypothesis. Thus in the case (i), as the p-value is 25% (relatively small), we may not like to accept the null hypothesis and we think Jack’s free throw percentage is over 50%. In the case (ii), as the p-value is 100% (very big), we should accept the null hypothesis and we think Jack’s free throw percentage is not greater than 50%. We can see that the use of p-value matches to our intuition to judge.
That’s all. I’m ready to be fired.