Pursuit-Evasion


photo
Zhen Li, PhD Student

Any adult of a certain age knows the video game Pac-Man! It is a classic “pursuit-evasion” game, where the player maneuvers Pac-Man around a maze to gobble up as many dots as possible before being captured by a ghost. Such pursuit-evasion problems are found in the real world as well, such as applications to missile guidance systems. Unfortunately, learning to optimally coordinate pursuer behaviors so as to minimize time to capture of the evader is challenging because the action space for the pursuers can be quite large and some noisy information exists. Consequently, previous approaches have relied primarily on heuristics. However, we have developed a variant of Thompson Sampling for pursuit-evasion that performs favorably relative to competitors in terms of time-to-capture.

Thompson sampling is a online decision algorithm that balances the exploration and exploitation of the decision making process based on the Bayesian approach. At each time point in a pursuit-evasion problem, we have the following steps:

(i) computing the posterior distribution over the space of possible evader strategies and the posterior distribution of the evader’s location;
(ii) sampling a strategy from the posterior distribution; and
(iii) using the model-based method in reinforcement learning to estimate the optimal pursuer strategy.

The proposed algorithm performs favorably relative to competitors in terms of time-to-capture in a suite of simulation experiments including pursuit-evasion over a grid and the coordination of ghost behavior in the class arcade game Pac-Man.

In the Pac-Man game implemented in JavaScript, the ghosts (pursuers) could follow one of several pursuit strategies. We devised three strategies for Pac-Man. The first is a pure random walk. The second is defined such that when the distance between Pac-Man and one of the ghosts is less than some value d, the Pac-Man will move in the opposite direction of the closest ghost; otherwise Pac-Man will move uniformly at random. The third is the same as the second except that when the distance between Pac-Man and one of the ghosts is at least d, Pac-Man moves towards the closest Pac-Dot with a probability p and moves randomly otherwise. We compare our method with a search strategy in which the ghosts know Pac-Man’s exact location and select their actions to minimize the distance to Pac-Man. The results show that our method performs slightly better than the benchmark strategy. In the future, we are deriving theoretical results for our algorithm and instead of pursuit-evasion problems, our algorithm can be extended to general multi-agent games and more applications. We expect to see the results soon!

Zhen is a PhD candidate working with Laber Labs. His research interests include  optimal treatment regime, machine learning, and Bayesian Inference . We thought this posting was a great excuse to get to know a little more about him, so we we asked:

Q: Explain what a p-value is, from the perspective of a deranged professor who’s clearly had enough, can’t take it any more, and is having a career-ending meltdown in front of their stoic, uncomprehending students.

A:We want to evaluate the free throw percentage of Jack, who loves playing basketball. We are interested in whether his free throw percentage is over 50%. Thus, we can set the null hypothesis that Jack’s free throw percentage is not greater than 50%. For simplicity, we want him to try two free throws. We consider two cases as follows: (i) Jack successfully puts the two in or (ii) Jack misses both free throws. The p-value is the probability Jack can get the same or better result than his current result when his free throw percentage is 50%. For the case (i), the same or better result than he has made two shots is only he makes the two shots. If his free throw percentage is 50%, then the probability he makes the two shots is 50%x50%=25%. So the p-value in case (i) is 25%. For the case (ii), every shooting result is the same or better than he has missed two shots. So, the probability he gets the same or better result is 100%. So the p-value in case (ii) is 100%. The p-value is used to help us decide if we accept our null hypothesis (i.e. Jack’s free throw percentage is not greater than 50% here). We should always remember a rule: the smaller the p-value is, we are less likely to accept the null hypothesis. Thus in the case (i), as the p-value is 25% (relatively small), we may not like to accept the null hypothesis and we think Jack’s free throw percentage is over 50%. In the case (ii), as the p-value is 100% (very big), we should accept the null hypothesis and we think Jack’s free throw percentage is not greater than 50%. We can see that the use of p-value matches to our intuition to judge.

That’s all. I’m ready to be fired.

Connection between Causal Inference and off-policy in Reinforcement Learning

Lili Wu, PhD Student

Recently I am kind of excited because I suddenly realize that there are some amazing similarities between causal inference and off-policy in reinforcement learning. Who “stole” the other?

Through some of the previous blogs, you may already more or less know about reinforcement learning (such as Lin’s blog about reinforcement learning in education), which is now very popular in Artificial Intelligence. Reinforcement learning is trying to find a sequential decision strategy to maximize the long-term cumulative rewards. Okay, now I am going to directly jump to introducing what off-policy is!

To understand “off-policy” it is probably easiest to start with “on-policy.” First, a policy is a “mapping” from a state to an action. For example, clinicians use patient’s health information (the “state”) to recommend treatment (the “action”). Clearly, in this example, we care about how good a policy is at determining the best action. But how do we evaluate that? One way is to follow the policy, record the outcomes, and then use some measure to evaluate how good the policy is. Such a procedure is called “on-policy” – the policy is always followed. However, it may not be possible to always follow a policy. For instance, clinicians cannot give some treatments to patients because the treatment may be dangerous; some clinicians may only follow a conservative policy; or we may only have access to observational data that did not follow a specific policy. Then “off-policy” plays a role! It deals with the situation where we want to learn a policy (we call it “target policy”) while following some other different policy (“behavior policy”). Most off-policy methods use a general technique known as “importance sampling” to estimate the expected value. The importance sampling ratio is the relative probability of the trajectory under the target and behavior policy, so we can use it to reweight the outcomes under the behavior policy to estimate what it will be if we follow the target policy, and thus measure the “goodness” of the policy.

Okay, if you know about causal inference, you may already have to be familiar with the our patient example. You only have observational data, but you want to use that data to learn the effect of some idealized treatment. In the language of our AI method — we can regard the idealized treatment as the target policy and the treatment actually assigned in the observational data as the behavior policy. Now – can we reweight the outcomes?? Yes! There was this method — “inverse probability weighting” proposed by Horvitz and Thompson in 1952, which is used to do the same thing as important sampling methods — reweight the outcomes!  See, this is the connection between the two!

Nowadays, there are more and more connection between causal inference and reinforcement learning. Statisticians combine the ideas of reinforcement learning like Q-learning into the causal inference frame, such as dynamic treatment regime. Computer scientists get inspired by some work in causal inference to estimate policy value, such as doubly robust estimator for off-policy evaluation. And I am excited about these connections, because causal inference has a long history, and it has built tons of good work; because reinforcement learning is getting more and more attention and has a lot of interesting ideas. Can we get more inspiration from each other? I think this is an emerging area and has a lot of possibilities to explore! I am looking forward to working on this and finding more!


Lili is a PhD candidate working with Laber Labs. Her research interests include  reinforcement learning and local linear models. We thought this posting was a great excuse to get to know a little more about her, so we we asked her:

Q: Do you have a motto?

A:

人生得意须尽欢,莫使金樽空对月。

(When hopes are won, oh! drink your fill in utmost delight.

And never leave your wine-cup empty in moonlight! )

天生我材必有用,千金散尽还复来。

(Heaven has made us talents, we’re not made in vain.

A thousand gold coins spent, more will turn up again.)

— from 《将进酒(Invitation to Wine)》by 李白 (Li, Bai)

Uncovering the Truth

zekunjackxu
Zekun (Jack) Xu, PhD Candidate

We almost never observe the absolute truth. In fact, there are entire industries driven by the single motivation of distorting it! How many times a day do you hear “You can look years younger!”? However, the distortion is not always intentional – has your GPS ever shown you driving through a nearby field? Or maybe your FitBit didn’t realize your leisurely walk was not a nap? Such distortions happen all of the time, and it can be hard to know what is true.

In our day to day lives, we, as the observers, must recognize that what we see or hear is not all that is there. We must continuously dig deeper to understand what is real. Frankly, it can be exhausting!

Fortunately, in science, there are tools that help us estimate the truth based on what we observe! In statistics, a class of models has been developed for just this purpose — the latent (or hidden) state models. Most of the models in this class are based on either the so-called dynamic linear model or the hidden Markov model. Both models date back to the 1960s [1][2], but they are still popular. And, in my opinion, they are the coolest generative models!

In the dynamic linear model, we assume that the data we observe over time is a noisy realization of a latent true process. For instance, to monitor air quality the EPA records the hourly concentration values for a variety airborne particles (O3, NO2, etc.). However, due to the measurement error from device and operation, the recording data is a noisy version of the true value. A similar example is the navigation system, where the GPS data from the satellite would deviate from the actual coordinates to some extent – thus your appearing to drive through a corn field!
In both cases, a dynamic linear model can be used to filter out the noise and perform predictions.

 

In the hidden Markov model, we assume that there is more than one underlying data-generating mechanisms, or the so-called states. For instance, physical activity data gathered through wearable devices like FitBit and iWatch. Those data do not contain the activity labels but provide only the intensity during the wearing time, which is driven by the actual activity state. In fact, one of my current research projects is to identify interesting patterns in human activity data, which are measured by wearable devices worn continuously. Based on those data, we want to be able to determine whether a subject is doing high intensity activity (e.g., running), medium intensity activity (e.g., walking), or low intensity activity (e.g., resting) during different times of the day. We can use this information to compare the lifestyle between different subjects. This is an interesting topic, especially in this era of artificial intelligence. For example, we might be able to build “smart” wearable devices based on this model that provide personalized suggestions regarding healthy lifestyle choices. This framework is both useful for predicting and modeling the effect of activity on health outcomes.

To paraphrase from Plato’s idealism, truth is an abstraction of the external world that we live in. All that we observe is a projection of the ideal world into reality. It is great to have some tools that aim to uncover the truth from the observation, but care must be taken regarding when those tools are applicable.

Kalman, Rudolph Emil. “A new approach to linear filtering and prediction problems.” Journal of basic Engineering 82.1 (1960): 35-45.

Baum, Leonard E., and Ted Petrie. “Statistical inference for probabilistic functions of finite state Markov chains.” The annals of mathematical statistics 37.6 (1966): 1554-1563.


Jack is a PhD candidate working with Laber Labs. His research interests include  wearable computing and hidden Markov model.  Currently he is working on hidden Markov models with applications in veterinary data. We thought this posting was a great excuse to get to know a little more about him, so we we asked him:

Q: What are the five qualities that great PhDs and great artists share?

A:

  1. Honesty. Falsification, fabrication, and plagiarism are despicable in both professions.
  2. Curiosity. Great PhDs and great artists are highly motivated to ask questions and seek answers.
  3. Creativity. Groundbreaking work almost always originates from out-of-the-box thinking.
  4. Detail orientation. Great PhDs and great artists demand perfection in every teeny-tiny detail in their work.
  5. Perseverance. Success in both professions requires persistence in spite of obstacles and setbacks.

 

The Two-Outcome Problem

Daniel Luckett, PhD
Post-doctoral Fellow, UNC

If you’re like me, you often find yourself struggling to select an outfit for a party. Finding the garments to provide the right balance between comfort and style is a difficult task. If you focus all your attention on looking sharp, you’ll find yourself showing up to the party in a suit and tie, while if you focus all your attention on being comfortable, you’ll find yourself showing up in sweatpants. In the absence of a professional stylist to consult, striking the right balance is tricky. I often find myself settling on a compromise: I’ll wear jeans, but I’ll pick out my nicest pair (what some refer to as “grad student formal”).

The difficulty in this decision making process lies in the fact that there are two competing goals. While the optimal decision for each goal may be obvious (wear a suit for style and sweatpants for comfort), there’s no decision that is simultaneously optimal for both. The right balance between the two goals isn’t obvious and the optimal balance may vary across individuals. Even if you could ask an individual directly, most people wouldn’t be able to articulate how they weight the trade-off between style and comfort. In this case, it might seem that the best solution would be to hire a team of professional stylists and observe how they select outfits for a number of individuals, doing their best to balance style and comfort using their expertise. Then, one could try to emulate the decisions of the professionals.

Similar themes show up in medical decision making. A large body of statistical literature has focused on estimating decision rules for assigning treatment to optimize a clinical outcome. However, this idea creates a disconnect with what actually happens in the clinic; much like we all have to select outfits to balance the trade-off between comfort and style, physicians often must make treatment decisions to balance the trade-off between multiple outcomes. Suppose you were a mental health professional treating a patient with bipolar disorder. You know that prescribing an antidepressant may help your patient control their symptoms of depression. However, you’ve recently read research articles, like the one by Gabriele Leverich and colleagues, indicating that antidepressants may induce manic episodes [1]. The value that each patient places on symptoms of depression and symptoms of mania is unknown and may vary from patient to patient. How can we use data to learn decision rules for treatment that balance two outcomes in a meaningful way?

A recent project that I have worked on approaches the two-outcome problem through the lens of utility functions. We assume that there exists some unknown utility function (a possibly patient-dependent function of the two outcomes) that physicians seek to optimize, perhaps subconsciously, when selecting treatments. The physician will not always be able to assign a patient the best treatment for that patient’s utility function, but we can assume that they are successful with some probability. In observational data, where treatment decisions are not randomized, this assumption allows us to model clinician decisions, estimate a patient-specific utility function, and estimate an optimal decision rule for the estimated utility function.

This idea represents a new way of thinking about observational data. Randomized controlled trials are widely considered the gold standard for medical research and many statistical methods are designed to take observational data and apply transformations that allow us to perform analyses as if treatment decisions were randomized. However, the statistical method we’ve been developing for this project handles observational data differently- by recognizing that when treatment decisions are not randomized, there may be information to be gleaned from the decisions themselves. This can be viewed as a form of inverse reinforcement learning, where we observe decisions made by someone with expertise, attempt to discern the goals of the expert, and finally, attempt to learn policies that will achieve the expert’s goals. This idea is similar in spirit to imitation learning, covered in more detail in a previous post on this blog entitled “The Computer is Watching!” by Eric Rose.

We applied our method to data from the observational component of the Systematic Treatment Enhancement Program for Bipolar Disorder (STEP-BD) study. By observing treatment decisions made in the clinic and assuming that physicians are implicitly acting with the intent to balance each patient’s depression symptom score and mania symptom score, we were able to construct a composite “depression-mania” score and estimate a decision rule for determining which patients should receive an antidepressant in order to optimize this composite score. We estimated that applying the resulting decision rule in a population would achieve a 7% improvement in the composite score compared to standard practice. Much like observing the actions of a professional stylist could help us all improve our fashion sense, in the future we may be able to use observed actions of experienced physicians to help us systematically construct better decisions rules for assigning treatment.

[1] Leverich, G. S. et al., (2006). “Risk of switch in mood polarity to hypomania or mania in patients with bipolar depression during acute and continuation trials of venlafaxine, sertraline, and bupropion as adjuncts to mood stabilizers.” American Journal of Psychiatry, 163(2), 232-239.


Daniel recently completed his PhD in biostatistics at UNC! Congratulations, Daniel!!  We thought this posting was a great excuse to get to know a little more about him, so we we asked him:

Q: Provide a list of five do’s and don’ts that apply both to effective teaching in STEM and dealing with a wild bear.

A:
DO:

  1. Do be patient. Sometimes all you need to do is wait for a student’s  understanding to catch up or wait for the bear to wander away.
  2. Do document your experiences. You’ll want to study your notes to become a better teacher, and you’ll want that bear photo to show off to your friends.
  3. Do recognize that there are multiple approaches to problem solving. Do you play dead? Or fight back?
  4. Do report any concerns to a department chair/park ranger.
  5. Do look for ways to improve for next time. There might be other bears out there, and you have another lecture on Thursday.

DON’T:

  1. Don’t be distracted while on the job. Keep your attention on your students/surroundings.
  2. Don’t be afraid to use technology. You should use every resource you can to be successful.
  3. Don’t try to go it alone. Have your students work in teams, and hike with your friends.
  4. Don’t carry food with strong scents. It’s distracting for everyone.
  5. Don’t run. You can’t outrun a bear, and your students will be confused.

To the festival and back: A journey of fun and Statistics

 

Yeng Saanchi
Yeng Saanchi, PhD Candidate

The Science and Engineering Festival is a biennial event that attracts numerous exhibits from people in the Science, Technology, Engineering, and Mathematics (STEM) fields and is held at the Walter E. Washington Convention Center in Washington DC. The purpose of the festival is to enlighten children, especially teens, about the great work that is being done in STEM and to engender interest in these fields.

This year, the festival took place from April 6 to April 8. A few of us from Laber Labs took a trip to DC to exhibit a computer game called Laser Foxes, which was developed by some members of the lab. The game is built primarily on a statistical concept called classification — a term used for problems that involve predicting a label from among several possible labels based on some known features.

Now to a simple description of the game. As the name suggests, the game involves two foxes shooting lasers at each other while navigating a series of blocks, where the blue fox is controlled by the human player and the orange fox is controlled by the computer. A player could adopt one of four possible strategies at any point in the game. There is the Camper, who generally stays in one place; the Aggessor, who actively seeks out his opponent; the Forager, who searches for tools within the game to aid his mission; and the Evader, who actively avoids his opponent. Throughout the course of the game, the computer, or the Artificial Intelligence (AI), adjusts his strategy to counter what it predicts the human player’s current strategy to be. This prediction is based on the observed past moves of the human player. As the game goes on, the AI gets “smarter” at predicting and responding to the human’s strategy. In effect, the AI learns to beat the human player by observing the human’s plays.

The first day of the festival was a sneak peek and was open to only middle schoolers. The turnout was great. As soon as the doors opened, busloads of children, together with their teachers, disembarked at the entrance to the convention center. It was heartwarming to see how excited the kids were to play the game and how good some of them were at beating the AI, which is quite a feat in the game of Laser Foxes. Some of the children and many of the parents were actually quite interested in the theory behind the game and thought it was really cool to be able to create a computer game using Statistics. We gave out stuffed foxes as a memento as well as information cards with details about the American Statistical Association and Laber Labs. During brief lulls in the traffic of kids waiting to play the game, some of us played the game and even managed to win at the expert level. Alas, some of us still managed to lose every game at the easy level.

Attendance on the second day was overwhelming. This time the festival was open to the public. By early afternoon, we had run out of the over one thousand stuffed animals we had brought with us. Up until the very end of the day’s session, we had children playing the game, some of whom had to be convinced by their parents that they could return the next day before they agreed to leave. In between shifts, we took in some of the sights in DC since for some of us it was our first visit. My first sight of the city of Washington DC with its primarily white-stone buildings, brought to mind Gondor, that make-believe city in Tolkien’s Lord of the Rings that was the prominent kingdom of the race of men in Middle-earth. It was impossible to see the museums because of the long queue of people waiting to enter at each one we attempted to visit. However, we managed to see the cherry blossom trees, the Lincoln Memorial, and the White House.

The third day of the festival dawned bright and clear and we were looking forward to another day of Laser Foxes with the kids. Our last day went really well. The turnout was remarkably good and the kids were as excited as ever. It was with mixed feelings that we packed up at the end of the day, sad that the festival was over but glad at the same time that for a few days we had been able to share our love of Statistics through a fun game called Laser Foxes. All in all, it was a trip to be remember!


Yeng is a PhD Candidate whose research interests include predictive modeling and variable selection. We asked a colleague to ask Yeng a probing question —

Q: Classic good news / bad news situation. First, the good news. You’ve accidentally been buried alive in a casket with enough air to keep you alive for three days. Now the bad news. You’re bored and need to occupy your thoughts until you suffocate so you decide to write a short story about a squirrel who befriends an anthropomorphic acorn but struggles against eating him during the long winter when food supplies are running low. Tell us that story.

A:
Winter had lasted far longer than Mr. Toad, the squirrel, had anticipated. Ten months of snow and ice was something no one could have foreseen. “What have those infernal humans done this time to cause such a drastic imbalance in weather?”, he wondered for the umpteenth time. As he sat and contemplated this, he also wondered how he was going to survive if the weather did not warm up soon. He was down to his penultimate acorn. The last acorn happened to be a particularly fat one that he thought could last him at least a week. Three days went by and still winter persisted. Mr. Toad was now down to the shiny and seemingly delectable acorn he had saved for last. He stared at the acorn sadly, reluctant to mar its smooth surface with teeth marks, but alas, he was too hungry to resist for long. Muttering an apology under his breath, he opened his mouth to take a nibble. Just before his teeth could make contact with the acorn, he heard a squeaky voice say, “Please sir, I beg of you, don’t eat me.” To say Mr. Toad was startled would be a gross understatement. He was positively stupefied and not a little terrified. He had assumed all this while that he was the only living being in his dwelling. He knew without a doubt that the voice must have come from the acorn but it was too much. A living, breathing, talking acorn? Who had ever heard of such a thing? Not even his Grandpa Turtle, the famous squirrel who had sailed the seven seas with Captain Blood, had ever mentioned such a phenomenon. “What devilry is this”, he wondered? “Has hunger made me delusional?” After a few minutes of silence, he mustered courage and asked, “Who are you?” In a breathy, chocolaty voice, the acorn replied, “My name is Mr. Mulberry”. “How is it that you can talk?”, asked Mr. Toad. And so began the tale of how the acorn became anthropomorphic. But that is a story for another time. Suffice it to say that by the time Mr. Mulberry was done telling his tale, the ice had begun to melt. Spring had arrived! Mr. Toad and Mr. Mulberry, now fast friends, ventured outdoors for the first time in almost a year, breathing in the fresh smell of spring. The End.

This is Yeng’s second post! To learn more about her research, check out her first article here!

Exploration and Exploitation Trade-off

Jesse Clifton
Jesse Clifton, PhD Student 

Most people want to eat as many delicious meals as possible throughout the course of their life. What’s the optimal strategy for accomplishing this goal? Every time you decide on a restaurant, you have a choice between going to the best restaurant that you know or trying somewhere new. If you always go to your favorite restaurant and never try anything new, you’re likely to miss out on even better dishes at new places you’ve never been to. But if you always try a new restaurant, you’ll eat a lot of meals that aren’t as good as your current favorite. Maximizing the number of delicious meals over a lifetime means balancing this trade-off.

The exploration-exploitation trade-off is a dilemma we face in sequential decision-making: each time we have a decision to make, should we try to learn more about the world or stick with what we currently think is the best decision?. Acting to learn — exploration — gives you more information to help achieve your goals in the long run, but you lose out on gains from going with your current best guess. When you exploit, you give up the chance to learn something new.

The exploration-exploitation trade-off arises in many problems studied in Laber Labs: decision-making in artificial intelligence, design of optimal medical treatment regimes, and effectively preventing the spread of disease are a few examples. I’m currently researching exploration and exploitation in cases where there is a huge number of choices available to the actor. For example, when public health decision-makers were trying to stop the spread of the recent Ebola epidemic, they had to decide whether to treat (given limited resources) each of dozens or hundreds of locations. All possible combinations of decisions to treat or not-treat each location add up to an astronomical number of possible decisions, so this is an example of a large action-space problem.

To explore effectively in large action-spaces, I’m looking into variants of an old technique called Thompson sampling. In Thompson sampling, we maintain a probability distribution that expresses our uncertainty over various models of the environment and continually update a probability distribution over the parameters of these models. In order to explore, we sample one model from this probability distribution and try to make the best decision acting as if this model were true. However, we also exploit effectively, because — as we get more data — our probability distribution will concentrate on the most accurate models and, therefore, lead to reasonable decisions.

Continuing the Ebola example, our models might be epidemiological models of how disease spreads between locations. As we observe the actual spread of disease, we update our uncertainty (probability distribution) over the parameters of these disease models. Each time we need to make a decision, we sample a single model from this probability distribution and try to act optimally according to this sampled model.

So much for our brief introduction to Thompson sampling. While the techniques of formal sequential decision-making may be less relevant to our everyday lives, the exploration-exploitation trade-off crops up in many of the decisions we make under uncertainty. Simply being aware of the costs and benefits of exploring and exploiting may help you to maximize your own payoffs in the long run.


Jesse is a PhD Candidate whose research interests include reinforcement learning and artificial intelligence. His current research focuses on finite approximations to infinite-horizon decision problems. We asked a fellow Laber-Labs colleague to ask Jesse a probing question —

Q: Suppose you made a significant discovery in the course of your research that could lead to the development of an Artificial Intelligent Digital Assistant (AIDA) which could result in medical breakthroughs that we have up until now only been able to dream about. However, there’s a 0.01% chance that AIDA could develop a mind of HER own, work toward the annihilation of the human race and succeed. Would you publish your research or would you destroy it so that it never sees the light of day? Perhaps, the discovery of a cure to cancer is worth the risk?

A: Assuming we ought to maximize expected value, consider that the expected number of lives saved by not turning on AIDA is 0.01 * (Expected number of people ever to live if AIDA doesn’t destroy the world). The latter is astronomically large, given that if civilization survives it may spread through the galaxy and beyond and persist until the heat death of the universe. This dwarfs the good that would come from medical breakthroughs, unless we expect these medical breakthroughs to be a necessary condition for civilization’s colonization of the universe.

This leaves out some considerations, such as scenarios where an AIDA-like discovery is made by someone else even if I don’t share my findings. But altogether, on the (debatable) assumption that saving astronomically many lives in expectation is good, I would destroy my research.

Modeling Illicit Network Behavior

Conor Artman, PhD Student 

Many TV shows and movies characterize the act of law enforcement chasing cybercriminals as a classic “cat and mouse” game, but it can be more akin to a “shark and fish”. If you’ve ever watched anything during Shark Week, or any other docu-drama on ocean biology (think, Blue Planet), you might have seen a birds-eye view of sharks repeatedly spear-bombing through massive clouds of fish. Each time they dive, the fish somehow form a shark-shaped gap, and from our vantage point, it looks like the shark doesn’t really catch anything. In reality, the shark scrapes by with a few of the stragglers in the school and dives time after time for more fish to eat. Likening law enforcement to the shark, this kind of pursuit and evasion more accurately depicts their efforts when discovering and unwinding a criminal network. Now imagine that the shark is doing this blind and may or may not start out in the middle of the school of fish. If we can’t see the fish, how can we find them? How can we catch them?

Working with the Lab for Analytic Sciences (LAS), we are using an approach known an agent based models (ABMs) to help find answers to these questions and provide strategies for law enforcement to find and disrupt criminal networks. This project is a collaboration with specialists in backgrounds ranging from anthropology and forensic psychology to work experience as intelligence analysts in the FBI.

Agent based models is a “bottom-up” approach. Typically, one defines an environment, agents, and behavioral rules for the agents. An “agent” is often defined to be the smallest autonomous unit acting in a system — this could be a cubic foot of air in the atmosphere, a single machine in a factory, or a single person in a larger economy. Agents follow rules that can be as simple as a series of if-else statements, with no agent adaptation, or as sophisticated as Q-learning, which allows each agent to learn over time. The goal is to recreate the appearance of complex real-world phenomena. We call the appearance of complex aggregate relationships from micro-level agents “emergent” behavior. The broad idea with ABMs is that we run many simulations, and when we attain stable emergent behavior in a particular simulation, we will then calibrate the ABM to data.

But why do we have to jump to using ABMs in the first place? Can’t we use real-world data or try to set up natural experiments? Human self-awareness creates notoriously difficult problems for those trying to model human behavior. Richard Feynman bluntly illustrates this idea: “Imagine how much harder physics would be if electrons had feelings!” Feynman implies that the volatility in modeling human behavior extends from emotions, but I don’t believe that is really the problem. The problem is how decision-making processes in human beings change over time. Human beings learn to adapt for the task at hand, which isn’t inherently problematic — lots of scenarios in physical sciences show adaptive behavior. The trouble comes in when human beings willfully improve their ability to learn. How should someone trying to come up with good rules for human behavior make adaptive rules for making rules? Ideally, social scientists would like to emulate the success of physical scientists by starting with simple rules, and then trying to generate the behavior they are interested in observing from those rules. Unfortunately, this approach has had mixed success at best.

As a result, social scientists often take one of two approaches. The first is a “take what you can get” approach. Scientists build a statistical model based on their field’s theories. From there, they run these models on observational data (or data collected outside of a randomized experiment) to find empirical evidence for or against their theories. The downside is that disentangling the causes from the effects can be difficult in this approach. For instance, ice cream purchases and murder-rates have a strong positive correlation. But does it make sense to say ice cream causes murder, or murder causes ice cream? Of course not! There’s an effect that’s common to both of them that makes it look like they’re related: heat. As heat increases in cities, murder rates often increase and so do ice cream purchases. Statisticians refer to this concept as “confounding,” where ice cream purchases and murder rates in this example are said to be “confounded” with heat. As a result, if we don’t have the correct insights or the right data, observational data can be confounded with many things at once, and we may not be able to tell.

The second approach uses experiments to formalize questions of cause and effect. The rationale is that the world is a perfect working model of itself. So, the problem isn’t that we do not have a perfect working model but that we do not understand the perfect working model. This means that if we could make smaller working versions of the world in a controlled experimental setting, then we should be able to make some ground in understanding how the perfect model works.

However, for studying illicit networks, we often do not have good observational data available: criminals don’t want to be caught, so they try to make their activities difficult to observe. Similarly, it is usually impossible to perform experiments. To illustrate, if we wanted to study the effect of poverty on crime, there is no way for us as scientists to randomize poverty in an experiment.

A third approach says to simply try to simulate! If you can construct a reliable facsimile of the environment in which your phenomenon exists, then the data generated from the facsimile may help your investigation. In some environments, this works great. For instance, in weather forecasting simulations climatologists can apply well-developed theories of atmospheric chemistry and physics to get informative simulated data. Unfortunately, this may not be a great deal of help if we do not already have a strong theoretical foundation from which to work.

As a result, we try to pool information from experts and the data we have available to build a simplified version of criminal agents, and we make tweaks on our simulation until it produces data that look similar to what we see in the real world (as told by our content specialists and the data we have available). From there, we do our best to make informed decisions based on the results of the simulation. ABMs have their own issues, and they may not be the ideal way to look at a problem. But, we hope that they’ll give us insights into what an optimal strategy for finding and disrupting networks may look like to prevent crime in the future.


Conor is a PhD Candidate whose research interests include reinforcement learning, dynamic treatment regimes, statistical learning, and predictive modeling. His current research focuses on pose estimation for predicting online sex trafficking. We asked a fellow Laber-Labs colleague to ask Conor a probing question —

Sleeping Beauty volunteers to undergo the following experiment and is told all of the following details: On Sunday she will be put to sleep. Once or twice, during the experiment, Beauty will be awakened, interviewed, and put back to sleep with an amnesia-inducing drug that makes her forget
that awakening. A fair coin will be tossed to determine which experimental procedure to undertake:

  • If the coin comes up heads, Beauty will be awakened and interviewed on Monday only.
  • If the coin comes up tails, she will be awakened and interviewed on Monday and Tuesday.

In either case, she will be awakened on Wednesday without interview and the experiment ends.
Any time Sleeping Beauty is awakened and interviewed she will not be able to tell which day it is or
whether she has been awakened before. During the interview Beauty is asked: “What is the
probability that the coin landed heads?”. What would your answer be? Please explain.

A: I think you could approach this problem from lots of perspectives, depending on how you conceptualize randomness and uncertainty, and how you conceptualize how people actually think versus what we say they should think.

On one hand, speaking purely from the perspective of Sleeping Beauty, I think there’s an argument to be made that you could claim that the probability is still just ½. If, from the perspective of Sleeping Beauty, you do not gain any information from being awakened, you could say, “Well, it was 50-50 before we started, and since I get no information, it’s equivalent to asking me this question before we even started the experiment.” On the other hand, you could try to think of this experiment in terms of a long-term repeated average, or you could even think of it in a more Bayesian way, so I think the point of this
question is to give an example of the tension that exists between human heuristic reasoning about uncertainty and precisely converting that intuition into useful statements about the world. (So that’s neat.)

If you ask me what I would personally think, given that I’ve just presumably awakened in some place where I can’t tell what day it is, I might say, “Well, I know I’m awake with an interviewer, so it’s definitely Monday or Tuesday. From my perspective, I can’t tell the difference between awakening on Monday via heads, Monday via tails, or Tuesday via tails. So, from my perspective, there’s only one version of the world corresponding to heads of these three versions, so if you absolutely must have me give a guess for the probability of heads for this experiment to continue, I think one reasonable guess is 1 out of 3.”

The Computer Is Watching!

Eric Rose
Eric Rose, PhD Candidate

In a previous Laber Labs blog post by Marshall Wang, he discussed using Reinforcement Learning to teach an AI to learn to play the game Laser Cats, which has since been rebranded as Space Mice. In this scenario, the AI knows nothing about the mechanics of the game but is able to learn to play the game far more effectively than is possible by any human player. To achieve this level of skill, the computer player records the current state of the game, the action it took, and a resulting reward or penalty based on the resulting next state of the game. The computer then learns the optimal strategy for maximizing their reward.

However, in some complex situations this approach can be difficult for a computer. For example, it might be difficult to design a reward function or maybe the decision on what action is best is very complicated. In situations when it is possible for a human to play close to optimally, it can be far easier for a computer to learn how to replicate how a human player plays the game. This is called imitation learning!

Let’s use the game Flying Squirrel (playable at http://www.laber-labs.com/games/flying-squirrel/) to demonstrate how imitation learning can be used to teach a computer player how to play a game. In this game, the player controls the squirrel and is playing against a clock. The goal is to traverse the hills as fast as possible so you can complete the level before you run out of time. At first, you have only one possible choice: to dive or not to dive. Diving adds a downward force to the squirrel. (Insider tip: to move as fast as possible, you want to dive so that you land on the downward sloping part of the hill, continue diving as you move downhill, and then release the dive button right before you reach the uphill so that your momentum lets you jump off the hill and fly through the air. It’s pretty cool!) As you move through the game you gain special abilities and obstacles appear that you need to dodge, but we will limit our discussion to the simplest case.

In imitation learning, the computer is watching! It is recording features in each frame that summarize the current state of the game (such as how high you are above the ground, your velocity, direction, and features summarizing the shape of the hill in front of you) as well as the action you, the human player, took in this state. This record keeping changes the problem of how to play the game to one of classification. The computer uses the state of the game as input, explores its database of user experiences, and outputs the choice to dive or not to dive. Many classification algorithms can be used for this problem. For Flying Squirrel, we used k-nearest neighbors to teach the computer to play. This works by taking the current state of the game and finding the k past states that are most similar to the current state. We can then look at what action was chosen by the human in each of these past states and choose the action that was most common among those k states.

To see this in action, you can play the game and switch to watching a computer player play using imitation learning based on the data that was just collected on you!


Eric is a PhD Candidate whose research interests include machine learning and statistical computing. His current research focuses on sample size calculations for dynamic treatment regimes. We asked a fellow Laber-Labs colleague to ask Eric a probing question —

Q: Imagine that you’re a game of thrones character – what is your house name, its sigil, and the house motto?
A: I guess my house name would be House Rose since I’m pretty sure they’re all the same as the family name. I also didn’t read the books and have only seen a couple of episodes so that may not even be true. Our sigil would contain just a vinyl record. If my imaginary Game of Thrones character is anything like the real me, music is probably going to be a huge part of his life. Then again if he was anything like the actual me he probably wouldn’t survive for very long. Our house motto would be along the same lines and would be a line from one of my favorite songs by the Drive-By Truckers that I also think could apply to Game of Thrones. “It’s great to be alive!”

This is Eric’s second post! To learn more about her research, check out her first article here!

Facing Missing Data

 

Eric Rose
Lin Dong, PhD Candidate

This is a detour from my last post about education. Turns out that I have been working on a project about sequential decision making in the face of missing data for several months, so why not talk about that. Missing data arises in all sorts of data. For data with sequential feature, like data from sequential multiple assignment randomized trials (SMART), the problem is that patients are often subject to drop out. Q-learning or other techniques for solving the optimal strategy cannot be directly applied to data sets containing missing values, so we need a way to get around having missing values.

The first question we may ask is – why is missing data a problem? Further, can we just throw out the missing entries? Why do people care so much about it and develop sophisticated methods to deal with it? Things are not that simple.

Missing data is not a big issue if the data are missing completely at random (MCAR). Yes, that’s jargon. MCAR means that missingness is completely random and is independent of the data. Suppose we have a typical n by p data matrix in which you have n rows corresponding to the subjects and p variables. A quick and dirty way to handle the missing data is to throw away all the rows containing missing values. This is not a great idea if a large proportion of your data contains missing values. Suppose you have an unbiased estimator for the full data. Under MCAR, the estimator remains unbiased, but you may lose a lot of efficiency (you are less certain about your estimation).

Another type of missingness is called missing at random (MAR), which means missingness is not completely random but only depends on the observed data. If you throw away missing data under this scenario, you will obtain a biased estimator. For example, cautious and wealthy people tend to avoid giving responses to questions about their income. For this reason, an income estimate would be lower than the truth because your sample only covers less wealthy respondents. Nonetheless, MAR is actually a very handy assumption because the missing event is tractable and thus can be modeled. The methods I will introduce later are all based on MAR.

If one refuses to assume MCAR or MAR for their data, we have a third and final missingness assumption called missing not at random (MNAR) – it says that the missing data depend on the things you did not observe. A very important paper that introduced these assumptions is Rubin 1976 [1].

So, we cannot simply throw away missing entries. What then are the alternatives? One can use the general class of imputation methods. Imputation methods are intuitive and work by filling in the missing entries based on the researcher’s best knowledge. The simplest imputation is to fill with the mean/median of the covariate. If we are willing to assume MAR, a more advanced way is to build a model for the variable. We can get a model-based estimator that can serve as the fill-in value. Instead of filling in with one estimator, one can estimate the conditional distribution of each variable given all other observed variables and then draw samples from the estimated conditional distribution to fill in the missing value. To account for the uncertainty in drawing samples, we can repeat the sampling procedure several times so that we have multiple imputed data sets. The inference is then performed on each of the imputed data sets. We combine the multiple results into one final estimator, for example, by averaging them. This is called multiple imputation and is a very popular approach to deal with missing data.

Another method, which is less known, models the missing mechanism directly. It is called the inverse probability weighted estimator, where probability refers to the probability of missing. When missingness is not MCAR, bias is introduced because the complete cases left are no longer a representative sample of the population. A method to fix that is to give each complete row a weight, which is 1 over the probability that it would be missing. Then we get a re-weighted sample that mimics the full and representative sample. The estimation of interest can be performed on the re-weighted sample, which only uses the complete rows. The key to this method is to estimate the probability of missing – the missing mechanism. Luckily, one can model the probability of missing under the MAR assumption.

Not until recently did I realized that I also encountered and studied the missing data issue in my undergraduate years. We were dealing with sensitive questionnaires, where people were being asked about very sensitive questions that they might be reluctant to answer. So we believed that they were likely being deceitful. The mechanism we used to address this was the following: I wanted to ask about a binary and sensitive status, and I coded it as {No = 0, Yes = 1}. Instead of asking directly, I listed a non-sensitive and independent question, e.g. how many times did you catch a flight in the last 3 months (an integer). Then, I asked the respondent to report only the sum of the number of flights and the answer to the sensitive question. For example, if a respondent traveled by air 3 times in the last three month and their sensitive status is “Yes”, he/she should write down 3+1 = 4. As the researcher, we only observe 4, which could be that the respondent flew 4 times with no sensitive status. In this way, it is believed that the compliance of respondents will increase. Utilizing this method, we translated the sensitive status into missing data as it was not directly observed. Typically, the researchers are only interested in population level of the sensitive status. Then, we applied the maximum likelihood to estimate the expected value of the missing value (In this case, a proportion). More details of this idea can be found in [2].

A perfect world would have no missing values. The real world, however, is so flawed that missing data arises wherever data are generated. Working on this issue gives me the illusion that I am helping to fix the world! A great reference for the general missing data issue is introduced in Prof. Marie Davidian’s course [3].

[1] Rubin, D. B. (1976). Inference and missing data. Biometrika 63, 581–592.
[2] GL Tian, ML Tang, Q Wu, Y Liu (2017). Poisson and negative binomial item count techniques for surveys with sensitive question. Statistical Methods in Medical Research. Vol 26, Issue 2, pp. 931 – 947
[3] http://www4.stat.ncsu.edu/~davidian/st790/


Lin is a PhD Candidate whose research interests include dynamic treatment regimes, reinforcement learning,  and survival analysis. Her current research focuses on shared decision making in resource allocation problems. We asked a fellow Laber-Labs colleague to ask Lin a probing question —

Q: Explain your favorite statistical method, but from the perspective of a crooked politician running a smear campaign against it.
A: Linear regression. This is definitely my favorite model. It is so simple, pure yet powerful. You can generalize it, penalize it and even interpret it.
Human brain should be linear – not some complicated, intricate, twisted, impenetrable, nonlinear, *deep* networks. Believe me, the whole world should be linear.

This is Lin’s second post! To learn more about her research, check out her first article here!

Improving Football Play-calls Using Madden

Nick Kapur
Nick Kapur, PhD Candidate

The ability to make crucial decisions in real time is one of the most sought after attributes of a head coach in any sport. Being able to improve upon these decisions is thus an important problem, as it can improve a team’s chances of winning. In baseball, there have been numerous studies on managerial decisions such as defensive alignments, bullpen usage, bunting, and more. These studies have resulted in managers making more efficient decisions, leading directly to better play. In football, coaches are faced with fundamental decisions to make every down: the personnel, the formation, and the play their team will run. Unlike baseball, where there is an abundance of data, it is difficult to determine whether coaches are making these important decisions effectively in football for several reasons. First, obtaining labeled data is extremely expensive and requires hand-labeling by domain experts. Furthermore, with a 16-game schedule and an average of only 130 plays per game, NFL football does not generate nearly enough data to reach reliable conclusions.

The lack of sufficient data is not an uncommon problem in science, and due to the proliferation of computing power, it is a problem commonly remedied by simulation studies. Luckily, there is a realistic NFL simulation environment that has been developed and extensively updated for nearly 30 years, EA Sports’ Madden video game franchise. Madden games can act as a model for the underlying system dynamics of an NFL game. We utilize data generated from Madden 17, the most recent version of the game, to train reinforcement learning algorithms that make every play-calling decision throughout an entire game. We compare the results of these algorithms with a baseline established from the game’s built-in play-calling algorithm, an initial surrogate for real-life coaching decisions.

Controllers with Raspberry Pi Computers

To generate the data at rates far greater than actual NFL games, we constructed 4 controllers that were operable through an interface with Raspberry Pi computers. We ran each of these controllers continuously on separate Xboxes, and we used optical character recognition techniques to capture the current state of the game from image data. Then, we used the current state as input to our reinforcement learning algorithms, which would return the play to run. The correct buttons were subsequently passed to the Raspberry Pi, resulting in 4 Madden games that could run continuously with no human input, collecting data 24 hours per day.

Our results show that the reinforcement learning algorithms are able to perform at better rates than the built-in Madden play-calling algorithm, leading to better decision-making and thus more victories. These results can potentially provide a framework for evaluating and improving play-calling in football. Additionally, they can potentially be augmented with real data to provide a model that performs better than a model based on the real data alone. With enough evidence, football coaches may be compelled to alter strategic decisions for the better, leading to more efficiently called football games.


Nick is a PhD Candidate whose research interests include machine learning and statistical genetic. His current research focuses on pursuit-evasion and cooperative reinforcement learning. We thought this posting was a great excuse to get to know a little more about him, so we we asked him a few questions! We asked a fellow Laber-Labs colleague to ask Nick a probing question —

Q: Explain the countable axiom of choice with an analogy involving hot dogs.  

Let’s say you really want a hotdog. You are walking down the street, and suddenly you stumble upon an infinite number of hotdog vendors who each have tubs with many hotdogs in them. You know that you are incredibly hungry right now, and that in the future you may want to go back to the best hotdog vendor. Therefore, you get out your trusty megaphone and announce to the hotdog vendors a rule (some function that allows them to choose…let’s call it a choice function) so that each of them will know exactly which of their hotdogs to give you. This way, you don’t have to go and pick out one hotdog from each of them individually. The axiom of choice has now saved you a lot of valuable time and probably doomed you to a sedentary lifestyle.

This is Nick’s second post! To learn more about his research, check out his first article here!