Exploration and Exploitation Trade-off

Jesse Clifton
Jesse Clifton, PhD Student 

Most people want to eat as many delicious meals as possible throughout the course of their life. What’s the optimal strategy for accomplishing this goal? Every time you decide on a restaurant, you have a choice between going to the best restaurant that you know or trying somewhere new. If you always go to your favorite restaurant and never try anything new, you’re likely to miss out on even better dishes at new places you’ve never been to. But if you always try a new restaurant, you’ll eat a lot of meals that aren’t as good as your current favorite. Maximizing the number of delicious meals over a lifetime means balancing this trade-off.

The exploration-exploitation trade-off is a dilemma we face in sequential decision-making: each time we have a decision to make, should we try to learn more about the world or stick with what we currently think is the best decision?. Acting to learn — exploration — gives you more information to help achieve your goals in the long run, but you lose out on gains from going with your current best guess. When you exploit, you give up the chance to learn something new.

The exploration-exploitation trade-off arises in many problems studied in Laber Labs: decision-making in artificial intelligence, design of optimal medical treatment regimes, and effectively preventing the spread of disease are a few examples. I’m currently researching exploration and exploitation in cases where there is a huge number of choices available to the actor. For example, when public health decision-makers were trying to stop the spread of the recent Ebola epidemic, they had to decide whether to treat (given limited resources) each of dozens or hundreds of locations. All possible combinations of decisions to treat or not-treat each location add up to an astronomical number of possible decisions, so this is an example of a large action-space problem.

To explore effectively in large action-spaces, I’m looking into variants of an old technique called Thompson sampling. In Thompson sampling, we maintain a probability distribution that expresses our uncertainty over various models of the environment and continually update a probability distribution over the parameters of these models. In order to explore, we sample one model from this probability distribution and try to make the best decision acting as if this model were true. However, we also exploit effectively, because — as we get more data — our probability distribution will concentrate on the most accurate models and, therefore, lead to reasonable decisions.

Continuing the Ebola example, our models might be epidemiological models of how disease spreads between locations. As we observe the actual spread of disease, we update our uncertainty (probability distribution) over the parameters of these disease models. Each time we need to make a decision, we sample a single model from this probability distribution and try to act optimally according to this sampled model.

So much for our brief introduction to Thompson sampling. While the techniques of formal sequential decision-making may be less relevant to our everyday lives, the exploration-exploitation trade-off crops up in many of the decisions we make under uncertainty. Simply being aware of the costs and benefits of exploring and exploiting may help you to maximize your own payoffs in the long run.


Jesse is a PhD Candidate whose research interests include reinforcement learning and artificial intelligence. His current research focuses on finite approximations to infinite-horizon decision problems. We asked a fellow Laber-Labs colleague to ask Jesse a probing question —

Q: Suppose you made a significant discovery in the course of your research that could lead to the development of an Artificial Intelligent Digital Assistant (AIDA) which could result in medical breakthroughs that we have up until now only been able to dream about. However, there’s a 0.01% chance that AIDA could develop a mind of HER own, work toward the annihilation of the human race and succeed. Would you publish your research or would you destroy it so that it never sees the light of day? Perhaps, the discovery of a cure to cancer is worth the risk?

A: Assuming we ought to maximize expected value, consider that the expected number of lives saved by not turning on AIDA is 0.01 * (Expected number of people ever to live if AIDA doesn’t destroy the world). The latter is astronomically large, given that if civilization survives it may spread through the galaxy and beyond and persist until the heat death of the universe. This dwarfs the good that would come from medical breakthroughs, unless we expect these medical breakthroughs to be a necessary condition for civilization’s colonization of the universe.

This leaves out some considerations, such as scenarios where an AIDA-like discovery is made by someone else even if I don’t share my findings. But altogether, on the (debatable) assumption that saving astronomically many lives in expectation is good, I would destroy my research.

Modeling Illicit Network Behavior

Conor Artman, PhD Student 

Many TV shows and movies characterize the act of law enforcement chasing cybercriminals as a classic “cat and mouse” game, but it can be more akin to a “shark and fish”. If you’ve ever watched anything during Shark Week, or any other docu-drama on ocean biology (think, Blue Planet), you might have seen a birds-eye view of sharks repeatedly spear-bombing through massive clouds of fish. Each time they dive, the fish somehow form a shark-shaped gap, and from our vantage point, it looks like the shark doesn’t really catch anything. In reality, the shark scrapes by with a few of the stragglers in the school and dives time after time for more fish to eat. Likening law enforcement to the shark, this kind of pursuit and evasion more accurately depicts their efforts when discovering and unwinding a criminal network. Now imagine that the shark is doing this blind and may or may not start out in the middle of the school of fish. If we can’t see the fish, how can we find them? How can we catch them?

Working with the Lab for Analytic Sciences (LAS), we are using an approach known an agent based models (ABMs) to help find answers to these questions and provide strategies for law enforcement to find and disrupt criminal networks. This project is a collaboration with specialists in backgrounds ranging from anthropology and forensic psychology to work experience as intelligence analysts in the FBI.

Agent based models is a “bottom-up” approach. Typically, one defines an environment, agents, and behavioral rules for the agents. An “agent” is often defined to be the smallest autonomous unit acting in a system — this could be a cubic foot of air in the atmosphere, a single machine in a factory, or a single person in a larger economy. Agents follow rules that can be as simple as a series of if-else statements, with no agent adaptation, or as sophisticated as Q-learning, which allows each agent to learn over time. The goal is to recreate the appearance of complex real-world phenomena. We call the appearance of complex aggregate relationships from micro-level agents “emergent” behavior. The broad idea with ABMs is that we run many simulations, and when we attain stable emergent behavior in a particular simulation, we will then calibrate the ABM to data.

But why do we have to jump to using ABMs in the first place? Can’t we use real-world data or try to set up natural experiments? Human self-awareness creates notoriously difficult problems for those trying to model human behavior. Richard Feynman bluntly illustrates this idea: “Imagine how much harder physics would be if electrons had feelings!” Feynman implies that the volatility in modeling human behavior extends from emotions, but I don’t believe that is really the problem. The problem is how decision-making processes in human beings change over time. Human beings learn to adapt for the task at hand, which isn’t inherently problematic — lots of scenarios in physical sciences show adaptive behavior. The trouble comes in when human beings willfully improve their ability to learn. How should someone trying to come up with good rules for human behavior make adaptive rules for making rules? Ideally, social scientists would like to emulate the success of physical scientists by starting with simple rules, and then trying to generate the behavior they are interested in observing from those rules. Unfortunately, this approach has had mixed success at best.

As a result, social scientists often take one of two approaches. The first is a “take what you can get” approach. Scientists build a statistical model based on their field’s theories. From there, they run these models on observational data (or data collected outside of a randomized experiment) to find empirical evidence for or against their theories. The downside is that disentangling the causes from the effects can be difficult in this approach. For instance, ice cream purchases and murder-rates have a strong positive correlation. But does it make sense to say ice cream causes murder, or murder causes ice cream? Of course not! There’s an effect that’s common to both of them that makes it look like they’re related: heat. As heat increases in cities, murder rates often increase and so do ice cream purchases. Statisticians refer to this concept as “confounding,” where ice cream purchases and murder rates in this example are said to be “confounded” with heat. As a result, if we don’t have the correct insights or the right data, observational data can be confounded with many things at once, and we may not be able to tell.

The second approach uses experiments to formalize questions of cause and effect. The rationale is that the world is a perfect working model of itself. So, the problem isn’t that we do not have a perfect working model but that we do not understand the perfect working model. This means that if we could make smaller working versions of the world in a controlled experimental setting, then we should be able to make some ground in understanding how the perfect model works.

However, for studying illicit networks, we often do not have good observational data available: criminals don’t want to be caught, so they try to make their activities difficult to observe. Similarly, it is usually impossible to perform experiments. To illustrate, if we wanted to study the effect of poverty on crime, there is no way for us as scientists to randomize poverty in an experiment.

A third approach says to simply try to simulate! If you can construct a reliable facsimile of the environment in which your phenomenon exists, then the data generated from the facsimile may help your investigation. In some environments, this works great. For instance, in weather forecasting simulations climatologists can apply well-developed theories of atmospheric chemistry and physics to get informative simulated data. Unfortunately, this may not be a great deal of help if we do not already have a strong theoretical foundation from which to work.

As a result, we try to pool information from experts and the data we have available to build a simplified version of criminal agents, and we make tweaks on our simulation until it produces data that look similar to what we see in the real world (as told by our content specialists and the data we have available). From there, we do our best to make informed decisions based on the results of the simulation. ABMs have their own issues, and they may not be the ideal way to look at a problem. But, we hope that they’ll give us insights into what an optimal strategy for finding and disrupting networks may look like to prevent crime in the future.


Conor is a PhD Candidate whose research interests include reinforcement learning, dynamic treatment regimes, statistical learning, and predictive modeling. His current research focuses on pose estimation for predicting online sex trafficking. We asked a fellow Laber-Labs colleague to ask Conor a probing question —

Sleeping Beauty volunteers to undergo the following experiment and is told all of the following details: On Sunday she will be put to sleep. Once or twice, during the experiment, Beauty will be awakened, interviewed, and put back to sleep with an amnesia-inducing drug that makes her forget
that awakening. A fair coin will be tossed to determine which experimental procedure to undertake:

  • If the coin comes up heads, Beauty will be awakened and interviewed on Monday only.
  • If the coin comes up tails, she will be awakened and interviewed on Monday and Tuesday.

In either case, she will be awakened on Wednesday without interview and the experiment ends.
Any time Sleeping Beauty is awakened and interviewed she will not be able to tell which day it is or
whether she has been awakened before. During the interview Beauty is asked: “What is the
probability that the coin landed heads?”. What would your answer be? Please explain.

A: I think you could approach this problem from lots of perspectives, depending on how you conceptualize randomness and uncertainty, and how you conceptualize how people actually think versus what we say they should think.

On one hand, speaking purely from the perspective of Sleeping Beauty, I think there’s an argument to be made that you could claim that the probability is still just ½. If, from the perspective of Sleeping Beauty, you do not gain any information from being awakened, you could say, “Well, it was 50-50 before we started, and since I get no information, it’s equivalent to asking me this question before we even started the experiment.” On the other hand, you could try to think of this experiment in terms of a long-term repeated average, or you could even think of it in a more Bayesian way, so I think the point of this
question is to give an example of the tension that exists between human heuristic reasoning about uncertainty and precisely converting that intuition into useful statements about the world. (So that’s neat.)

If you ask me what I would personally think, given that I’ve just presumably awakened in some place where I can’t tell what day it is, I might say, “Well, I know I’m awake with an interviewer, so it’s definitely Monday or Tuesday. From my perspective, I can’t tell the difference between awakening on Monday via heads, Monday via tails, or Tuesday via tails. So, from my perspective, there’s only one version of the world corresponding to heads of these three versions, so if you absolutely must have me give a guess for the probability of heads for this experiment to continue, I think one reasonable guess is 1 out of 3.”

The Computer Is Watching!

Eric Rose
Eric Rose, PhD Candidate

In a previous Laber Labs blog post by Marshall Wang, he discussed using Reinforcement Learning to teach an AI to learn to play the game Laser Cats, which has since been rebranded as Space Mice. In this scenario, the AI knows nothing about the mechanics of the game but is able to learn to play the game far more effectively than is possible by any human player. To achieve this level of skill, the computer player records the current state of the game, the action it took, and a resulting reward or penalty based on the resulting next state of the game. The computer then learns the optimal strategy for maximizing their reward.

However, in some complex situations this approach can be difficult for a computer. For example, it might be difficult to design a reward function or maybe the decision on what action is best is very complicated. In situations when it is possible for a human to play close to optimally, it can be far easier for a computer to learn how to replicate how a human player plays the game. This is called imitation learning!

Let’s use the game Flying Squirrel (playable at http://www.laber-labs.com/games/flying-squirrel/) to demonstrate how imitation learning can be used to teach a computer player how to play a game. In this game, the player controls the squirrel and is playing against a clock. The goal is to traverse the hills as fast as possible so you can complete the level before you run out of time. At first, you have only one possible choice: to dive or not to dive. Diving adds a downward force to the squirrel. (Insider tip: to move as fast as possible, you want to dive so that you land on the downward sloping part of the hill, continue diving as you move downhill, and then release the dive button right before you reach the uphill so that your momentum lets you jump off the hill and fly through the air. It’s pretty cool!) As you move through the game you gain special abilities and obstacles appear that you need to dodge, but we will limit our discussion to the simplest case.

In imitation learning, the computer is watching! It is recording features in each frame that summarize the current state of the game (such as how high you are above the ground, your velocity, direction, and features summarizing the shape of the hill in front of you) as well as the action you, the human player, took in this state. This record keeping changes the problem of how to play the game to one of classification. The computer uses the state of the game as input, explores its database of user experiences, and outputs the choice to dive or not to dive. Many classification algorithms can be used for this problem. For Flying Squirrel, we used k-nearest neighbors to teach the computer to play. This works by taking the current state of the game and finding the k past states that are most similar to the current state. We can then look at what action was chosen by the human in each of these past states and choose the action that was most common among those k states.

To see this in action, you can play the game and switch to watching a computer player play using imitation learning based on the data that was just collected on you!


Eric is a PhD Candidate whose research interests include machine learning and statistical computing. His current research focuses on sample size calculations for dynamic treatment regimes. We asked a fellow Laber-Labs colleague to ask Eric a probing question —

Q: Imagine that you’re a game of thrones character – what is your house name, its sigil, and the house motto?
A: I guess my house name would be House Rose since I’m pretty sure they’re all the same as the family name. I also didn’t read the books and have only seen a couple of episodes so that may not even be true. Our sigil would contain just a vinyl record. If my imaginary Game of Thrones character is anything like the real me, music is probably going to be a huge part of his life. Then again if he was anything like the actual me he probably wouldn’t survive for very long. Our house motto would be along the same lines and would be a line from one of my favorite songs by the Drive-By Truckers that I also think could apply to Game of Thrones. “It’s great to be alive!”

This is Eric’s second post! To learn more about her research, check out her first article here!

Facing Missing Data

 

Eric Rose
Lin Dong, PhD Candidate

This is a detour from my last post about education. Turns out that I have been working on a project about sequential decision making in the face of missing data for several months, so why not talk about that. Missing data arises in all sorts of data. For data with sequential feature, like data from sequential multiple assignment randomized trials (SMART), the problem is that patients are often subject to drop out. Q-learning or other techniques for solving the optimal strategy cannot be directly applied to data sets containing missing values, so we need a way to get around having missing values.

The first question we may ask is – why is missing data a problem? Further, can we just throw out the missing entries? Why do people care so much about it and develop sophisticated methods to deal with it? Things are not that simple.

Missing data is not a big issue if the data are missing completely at random (MCAR). Yes, that’s jargon. MCAR means that missingness is completely random and is independent of the data. Suppose we have a typical n by p data matrix in which you have n rows corresponding to the subjects and p variables. A quick and dirty way to handle the missing data is to throw away all the rows containing missing values. This is not a great idea if a large proportion of your data contains missing values. Suppose you have an unbiased estimator for the full data. Under MCAR, the estimator remains unbiased, but you may lose a lot of efficiency (you are less certain about your estimation).

Another type of missingness is called missing at random (MAR), which means missingness is not completely random but only depends on the observed data. If you throw away missing data under this scenario, you will obtain a biased estimator. For example, cautious and wealthy people tend to avoid giving responses to questions about their income. For this reason, an income estimate would be lower than the truth because your sample only covers less wealthy respondents. Nonetheless, MAR is actually a very handy assumption because the missing event is tractable and thus can be modeled. The methods I will introduce later are all based on MAR.

If one refuses to assume MCAR or MAR for their data, we have a third and final missingness assumption called missing not at random (MNAR) – it says that the missing data depend on the things you did not observe. A very important paper that introduced these assumptions is Rubin 1976 [1].

So, we cannot simply throw away missing entries. What then are the alternatives? One can use the general class of imputation methods. Imputation methods are intuitive and work by filling in the missing entries based on the researcher’s best knowledge. The simplest imputation is to fill with the mean/median of the covariate. If we are willing to assume MAR, a more advanced way is to build a model for the variable. We can get a model-based estimator that can serve as the fill-in value. Instead of filling in with one estimator, one can estimate the conditional distribution of each variable given all other observed variables and then draw samples from the estimated conditional distribution to fill in the missing value. To account for the uncertainty in drawing samples, we can repeat the sampling procedure several times so that we have multiple imputed data sets. The inference is then performed on each of the imputed data sets. We combine the multiple results into one final estimator, for example, by averaging them. This is called multiple imputation and is a very popular approach to deal with missing data.

Another method, which is less known, models the missing mechanism directly. It is called the inverse probability weighted estimator, where probability refers to the probability of missing. When missingness is not MCAR, bias is introduced because the complete cases left are no longer a representative sample of the population. A method to fix that is to give each complete row a weight, which is 1 over the probability that it would be missing. Then we get a re-weighted sample that mimics the full and representative sample. The estimation of interest can be performed on the re-weighted sample, which only uses the complete rows. The key to this method is to estimate the probability of missing – the missing mechanism. Luckily, one can model the probability of missing under the MAR assumption.

Not until recently did I realized that I also encountered and studied the missing data issue in my undergraduate years. We were dealing with sensitive questionnaires, where people were being asked about very sensitive questions that they might be reluctant to answer. So we believed that they were likely being deceitful. The mechanism we used to address this was the following: I wanted to ask about a binary and sensitive status, and I coded it as {No = 0, Yes = 1}. Instead of asking directly, I listed a non-sensitive and independent question, e.g. how many times did you catch a flight in the last 3 months (an integer). Then, I asked the respondent to report only the sum of the number of flights and the answer to the sensitive question. For example, if a respondent traveled by air 3 times in the last three month and their sensitive status is “Yes”, he/she should write down 3+1 = 4. As the researcher, we only observe 4, which could be that the respondent flew 4 times with no sensitive status. In this way, it is believed that the compliance of respondents will increase. Utilizing this method, we translated the sensitive status into missing data as it was not directly observed. Typically, the researchers are only interested in population level of the sensitive status. Then, we applied the maximum likelihood to estimate the expected value of the missing value (In this case, a proportion). More details of this idea can be found in [2].

A perfect world would have no missing values. The real world, however, is so flawed that missing data arises wherever data are generated. Working on this issue gives me the illusion that I am helping to fix the world! A great reference for the general missing data issue is introduced in Prof. Marie Davidian’s course [3].

[1] Rubin, D. B. (1976). Inference and missing data. Biometrika 63, 581–592.
[2] GL Tian, ML Tang, Q Wu, Y Liu (2017). Poisson and negative binomial item count techniques for surveys with sensitive question. Statistical Methods in Medical Research. Vol 26, Issue 2, pp. 931 – 947
[3] http://www4.stat.ncsu.edu/~davidian/st790/


Lin is a PhD Candidate whose research interests include dynamic treatment regimes, reinforcement learning,  and survival analysis. Her current research focuses on shared decision making in resource allocation problems. We asked a fellow Laber-Labs colleague to ask Lin a probing question —

Q: Explain your favorite statistical method, but from the perspective of a crooked politician running a smear campaign against it.
A: Linear regression. This is definitely my favorite model. It is so simple, pure yet powerful. You can generalize it, penalize it and even interpret it.
Human brain should be linear – not some complicated, intricate, twisted, impenetrable, nonlinear, *deep* networks. Believe me, the whole world should be linear.

This is Lin’s second post! To learn more about her research, check out her first article here!

Improving Football Play-calls Using Madden

Nick Kapur
Nick Kapur, PhD Candidate

The ability to make crucial decisions in real time is one of the most sought after attributes of a head coach in any sport. Being able to improve upon these decisions is thus an important problem, as it can improve a team’s chances of winning. In baseball, there have been numerous studies on managerial decisions such as defensive alignments, bullpen usage, bunting, and more. These studies have resulted in managers making more efficient decisions, leading directly to better play. In football, coaches are faced with fundamental decisions to make every down: the personnel, the formation, and the play their team will run. Unlike baseball, where there is an abundance of data, it is difficult to determine whether coaches are making these important decisions effectively in football for several reasons. First, obtaining labeled data is extremely expensive and requires hand-labeling by domain experts. Furthermore, with a 16-game schedule and an average of only 130 plays per game, NFL football does not generate nearly enough data to reach reliable conclusions.

The lack of sufficient data is not an uncommon problem in science, and due to the proliferation of computing power, it is a problem commonly remedied by simulation studies. Luckily, there is a realistic NFL simulation environment that has been developed and extensively updated for nearly 30 years, EA Sports’ Madden video game franchise. Madden games can act as a model for the underlying system dynamics of an NFL game. We utilize data generated from Madden 17, the most recent version of the game, to train reinforcement learning algorithms that make every play-calling decision throughout an entire game. We compare the results of these algorithms with a baseline established from the game’s built-in play-calling algorithm, an initial surrogate for real-life coaching decisions.

Controllers with Raspberry Pi Computers

To generate the data at rates far greater than actual NFL games, we constructed 4 controllers that were operable through an interface with Raspberry Pi computers. We ran each of these controllers continuously on separate Xboxes, and we used optical character recognition techniques to capture the current state of the game from image data. Then, we used the current state as input to our reinforcement learning algorithms, which would return the play to run. The correct buttons were subsequently passed to the Raspberry Pi, resulting in 4 Madden games that could run continuously with no human input, collecting data 24 hours per day.

Our results show that the reinforcement learning algorithms are able to perform at better rates than the built-in Madden play-calling algorithm, leading to better decision-making and thus more victories. These results can potentially provide a framework for evaluating and improving play-calling in football. Additionally, they can potentially be augmented with real data to provide a model that performs better than a model based on the real data alone. With enough evidence, football coaches may be compelled to alter strategic decisions for the better, leading to more efficiently called football games.


Nick is a PhD Candidate whose research interests include machine learning and statistical genetic. His current research focuses on pursuit-evasion and cooperative reinforcement learning. We thought this posting was a great excuse to get to know a little more about him, so we we asked him a few questions! We asked a fellow Laber-Labs colleague to ask Nick a probing question —

Q: Explain the countable axiom of choice with an analogy involving hot dogs.  

Let’s say you really want a hotdog. You are walking down the street, and suddenly you stumble upon an infinite number of hotdog vendors who each have tubs with many hotdogs in them. You know that you are incredibly hungry right now, and that in the future you may want to go back to the best hotdog vendor. Therefore, you get out your trusty megaphone and announce to the hotdog vendors a rule (some function that allows them to choose…let’s call it a choice function) so that each of them will know exactly which of their hotdogs to give you. This way, you don’t have to go and pick out one hotdog from each of them individually. The axiom of choice has now saved you a lot of valuable time and probably doomed you to a sedentary lifestyle.

This is Nick’s second post! To learn more about his research, check out his first article here!

What is in a Model?

Isaac J. Michaud
Isaac J. Michaud, PhD Candidate

Nothing kills communication like jargon: it signals the tribe you belong to. Jargon makes the distinction between the insiders and the outsiders painfully clear. One particular piece of jargon that has always bothered me is the concept of a “model.” I suppose this has been on my mind recently because I have heard people relay the famous quote from George Box, “All models are wrong, but some are useful.” This is an adage that is hard to escape in Statistics, and like all maxims it becomes trite when overused. I am most annoyed when presenters throw this into their lecture as some legal caveat emptor to mitigate the criticisms of their work….

…But, getting back on topic, what exactly did Box mean by a model? We use this term all the time. Taking a blunt view of Statisticians, all we really do is build models. Of course other scientists also build models, we don’t have a monopoly –yet (insert evil laugh). My definition of a model, albeit inept, is: a description of either an object or a process. Now some descriptions are better than others. A detailed blueprint is a more useful description for building a skyscraper than a poem. This is why mathematical models are so prevalent. They cut directly to a quantitative description without any confusion. Models don’t need to be equations; they can take many different forms, for example a computer program. The important thing is that it is a description.

Joyce’s second blog post discusses two camps of modeling. There are those that want the model to be interpretable and those that do not care about the form of the model but instead only want them to achieve some result, say winning at go or chess. Both are valid descriptions, but they illuminate different aspects of the same object. Neither of them is right and neither of them is wrong. The only flawed assumption is that the only correct description is your own model.

My research deals specifically with what are called surrogate models. These are models that are built and calibrated to produce the same results as another model. Now why would anyone want to do this? It seems meta and academic. Well, you’re not wrong! But there are very good reasons to do this. Simpler models, assuming they have enough fidelity, are easier to analyze and understand without losing relevant information. When thinking about surrogate models I always remember the short story “On Exactitude in Science” by Jorge Luis Borges, which describes an empire whose cartographers were so proficient that their maps were the same size as the empire itself. Every detail of the terrain reproduced exactly. Obviously such a map, although accurate, is rather unwieldy. A cut down version would be sufficient for most practical purposes. The tricky issue is how to perform the trimming.

My surrogate modeling falls into a gray region between the two camps Joyce describes. Often the surrogate model takes the form of some Gaussian Process model that’s inscrutable, and the model being approximated is a computer simulation built up from scientific knowledge. The simulation is understandable but slow, whereas the surrogate is the reverse. The Gaussian Process model is not a better description of the reality that the computer code is stimulating, but it does make certain information available to us that would otherwise be locked away in a computer program running until the end of time. In my case, one model is not enough to describe everything. I believe this plurality is true across Statistics and the other sciences. We must be flexible so we are not dogmatically stuck at the expense of progress.

 


Isaac is a PhD Candidate whose research interests include epidemiology, differential equation modeling, and reinforcement learning. His current research focuses on pursuit-evasion and cooperative reinforcement learning. We asked a fellow Laber-Labs colleague to ask Isaac a probing question —

Q: On the table in front of you are two boxes.  1 is clear and contains $1000.  The other is opaque, and contains either $1 million or nothing.  You have two choices: 

1. Take only the opaque box. 
2. Take both boxes. 
The catch is, before you were asked to play this game, a being called Omega, who has nearly-perfect foresight, predicted what you would do.  If Omega predicted you would take one box, they put $1 million in the opaque box.  If Omega predicted you’d take 2 boxes, they put nothing in the opaque box.
Do you choose one or two boxes?

A: Both boxes. Depending on what you assume about Omega’s foresight and objectives will lead to different conclusions. If I believe that Omega is more likely to be correct than wrong when predicting my actions, then I would choose the opaque box in order to maximize my expected reward. But if I assume that Omega is a rational being who believes I am a rational being and has the goal of maximizing the chance of being correct, then she will know that I am going to choose the opaque box with 100% certainty and will predict it. But I, knowing that Omega will choose this, will maximize reward by instead choosing both boxes. Omega will know this adjust accordingly. I would then have no reason to switch to picking the opaque box because it would have nothing in it. Instead, I would settle for taking both boxes while realizing a 1000 dollar prize, and Omega would be correct in her foresight. Moral of the story: 1000 dollars on the table is worth 1 million gambling with an omniscient being.  

This is Isaac’s second post! To learn more about her research, check out her first article here!

Grad School is a Miserable Experience

Joyce Yu Cahoon
Joyce Yu Cahoon, PhD Candidate

I’m kidding. Your time in graduate school can be challenging, but like so many things in life, it’s how you take on those challenges that matters. My resolution to succeed was tested after jumping to the conclusion that two semesters of research will never see the light of day. I had a bit of an identity crisis. I questioned my life decisions. I was bitter and resentful. But getting through moments like these has made me realize the intrinsic value of a PhD.

Everyone has some theory of the world in which they conceptualize themselves. At least I’d like to think so. When someone or something dear to us objects to our theory or lurks outside our structure, chaos ensues. Luckily, I had the fortune to have such a formative experience and gain this perspective through the wide gamut of projects in our lab. I even had a short stint as a data scientist this summer at a local start-up. Over the past year, the projects I’ve worked on include:

  • Monitoring food safety violation rates.
  • Using digital mammography to predict breast cancer.
  • Text mining Twitter data to identify incidences of food poisoning.
  • Developing a means to detect age from facial and body markers.
  • Reconciling disparate data sources.
  • Building a simulation tool to illuminate the benefits and costs of microtransit.

If you aren’t familiar with these topics, let me assure you that this year was a random walk through research areas – no one topic naturally flowed from the last. In hindsight, it was interesting to encounter the broad divide among statisticians in understanding the nature of these problems and the best approaches to solve them—and no, this is not another one of those frequentists vs. bayesians posts. If there is a common thread to these projects, it would be that we have some set of inputs, x, to which we hope to apply some statistical magic so that we arrive at the response of interest, y. But what magic do we use? How do we get from x to y?

In one camp are those that generally assume that the input data is generated by some stochastic model and can be fit using a class of parametric models. By applying this template, we can elegantly conduct our hypothesis tests, arrive at our confidence intervals, and get the asymptotics we desire. This tends to be the lens provided by our core curriculum. The strength of this approach lies in its simplicity. However, with the rise of Bayesian methods, Markov chains, etc. this camp is beginning to lose the “most interpretable” designation. Moreover, what if the data model doesn’t hold? What if it doesn’t emulate nature at all?

In the other camp, are the statisticians whose magic relies on the proverbial “black-box” to get from x to y. They use algorithmic methods, such as trees, forests, neural nets and svms, which can achieve high prediction rates. I must admit, most of the projects I’ve worked on fall in this camp. But despite its many advantages, there are issues: multiplicity, interpretability and dimensionality, to name a few. Case in point, the team I worked with in the digital mammography project was provided a pilot data set of 500 mammograms from 58 patients with and without breast cancer. Our goal was to design a model that can flag a patient with or without cancer. But how can we make rich inferences from such limited training data? Some argue our algorithmic models can be sufficiently groomed to learn representations that meet that of a human mind. In this case, that of a radiologist in identifying mammograms associated with those at risk of breast cancer. However, our team worked on tinkering with a variety of adaptations of often-cited convolutional neural nets; each variation of which was not able to fully capture the representations we desired in identifying radiological features. The tools at hand were simply not designed to achieve the objective; grooming was not the solution.

So now that I’ve come through this experience and am again looking forward — I have to ask — in which camp do I fall? Perhaps it’s not either-or; perhaps it’s not even a combination, but something entirely new. Whatever the path forward is, I’m excited to be playing a part. It’s been an intense year, but the level of intellectual growth and personal self-discovery made it all the more worthwhile.

References
Leo Breiman. Statistical modeling: the two cultures, Statistical Science, 2001.


Joyce is a PhD Candidate whose research focuses on machine learning. We asked a fellow Laber-Labs colleague to ask Joyce a probing question —

Q:  Propose a viable strategy to Kim Jong-un on how to take over the world in the next 5 years. — Marshall Wang

A:

With just nuclear capability, NK is left with a route with low odds but high payout. They should continue to do missile tests that inflate their nuclear capability. Kim should also ramp up the disparaging comments against Trump for Trump’s inaction insinuates Americans would never use an atomic bomb. Such comments would likely not draw sanctions from strong allies, namely China and Russia. Kim should then leak intel on a planned nuclear weapon launch as close to the SK border as possible. If stars align, Trump could justify nuking NK, but the collateral damage in SK would likely draw political ire from the global community. If successful to this point, the US would fall into great political turmoil as Trump would be demonized to be worst than Putin. Kim would then need the US and Russia to somehow engage with one another in WW3. While they are preoccupied on that front, Kim could start a violent civil war within Korea. This may involve bombing highly populated areas in SK, though the US would HAVE to be preoccupied fiercely on other fronts AND China would have to be involved, perhaps on the Eastern front, for Kim to execute on such a scheme successfully. NK should at this time revert to a defensive strategy in order to move towards a united Korea. Over the course of a few years, Kim should hope both sides take heavy causalities, provide help to China when it can, and win over other Asian allies of US. And since Korea is among the most advanced nations in technological production, these years would leave a destructive gap in technological progress.

 

This is Joyce’s second post! To learn more about her research, check out her first article here!

Teaching human language to a computer

Longshaokan (Marshall) Wang
Longshaokan (Marshall) Wang, PhD Candidate

Have you ever learned to write code? If so, you were learning a “computer language.” But, have you ever considered the reverse; teaching a computer to “understand” a human language? With the advancement of machine learning techniques, we can now build models to convert audio signals to texts (Automatic Speech Recognition), detect emotions carried by sentences (Sentiment Analysis), identify intentions from texts (Natural Language Understanding), translate texts from one language to another (Machine Translation), synthesize audio signals from texts (Text-To-Speech) and more! In fact, you probably have already been using these models without knowing because they are the brains of the popular artificial intelligence (AI) assistants such as Amazon’s Alexa, Google Assistant, Apple’s Siri, Microsoft’s Cortana, and most likely Iron Man’s Jarvis. If you have ever wondered how these AI assistants interact with you, then you are in luck! We are going to take a high-level look at how these models are built.

Many of the language-processing tasks listed above use variations of a machine-learning model called Recurrent Neural Network (RNN). But, let’s start from the very beginning. Say you spotted an animal you haven’t seen before, and you attempt to classify it. Your brain is implicitly considering multiple factors (or features): Is it big? Does it make a noise? Does it have a tail?, etc. You weight these factors differently because maybe the color of its fur is not as important as the shape of its face. Then, your guess will be the “closest” animal you know. A machine-learning model for classification works similarly. It maps a set of input features (e.g., big, purrs, has a tail, …) to a classification label (cat). First, the model needs to be trained using samples with correct labels, so that it knows what features correspond to each label. Then, given the features of a new sample, the model can assign it to the “closest” label it knows.

A simple example of a classification model is a perceptron. This model uses a weighted sum of the input features to produce a binary classification based on whether the sum passes a threshold:

[1]

But a perceptron is too simple for many tasks, such as the “Exclusive Or (XOR)” problem. In XOR problems with 2 input variables, the correct classification is 1 if only one input variable is 1, and 0 otherwise:

Values of input variables A and B True output/correct classification
A = 0, B = 0 0
A = 1, B = 0 1
A = 0, B = 1 1
A = 1, B = 1 0

However, this classification rule is impossible for a perceptron to learn. To see this, note that if there are only two input features, a perceptron essentially draws a line in the plane to separate the 2 classes, and in the XOR problem, a line can never classify the labels correctly (separate the yellow and gray dots):

[2]

To handle more complicated tasks, we need to make our model more flexible. One method is to stack multiple perceptrons to form a layer, stack multiple layers to form a network, and add non-linear transformations to the perceptrons:

[3]

The result is called an Artificial Neural Network (ANN). Instead of learning only a linear separation, this model can learn extremely complicated classification rules. We can increase the number of layers to make the model “deep” and more powerful, which we refer to as a Deep Neural Network (DNN) or Deep Learning.

Despite the flexibility of the DNN model, language processing remains a challenging classification task, however. For starters, sentences can have different lengths. In cases like Machine Translation, the output is not just a single label. What’s more, how can we train a model to extract useful linguistic features on its own? Just think about how hard it is for a human to become a linguist. So, to handle language processing, we need a few more twists on our DNN model.

To deal with the variable lengths of sentences, one can employ a method known as word embedding. Here, each word of a sentence is processed individually and mapped to a numeric vector of a fixed length. A good word embedding tends to put words with related meanings, such as “dolphin” and “SeaWorld,” close to one another in the vector space and words with distinct meanings far apart:

[4]

The embeddings are then fed to the DNN for classification.

But a word’s meaning and function also depend on its context in the sentence! How can we preserve the context when processing a sentence word by word? Instead of using only the current word as our DNN’s input, we also use the output of our DNN for the previous word as an additional input. The resulting structure is called a Recurrent Neural Network (RNN) because the previous output becomes part of the current input:

[5]

Now we know how to make our model “read” a sentence, but how do we format all the language-processing tasks as classification problems? It’s straightforward in Sentiment Analysis, where we use the output of an RNN for the last word as a summary for the sentence and add a simple classification model on top of the summary. The labels can be [“positive”, “neutral”, “negative”] or [“happy”, “angry”, “sad,”  …]. In Machine Translation, we have an encoder RNN and a decoder RNN. The encoder reads and summarizes the sentence in language A; the decoder sequentially generates the translation word-by-word in language B. Given what you’ve learned so far, can you figure out how to use a RNN for Natural Language Understanding, Automatic Speech Recognition, and Text-To-Speech?

On this journey, we started with the basic classification model, the perceptron, and finished with the bleeding-edge classification models that can process human language. We have peeked into the brains of the AI assistants. Exciting research in language processing is happening as we speak, but there is still a long road ahead for the AI assistants to converse like humans. Language processing is, as mentioned before, not easy. At least next time you get frustrated with Siri, instead of yelling “WHY ARE YOU SO DUMB?” you can yell “YOU CLASSIFIED MY INTENTION WRONG! DO YOU NEED A BETTER EMBEDDING?”

[1]Programming a Perceptron in Python, 2013, Danilo Bargen.

[2]A deep learning tutorial: from perceptrons to deep networks, 2014, Ivan Vasilev

[3]Overview of artificial neural networks and its applications, 2017, Jagreet.

[4]Wonderful world of word embeddings: what are they and why are they needed?, 2017, Madrugado.

[5]Understanding LSTM networks, 2015, Colah.


Marshall is a PhD Candidate whose research focuses on artificial intelligence, machine learning, and sufficient dimension reduction. We asked a fellow Laber-Labs colleague to ask Marshal a probing question —

Q:  If you were running a company in Boston and had summer interns coming from out of town, what would be the best way to scam some money off of them? — James Gilman

A:

Call my company Ataristicians and ask for seed money.

Just kidding. On a more serious note, if I were a scammer, I would take advantage of the fact that in Boston, gifting weed is legal but selling is not. The way transaction works is that the buyer would “accidentally” drop his money and then pick up the “gift bag” from the seller. The employees of my company would go to all the intern events, establish contacts with the interns, find the potential customers, and pose as discrete weed dealers. Then we would simply put garbage in the gift bag and take the interns “dropped” money. Nothing illegal with gifting garbage. Those interns can’t find help from the police. And because they came from out of town, they are unlikely to have connections with local gangs. Now, if we want to make more money, we would record the whole price negotiations and the transactions, then blackmail the interns, threatening to email the recordings to their managers and ruin their careers.

This is Marshall’s second post! To learn more about his research, check out his first article here!

Variable Selection using LASSO

Wenhao Hu
Wenhao Hu, PhD Candidate

How to identify a gene related to cancer?  What factors are correlated to graduation rates in all NCAA universities? To answer those questions, statisticians usually use a method called variable selection. Variable selection is a technique to identity significant factors related to the response, e.g., graduation rates. One of the most widely used variable selection methods is called LASSO. LASSO is a standard tool among quantitative researchers working across nearly all areas of science.

LASSO can handle data with lots of factors, e.g., thousands of genes. In the era of big data, this is extremely useful. For example, suppose that there are 50 patients with cancer and another 50 healthy people. And scientists sequence each subject’s gene at ~100k positions. To identify the gene related to cancer, one needs to check those ~100k positions. Traditional regression methods fail in this case because they usually require that the number of subjects be larger than the number of genes. LASSO avoids this problem by introducing regularization, which then has been used by many others machine learning and deep learning algorithms. LASSO has been implemented in most statistical software environments. For example, R has a package called glmnet. SAS has a PROC called glmselect.

To achieve good performances for LASSO, it is vital to choose an appropriate tuning parameter, which balances the model complexity and model fitting. Classical methods usually focus on selecting on a single optimal tuning parameter that minimizes some criterion, e.g., AIC, BIC. However, researchers usually ignore uncertainties in tuning parameter selection. Our research studies the distribution of the tuning parameter, and thus provides scientists with information about the variability of model selection.  Furthermore, we are developing an interactive R package for LASSO. By using the package,  scientists can dynamically see the model selected and corresponding false selection rates.  This allows them to explore the dataset and to incorporate their own subject knowledge into model selection.

Illustration of the interactive R package under development for variable selection.


Wenhao is a PhD Candidate whose research interests include variable selection and statistical learning. We thought this posting was a great excuse to get to know a little more about him, so we we asked him a few questions!

Q: What do you find most interesting/compelling about your research?

A: My research provides me a better understanding about the theory of
linear models, which is one of the most widely used statistical
methods.

Q: What do you see are the biggest or most pressing challenges in your research area?

A: One biggest challenge is model interpretability and inference after model selection. Meanwhile, users usually have little freedom to incorporate their domain knowledge into the process of model selection.

Q: Finish this parable:
A Tiger is walking through the jungle whereupon he sees a python strangling a lemur. The Tiger asks the python, “why must you kill in this way?” it is slow and painful. We all must eat, but have you no compassion for your fellow animals? To which the python replied, “Why must you kill with teeth and fangs? The gore and violence of it is scarring to all who are unfortunate enough to see it.” The tiger considered this for a moment and finally said, “Let us ask the Lemur. Lemur, which is your preferred way to go?”

A: The python relaxed his grip slightly so that the Lemur could speak, “I
don’t know which way is better. But if I can choose, I prefer to be
killed by the strongest animal. Is Python or Tiger stronger?” The Tiger
answer confidently, ‘I am the strongest animal in the jungle. Python, you
should leave the Lemur to me.’ The Python felt very unhappy and started to debate with the Tiger and the Lemur. After several minutes, the Python and Tiger started fighting with each other. The Lemur escaped…

Following the money trail… To catch the bad guys

Yeng Saanchi
Yeng Saanchi, PhD Candidate

Imagine a world devoid of human exploitation, a world free from the fear of being trapped under the yoke of slavery. Sounds like a perfectly splendid world, if you ask me! Alas, this is a world that remains elusive because there are people who refuse to accept that owning their fellow human being is the epitome of evil. Modern-day slavery, also known as human trafficking, involves the abuse of power by certain individuals or groups to coerce the victims and exploit them through the use of threats or force. It is usually characterized by the giving or receipt of payments or benefits in order to assume control over a person. Forms of modern-day slavery include sex trafficking, labor trafficking, domestic servitude, forced marriage, bonded labor and child labor.

 

Human trafficking ranks as the third most profitable crime in the world and generates about $32 billion a year. But there is hope! The illicit trade of humans is a problem that is acknowledged by many governments and organisations in the world, and the battle against it has been ongoing for decades now. Although there have been many attempts to curb human trafficking activities, many have come to the realization that these criminal acts cannot be disrupted by conventional policing methods. The inefficacy of conventional methods has given rise to what is referred to as “follow the money” techniques. These methods target illicit assets, such as the financial assets of criminal organisations. It is claimed that targeting illicit assets demonstrates that crime does not pay, disrupts criminal networks and markets, and acts as a deterrent through reduced returns. While the confiscation systems developed have been partly effective, there is an ongoing discussion as to whether these current methods are truly achieving the objective of curbing these crimes.

The purpose of the current human trafficking project being undertaken by Laber Labs is to map out connections between people based on personal and geographic information, as well as their financial transactions. This is to aid in devising a means of detecting with reasonable accuracy, which transactions appear suspicious and are more likely to be associated with criminal activities. The long-term goal of the project is to assist law enforcement to apprehend the criminals as well as stop these crimes before they are committed, if possible.

The dark web plays an important role in exploring this method of apprehending criminals involved in illegal activities, primarily human trafficking. The dark web is the content on the world-wide web which can only be accessed by specific software, configurations or authorisation. Though the dark web is deemed as a treasure trove of criminal activity, a study conducted by Terbium Labs, showed that about 48% of the activities that take place on the dark web are legal. Interesting, right? The dark web is actually patronised by a lot of people who wish for privacy or anonymity.

The task of “catching the bad guys” perpetrating these inhuman acts is no doubt a challenging one and hopefully the outcome of this project will provide an effective method for curbing this canker.

I will end this post with a quote by William Wilberforce: “If to be feelingly alive to the sufferings of my fellow creatures is to be a fanatic, I am one of the most incurable fanatics ever permitted to be at large.”


Yeng is a PhD Candidate whose research interests include predictive modeling and variable selection. We thought this posting was a great excuse to get to know a little more about her, so we we asked her a few questions!

Q: What do you find most interesting/compelling about your research?

A: What I find most compelling about my research is the potential of saving lives by helping to put a stop to modern-day slavery.

Q: What do you see are the biggest or most pressing challenges in your research area?

A: The most pressing challenge at the moment is building a statistical model for age prediction using body poses in order to help in distinguishing between underage and adult victims.

Q: Give five tips for starting a successful doomsday cult! One tip should be about fostering the deviancy amplification spiral in your potential followers.

A: i) Run for student body president as a way of getting students on board. Could insert subtle messages about an imminent robot apocalypse in the numerous emails that the student president is allowed to send to students.

ii) Reach out to the fraternities and sororities as a way of garnering more support

iii) Put out a story online about a prominent figure in the academic community who is working on helping law enforcement to curtail human trafficking and yet has his own coffle of slaves in the guise of a lab, with proof and all(made-up or not). This should elicit moral outrage and help foster the deviancy amplification spiral somewhat, I think.

iv) Work on getting a couple of notable figures involved, probably someone from the academic community. For instance, convincing EBL that an apocalypse is imminent will be a step in convincing many. How to go about that, I’m not certain.

v) The least probable tactic will be to convince the most powerful man in the world that unlike global warming, a robot apocalypse is real and imminent.