A good data set is a gold mine for a statistician. A good data set provides an opportunity to put one’s favorite statistical methods to the test or to try out new ones. In fact, a good data set is worth even more: it presents novel challenges and questions, which can in turn inspire creative approaches to problem solving that meld statistical, mathematical, and other ideas that may have never been realized otherwise.
At Doran’s Lab, the undergraduate data science research group at Laber Labs, we have found our gold mine, and it sits atop the worldwide phenomenon League of Legends. Released by Riot Games in 2009, League of Legends (or, “League”) is a 5-vs-5 competitive online game with a massive following and a professional scene of unprecedented maturity. In keeping with Riot’s openness toward their player base, they maintain a public API that lets anyone with a free account make requests to obtain detailed game data. In our case, we were given permission to make requests at a high volume, which has allowed us to fill up almost nine terabytes of hard drive space in about a year; we collect over 100,000 games every day and have logged data from well over 3 million accounts in North America. This data is ripe for insights.
After a year of collaborative work we’ve still only scratched the surface of the possibilities presented by the data, which contains information as detailed as (1) minute-to-minute updates on player location, gold, and experience (which are the primary resources in the game) and (2) exact times and details of important game events like item purchases, champion, building, and elite monster kills. If you’re interested in reading about these projects, check out our awesome website at doranslab.gg
Since our undergraduate data analysts have already done an excellent job writing up their work on past projects, I’d like to highlight three examples of unanticipated learning that resulted from a challenge in our data. These are a few tiny examples of times when the data demanded a new approach, although I stress that there were many more challenges and the bulk of the good stuff is contained in the linked articles themselves!
Game prediction. The project: for 600,000 games, given which 5 of League’s 141 champions are on blue team and which 5 are on red team, estimate the probability that the blue team will win using a procedure that is transparent and interpretable to humans. The challenge: we wanted to include terms to measure potential synergies between champions, or effects where a champion “counters” another (e.g., an already-strong fighter might be even stronger against a defenseless mage). In statistical language, we wanted to estimate pairwise interaction effects. The only issue is, there are (141 choose 2) times 2, or about 20,000 possible interactions, and storing a table with this many columns and 600,000 rows became a computational nightmare. This data challenge caused us to look into methods for storing sparse matrices, which allowed us to develop a highly efficient solution.
Critical strike smoothing: The project: in the game, a player can issue the command for their champion to automatically attack another unit. Based on items, these attacks have a chance to deal double damage (a “critical strike,” or “crit,” for short), in a process that is known, perhaps surprisingly, to be non-independent. This data is not available through the API, so we needed to collect it ourselves by reading in attacks as “crits” or “non-crits” from video recorded from the game. The challenge: what seemed like a basic task not worthy of the “computer vision” moniker proved surprisingly difficult, as slight variations in the game state required to obtain video footage made it difficult to apply simple heuristics to detect attacks. In the end, we learned the importance of deliberate pre-processing and systematic validation and error detection.
Champion location metrics: The project: to develop useful and easily-interpreted metrics for measuring ways players move around the map, for example, how much a champion “roams” to apply pressure to different strategic regions. The challenge: while the League of Legends map sits comfortably on a square grid and movement can be measured by traditional Euclidean distance, it turns out that naively recording “total distance traveled” (as estimated by the minute-to-minute location updates we have access to) provides little useful information, as certain areas of the game’s map are expected to be traversed often and other regions represent highly contested and noteworthy places. This forced us to think creatively about new ways to measure movement.
These are just a few examples, and there are more every day for everyone at Doran’s Lab. If you’d like to jump in to experience the magic yourself, check out our publicly available, curated datasets here: https://github.com/DoransLab/data and follow us on Twitter @DoransLab to hear more soon.
Alex is a PhD candidate working with Laber Labs. His research interests include reinforcement learning and interpretable models. We thought this posting was a great excuse to get to know a little more about him, so we we asked him:
Q: What is your favorite statistics-themed limerick?
A: This is a rap, not a limerick– because we live in 2019. It’s about the baddest motherf—er to ever enroll in the graduate statistics program at NC State.