Saturday, July 30, 2016

Predicting ARAM Outcome Based on the Champions Selected

ARAM Outcome Predictor - Current on Patch 6.16

Enter the champions below, or to generate a random set of champions.

Champions on Blue Side:
Champions on Red Side:




Anyone with the slightest experience with ARAM knows that there are many good (and bad) champions on this map - in particular, ranged champions with long range poke and/or sustain tend to be favoured in ARAM. To quickly show that this is indeed the case, here are the top and bottom 5 champions in ARAM on Patch 6.15 by win rate after removing mirror matches:

Bottom 5:
ChampionWinrate
Ryze36.25%
Evelynn36.68%
LeBlanc37.80%
Rek'Sai38.39%
Kha'Zix38.78%

Top 5:
ChampionWinrate
Swain61.22%
Teemo61.32%
Galio62.86%
Sona63.30%
Ziggs64.10%

With such discrepancy in power between different champions and the fact that the champions chosen in ARAM are random, it is natural to ask how much of the game is decided as soon as the champions are selected. In other words, given the champions locked in for both sides and no other additional information, how well can we predict the outcome of the game?

By constructing a predictive model using machine learning techniques, I have discovered that I can predict the outcome of ARAM games in Patch 6.15 with around 66% accuracy. You can play with the my predictive model above, where you can enter the champions and see the predicted outcome.



Warning: technical descriptions of the model ahead.

As far as the methodology is concerned, it is very standard - I collected around 160k ARAM games from the NA server on Patch 6.15, split the data into a training and testing set (in 3:1 ratio), trained several machine learning models on the training set, and finally computed prediction error using the testing set.

Several different models were attempted, including logistic regression, random forest, XGB, and some simple MLP. Somewhat surprisingly, a logistic regression model performed remarkably well against the other models. A small amount of regularization was needed for the logistic regression since the covariate matrix was rank deficient.

The result from the testing set is as follows:

Confusion Matrix and Statistics

          Reference
Prediction  LOSS   WIN
      LOSS 13617  7174
      WIN   7371 14621
                                          
               Accuracy : 0.66            
                 95% CI : (0.6555, 0.6645)
    No Information Rate : 0.5094          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.3197          
 Mcnemar's Test P-Value : 0.1041          
                                          
            Sensitivity : 0.6488          
            Specificity : 0.6708          
         Pos Pred Value : 0.6549          
         Neg Pred Value : 0.6648          
             Prevalence : 0.4906          
         Detection Rate : 0.3183          
   Detection Prevalence : 0.4860          
      Balanced Accuracy : 0.6598          
                                          
       'Positive' Class : LOSS 

Which seems very good. The ROC curve is as follows:


which may have room for improvement.




Sunday, December 13, 2015

Legend of the Poroking Statistics (2015)

Poroking mode statistics - with 280k games analyzed from the NA server.
Note that win rate is calculated for non-mirror games only.
Click on the column headers to sort.

ChampionPopularityWin Rate
Aatrox0.94%39.15%
Ahri8.41%47.79%
Akali2.42%38.33%
Alistar10.53%58.29%
Amumu7.38%55.26%
Anivia5.68%45.73%
Annie7.92%47.64%
Ashe13.86%49.70%
Azir10.32%42.58%
Bard9.53%34.41%
Blitzcrank25.58%48.29%
Brand15.31%53.69%
Braum3.87%49.17%
Caitlyn12.00%51.49%
Cassiopeia2.17%47.34%
Cho'Gath7.80%45.76%
Corki2.43%43.62%
Darius11.36%51.83%
Diana2.69%49.28%
Draven2.82%50.23%
Dr. Mundo11.00%56.51%
Ekko9.88%44.07%
Elise0.68%44.60%
Evelynn0.41%40.59%
Ezreal28.83%48.35%
Fiddlesticks9.06%53.60%
Fiora2.52%47.17%
Fizz7.60%46.56%
Galio7.67%62.83%
Gangplank8.74%53.31%
Garen14.82%56.18%
Gnar13.58%48.53%
Gragas2.77%42.36%
Graves4.95%51.52%
Hecarim2.10%45.36%
Heimerdinger10.56%60.36%
Illaoi21.06%55.87%
Irelia1.80%48.34%
Janna4.25%56.52%
Jarvan IV5.64%46.41%
Jax4.46%44.46%
Jayce6.99%47.30%
Jinx20.65%50.85%
Kalista9.35%44.30%
Karma5.61%50.71%
Karthus12.49%50.63%
Kassadin2.85%42.84%
Katarina16.89%49.22%
Kayle2.71%45.59%
Kennen3.62%42.16%
Kha'Zix1.45%37.37%
Kindred6.58%41.91%
Kog'Maw9.07%57.05%
LeBlanc6.75%37.91%
Lee Sin7.05%38.08%
Leona6.82%54.43%
Lissandra2.34%45.86%
Lucian9.25%48.62%
Lulu5.45%43.03%
Lux35.75%56.37%
Malphite13.17%48.32%
Malzahar8.29%55.21%
Maokai3.55%58.03%
Master Yi8.98%48.91%
Miss Fortune37.12%55.76%
Mordekaiser1.68%48.42%
Morgana12.23%51.17%
Nami3.78%49.23%
Nasus3.29%48.31%
Nautilus7.34%57.81%
Nidalee14.90%38.98%
Nocturne0.45%40.43%
Nunu2.33%40.25%
Olaf3.96%48.08%
Orianna9.06%43.90%
Pantheon3.03%45.96%
Poppy34.57%43.81%
Quinn2.64%43.84%
Rammus3.40%58.45%
Rek'Sai0.59%40.32%
Renekton2.38%48.00%
Rengar2.99%39.10%
Riven6.06%41.69%
Rumble2.63%43.29%
Ryze2.93%44.55%
Sejuani5.07%51.43%
Shaco7.59%45.83%
Shen2.65%52.08%
Shyvana1.26%43.65%
Singed5.19%54.27%
Sion6.85%62.60%
Sivir5.32%49.88%
Skarner3.24%51.31%
Sona19.69%61.00%
Soraka10.09%44.29%
Swain3.75%56.97%
Syndra8.00%41.85%
Tahm Kench7.39%47.10%
Talon5.35%55.41%
Taric2.54%63.78%
Teemo20.85%58.54%
Thresh7.37%42.64%
Tristana10.85%45.52%
Trundle2.23%60.21%
Tryndamere2.09%42.18%
Twisted Fate9.91%46.72%
Twitch3.59%46.88%
Udyr0.81%44.30%
Urgot0.94%47.44%
Varus15.13%53.00%
Vayne9.09%48.50%
Veigar15.17%48.04%
Vel'Koz9.94%53.13%
Vi1.47%47.36%
Viktor4.48%47.22%
Vladimir5.95%60.08%
Volibear3.40%52.53%
Warwick1.44%50.11%
Wukong7.84%54.20%
Xerath7.08%48.71%
Xin Zhao2.36%46.80%
Yasuo20.86%46.50%
Yorick1.10%54.86%
Zac8.04%44.31%
Zed10.74%43.13%
Ziggs14.60%58.31%
Zilean5.52%48.79%
Zyra4.80%57.66%

Some other intersting tidbits:
  • Overall, blue side won about 54.5% of the time.
  • Out of a sample of 12179 Katarina games on record, she had at least one pentakill in 563 of them - a pentakill rate of 4.6% per game for the Katarina player. This is followed by Master Yi at 3.3% and Darius at 1.4%. On Summoners' Rift, Katarina and Master Yi usually have around 1-1.5% pentakill per game.
  • The mean and median game length are 21.4 and 20.6 minutes respectively. The longest game on my record is 78 minutes; the shortest is 7 minutes.

See last year's statistics here.

Saturday, November 21, 2015

Learn Statistics and Data Mining with League of Legends Data - Principal Component Analysis

plot of chunk unnamed-chunk-2

Hi, and welcome to the first of the series of learning statistics and data mining with League of Legends. In this series, I will try to illustrate how League of Legends data can be analyzed using various statistical and/or data mining methodologies - while keeping mathematical details to a minimum. The data and the code (written in R) are both available, so you can try it yourself!

So let's begin. Suppose you are interested in the average number of champion kills and deaths per game for each champion in the game (NA ranked soloqueue). One starting point is typically to draw scatterplot like this:

plot of chunk unnamed-chunk-4

This plot is slightly crowded in the middle, but overall it's not too bad: we see supports like Janna with the lowest average kills and deaths per game; on the other hand, assassins like Akali, Talon, and Katarina tends to have high kills and high deaths per game.

However, we have only looked at only two variables here: champion kills and deaths. This is easy since two variables fit onto a 2-dimensional scatterplot with two axes - which conveniently fits onto your computer screen which is a 2-dimensional surface.

But the Riot API offers a lot of data with more than two variables - for the data of interest (Google Doc Link, Pastebin Link), we have 16 different variables. Obviously we don't have a 16-dimensional surface to make our scatterplot with; drawing a scatterplot for each variable pair also gets unreadable really quick. So what can we do at this point?


One way to do it is through a data mining technique called Principal Component Analysis (PCA), which I will be talking about today.

Here's what we want to do: we cannot make a 16 variables scatterplot since it will require 16 axes and a 16 dimensional plot, but we CAN make a 2 variables scatterplot. So, instead of plotting all 16 variables, we will “compress” our data into 2 variables - then the problem becomes easy.

How do we “compress” the data? In our case, our dataset has 16 variables, all of which contains a certain amount of variance (think variance as “information”). We apply a linear algebra technique called Singular Value Decomposition to construct 16 new variables in a smart way - so that the first two variables will contain most of the variance we care about (again, think variance as “information”). Then we can just plot the first two new variables with a 2-dimensional scatterplot and ignore the rest - with any luck, the first two new variables will capture most of the information you care about.

In case you had no idea what I just said, here's an analogy. You have 16 friends, all of them have (say) 100 dollars. But tracking all 16 people's money is too hard for you, so you redistribute everyone's money such that your best friend has the most money, your second best friend has the second most money, and so on; you can't just give all 1600 dollars to your best friend (which will cause the other 15 to starve to death), but you push your envelope as much as possible such that a majority of the wealth stays with your best and second best friend. Then you leave with your two best friends and ignore the rest.

While it is not impossible to do principal component analysis by hand (it all comes down to linear algebra), we will use a popular programming language / statistical analysis tool called R which will do the heavy lifting for us.

# First download the data from here: http://pastebin.com/3NXpPRp1 and save it as a csv file (e.g. "data_summary_cleansed.csv")

# Load the data into R
data <- read.table("data_summary_cleansed.csv", header = T, row.names = 1, sep = ",", as.is = T) 

# Perform PCA:
pca <- prcomp(data, scale = T)
# Note that we chose to normalize the data - since variables such as number of kills and gold earned are operating on completely different scales.

# Take the first two principal components for plotting:
x <- summary(pca)$x[,1:2]

# Draw the scatterplot:
plot(x, pch = 16)
text(x, labels=row.names(data), cex= 0.8, pos=1) # this adds the labels to the point so we know which champion it is.

plot of chunk unnamed-chunk-9

What is cool about this plot is that champions with similar roles are automatically grouped together; for example, we see that most support champions are on the lower left corner - while AP mid laners are on the lower right corner. Karma and Zyra, which share the playstyle of both support and AP mid lane, are somewhere in between. It's fairly easy to see that champions with similar playstyles are indeed close in this scatterplot (click here for the same plot in high resolution).

What is cooler is that our data does not contain any information about each champion's role in the game - we employed a data mining algorithm which digested a fairly complex set of data and illustrated the answer for us.

You may remember that by performing principal component analysis we concentrate the total variance of the dataset onto the first two principal components. We can illustrate the effect as follows:

plot(pca)

plot of chunk unnamed-chunk-10

As you can see, the first two principal components contain a lot of variance and thus a lot of “information”. The principal components further down are not as valuable because they don't contain as much information as the first two. Since the total amount of variance is 16 (we have 16 original variables) and the first two principal components have roughly \(6.2 + 3 = 9.2\) variance, we can hand-wavingly argue that our scatterplot using principal component analysis captured about 57.5% of the total “information” from our original dataset. We still lost a lot of information from the original dataset, but sometimes sacrifices are needed.

So where did all the original variables go? We can visualize this by using a biplot:

p <- biplot(pca, cex=c(0.5, 1))

plot of chunk unnamed-chunk-11

What we see here is that a support champion generally has high ward killed, ward placed, and assists (arrows to the left); mid laners tend to have high magic damage dealt to champions (arrow to the bottom); junglers and other non-jungle tanks tend to have an assortment of high damage taken and neutral monsters killed (arrows to the top); assassins and ADCs tend to have high gold, high champion kills, high minion kills, and high gold earned (arrows to the right). All of the information we've mined from the data conforms to our intuition about the game.

On a closing note, data mining techniques such as Principal Component Analysis reduces the the complexity of the data by what is called dimension reduction. When we are faced with too many variables (in our case, 16) that we cannot easily analyze, we reduce it to 2 so we can more easily comprehend the data. Do keep in mind there is a series of standard techniques on dimension reduction and this barely scratches the surface of how we can reduce the complexity of the data. In fact, the way which we compute KDA can be seen as a dimension reduction technique where we reduce 3 variables (kills, deaths, and assists) into 1 variable.

Do you want to learn more? Here's a reading list:

  1. If you know little on linear algebra or statistics and would like to learn Principal Component Analysis, try this tutorial which starts from the very basics.

  2. If you think your mathematical background is sufficiently strong, try this tutorial instead.

  3. If you would like to go beyond Principal Component Analysis (including Sparse PCA and ICA), one source I recommend is Hastie et al.'s book Sections 14.5 - 14.7.