In a previous document I've explained how to use the CAPM model to create naive ranking for pokemon. Even though the approach had some logic to it, I wanted to explore if a naive probibalistic implementation of trueskill could do the job as well. This document contains a python implementation my naive approach as well as an application to a pokemon dataset. The first part will explain how it is made and tested. The second part will analyse the ranking of all 750 pokemon.

Starting Data

Before writing any application code, it seems sensible to first think about how to test if the code works. It would be a good idea to simulate some random players with predetermined skill levels such that we can test if our ranking algorithm can at least rank simulated data well enough. The code below generates this data.

import uuid
import numpy as np
import pandas as pd
import json 
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn; seaborn.set()
SIZE_ARR = 100

def sim(p1, p2, entropy = 1):
    ''' 
    Given two players (tuples: <name, skill>) simulate 
    a game between them. Returns (winner-name, loser-name)
    '''
    r = np.random.randint(-entropy, entropy + 1)
    if p1[1] + r > p2[1]:
    return p1[0], p2[0]
    else:
    return p2[0], p1[0]  

def gen_data(n_users = 20, n_games = 100, entropy = 1):
    '''
    Generate a lot of games and outputs a list of user tuples
    (<name>, <skill>)as well as a list of games played.
    '''
    data = []
    users = [(str(uuid.uuid4())[:6], i) for i in range(n_users)]
    for  in range(n_games): 
    i, j = np.random.choice(range(len(users)), 2)
    u1, u2 = users[i], users[j] 
    data.append(sim(u1, u2, entropy))
    return users, pd.DataFrame(data, columns = ['winner', 'loser'])

Code that does the math

We'll now concern ourselves with writing code that updates a system of beliefs. The belief will be captured in a numpy array and can be interpreted as a histogram of likelihood values for a certain level of skill. The algorithm is explained in the previous blogpost as well but schematically, this is what the belief update looks like;

We start out with two arrays of prior skill beliefs for player 1 and player 2. We then combine them in a two dimensional density. This is our prior belief of the skill levels of the two players. Once a player wins, we can update this belief. One player has a lower skill level. To translate this to our belief system, we cut away all likelihood in the triangle of the losing player. Then we marginalise our belief back to the two players. By repeating this for all the games, we hope to achieve a belief of skill that can be updated in real time.

Towards Code

Next we'll define all the functions that do the ranking. These functions are meant to be minimal, as we only want to check if this algorithm can work. It makes sense to think about what our app code might look like and to think about how to test individual components before writing the code though. I'm not particularily religious about testing but I do see some value for writing a test before writing code here. We're going to be munging with numpy matrices and typos in indices will be unforgiving.

p1 = np.array([1.,2.,3.])
p2 = np.array([3.,2.,1.])
p3 = np.array([1.,1.,1.])
mat1 = gen_prior(p1, p2)
mat2 = gen_prior(p2, p1)
mat3 = gen_prior(p3, p3)
m1, m2 = gen_marginals(mat1)
m3, m4 = gen_marginals(mat2)
np1, np2 = gen_marginals(cut_matrix(mat1))
assert all(m1 == m2[::-1]) #marginals should still be same shape 
assert all(m3 == m4[::-1]) #marginals should still be same shape 
assert np.sum(mat1) == np.sum(mat2) #matrices should have equal values
assert np.sum(mat3) == 9 #all ones should sum to 9 
assert all(np1 == np2[::-1]) #marginals should still be same shape

With these tests in place, I could move on to the code.

def get_player_stats(name):
    '''
    Check if a player exists in the dictionary. If not, add it. 
    Then returns the player belief. 
    '''
    if name not in d.keys(): 
    d[name] = np.ones(SIZE_ARR)/np.sum(np.ones(SIZE_ARR))
    return d[name]

def gen_prior(arr1, arr2):
    '''
    Combines two arrays into a matrix.
    '''
    return np.matrix(arr1).T  np.matrix(arr2)

def cut_matrix(mat): 
    '''
    Cuts the matrix according to likelihood update rule. 
    Also applies normalisation
    '''
    posterior = np.triu(mat) + 0.000001
    posterior = posterior/np.sum(posterior)
    return posterior

def gen_marginals(posterior_mat):
    '''
    From a cut matrix, generate the appropriate marginals back
    '''
    winner = np.squeeze(np.asarray(np.sum(posterior_mat, 0)))
    loser = np.squeeze(np.asarray(np.sum(posterior_mat, 1)))
    return winner, loser

def update(winner, loser):
    '''
    Given a winner and user, update our state. 
    '''
    winner_arr = get_player_stats(winner)
    loser_arr = get_player_stats(loser)
    prior_mat = gen_prior(loser_arr, winner_arr)
    posterior_mat = cut_matrix(prior_mat)
    new_winner, new_loser = gen_marginals(posterior_mat)
    d[loser] = new_loser
    d[winner] = new_winner

I wrote the code and the tests passed. For now this is enough. Before running this code against the pokemon dataset, let us first run it against some simulated data.

Running the code a few times.

First let's define a few functions that automate plotting.

def plot_scores(plot_idx): 
    plt.subplot(plot_idx)
    plt.scatter(
        np.arange(len(users)), 
        [np.mean(d[usr[0]]  np.arange(SIZE_ARR)) for usr in users])

def plot_dist(plot_idx):
    plt.subplot(plot_idx)
    for usr in users:
        plt.plot(np.arange(SIZE_ARR), d[usr[0]], alpha = 0.5)

Next we'll run some experiments.

Simulate n_players = 100, n_games = 400, entropy = 1

Simulate n_players = 100, n_games = 400/10000, entropy = 15

Simulate n_players = 100, n_games = 400/10000, entropy = 50

There's still some room for improvement to this system to make the convergence better, but it seems well enough for my purposes. I want to use this metric to rank pokemon so if I need to simulate some more outcomes between two pokemon then I won't mind. For more real-time purposes, the following tweaks seem to be important;

  • New matches are allocated randomly, which is suboptimal from a learning perspective. If you want to fine tune the top 10 players it makes sense to have them play against eachother if you are unsure who is number 1. If instead you'd have one of these players compete against somebody mediocre you can predict the game outcome before it even started. This means you would not learn anything from that game.
  • I am using buckets to simulate a distribution for a player. It may be better to use Beta distributions instead and save it's parameters. This would make the blogpost harder to read for more novice readers which is the main reason of not doing it, but this should improve the stability and remove the need for the hard coded smoothing parameter in the cut_matrix function.

For the pokemon dataset I only have one concern; the effect of rock paper scissors.

Pokemon Data

I took the liberty of preparing a dataset. It's easy to do yourself with a bit of python and this project. I'll use the formula from my previous post here again to create a function that can simulate the outcome of a pokemon battle. It will be based on the number of turns that one pokemon can outlast the other.

$$ T_{ij} = \frac{HP_i}{DMG_{ji}} $$

where $T_{ij}$ is the number of turns the pokemon $i$ would be able to survive the opponent $j$ if it were to be attacked indefinately,
$HP_i$ is the amount of hitpoints that pokemon $i$ has and $DMG_{ji}$ is the amount of damage a basic attack pokemon $j$ would deal to pokekom $i$. This damage is defined via:

$$ DMG_{ji} = \frac{2L_j + 10}{250} \times \frac{A_j}{D_i} \times w_{ji} $$

The situation will very much assume some core aspects of the game away. The hope is that my making this assumption we'll still leave much of the actual pokemon game intact. For ease of use, you can read the datasets from my hosted blog.

poke_df = pd.read_csv("http://koaning.io/theme/data/full_pokemon.csv")
weak_df = pd.read_csv("http://koaning.io/theme/data/pokemon_weakness.csv")

def weakness_weight(types_j, types_i): 
    """
    Returns w_{ji} 
    """
    res = 1.
    possible_types = [t for t in weak_df['type']]
    for i in filter(lambda t: t in possible_types, types_i):
    for j in filter(lambda t: t in possible_types, types_j):
    res = float(weak_df[weak_df['type'] == j][i])
    return res

def calc_tij(poke_i, poke_j):
    """
    Calculate the number of turns one pokemon would outlast the other. 
    """
    poke_row_i, poke_row_j = poke_df[poke_df['name'] == poke_i].iloc[0], poke_df[poke_df['name'] == poke_j].iloc[0]
    dmg_ji = (200. + 10)/250
    dmg_ji = float(poke_row_j['attack'])/float(poke_row_i['defence'])
    dmg_ji *= weakness_weight(eval(poke_row_j['type']), eval(poke_row_i['type']))
    return poke_row_i['hp']/dmg_ji

def sim_poke_battle(poke_i, poke_j):
    """
    Simulate the outcome between two pokemon. Output: winner_name, loser_name.
    """
    if calc_tij(poke_i, poke_j) > calc_tij(poke_j, poke_i):
        return poke_i, poke_j
    return poke_j, poke_i

I ran some of the code, and the posterior output looked like this:

The maximum likelihood then gives an insight for the best/worst pokemon.

                    name       mle                 name       mle
                    128   Magikarp  0.011007        142     Snorlax  0.877423
                    112    Chansey  0.013735        248       Lugia  0.882689
                    348     Feebas  0.015845        487   Cresselia  0.883819
                    171      Pichu  0.019108        66      Machoke  0.887577
                    291   Shedinja  0.020473        713     Avalugg  0.890049
                    439    Happiny  0.026763        380      Latios  0.893087
                    241    Blissey  0.037009        148   Dragonite  0.895297
                    172     Cleffa  0.042802        611     Haxorus  0.895832
                    234   Smeargle  0.048575        290     Ninjask  0.897319
                    49     Diglett  0.054010        388    Torterra  0.902103
                    400  Kricketot  0.057252        717     Yveltal  0.904122
                    297    Azurill  0.058177        646      Kyurem  0.914460
                    235    Tyrogue  0.060337        485   Regigigas  0.916059
                    659   Bunnelby  0.060847        305      Aggron  0.916220
                    62        Abra  0.064720        597  Ferrothorn  0.916377
                    173  Igglybuff  0.068975        425    Drifblim  0.917240
                    190    Sunkern  0.071101        644      Zekrom  0.919065
                    279      Ralts  0.081876        537       Throh  0.921175
                    160    Sentret  0.082374        482      Dialga  0.928502
                    51      Meowth  0.083082        149      Mewtwo  0.951472
                    174     Togepi  0.087797        483      Palkia  0.953823
                    581  Vanillite  0.091351        288     Slaking  0.956629
                    212    Shuckle  0.095570        375   Metagross  0.957383
                    18     Rattata  0.100470        492      Arceus  0.957729
                    330     Cacnea  0.100798        716     Xerneas  0.960467

Conclusion

I have run this algorithm a few times and it seems doesn't seem to converge very consistently. It does some things very well; the fact that magicarp is generally performing poorly is good news. I've also noticed Mew/Mewtwo in the higher skilled area. I'm still concerned that this algorithm fails to grasp the rock/paper scissors element that is in the game. But the fact that I am assuming that every pokemon is fighting with only basic attacks also does not help.

As a final push one might consider making an ensemble of these pokemon rankings to see if the convergence can be make to go faster. Another tweak to consider is to have the matching of these pokemon not be random during the learning. I like the results of the exercise, but it seems that we are not done with ranking pokemon.