In this document I'll explain why it makes sense to make a mental battle plan for an analysis before you start writing any code. As a demonstration, I'll demonstrate this wil calculating the amount of time that is required for people to get to level 60 in world of warcraft. In particular, I'll be using this dataset from kaggle:

The Dataset

If you'd download the dataset you can confirm that it looks something like:

    char level  race charclass                      zone guild           timestamp
   <int> <int> <chr>     <chr>                     <chr> <int>              <dttm>
 1     2    18   Orc    Shaman               The Barrens     6 2008-12-03 10:41:47
 2     7    54   Orc    Hunter                   Feralas    -1 2008-01-15 21:47:09
 3     7    54   Orc    Hunter            Un'Goro Crater    -1 2008-01-15 21:56:54
 4     7    54   Orc    Hunter               The Barrens    -1 2008-01-15 22:07:23
 5     7    54   Orc    Hunter                  Badlands    -1 2008-01-15 22:17:08
 6     7    54   Orc    Hunter                  Badlands    -1 2008-01-15 22:26:52
 7     7    54   Orc    Hunter                  Badlands    -1 2008-01-15 22:37:25
 8     7    54   Orc    Hunter          Swamp of Sorrows   282 2008-01-15 22:47:10
 9     7    54   Orc    Hunter The Temple of Atal'Hakkar   282 2008-01-15 22:56:53
10     7    54   Orc    Hunter The Temple of Atal'Hakkar   282 2008-01-15 23:07:25

World of Warcraft is a video game with many players and this dataset contains data from the horde faction on a single server during 2008. We have a character id, we know where the player was in the world and we know if the player was part of a guild. We also have a very nice timestamp and a level. To familiarize ourselves with the dataset we could make a plot over time to check how many players log in per day.

There is a spike in october (this is when an expansion pack was announced). During this expansion levelling was very different so I've removed dates from october and later to ensure a uniform dataset.

The Task

We could use this dataset to figure out how long it takes before you gain a level or indeed get to level 60. You'd need to do a bit of data wrangling though. Let's look at a different slice of the dataset.

    char level  race charclass              zone guild           timestamp
1      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:02:20
2      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:12:07
3      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:22:40
4      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:32:29
5      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:42:18
6      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:52:47
7      9    70   Orc    Hunter         Ashenvale    79 2008-01-01 13:02:29
8      9    70   Orc    Hunter         Ashenvale    79 2008-01-01 13:12:18
9      9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 13:22:44
10     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 13:32:32
11     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:02:31
12     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:12:18
13     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:22:44
14     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:32:32
15     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:42:20
16     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:52:08
17     9    70   Orc    Hunter    Shattrath City    79 2008-01-01 17:02:43

These are logs from a single player. If you look at row 10-11 you'll notice a gap in the timestamps. Usually we see a timestamp every 10 minutes while here there is a 2.5 hour gap. If we were to take just the min and max-timestamp per user, we are at risk of counting these 2.5 hours of game-time while the player wasn't online. We need to accomodate for this. Before we can report our numbers we need to do some proper data wrangling here.

Battle Plan

This might be a good time to stop reading this document to see if you can solve this analysis yourself.

In this situation it really helps if you can take a step back and come up with a battle plan before writing any code. I've noticed a lot of junior analysts not doing this, which is very risky. If you're just learning a new tool it really helps to be consious of what you're doing and there's nothing wrong with planning your next step.

In this particular case, we need to create sessions such that all the timestamps that are within 10 minutes of eachother are bundled together in a session. We can then calculate the total time per session and then allocate that to the user. In effect, this means that we want to have an extra column that looks like this:

    char level  race charclass              zone guild           timestamp  session
1      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:02:20        1
2      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:12:07        1
3      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:22:40        1
4      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:32:29        1
5      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:42:18        1
6      9    70   Orc    Hunter       The Barrens    79 2008-01-01 12:52:47        1
7      9    70   Orc    Hunter         Ashenvale    79 2008-01-01 13:02:29        1
8      9    70   Orc    Hunter         Ashenvale    79 2008-01-01 13:12:18        1
9      9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 13:22:44        1
10     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 13:32:32        1
11     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:02:31        2
12     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:12:18        2
13     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:22:44        2
14     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:32:32        2
15     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:42:20        2
16     9    70   Orc    Hunter Blackfathom Deeps    79 2008-01-01 16:52:08        2
17     9    70   Orc    Hunter    Shattrath City    79 2008-01-01 17:02:43        2

This session column needs to have unique values for every user. Knowing the end goal is helpful when you want to conjure the query that needs to be written. Even at this stage though, I'd still prefer not to write any code yet as I still need to figure out a battle plan on how to get to the session.

Seriously, this might be a good time to stop reading this document to see if you can solve this analysis yourself.

After somethinking, it dawned on me.

  1. If I use a group_by mechanic I could isolate the sessionizing per user.
  2. What I could then do is check the time between two timestamps (assuming they are ordered correctly). I could calculate the difference in minutes between the timestamp now and the previous one.
  3. If that time is larger than say, 20 minutes, then this could indicate if that row should be the start of a new session. I'd also need to accomodate the first row per user which should always cause a new session.
  4. With a column of all true/false values like this, I'd merely need a cumsum and then I'd have the session column I'd want per user.
              timestamp   diff_mins     new_session        session
1   2008-01-01 12:02:20          NA            TRUE              1
2   2008-01-01 12:12:07          10           FALSE              1
3   2008-01-01 12:22:40          10           FALSE              1
4   2008-01-01 12:32:29          10           FALSE              1
5   2008-01-01 12:42:18          10           FALSE              1
6   2008-01-01 12:52:47          10           FALSE              1
7   2008-01-01 13:02:29          10           FALSE              1
8   2008-01-01 13:12:18          10           FALSE              1
9   2008-01-01 13:22:44          10           FALSE              1
10  2008-01-01 13:32:32          10           FALSE              1
11  2008-01-01 16:02:31          10           FALSE              1
12  2008-01-01 16:12:18         120            TRUE              2
13  2008-01-01 16:22:44          10           FALSE              2
14  2008-01-01 16:32:32          10           FALSE              2
15  2008-01-01 16:42:20          10           FALSE              2
16  2008-01-01 16:52:08          10           FALSE              2
17  2008-01-01 17:02:43          10           FALSE              2

I tried this approach on a small dataset first (otherwise I'd have to wait a long time for the query to run) and I wrote a few small tests to confirm that indeed my method worked. The end result was the following bit of dplyr code.

wowdf %>% 
  arrange(char, timestamp) %>% 
  group_by(char) %>% 
  mutate(unix_ts = as.integer(timestamp),
         diff_mins = unix_ts - lag(unix_ts),
         new_sess = ifelse(is.na(unix_diff), TRUE, unix_diff > 1000),
         session = cumsum(session))

I imagine something similar can be done with pandas, the dplyr code looks very clean to me. In a mere 7 lines of code a lot of logic is being taken care of.

Bonus Tidyverse Features

You might wonder why I took the extra effort to cast the timestamp into an integer. The reason is that this way, I am a bit more flexible with backends. Currently wowdf is a dataframe that I've read in base R, but it could just as well be a dataframe in spark or a table in bigquery. You can still talk dplyr against though. Internally it translates the query into SQL before running it. This can be made explicit by running something like:

q <- wowddf %>% 
  arrange(char, timestamp) %>% 
  group_by(char) %>% 
  mutate(unix_ts = unix_timestamp(timestamp),
         unix_diff = unix_ts - lag(unix_ts)) %>% 
  mutate(new_sess = ifelse(is.na(unix_diff), TRUE, unix_diff > 1000))

explain(q)
Translation towards SparkSql

This is the generated query if wowddf was living in Spark.

SELECT 
  `char`, `level`, `race`, `charclass`, `zone`, 
  `guild`, `timestamp`, `unix_ts`, `unix_diff`, 
  CASE WHEN (((`unix_diff`) IS NULL)) THEN (TRUE) ELSE (`unix_diff` > 1000.0) END AS `new_sess`
FROM (
  SELECT 
  `char`, `level`, `race`, `charclass`, 
  `zone`, `guild`, `timestamp`, `unix_ts`, 
  `unix_ts` - LAG(`unix_ts`, 1, NULL) 
    OVER (PARTITION BY `char` ORDER BY `char`, `timestamp`) AS `unix_diff`
  FROM (
    SELECT `char`, `level`, `race`, `charclass`, 
    `zone`, `guild`, `timestamp`, 
    UNIX_TIMESTAMP(`timestamp`) AS `unix_ts`
    FROM (
    SELECT * FROM `df`
    ORDER BY `char`, `timestamp`) `rkdvulqmdg`) `efzujnwfsu`) `dbcscslutz`
Translation towards BigQuery

This is the generated query if wowddf was living in BigQuery.

SELECT
  [char],[level],[race],[charclass],
  [zone],[guild],[timestamp],[unix_ts],[unix_diff],
  IF((([unix_diff]) IS NULL), TRUE, [unix_diff] > 1000.0) AS [new_sess]
FROM (
  SELECT
  [char],[level],[race],[charclass],
  [zone],[guild],[timestamp],[unix_ts],
  [unix_ts] - LAG([unix_ts], 1, NULL) 
    OVER (PARTITION BY [char] ORDER BY [char], [timestamp]) AS [unix_diff]
  FROM (
    SELECT
      [char],[level],[race],[charclass],
      [zone],[guild],[timestamp],
      UNIX_TIMESTAMP([timestamp]) AS [unix_ts]
    FROM (
      SELECT * FROM [wow.cleansess]
      ORDER BY [char], [timestamp])))

I'd still recommend starting with a very small dataset that you can iterate on, but it is nice to know that you can use dplyr on a large dataset as well.

Conclusion

I can advice anyone to make a battle plan before writing code. Sure, it is fine to make some plots to familiarize yourself with a dataset but this is behavior that I've seen too many people get stuck in. Taking a break from the computer to hang out with the whiteboard is a great productivity boost (for me at least).

With the sessions in place, all that needs to happend is calculate the (max - min) time per session and then summing over all this time per user per level. When you do that, you could make a chart that looks something like this one.

It makes sense that it takes less time to level up when you're a lower level and that it takes more time to reach the higher levels. There's a few things to say about it though:

  • World of Warcraft was released in 2004 and this dataset was from 2008. It is somewhat plausible to state that levelling was harder in 2004 compared to 2008.
  • In this dataset we already had the 1st expansion that came out but not every player might have been on this expansion so the dataset might still not be uniform.
  • If you want to level up, it might really help if you have a friend with a high level. This is also reason to assume there might be some bias in the dataset.

With this in mind, my estimate for the total time it takes on average to reach level 60 is about 132.6 weeks (which is about 3.3 weeks if you're assuming a 40 hour work week).