R may just have become more preferable for simple webscraping jobs with the release of rvest. Before, this was something I'd prefer to do in python but the new R syntax seems to prevail. In this document I will give a small example of it's syntax.

Dependencies

library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)
library(GGally)

Heroes of the storm

We will be scraping information on video game characters from heroes of the storm, a popular brawler game made by Blizzard. We'll be scraping a fan website to get the information we want.

We retrieve the website simply via the html method.

heroes <- html("http://www.heroesnexus.com/heroes")

This page contains many html nodes which have classes. These are very useful for scraping and can be accessed via a css-selector string in html_nodes. The text of these nodes can then be accessed via html_text. If you need a reminder of useful selectors, this source might help.

df <- data.frame(
  name = heroes %>% html_nodes("a.hero-champion") %>% html_text, 
  hp_txt = heroes %>% html_nodes(".visual-quickinfo-cell .hero-hp") %>% html_text,
  attack_txt = heroes %>% html_nodes(".visual-quickinfo-cell .hero-atk") %>% html_text,
  role = heroes %>% html_nodes(".role") %>% html_text,
  attack_type = heroes %>% html_nodes(".hero-type :not(.role)") %>% html_text
)

The nodes that we've retreived now only need to be extracted for numerical value. Through mutate we perform some etl together with some functions from stringr. If \\d* confuses you: don't worry. It's a thing called a regex: go here if you want to know more.

df <- df %>% 
  mutate(hp = hp_txt %>% str_extract("(HP: \\d*)") %>% 
           str_replace("HP: ", "") %>% as.numeric,
         attack = attack_txt %>% str_extract("(Damage: \\d.\\d*)") %>% 
           str_replace("Damage: ", "") %>% as.numeric,
         attack_spd = attack_txt %>% str_extract("(Speed: \\d\\.?\\d*)") %>% 
           str_replace("Speed: ", "") %>% as.numeric)

Now that the dataframe is done, let's go for some massive visual exploring with GGally.

df %>% 
  select(hp, attack, attack_spd, attack_type) %>% 
  ggpairs(data=., color = "attack_type", title="stats by melee/ranged")

df %>% 
  select(hp, attack, attack_spd, role) %>% 
  ggpairs(data=., color = "role", title="stats by role")

Specialist and support heroes don't deal as much damage as assassins or warriors. Melee characters also seem to pack more of a punch. All these numbers make sense from a game balance perspective.

Conclusion

So in about 20 lines of (very straightforward) code we have;

  • retreived the html of a website
  • parsed through the html to find nodes of interest
  • parsed the nodes of interest
  • visualised the result with two plots


In python, I would need to use beautiful soup as well as requests and matplotlib to get to a similar result, both of which feel like different styles of api. This is something R is becomming very good at, the entire language feels as the same api.

The html_nodes function feels lovely if you just want to quickly select a few things based of css-selectors. It plays very nicely with the %>% operator too. It even has support for rare css-selectors like :not. Very cool.

R and python: always contending.

Source

You can download the .Rmd file for this post here.

You should also be able to find this blog on: r-bloggers