R got good at scraping

R may just have become more preferable for simple webscraping jobs with the release of rvest. Before, this was something I'd prefer to do in python but the new R syntax seems to prevail. In this document I will give a small example of it's syntax.

Dependencies

library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)
library(GGally)

Heroes of the storm

We will be scraping information on video game characters from heroes of the storm, a popular brawler game made by Blizzard. We'll be scraping a fan website to get the information we want.

We retrieve the website simply via the html method.

heroes &lt;- html("http://www.heroesnexus.com/heroes")

This page contains many html nodes which have classes. These are very useful for scraping and can be accessed via a css-selector string in html_nodes. The text of these nodes can then be accessed via html_text. If you need a reminder of useful selectors, this source might help.

df &lt;- data.frame(
  name = heroes %&gt;% html_nodes("a.hero-champion") %&gt;% html_text, 
  hp_txt = heroes %&gt;% html_nodes(".visual-quickinfo-cell .hero-hp") %&gt;% html_text,
  attack_txt = heroes %&gt;% html_nodes(".visual-quickinfo-cell .hero-atk") %&gt;% html_text,
  role = heroes %&gt;% html_nodes(".role") %&gt;% html_text,
  attack_type = heroes %&gt;% html_nodes(".hero-type :not(.role)") %&gt;% html_text
)

The nodes that we've retreived now only need to be extracted for numerical value. Through mutate we perform some etl together with some functions from stringr. If \\d* confuses you: don't worry. It's a thing called a regex: go here if you want to know more.

df &lt;- df %&gt;% 
  mutate(hp = hp_txt %&gt;% str_extract("(HP: \\d*)") %&gt;% 
           str_replace("HP: ", "") %&gt;% as.numeric,
         attack = attack_txt %&gt;% str_extract("(Damage: \\d.\\d*)") %&gt;% 
           str_replace("Damage: ", "") %&gt;% as.numeric,
         attack_spd = attack_txt %&gt;% str_extract("(Speed: \\d\\.?\\d*)") %&gt;% 
           str_replace("Speed: ", "") %&gt;% as.numeric)

Now that the dataframe is done, let's go for some massive visual exploring with GGally.

df %&gt;% 
  select(hp, attack, attack_spd, attack_type) %&gt;% 
  ggpairs(data=., color = "attack_type", title="stats by melee/ranged")

df %&gt;% 
  select(hp, attack, attack_spd, role) %&gt;% 
  ggpairs(data=., color = "role", title="stats by role")

Specialist and support heroes don't deal as much damage as assassins or warriors. Melee characters also seem to pack more of a punch. All these numbers make sense from a game balance perspective.

Conclusion

So in about 20 lines of (very straightforward) code we have;

retreived the html of a website
parsed through the html to find nodes of interest
parsed the nodes of interest
visualised the result with two plots

In python, I would need to use beautiful soup as well as requests and matplotlib to get to a similar result, both of which feel like different styles of api. This is something R is becomming very good at, the entire language feels as the same api.

The html_nodes function feels lovely if you just want to quickly select a few things based of css-selectors. It plays very nicely with the %>% operator too. It even has support for rare css-selectors like :not. Very cool.

R and python: always contending.

Source

You can download the .Rmd file for this post here.

You should also be able to find this blog on: r-bloggers

koaning.io

R got good at scraping

Dependencies

Heroes of the storm

Conclusion

Source