possibly a new sticker for the tidyverse for from nounproject


This document discusses some of the hardest topics to understand in R. If you're an analyst you'll not need to worry too much about these topics, but for advanced R users it should be a thing to be aware of. The content is heavily influenced by Hadley Wickhams Advanced R Book and the tidyverse documentation.

Pretty Freakin' Weird

The following few lines of code will cause a common programmer to feel uneasy even though it is easy to imagine what it is supposed to do.

library(tidyverse)
library(stringr)

ChickWeight %>% 
  group_by(Time, Diet) %>% 
  summarise(m = mean(weight), v = var(weight))

It's not so much what is happing, rather how it is happening. Even if you understand what %>% does you should notice something strange. The variables Time and Diet are not declared anywhere, yet we are able to use them. Within the context of the ChickWeight dataframe they are perfectly reasonable things to refer to but if feels a bit misterious. If I run Diet and Time outside of this group_by call I'd get an error.

In other languages and frameworks you would expect strings to be used as parameters or strings that are wrapped by a control mechanic (like sf.col("<str>") in spark). Here we are able to save some characters because apperantly R allows you to do things like this.

If you're used to python or C/C++ you should feel rather uneasy at this stage. The goal of this blogpost is to give more intuition in what is going on here both to myself and to anybody reading this.

API for Graphs

Let's make a small object that will learn a graph structure out of a dataframe. It will be a seperate R object. If you've never seen s3 classes before If you've never seen s3 classes before this blogpost or this resource might help. The graaf object will represent an bi-directional graph such that each relation between two people results in two directed arcs.

df <- data_frame(n1 = c("Alice", "Bob", "Claire", "Desmond", "Desmond"), 
                 n2 = c("Bob", "Claire", "Desmond", "Alice", "Claire"), 
                 relationship = c("friends", "family", "friends", "enemies", "family"))
df2 <- data_frame(n1=df$n2, n2=df$n1, relationship=df$relationship) %>% rbind(df)
graaf <- function(x, ...) UseMethod("graaf")

graaf.default <- function(df, ...){
  res <- list()
  res$df <- data_frame(n1=df$n2, n2=df$n1, relationship=df$relationship) %>% 
    rbind(df) %>% 
    arrange(n1)
  class(res) <- "graaf"
  res
}

print.graaf <- function(graaf, ...){
  df <- graaf$df
  cat("graph of type graaf.")
  n_nodes <- c(df$n1, df$n2) %>% unique() %>% length()
  cat(paste0("\n  - number of nodes:", n_nodes))
  n_edges <- df %>% nrow()
  cat(paste0("\n  - number of edges:", n_edges))
  cat("\nhere are the top examples:\n")
  for(i in 1:nrow(head(df, 10))){
    cat(paste0("  - ", df$n1[i], " -[", df$relationship[i], "]-> ", df$n2[i], "\n"))
  }
}

Let's not spend times on the internels of this graph, that could be much better. We can confirm that this represents a new object that we can write an api for.

graaf(df) %>% print()

We will not concern ourselves with the data structure but instead we will focus on the way we might want to communicate with the graph. Suppose we would like to construct queries such as;

graaf(df) %>% select(Alice) %>% grab(family)
graaf(df) %>% grab(enemies)

In this case the select verb will subset nodes and the grab verb will select relationships. It would be cool if we could pass in the name of the nodes or the types of edges without having to communicate strings.

Node Selection

Let's first implement select.

select.graaf <- function(graaf, ...){
  quoted <- quos(...) 
  node_names <- quoted %>% 
    as.character() %>% 
    str_replace("~", "")

  graaf$df %>% 
    filter(n1 %in% node_names) %>% 
    graaf()
}

Note that I am using quos(...) to enquote whatever is being passed into the function and that I am casting these captured quotes as strings which I can then use in dplyr.

> graaf(df) %>% select(Alice)
graph of type graaf.
  - number of nodes:3
  - number of edges:4
here are the top examples:
  - Alice -[enemies]-> Desmond
  - Alice -[friends]-> Bob
  - Bob -[friends]-> Alice
  - Desmond -[enemies]-> Alice

> graaf(df) %>% select(Bob, Desmond)
graph of type graaf.
  - number of nodes:4
  - number of edges:10
here are the top examples:
  - Alice -[friends]-> Bob
  - Alice -[enemies]-> Desmond
  - Bob -[friends]-> Alice
  - Bob -[family]-> Claire
  - Claire -[family]-> Bob
  - Claire -[friends]-> Desmond
  - Claire -[family]-> Desmond
  - Desmond -[friends]-> Claire
  - Desmond -[enemies]-> Alice
  - Desmond -[family]-> Claire

That seems to be working fine.

Edge Selection

Let's now implement grab.

grab <- function(x, ...) UseMethod("grab")
grab.graaf <- function(graaf, ...){
  quoted <- quos(...) 
  rel_names <- quoted %>% 
    as.character() %>% 
    str_replace("~", "")

  graaf$df %>% 
    filter(relationship %in% rel_names) %>% 
    graaf()
}

Basically we have just repeated the exact same trick but on a different column.

graaf(df) %>% grab(enemies)
graaf(df) %>% select(Alice, Bob) %>% grab(family)

This works! We see that it is useful to have this form of quoting such that it can communicate with dplyr (which is our datastructure).

The implementation of this is fine and all but it may still feel like this quoting stuff only acts as a sugar coat to evaluate variable names to strings. Let's consider an application in dplyr that is different.

Grouping

Let's make our own custom grouping function that just shows how often a certain subset occurs.

show_size <- function(df, ...){
  df %>% 
    group_by(!!!quos(...)) %>% 
    summarise(n = n())
}

ChickWeight %>% show_size(Diet)
ChickWeight %>% show_size(Chick)
ChickWeight %>% show_size(Diet, Time)

You may have noticed the triple !!!. This is the tidyverse way of telling dplyr that it can evaluate the quoted inputs of the function. The idea is that this pattern can replace the group_by_ pattern we used to have with strings.

Note that we are not merely passing variables in that need to be evaluated to strings. The group_by function can now actually evaluate what is passed in without any translation to strings.

At this point you probably appreciate quoting a bit more but probably still feel like all this quoting business is simply tied to dplyr or other things in the tidyverse. Is quoting just another way to commicate variables as if they are strings?

Another Reason to Quote

That's not the goal of all this quoting business. The goal is not to parse variables as if it were just a mere strings such that we can save on typing "". The idea is that we could make use of quoting to cause delayed evaluation. To demonstrate this, suppose that I have the following bit of code.

z <- x + y

There is nothing wrong with this expression, but it will give an error if there is no known x or y. So we could wrap it into an expression.

expr <- quo(z <- x + y)

This expression can be evaluated whenever we wish via the eval_tidy function from the rlang package. If we run it now, it will give an appropriate error.

rlang::eval_tidy(quo(x+y)) 
# Error in overscope_eval_next(overscope, expr) : object 'x' not found

The current evaluation gives an error because x does not exist. Let's add it.

x <- 1 
rlang::eval_tidy(quo(x+y)) # Error in overscope_eval_next(overscope, expr) : object 'y' not found

The error message is different. Let's add y.

y <- 2 
rlang::eval_tidy(quo(x+y)) # [1] 3

It evaluates. If we now remove the variables from our environment we will see the same error again.

rm(x,y)
rlang::eval_tidy(quo(x+y)) # Error in overscope_eval_next(overscope, expr) : object 'x' not found

When we run rlang::eval_tidy it searches for assigned values of x and y and it throws errors if it cannot find them. We could also pass it along explicitly if we don't want to rely on an environment.

rlang::eval_tidy(quo(x+y), list(x=1, y=3)) # [1] 4

Because the expression is quoted, it is not run until we want it to be run.

When this is useful

You can capture not just a single expression, but also a bulk of code. When you use {}-brackets you can ensure that an entire block of code is passed as a single expression.

expr <- quo({
  a <- rnorm(n = n)
  b <- rnorm(n = n)
  plot(a,b)
})
rlang::eval_tidy(expr, list(n=100)) # this plots something

But why only evaluate one expression when you can evaluate many?

on_update <- function(...){
  expr <- quos(...)
  function(){
    expr %>% map(rlang::eval_tidy)
  }
}

This function outputs a function that will execute blocks of code whenever we call it. This will sort of work as a callback. Whenever it is called we can interpret it as an update mechanism receiving an event.

update <- on_update({
  ggplot(data=dataf, aes(x1, x2)) + 
    geom_point() + 
    geom_smooth()
},{
  ggplot(data=dataf, aes(x1, x2)) + 
    geom_density2d()
})

Every time that update() is called it will search for a dataframe to plot.

dataf <- data_frame(
  x1 = rnorm(100), 
  x2 = rnorm(100)
)
update()
dataf <- data_frame(
  x1 = rnorm(1000), 
  x2 = rnorm(1000)
)
update()
dataf <- data_frame(
  x1 = rnorm(1000), 
  x2 = x1 + rnorm(1000)
)
update()

You may wonder in what edge case something like this may be useful. And it is fair to point out: when you're doing day to day analysis work you're probably not going to see an obvious oppotunity for delayed execution. Instead you're probably going to be using some tools like dplyr, ggplot2 and shiny to get on with you day to day work.

Conclusion

And to understand and appreciate the internals of those tools it helps to understand quoting. You don't need to understand everything of it to work with data but I find it enlightening (albeit a little bit confusing too).

The easiest way for me get my head to understand this material is to regard these quotes are mere expressions that are not evaluated yet, similar to how code in a function is also code that is not evaluated yet.

This is a post I wrote while teaching myself something new and I must admit that this quoting stuff does not immediately feel natural. So if this post feel indimidating: don't panic, just carry on. I'm glancing over some of the newer tidyverse stuff which is very very advanced and very very new. Odds are that 99% of the time, you'll not need to be mindful of any of this. A lot of things that quotes and expressions solve can also be solved via functions and proper programming too.

There are these naughty edge cases though when you may want things to be just this extra bit expressive and magic. Like how shiny is able to reactively update charts when something changes or how dplyr is able to parse the expression that is passed to a mutate function to cause a new column.

For the more edgy things you'd like to make I can imagine that this quoting business may be a step in a forward direction. I can also imagine that quoting will lead to better alternatives to functions like dplyr::summarise_ which did start feeling like a compromise. If you're used to other languages though, it is likely to be tricky to really get your head around this mechanic.

This blogpost is also viewable on: r-bloggers