When exploring a new dataset you want to be able to visualise the dataset for patterns (usually with something like ggplot2). This usually starts by selecting two variables that seem reasonable to plot against eachother and then iterating over possible visualisations until you find a pattern.
Unfortunately, this is a very time consuming process. Instead of doing all this work manually, why not just plot all variables against eachother to get a glimpse of the data?
Enter GGally
Let's do a quick hello world.
library(GGally)
library(dplyr)
ChickWeight %>%
select(Time, weight) %>%
ggpairs
The ggpairs function takes a dataframe (in this case we first reduce the dataframe to only have two columns) and makes plots between all variables. On the diagonal it lists a histogram each variable.
Discrete Values
Note that all plotted variables were continous variables, if we add a discrete variable it will automatically detect and adjust it's plotting.
ChickWeight %>%
select(Time, weight, Diet) %>%
ggpairs
If this variable is of special interest, we can assign a color to it across all subplots.
ChickWeight %>%
select(Time, weight, Diet) %>%
ggpairs(colour = "Diet")
You might notice that we are plotting a lot of charts and that this might take a lot of time. If we want to prevent this we can choose to only plot the lower triangle half.
ChickWeight %>%
select(Time, weight, Diet) %>%
ggpairs(colour = "Diet",
upper="blank")
Extra settings
There are three different situations for ggally when it is considering making a subplot between two variables:
- continous variable x continous variable
- continous variable x discrete variable
- discrete variable x discrete variable
If we now take a different dataset with all these different crossplots we will see what the function will do for each of these situations.
mtcars %>%
mutate(cyl = as.factor(cyl), am = as.factor(am)) %>%
select(hp, wt, cyl, am) %>%
ggpairs(title = "a matrix of characteristics")
Notice that I've purposefully casted the cyl and am variables for this purpose.
The upper and lower settings can adapt for different combinations of values. The following settings are available according to the documentation:
- continuous : "points", "smooth", "density", "cor", "blank"
- combo : "box", "dot", "facethist", "facetdensity", "denstrip", "blank"
- discrete: "facetbar", "ratio", "blank"
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
select(mpg, disp, hp, wt, cyl) %>%
ggpairs(title = "a matrix of characteristics",
upper = list(continuous = "density", combo = "box"),
lower = list(continuous = "smooth"),
color = "cyl")
That's a lot of exploratory plotting for just one command.
Conclusion
All in all ggally feels like a huge timesaver, albeit a bit slow. You can always prevent the long plotting time by either reducing the number of plots you are drawing or by selecting only a sample of your data (via sample_n for example).
You should also be able to find this blog on: r-bloggers