Custom ML in R — koaning.io

Before writing this blogpost I did very little with object oriented code in R. I never really saw it as a useful feature because I am mostly using R as an analysis tool. The power of R comes partly from the fact that other people have done this work for you. Recently though, I've been tasked to write a custom machine learning library that needed to support the predict function.

This document will describe the simplest machine learning method ever and some quick details on how to implement it in R via object oriented coding. This post is heavily inspired by this pdf and this tutorial. My goal is to have a similar document but a bit shorter to make it easier for my own reference. I will try to compare python to R wherever possible such that other people may find it useful too.

S3 Classes

Where python has dictionaries, R has lists.

obj <- list(a = 1, b = 2)

Instead of having methods within these objects, R uses functions that can accept many different types of objects. Notice below that I am using the same function on a list as I am on a dataframe.

names(obj) 
# "a" "b"
names(ChickWeight)
# "weight" "Time"   "Chick"  "Diet"

Where python has polymorphism, R (and Julia by the way) has multiple dispatch. Methods do not belong to objects, they belong to functions. Instead of binding a method to an object, R allows you to write many functions that share the same name but refer to different objects. For example, if you want to check out what objects can be handed to summary:

> methods(summary)
 [1] summary.aov                    summary.aovlist*               summary.aspell*               
 [4] summary.check_packages_in_dir* summary.connection             summary.data.frame            
 [7] summary.Date                   summary.default                summary.ecdf*             
 ...

You could also check for all methods that below to a certain class.

> methods(class = "Date")
 [1] -             [             [[            [<-           +             as.character 
 [7] as.data.frame as.list       as.POSIXct    as.POSIXlt    Axis          c            
[13] coerce        cut           diff          format        hist          initialize   
[19] is.numeric    julian        Math          mean          months
...

If the function mean() were called on a Date object it would internally call mean.Date(). To keep track of what method a function should use R looks at the name of the function. In the case of mean.Date() the S3 class in R would recognize that the function mean can be used for Date by looking at the name. A silly example; foo.bar() would allow R to recognize that the function foo can be used on a bar object.

This should feel very odd if you are a python programmer because it puts a lot of functions in the global namespace.

Let's create an object of type foo again, just to be explicit.

obj <- list(a = 1, b = 2)
class(obj) <- 'foo' 

> class(obj) 
"foo"

We will now create a method mean.foo which will be called by the generic function mean if it is passed an object of class foo.

mean.foo <- function(x){
  (x$a + x$b)/2
}

> mean(obj) 
1.5

The only thing missing right now is a way to generate our own generic function. Just like mean was a generic function here, we might want to create a generic function that can be used on multiple objects.

f <- function(x) UseMethod("f")
f.foo <- function(x){
  paste(x$a, "and",  x$b, "are in this foo obj")
}
f.numeric <- function(x){
  paste("this numeric has value", x)
}

> f(obj)
[1] "1 and 2 are in this foo obj"
> f(2)
[1] "this numeric has value 2"

The model

I want to make a model that assumes a continous variable $y$ and a discrete input $X$. It will average $y$ over all the $X$ combinations.

To create this machine learnin model I want a generic function that returns an object with a class. As long as I create a function via UseMethod that returns a list with an assigned class this should work.

library(dplyr)

aggmod <- function(x, ...) UseMethod("aggmod")

aggmod.default <- function(form, data, ...){
  res <- list()

  agg <- aggregate(formula = form, FUN = mean, data = data)
  colnames(agg) <- c(form %>% all.vars %>% tail(-1), "pred")
  res$agg <- agg

  res$call <- match.call()
  res$formula <- form
  res$fitted.values <- data %>% left_join(res$agg) %>% .$pred
  res$y <- data %>% select_(form %>% all.vars %>% head(1))
  res$residuals <- res$y - res$fitted.values
  res$mae <- mean(sum(abs(res$residuals)/length(res$residuals)))
  res$mse <- mean(sum(res$residuals^2)/length(res$residuals))

  class(res) <- "aggmod"
  res
}

The .default method can be seen as a constructor. When we call aggmod() function it will point to the aggmod.default method and return a list of class aggmod. This object still needs some utility generics. Currently, this object has no pretty print representation and can also not be passed into the predict function.

print.aggmod <- function(x, ...){
  cat("Call:\n")
  print(x$call)
  cat("\nMSE:")
  print(x$mse)
  cat("MAE:")
  print(x$mae)
}

predict.aggmod <- function(x, newdata = NULL, ...){
  if(is.null(newdata)) return(fitted(x))
  newdata %>% left_join(x$agg) %>% .$pred
}

With this in place, it starts to feel like using the lm function.

modl <- aggmod(weight ~ Time + Diet, ChickWeight) 
> modl %>% print
Call:
aggmod.default(form = weight ~ Time + Diet, data = ChickWeight)

MSE:[1] 631109.7
MAE:[1] 11692.11

> predict(modl, newdata = ChickWeight %>% sample_n(5))
Joining by: c("Time", "Diet")
[1] 187.70000  47.25000  79.68421  64.50000  66.78947

Conclusion

I found this exercize very helpful in understanding the R way of dealing with objects. If you are from a different programming language this may feel like a very strange way of doing things but it has it's benefits. By allowing our code to be written this way, we can do the following;

formulas <- c(weight ~ Time, weight ~ Time + Diet, weight ~ Diet)
ml_methods <- list(lm, aggmod)
df <- data.frame(variables = as.character(), model = as.character(), median_mse = as.numeric())

mse <- function(x,y){
  diff <- x - y
  mean(sum(diff^2)/length(diff))
}

for(f in formulas){
  for(m in ml_methods){
    mod <- m(f, ChickWeight)
    df <- df %>% rbind(data.frame(
      variables = f %>% all.vars %>% tail(-1) %>% paste(collapse=' '),
      model = mod$call %>% as.character %>% .[1], 
      median_mse = mse(mod %>% predict, ChickWeight$weight)
    ))
  }
}

Having a generic predict allows an R user to focus on the statistics because there is a common expectation of how an object should interact with it. Not all programmers will like this style, some may say that it offers too much sugar. Another example of things that feel trippy to programmers;

`%+%` <- function(a, b){
  paste(a,b, sep ='')
}

> 'a' %+% 'b' %+% 'c'
[1] "abc"

Where python can make use of polymorphism for it's operators, R imposes different rules, but allows you to write your own operators. Most statistician will enjoy this because this syntax allows them only worry about doing statistics with code that feels natural to them.