In this document I'll give a quick review of some exiting new features for SparkR from the new 1.5 version that had it's official release today.

Rstudio provisioned

Starting a full start SparkR service on aws just got a whole lot easier. It takes about 15 minutes and only requires you to do three simple things.

Step 1: Command line

If you have your amazon .pem files in order on your machine as well as your aws credentials in your .bash_profile you can use the command line app in the spark-ec2 folder to start up a cluster on amazon.

From this folder, run this command;

./spark-ec2 \
    --key-pair=pems \
    --identity-file=/path/pems.pem \
    --region=eu-west-1 \
    -s 4 \
    --instance-type c3.xlarge \
    --copy-aws-credentials
    launch my-spark-cluster

This example will start a cluster with 4 slave nodes of type c3.xlarge in the eu-west-1 region. The name of the cluster is 'my-spark-cluster'. If you want a bigger cluster, you can set the preferences here.

Step 2: Ssh + set password

The same command line app allows you to ssh into the master machine of the spark cluster.

./spark-ec2 -k pems-df -i /path/pems.pem --region=eu-west-1 login my-spark-cluster

Once you are logged in, you'll be logged in as the root user. For security reasons it is more preferable to have a seperate user for rstudio. This user has already been added to the system, the only thing you need to do is assign a password to it.

passwd rstudio

Step 3: Login

First, just check where the master server is located.

curl icanhazip.com

Then fill in the link in your favorite browser and go to port 8787 and log in with the 'rstudio' user and the password that you've set.

You can use the startSpark.R script to quickly get started. It creates a sparkContext and a sqlContext which you can then use to create distributed dataframes on your cluster.

When you are done

If you are done with the cluster and saved all your files back to s3, you don't need to pay for the ec2 machines anymore. The cluster can then safely be destroyed via the same command line app as before.

./spark-ec2 -k pems-df -i /path/pems.pem --region=eu-west-1 destroy my-spark-cluster

Linear Models

Regression Example

ddf <- createDataFrame(sqlContext, ChickWeight)
ddf %>% 
  summary %>% 
  collect
Result
  summary             weight               Time              Chick 
1   count                578                578                578 
2    mean 121.81833910034602 10.717993079584774 25.750865051903116 
3  stddev  71.01045205007217  6.752550828025898   14.5561866041844 
4     min               35.0                0.0                  1 
5     max              373.0               21.0                  9 
                Diet
1                578
2  2.235294117647059
3 1.1616716269489116
4                  1
5                  4

In this new SparkR shell you will notice that the glm method now comes from multiple packages.

?glm
Help on topic ‘glm’ was found in the following packages:

  Package               Library
  SparkR                /Users/code/Downloads/spark-1.5.0-bin-hadoop2.6/R/lib
  stats                 /Library/Frameworks/R.framework/Versions/3.2/Resources/library

You can run the model and save it's parameters in a variable, just like in normal R.

dist_mod <- glm(weight ~ Time + Diet, data = ddf, family = "gaussian")

To view the characteristics of the regression you'll only need to run it through the summary function (a pattern common for many things in R).

dist_mod %>% summary
Result
$coefficients
              Estimate
(Intercept)  41.157845
Time          8.750492
Diet__1     -30.233456
Diet__2     -14.067382
Diet__3       6.265952

This model can be applied to new data to create predictions, very much in the same style as you would in normal R models/dataframes.

dist_mod %>% 
  predict(newData = ddf) %>% 
  showDF
Result
+------+----+-----+----+------------------+-----+------------------+
|weight|Time|Chick|Diet|          features|label|        prediction|
+------+----+-----+----+------------------+-----+------------------+
|  42.0| 0.0|    1|   1|     (4,[1],[1.0])| 42.0|10.924388954283739|
|  51.0| 2.0|    1|   1| [2.0,1.0,0.0,0.0]| 51.0| 28.42537280265673|
|  59.0| 4.0|    1|   1| [4.0,1.0,0.0,0.0]| 59.0| 45.92635665102972|
|  64.0| 6.0|    1|   1| [6.0,1.0,0.0,0.0]| 64.0| 63.42734049940272|
|  76.0| 8.0|    1|   1| [8.0,1.0,0.0,0.0]| 76.0| 80.92832434777571|
|  93.0|10.0|    1|   1|[10.0,1.0,0.0,0.0]| 93.0| 98.42930819614871|
| 106.0|12.0|    1|   1|[12.0,1.0,0.0,0.0]|106.0|115.93029204452169|
| 125.0|14.0|    1|   1|[14.0,1.0,0.0,0.0]|125.0| 133.4312758928947|
| 149.0|16.0|    1|   1|[16.0,1.0,0.0,0.0]|149.0|150.93225974126767|
| 171.0|18.0|    1|   1|[18.0,1.0,0.0,0.0]|171.0|168.43324358964065|
| 199.0|20.0|    1|   1|[20.0,1.0,0.0,0.0]|199.0|185.93422743801366|
| 205.0|21.0|    1|   1|[21.0,1.0,0.0,0.0]|205.0|194.68471936220016|
|  40.0| 0.0|    2|   1|     (4,[1],[1.0])| 40.0|10.924388954283739|
|  49.0| 2.0|    2|   1| [2.0,1.0,0.0,0.0]| 49.0| 28.42537280265673|
|  58.0| 4.0|    2|   1| [4.0,1.0,0.0,0.0]| 58.0| 45.92635665102972|
|  72.0| 6.0|    2|   1| [6.0,1.0,0.0,0.0]| 72.0| 63.42734049940272|
|  84.0| 8.0|    2|   1| [8.0,1.0,0.0,0.0]| 84.0| 80.92832434777571|
| 103.0|10.0|    2|   1|[10.0,1.0,0.0,0.0]|103.0| 98.42930819614871|
| 122.0|12.0|    2|   1|[12.0,1.0,0.0,0.0]|122.0|115.93029204452169|
| 138.0|14.0|    2|   1|[14.0,1.0,0.0,0.0]|138.0| 133.4312758928947|
+------+----+-----+----+------------------+-----+------------------+

A small GOTYA

You could compare the SparkR model results from the ChickWeight regression with the results from the normal R lm method. What you see might scare you a bit.

df <- ChickWeight
loc_mod <- glm(weight ~ Time + Diet, data = df)
loc_mod %>% summary
Results
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.9244     3.3607   3.251  0.00122 ** 
Time          8.7505     0.2218  39.451  < 2e-16 ***
Diet2        16.1661     4.0858   3.957 8.56e-05 ***
Diet3        36.4994     4.0858   8.933  < 2e-16 ***
Diet4        30.2335     4.1075   7.361 6.39e-13 ***

Compare this output with the distributed glm output. You might be frightned slightly at this point. The two models give very different output!

Don't be scared just yet. The way that Spark has implemented machine learning is different it is still doing proper regression.

The main difference is that R-glm translates the Diet1 variable to be the constant intercept whereas the SparkR-glm translates Diet4. The only different therefore is a linear transformation of the model and the prediction outcomes shoulds still be the same. Another way to confirm this is to notice that the different between Diet2 and Diet3 is the same in both models and the parameter for Time is also the same.

loc_mod %>% predict(df) %>% head(20)
Result
        1         2         3         4         5         6         7         8
 10.92439  28.42537  45.92636  63.42734  80.92833  98.42931 115.93029 133.43128
        9        10        11        12        13        14        15        16
150.93226 168.43324 185.93423 194.68472  10.92439  28.42537  45.92636  63.42734
       17        18        19        20
 80.92833  98.42931 115.93029 133.43128

Classification Example

By turning the family parameter from "gaussian" to "binomial" we turn the linear regression into a logistic regression.

df <- data.frame(a = c(1,1,1,0,0,0), b = c(6,7,8,1,2,3))
ddf <- createDataFrame(sqlContext, df)
mod <- glm(a ~ b, data=ddf, family="binomial")
mod %>% 
  predict(newData = ddf) %>% 
  showDF
Result
+---+---+--------+-----+--------------------+--------------------+----------+
|  a|  b|features|label|       rawPrediction|         probability|prediction|
+---+---+--------+-----+--------------------+--------------------+----------+
|1.0|6.0|   [6.0]|  1.0|[-16.405749578003...|[7.50021053099107...|       1.0|
|1.0|7.0|   [7.0]|  1.0|[-27.201563029171...|[1.53642468901809...|       1.0|
|1.0|8.0|   [8.0]|  1.0|[-37.997376480340...|[3.14737918119432...|       1.0|
|0.0|1.0|   [1.0]|  0.0|[37.5733176778398...|[1.0,4.8096720602...|       0.0|
|0.0|2.0|   [2.0]|  0.0|[26.7775042266712...|[0.99999999999765...|       0.0|
|0.0|3.0|   [3.0]|  0.0|[15.9816907755026...|[0.99999988538542...|       0.0|
+---+---+--------+-----+--------------------+--------------------+----------+

The logistic version of glm unfortuntaly doesn't give us a nice summary output at the moment. This is a known missing feature and should become available as of Spark 1.6. Another problem is that SparkR currently does not seem to support strings; which means that all classification tasks need to be cast to integers manually.

names(iris) <- c("sepal_length","sepal_width","petal_length","petal_width","species")

ddf <- sqlContext %>% 
  createDataFrame(iris) %>% 
  withColumn("to_pred", .$species == "setosa")

glm(to_pred ~ sepal_length + sepal_width + petal_length + petal_width, family = "binomial", data = ddf) %>% 
  predict(newData = ddf) %>% 
  showDF

Note that Spark doesn't like points in column names, which is why they are manually reset here.

Result
+-------+-------+-------+-------+-------+-------+ ... +----------+
|sep_len|sep_wid|pet_len|pet_wid|species|to_pred| ... |prediction|
+-------+-------+-------+-------+-------+-------+ ... +----------+
|    5.1|    3.5|     1.4|   0.2| setosa|   true| ... |       1.0|
|    4.9|    3.0|     1.4|   0.2| setosa|   true| ... |       1.0|
|    4.7|    3.2|     1.3|   0.2| setosa|   true| ... |       1.0|
|    4.6|    3.1|     1.5|   0.2| setosa|   true| ... |       1.0|
|    5.0|    3.6|     1.4|   0.2| setosa|   true| ... |       1.0|
|    5.4|    3.9|     1.7|   0.4| setosa|   true| ... |       1.0|
|    4.6|    3.4|     1.4|   0.3| setosa|   true| ... |       1.0|
|    5.0|    3.4|     1.5|   0.2| setosa|   true| ... |       1.0|
|    4.4|    2.9|     1.4|   0.2| setosa|   true| ... |       1.0|
|    4.9|    3.1|     1.5|   0.1| setosa|   true| ... |       1.0|
|    5.4|    3.7|     1.5|   0.2| setosa|   true| ... |       1.0|
|    4.8|    3.4|     1.6|   0.2| setosa|   true| ... |       1.0|
|    4.8|    3.0|     1.4|   0.1| setosa|   true| ... |       1.0|
|    4.3|    3.0|     1.1|   0.1| setosa|   true| ... |       1.0|
|    5.8|    4.0|     1.2|   0.2| setosa|   true| ... |       1.0|
|    5.7|    4.4|     1.5|   0.4| setosa|   true| ... |       1.0|
|    5.4|    3.9|     1.3|   0.4| setosa|   true| ... |       1.0|
|    5.1|    3.5|     1.4|   0.3| setosa|   true| ... |       1.0|
|    5.7|    3.8|     1.7|   0.3| setosa|   true| ... |       1.0|
|    5.1|    3.8|     1.5|   0.3| setosa|   true| ... |       1.0|
+-------+-------+-------+-------+-------+-------+ ... +----------+
only showing top 20 rows

More Advanced Types and Queries

A few nice features were added in Spark 1.5 in regards to types and operations in distributed data frames.

The %in% operator

This was definately missing in Spark 1.4 but now you can do more complex queries via the %in% operator.

ddf <- createDataFrame(sqlContext, ChickWeight)
ddf %>% 
  filter(.$Diet %in% c("3","4")) %>% 
  sample(TRUE, 0.05) %>% 
  collect
Result
   weight Time Chick Diet
1      87    6    35    3
2      48    2    37    3
3     109   10    38    3
4     232   18    38    3
5      66    4    40    3
6     215   16    40    3
7     155   12    41    4
8     204   16    42    4
9     198   18    43    4
10    101    8    46    4

Date types

You can now start playing around with dates in SparkR.

df <- data.frame(
  date = as.Date("2015-04-01") + 0:99, 
  r = runif(100)
)

ddf <- createDataFrame(sqlContext, df) 
ddf %>% printSchema
Result
root
 |-- date: date (nullable = true)
 |-- r: double (nullable = true)

You can filter these dates by using date types.

ddf %>% 
  filter(.$date > as.Date("2015-04-03")) %>% 
  filter(.$date < as.Date("2015-04-08")) %>% 
  collect
Result
        date         r
1 2015-04-04 0.5821896
2 2015-04-05 0.9939826
3 2015-04-06 0.4792869
4 2015-04-07 0.3411329

SparkR even has some lubridate-ish support for date manipulation.

ddf %>% 
    withColumn("dom", .$date %>% dayofmonth) %>% 
    withColumn("doy", .$date %>% dayofyear) %>% 
    withColumn("woy", .$date %>% weekofyear) %>% 
    head
Result
        date         r dom doy woy
1 2015-04-01 0.2736077   1  91  14
2 2015-04-02 0.2015249   2  92  14
3 2015-04-03 0.3586754   3  93  14
4 2015-04-04 0.6162447   4  94  14
5 2015-04-05 0.5220081   5  95  14
6 2015-04-06 0.4814839   6  96  15

You can do similar things with datetimes.

df <- data.frame(
    t = as.POSIXct("2015-01-01 00:00:00") + runif(1000)*10000, 
    r = runif(1000)
)

ddf <- createDataFrame(sqlContext, df) 
ddf %>% printSchema
Result
root
 |-- t: timestamp (nullable = true)
 |-- r: double (nullable = true)

Besides dates, we've also got some lubri-time methods in SparkR now.

ddf %>% 
    withColumn("hour", .$t %>% hour) %>% 
    withColumn("minute", .$t %>% minute) %>% 
    withColumn("unix_t", .$t %>% unix_timestamp) %>% 
    head
Result
                    t         r hour minute     unix_t
1 2015-01-01 01:08:09 0.1484953    1      8 1420070889
2 2015-01-01 02:27:22 0.6907954    2     27 1420075642
3 2015-01-01 01:11:24 0.6616176    1     11 1420071084
4 2015-01-01 00:11:06 0.9897747    0     11 1420067466
5 2015-01-01 02:07:13 0.2923660    2      7 1420074433
6 2015-01-01 01:06:16 0.6781178    1      6 1420070776

For most dataframe operations, this is nice to start playing with, but don't expect the full flexibility of base R just yet. As of right now dates and dateimes cannot be summerized and they cannot be used in models. This is a current known issue in Jira that peolpe are working on.

Levenshtein

These might fall in the special usecase category, but it can be suprisingly handy when trying to find similar names in a large databank. df <- data.frame( a = c('saark', 'spork', 'siaru', 'soobk'), b = c('spark', 'spark', 'spark', 'spark') )

ddf &lt;- createDataFrame(sqlContext, df) 
ddf$c &lt;- levenshtein(ddf$a, ddf$b)
ddf %&gt;% head
Results
      a     b c
1 saark spark 1
2 spork spark 1
3 siaru spark 2
4 soobk spark 3

Regex

Again, regexes are a bit of a special use case to a lot of R users but they are invaluable when analysing log files.

df &lt;- data.frame(
    s = c('100-202', '2a4ta24', '300-200', 't2t2t2')
)

ddf &lt;- createDataFrame(sqlContext, df)

ddf %&gt;% 
    withColumn('regex1', regexp_extract(.$s, "(\\d+)-(\\d+)", 1)) %&gt;% 
    withColumn('regex2', regexp_extract(.$s, "(\\d+)-(\\d+)", 2)) %&gt;% 
    withColumn('regex3', regexp_replace(.$s, "(\\d+)-(\\d+)", "HERE!")) %&gt;% 
    head
Results
        s regex1 regex2  regex3
1 100-202    100    202   HERE!
2 2a4ta24               2a4ta24
3 300-200    300    200   HERE!
4  t2t2t2                t2t2t2

The future

Spark could use some more features, but the recent additions already grant many usecases. Spark is a project with enourmous traction and has a new release every 3 months, so you can expect more to come.

Being able to work with a distributed dataframe in a dplyr-like syntax really opens up doors for R users who want to handle larger datasets. The fact that all of this can run on amazon cheaply adds to the benefit.