Spark is cool. It's like writing code and never having to worry about parallelism. This is nice if you have a cluster but also if your local machine has multiple cores and sufficient memory. No need for extra packages or workarounds, the api will handle the scaling automatically.

Show of force

Suppose I run the following parallel spark code;

val count = spark.parallelize(1 to 100000000).map{i =>
  val x = Math.random()
  val y = Math.random()
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)

The moment that I run this code, the top command in my terminal will show that I'm almost reaching the full 800% of 8-core mac cpu power.

That's 8x more computing power that's available to you as a programmer just by following the standard spark api. No multicore library needed.

More numbers

To show the potential gain in performance I've set up two benchmarks to compare local spark to local python. One will count bins from a uniform distribution through and the other will aggregate a large data file. I will use the ipython notebook %%time magic and the following scala function to measure the performance:

def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}

Binning random numbers

Python Code 3min 23s

The quickest way for python without extra hassle involves using vectorized numpy.

[len(i) for i in pd.cut(np.random.uniform(0,1,10000000),100)]
Scala Spark Code 25.2s

Notice that the scala spark code hardly has to be verbose. You only need to appreciate the functional method chaining.

sc.parallelize(1 to 100000000)
  .map(_ => scala.util.Random.nextDouble())
  .map(x => (x - x % 0.01, 1))
  .reduceByKey( (a,b) => a + b )
  .collect()
Fast Scala Spark Code

Spark can even be quicker. The documentation is not entirely up to par but there is support for creating random variables instead of casting random variables as an intermediate .map step. It's like using numpy instead of normal python but in spark with scala.

import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._
uniformRDD(sc, 100000000, 10).sum()

Handling large files

I took a 1.1Gb .csv file from my local drive to do some benchmarks as well. Here's a small hint at performance for just opening the file and doing a linecount:

Python 13.3 s
df = pd.read_csv("/some/path/largefile.csv")
df.shape
Scala Spark 1.8s
val txtfile = sc.textFile("/some/path/largefile.csv")
txtfile.count()
Bash 1.0s
$ time wc -l /some/path/largefile.csv
real    0m1.002s
user    0m0.786s
sys     0m0.213s

Aggregation

Aggregation would make a better indication of performance in a real life scenario. Here I aggregate the sum on the third column of a file and show the top 100.

Python 11.1s

For the python code I excluded the 13.3s needed to load in the file.

df.groupby(['col3']).count().sort("col3").head(100)['col3']
Scala Spark 4.6s
val txtfile = sc.textFile("/some/path/largefile.csv")
txtfile.map(line => line.split(" ")(3)).map( x => (x, 1) ).reduceByKey((a,b) => a + b)
  .map(item => item.swap)
  .sortByKey(false, 1)
  .map(item => item.swap)
  .take(100)

Conclusion

Benchmarks are always stenchmarks because there are ways to get more juice out of pandas. Still, the numpy/pandas stack of python is one of the most performant pieces of python and with some basic examples spark seems to be outperforming without any trouble.

The api of numpy and pandas still seems more convenient for middle sized data. For bigger datasets, spark provides an easy enough api that is fast and flexible. I've only scratched the surface here as spark also offers scalable machine learning algorithms as well as graph analysis tools.

Did I mention you can also run all of this on a Hadoop cluster? Profit.