Spark is cool. It's like writing code and never having to worry about parallelism. This is nice if you have a cluster but also if your local machine has multiple cores and sufficient memory. No need for extra packages or workarounds, the api will handle the scaling automatically.
Show of force
Suppose I run the following parallel spark code;
val count = spark.parallelize(1 to 100000000).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
The moment that I run this code, the top
command in my terminal will show that I'm almost reaching the full 800% of 8-core mac cpu power.
That's 8x more computing power that's available to you as a programmer just by following the standard spark api. No multicore library needed.
More numbers
To show the potential gain in performance I've set up two benchmarks to compare local spark to local python. One will count bins from a uniform distribution through and the other will aggregate a large data file. I will use the ipython notebook %%time
magic and the following scala function to measure the performance:
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: "+(System.nanoTime-s)/1e6+"ms")
ret
}
Binning random numbers
Python Code 3min 23s
The quickest way for python without extra hassle involves using vectorized numpy.
[len(i) for i in pd.cut(np.random.uniform(0,1,10000000),100)]
Scala Spark Code 25.2s
Notice that the scala spark code hardly has to be verbose. You only need to appreciate the functional method chaining.
sc.parallelize(1 to 100000000)
.map(_ => scala.util.Random.nextDouble())
.map(x => (x - x % 0.01, 1))
.reduceByKey( (a,b) => a + b )
.collect()
Fast Scala Spark Code
Spark can even be quicker. The documentation is not entirely up to par but there is support for creating random variables instead of casting random variables as an intermediate .map
step. It's like using numpy
instead of normal python but in spark with scala.
import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._
uniformRDD(sc, 100000000, 10).sum()
Handling large files
I took a 1.1Gb .csv file from my local drive to do some benchmarks as well. Here's a small hint at performance for just opening the file and doing a linecount:
Python 13.3 s
df = pd.read_csv("/some/path/largefile.csv")
df.shape
Scala Spark 1.8s
val txtfile = sc.textFile("/some/path/largefile.csv")
txtfile.count()
Bash 1.0s
$ time wc -l /some/path/largefile.csv
real 0m1.002s
user 0m0.786s
sys 0m0.213s
Aggregation
Aggregation would make a better indication of performance in a real life scenario. Here I aggregate the sum on the third column of a file and show the top 100.
Python 11.1s
For the python code I excluded the 13.3s needed to load in the file.
df.groupby(['col3']).count().sort("col3").head(100)['col3']
Scala Spark 4.6s
val txtfile = sc.textFile("/some/path/largefile.csv")
txtfile.map(line => line.split(" ")(3)).map( x => (x, 1) ).reduceByKey((a,b) => a + b)
.map(item => item.swap)
.sortByKey(false, 1)
.map(item => item.swap)
.take(100)
Conclusion
Benchmarks are always stenchmarks because there are ways to get more juice out of pandas. Still, the numpy/pandas stack of python is one of the most performant pieces of python and with some basic examples spark seems to be outperforming without any trouble.
The api of numpy and pandas still seems more convenient for middle sized data. For bigger datasets, spark provides an easy enough api that is fast and flexible. I've only scratched the surface here as spark also offers scalable machine learning algorithms as well as graph analysis tools.
Did I mention you can also run all of this on a Hadoop cluster? Profit.