Install spark on a hadoop cluster

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Install spark on a hadoop cluster

Bhushan Pathak

I have a 3-node hadoop cluster - one master & 2 slaves. I want to integrate spark with this hadoop setup so that spark uses yarn for job scheduling & execution.

Hadoop version : 2.7.3
Spark version : 2.1.0

I have read various documentation & blog posts & my understanding so far is that -
1. I need to install spark [download the tar.gz file & extract] on all 3 nodes
2. On master node, update the as follows -

3. On master node, update slaves file to list IP of the 2 slave nodes
4. On master node, update spark-defaults.conf as follows -
spark.master                    spark://
spark.serializer                org.apache.spark.serializer.KyroSerializer

5. Repeat steps 2 - 5 on the slave nodes as well
6. HDFS & Yarn services are already running
7. Directly use spark-shell to submit jobs to yarn, command to use is -
$ ./spark-submit --master yarn --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=file:///tmp/spark-events --class org.sparkexample.WordCountTask /home/hadoop/first-example-1.0-SNAPSHOT.jar a /user/hadoop 

Please let me know whether this is correct? Am I missing something?

Bhushan Pathak