Install spark on a hadoop cluster

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Install spark on a hadoop cluster

Bhushan Pathak
Hello,

I have a 3-node hadoop cluster - one master & 2 slaves. I want to integrate spark with this hadoop setup so that spark uses yarn for job scheduling & execution.

Hadoop version : 2.7.3
Spark version : 2.1.0

I have read various documentation & blog posts & my understanding so far is that -
1. I need to install spark [download the tar.gz file & extract] on all 3 nodes
2. On master node, update the spark-env.sh as follows -
SPARK_DAEMON_JAVA_OPTS=-Dspark.driver.port=53411
HADOOP_CONF_DIR=/home/hadoop/hadoop-2.7.3/etc/hadoop
SPARK_MASTER_HOST=192.168.10.44

3. On master node, update slaves file to list IP of the 2 slave nodes
4. On master node, update spark-defaults.conf as follows -
spark.master                    spark://192.168.10.44:7077
spark.serializer                org.apache.spark.serializer.KyroSerializer

5. Repeat steps 2 - 5 on the slave nodes as well
6. HDFS & Yarn services are already running
7. Directly use spark-shell to submit jobs to yarn, command to use is -
$ ./spark-submit --master yarn --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=file:///tmp/spark-events --class org.sparkexample.WordCountTask /home/hadoop/first-example-1.0-SNAPSHOT.jar a /user/hadoop 

Please let me know whether this is correct? Am I missing something?

Thanks
Bhushan Pathak
Loading...