Apache Spark is a distributed computing framework which has built-in support for batch and stream processing of big data, most of that processing happens in-memory which gives a better performance. It has built-in modules for SQL, machine learning, graph processing, etc.
There are two different modes in which Apache Spark can be deployed, Local and Cluster mode.
Local mode is mainly for testing purposes. In this mode, all the main components are created inside a single process. In cluster mode, the application runs as the sets of processes managed by the driver (SparkContext). The following are the main components of cluster mode.
- Resource Manager
You can visit this link for more details about cluster mode.
Currently, Apache Spark supports Standalone, Apache Mesos, YARN, and Kubernetes as resource managers. Standalone is a spark’s resource manager which is easy to set up which can be used to get things started fast.
There are many articles and enough information about how to start a standalone cluster on Linux environment. But, there is not much information about starting a standalone cluster on Windows.
In this article, we will see, how to start Apache Spark using a standalone cluster on the Windows platform.
Few key things before we start with the setup:
- Avoid having spaces in the installation folder of Hadoop or Spark.
- Always start Command Prompt with Administrator rights i.e with Run As Administrator option
- Download JDK and add JAVA_HOME = <path_to_jdk_> as an environment variable.
- Download Spark and add SPARK_HOME=<path_to_spark>. If you choose to download spark pre-built with particular version of hadoop, no need to download it explicitly in step 3.
- Download Hadoop and add HADOOP_HOME=<path_to_hadoop> and add %HADOOP_HOME%\bin to PATH variable.
- Download winutils.exe (for the same Hadoop version as above) and place it under %HADOOP_HOME%\bin.
Set up Master Node
Go to spark installation folder, open Command Prompt as administrator and run the following command to start master node.
The host flag (
--host) is optional. It is useful to specify an address specific to a network interface when multiple network interfaces are present on a machine.
bin\spark-class org.apache.spark.deploy.master.Master --host <IP_Addr>
Set up Worker Node
Follow the above steps and run the following command to start a worker node
bin\spark-class org.apache.spark.deploy.worker.Worker spark://<master_ip>:<port> --host <IP_ADDR>
Your standalone cluster is up with the master and one worker node. And now you can access it from your program using master as
These two instances can run on the same or different machines.
You can access Spark UI by using the following URL
If you like this article, check out similar articles here https://www.bugdbug.com
Feel free to share your thoughts, comments.
If you find this article helpful, share it with a friend!