Apache Spark standalone cluster on Windows

Apache Spark is a distributed computing framework which has built-in support for batch and stream processing of big data, most of that processing happens in-memory which gives a better performance. It has built-in modules for SQL, machine learning, graph processing, etc.

There are two different modes in which Apache Spark can be deployed, Local and Cluster mode.

Local mode is mainly for testing purposes. In this mode, all the main components are created inside a single process. In cluster mode, the application runs as the sets of processes managed by the driver (SparkContext). The following are the main components of cluster mode.

  1. Master
  2. Worker
  3. Resource Manager

You can visit this link for more details about cluster mode.

Currently, Apache Spark supports Standalone, Apache Mesos, YARN, and Kubernetes as resource managers. Standalone is a spark’s resource manager which is easy to set up which can be used to get things started fast.

There are many articles and enough information about how to start a standalone cluster on Linux environment. But, there is not much information about starting a standalone cluster on Windows.

In this article, we will see, how to start Apache Spark using a standalone cluster on the Windows platform.

Few key things before we start with the setup:

  1. Avoid having spaces in the installation folder of Hadoop or Spark.
  2. Always start Command Prompt with Administrator rights i.e with Run As Administrator option

Pre-requisites

  1. Download JDK and add JAVA_HOME = <path_to_jdk_> as an environment variable.
  2. Download Spark and add SPARK_HOME=<path_to_spark>. If you choose to download spark pre-built with particular version of hadoop, no need to download it explicitly in step 3.
  3. Download Hadoop and add HADOOP_HOME=<path_to_hadoop> and add %HADOOP_HOME%\bin to PATH variable.
  4. Download winutils.exe (for the same Hadoop version as above) and place it under %HADOOP_HOME%\bin.

Set up Master Node

Go to spark installation folder, open Command Prompt as administrator and run the following command to start master node.

The host flag ( --host) is optional. It is useful to specify an address specific to a network interface when multiple network interfaces are present on a machine.

Set up Worker Node

Follow the above steps and run the following command to start a worker node

Your standalone cluster is up with the master and one worker node. And now you can access it from your program using master as spark://<master_ip>:<port>.

These two instances can run on the same or different machines.

Spark UI

You can access Spark UI by using the following URL

If you like this article, check out similar articles here https://www.bugdbug.com

Feel free to share your thoughts, comments.

If you find this article helpful, share it with a friend!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Amar Gajbhiye

Technology Enthusiast | Big Data Developer | Amateur Cricketer | Technical Lead Engineer @ eQ Technologic | https://www.bugdbug.com