Apache Spark and In-memory Hadoop File System (IGFS)

IGFS

Apache Ignite is a distributed in-memory caching and data processing framework.

One of the unique capability of Apache Ignite is IGFS, it’s distributed in-memory file system. IGFS implements Hadoop FileSystem API because of which it can easily be plugged-in with the Hadoop and Spark deployments.

In IGFS, when a file is stored, it is divided into blocks and stored in memory using a distributed cache.

In Apache Spark, for each spark application, a SparkContext is created. Each spark context manages its own set of executors which executes Spark Jobs. Data cached by a Spark job in an executor of one spark context cannot be accessed by a job executing in an executor from different spark context. This makes data sharing between different spark jobs impossible without writing it into any third-party storage like HDFS.

IGFS gives an alternative way to HDFS, to write and share intermediate spark job data and state. Since IGFS writes data into the distributed in-memory cache it performs a lot faster than HDFS.

Ignite also provides its own persistence layer called Ignite Native Persistence with sync and async data write-through mode. It can also be used as a cache over the existing Hadoop file system. Please visit this link for more details.

Now, let’s see an example of how we can integrate IGFS with Apache Spark.

1. Start the Ignite Node with IGFS

Configure the file system and start Apache Ignite with IGFS. A sample configuration is given below:

<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
. . . . . . . . .
<property name="fileSystemConfiguration">
<list>
<bean class="org.apache.ignite.configuration.FileSystemConfiguration">
<!-- Distinguished file system name. -->
<property name="name" value="myFileSystem" />
<property name="ipcEndpointConfiguration">
<bean class="org.apache.ignite.igfs.IgfsIpcEndpointConfiguration">
<property name="type" value="TCP"/>
<property name="host" value="192.168.0.106"/>
<!--<property name="port" value="12345"/>-->
</bean>
</property>
</bean>
</list>
</property>
. . . . . . . . .
</bean>

IGFS file system can be accessed using “igfs://myFileSystem@<IP>/<folder_path>/” e.g. igfs://myFileSystem@192.168.0.106/<folder_path>/”

2. Start Apache Spark node with IGFS connector

Add the following dependency in your Spark project:

<dependency>
<groupId>org.apache.ignite</groupId>
<artifactId>ignite-spring</artifactId>
<version>2.7.5</version>
</dependency>

Also, set the following parameters in spark config to identify IGFS as a valid file format:

sparkConf.set("fs.igfs.impl", "org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem");

Now, IGFS can be accessed to read and write data using the Spark data frame as follows:

sparkSession.read().text("igfs://myFileSystem@192.168.0.106/file/file.txt")anddataFrame.write().text("igfs://myFileSystem@192.168.0.106/file/file.txt")

Link for the complete source code can be found here https://www.bugdbug.com/post/apache-spark-and-in-memory-hadoop-file-system-igfs

If you like this article, check out similar articles here https://www.bugdbug.com

Feel free to share your thoughts, comments.

If you find this article helpful, share it with a friend!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Amar Gajbhiye

Amar Gajbhiye

Technology Enthusiast | Big Data Developer | Amateur Cricketer | Technical Lead Engineer @ eQ Technologic | https://www.bugdbug.com