Apache Spark and In-memory Hadoop File System (IGFS)
IGFS is Apache Ignite’s in-memory distributed file system, which is the Hadoop File System compatible.
Apache Ignite is a distributed in-memory caching and data processing framework.
One of the unique capability of Apache Ignite is IGFS, it’s distributed in-memory file system. IGFS implements Hadoop FileSystem API because of which it can easily be plugged-in with the Hadoop and Spark deployments.
In IGFS, when a file is stored, it is divided into blocks and stored in memory using a distributed cache.
In Apache Spark, for each spark application, a SparkContext is created. Each spark context manages its own set of executors which executes Spark Jobs. Data cached by a Spark job in an executor of one spark context cannot be accessed by a job executing in an executor from different spark context. This makes data sharing between different spark jobs impossible without writing it into any third-party storage like HDFS.
IGFS gives an alternative way to HDFS, to write and share intermediate spark job data and state. Since IGFS writes data into the distributed in-memory cache it performs a lot faster than HDFS.
Ignite also provides its own persistence layer called Ignite Native Persistence with sync and async data write-through mode. It can also be used as a cache over the existing Hadoop file system. Please visit this link for more details.
Now, let’s see an example of how we can integrate IGFS with Apache Spark.
1. Start the Ignite Node with IGFS
Configure the file system and start Apache Ignite with IGFS. A sample configuration is given below:
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
. . . . . . . . .
<!-- Distinguished file system name. -->
<property name="name" value="myFileSystem" />
<property name="type" value="TCP"/>
<property name="host" value="192.168.0.106"/>
<!--<property name="port" value="12345"/>-->
. . . . . . . . .
IGFS file system can be accessed using “igfs://myFileSystem@<IP>/<folder_path>/” e.g. igfs://myFileSystem@192.168.0.106/<folder_path>/”
2. Start Apache Spark node with IGFS connector
Add the following dependency in your Spark project:
Also, set the following parameters in spark config to identify IGFS as a valid file format:
Now, IGFS can be accessed to read and write data using the Spark data frame as follows:
Link for the complete source code can be found here https://www.bugdbug.com/post/apache-spark-and-in-memory-hadoop-file-system-igfs
If you like this article, check out similar articles here https://www.bugdbug.com
Feel free to share your thoughts, comments.
If you find this article helpful, share it with a friend!