Apache Spark: Custom streaming data source

Amar Gajbhiye
4 min readAug 19, 2019

In the last article, we learned how to write a Apache Spark custom data source which could be used for connecting to any legacy data store. In this article, we will learn how to write a custom streaming data source.

Photo by Tim Carey on Unsplash

With the rise in popularity of IoT devices, there are lots of sensory data streams that can be used for processing and gathering helpful insights. This has resulted in the advancement of different stream processing frameworks, Apache Spark is one of them.

There are two ways to perform stream processing in Apache Spark.

  1. Spark Streaming (DStreams)
  2. Structured Streaming

Spark streaming is a separate module for working with streaming data. It uses Discretized Stream (a.k.a. DStreams) to represent data streams. Structured Streaming was introduced in Spark 2.0. Since then, it is a preferred way of stream processing.

With structured streaming, the same Spark SQL interfaces can be used for stream processing which helps us to use the same code batch processing and stream processing. With new releases, Apache Spark has added important features as Static-Stream join, Stream-Stream joins, streaming aggregation and many other important streaming operations, etc. These feature-rich capabilities make Apache Spark one of the most popular choices as a stream processing engine.

Apache Spark supports Kafka, file and TCP socket as the input data source. As I have explained in the last article, Apache Spark has provided ways to write a custom data source. In this article, I will explain how to write a custom streaming data source for Apache Spark.

For the sake of simplicity, let’s consider CSV as a custom streaming data source.

The first interface we need to implement is DataSourceV2. It is a marker interface that identifies it as a v2 data source.

ReadSupport

To add read-support to our custom steaming data source, we need to implement MicroBatchReadSupport interface.

MicroBatchReader implementation will continuously keep reading data from the source and creates an InputPartition batch for a given offset range.

  1. getEndOffset() is invoked to know the possible end offset, at that moment.
  2. end offset is used in setOffsetRange(start,end). The offset range is used to create InputPartition.
  3. For each InputPartition, InputPartitionReader is created that is then used to read the data from the given offset range.
  4. When data is read up to a certain offset, that offset is committed, so that we can flush out the data up to that offset from out local buffer.

Now, let’s use this data source to fetch the data.

Dataset<Row> dataset = sparkSession
.readStream().format("com.bigdataprojects.customstreamingsource.CSVStreamingSource").option("filepath", "path_to_file").load();

So far, we have just defined a dataset that will start receiving data from our custom source. To actually, start receiving the data, we have to set up an output stream using one of the available formats and proper output mode depending upon the aggregation and operations involved.

StreamingQuery is the handle that we get to control the execution and to receive the execution updates like QueryStatus.

Static data source as Spark Streaming data source

Any static data source can be converted into a streaming data source. We just need to take care of the termination condition.

Streaming data source is considered to receive continuous data stream. So, unless the user does not want to run operations anymore, it keeps receiving the data and transformation will keep happening. This is not the case for a static data source. When complete data is fetched from the source, a user would want to stop running the operations. This can be achieved using special data value sent from the source which can be used to terminate the write stream.

There are cases for using static data source as a streaming source where the data source fetch is very slow and visualizations built on top of it is capable of rendering updates. In such cases, the user does not need to be kept waiting until the whole data is fetched. Aggregated results can be rendered incrementally as and when more data is fetched.

Thus, a custom streaming data source helps us to integrate our stream source with Apache Spark and lets us leverage its stream processing. It also can be used to transform static a data source to a streaming source.

The link for complete source code for this example is shared at https://www.bugdbug.com/post/apache-spark-custom-streaming-data-source

If you like this article, check out similar articles here (https://www.bugdbug.com)

Feel free to share your thoughts, comments.

If you find this article helpful, share it with a friend!

--

--

Amar Gajbhiye

Technology Enthusiast | Big Data Developer | Amateur Cricketer | Technical Lead Engineer @ eQ Technologic |