Apache Spark: Custom streaming data source

Photo by Tim Carey on Unsplash
  1. Structured Streaming

ReadSupport

To add read-support to our custom steaming data source, we need to implement MicroBatchReadSupport interface.

  1. end offset is used in setOffsetRange(start,end). The offset range is used to create InputPartition.
  2. For each InputPartition, InputPartitionReader is created that is then used to read the data from the given offset range.
  3. When data is read up to a certain offset, that offset is committed, so that we can flush out the data up to that offset from out local buffer.
Dataset<Row> dataset = sparkSession
.readStream().format("com.bigdataprojects.customstreamingsource.CSVStreamingSource").option("filepath", "path_to_file").load();

Static data source as Spark Streaming data source

Any static data source can be converted into a streaming data source. We just need to take care of the termination condition.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Amar Gajbhiye

Amar Gajbhiye

Technology Enthusiast | Big Data Developer | Amateur Cricketer | Technical Lead Engineer @ eQ Technologic | https://www.bugdbug.com