Apache Spark: Custom streaming data source

Photo by Tim Carey on Unsplash
  1. Spark Streaming (DStreams)
  2. Structured Streaming

ReadSupport

  1. getEndOffset() is invoked to know the possible end offset, at that moment.
  2. end offset is used in setOffsetRange(start,end). The offset range is used to create InputPartition.
  3. For each InputPartition, InputPartitionReader is created that is then used to read the data from the given offset range.
  4. When data is read up to a certain offset, that offset is committed, so that we can flush out the data up to that offset from out local buffer.
Dataset<Row> dataset = sparkSession
.readStream().format("com.bigdataprojects.customstreamingsource.CSVStreamingSource").option("filepath", "path_to_file").load();

Static data source as Spark Streaming data source

--

--

--

Technology Enthusiast | Big Data Developer | Amateur Cricketer | Technical Lead Engineer @ eQ Technologic | https://www.bugdbug.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Batch download data and media files from JavaScript Object

OpenMHz Alexandria Police Dispatch page, Firefox settings menu open to Web Developer menu with Web Console highlighted.

99P Labs & AlignAI: A Learning Collaboration

Introducing Azure AD B2C

Simulation of BB84 protocol

Month 22 — Reflections on using Obsidian & Anki Combo for 6 months

Hello World Meet RAW.

The splendour of UI State Management

Data Engineering Explained: Data Serialization, backward and forward compatibility

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Amar Gajbhiye

Amar Gajbhiye

Technology Enthusiast | Big Data Developer | Amateur Cricketer | Technical Lead Engineer @ eQ Technologic | https://www.bugdbug.com

More from Medium

Spark Loves Kafka Loves Mongo

Unnesting of StructType and ArrayType Data Objects in Pyspark -Exploding Nested JSON

How to Flatten Json Files Dynamically Using Apache Spark(Scala Version)

Querying Kafka Topics Using Presto

Start Presto in the foreground when querying Kafka topics using Presto