Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It allows you to express your streaming computation similarly to how you would express a batch computation on static data. It is a key feature of interest for professionals preparing for the DP-203 Data Engineering on Microsoft Azure exam.
Let’s go through the essentials of Spark Structured Streaming, how to process data using it, and the features making it superior to its older counterpart, Spark Streaming.
Structured Streaming
Structured Streaming is an API that allows you to stream data in a stream-batch-stream fashion. It uses Spark’s DataFrame/Dataset APIs and thus, it can be used in a various set of languages. It offers rich, unified data processing capabilities for both streaming and batch data, simplifies streaming applications, and provides robust end-to-end exactly-once fault-tolerance guarantees.
Structured Streaming supports many sources such as Kafka, Flume, and HDFS and sink types such as Kafka, console, and memory.
Processing Data with Structured Streaming
Processing data using Structured Streaming follows these steps:
- Define Input Sources: Structured streaming variables and functions can be used to define input sources like Kafka, socket, or file. For example, to connect to a Kafka source:
val df = spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“subscribe”, “topic1”)
.load()
- Define Transformations: Once the data is read, you can now apply transformations per your application logic. For example, filtering out necessary data from the source.
val words = df.select(
explode(
split(df.value, ” “)
).alias(“word”)
)
- Define Output Sinks: Define the destination of the processed data. The data can be outputted to File Systems, Databases, and live dashboards. In addition, you can define when and how to output the aggregated data.
val query = words
.writeStream
.outputMode(“complete”)
.format(“console”)
.start()
All the operations above are lazy and computations will not start until start()
is called.
Structural Streaming vs Spark Streaming:
Feature | Structured Streaming | Spark Streaming |
---|---|---|
API Ease | High (DataFrame/Dataset APIs) | Low (DStream APIs) |
Fault Tolerance | End-to-end exactly-once | At-least-once |
Event-time processing | Yes | No |
Windowing ability | More functional | Basic |
Late data management | Yes | No |
Stream-batch integration | Seamless | Requires extra effort |
Structured Streaming has quite a few benefits over Spark Streaming. It’s easier to use thanks to DataFrame/Dataset APIs, it provides an end-to-end exactly-once fault tolerance guarantee, and it has better handling for late data and windowing, among other things.
With the rise of big data, stream processing is gaining momentum and it’s crucial to understand and be able to apply these concepts in real-world scenarios, especially for those aiming to take the DP-203 Data Engineering on Microsoft Azure exam. The exam not only focuses on Microsoft Azure-specific data engineering considerations but also on general big data processing techniques and architectures, one key player being Apache Spark’s Structured Streaming.
Practice Test
True or False: Spark Structured Streaming can process data in real time.
- Answer: True
Explanation: Spark Structured Streaming is a scalable and fault-tolerant stream processing engine that allows real-time data processing.
Which of the following can be handled by Spark Structured Streaming?
- A. Files
- B. Kafka
- C. Sockets
- D. All of the above
Answer: D. All of the above
Explanation: Spark Structured Streaming supports various types of sources including files, Kafka, and sockets.
True or False: Structured streaming does not support Machine Learning and Graph Processing.
- Answer: False
Explanation: Spark Structured Streaming does support APIs for Machine Learning and Graph Processing.
What type of storage should be used to improve the speed of Spark Structured Streaming?
- A. Blob storage
- B. Data Lake Storage
- C. Cosmos DB
- D. SQL Data Warehouse
Answer: B. Data Lake Storage
Explanation: Data Lake Storage is designed to improve the speed of Spark Structured Streaming due to its optimised performance for big data analytics workloads.
Which of the following file formats is not supported by Spark Structured Streaming?
- A. Parquet
- B. JSON
- C. CSV
- D. PDF
Answer: D. PDF
Explanation: Spark Structured Streaming supports formats such as Parquet, JSON, and CSV, but does not support PDF.
True or False: Spark Structured Streaming cannot handle late data.
- Answer: False
Explanation: With watermarking, Spark Structured Streaming can handle late data.
Which SQL operation is not supported by Spark Structured Streaming?
- A. Aggregations
- B. Updates
- C. Deletes
- D. None of the above
Answer: D. None of the above
Explanation: Spark Structured Streaming supports all types of SQL operations, including aggregations, updates and deletes.
Is it possible to define sliding event time windows with Spark Structured Streaming?
- A. Yes
- B. No
Answer: A. Yes
Explanation: With Spark Structured Streaming, it’s possible to define sliding event time windows to perform computations on windowed data.
True or False: Structured Streaming requires manual configuration of partitioning to achieve efficient data processing.
- Answer: False
Explanation: Spark Structured Streaming automatically manages and optimizes the details of data partitioning.
True or False: The output of a Spark Structured Streaming query is always a Dataframe.
- Answer: True
Explanation: The output of any Spark Structured Stream is a DataFrame, and it is continuously updated as new data flows.
Which of the following output modes are valid in Spark Structured Streaming?
- A. Append mode
- B. Complete mode
- C. Update mode
- D. All of the above
Answer: D. All of the above
Explanation: In Spark Structured Streaming, all three output modes Append, Complete, and Update are valid and can be used as per the use case.
Which stream processing engine does Spark Structured Streaming use?
- A. Flink
- B. Kafka Streams
- C. Spark Engine
- D. Storm
Answer: C. Spark Engine
Explanation: Spark Structured Streaming uses the Spark engine for its stream processing.
True or False: Structured Streaming can only process data in batch mode.
- Answer: False
Explanation: While Structured Streaming includes a batch mode, it is capable of real-time stream processing as well.
Transformations in Spark Structured Streaming are ________.
- A. Lazy
- B. Eager
Answer: A. Lazy
Explanation: Transformations in Spark Structured Streaming are lazy, meaning they’re not executed immediately, but computed when an action, such as a count or collect, is executed.
Which one is not an output sink in Spark Structured Streaming?
- A. File systems
- B. Databases
- C. Message queues
- D. Power BI
Answer: D. Power BI
Explanation: Power BI is not an output sink in Spark Structured Streaming, but file systems, databases, and message queues are supported sinks.
Interview Questions
What is Spark Structured Streaming?
Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.
How does Spark Structured Streaming integrate with Azure Data Lake Storage?
Spark Structured Streaming can read and write data in Azure Data Lake Storage. Azure Databricks includes Azure Data Lake Storage libraries that allow Spark to work with Data Lake Storage, providing efficient I/O connectivity.
Is it possible to use SQL to process data in Spark Structured Streaming?
Yes, it is possible to use SQL processing in Spark Structured Streaming which provides ease of use and powerful integrations between SQL (relational) and complex analytics (nested data).
In what formats can Spark Structured Streaming receive data?
Spark Structured Streaming can accept data in many formats including but not limited to JSON, Parquet, CSV, and in key-value pairs for Kafka, as well as Avro files.
How can you handle late data in spark structured streaming?
You can handle late data by using watermarks. Watermarking in Spark allows the streaming engine to automatically track the current event time in the data and attempt to clean up old state.
Can we use machine learning with Spark Structured Streaming?
Yes, Spark Structured Streaming can use machine learning algorithms like regression, classification, and clustering, etc.
How can you output data from Spark Structured Streaming to Cosmos DB?
You can output data to Cosmos DB by using the Azure Cosmos DB Spark connector.
How do we check if our Spark Structured Streaming application is working correctly?
You can monitor Spark applications using the Web UI, which offers a visual representation of the application’s operations and performance.
What is an event time in Spark Structured Streaming?
Event time is the time that each individual event occurred on its originating device. This time is typically embedded within the events during data production.
In which programming languages can you write Spark Structured Streaming applications?
You can write Spark Structured Streaming applications in Scala, Java, and Python.
What kind of transformations does Spark Structured Streaming support?
Spark Structured Streaming supports two types of transformations – Stateless and Stateful.
What is the role of checkpoints in Structured Streaming?
Checkpoints store metadata about computations for recovery purposes. They help provide fault-tolerance by logging the metadata to a fault-tolerant storage system.
What do you mean by Trigger intervals in Spark Structured Streaming?
Trigger intervals decide when the next micro-batch will start. This allows you to control the latency and throughput of your job.
How do we specify the schema of the input data in Spark Structured Streaming?
You can explicitly specify the schema of the input data by defining a schema with StructType and filling it with StructField.
Can we use Spark Structured Streaming with SQL Server in Azure?
Yes, you can use the JDBC connector to write your streams directly to SQL Server in Azure.