Latency goes subsecond in Apache Spark Structured Streaming Improving Offset Management in Project Lightspeed

Latency goes subsecond in Apache Spark Structured Streaming Improving Offset Management in Project Lightspeed
by Jerry Peng, Pranav Anand, Sourav Gulati, Karthik Ramasamy, Michael Armbrust and Matei Zaharia, Databricks May 15, 2023 in Engineering Blog Apache Spark Structured Streaming is the leading open source stream processing platform. It is also the core technology that powers streaming on the Databricks Lakehouse Platform and provides a unified API for batch and stream processing. As the adoption of streaming is growing rapidly, diverse applications want to take advantage of it for real time decision making. Some of these applications, especially those operational in nature, demand lower latency. While Spark's design enables high throughput and ease-of-use at a lower cost, it has not been optimized for sub-second latency. In this blog, we will focus on the improvements we have made around offset management to lower the inherent processing latency of Structured Streaming. These improvements primarily target operational use cases such as real time monitoring and alerting that are simple and stateless. Extensive evaluation of these enhancements indicates that the latency has improved by 68-75% - or as much as 3X - from 700-900 ms to 150-250 ms for throughputs of 100K events/sec, 500K events/sec and 1M events/sec. Structured Streaming can now achieve latencies lower than 250 ms, satisfying SLA requirements for a large percentage of operational workloads. Read more: https://www.databricks.com/blog/latency-goes-subsecond-apache-spark-structured-streaming

Comments