Batches have a known size and it are not infinite.
Say you are receiving records from users and different intervals and you want to eventually store them in a different format on a database.
Streaming to me means you're "pushing" to the database according to some rule. For example, wait and accumulate 10 records to push. This could happen in 1 minute or in 10 hours. You know the size of the dataset (exactly 10 records). (You could also add some max time too and then you'd be combining batching with streaming)
Batching to me means you're pulling from the database. For example, you pull once every hour. In that hour, you get 0 records or 1000 records. You don't know the size and it's potentially infinite
From the perspective of the data source, in a streaming context, the size is finite — it’s whatever you’re sending. From the data sink’s perspective, it’s unknown how many records are going to get sent in total.
Vice versa, in a batch context, the data source has no idea how many records will eventually be requested, but the data sink knows exactly the size of the request.
That is, whoever is initiating the job knows what’s up, and whoever is targeted just has to deal with it.
But generally I believe the norm is to discuss from the sink’s perspective, because the main interesting problem is when the sink has to deal with infinity (streaming). When then source deals with infinity (batch), it’s fairly straightforward to manage — refuse requests of too large a size and move on. The data isn’t going anywhere, so the sink can fix itself and re-request. You do that with streaming and data starts getting lost
Batches -> optimized for efficiency
The article rightfully says that it's not a question of streaming OR batching, because you can stream batches.
His problem is one of data transfer and a better fit for what hes looking for is probably "polling" versus "interrupt" driven.
I've had the following view from the beginning:
- Batches are groups of data with a finite size, delivered at whatever interval you desire (this can be seconds, minutes, hours, days, or years between batches).
- Streaming is when you deliver the data "live", meaning immediately upon generation of that data. There is no defined start or end. There is no buffering or grouping at the transmitter of that data. It's constant. What you do with that data after you receive it (buffering, batching it up, ...) is irrelevant.
JMHO.
Batching is, by definition, the gathering of data records into a collection before you send it. Streaming does not do that, which is the entire point. What happens after transmission occurs, on reception, is entirely irrelevant to whether the data transfer mode is "streaming."
If I try to service a snapshot with a push system, I'll have to either buffer an unbounded number of events, discard events, or back-pressure up the source to prevent events from being created. And with push alone, my snapshot would still only be ephemeral; once I open the floodgates and start processing more events, the snapshot is gone.
If I try to service a live view with a pull system, I'll have to either pull infrequently and sacrifice freshness, or pull more frequently and waste time and bandwidth reprocessing unchanged data. And with pull alone, I would still only be chasing freshness; every halving of refresh interval doubles the resource cost, until the system can't keep up.
The complicating real-world factor that this article alludes to is that, historically, push systems lacked the expressiveness to model complex data transformations. (And to be fair, they're up against physical limitations: Many transformations simply require storing the full intermediate dataset in order to compute an incremental update.) So the solution was to either switch wholesale to pull at some point in the pipeline (and try to use caching, change detection, etc to reduce the resource cost and enable more frequent pulling), or, introduce a pulling segment in the pipeline ("windowing" joins, aggregations, etc) and switch back to push after.
It's pretty recent that push systems are attempting to match the expressiveness of pull systems (e.g. Materialize, Readyset), but people are still so used to assuming pull-based compromises, asking questions like "How fresh does this data feed really _need_ to be?". It's analogous to asking "How long does this snapshot really _need_ to last?" - a relevant question to be sure, but maybe doesn't need to be the basis for massive architectural lifts.
For example, I’ve worked with “batch” systems that periodically go do fetches from databases or S3 buckets and then do lots of crunching, before storing the results.
Sometimes batch systems have separate fetchers and only operate vs a local store; they’re still batch.
Streaming systems may have local aggregation or clumping in the arriving information; that doesn’t make it a “batch” system. Likewise streaming systems may process more than one work item simultaneously; still not a “batch”.
I associate “batch” more with “schedule” or “periodic” and “fetch”; I associate “stream” with “continuous” and “receiver”.
In the old days batch was not realtime and took a while. Imagine printing bank statements, or calculating interest on your accounts at the end of the day. You literally process them all later.
Streaming is processing the records as they arrive, continuously.
IRL you can stream then batch...but normally batch runs at a specific time and chows everything.
It does seem to me that push vs pull are slightly more standardized in usage, which might be what the author is getting at. But even then depending on what level of abstraction in the system you are concerned with the concepts can flip.
Just seems like a flawed premise to me since lambda architecture is the context in which streaming for data processing is frequently introduced. The batch vs stream discussion is more about the implementation side - tools or techniques best used for one aren’t best suited for the other since batch processing is usually optimized for throughput and streaming is usually optimized for latency. For example vectorization is useful for the former and code generation is useful for the latter.
- 'Streaming' means the consumer determines server utilization rates
- 'Batch' means the server determines server utilization rates
I much prefer Batch as the processing can be performed when the server has appropriate resources, helping products run on lower spec servers.
My experience is the opposite.
You think you need streaming, so you "try it out" and build something incredibly complex with Kafka, that needs 24h maintenance to monitor congestion in every pipeline.
And 10x more expensive because your servers are always up.
And some clever (expensive) engineers that figure out how watermarks, out of orderness and streaming joins really work and how you can implement them in a parallel way without SQL.
And of course a renovate bot to upgrade your fancy (but half baked) framework (flink) to the latest version.
When all you business case really needed was a monthly report. And that you can achieve with pub/sub and an SQL query.
In my experience the need for live data rarely comes from a business case, but for a want to see your data live.
And if it indeed comes from a business case, you are still better off prototyping with something simple and see if it really flies before you "try it out".
pestatije•3d ago
aeonik•8h ago
And once you push the boundaries—high-frequency trading, deep space comms, even global-scale latency—you run into the brick wall of physics. At certain speeds and distances, simultaneity stops being objective. Two observers won’t agree on what "just happened." Causality gets slippery. Streams bifurcate.
At that point, "live" isn’t just fuzzy—it’s frame-dependent. You’re not streaming reality anymore. You’re curating a perspective.