frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Novo Nordisk's Canadian Mistake

https://www.science.org/content/blog-post/novo-nordisk-s-canadian-mistake
157•jbm•2h ago•67 comments

Original C64 Lode Runner Source Code

https://github.com/Piddewitt/Loderunner
23•indigodaddy•1h ago•7 comments

Doing well in your courses: Andrej's advice for success (2013)

https://cs.stanford.edu/people/karpathy/advice.html
327•peterkshultz•6h ago•119 comments

Duke Nukem: Zero Hour N64 ROM Reverse-Engineering Project Hits 100%

https://github.com/Gillou68310/DukeNukemZeroHour
41•birdculture•2h ago•12 comments

Dosbian: Boot to DOSBox on Raspberry Pi

https://cmaiolino.wordpress.com/dosbian/
84•indigodaddy•4h ago•31 comments

Airliner hit by possible space debris

https://avbrief.com/united-max-hit-by-falling-object-at-36000-feet/
146•d_silin•5h ago•70 comments

Compare Single Board Computers

https://sbc.compare/
93•todsacerdoti•5h ago•39 comments

Deterministic multithreading is hard (2024)

https://www.factorio.com/blog/post/fff-415
24•adtac•14h ago•3 comments

GNU Octave Meets JupyterLite: Compute Anywhere, Anytime

https://blog.jupyter.org/gnu-octave-meets-jupyterlite-compute-anywhere-anytime-8b033afbbcdc
96•bauta-steen•7h ago•17 comments

Could the XZ backdoor been detected with better Git/Deb packaging practices?

https://optimizedbyotto.com/post/xz-backdoor-debian-git-detection/
55•ottoke•5h ago•43 comments

The working-class hero of Bletchley Park you didn't see in the movies

https://www.theguardian.com/world/2025/oct/12/move-over-alan-turing-meet-the-working-class-hero-o...
70•hansmayer•1w ago•18 comments

The Spilhaus Projection: A world map according to fish

https://southernwoodenboatsailing.com/news/the-spilhaus-projection-a-world-map-according-to-fish
81•zynovex•1w ago•10 comments

Comparing the power consumption of a 30 year old refrigerator to a new one

https://ounapuu.ee/posts/2025/10/14/fridge-power-consumption/
87•furkansahin•5d ago•126 comments

The Cancer Imaging Archive (TCIA)

https://www.cancerimagingarchive.net/
12•1970-01-01•6d ago•0 comments

Abandoned land drives dangerous heat in Houston, study finds

https://stories.tamu.edu/news/2025/10/07/abandoned-land-drives-dangerous-heat-in-houston-texas-am...
113•PaulHoule•9h ago•115 comments

The Trinary Dream Endures

https://www.robinsloan.com/lab/trinary-dream/
36•FromTheArchives•6h ago•48 comments

Show HN: Duck-UI – Browser-Based SQL IDE for DuckDB

https://demo.duckui.com
172•caioricciuti•12h ago•54 comments

The macOS LC_COLLATE hunt: Or why does sort order differently on macOS and Linux (2020)

https://blog.zhimingwang.org/macos-lc_collate-hunt
68•g0xA52A2A•10h ago•15 comments

Infisical (YC W23) Is Hiring Full Stack Engineers

https://www.ycombinator.com/companies/infisical/jobs/0gY2Da1-full-stack-engineer-global
1•vmatsiiako•6h ago

Show HN: Pyversity – Fast Result Diversification for Retrieval and RAG

https://github.com/Pringled/pyversity
64•Tananon•9h ago•6 comments

How to Assemble an Electric Heating Element from Scratch

https://solar.lowtechmagazine.com/2025/10/how-to-build-an-electric-heating-element-from-scratch/
81•surprisetalk•10h ago•51 comments

Redis Backplane for Hubots

https://github.com/hubot-friends/hubot-redis-backplane
8•gijoeyguerra•5d ago•3 comments

Ask HN: What are people doing to get off of VMware?

103•jwithington•6h ago•73 comments

Enchanting Imposters

https://daily.jstor.org/enchanting-imposters/
3•Petiver•2d ago•0 comments

The case for the return of fine-tuning

https://welovesota.com/article/the-case-for-the-return-of-fine-tuning
127•nanark•13h ago•70 comments

Scheme Reports at Fifty

https://crumbles.blog/posts/2025-10-18-scheme-reports-at-fifty.html
48•djwatson24•8h ago•18 comments

Designing EventQL, an Event Query Language

https://docs.eventsourcingdb.io/blog/2025/10/20/designing-eventql-an-event-query-language/
7•goloroden•3h ago•1 comments

Improving PixelMelt's Kindle Web Deobfuscator

https://shkspr.mobi/blog/2025/10/improving-pixelmelts-kindle-web-deobfuscator/
85•ColinWright•11h ago•14 comments

Xubuntu.org Might Be Compromised

https://old.reddit.com/r/Ubuntu/comments/1oa4549/xubuntuorg_might_be_compromised/
301•kekqqq•9h ago•126 comments

Show HN: Open-Source Voice AI Badge Powered by ESP32+WebRTC

https://github.com/VapiAI/vapicon-2025-hardware-workshop
40•Sean-Der•1w ago•4 comments
Open in hackernews

“Streaming vs. Batch” Is a Wrong Dichotomy, and I Think It's Confusing

https://www.morling.dev/blog/streaming-vs-batch-wrong-dichotomy/
70•ingve•5mo ago

Comments

pestatije•5mo ago
streaming should be used exclusively for live data...if you keep things that way everything falls smoothly in place
aeonik•5mo ago
This gets messy though because the deeper you dig the more the word "live" loses meaning. Is "live" data something emitted within milliseconds? Seconds? Is it "live" if it’s replayed from a buffer with minimal delay? Real-time systems aren’t always real-time. Some "live" streams are just batched updates in disguise. You blink and suddenly you’re in temporal quantum soup.

And once you push the boundaries—high-frequency trading, deep space comms, even global-scale latency—you run into the brick wall of physics. At certain speeds and distances, simultaneity stops being objective. Two observers won’t agree on what "just happened." Causality gets slippery. Streams bifurcate.

At that point, "live" isn’t just fuzzy—it’s frame-dependent. You’re not streaming reality anymore. You’re curating a perspective.

brudgers•5mo ago
Streams have unknown size and may be infinite.

Batches have a known size and it are not infinite.

fjdjshsh•5mo ago
Maybe I'm using the wrong definitions, but I think that's backwards.

Say you are receiving records from users and different intervals and you want to eventually store them in a different format on a database.

Streaming to me means you're "pushing" to the database according to some rule. For example, wait and accumulate 10 records to push. This could happen in 1 minute or in 10 hours. You know the size of the dataset (exactly 10 records). (You could also add some max time too and then you'd be combining batching with streaming)

Batching to me means you're pulling from the database. For example, you pull once every hour. In that hour, you get 0 records or 1000 records. You don't know the size and it's potentially infinite

numbsafari•5mo ago
I work with batch oriented store and forward systems and they definitely push data in batches.
setr•5mo ago
It’s because you’re looking at it from opposing ends.

From the perspective of the data source, in a streaming context, the size is finite — it’s whatever you’re sending. From the data sink’s perspective, it’s unknown how many records are going to get sent in total.

Vice versa, in a batch context, the data source has no idea how many records will eventually be requested, but the data sink knows exactly the size of the request.

That is, whoever is initiating the job knows what’s up, and whoever is targeted just has to deal with it.

But generally I believe the norm is to discuss from the sink’s perspective, because the main interesting problem is when the sink has to deal with infinity (streaming). When then source deals with infinity (batch), it’s fairly straightforward to manage — refuse requests of too large a size and move on. The data isn’t going anywhere, so the sink can fix itself and re-request. You do that with streaming and data starts getting lost

brudgers•5mo ago
In part I think that is because the sink can run out of memory, the store has already allocated enough memory.
oatmeal_croc•5mo ago
Streams -> optimized for latency

Batches -> optimized for efficiency

alok-g•5mo ago
Provided of course that both cannot be achieved together, e.g., a low-latency solution is not itself high efficiency for the XYZ reasons specified.
gregw2•5mo ago
If streaming is 5x as expensive as batch, that might be a factor worth considering.
iLoveOncall•5mo ago
It's quite amazing how none of the comments have bothered reading the article, but are also commenting about something completely unrelated to its title.

The article rightfully says that it's not a question of streaming OR batching, because you can stream batches.

stefan_•5mo ago
Because the author is lost in peculiarities of the systems he happens to work with that he is redefining the terms (see all the discussion around "push" and "pull"). That's gonna run into this problem.

His problem is one of data transfer and a better fit for what hes looking for is probably "polling" versus "interrupt" driven.

gugagore•5mo ago
Interrupts are a hardware feature on CPUs. You could have software that is effectively checking for events on each tick (clock cycle), and emulates interrupts. But that's what polling is.
smilliken•5mo ago
The operating system provides abstractions for blocking and asynchronous IO, which are the higher abstraction version of the same concept.
floating-io•5mo ago
From skimming the article, it seems that this is a munging of the terms in directions that just aren't meaningful.

I've had the following view from the beginning:

- Batches are groups of data with a finite size, delivered at whatever interval you desire (this can be seconds, minutes, hours, days, or years between batches).

- Streaming is when you deliver the data "live", meaning immediately upon generation of that data. There is no defined start or end. There is no buffering or grouping at the transmitter of that data. It's constant. What you do with that data after you receive it (buffering, batching it up, ...) is irrelevant.

JMHO.

bjornsing•5mo ago
The lines blur though when you start keeping state between batches, and a lot of batch processing ends up requiring that (joins, deduplication, etc).
floating-io•5mo ago
No, it really doesn't. The definition of "streaming", to me, can be boiled down to "you send individual data as soon as it's available, without collecting into groups."

Batching is, by definition, the gathering of data records into a collection before you send it. Streaming does not do that, which is the entire point. What happens after transmission occurs, on reception, is entirely irrelevant to whether the data transfer mode is "streaming."

leni536•5mo ago
Most streaming does some batching. If you stream audio from a live source, you batch at least into "frames", and you batch into network packets. On top of that you might batch further depending on your requirements, yet I would still count most of it as "streaming".
floating-io•5mo ago
Only if you ignore that streaming streams data in records. The creation of a record (or struct, or whatever term you want to use) is not "batching". Otherwise any 32-bit word is a nothing more than a batch of four bytes, and the entire distinction instantly becomes meaningless.

An audio stream can easily be defined as a series of records, where each record is a sample spanning N seconds, probably as provided by the hardware. Similarly, a video frame can also be considered a record. As soon as a record becomes available, it is sent. Thus, streaming.

Optimizing to fully utilize network frames can generally be considered a low level transport optimization, and thus not relevant to the discussion.

bjornsing•5mo ago
Isn’t that pretty much exactly what the OP is saying? He just calls it ”push” and ”pull” instead. Different words, same concepts.
davery22•5mo ago
I think the article was getting at this at the end - different use cases naturally call for either a point-in-time snapshot (optimally serviced by pull) or a live-updating view (optimally serviced by push). If I am gauging the health of a system, I'll probably want a live view. If I am comparing historical financial reports, snapshot. Note that these are both "read-only" use cases. If I am preparing updates to a dataset, I may well want to work off a snapshot (and when it comes time to commit the changes, compare-and-swap if possible, else pull the latest snapshot and reconcile conflicts). If I am adjusting my trades for market changes, live view again.

If I try to service a snapshot with a push system, I'll have to either buffer an unbounded number of events, discard events, or back-pressure up the source to prevent events from being created. And with push alone, my snapshot would still only be ephemeral; once I open the floodgates and start processing more events, the snapshot is gone.

If I try to service a live view with a pull system, I'll have to either pull infrequently and sacrifice freshness, or pull more frequently and waste time and bandwidth reprocessing unchanged data. And with pull alone, I would still only be chasing freshness; every halving of refresh interval doubles the resource cost, until the system can't keep up.

The complicating real-world factor that this article alludes to is that, historically, push systems lacked the expressiveness to model complex data transformations. (And to be fair, they're up against physical limitations: Many transformations simply require storing the full intermediate dataset in order to compute an incremental update.) So the solution was to either switch wholesale to pull at some point in the pipeline (and try to use caching, change detection, etc to reduce the resource cost and enable more frequent pulling), or, introduce a pulling segment in the pipeline ("windowing" joins, aggregations, etc) and switch back to push after.

It's pretty recent that push systems are attempting to match the expressiveness of pull systems (e.g. Materialize, Readyset), but people are still so used to assuming pull-based compromises, asking questions like "How fresh does this data feed really _need_ to be?". It's analogous to asking "How long does this snapshot really _need_ to last?" - a relevant question to be sure, but maybe doesn't need to be the basis for massive architectural lifts.

efitz•5mo ago
Batch processes IRL tend to “fetch” data, which is nothing like streaming.

For example, I’ve worked with “batch” systems that periodically go do fetches from databases or S3 buckets and then do lots of crunching, before storing the results.

Sometimes batch systems have separate fetchers and only operate vs a local store; they’re still batch.

Streaming systems may have local aggregation or clumping in the arriving information; that doesn’t make it a “batch” system. Likewise streaming systems may process more than one work item simultaneously; still not a “batch”.

I associate “batch” more with “schedule” or “periodic” and “fetch”; I associate “stream” with “continuous” and “receiver”.

simlevesque•5mo ago
Isn't everything batched ? I've built live streaming video, iot, and it's batches all the way down.
fragmede•5mo ago
technically yes, at the lowest levels (polled interrupts anybody) but there's a material difference (or not, as this blog argues) depending on how they're processed. At one end of the spectrum you have bank records being reconciled at the end of the day. At the other extreme, reading individual chunks of video data off disk, not saving it, and chucking it into the Internet via udp as fast as the client can handle, but could be dropped on the floor as necessary; that doesn't really require the same kind of assurances as a day's worth of bank records.
mannyv•5mo ago
'Pull' and 'push' make even less sense than 'stream' and 'batch.'

In the old days batch was not realtime and took a while. Imagine printing bank statements, or calculating interest on your accounts at the end of the day. You literally process them all later.

Streaming is processing the records as they arrive, continuously.

IRL you can stream then batch...but normally batch runs at a specific time and chows everything.

binoct•5mo ago
The comments here are really interesting to read since there are so many strongly stated different definitions. It’s obvious “steaming” and “batch” have different implications and even meanings in different contexts. Depending on what the type of work being done and what system it’s being done with, batch and streaming can be interpreted differently, so it feels like really a semantic argument going on lacking specificity. It’s important to have common and clear terminology, and across the industry these words (like so many in computer science) are not always as clear as we might assume. Part of what makes naming things so difficult.

It does seem to me that push vs pull are slightly more standardized in usage, which might be what the author is getting at. But even then depending on what level of abstraction in the system you are concerned with the concepts can flip.

briankelly•5mo ago
> Often times, "Stream vs. Batch" is discussed as if it’s one or the other, but to me this does not make that much sense really.

Just seems like a flawed premise to me since lambda architecture is the context in which streaming for data processing is frequently introduced. The batch vs stream discussion is more about the implementation side - tools or techniques best used for one aren’t best suited for the other since batch processing is usually optimized for throughput and streaming is usually optimized for latency. For example vectorization is useful for the former and code generation is useful for the latter.

calrain•5mo ago
For me, a key difference is:

- 'Streaming' means the consumer determines server utilization rates

- 'Batch' means the server determines server utilization rates

I much prefer Batch as the processing can be performed when the server has appropriate resources, helping products run on lower spec servers.

fifilura•5mo ago
"Try it yourself" "very quickly wanted to get real-time streaming for more"

My experience is the opposite.

You think you need streaming, so you "try it out" and build something incredibly complex with Kafka, that needs 24h maintenance to monitor congestion in every pipeline.

And 10x more expensive because your servers are always up.

And some clever (expensive) engineers that figure out how watermarks, out of orderness and streaming joins really work and how you can implement them in a parallel way without SQL.

And of course a renovate bot to upgrade your fancy (but half baked) framework (flink) to the latest version.

And you want to tune your logic? Luckily that last 3 hours of data is stored in Kafka so all you have to do is reset all consumer offsets, clean your pipelines and restart your job and the in data will hopefully be almost the same as last time you run it. (Compared to changing a parameter and re-running that SQL query).

When all you business case really needed was a monthly report. And that you can achieve with pub/sub and an SQL query.

In my experience the need for live data rarely comes from a business case, but for a want to see your data live.

And if it indeed comes from a business case, you are still better off prototyping with something simple and see if it really flies before you "try it out".

monksy•5mo ago
Kafka integrates against aws lamdas very easily
franktankbank•5mo ago
I'm curious if its web scale.
fifilura•5mo ago
You can't do groupbys or joins with lambdas.

You can only really take one garden gnome, put some paint on it and forward the same gnome.

perrygeo•5mo ago
I think I get the analogy, something like: you can append to the record but everything is still record-based? And what do garden gnomes have to do with it :-)
fifilura•5mo ago
Yeah, a garden gnome is an element in a garden gnome factory. Just an example.
layer8•5mo ago
Is that a pro or a con? ;)
gopher_space•5mo ago
Or, "How I saved millions a year by introducing one 500ms animation".
cgio•5mo ago
I know many push batch systems, e.g. all the csv type pushed onto s3 and processed in an event based pipeline. Even for non event based, the fact that I schedule a batch does not make a pipeline pull. Pull is when I control the timing AND query. In my view the dichotomy stream vs batch is meaningful. The fact that there are also combinations where a stream is supported by batch does not invalidate the differences.
10000truths•5mo ago
"latency" and "throughput" are only mentioned in passing in the article, but that is really the crux of the whole "streaming vs. batch" thing. You can implement a stream-like thing with small, frequent batches of data, and you can implement a batch-like thing with a large-buffered stream that is infrequently flushed. What matters is how much you prioritize latency over throughput, or vice versa. More importantly, this can be quantified - multiply latency and throughput, and you get buffer/batch size. Congratulations, you've stumbled across Little's Law, one of the fundamental tenets of queuing theory!
sxv•5mo ago
Further reading: https://en.wikipedia.org/wiki/Double-slit_experiment
alfiedotwtf•5mo ago
With the absence of stream ciphers vs block cyphers, I’m guessing the cryptographers are biting their tongues in the comments
kazinator•5mo ago
The opposite of "batch" is "interactive".

A classic "batch job" is one that can be executed without input from a keyboard or output to a display, and therefore can be queues in a batch with other such jobs (perhaps from other programmers).

There is a connection with scripting; batch job control was done with command languages. This is where DOS/Windows "batch files" get their name, and the .BAT suffix.

Grouping transmitted items together (such as bytes into a datagram) is better called aggregation, not to confuse it with "batch job" batching.

Nearly all streaming uses aggregation, other than at the lowest data link and physical layers.

matjazk•5mo ago
This is basically a discussion of advantages of lamda architecture (batch+streaming) for data processing as opposed to kappa (pure streaming). What the author neglects to mention is that in the first case you have to maintain two data processing pioelines, whereas in the second case batch data is treated as a special case of streaming data.