frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

End of Japanese community

https://support.mozilla.org/en-US/forums/contributors/717446
263•phantomathkg•3h ago•156 comments

Solarpunk is happening in Africa

https://climatedrift.substack.com/p/why-solarpunk-is-already-happening
661•JoiDegn•9h ago•319 comments

Ratatui – App Showcase

https://ratatui.rs/showcase/apps/
104•AbuAssar•2h ago•43 comments

Dillo, a multi-platform graphical web browser

https://github.com/dillo-browser/dillo
272•nazgulsenpai•11h ago•94 comments

ChatGPT terms disallow its use in providing legal and medical advice to others

https://www.ctvnews.ca/sci-tech/article/openai-updates-policies-so-chatgpt-wont-provide-medical-o...
263•randycupertino•11h ago•250 comments

Recursive macros in C, demystified (once the ugly crying stops)

https://h4x0r.org/big-mac-ro-attack/
50•eatonphil•4h ago•29 comments

I may have found a way to spot U.S. at-sea strikes before they're announced

https://old.reddit.com/r/OSINT/comments/1opjjyv/i_may_have_found_a_way_to_spot_us_atsea_strikes/
50•hentrep•1h ago•22 comments

Firefox profiles: Private, focused spaces for all the ways you browse

https://blog.mozilla.org/en/firefox/profile-management/
181•darkwater•1w ago•82 comments

The state of SIMD in Rust in 2025

https://shnatsel.medium.com/the-state-of-simd-in-rust-in-2025-32c263e5f53d
177•ashvardanian•10h ago•94 comments

Why aren't smart people happier?

https://www.theseedsofscience.pub/p/why-arent-smart-people-happier
255•zdw•13h ago•354 comments

New gel restores dental enamel and could revolutionise tooth repair

https://www.nottingham.ac.uk/news/new-gel-restores-dental-enamel-and-could-revolutionise-tooth-re...
409•CGMthrowaway•10h ago•167 comments

Ruby and Its Neighbors: Smalltalk

https://noelrappin.com/blog/2025/11/ruby-and-its-neighbors-smalltalk/
179•jrochkind1•14h ago•100 comments

NY school phone ban has made lunch loud again

https://gothamist.com/news/ny-smartphone-ban-has-made-lunch-loud-again
257•hrldcpr•16h ago•185 comments

Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

https://AmitZalcher.github.io/Brain-IT/
17•SerCe•2h ago•0 comments

The MDL ("Muddle") Programming Language (1979) [pdf]

http://bitsavers.informatik.uni-stuttgart.de/pdf/mit/lcs/tr/MIT-LCS-TR-0293_MDL_Pgmg_Lang.pdf
7•twoodfin•6d ago•1 comments

The Basic Laws of Human Stupidity (1987) [pdf]

https://gandalf.fee.urv.cat/professors/AntonioQuesada/Curs1920/Cipolla_laws.pdf
58•bookofjoe•6h ago•24 comments

Carice TC2 – A non-digital electric car

https://www.caricecars.com/
201•RubenvanE•15h ago•154 comments

Scientists Growing Colour Without Chemicals

https://www.forbes.com/sites/maevecampbell/2025/06/20/dyeing-for-fashion-meet-the-scientists-grow...
4•caiobegotti•4d ago•0 comments

Vacuum bricked after user blocks data collection – user mods it to run anyway

https://www.tomshardware.com/tech-industry/big-tech/manufacturer-issues-remote-kill-command-to-nu...
212•toomanyrichies•4d ago•66 comments

The shadows lurking in the equations

https://gods.art/articles/equation_shadows.html
264•calebm•15h ago•83 comments

I want a good parallel language [video]

https://www.youtube.com/watch?v=0-eViUyPwso
62•raphlinus•1d ago•36 comments

Radiant Computer

https://radiant.computer
187•beardicus•16h ago•134 comments

I was right about dishwasher pods and now I can prove it [video]

https://www.youtube.com/watch?v=DAX2_mPr9W8
354•hnaccount_rng•1d ago•241 comments

A Lost IBM PC/at Model? Analyzing a Newfound Old Bios

https://int10h.org/blog/2025/11/lost-ibm-at-model-bios-analysis/
73•TMWNN•9h ago•13 comments

An eBPF Loophole: Using XDP for Egress Traffic

https://loopholelabs.io/blog/xdp-for-egress-traffic
216•loopholelabs•1d ago•69 comments

Gloomth

https://www.lrb.co.uk/the-paper/v47/n20/jon-day/gloomth
7•prismatic•6d ago•0 comments

App Store web has exposed all its source code

https://www.reddit.com/r/webdev/comments/1onnzlj/app_store_web_has_exposed_all_its_source_code/
205•redbell•2d ago•68 comments

Absurd Workflows: Durable Execution with Just Postgres

https://lucumr.pocoo.org/2025/11/3/absurd-workflows/
114•ingve•2d ago•23 comments

Timing Wheels

https://pncnmnp.github.io/blogs/timing-wheels.html
42•pncnmnp•4d ago•1 comments

SPy: An interpreter and compiler for a fast statically typed variant of Python

https://antocuni.eu/2025/10/29/inside-spy-part-1-motivations-and-goals/
246•og_kalu•6d ago•110 comments
Open in hackernews

The Speed of VITs and CNNs

https://lucasb.eyer.be/articles/vit_cnn_speed.html
74•jxmorris12•6mo ago

Comments

GaggiX•6mo ago
>text in photos, phone screens, diagrams and charts, 448px² is enough

Not in the graph you provided as an example.

yorwba•6mo ago
It has this note at the bottom:

"Note that I chose an unusually long chart to exemplify an extreme case of aspect ratio stretching. Still, 512px² is enough.

This is two_col_40643 from ChartQA validation set. Original resolution: 800x1556."

But yeah, ultimately which resolution you need depends on the image content, and if you need to squeeze out every bit of accuracy, processing at the original resolution is unavoidable.

zamadatix•6mo ago
It's enough, especially if you select one of the sharper options like Lanczos, but 512px is sure a lot easier for a human.
ninamoss•6mo ago
Really appreciated the post, very insightful. We also use VITs for some of our models and find that between model compilation and hyperparameter tuning we are able to get sub second evaluation of images on commodity hardware while maintaining a high precision and recall.
John7878781•6mo ago
In the Twitter thread the article mentions, LeCun makes his claim only for "high-resolution" images and the article assumes 1024x1024 to fall under this category. To me, 1024x1024 is not "high-resolution." This assumption is flawed imo

I currently use convnext for image classification at a size of 4096x2048 (definitely counts as "high-resolution"). For my use case, it would never be practical to use VITs for this. I can't downscale the resolution because extremely fine details need to be preserved.

I don't think LeCun's comment was a "knee-jerk reaction" as the article claims.

hedgehog•6mo ago
LeCun's technical assessments have borne out over a lot of years. The likely next step in scaling vision transformers is to treat the image as a MIP pyramid and use the transformer to adaptively sample out of that. Requires RL to train (tricky) but it would decouple compute footprint from input size.
tbalsam•6mo ago
As someone who has worked in computer vision ML for nearly a decade, this sounds like a terrible idea.

You don't need RL remotely for this usecase. Image resolution pyramids are pretty normal tho and handling them well/efficiently is the big thing. Using RL for this would be like trying to use graphene to make a computer screen because it's new and flashy and everyone's talking about it. RL is inherently very sample inefficient, and is there to approximate when you don't have certain defined informative components, which we do have in computer vision in spades. Crossentropy losses (and the like) are (generally, IME/IMO) what RL losses try to approximate, only on a much larger (and more poorly-defined) scale.

Please mark speculation as such -- I've seen people see confident statements like this and spend a lot of time/manhours on it (because it seems plausible). It is not a bad idea from a creativity standpoint, but practically is most certainly not the way to go about it.

(That being said, you can try for dynamic sparsity stuff, it has some painful tradeoffs that generally don't scale but no way in Illinois do you need RL for that)

hedgehog•6mo ago
SPECULATION ALERT! I think there's reasonable motivation though. In the last few years there has been a steady drip of papers in the general area, at least insofar as they use vision transformers and image pyramids, and work on applying RL to object detection goes back before that. IoU and the general way SSD and YOLO descendants are set up is kind of wacky so I don't think it's much of a stretch to try to both 1) avoid attending to or materializing most of the pyramid, and 2) go directly to feature proposals without worrying about box anchors or grid cells or any of that. Now with that context if you still think it's a terrible idea, well, you're probably more current than I am.
tbalsam•6mo ago
Not bad frustrations at all. That said -- IoU is how the final box scores are calculated, that doesn't change how you do feature aggregation, this will happen in basically any technique you use.

Modern SSD/YOLO-style detectors use efficient feature pyramids, you need that to know where to propose where things are in the image.

This sounds a lot like going back to the old school object detection techniques which end up being more inefficient in general, generally very compute inefficient.

dimatura•6mo ago
There's been a huge amount of work on image transformers since the original VIT. A lot of it has explored different schemes to slice up the image in tokens, and I've definitely seen some of it using a multiresolution pyramid. Not sure about the RL part - after all, the higher/low-res levels of the pyramid would add less tokens than the base/high-res level, so it doesn't seem that necessary. But given the sheer volume of work out there I can bet someone has explored this idea or something pretty close to it already.
djoldman•6mo ago
Interesting. Can you run your images through a segment model first and then only classify interesting boxes?
lairv•6mo ago
Curious what kind of classification problems requires full 4096x2048 images, couldn't you feed multiple 512x512 overlapping crops instead?
threeducks•6mo ago
ConvNeXT's architecture contains an AdaptiveAvgPool2d layer: https://github.com/pytorch/vision/blob/5f03dc524bdb7529bb4f2...

This means that you can split your image into tiles, process each tile individually, average the results, apply a final classification layer to the average and get exactly the same result. For reference, see the demonstration below.

You could of course do exactly the same thing with a vision transformer instead of a convolutional neural network.

That being said, architecture is wildly overemphasized in my opinion. Data is everything.

    import torch, torchvision.models

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = torchvision.models.convnext_small()
    model.to(device)
    tile_size, image_size = 32, 224 # note that 32 divides 224 evenly
    image = torch.randn((1, 3, image_size, image_size), device=device)

    # Process image as usual
    x_expected = model(image)

    # Process image as tiles (using for-loops for educational purposes; should use .view and .permute instead for performance)
    features = [
        model.features(image[:, :, y:y + tile_size, x:x + tile_size])
        for y in range(0, image_size, tile_size)
        for x in range(0, image_size, tile_size)]
    x = model.classifier(sum(features) / len(features))

    print(f"Mean squared error: {(x - x_expected).pow(2).mean().item():.20f}")
tbalsam•6mo ago
As someone who's done a fair bit of architecture work -- both are important! Making it either or is a very silly thing, both are the limiting factor for the other and there are no two ways about it.

Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.

Also, please do smoothed crossentropy for image class stuff (generally speaking, unless maybe data is hilariously large), MSE won't nearly cut it!

But that being said, adaptive stuff certainly is great when doing classification. Something to note is that batching does become an issue at a certain point -- as well as certain other fine-grained details if you're simply going to average it all down to one single vector (IIUC).

threeducks•6mo ago
> Also, please do smoothed crossentropy for image class stuff (generally speaking, unless maybe data is hilariously large), MSE won't nearly cut it!

Of course. The MSE here is not intended to be a training loss, but as a means to demonstrate that both approaches lead to almost the same result except for some rounding error. The MSE is somewhere in the order of 10^-9.

> Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.

I don't think that max pooling the last feature maps would be a good idea here, because it would cut off about 98 % of the gradients and training would take much longer. (The shape of the input feature layer is (1, 768, 7, 7), pooled to (1, 768, 1, 1).)

> Something to note is that batching does become an issue at a certain point

Could you elaborate on that?

tbalsam•6mo ago
> The MSE here is not intended to be a training loss, but as a means to demonstrate that both approaches lead to almost the same result except for some rounding error.

Ah, gotcha

> I don't think that max pooling the last feature maps would be a good idea here, because it would cut off about 98 % of the gradients and training would take much longer. (The shape of the input feature layer is (1, 768, 7, 7), pooled to (1, 768, 1, 1).)

MaxPooling is generally only useful if you're training your network for it, but in most cases it ends up performing better. That sparsity actually ends up being a good thing -- you generally need to suppress all of those unused activations! It ends up being quite a wide gap in practice (and, if you have convolutions beforehand -- using avgpooling2d is a bit of extra wasted extra computation blurring the input)

> Could you elaborate on that?

Variable-sized inputs don't batch easily as the input dims need to match, you can go down the padding route but that has its own particularly hellacious costs with it that end up taking away from compute that you could be using for other useful things.

dimatura•6mo ago
Slicing up images to analyze them is definitely something people do - in many cases, such as satellite imagery, there is not much alternative. But it should be done mindfully, especially if there are differences between the training and testing steps. Depending on the architecture and the application, it's not the same as processing the whole image at once. Some differences are more or less obvious (for example, you might have border artifacts), but others are more subtle. For example, contrary to the expected positional equivariance of convolutional nets, they can implicitly encode positional information based on where they see border padding during training. For some types of normalization such as instance normalization, the statistics of the normalization may vary significantly when applied across patches or whole images.
kookamamie•6mo ago
> You don't need very high resolution

Yes, you do. Also, 1024x1024 is not high resolution.

An example is segmenting basic 1920x1080 (FHD) video in 60 Hz formats.

CHY872•6mo ago
The article basically argues: You would expect to get similarly good results with subsampling in practice. E.g. no need to process at 1920x1080 when you can do 960x540. Separately, you can break down many problems into smaller tiles and get similar quality results without the compute overheads of a high res ViT.
dimatura•6mo ago
Yeah, the article was painting with a bit too of a broad stroke IMO, though they did briefly acknowledge "special exceptions" such as satellite or medical imagery. It's very application-dependent.

That said, in my experience beginners do often overestimate how much image resolution is needed for a given task for some reason. I often find myself asking to retry their experiments with a lower resolution. There's a surprising amount of information in 128x128 or even smaller images.

magicalhippo•6mo ago
I have a vivid memory of playing Rise of the Triad[1] against my buddy over serial cable. As most PC games from back then, it used mode 13h[2], so 320x200 resolution with a 256 color palette.

I have the distinct memory of firing a rocket at him from far away because I thought that one pixel had the wrong color, and killing him to his great frustration. Good times.

You can play the shareware portion of the game here[3] to get an idea.

[1]: https://en.wikipedia.org/wiki/Rise_of_the_Triad

[2]: https://en.wikipedia.org/wiki/Mode_13h

[3]: https://www.dosgames.com/game/rise-of-the-triad/

jacobgorm•6mo ago
A nice feature of CNNs is that you can change the resolution at inference time without retraining. For instance, when the user plugs in a camera with a different aspect or decides to the change the orientation of his phone from landscape to portrait. It is not clear to me if VITs can support aspect or resolution changes without any retraining?
lava_pidgeon•6mo ago
Can you elaborate? In my experience it is the opposite: CNNs are highly depend on the input tensor shapes thus resolution change need even an architectional change. While resolution changes in ViT lead to more tokens, a ViT model can handle that (for image classification e.g. you always take the CLS token, Segmentation maps and similar task have the same output as in the input).