I currently use convnext for image classification at a size of 4096x2048 (definitely counts as "high-resolution"). For my use case, it would never be practical to use VITs for this. I can't downscale the resolution because extremely fine details need to be preserved.
I don't think LeCun's comment was a "knee-jerk reaction" as the article claims.
You don't need RL remotely for this usecase. Image resolution pyramids are pretty normal tho and handling them well/efficiently is the big thing. Using RL for this would be like trying to use graphene to make a computer screen because it's new and flashy and everyone's talking about it. RL is inherently very sample inefficient, and is there to approximate when you don't have certain defined informative components, which we do have in computer vision in spades. Crossentropy losses (and the like) are (generally, IME/IMO) what RL losses try to approximate, only on a much larger (and more poorly-defined) scale.
Please mark speculation as such -- I've seen people see confident statements like this and spend a lot of time/manhours on it (because it seems plausible). It is not a bad idea from a creativity standpoint, but practically is most certainly not the way to go about it.
(That being said, you can try for dynamic sparsity stuff, it has some painful tradeoffs that generally don't scale but no way in Illinois do you need RL for that)
Modern SSD/YOLO-style detectors use efficient feature pyramids, you need that to know where to propose where things are in the image.
This sounds a lot like going back to the old school object detection techniques which end up being more inefficient in general, generally very compute inefficient.
This means that you can split your image into tiles, process each tile individually, average the results, apply a final classification layer to the average and get exactly the same result. For reference, see the demonstration below.
You could of course do exactly the same thing with a vision transformer instead of a convolutional neural network.
That being said, architecture is wildly overemphasized in my opinion. Data is everything.
import torch, torchvision.models
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torchvision.models.convnext_small()
model.to(device)
tile_size, image_size = 32, 224 # note that 32 divides 224 evenly
image = torch.randn((1, 3, image_size, image_size), device=device)
# Process image as usual
x_expected = model(image)
# Process image as tiles (using for-loops for educational purposes; should use .view and .permute instead for performance)
features = [
model.features(image[:, :, y:y + tile_size, x:x + tile_size])
for y in range(0, image_size, tile_size)
for x in range(0, image_size, tile_size)]
x = model.classifier(sum(features) / len(features))
print(f"Mean squared error: {(x - x_expected).pow(2).mean().item():.20f}")
Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.
Also, please do smoothed crossentropy for image class stuff (generally speaking, unless maybe data is hilariously large), MSE won't nearly cut it!
But that being said, adaptive stuff certainly is great when doing classification. Something to note is that batching does become an issue at a certain point -- as well as certain other fine-grained details if you're simply going to average it all down to one single vector (IIUC).
Of course. The MSE here is not intended to be a training loss, but as a means to demonstrate that both approaches lead to almost the same result except for some rounding error. The MSE is somewhere in the order of 10^-9.
> Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.
I don't think that max pooling the last feature maps would be a good idea here, because it would cut off about 98 % of the gradients and training would take much longer. (The shape of the input feature layer is (1, 768, 7, 7), pooled to (1, 768, 1, 1).)
> Something to note is that batching does become an issue at a certain point
Could you elaborate on that?
Ah, gotcha
> I don't think that max pooling the last feature maps would be a good idea here, because it would cut off about 98 % of the gradients and training would take much longer. (The shape of the input feature layer is (1, 768, 7, 7), pooled to (1, 768, 1, 1).)
MaxPooling is generally only useful if you're training your network for it, but in most cases it ends up performing better. That sparsity actually ends up being a good thing -- you generally need to suppress all of those unused activations! It ends up being quite a wide gap in practice (and, if you have convolutions beforehand -- using avgpooling2d is a bit of extra wasted extra computation blurring the input)
> Could you elaborate on that?
Variable-sized inputs don't batch easily as the input dims need to match, you can go down the padding route but that has its own particularly hellacious costs with it that end up taking away from compute that you could be using for other useful things.
Yes, you do. Also, 1024x1024 is not high resolution.
An example is segmenting basic 1920x1080 (FHD) video in 60 Hz formats.
That said, in my experience beginners do often overestimate how much image resolution is needed for a given task for some reason. I often find myself asking to retry their experiments with a lower resolution. There's a surprising amount of information in 128x128 or even smaller images.
I have the distinct memory of firing a rocket at him from far away because I thought that one pixel had the wrong color, and killing him to his great frustration. Good times.
You can play the shareware portion of the game here[3] to get an idea.
[1]: https://en.wikipedia.org/wiki/Rise_of_the_Triad
GaggiX•2mo ago
Not in the graph you provided as an example.
yorwba•2mo ago
"Note that I chose an unusually long chart to exemplify an extreme case of aspect ratio stretching. Still, 512px² is enough.
This is two_col_40643 from ChartQA validation set. Original resolution: 800x1556."
But yeah, ultimately which resolution you need depends on the image content, and if you need to squeeze out every bit of accuracy, processing at the original resolution is unavoidable.
zamadatix•2mo ago