Gemini is really impressive at these kinds of object detection tasks.
https://www.sergey.fyi/articles/using-gemini-for-precise-cit...
Are you using that approach in production for grounding when PDFs don't include embedded text, like in the case of scanned documents? I did some experiments for that use case, and it wasn't really reaching the bar I was hoping for.
It really feels like we're maybe half a model generation away from this being a solved problem.
Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.
It has been tuned so heavily on this specific format that even a tiny change, like switching the order in the `box_2d` format from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)` causes performance to tank.
- Do you ask the multimodal LLM to return the image with boxes drawn on it (and then somehow extract coordinates), or simply ask it to return the coordinates? (Is the former even possible?)
- Does it better or worse when you ask it for [xmin, xmax, ymin, ymax] or [x, y, width, height] (or various permutations thereof)?
- Do you ask for these coordinates as integer pixels (whose meaning can vary with dimensions of the original image), or normalized between 0.0 and 1.0 (or 0–1000 as in this post)?
- Is it worth doing it in two rounds: send it back its initial response with the boxes drawn on it, to give it another opportunity to "see" its previous answer and adjust its coordinates?
I ought to look at these things, but wondering: as you (or others) work on something like this, how do you keep track of which prompts seem to be working better? Do you log all requests and responses / scores as you go? I didn't do that for my initial attempts, and it feels a bit like shooting in the dark / trying random things until something works.
Recommend the Gemini docs here, they are implicit on some of these points.
Prompts matter too, less is more.
And you need to submit images to get good bounding boxes. You can somewhat infer this from the token counts, but Gemini APIs do something to PDFs (OCR, I assume) that cause them to lose complete location context on the page. If you send the page in as an image, that context isn't lost and the boxes are great.
As an example of this, you can send a PDF page with half of the page text, the bottom half empty. If you ask it to draw a bounding box around the last paragraph it tends to return a result that is much higher number on the normalized scale (lower on the y axis) than it should be. In one experiment I did, it would think a footer text that was actually about 2/3 down the page was all the way at the end. When I sent as an image, it had in around the 660 mark on the normalized 1000 scale exactly where you would expect it.
And hopefully with diffusion based llms, we might even see real-time appliances?
> The allure of skipping dataset collection, annotation, and training is too enticing not to waste a few evenings testing.
How’s annotation work? Do you actually have to mark every pixel of “the thing,” or does the training process just accept images with “a thing” inside it, and learn to ignore all the “not the thing” stuff that tends to show up. If it is the latter, maybe Gemini with it’s mediocre bounding boxes could be used as an infinitely un-bore-able annotater instead.
That way it would be both efficient and cost-effective.
I've been absurdly surprised at how good it is at things, and how bad it is at others, and notably that the thing it seems the worst at are the easy picking parts.
Let me give an exemple; I was checking with it the payslip of my employees for the last few months, various wires related to their salaries and the various taxes, and my social declaration papers for labor taxes (which in France are very numerous and complex to follow).I had found a discrepency in a couple of declaration that ultimately led to a few dozen euros losts over some back and forth. Figuring it out by myself took me a while, and was not fun; I had the right accounting total and almost everything was okay, and ultimately it was a case of a credit being applied while an unrelated malus was also applied, both to some employees but not others, and the collision meant a pain to find.
Providing all the papers to gemini and asking it to check if everything was fine, it found me a bazillion "weird things", all mostly correct but worth checking, but not the real problem.
Giving it the same papers, telling him the problem I had and where to look without being sure, it found it for me with decent details, making me confident that next time I can use it not to solve it, but to be put on the right track much much faster than without gemini.
Giving it the same papers, the problem but also the solution I had but asking it to give me more details, again provided great result and actually helped me clarified which lines collided in which order, again not a replacement but a great add on. Definitely felt like the price I'm paying for it is worth it.
But here is the funny part : in all of those great analysis, it kept trying to tally me totals, and there was always one wrong. We're not talking impressive stuff here, but quite literal case of here is a 2 column 5 rows table of data, and here is the total, and the total is wrong, and I needed to ask it like 3 or 4 times in a row to fix its total until it agreed / found its issue (which was, literally).
Despite being a bit amused (and intrigued) at the "show thinking" detail of that, where I saw it do the same calculation in half a dozen different way to try and find how I came up with my number, it really showed to me how weirdly different from us those thing work (or "think", some would say).
It it's not thinking but just emergent behavior for text assimilation, which it's supposed to be, then it figuring it something like that in such details and clarity was impressive in a way I can't quite grasp. But if it's not that but a genuine thought process of some sort, how could he miss so many time the simplest thing beside being told.
I don't really have a point here, other than I used to know where I sat on "are the models thinking or not" and the waters have really been murkied for me lately.
There have been lots of talk about these things replacing employees or not, and I don't see how they could, but I also don't see how an employee without one could compete with one helped by one as an assistant; "throw ideas at me" or "here is the result I already know but help me figure out why". That's where they shine very brightly for me.
> Sometimes Gemini is better than the ground truth
That ain’t ground truth, that’s just what MS-COCO has.
Our ground truth should reflect the "correct" output expected of the model in regards to it's training. So while in many cases "truth" and "correct" should algin, there are many many cases where "truth" is subjective, and so we must settle for "correct".
Case in point: we've trained a model to parse out addresses from a wide-array of forms. Here is an example address as it would appear on the form.
Address: J Smith 123 Example St
City: LA State: CA Zip: 85001
Our ground truth says it should be rendered as such:
Address Line 1: J Smith
Address Line 2: 123 Example St
City: LA
State: CA
ZipCode: 85001
However our model outputs it thusly:
Address Line 1: J Smith 123 Example St
Address Line 2:
City: LA
State: CA
ZipCode: 85001
That may be true, as there is only 1 address line and we have a field for "Address Line 1", but it is not correct. Sure, there may be a problem with our taxonomy, training data, or any other number of other things, but as far as ground truth goes it is not correct.
Are you trying to tell me that the COCO labelling of the cars is what you call correct?
If, as it seems in the article, they are using COCO to establish ground truth, i.e. what COCO says is correct, then whatever COCO comes up with is, by definition "correct". It is, in effect, the answer, the measuring stick, the scoring card. Now what you're hinting at is that, in this instance, that's a really bad way to establish ground truth. I agree. But that doesn't change what is and how we use ground truth.
Think of it another way:
- Your job is to pass a test.
- To pass a test you must answer a question correctly.
- The answer to that question has already been written down somewhere.
To pass the test does your answer need to be true, or does it need to match what is already written down?
When we do model evaluation the answer needs to match what is already written down.
You're both right. Perfection isn't possible or practical. But their "ground truth" (in that example) is obviously shite, that nobody should be using for training or any sort of metric, since it will make them worse. You're also right that you can name a dataset "ground truth", but names don't mean much when they're in opposition to the intent.
Notably, performance on out of distribution data like those in RF100VL is super degraded
It worked really well zero-shot (comparatively to the foundation model field) achieving 13.3 average mAP, but counterintuitively performance degraded when provided visual examples to ground its detections from, and when provided textual instructions on how to find objects as additional context. So it seems it has some amount of object detection zero-shot training, probably on a few standard datasets, but isn't smart enough to incorporate additional context or its general world knowledge into those detection abilities
Different models have different encoders, they are not shared as the datasets across models and even model sizes vary. So performance between models will vary.
What you seem to be thinking is that text models were simply calling an API to a vision model, similar to tool-use. That is not what’s happening, it is much more inbuilt, the forward pass is going through the vision architecture to the language architecture. Robotics research has been doing this for a while.
Most vision LLMs don't actually use a separate vision model. https://huggingface.co/blog/vlms is a decent explanation of what's going on.
Most of the big LLMs these days are vision LLMs - the Claude models, the OpenAI models, Grok and most of the Gemini models all accept images in addition to text. To my knowledge none of them are using tool calling to a separate vision model for this.
Some of the local models can do this too - Mistral Small and Gemma 3 are two examples. You can tell they're not tool calling to anything because they run directly out of a single model weights file.
For instance, I asked it to compute the symmetry group of a pattern I found on a wallpaper in a Lebanese restaurant this weekend. It realised it was unsure of the symmetries and used a python script to rotate and mirror the pattern and compare to the original to check the symmetries it suspected. Pretty awesome!
since there are still strengths the computer vision has, i wonder why someone hasn't made an "über vision language service" that just exposes the old CV APIs as MCP or something, and have both systems work in conjunction to increase accuracy and understanding
In either case, your assertion that one _understands_, and the other doesn't, seems like motivated reasoning, rather than identifying something fundamental about the situation.
But at least with a child I can quickly teach it to follow simple orders, while this AI requires hours of annotating + training, even for simple changes in instructions.
The other problem with LLMs today, is that they don't persist any learning they do from their everyday inference and interaction with users; at least not in real-time. So it makes them harder to instruct in a useful way.
But it seems inevitable that both their pre-training, and ability to seamlessly continue to learn afterward, should improve over the coming years.
We have a planner phase followed by a "finder" phase where vision models are used. Following is the summary of our findings for planner and finder. Some of them are "work in progress" as they do not support tool calling (or are extremely bad at tool calling).
+------------------------+------------------+------------------+
| Models | Planner | Finder |
+------------------------+------------------+------------------+
| Gemini 1.5 Pro | recommended | recommended |
| Gemini 1.5 Flash | can use | recommended |
| Openai GPT 4o | recommended | work in progress |
| Openai GPT 4o mini | recommended | work in progress |
| llama 3.2 latest | work in progress | work in progress |
| llama 3.2 vision | work in progress | work in progress |
| Molmo 7B-D-4bit | work in progress | recommended |
+------------------------+------------------+------------------+
1. https://github.com/BandarLabs/clickclickclickOne insight that the author calls out is the inconsistencies in coordinate systems used in post-training these - you can't just swap models and get similar results. Gemini uses (ymin, xmin, ymax, xmax) integers b/w 0-1000. Qwen uses (xmin, ymin, xmax, ymax) floats b/w 0-1. We've been evaluating most of the frontier models for bounding boxes / segmentation masks, and this is quite a footgun to new users.
One of the reasons we chose to delegate object-detection to specialized tools is essentially due to the poor performance (~0.34 mAP w/ Gemini to 0.6 mAP w/ DETR like architectures). Check out this cookbook [1] we recently released, we use any LLM to delegate tasks like object-detection, face-detection and other classical CV tasks to a specialized model while still giving the user the dev-ex of a VLM.
[1] https://colab.research.google.com/github/vlm-run/vlmrun-cook...
There's a neat table here: https://dragoneye.ai/blog/a-guide-to-bounding-box-formats
Picking yxyx was certainly a decision.
Disclosure: I work for https://aryn.ai/
EconomistFar•12h ago
Has anyone here found good ways to handle bounding box quality in noisy datasets? Do you rely more on human annotation or clever augmentation?
simedw•11h ago
In some cases, running a model like SAM 2 on a loose bounding box can help refine the results. I usually add about 10% padding in each direction to the bounding box, just in case the original was too tight. Then if you don't actually need to mask you just convert it back to a bounding box.
steinvakt2•11h ago