What's the current state of the art for API generation of an image from a reference plus modifier prompt?
Say, in the 1c per HD (1920*1080) image range?
The interesting part of this GPT-4o API is that it doesn't need to learn them. But given the cost of `high` quality image generation, it's much cheaper to train a LoRA for Flux 1.1 Pro and generate from that.
Anyone know of an Ai model for generating svg images? Please share.
first paper linked on prior comment is the latest one from SVGRender group, but not sure if any runnable model weights are out yet for it (SVGFusion)
One note with these is most of the production ones are actually diffusion models that get ran through an image->svg model after. The issue with this is that the layers aren't set up semantically like you'd expect if you were crafting these by hand, or if you were directly generating svgs. The results work, but they aren't perfect.
PermissionDeniedError: Error code: 403 - {'error': {'message': 'To access gpt-image-1, please complete organization verification
So the options are: 1) nerf the model so it can't produce images like that, or 2) use some type of KYC verification.
Upload a picture of a friend -> OK. Upload my own picture -> I can't generate anything involving real people.
Also after they enabled global chat memory I started seeing my other chats leaking into the images as literal text. Disabled it since.
EDIT: Oh, yes, that's what it appears to be. Is it better? Why would I switch?
Enhance headshots for putting on Linkedin.
- specificity (a diagram that perfectly encapsulates the exact set of concepts you're asking about)
A pro/con of the multimodal image generation approach (with an actually good text encoder) is that it rewards intense prompt engineering moreso than others, and if there is a use case that can generate more than $0.17/image in revenue, that's positive marginal profit.
Today I'm discovering there is a tier of API access with virtually no content moderation available to companies working in that space. I have no idea how to go about requesting that tier of access, but have spoken to 4 different defense contractors in the last day who seem to already be using it.
(fwiw for anyone curious how to implement it, it's the 'moderation' parameter in the JSON request you'll send, I missed it for a few hours because it wasn't in Dalle-3)
I just took any indication that the parent post meant absolute zero moderation as them being a bit loose with their words and excitable with how they understand things, there were some signs:
1. it's unlikely they completed an API integration quickly enough to have an opinion on military / defense image generation moderation yesterday, so they're almost certainly speaking about ChatGPT. (this is additionally confirmed by image generation requiring tier 5 anyway, which they would have been aware of if they had integrated)
2. The military / defense use cases for image generation are not provided (and the steelman'd version in other comments is nonsensical, i.e. we can quickly validate you can still generate kanban boards or wireframes of ships)
3. The poster passively disclaims being in military / defense themself (grep "in that space")
4. it is hard to envision cases of #2 that do not require universal moderation for OpenAI's sake, i.e. lets say their thought process is along the lines of: defense/military ~= what I think of as CIA ~= black ops ~= image manipulation on social media, thus, the time I said "please edit this photo of the ayatollah to have him eating pig and say I hate allah" means its overmoderated for defense use cases
5. It's unlikely openai wants to be anywhere near PR resulting from #4. Assuming there is a super secret defense tier that allows this, it's at the very least, unlikely that the poster's defense contractor friends were blabbing about about the exclusive completely unmoderated access they had, to the poster, within hours of release. They're pretty serious about that secrecy stuff!
6. It is unlikely the lack of ability to generate images using GPT Image 1 would drive the military to Chinese models (there aren't Chinese LLMs that do this! even if they were, there's plenty of good ol' American diffusion models!)
OP was clearly implying there is some greater ability only granted to extra special organizations like the military.
With all possible respect to OP, I find this all very hard to believe without additional evidence. If nothing else, I don't really see a military application of this API (specifically, not AI in general). I'm sure it would help them create slide decks and such, but you don't need extra special zero moderation for that.
I can't provide additional evidence (it's defense, duh), but the #1 use I've seen is generating images for computer vision training mostly to feed GOFAI algorithms that have already been validated for target acquisition. Image gen algorithms have a pretty good idea of what a T72 tank and different camouflage looks like, and they're much better at generating unique photos combining the two. It's actually a great use of the technology because hallucinations help improve the training data (i.e. the final targetting should be invariant to a T72 tank with a machine gun on the wrong side or with too many turrets, etc.)
That said, due to compartmentalization, I don't know the extent to which image gen is used in defense, just my little sliver of it.
I kid, more real world use cases would be for concept images for a new product or marketing campaigns.
Then include this image in a dataset of another net with marker "civilian". Train that new neural net better so that it does lower false positive rate when asked "is this target military".
The real answer is probably way, way more mundane - generating images for marketing, etc.
Now I'm just wondering what the hell defense contractors need image generation for that isn't obviously horrifying...
“Please move them to some desert, not the empire state building.”
“The civilians are supposed to have turbans, not ballcaps.”
The bias leans towards overfitting the data, which in some use cases - such as missile or drone design which doesn't need broad comparisons like 747s or artillery to complete it's training.
Kind of like neural net back propogation but in terms of model /weights
GCCH is typically 6-12 months behind in feature set.
I can’t see a way to do this currently, you just get a prompt.
This, I think, is the most powerful way to use the new image model since it actually understands the input image and can make a new one based on it.
Eg you can give it a person sitting at a desk and it can make one of them standing up. Or from another angle. Or in the moon.
Text input tokens (prompt text): $5 per 1M tokens Image input tokens (input images): $10 per 1M tokens Image output tokens (generated images): $40 per 1M tokens
In practice, this translates to roughly $0.02, $0.07, and $0.19 per generated image for low, medium, and high-quality square images, respectively.
that's a bit pricy for a startup.
I remember meeting someone on Discord 1-2 years ago (?) working on a GoDaddy effort to have customer-generated icons using bespoke foundation image gen models? Suppose that kind of bespoke model at that scale is ripe for replacement by gpt-image-1, given the instruction-following ability / steerability?
As an example, some users (myself included) of a generative image app were trying to make a picture of person in the pouch of a kangaroo.
No matter what we prompted, we couldn’t get it to work.
GPT-4o did it in one shot!
And you seem to be right, though the only reference I can find is in one of the example images of a whiteboard posted on the announcement[0].
It shows: tokens -> [transformer] -> [diffusion] pixels
hjups22 on Reddit[1] describes it as:
> It's a hybrid model. The AR component generates control embeddings that then get decoded by a diffusion model. But the control embeddings are accurate enough to edit and reconstruct the images surprisingly well.
[0]https://openai.com/index/introducing-4o-image-generation/
[1]https://www.reddit.com/r/MachineLearning/comments/1jkt42w/co...
> Unlike DALL·E, which operates as a diffusion model, 4o image generation is an autoregressive model natively embedded within ChatGPT.
[0]https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/...
Why would that be more likely ? It seems like some implementation of bytedance's VAR.
Prompt: “a cute dog hugs a cute cat”
https://x.com/terrylurie/status/1915161141489136095
I also then showed a couple of DALL:E 3 images for comparison in a comment
“Auto” is just whatever the best quality is for a model. So in this case it’s the same as “high”.
This prompt is best served by Midjourney, Flux, Stable Diffusion. It'll be far cheaper, and chances are it'll also look a lot better.
The place where gpt-image-1 shines if if you want to do a prompt like:
"a cute dog hugs a cute cat, they're both standing on top of an algebra equation (y=\(2x^{2}-3x-2\)). Use the first reference image I uploaded as a source for the style of the dog. Same breed, same markings. The cat can contrast in fur color. Use the second reference image I uploaded as a guide for the background, but change the lighting to sunset. Also, solve the equation for x."
gpt-image-1 doesn't make the best images, and it isn't cheap, and it isn't fast, but it's incredibly -- almost insanely -- powerful. It feels like ComfyUI got packed up into an LLM and provided as a natural language service.
Does this mean this also does video in some manner?
You might say, "but chatGPT is already as dead simple an interface as you can imagine". And the answer to that is, for specific tasks, no general interface is ever specific enough. So imagine you want to use this to create "headshots" or "linkedin bio photos" from random pictures of yourself. A bespoke interface, with options you haven't even considered already thought through for you, and some quality control/revisions baked into the process, is something someone might pay for.
minimaxir•6h ago
Prompting the model is also substantially more different and difficult than traditional models, unsurprisingly given the way the model works. The traditional image tricks don't work out-of-the-box and I'm struggling to get something that works without significant prompt augmentation (which is what I suspect was used for the ChatGPT image generations)
tough•5h ago
maybe OpenAI thinks model business is over and they need to start sherlocking all the way from the top to final apps (Thus their interest on buying out cursor, finally ending up with windsurf)
Idk this feels like a new offering between a full raw API and a final product where you abstract some of it for a few cents, and they're basically bundling their SOTA llm models with their image models for extra margin
vineyardmike•5h ago
In case you didn’t know, it’s not just wrapping in an LLM. The image model they’re referencing is a model that’s directly integrated into the LLM for functionality. It’s not possible to extract, because the LLM outputs tokens which are part of the image itself.
That said, they’re definitely trying to focus on building products over raw models now. They want to be a consumer subscription instead of commodity model provider.
tough•5h ago
waiting for some FOSS multi-modal model to come out eventually too
great to see openAI expanding into making actual usable products i guess
spilldahill•5h ago
doctorpangloss•5h ago
arevno•5h ago
thot_experiment•5h ago
koakuma-chan•5h ago
thot_experiment•5h ago
echelon•4h ago
- OpenAI is notorious for blocking copyrighted characters. They do prompt keyword scanning, but also run a VLM on the results so you can't "trick" the model.
- Lots of providers block public figures and celebrities.
- Various providers block LGBT imagery, even safe for work prompts. Kling is notorious for this.
- I was on a sales call with someone today who runs a father's advocacy group. I don't know what system he was using, but he said he found it impossible to generate an adult male with a child. In a totally safe for work context.
- Some systems block "PG-13" images of characters that are in bathing suits or scantily clad.
None of this is porn, mind you.
thot_experiment•4h ago
throwaway314155•4h ago
echelon•5h ago
This is the god model in images right now.
I don't think open source diffusion models can catch up with this. From what I've heard, this model took a huge amount of money to train that not even Black Forest Labs has access to.
thot_experiment•4h ago
As for LoRAs and fine tuning and open source in general; if you've ever been to civit.ai it should be immediately obvious why those things aren't going away.
simonw•5h ago
Here's my dog in a pelican costume: https://bsky.app/profile/simonwillison.net/post/3lneuquczzs2...
steve_adams_86•5h ago
furyofantares•5h ago
I suspect what I'll do with the API is iterate at medium quality and then generate a high quality image when I'm done.
thot_experiment•5h ago
Sohcahtoa82•4h ago
It's actually more than that. It's about 16.7 cents per image.
$0.04/image is the pricing for DALL-E 3.
weird-eye-issue•4h ago
mkl•3h ago
Sohcahtoa82•3h ago
They didn't show low/med/high quality, they just said an image was a certain number of tokens with a price per token that led to $0.16/image.
raincole•4h ago
I think in terms of image generation, ChatGPT is the biggest leap since Stable Diffusion's release. LoRA/ControlNet/Flux are forgettable in comparison.
soared•4h ago
tacoooooooo•4h ago
fkyoureadthedoc•4h ago
mediaman•4h ago
stavros•4h ago
abhpro•1h ago
thegeomaster•4h ago
thefourthchime•3h ago
yousif_123123•2h ago
Yiling-J•13m ago
adamhowell•4h ago
I started Accomplice v1 back in 2021 with this goal in mind and raised some VC money but it was too early.
Now, with these latest imagen-3.0-generate-002 (Gemini) and gpt-image-1 (OpenAI) models – especially this API release from OpenAI – I've been able to resurrect Accomplice as a little side project.
Accomplice v2 (https://accomplice.ai) is just getting started back up again – I honestly decided to rebuild it only a couple weeks ago in preparation for today once I saw ChatGPT's new image model – but so far 1,000s of free to download PNGs (and any SVGs that have already been vectorized are free too (costs a credit to vectorize)).
I generate new icons every few minutes from a huge list of "useful icons" I've built. Will be 100% pay-as-you-go. And for a credit, paid users can vectorize any PNGs they like, tweak them using AI, upload their own images to vectorize and download, or create their own icons (with my prompt injections baked in to get you good icon results)
Do multi-modal models make something like this obsolete? I honestly am not sure. In my experience with Accomplice v1, a lot of users didn't know what to do with a blank textarea, so the thinking here is there's value in doing some of the work for them upfront with a large searchable archive. Would love to hear others' thoughts.
But I'm having fun again either way.
stavros•4h ago
throwup238•4h ago
stavros•3h ago
vunderba•2h ago
https://imgur.com/a/BTzbsfh
It definitely captures the style - but any reasonably complicated prompt was beyond it.
varenc•4h ago
vunderba•4h ago
Can you elaborate? This was not my experience - retesting the prompts that I used for my GenAI image shootout against gpt-image-1 API proved largely similar.
https://genai-showdown.specr.net
Wowfunhappy•2h ago