What's the current state of the art for API generation of an image from a reference plus modifier prompt?
Say, in the 1c per HD (1920*1080) image range?
The interesting part of this GPT-4o API is that it doesn't need to learn them. But given the cost of `high` quality image generation, it's much cheaper to train a LoRA for Flux 1.1 Pro and generate from that.
Anyone know of an Ai model for generating svg images? Please share.
first paper linked on prior comment is the latest one from SVGRender group, but not sure if any runnable model weights are out yet for it (SVGFusion)
One note with these is most of the production ones are actually diffusion models that get ran through an image->svg model after. The issue with this is that the layers aren't set up semantically like you'd expect if you were crafting these by hand, or if you were directly generating svgs. The results work, but they aren't perfect.
PermissionDeniedError: Error code: 403 - {'error': {'message': 'To access gpt-image-1, please complete organization verification
So the options are: 1) nerf the model so it can't produce images like that, or 2) use some type of KYC verification.
Upload a picture of a friend -> OK. Upload my own picture -> I can't generate anything involving real people.
Also after they enabled global chat memory I started seeing my other chats leaking into the images as literal text. Disabled it since.
EDIT: Oh, yes, that's what it appears to be. Is it better? Why would I switch?
Enhance headshots for putting on Linkedin.
- specificity (a diagram that perfectly encapsulates the exact set of concepts you're asking about)
A pro/con of the multimodal image generation approach (with an actually good text encoder) is that it rewards intense prompt engineering moreso than others, and if there is a use case that can generate more than $0.17/image in revenue, that's positive marginal profit.
Today I'm discovering there is a tier of API access with virtually no content moderation available to companies working in that space. I have no idea how to go about requesting that tier of access, but have spoken to 4 different defense contractors in the last day who seem to already be using it.
(fwiw for anyone curious how to implement it, it's the 'moderation' parameter in the JSON request you'll send, I missed it for a few hours because it wasn't in Dalle-3)
I just took any indication that the parent post meant absolute zero moderation as them being a bit loose with their words and excitable with how they understand things, there were some signs:
1. it's unlikely they completed an API integration quickly enough to have an opinion on military / defense image generation moderation yesterday, so they're almost certainly speaking about ChatGPT. (this is additionally confirmed by image generation requiring tier 5 anyway, which they would have been aware of if they had integrated)
2. The military / defense use cases for image generation are not provided (and the steelman'd version in other comments is nonsensical, i.e. we can quickly validate you can still generate kanban boards or wireframes of ships)
3. The poster passively disclaims being in military / defense themself (grep "in that space")
4. it is hard to envision cases of #2 that do not require universal moderation for OpenAI's sake, i.e. lets say their thought process is along the lines of: defense/military ~= what I think of as CIA ~= black ops ~= image manipulation on social media, thus, the time I said "please edit this photo of the ayatollah to have him eating pig and say I hate allah" means its overmoderated for defense use cases
5. It's unlikely openai wants to be anywhere near PR resulting from #4. Assuming there is a super secret defense tier that allows this, it's at the very least, unlikely that the poster's defense contractor friends were blabbing about about the exclusive completely unmoderated access they had, to the poster, within hours of release. They're pretty serious about that secrecy stuff!
6. It is unlikely the lack of ability to generate images using GPT Image 1 would drive the military to Chinese models (there aren't Chinese LLMs that do this! even if they were, there's plenty of good ol' American diffusion models!)
OP was clearly implying there is some greater ability only granted to extra special organizations like the military.
With all possible respect to OP, I find this all very hard to believe without additional evidence. If nothing else, I don't really see a military application of this API (specifically, not AI in general). I'm sure it would help them create slide decks and such, but you don't need extra special zero moderation for that.
I kid, more real world use cases would be for concept images for a new product or marketing campaigns.
Then include this image in a dataset of another net with marker "civilian". Train that new neural net better so that it does lower false positive rate when asked "is this target military".
Now I'm just wondering what the hell defense contractors need image generation for that isn't obviously horrifying...
“Please move them to some desert, not the empire state building.”
“The civilians are supposed to have turbans, not ballcaps.”
I can’t see a way to do this currently, you just get a prompt.
This, I think, is the most powerful way to use the new image model since it actually understands the input image and can make a new one based on it.
Eg you can give it a person sitting at a desk and it can make one of them standing up. Or from another angle. Or in the moon.
Text input tokens (prompt text): $5 per 1M tokens Image input tokens (input images): $10 per 1M tokens Image output tokens (generated images): $40 per 1M tokens
In practice, this translates to roughly $0.02, $0.07, and $0.19 per generated image for low, medium, and high-quality square images, respectively.
that's a bit pricy for a startup.
I remember meeting someone on Discord 1-2 years ago (?) working on a GoDaddy effort to have customer-generated icons using bespoke foundation image gen models? Suppose that kind of bespoke model at that scale is ripe for replacement by gpt-image-1, given the instruction-following ability / steerability?
As an example, some users (myself included) of a generative image app were trying to make a picture of person in the pouch of a kangaroo.
No matter what we prompted, we couldn’t get it to work.
GPT-4o did it in one shot!
> Unlike DALL·E, which operates as a diffusion model, 4o image generation is an autoregressive model natively embedded within ChatGPT.
[0]https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/...
Prompt: “a cute dog hugs a cute cat”
https://x.com/terrylurie/status/1915161141489136095
I also then showed a couple of DALL:E 3 images for comparison in a comment
“Auto” is just whatever the best quality is for a model. So in this case it’s the same as “high”.
Does this mean this also does video in some manner?
minimaxir•4h ago
Prompting the model is also substantially more different and difficult than traditional models, unsurprisingly given the way the model works. The traditional image tricks don't work out-of-the-box and I'm struggling to get something that works without significant prompt augmentation (which is what I suspect was used for the ChatGPT image generations)
tough•4h ago
maybe OpenAI thinks model business is over and they need to start sherlocking all the way from the top to final apps (Thus their interest on buying out cursor, finally ending up with windsurf)
Idk this feels like a new offering between a full raw API and a final product where you abstract some of it for a few cents, and they're basically bundling their SOTA llm models with their image models for extra margin
vineyardmike•4h ago
In case you didn’t know, it’s not just wrapping in an LLM. The image model they’re referencing is a model that’s directly integrated into the LLM for functionality. It’s not possible to extract, because the LLM outputs tokens which are part of the image itself.
That said, they’re definitely trying to focus on building products over raw models now. They want to be a consumer subscription instead of commodity model provider.
tough•4h ago
waiting for some FOSS multi-modal model to come out eventually too
great to see openAI expanding into making actual usable products i guess
spilldahill•3h ago
doctorpangloss•4h ago
arevno•3h ago
thot_experiment•3h ago
koakuma-chan•3h ago
thot_experiment•3h ago
echelon•3h ago
- OpenAI is notorious for blocking copyrighted characters. They do prompt keyword scanning, but also run a VLM on the results so you can't "trick" the model.
- Lots of providers block public figures and celebrities.
- Various providers block LGBT imagery, even safe for work prompts. Kling is notorious for this.
- I was on a sales call with someone today who runs a father's advocacy group. I don't know what system he was using, but he said he found it impossible to generate an adult male with a child. In a totally safe for work context.
- Some systems block "PG-13" images of characters that are in bathing suits or scantily clad.
None of this is porn, mind you.
thot_experiment•3h ago
throwaway314155•2h ago
echelon•3h ago
This is the god model in images right now.
I don't think open source diffusion models can catch up with this. From what I've heard, this model took a huge amount of money to train that not even Black Forest Labs has access to.
thot_experiment•3h ago
As for LoRAs and fine tuning and open source in general; if you've ever been to civit.ai it should be immediately obvious why those things aren't going away.
simonw•3h ago
Here's my dog in a pelican costume: https://bsky.app/profile/simonwillison.net/post/3lneuquczzs2...
steve_adams_86•3h ago
furyofantares•3h ago
I suspect what I'll do with the API is iterate at medium quality and then generate a high quality image when I'm done.
thot_experiment•3h ago
Sohcahtoa82•3h ago
It's actually more than that. It's about 16.7 cents per image.
$0.04/image is the pricing for DALL-E 3.
weird-eye-issue•3h ago
mkl•2h ago
Sohcahtoa82•1h ago
They didn't show low/med/high quality, they just said an image was a certain number of tokens with a price per token that led to $0.16/image.
raincole•3h ago
I think in terms of image generation, ChatGPT is the biggest leap since Stable Diffusion's release. LoRA/ControlNet/Flux are forgettable in comparison.
soared•3h ago
tacoooooooo•3h ago
fkyoureadthedoc•3h ago
mediaman•2h ago
stavros•2h ago
thegeomaster•2h ago
thefourthchime•2h ago
yousif_123123•1h ago
adamhowell•3h ago
I started Accomplice v1 back in 2021 with this goal in mind and raised some VC money but it was too early.
Now, with these latest imagen-3.0-generate-002 (Gemini) and gpt-image-1 (OpenAI) models – especially this API release from OpenAI – I've been able to resurrect Accomplice as a little side project.
Accomplice v2 (https://accomplice.ai) is just getting started back up again – I honestly decided to rebuild it only a couple weeks ago in preparation for today once I saw ChatGPT's new image model – but so far 1,000s of free to download PNGs (and any SVGs that have already been vectorized are free too (costs a credit to vectorize)).
I generate new icons every few minutes from a huge list of "useful icons" I've built. Will be 100% pay-as-you-go. And for a credit, paid users can vectorize any PNGs they like, tweak them using AI, upload their own images to vectorize and download, or create their own icons (with my prompt injections baked in to get you good icon results)
Do multi-modal models make something like this obsolete? I honestly am not sure. In my experience with Accomplice v1, a lot of users didn't know what to do with a blank textarea, so the thinking here is there's value in doing some of the work for them upfront with a large searchable archive. Would love to hear others' thoughts.
But I'm having fun again either way.
stavros•2h ago
throwup238•2h ago
stavros•2h ago
vunderba•24m ago
https://imgur.com/a/BTzbsfh
It definitely captures the style - but any reasonably complicated prompt was beyond it.
varenc•3h ago
vunderba•2h ago
Can you elaborate? This was not my experience - retesting the prompts that I used for my GenAI image shootout against gpt-image-1 API proved largely similar.
https://genai-showdown.specr.net
Wowfunhappy•30m ago