What's the current state of the art for API generation of an image from a reference plus modifier prompt?
Say, in the 1c per HD (1920*1080) image range?
The interesting part of this GPT-4o API is that it doesn't need to learn them. But given the cost of `high` quality image generation, it's much cheaper to train a LoRA for Flux 1.1 Pro and generate from that.
Anyone know of an Ai model for generating svg images? Please share.
first paper linked on prior comment is the latest one from SVGRender group, but not sure if any runnable model weights are out yet for it (SVGFusion)
One note with these is most of the production ones are actually diffusion models that get ran through an image->svg model after. The issue with this is that the layers aren't set up semantically like you'd expect if you were crafting these by hand, or if you were directly generating svgs. The results work, but they aren't perfect.
PermissionDeniedError: Error code: 403 - {'error': {'message': 'To access gpt-image-1, please complete organization verification
So the options are: 1) nerf the model so it can't produce images like that, or 2) use some type of KYC verification.
Upload a picture of a friend -> OK. Upload my own picture -> I can't generate anything involving real people.
Also after they enabled global chat memory I started seeing my other chats leaking into the images as literal text. Disabled it since.
EDIT: Oh, yes, that's what it appears to be. Is it better? Why would I switch?
Enhance headshots for putting on Linkedin.
Hard to find such news.
> Additionally, developers can also control moderation sensitivity with the `moderation` parameter, which can be set to auto (default) for standard filtering, or low for less restrictive filtering.
I played around with this last night and although it still sometimes refused to create images, it seemed to be significantly more lenient.
- specificity (a diagram that perfectly encapsulates the exact set of concepts you're asking about)
AI companies are still in their "burning money" phase.
Enshittification is not on the horizon yet, but it's inevitable.
I believe, over all, development will go forward and things will get better. A rising tide lifts all ships, even if some of them decide to be shitty leaking vessels. If nothing else we always have open source software to fall back on when the enshittification of the proprietary models start.
For a practical example: The cars we drive today are a lot better than 100 years ago. A bad future isn't always inevitable.
etc
A pro/con of the multimodal image generation approach (with an actually good text encoder) is that it rewards intense prompt engineering moreso than others, and if there is a use case that can generate more than $0.17/image in revenue, that's positive marginal profit.
That said it’ll be 10-20x cheaper in a year at which point I don’t think you care about price for this workflow in 2D games.
Anyone using image gen for real work not just for fun.
Although you're way better off finding your own workflows with local models at that scale.
Today I'm discovering there is a tier of API access with virtually no content moderation available to companies working in that space. I have no idea how to go about requesting that tier of access, but have spoken to 4 different defense contractors in the last day who seem to already be using it.
(fwiw for anyone curious how to implement it, it's the 'moderation' parameter in the JSON request you'll send, I missed it for a few hours because it wasn't in Dalle-3)
I just took any indication that the parent post meant absolute zero moderation as them being a bit loose with their words and excitable with how they understand things, there were some signs:
1. it's unlikely they completed an API integration quickly enough to have an opinion on military / defense image generation moderation yesterday, so they're almost certainly speaking about ChatGPT. (this is additionally confirmed by image generation requiring tier 5 anyway, which they would have been aware of if they had integrated)
2. The military / defense use cases for image generation are not provided (and the steelman'd version in other comments is nonsensical, i.e. we can quickly validate you can still generate kanban boards or wireframes of ships)
3. The poster passively disclaims being in military / defense themself (grep "in that space")
4. it is hard to envision cases of #2 that do not require universal moderation for OpenAI's sake, i.e. lets say their thought process is along the lines of: defense/military ~= what I think of as CIA ~= black ops ~= image manipulation on social media, thus, the time I said "please edit this photo of the ayatollah to have him eating pig and say I hate allah" means its overmoderated for defense use cases
5. It's unlikely openai wants to be anywhere near PR resulting from #4. Assuming there is a super secret defense tier that allows this, it's at the very least, unlikely that the poster's defense contractor friends were blabbing about about the exclusive completely unmoderated access they had, to the poster, within hours of release. They're pretty serious about that secrecy stuff!
6. It is unlikely the lack of ability to generate images using GPT Image 1 would drive the military to Chinese models (there aren't Chinese LLMs that do this! even if they were, there's plenty of good ol' American diffusion models!)
OP was clearly implying there is some greater ability only granted to extra special organizations like the military.
With all possible respect to OP, I find this all very hard to believe without additional evidence. If nothing else, I don't really see a military application of this API (specifically, not AI in general). I'm sure it would help them create slide decks and such, but you don't need extra special zero moderation for that.
I can't provide additional evidence (it's defense, duh), but the #1 use I've seen is generating images for computer vision training mostly to feed GOFAI algorithms that have already been validated for target acquisition. Image gen algorithms have a pretty good idea of what a T72 tank and different camouflage looks like, and they're much better at generating unique photos combining the two. It's actually a great use of the technology because hallucinations help improve the training data (i.e. the final targetting should be invariant to a T72 tank with a machine gun on the wrong side or with too many turrets, etc.)
That said, due to compartmentalization, I don't know the extent to which image gen is used in defense, just my little sliver of it.
I'm not aware of the moderation parameter here but these contractors have special API keys that unlock unmoderated access for them, they've apparently had it for weeks.
I kid, more real world use cases would be for concept images for a new product or marketing campaigns.
What an impossibly weird thing to "need" an LLM for.
Then include this image in a dataset of another net with marker "civilian". Train that new neural net better so that it does lower false positive rate when asked "is this target military".
Definitely a premium.
Videos on wikileaks tell a different story.
The real answer is probably way, way more mundane - generating images for marketing, etc.
[1] https://media.wired.com/photos/5933e578714b881cb296c6ef/mast...
With this, you could create a dataset that will by definition have that. You should still corroborate the data, but it's a step ahead without having to take 1000 photos and adding enough noise and variations to get to 30k.
You take your images and crop, shift, etc them so that your model doesn't learn "all x are in the middle of the image". For text you might auto replace days of the week with others, there's a lot of work there.
Broadly the intent is to keep the key information and generate realistic but irrelevant noise so that you train a model that correctly ignores the noise.
You don't want to train your model identifying some class of ship to base it on how choppy the water is, just because that was the simple signal that correlated well. There was a case of radiology results that detected cancer well but actually was detecting rulers in the image because in images with tumors there was often a ruler so the tumor could be sized. (I think it was cancer, broad point applies if it was something else).
I can get an AI to generate an image of a bear wearing a sombrero. There are no images of this in its training data, but there are bears, and there are images of sombreros, and other things wearing sombreros. It can combine the distributions in a plausible way.
If I am trying to train a small model to fit into the optical sensor of a warhead to target bears wearing sombreros, this synthetic training set would be very useful.
Same thing with artillery in bushes. Or artillery in different lighting conditions. This stuff is useful to saturate the input space with synthetic examples.
See bifrost.ai and their fun videos of training naval drones to avoid whales in an ethical manners
Now I'm just wondering what the hell defense contractors need image generation for that isn't obviously horrifying...
“Please move them to some desert, not the empire state building.”
“The civilians are supposed to have turbans, not ballcaps.”
I'm confused.
The bias leans towards overfitting the data, which in some use cases - such as missile or drone design which doesn't need broad comparisons like 747s or artillery to complete it's training.
Kind of like neural net back propogation but in terms of model /weights
GCCH is typically 6-12 months behind in feature set.
I personally just warn customers that it cannot technically handle CUI or higher, can't say that it stops them
"GPT-4o is now available as part of Azure OpenAI Service for Azure Government and included as part of this latest FedRAMP High and DoD IL4/IL5 Authorization."
...we have everything setup in Azure but are weary to start using with CUI. Our DoD contacts think it's good to go, but nobody wants to go on record as giving the go-ahead.
https://devblogs.microsoft.com/azuregov/azure-openai-fedramp...
https://learn.microsoft.com/en-us/azure/azure-government/com...
* organization required to perform a Risk Assessment (is this standardized?)
* organization must issue an Authority to Operate (ATO) (example? to whom?) to use it for CUI as the data owner.
* organization must ensure data is encrypted properly both at rest and in transit (is plain text typed into a chat window encrypted at rest?).
* organization must ensure the system is documented in a System Security Plan (SSP) (example?).
* organization must get approval from government sponsor of each project to use CUI with AI tools
I am the one pushing for adoption, but don't have the time or FedRAMP/DISA expertise, and our FSO/CISO would rather we just not.
They also have a deployment on SIPR rated for secret.
Anything higher, you need a special key but AWS Bedrock has Claude up on C2S.
That being said both Azure OpenAI and AWS Bedrock suck for many reasons and they will by default extend your system boundary (meaning you need to extend your ATO). Also, for CUI, it has the P-ATO from JAB, not many agency specific ATOs, which means you will probably need to submit it thru your agency sponsor.
I’m sure access to military grade tech is only one small slice in the set of advantages the masters get over the mastered in any human society.
>> How'd you do it?
> I don't know the details. ChatGPT did it for me, this thing's amazing. Our bonuses are gonna be huge this year, I might even be able to afford a lift kit for my truck.
Trying to align OpenAI etc. with the rest of humanity is a completely different problem.
Additionally, we don't tax unrealized capital gains.
It is generally accepted that business profit is taxed. Meanwhile, there are entire industries and tax havens set up to help corporations and their executives avoid paying taxes.[0]
However, the crux of my comment was not about the vagaries of corporate taxation, it was simply about "AI alignment" being more about the creators, than the entire species.
[0] https://en.wikipedia.org/wiki/Category:Corporate_tax_avoidan...
("There was an old lady who swallowed a fly, …")
Each of those proxies can have an alignment failure with the adjacent level(s).
And RLHF involves training one AI to learn human preferences, as a proxy for what "good" is, in order to be the reward function that trains the actual LLM (or other model, but I've only heard of RLHF being used to train LLMs)
Do people actually fall for these lol? Yes they do and it works to raise interest and get additional funding.
Morals and ethics are different and I would not want the US to be "protecting the world" with their ridiculous ethics and morals.
I would imagine defense contractors can cut deals for similar preferential treatment with OAI and the like to be exempt from potentially copyright-infringing uses of their API.
[0]https://fortune.com/asia/2024/07/03/pentagon-huawei-ban-nati...
I can’t see a way to do this currently, you just get a prompt.
This, I think, is the most powerful way to use the new image model since it actually understands the input image and can make a new one based on it.
Eg you can give it a person sitting at a desk and it can make one of them standing up. Or from another angle. Or in the moon.
Text input tokens (prompt text): $5 per 1M tokens Image input tokens (input images): $10 per 1M tokens Image output tokens (generated images): $40 per 1M tokens
In practice, this translates to roughly $0.02, $0.07, and $0.19 per generated image for low, medium, and high-quality square images, respectively.
that's a bit pricy for a startup.
I remember meeting someone on Discord 1-2 years ago (?) working on a GoDaddy effort to have customer-generated icons using bespoke foundation image gen models? Suppose that kind of bespoke model at that scale is ripe for replacement by gpt-image-1, given the instruction-following ability / steerability?
As an example, some users (myself included) of a generative image app were trying to make a picture of person in the pouch of a kangaroo.
No matter what we prompted, we couldn’t get it to work.
GPT-4o did it in one shot!
And you seem to be right, though the only reference I can find is in one of the example images of a whiteboard posted on the announcement[0].
It shows: tokens -> [transformer] -> [diffusion] pixels
hjups22 on Reddit[1] describes it as:
> It's a hybrid model. The AR component generates control embeddings that then get decoded by a diffusion model. But the control embeddings are accurate enough to edit and reconstruct the images surprisingly well.
[0]https://openai.com/index/introducing-4o-image-generation/
[1]https://www.reddit.com/r/MachineLearning/comments/1jkt42w/co...
Still, very exciting and for the future as well. It's still pretty expensive and slow. But moving in the right direction.
> Unlike DALL·E, which operates as a diffusion model, 4o image generation is an autoregressive model natively embedded within ChatGPT.
[0]https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/...
Why would that be more likely ? It seems like some implementation of bytedance's VAR.
Prompt: “a cute dog hugs a cute cat”
https://x.com/terrylurie/status/1915161141489136095
I also then showed a couple of DALL:E 3 images for comparison in a comment
“Auto” is just whatever the best quality is for a model. So in this case it’s the same as “high”.
This prompt is best served by Midjourney, Flux, Stable Diffusion. It'll be far cheaper, and chances are it'll also look a lot better.
The place where gpt-image-1 shines if if you want to do a prompt like:
"a cute dog hugs a cute cat, they're both standing on top of an algebra equation (y=\(2x^{2}-3x-2\)). Use the first reference image I uploaded as a source for the style of the dog. Same breed, same markings. The cat can contrast in fur color. Use the second reference image I uploaded as a guide for the background, but change the lighting to sunset. Also, solve the equation for x."
gpt-image-1 doesn't make the best images, and it isn't cheap, and it isn't fast, but it's incredibly -- almost insanely -- powerful. It feels like ComfyUI got packed up into an LLM and provided as a natural language service.
Does this mean this also does video in some manner?
You might say, "but chatGPT is already as dead simple an interface as you can imagine". And the answer to that is, for specific tasks, no general interface is ever specific enough. So imagine you want to use this to create "headshots" or "linkedin bio photos" from random pictures of yourself. A bespoke interface, with options you haven't even considered already thought through for you, and some quality control/revisions baked into the process, is something someone might pay for.
Is the printer just drop shipping? Do you use a single printer, or is there a printing service that contracts the closer physical shop?
https://github.com/Alasano/gpt-image-1-playground
Openai's Playground doesn't expose all the API options.
Mine covers all options, has built in mask creation and cost tracking as well.
Can we talk in discord, please?
You need to verify your identity with a driver's license or passport with OpenAI to have access to certain things like chain of thought summaries in the API and image generation with the new model.
Nothing I can do there, you gotta verify unfortunately.
PS: Does anyone know a good LLM/service to turn images into Videos?
However, while being better than my other models, it is not perfect. The image edit api will make a similar looking picture (even with masking) but exactly the same with some modifications.
BTW, if you can help me: I've been struggling with WhatsApp Business API for some days to make my app receive webhooks. It receives the GET verification request but when I send a message to the number I never get the POST. Have you had this problem?
But just gave a try to Gupshup and it's looking good, thanks for the recommendation!
- N by N sprite sheets
- Isometric sprite sheets
Basically anything that I can directly drop into my little game engine.
Maybe this one.
What is the game?
https://wiki.ultimacodex.com/images/0/0d/Ultima_4_-_Tiles.pn...
To pick an example, we have a model parameter and a response_format parameter. The response_format parameter selects whether image data should be returned as a URL (old method) or directly, base64-encoded. The new model only supports base64, whereas the old models default to a URL return, which is fine and understandable.
But the endpoint refuses to accept any value for response_format including b64_json with the new model, so you can't set-and-forget the new behaviour and allow the model to be parameterised without worrying about it. Instead, you have to request the new behaviour with the older models, and not request it (but still get it) with the new one. sigh
The website is barely responding today, and the Desktop client always has massively degraded performance. Really annoying having their desire for user growth killing the experience for those of us who are financing it.
let imageId = api.generateImage(prompt)
let {url, isFinished} = api.imageInfo(id)
But instead it's: let bytes = api.generateImage(prompt)
It's interesting to me how AI APIs let you hold such a persistent, active connection. I'm so used to anything that takes more than a second becoming an async background process where you notify the recipient when it's ready.With Netflix, it makes sense that you can open a connection to some static content and receive gigabytes over it.
But streaming tokens from a GPU is a much more active process. Especially in this case where you're waiting tens of seconds for an image to generate.
That’s why when you generate an image in chatgpt nowadays, it will start displaying in full resolution from the top pixel row and start loading towards the bottom.
Wouldn't have expected that from a honest player.
It looks like I will not be able to get any prepaid money back [0] so I will be careful not to put any further money on it.
I guess I better start using some of the more expensive APIs to make it worth the $20 I prepaid.
[0] https://openai.com/policies/service-credit-terms/
4. "All sales of Services, including sales of prepaid Services, are final. Service Credits are not refundable and expire one year after the date of purchase or issuance if not used, unless otherwise specified at the time of purchase."
minimaxir•2mo ago
Prompting the model is also substantially more different and difficult than traditional models, unsurprisingly given the way the model works. The traditional image tricks don't work out-of-the-box and I'm struggling to get something that works without significant prompt augmentation (which is what I suspect was used for the ChatGPT image generations)
tough•2mo ago
maybe OpenAI thinks model business is over and they need to start sherlocking all the way from the top to final apps (Thus their interest on buying out cursor, finally ending up with windsurf)
Idk this feels like a new offering between a full raw API and a final product where you abstract some of it for a few cents, and they're basically bundling their SOTA llm models with their image models for extra margin
vineyardmike•2mo ago
In case you didn’t know, it’s not just wrapping in an LLM. The image model they’re referencing is a model that’s directly integrated into the LLM for functionality. It’s not possible to extract, because the LLM outputs tokens which are part of the image itself.
That said, they’re definitely trying to focus on building products over raw models now. They want to be a consumer subscription instead of commodity model provider.
tough•2mo ago
waiting for some FOSS multi-modal model to come out eventually too
great to see openAI expanding into making actual usable products i guess
spilldahill•2mo ago
doctorpangloss•2mo ago
arevno•2mo ago
thot_experiment•2mo ago
koakuma-chan•2mo ago
thot_experiment•2mo ago
echelon•2mo ago
- OpenAI is notorious for blocking copyrighted characters. They do prompt keyword scanning, but also run a VLM on the results so you can't "trick" the model.
- Lots of providers block public figures and celebrities.
- Various providers block LGBT imagery, even safe for work prompts. Kling is notorious for this.
- I was on a sales call with someone today who runs a father's advocacy group. I don't know what system he was using, but he said he found it impossible to generate an adult male with a child. In a totally safe for work context.
- Some systems block "PG-13" images of characters that are in bathing suits or scantily clad.
None of this is porn, mind you.
thot_experiment•2mo ago
throwaway314155•2mo ago
echelon•2mo ago
This is the god model in images right now.
I don't think open source diffusion models can catch up with this. From what I've heard, this model took a huge amount of money to train that not even Black Forest Labs has access to.
thot_experiment•2mo ago
As for LoRAs and fine tuning and open source in general; if you've ever been to civit.ai it should be immediately obvious why those things aren't going away.
AuryGlenz•2mo ago
Sure, I can ghiblify specific images of them on this model, but anything approaching realistic changes their looks. I've also done specific LoRAs for things that may or may not be in their training data, such as specific movies.
simonw•2mo ago
Here's my dog in a pelican costume: https://bsky.app/profile/simonwillison.net/post/3lneuquczzs2...
steve_adams_86•2mo ago
furyofantares•2mo ago
I suspect what I'll do with the API is iterate at medium quality and then generate a high quality image when I'm done.
thot_experiment•2mo ago
Sohcahtoa82•2mo ago
It's actually more than that. It's about 16.7 cents per image.
$0.04/image is the pricing for DALL-E 3.
weird-eye-issue•2mo ago
mkl•2mo ago
Sohcahtoa82•2mo ago
They didn't show low/med/high quality, they just said an image was a certain number of tokens with a price per token that led to $0.16/image.
raincole•2mo ago
I think in terms of image generation, ChatGPT is the biggest leap since Stable Diffusion's release. LoRA/ControlNet/Flux are forgettable in comparison.
soared•2mo ago
tacoooooooo•2mo ago
fkyoureadthedoc•2mo ago
mediaman•2mo ago
stavros•2mo ago
abhpro•2mo ago
thegeomaster•2mo ago
thefourthchime•2mo ago
yousif_123123•2mo ago
Yiling-J•2mo ago
swyx•2mo ago
what is that?
thegeomaster•2mo ago
Of course, nobody really knows what 4o image generation really is under the hood, but it looks to be like some kind of hybrid system like Transfusion to me. It is much better at prompt adherence than diffusion models, but its output can be clunkier/stylistically incoherent. At times, it also exhibits similar failure modes as diffusion (such as weirdly rotated body parts).
Given how it behaves, I think Gemini 2.0 Flash image generation is probably the same approach but with a smaller parameter count. It's... eerie... how close together these two were released and how similar they appear to be.
raincole•2mo ago
The AI field looks awfully like {OpenAI, Google, The Irrelevent}.
echelon•2mo ago
Now for the bad part: I don't think Black Forest Labs, StabilityAI, MidJourney, or any of the others can compete with this. They probably don't have the money to train something this large and sophisticated. We might be stuck with OpenAI and Google (soon) for providing advanced multimodal image models.
Maybe we'll get lucky and one of the large Chinese tech companies will drop a model with this power. But I doubt it.
This might be the first OpenAI product with an extreme moat.
raincole•2mo ago
Yeah. I'm a tad sad about it. I once thought the SD ecosystem proves open-source won when it comes to image gen (a naive idea, I know). It turns out big corps won hard in this regard.
adamhowell•2mo ago
I started Accomplice v1 back in 2021 with this goal in mind and raised some VC money but it was too early.
Now, with these latest imagen-3.0-generate-002 (Gemini) and gpt-image-1 (OpenAI) models – especially this API release from OpenAI – I've been able to resurrect Accomplice as a little side project.
Accomplice v2 (https://accomplice.ai) is just getting started back up again – I honestly decided to rebuild it only a couple weeks ago in preparation for today once I saw ChatGPT's new image model – but so far 1,000s of free to download PNGs (and any SVGs that have already been vectorized are free too (costs a credit to vectorize)).
I generate new icons every few minutes from a huge list of "useful icons" I've built. Will be 100% pay-as-you-go. And for a credit, paid users can vectorize any PNGs they like, tweak them using AI, upload their own images to vectorize and download, or create their own icons (with my prompt injections baked in to get you good icon results)
Do multi-modal models make something like this obsolete? I honestly am not sure. In my experience with Accomplice v1, a lot of users didn't know what to do with a blank textarea, so the thinking here is there's value in doing some of the work for them upfront with a large searchable archive. Would love to hear others' thoughts.
But I'm having fun again either way.
stavros•2mo ago
throwup238•2mo ago
stavros•2mo ago
vunderba•2mo ago
https://imgur.com/a/BTzbsfh
It definitely captures the style - but any reasonably complicated prompt was beyond it.
varenc•2mo ago
echelon•2mo ago
While there's a market need for fast diffusion, that's already been filled and is now a race to the bottom. There's nobody else that can do what OpenAI does with gpt-image-1. This model is a truly programmable graphics workflow engine. And this type of model has so much more value than mere "image generation".
gpt-image-1 replaces ComfyUI, inpainting/outpainting, LoRAs, and in time one could imagine it replaces Adobe Photoshop and nearly all the things people use it for. It's an image manipulation engine, not just a diffusion model. It understands what you want on the first try, and it does a remarkably good job at it.
gpt-image-1 is a graphics design department in a box.
Please don't think of this as a model where you prompt things like "a dog and a cat hugging". This is so much more than that.
vunderba•2mo ago
Can you elaborate? This was not my experience - retesting the prompts that I used for my GenAI image shootout against gpt-image-1 API proved largely similar.
https://genai-showdown.specr.net
Wowfunhappy•2mo ago