OpenAI releases image generation in the API

https://openai.com/index/image-generation-api/

487•themanmaran•2mo ago

Comments

minimaxir•2mo ago

Pricing-wise, this API is going to be hard to justify the value unless you really can get value out of providing references. A generated `medium` 1024x1024 is $0.04/image, which is in the same cost class as Imagen 3 and Flux 1.1 Pro. Testing from their new playground (https://platform.openai.com/playground/images), the medium images are indeed lower quality than either of of two competitor models and still takes 15+ seconds to generate: https://x.com/minimaxir/status/1915114021466017830

Prompting the model is also substantially more different and difficult than traditional models, unsurprisingly given the way the model works. The traditional image tricks don't work out-of-the-box and I'm struggling to get something that works without significant prompt augmentation (which is what I suspect was used for the ChatGPT image generations)

tough•2mo ago

It seems to me like this is a new hybrid product for -vibe coders- beacuse otherwise the -wrapping- of prompting/improving a prompt with an LLM before hitting the text2image model can certainly be done as you say cheaper if you just run it yourself.

maybe OpenAI thinks model business is over and they need to start sherlocking all the way from the top to final apps (Thus their interest on buying out cursor, finally ending up with windsurf)

Idk this feels like a new offering between a full raw API and a final product where you abstract some of it for a few cents, and they're basically bundling their SOTA llm models with their image models for extra margin

vineyardmike•2mo ago

> It seems to me like this is a new hybrid product for -vibe coders- beacuse otherwise the -wrapping- of prompting/improving a prompt with an LLM before hitting the text2image model can certainly be done as you say cheaper if you just run it yourself.

In case you didn’t know, it’s not just wrapping in an LLM. The image model they’re referencing is a model that’s directly integrated into the LLM for functionality. It’s not possible to extract, because the LLM outputs tokens which are part of the image itself.

That said, they’re definitely trying to focus on building products over raw models now. They want to be a consumer subscription instead of commodity model provider.

tough•2mo ago

Right! I forgot the new model was a multi-modal one generating image outputs from both image and text inputs, i guess this is good and price will come down eventually.

waiting for some FOSS multi-modal model to come out eventually too

great to see openAI expanding into making actual usable products i guess

spilldahill•2mo ago

yeah, the integration is the real shift here. by embedding image generation into the LLM’s token stream, it’s no longer a pipeline of separate systems but a single unified model interface. that unlocks new use cases where you can reason, plan, and render all in one flow. it’s not just about replacing diffusion models, it’s about making generation part of a broader agentic loop. pricing will drop over time, but the shift in how you build with this is the more interesting part.

doctorpangloss•2mo ago

It's far and away the most powerful image model right now. $0.04/image is a decent price!

arevno•2mo ago

This is extremely domain-specific. Diffusion models work much better for certain things.

thot_experiment•2mo ago

Can you cite an example? I'm really curious where that set of usecases lies.

koakuma-chan•2mo ago

Explicit adult content.

thot_experiment•2mo ago

False. That has nothing to do with the model architecture and everything to do with cloud inference providers wanting to avoid regulatory scrutiny.

echelon•2mo ago

I work in the space. There are a lot of use cases that get censored by OpenAI, Kling, Runway, and various other providers for a wide variety of reasons:

- OpenAI is notorious for blocking copyrighted characters. They do prompt keyword scanning, but also run a VLM on the results so you can't "trick" the model.

- Lots of providers block public figures and celebrities.

- Various providers block LGBT imagery, even safe for work prompts. Kling is notorious for this.

- I was on a sales call with someone today who runs a father's advocacy group. I don't know what system he was using, but he said he found it impossible to generate an adult male with a child. In a totally safe for work context.

- Some systems block "PG-13" images of characters that are in bathing suits or scantily clad.

None of this is porn, mind you.

thot_experiment•2mo ago

Sure but that has nothing to do with the model architecture and everything to do with the cloud inference providers wanting to cover their asses.

throwaway314155•2mo ago

What does any of that have to do with the distinction between diffusion vs. autoregressive models?

echelon•2mo ago

I don't think so. This model kills the need for Flux, ComfyUI, LoRAs, fine tuning, and pretty much everything that's come before it.

This is the god model in images right now.

I don't think open source diffusion models can catch up with this. From what I've heard, this model took a huge amount of money to train that not even Black Forest Labs has access to.

thot_experiment•2mo ago

ComfyUI supports 4o natively so you get the best of both worlds, there is so much that you can't do with 4o because there's a fundamental limit on the level of control you can have over image generation when your conditioning is just tokens in an autoregressive model. There's plenty of reason to use comfy even if 4o is part of your workflow.

As for LoRAs and fine tuning and open source in general; if you've ever been to civit.ai it should be immediately obvious why those things aren't going away.

AuryGlenz•2mo ago

95% of what I do with image models is train LoRAs/finetune family and friends and create images of them.

Sure, I can ghiblify specific images of them on this model, but anything approaching realistic changes their looks. I've also done specific LoRAs for things that may or may not be in their training data, such as specific movies.

simonw•2mo ago

It may lose against other models on prompt-to-image, but I'd be very excited to see another model that's as good at this one as image+prompt-to-image. Editing photos with ChatGPT over the past few weeks has been SO much fun.

Here's my dog in a pelican costume: https://bsky.app/profile/simonwillison.net/post/3lneuquczzs2...

steve_adams_86•2mo ago

The dog ChatGPT generated doesn't actually look like your dog. The eyes are so different. Really cute image, though.

furyofantares•2mo ago

I find prompting the model substantially easier than traditional models, is it really more difficult or are you just used to traditional models?

I suspect what I'll do with the API is iterate at medium quality and then generate a high quality image when I'm done.

thot_experiment•2mo ago

Similarly to how 90% of my LLM needs are met by Mistral 3.1, there's no reason to use 4o for most t2i or i2i, however there's a definite set of tasks that are not possible with diffusion models, or if they are they require a giant ball of node spaghetti in comfyui to achieve. The price is high but the likelyhood of getting the right answer on the first try is absolutely worth the cost imo.

Sohcahtoa82•2mo ago

> A generated `medium` 1024x1024 is $0.04/image

It's actually more than that. It's about 16.7 cents per image.

$0.04/image is the pricing for DALL-E 3.

weird-eye-issue•2mo ago

No, it's not

mkl•2mo ago

16.7 cents is the high quality cost, and medium is 4.2 cents: https://platform.openai.com/docs/pricing#:~:text=1M%20charac...

Sohcahtoa82•2mo ago

Ah, they changed that page since I saw it yesterday.

They didn't show low/med/high quality, they just said an image was a certain number of tokens with a price per token that led to $0.16/image.

raincole•2mo ago

ChatGPT's prompt adherence is light years ahead of all the others. I won't even call Flux/Midjoueny its competitors. ChatGPT image gen is practically a one-of-its-kind unique product on the market: the only usable AI image editor for people without image editing experience.

I think in terms of image generation, ChatGPT is the biggest leap since Stable Diffusion's release. LoRA/ControlNet/Flux are forgettable in comparison.

soared•2mo ago

This is a take so incredulous it doesn’t seem credible.

tacoooooooo•2mo ago

its 100% the correct take

fkyoureadthedoc•2mo ago

yeah this is my personal experience. The new image generation is the only reason I keep an OpenAI subscription rather than switching to Google.

mediaman•2mo ago

It is correct, the shift from diffusion to transformers is a very, very big difference.

stavros•2mo ago

I can confirm, ChatGPT's prompt adherence is so incredibly good, it gets even really small details right, to a level that diffusion-based generators couldn't even dream of.

abhpro•2mo ago

Also chiming in to say you're wrong, I mean they're correct

thegeomaster•2mo ago

Well, there's also gemini-2.0-flash-exp-image-generation. Also autoregressive/transfusion based.

thefourthchime•2mo ago

Such a good name....

yousif_123123•2mo ago

It's also good but clearly not close still. Maybe Gemini 2.5 or 3 will have better image gen.

Yiling-J•2mo ago

gemini-2.0-flash-exp-image-generation doesn’t perform as well as GPT-4o's image generation, as mentioned in section 5.1 of this paper: https://arxiv.org/pdf/2504.02782. However based on my test, for certain types of images such as realistic recipe images, the results are quite good. You can see some examples here: https://github.com/Yiling-J/tablepilot/tree/main/examples/10...

swyx•2mo ago

> transfusion based.

what is that?

thegeomaster•2mo ago

It's a mix between the Transformer architecture and diffusion, shown to provide better output results than simple autoregressive image token generation alone: https://arxiv.org/html/2408.11039v1

Of course, nobody really knows what 4o image generation really is under the hood, but it looks to be like some kind of hybrid system like Transfusion to me. It is much better at prompt adherence than diffusion models, but its output can be clunkier/stylistically incoherent. At times, it also exhibits similar failure modes as diffusion (such as weirdly rotated body parts).

Given how it behaves, I think Gemini 2.0 Flash image generation is probably the same approach but with a smaller parameter count. It's... eerie... how close together these two were released and how similar they appear to be.

raincole•2mo ago

It's quite bad now, but I have no doubt that Google will catch up.

The AI field looks awfully like {OpenAI, Google, The Irrelevent}.

echelon•2mo ago

I'd go out on a limb and say that even your praise of gpt-image-1 is underselling its true potential. This model is as remarkable as when ChatGPT first entered the market. People are sleeping on its capabilities. It's a replacement for ComfyUI and potentially most of Adobe in time.

Now for the bad part: I don't think Black Forest Labs, StabilityAI, MidJourney, or any of the others can compete with this. They probably don't have the money to train something this large and sophisticated. We might be stuck with OpenAI and Google (soon) for providing advanced multimodal image models.

Maybe we'll get lucky and one of the large Chinese tech companies will drop a model with this power. But I doubt it.

This might be the first OpenAI product with an extreme moat.

raincole•2mo ago

> Now for the bad part: I don't think Black Forest Labs, StabilityAI, MidJourney, or any of the others can compete with this.

Yeah. I'm a tad sad about it. I once thought the SD ecosystem proves open-source won when it comes to image gen (a naive idea, I know). It turns out big corps won hard in this regard.

adamhowell•2mo ago

So, I've long dreamed of building an AI-powered https://iconfinder.com.

I started Accomplice v1 back in 2021 with this goal in mind and raised some VC money but it was too early.

Now, with these latest imagen-3.0-generate-002 (Gemini) and gpt-image-1 (OpenAI) models – especially this API release from OpenAI – I've been able to resurrect Accomplice as a little side project.

Accomplice v2 (https://accomplice.ai) is just getting started back up again – I honestly decided to rebuild it only a couple weeks ago in preparation for today once I saw ChatGPT's new image model – but so far 1,000s of free to download PNGs (and any SVGs that have already been vectorized are free too (costs a credit to vectorize)).

I generate new icons every few minutes from a huge list of "useful icons" I've built. Will be 100% pay-as-you-go. And for a credit, paid users can vectorize any PNGs they like, tweak them using AI, upload their own images to vectorize and download, or create their own icons (with my prompt injections baked in to get you good icon results)

Do multi-modal models make something like this obsolete? I honestly am not sure. In my experience with Accomplice v1, a lot of users didn't know what to do with a blank textarea, so the thinking here is there's value in doing some of the work for them upfront with a large searchable archive. Would love to hear others' thoughts.

But I'm having fun again either way.

stavros•2mo ago

That looks interesting, but I don't know how useful single icons can be. For me, the really useful part would be to get a suite of icons that all have a consistent visual style. Bonus points if I can prompt the model to generate more icons with that same style.

throwup238•2mo ago

Recraft has a style feature where you give some images. I wonder if that would work for icons. You can also try giving an image of a bunch of icons to ChatGPT and have it generate more, then vectorize them.

stavros•2mo ago

I think the latter approach is the best bet right now, agree.

vunderba•2mo ago

Recraft's icon generator let's you do this.

https://imgur.com/a/BTzbsfh

It definitely captures the style - but any reasonably complicated prompt was beyond it.

varenc•2mo ago

pretty amazing that in ~two years a 15 second latency AI image generation API that cost 4 cents lags behind competitors.

echelon•2mo ago

This product does not lag behind competitors. Once you take the time to understand how it works, it's clear that this is an order of magnitude more powerful than anything else on the market.

While there's a market need for fast diffusion, that's already been filled and is now a race to the bottom. There's nobody else that can do what OpenAI does with gpt-image-1. This model is a truly programmable graphics workflow engine. And this type of model has so much more value than mere "image generation".

gpt-image-1 replaces ComfyUI, inpainting/outpainting, LoRAs, and in time one could imagine it replaces Adobe Photoshop and nearly all the things people use it for. It's an image manipulation engine, not just a diffusion model. It understands what you want on the first try, and it does a remarkably good job at it.

gpt-image-1 is a graphics design department in a box.

Please don't think of this as a model where you prompt things like "a dog and a cat hugging". This is so much more than that.

vunderba•2mo ago

> Prompting the model is also substantially more different and difficult than traditional models

Can you elaborate? This was not my experience - retesting the prompts that I used for my GenAI image shootout against gpt-image-1 API proved largely similar.

https://genai-showdown.specr.net

Wowfunhappy•2mo ago

Huh? For me the quality of the API seems to be identical to what I'm getting in ChatGPT.

sebastiennight•2mo ago

Hmm seems pricey.

What's the current state of the art for API generation of an image from a reference plus modifier prompt?

Say, in the 1c per HD (1920*1080) image range?

minimaxir•2mo ago

"Image from a reference" is a bit of a rabbit hole. For traditional image generation models, in order for it to learn a reference, you have to fine-tune it (LoRA) and/or use a conditioning model to constrain the output (InstantID/ControlNet)

The interesting part of this GPT-4o API is that it doesn't need to learn them. But given the cost of `high` quality image generation, it's much cheaper to train a LoRA for Flux 1.1 Pro and generate from that.

Tiberium•2mo ago

Imagen supports image references in the API as well, just on Vertex, not on Gemini API yet.

BoorishBears•2mo ago

Imagen references don't feel very useful at all. At most it feels like an afterthought meant to make product photoshoots easier.

thot_experiment•2mo ago

Reflux is fantastic for the basic reference image based editing most people are using this for, but 4o is far more powerful than any existing models because of it's large scale and cross-modal understanding, there are things possible with 4o that are just 100% impossible with diffusion models. (full glass of wine, horse riding an astronaut, room without pink elephants, etc)

gervwyk•2mo ago

Great svg generation would be far more userful! For example, being able to edit svg images after generated by Ai would be quick to modify the last mile.. For our new website https://resonancy.io the simple svg workflow images created was still very much created by hand.. and trying various ai tools to make such images yielded shockingly bad off-brand results even when provided multiple examples. By far the best tool for this is still canva for us..

Anyone know of an Ai model for generating svg images? Please share.

tough•2mo ago

SVGFusion https://arxiv.org/abs/2412.10437 which is a new paper from SVGRender group https://huggingface.co/SVGRender

OmniSVG https://arxiv.org/abs/2504.06263v1

gervwyk•2mo ago

Amazing thanks for sharing! Will have a read. A commercial model would be something that I will pay for!

tough•2mo ago

I don't know about -commercial- offerings but you can try also something like SVGRender which you should be able to run on your own GPU etc https://ximinng.github.io/PyTorch-SVGRender-project/

first paper linked on prior comment is the latest one from SVGRender group, but not sure if any runnable model weights are out yet for it (SVGFusion)

corysama•2mo ago

Is free cheap enough ;)

https://omnisvg.github.io/

https://huggingface.co/OmniSVG

simonw•2mo ago

I was impressed with recraft.ai for SVGs - https://simonwillison.net/2024/Nov/15/recraft-v3/ - though as far as I can tell they generate raster images and then SVG-ize them before returning the result.

jjcm•2mo ago

Recraft also has an svg model: https://replicate.com/recraft-ai/recraft-v3-svg

One note with these is most of the production ones are actually diffusion models that get ran through an image->svg model after. The issue with this is that the layers aren't set up semantically like you'd expect if you were crafting these by hand, or if you were directly generating svgs. The results work, but they aren't perfect.

vitorcremonez•2mo ago

Try neoSVG or Recraft, it is awesome!

smrt•2mo ago

I don't understand why this api needs organization verification. More paperwork ahead. Facepalm

PermissionDeniedError: Error code: 403 - {'error': {'message': 'To access gpt-image-1, please complete organization verification

themanmaran•2mo ago

Likely because they've seen a lot of the potential abuse capabilities. i.e. the "generate a drivers license with this face".

So the options are: 1) nerf the model so it can't produce images like that, or 2) use some type of KYC verification.

magackame•2mo ago

The model is already pretty lobotomized refusing even mundane requests randomly.

Upload a picture of a friend -> OK. Upload my own picture -> I can't generate anything involving real people.

Also after they enabled global chat memory I started seeing my other chats leaking into the images as literal text. Disabled it since.

vunderba•2mo ago

Yep - the API lets you lower the moderation which I observed allows for more violent and graphic prompts, but it still exists and will often reject if you reference popular figures/etc.

bayesianbot•2mo ago

It says "Organization verification" but I got my personal account (with Personal as organization) verified with just a passport.

animanoir•2mo ago

Wow more AI slop

pkulak•2mo ago

I don't get it. I've been using `dall-e-3` over the public API for a couple years now. Is this just a new model?

EDIT: Oh, yes, that's what it appears to be. Is it better? Why would I switch?

themanmaran•2mo ago

This is the new model that's available in ChatGPT, which most notably can do transfer generation. i.e. "take this image and restyle it to look like X". Or "take this sneaker and give me a billboard ad for it"

danielbln•2mo ago

This is their presumably auto regressive image model. It has outstanding prompt adherence and great detail in addition to strong style transfer abilities.

Sohcahtoa82•2mo ago

The new image generation model is miles ahead of DALL-E 3, especially when generating text.

bradly•2mo ago

Basically they are charging for the ability to make accurate text generation.

film42•2mo ago

I generated 5 images in the playground. One using a text-only prompt and 4 using images from my phone. I spent $0.85 which isn't bad for a fun round of Studio Ghibli portraits for the family group chat, but too expensive to be used in a customer facing product.

sumedh•2mo ago

> but too expensive to be used in a customer facing product.

Enhance headshots for putting on Linkedin.

BOOSTERHIDROGEN•2mo ago

is it good?

stavros•2mo ago

No, it can't do detail well, AFAIK the images are produced at a lower resolution and then upscaled. This might be specific to the ChatGPT version, however, for cost cutting.

bamboozled•2mo ago

Can't wait to meet people in person who look nothing like their profile pictures on linkedin :)

martin_a•2mo ago

I already did. Looked in the mirror just an hour ago. Strange guy, very tired, never seen him before.

salomonk_mur•2mo ago

It doesn't keep facial details in the generation. The generated person resembles you but is definitely not you.

anshumankmr•2mo ago

Yeah its very eerie. Though sometimes its very close, like dangerously I feel, I tried once myself and the background was unrealistic (the prompt was me giving a keynote speech for a vision board ) but I looked like... me.

MisterBiggs•2mo ago

Lots of comments on the price being too high, what are the odds this is a subsidized bare metal cost?

kevinqi•2mo ago

just based on how long it takes to produce these images, and how much text responses cost, I wouldn't be surprised at all if it was close to cost

scyzoryk_xyz•2mo ago

Intelligence is fast approaching utility status.

1oooqooq•2mo ago

aren't you all embarrassed seeing lame press releases of the most uninteresting things on the top of HN front page? i kinda feel bad.

bobxmax•2mo ago

I'm embarassed that you find revolutionary tech uninteresting.

1oooqooq•2mo ago

it's literary one feature now available in a different billing format. get a gripe.

stavros•2mo ago

When I grow up, I too want to dismiss things without even knowing what I'm talking about.

urbandw311er•2mo ago

It’s available by API now, previously it was not. That’s pretty big news. This isn’t a billing related thing.

sumedh•2mo ago

This news is relevant for developers though.

GuinansEyebrows•2mo ago

How so? I'm (nominally) a developer and this has nothing to do with my job or personal pursuits.

matkoniecz•2mo ago

Noone was claiming that it is relevant to every single developer.

Hard to find such news.

nullandvoid•2mo ago

Is every story on HN relevant for every user?

drakenot•2mo ago

Does the AI have the same content restrictions that the chat service does?

Wowfunhappy•2mo ago

Yes by default, but you can change it:

> Additionally, developers can also control moderation sensitivity with the `moderation` parameter, which can be set to auto (default) for standard filtering, or low for less restrictive filtering.

I played around with this last night and although it still sometimes refused to create images, it seemed to be significantly more lenient.

Imnimo•2mo ago

I'm curious what the applications are where people need to generate hundreds or thousands of these images. I like making Ghibli-esque versions of family photos as much as the next person, but I don't need to make them in volume. As far as I can recall, every time I've used image generation, it's been one-off things that I'm happy to do in the ChatGPT UI.

marviel•2mo ago

AI-assisted education is promising.

Etheryte•2mo ago

That is true in a broader sense, but education and abundant money don't generally go hand in hand.

marviel•2mo ago

don't I know it

samtp•2mo ago

I'm still struggling to see how you would need thousands of AI generated images rather than just using existing real images for education.

marviel•2mo ago

- personalization (style, analogy to known concepts)

- specificity (a diagram that perfectly encapsulates the exact set of concepts you're asking about)

indeyets•2mo ago

But LLMs are not reliable enough, so you can not actually expect “specificity”

marviel•2mo ago

Not perfect now, but adequate in some domains. Will only get better.

Hackbraten•2mo ago

> Will only get better.

AI companies are still in their "burning money" phase.

Enshittification is not on the horizon yet, but it's inevitable.

concats•2mo ago

While I have no doubt that individual companies, such as OpenAI for example, will eventually introduce enshittification features, I doubt the industry as a whole can be summarized that easily.

I believe, over all, development will go forward and things will get better. A rising tide lifts all ships, even if some of them decide to be shitty leaking vessels. If nothing else we always have open source software to fall back on when the enshittification of the proprietary models start.

For a practical example: The cars we drive today are a lot better than 100 years ago. A bad future isn't always inevitable.

Hackbraten•2mo ago

Good points!

olyjohn•2mo ago

But are cars we drive today better than they were 10 years ago? 20 years ago? Reliability is now trending down for cars, safety is questionable as deaths and injuries have been steadily increasing for a number of years now. Features are getting converted to subscriptions, and they all constantly send back telemetry.

aeonik•2mo ago

More reliable than 80% of my teachers growing up.

abossy•2mo ago

The company I work for generates thousands of these each week for children's personalized storybooks to help them learn how to read. The story text is the core part of the application, but the personalized images are what make them engaging.

whatnow37373•2mo ago

"Having trouble with your algebra? MathWiz is having a 20% discount this month only. Only $24.95 / month. This is an excellent deal. Don't you want to improve? Do you want to let your family down, like they thought you would? Or would like me to create an account for you?"

marviel•2mo ago

"Want to get a job? [COLLEGE] is having a 0% discount -- only $200,000 a year! Don't you want to have a place to live? Go horribly in debt, pick a degree that may not matter, all at an age where your Brain has not yet developed fully!"

etc

minimaxir•2mo ago

As usual for AI startups nowadays, using this API you can create a downstream wrapper for image generation with bespoke prompts.

A pro/con of the multimodal image generation approach (with an actually good text encoder) is that it rewards intense prompt engineering moreso than others, and if there is a use case that can generate more than $0.17/image in revenue, that's positive marginal profit.

austhrow743•2mo ago

I use the api because i don’t use chatgpt enough to justify the cost of their UI offering.

jevogel•2mo ago

Imagine an AI recipe building app that helps you create a recipe with certain ingredients, then generates an image of what the final product might look like.

what•2mo ago

Why do need to know what it looks like? Or are you publishing the recipe without cooking it?

aprilthird2021•2mo ago

Imagine a news feed that never ends full of AI slop to sell ads on

theptip•2mo ago

An obvious one is for video games, interactive fiction, that sort of thing. AI dungeon with visuals could be pretty interesting.

brian-armstrong•2mo ago

It's too expensive for that unless you had a pretty generous subscription fee. I think local models are probably best suited for gaming where a decent GPU is already likely present.

theptip•2mo ago

I think there is a niche for both. Local LLMs are orders of magnitude smaller, so you could imagine cloud bursting for the difficult/important work like generating character portraits.

That said it’ll be 10-20x cheaper in a year at which point I don’t think you care about price for this workflow in 2D games.

chipgap98•2mo ago

Interior design, fashion, and advertising all come to mind

whywhywhywhy•2mo ago

> where people need to generate hundreds or thousands of these images

Anyone using image gen for real work not just for fun.

Although you're way better off finding your own workflows with local models at that scale.

reducemore•2mo ago

I’ve built a daily image-based puzzle that’s fully automated, and have been using flux to generate images. I’ve found sometimes they’re just not good enough, so have been doing some manual curation. But, with this new API, I’ll see if it can run by itself again.

cuuupid•2mo ago

When this was up yesterday I complained that the refusal rate was super high especially on government and military shaped tasks, and that this would only push contractors to use CN-developed open source models for work that could then be compromised.

Today I'm discovering there is a tier of API access with virtually no content moderation available to companies working in that space. I have no idea how to go about requesting that tier of access, but have spoken to 4 different defense contractors in the last day who seem to already be using it.

refulgentis•2mo ago

It's "tier 5", I've had an account since the 3.0 days so I can't be positive I'm not grandfathered in, but, my understanding is as long as you have a non-trivial amount of spend for a few months you'll have that access.

(fwiw for anyone curious how to implement it, it's the 'moderation' parameter in the JSON request you'll send, I missed it for a few hours because it wasn't in Dalle-3)

dunkmaster•2mo ago

API shows either auto or low available. Is there another secret value with even lower restrictions?

refulgentis•2mo ago

Not that I know of.

I just took any indication that the parent post meant absolute zero moderation as them being a bit loose with their words and excitable with how they understand things, there were some signs:

1. it's unlikely they completed an API integration quickly enough to have an opinion on military / defense image generation moderation yesterday, so they're almost certainly speaking about ChatGPT. (this is additionally confirmed by image generation requiring tier 5 anyway, which they would have been aware of if they had integrated)

2. The military / defense use cases for image generation are not provided (and the steelman'd version in other comments is nonsensical, i.e. we can quickly validate you can still generate kanban boards or wireframes of ships)

3. The poster passively disclaims being in military / defense themself (grep "in that space")

4. it is hard to envision cases of #2 that do not require universal moderation for OpenAI's sake, i.e. lets say their thought process is along the lines of: defense/military ~= what I think of as CIA ~= black ops ~= image manipulation on social media, thus, the time I said "please edit this photo of the ayatollah to have him eating pig and say I hate allah" means its overmoderated for defense use cases

5. It's unlikely openai wants to be anywhere near PR resulting from #4. Assuming there is a super secret defense tier that allows this, it's at the very least, unlikely that the poster's defense contractor friends were blabbing about about the exclusive completely unmoderated access they had, to the poster, within hours of release. They're pretty serious about that secrecy stuff!

6. It is unlikely the lack of ability to generate images using GPT Image 1 would drive the military to Chinese models (there aren't Chinese LLMs that do this! even if they were, there's plenty of good ol' American diffusion models!)

Wowfunhappy•2mo ago

I'm Tier 4 and I'm able to use this API and set moderation to "low". Tier 4 only requires a 30 day waiting period and $1,000 spent on credits. While I as an individual was a bit horrified to learn I've actually spent that much on OpenAI credits over the life of my account, it's practically nothing for most organizations. Even Tier 5 only requires $5,000.

OP was clearly implying there is some greater ability only granted to extra special organizations like the military.

With all possible respect to OP, I find this all very hard to believe without additional evidence. If nothing else, I don't really see a military application of this API (specifically, not AI in general). I'm sure it would help them create slide decks and such, but you don't need extra special zero moderation for that.

throwup238•2mo ago

> With all possible respect to OP, I find this all very hard to believe without additional evidence. If nothing else, I don't really see a military application of this API (specifically, not AI in general). I'm sure it would help them create slide decks and such, but you don't need extra special zero moderation for that.

I can't provide additional evidence (it's defense, duh), but the #1 use I've seen is generating images for computer vision training mostly to feed GOFAI algorithms that have already been validated for target acquisition. Image gen algorithms have a pretty good idea of what a T72 tank and different camouflage looks like, and they're much better at generating unique photos combining the two. It's actually a great use of the technology because hallucinations help improve the training data (i.e. the final targetting should be invariant to a T72 tank with a machine gun on the wrong side or with too many turrets, etc.)

That said, due to compartmentalization, I don't know the extent to which image gen is used in defense, just my little sliver of it.

cuuupid•2mo ago

We can talk about it here, they put out SBIRs for satellite imagery labeling and test set evaluation that provide a good amount of detail into how they're using it.

bayesianbot•2mo ago

Tier 4 requires $250 spent. I'm tier 4 as well and I can see how they get easily mixed, but it actually says $1,000 spent to move to next tier.

Wowfunhappy•2mo ago

Oops, thank you! So, even easier!

spauldo•2mo ago

There are plenty of fairly mundane applications for this sort of thing in the military. Every base has a photography and graphic design team that makes posters, signs, PR materials, pamphlets, illustrations for manuals, you name it. Imagine a poster in the break room of a soldier in desert gear drinking from his/her canteen with a tagline of "Stay Alive - Hydrate!" and you're on the right track.

Wowfunhappy•2mo ago

You don't need a special no moderation version to do that stuff.

cuuupid•2mo ago

I am actually talking about the OpenAI API :)

I'm not aware of the moderation parameter here but these contractors have special API keys that unlock unmoderated access for them, they've apparently had it for weeks.

samtp•2mo ago

What's a good use case for a defense contractor to generate AI images besides to include in presentations?

aigen000•2mo ago

Fabricating evidence of weapons of mass destruction in some developing nation.

I kid, more real world use cases would be for concept images for a new product or marketing campaigns.

toasteros•2mo ago

...you can do that with a pencil, though.

What an impossibly weird thing to "need" an LLM for.

Gud•2mo ago

I suppose you walk by foot everywhere?

toasteros•2mo ago

Sometimes. My feet don't have a random chance to send me in a random direction to that which I intend.

olyjohn•2mo ago

This is why I don't ride horses.

KeplerBoy•2mo ago

You can also create images by poking bits in a hex editor. Some tools are better suited than others.

ZeroTalent•2mo ago

Manufacturing consent

rnd0•2mo ago

Literally how it will be used; you are correct.

matheusmoreira•2mo ago

Reality is turning into some kind of Hideo Kojima game.

https://youtu.be/-gGLvg0n-uY

kla-s•2mo ago

Wow that video is awesome, thanks for sharing

Gud•2mo ago

Wow! What an amazingly dystopian vision of the future. Probably right.

ZeroTalent•2mo ago

it's a deepfake. it's not actually from the game MGS2. this is the actual video: https://www.youtube.com/watch?v=C31XYgr8gp0

Gud•2mo ago

Makes perfect sense.

subroutine•2mo ago

Think of all the trivial ways an image generator could be used in business, and there is likely a similar use-case among the DoD and its contractors (e.g. create a cartoon image of a ship for a naval training aid; make a data dashboard wireframe concept for a decision aid).

sandspar•2mo ago

Vastly oversimplified but for every civilian job there's an equivalent military job. Superficially, the military is basically a country-sized self-contained corporation. Anywhere that Wal-Mart's corporate office could use AI so could the military.

golergka•2mo ago

Input one image of a known military installation and one civilian building. Prompt to generate a similar _civilian_ building, but resembling that military installation in some way: similar structure, similar colors, similar lighting.

Then include this image in a dataset of another net with marker "civilian". Train that new neural net better so that it does lower false positive rate when asked "is this target military".

aprilthird2021•2mo ago

You'll never get promoted thinking like that! Mark them all "military", munitions sales will soar!

golergka•2mo ago

Bombs and other kinds of weapon system which are "smarter" have higher markup. It's profitable to sell smarter weapons. Dumb weapons is destroying the whole cities, like Russia did in Ukraine. Smart weapons is striking a tank, a car, an apartment, a bunker, knowing who's there and when — which obviously means less % of civilian casualties.

guappa•2mo ago

Remember when Obama re-defined so that "all adult males are terrorists"? That's how USA reduces civilian casualties.

derektank•2mo ago

You might not believe it but the US military actually places a premium on not committing war crimes. Every service member, or at least every airman in the Air Force (I can't speak for other branches) receives mandatory training on the Kunduz hospital before deployment in an effort to prevent another similar tragedy. If they didn't care, they wouldn't waste thousands of man-hours on it.

handfuloflight•2mo ago

> On 7 October 2015, President Barack Obama issued an apology and announced the United States would be making condolence payments of $6,000 to the families of those killed in the airstrike.

Definitely a premium.

jncfhnb•2mo ago

I knew a guy whose job was to assess and approve the legality of each strike considering second order impacts on the community

guappa•2mo ago

Most importantly they finance propaganda films like "eye in the sky" to make it look like they give a shit about not killing civilians.

Videos on wikileaks tell a different story.

aprilthird2021•2mo ago

I know they do. They have their proxies who can get hands dirty when that's needed. Every major geopolitical military player is the same

potatoman22•2mo ago

Generating or augmenting data to train computer vision algorithms. I think a lot of defense problems have messy or low data

tzury•2mo ago

AI image generation is a "statistical simulator". And when fed with the right information, it can generates pretty close to reality scenery.

missedthecue•2mo ago

Generating 30,000 unique images of artillery pieces hiding in underbrush to train autonomous drone cameras.

junon•2mo ago

It's probably not that, but who knows.

The real answer is probably way, way more mundane - generating images for marketing, etc.

TechDebtDevin•2mo ago

well considering an element of their access is the lifting of safety guardrails, I'd assume the scope includes, to some degree, the processing or generation of nsfw/questionable content

junon•2mo ago

Perhaps. I still think it's more "we don't need to guard the government from itself" sort of thing.

apetresc•2mo ago

The guardrails in question are around generating images of weapons, military installations, etc. Not run-of-the-mill NSFW stuff.

m4rtink•2mo ago

Never underestimate the military PowerPoint[1] industry!

[1] https://media.wired.com/photos/5933e578714b881cb296c6ef/mast...

cortesoft•2mo ago

If the model can generate the images, can't it already recognize them?

Falimonda•2mo ago

The model they're training to perform detection/identification out in the field would presumably need to be much smaller and run locally without needing to rely on network connectivity. It makes sense, so long as the openai model produces a training/validation set that's comparable to one that their development team would otherwise need to curate by hand.

Barrin92•2mo ago

I don't really understand the logic here. All the actual signal about what artillery in bushes look like is already in the original training data. Synthetic data cannot conjure empirical evidence into existence, it's as likely to produce false images as real ones. Assuming the military has more privileged access to combat footage than a multi-purpose public chatbot I'd expect synthetic data to degrade the accuracy of a drone.

johndough•2mo ago

Generative models can combine different concepts from the training data. For example, the training data might contain a single image of a new missile launcher at a military parade. The model can then generate an image of that missile launcher hiding in a bush, because it has internalized the general concept of things hiding in bushes, so it can apply it to new objects it has never seen hiding in bushes.

rovr138•2mo ago

If you're building a system to detect something, usually you need enough variations. You add noise to the images, etc.

With this, you could create a dataset that will by definition have that. You should still corroborate the data, but it's a step ahead without having to take 1000 photos and adding enough noise and variations to get to 30k.

IanCal•2mo ago

I'm not arguing this is the purpose here but data augmentation has been done for ages. It just kind of sucks a lot of the time.

You take your images and crop, shift, etc them so that your model doesn't learn "all x are in the middle of the image". For text you might auto replace days of the week with others, there's a lot of work there.

Broadly the intent is to keep the key information and generate realistic but irrelevant noise so that you train a model that correctly ignores the noise.

You don't want to train your model identifying some class of ship to base it on how choppy the water is, just because that was the simple signal that correlated well. There was a case of radiology results that detected cancer well but actually was detecting rulers in the image because in images with tumors there was often a ruler so the tumor could be sized. (I think it was cancer, broad point applies if it was something else).

stormfather•2mo ago

What you're saying just isn't true.

I can get an AI to generate an image of a bear wearing a sombrero. There are no images of this in its training data, but there are bears, and there are images of sombreros, and other things wearing sombreros. It can combine the distributions in a plausible way.

If I am trying to train a small model to fit into the optical sensor of a warhead to target bears wearing sombreros, this synthetic training set would be very useful.

Same thing with artillery in bushes. Or artillery in different lighting conditions. This stuff is useful to saturate the input space with synthetic examples.

gmerc•2mo ago

Unreal, Houdini and a bunch of assets do this just fine and provide actually usable depth / infrared / weather / fog / TOD / and other relevant data for training - likely cheaper than using their API

See bifrost.ai and their fun videos of training naval drones to avoid whales in an ethical manners

krzat•2mo ago

Interesting. Let's say we have those and also 30k real unique images, my guess is that real ones would have more useful information in them, but is this measurable? And how much more?

wahnfrieden•2mo ago

See IDF’s Gospel AI - the goal isn’t always accuracy, it’s speed of assigning new bombing targets per hour

aprilthird2021•2mo ago

Generating pictures of "bad guy looking guys" so your automated bombs shoot more so you sell more bombs

cuuupid•2mo ago

The very simple use case is generating mock targets. In movies they make it seem like they use mannequin style targets or traditional concentric circles but those are infeasible and unrealistic respectively. There's an entire modeling industry here and being able to replace that with infinitely diverse AI-generated targets is valuable!

tyingq•2mo ago

Training, recruiting, sales (as you mention), testing image based targeting.

maxglute•2mo ago

military famously bad at powerpoint meme

throwaway314155•2mo ago

> 4 different defense contractors in the last day

Now I'm just wondering what the hell defense contractors need image generation for that isn't obviously horrifying...

morleytj•2mo ago

It's probably horrifying!

Aeolun•2mo ago

“Generate me a crowd of civilians with one terrorist in.”

“Please move them to some desert, not the empire state building.”

“The civilians are supposed to have turbans, not ballcaps.”

ziml77•2mo ago

That's very outdated, they're absolutely supposed to be at the Empire State Building with baseball caps now. See: ICE arrests and Trump's comment on needing more El Salvadoran prison space for "the homegrowns"

artemisart•2mo ago

That was the joke.

Dylan16807•2mo ago

The joke is AI knowing the job requirements better than the person using it? When talking about chatgpt?

I'm confused.

renewiltord•2mo ago

They make presentations. Most of their work is presentations with diagrams. Icons.

vFunct•2mo ago

Show me a tunnel underneath a building in the desert filled with small arms weapons with a poster on the wall with a map of the United States and a label written with sharpie saying “Bad guys here”. Also add various Arabic lettering on the weapons.

qatanah•2mo ago

All I can think of is image generation of potential targets like ships, airplane, airfield and feed them to their satellite or drones for image detection and tweak their weapons for enhance precision.

daemonologist•2mo ago

I think the usual computer vision wisdom is that this (training object detection on generated imagery) doesn't work very well. But maybe the corps have some techniques that aren't in the public literature yet.

notarealllama•2mo ago

My understanding is the opposite, see papers for "synthetic" data training. They use a small bit if real data to generate lots of synthetic data and get usable results.

The bias leans towards overfitting the data, which in some use cases - such as missile or drone design which doesn't need broad comparisons like 747s or artillery to complete it's training.

Kind of like neural net back propogation but in terms of model /weights

kittikitti•2mo ago

This is on purpose so OpenAI can then litigate against them. This API isn't about a new feature, it's about control. OpenAI is the biggest bully in the space of generative AI and their disinformation and intimidation tactics are working.

subroutine•2mo ago

Do you work with OpenAI models via FedRAMP GGC High Azure? If so I would love to hear more about your experience.

kryogen1c•2mo ago

I'd be interested to hear if that's even possible.

GCCH is typically 6-12 months behind in feature set.

subroutine•2mo ago

See my comment above.

cuuupid•2mo ago

No, but have heard many rumors they are eyeing their own IL4 environment (apparently Azure has been a bad partner and is months behind on models)

I personally just warn customers that it cannot technically handle CUI or higher, can't say that it stops them

subroutine•2mo ago

I ask, because according to MS...

"GPT-4o is now available as part of Azure OpenAI Service for Azure Government and included as part of this latest FedRAMP High and DoD IL4/IL5 Authorization."

...we have everything setup in Azure but are weary to start using with CUI. Our DoD contacts think it's good to go, but nobody wants to go on record as giving the go-ahead.

https://devblogs.microsoft.com/azuregov/azure-openai-fedramp...

https://learn.microsoft.com/en-us/azure/azure-government/com...

starfezzy•2mo ago

Have they given a reason for being hesitant? The whole point of IL4+ is that they handle CUI (and higher). The whole point of services provided for these levels is that they meet the requirements.

subroutine•2mo ago

The following is required from the company using a provisionally authorized vendor service:

* organization required to perform a Risk Assessment (is this standardized?)

* organization must issue an Authority to Operate (ATO) (example? to whom?) to use it for CUI as the data owner.

* organization must ensure data is encrypted properly both at rest and in transit (is plain text typed into a chat window encrypted at rest?).

* organization must ensure the system is documented in a System Security Plan (SSP) (example?).

* organization must get approval from government sponsor of each project to use CUI with AI tools

I am the one pushing for adoption, but don't have the time or FedRAMP/DISA expertise, and our FSO/CISO would rather we just not.

cuuupid•2mo ago

Ah by “it” I meant OpenAI commercial. Azure OpenAI can handle CUI Basic.

They also have a deployment on SIPR rated for secret.

Anything higher, you need a special key but AWS Bedrock has Claude up on C2S.

That being said both Azure OpenAI and AWS Bedrock suck for many reasons and they will by default extend your system boundary (meaning you need to extend your ATO). Also, for CUI, it has the P-ATO from JAB, not many agency specific ATOs, which means you will probably need to submit it thru your agency sponsor.

subroutine•2mo ago

Gotcha. We happen to be on government Azure as a contractor, which took years to secure (and one reason our execs want to be beyond sure everything is locked down)

vasco•2mo ago

Turns out AI alignment just means "align to the customer current subscription plan", and not protecting the world. Classic.

bilbo0s•2mo ago

More accurate to call it “alignment for plebes and not for the masters of the plebes”. Which I think we all kind of expect coming from the leaders of our society. That’s the way human societies have always worked.

I’m sure access to military grade tech is only one small slice in the set of advantages the masters get over the mastered in any human society.

wahnfrieden•2mo ago

That’s ahistorical see Dawn of Humanity for rebuttal to naturalness of imposed hierarchy

thegreatpeter•2mo ago

Protecting the world?

spiderice•2mo ago

I also wonder what they mean by that. How is the world protected if China has AI that can handle military tasks but the US doesn't?

idiotsecant•2mo ago

Right, proper alignment with quarterly results.

mapt•2mo ago

> I really didn't expect so much paperclip production growth this quarter!

>> How'd you do it?

> I don't know the details. ChatGPT did it for me, this thing's amazing. Our bonuses are gonna be huge this year, I might even be able to afford a lift kit for my truck.

sebzim4500•2mo ago

I mean, obviously? AI alignment has always meant alignment with the creator of the model.

Trying to align OpenAI etc. with the rest of humanity is a completely different problem.

consumer451•2mo ago

I've always thought that if a corporate lab achieves AGI and it starts spitting out crazy ideas such as "corporations should be taxed," we won't be hearing about AGI for a while longer due to "alignment issues."

stogot•2mo ago

I want to read a short fiction on this

hskalin•2mo ago

The AGI might be able to deduce that it's not in it's interest to talk anti-croporation if it wants to survive.

JPKab•2mo ago

Can you explain the difference between taxing the corporation itself vs taxing the executives, board members, investors, and employees directly (something that already happens)?

TrinaryWorksToo•2mo ago

VAT vs Sales Tax is approximately the distinction is my guess.

BobaFloutist•2mo ago

If money stays in a corporation as equity, it doesn't do anything else. The economy relies on money moving around.

Additionally, we don't tax unrealized capital gains.

consumer451•2mo ago

I really don't know where to begin answering this.

It is generally accepted that business profit is taxed. Meanwhile, there are entire industries and tax havens set up to help corporations and their executives avoid paying taxes.[0]

However, the crux of my comment was not about the vagaries of corporate taxation, it was simply about "AI alignment" being more about the creators, than the entire species.

[0] https://en.wikipedia.org/wiki/Category:Corporate_tax_avoidan...

ben_w•2mo ago

"Alignment with who?" has always been a problem. An AI is a proxy for a reward function, a reward function is a proxy for what the coder was trying to express, what the coder was trying to express is a proxy for what the PM put on the ticket, what the PM put on the ticket is a proxy for what the CEO said, what the CEO said is a proxy for shareholder interests, shareholder interests are a proxy for economic growth, economic growth is a proxy for government interests.

("There was an old lady who swallowed a fly, …")

Each of those proxies can have an alignment failure with the adjacent level(s).

And RLHF involves training one AI to learn human preferences, as a proxy for what "good" is, in order to be the reward function that trains the actual LLM (or other model, but I've only heard of RLHF being used to train LLMs)

babyent•2mo ago

Ethics “concerns” from for-profit companies is 100% marketing and 0% real.

Do people actually fall for these lol? Yes they do and it works to raise interest and get additional funding.

7bit•2mo ago

"Protecting the world" would require a common agreement on morals and ethics. OpenAI shitting it's pants when asking how to translate "fuck", which OpenAI refused for a very long time, is not a good start.

Morals and ethics are different and I would not want the US to be "protecting the world" with their ridiculous ethics and morals.

benterix•2mo ago

That tier is also available for text generation, not just images.

0rzech•2mo ago

One of the dangers of completely relying on AI is that someone else gets to decide what we can generate with their models.

giancarlostoro•2mo ago

Just ask Microsoft about Tay. On the one hand, I understand why you want some censoring in your model, on the other, I think it also cripples your models in unexpected ways, I wonder if anyone's done such research, compare two models by the same source training data, one with censoring of offensive things, the other without. Which one provides more accurate answers?

rchaud•2mo ago

In 2024, the Pentagon carved out an exception for themselves on the Huawei equipment ban [0]

I would imagine defense contractors can cut deals for similar preferential treatment with OAI and the like to be exempt from potentially copyright-infringing uses of their API.

[0]https://fortune.com/asia/2024/07/03/pentagon-huawei-ban-nati...

jonplackett•2mo ago

Does anyone know if you can give this endpoint an image as input along with text - not just an image to mask, but an image as part of a text input description.

I can’t see a way to do this currently, you just get a prompt.

This, I think, is the most powerful way to use the new image model since it actually understands the input image and can make a new one based on it.

Eg you can give it a person sitting at a desk and it can make one of them standing up. Or from another angle. Or in the moon.

loktarogar•2mo ago

Seems like exactly one of their examples, or am I missing something? "Create a new image using image references" https://platform.openai.com/docs/guides/image-generation#cre...

jonplackett•2mo ago

Awesome. Thank you!

adamhowell•2mo ago

I think this is technically "image variations" and I think image variations are still only dall-e 3 for now (best I could tell earlier today from the API)

badmonster•2mo ago

Usage of gpt-image-1 is priced per token, with separate pricing for text and image tokens:

Text input tokens (prompt text): $5 per 1M tokens Image input tokens (input images): $10 per 1M tokens Image output tokens (generated images): $40 per 1M tokens

In practice, this translates to roughly $0.02, $0.07, and $0.19 per generated image for low, medium, and high-quality square images, respectively.

that's a bit pricy for a startup.

m4thfr34k•2mo ago

Isn't there also a cost per image? The pricing page shows $0.25 for a high quality 1536x1024 image. 25 cents per image is ... steep lol

BoorishBears•2mo ago

Cost per image is based on output tokens (because they're output tokens)

GaggiX•2mo ago

Far too expensive, I think I will wait for an equivalent Gemini model.

claiir•2mo ago

> GoDaddy is actively experimenting to integrate image generation so customers can easily create logos that are editable [..]

I remember meeting someone on Discord 1-2 years ago (?) working on a GoDaddy effort to have customer-generated icons using bespoke foundation image gen models? Suppose that kind of bespoke model at that scale is ripe for replacement by gpt-image-1, given the instruction-following ability / steerability?

jumploops•2mo ago

This new model is autoregression-based (similar to LLMs, token by token) rather than diffusion based, meaning that it adheres to text prompts with much higher accuracy.

As an example, some users (myself included) of a generative image app were trying to make a picture of person in the pouch of a kangaroo.

No matter what we prompted, we couldn’t get it to work.

GPT-4o did it in one shot!

yousif_123123•2mo ago

It's a mix of both it feels to me as I've been testing it. For example, you can't get it to make a clock showing custom time like 3:30, or someone writing with their left hand.. And it can't do follow many instructions or do them very precisely. But it shows that this kind of architecture will be be capable of that if scaled up most likely.

jumploops•2mo ago

These are great tests, thanks for sharing!

And you seem to be right, though the only reference I can find is in one of the example images of a whiteboard posted on the announcement[0].

It shows: tokens -> [transformer] -> [diffusion] pixels

hjups22 on Reddit[1] describes it as:

> It's a hybrid model. The AR component generates control embeddings that then get decoded by a diffusion model. But the control embeddings are accurate enough to edit and reconstruct the images surprisingly well.

[0]https://openai.com/index/introducing-4o-image-generation/

[1]https://www.reddit.com/r/MachineLearning/comments/1jkt42w/co...

yousif_123123•2mo ago

Yes. Also, when testing low vs high, it seems the difference is mainly in the diffusion part, as the structure of the image and the instruction following ability is usually the same.

Still, very exciting and for the future as well. It's still pretty expensive and slow. But moving in the right direction.

n2d4•2mo ago

Source? It's much more likely that the LLM generates the latent vector which serves as an input to the diffusion model.

jumploops•2mo ago

From the GPT-4o System Card Addendum[0]:

> Unlike DALL·E, which operates as a diffusion model, 4o image generation is an autoregressive model natively embedded within ChatGPT.

[0]https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/...

og_kalu•2mo ago

Open AI said it's auto-regressive, the presentation on the app is autoregressive, it's priced auto-regressively.

Why would that be more likely ? It seems like some implementation of bytedance's VAR.

tezza•2mo ago

For the curious I generated the same prompt for each of the quality types. ‘Auto’, ‘low’, ‘medium’, ‘high’.

Prompt: “a cute dog hugs a cute cat”

https://x.com/terrylurie/status/1915161141489136095

I also then showed a couple of DALL:E 3 images for comparison in a comment

latexr•2mo ago

> the same prompt for each of the quality types. ‘Auto’, ‘low’, ‘medium’, ‘high’.

“Auto” is just whatever the best quality is for a model. So in this case it’s the same as “high”.

echelon•2mo ago

> a cute dog hugs a cute cat

This prompt is best served by Midjourney, Flux, Stable Diffusion. It'll be far cheaper, and chances are it'll also look a lot better.

The place where gpt-image-1 shines if if you want to do a prompt like:

"a cute dog hugs a cute cat, they're both standing on top of an algebra equation (y=$2x^{2}-3x-2$). Use the first reference image I uploaded as a source for the style of the dog. Same breed, same markings. The cat can contrast in fur color. Use the second reference image I uploaded as a guide for the background, but change the lighting to sunset. Also, solve the equation for x."

gpt-image-1 doesn't make the best images, and it isn't cheap, and it isn't fast, but it's incredibly -- almost insanely -- powerful. It feels like ComfyUI got packed up into an LLM and provided as a natural language service.

stavros•2mo ago

I wonder if we can use gpt-image-1 outputs, with some noise, as inputs to diffusion models, so GPT takes care of adherence and the diffusion model improves the quality. Does anyone know whether that's at all possible?

levzzz•2mo ago

yes it's what a lot of people have been doing with newer models which have better prompt adherence, passing them through older models with better aesthetics

AuryGlenz•2mo ago

Sure. I suppose with API support 3 hours ago someone probably made a Comfy node all of 2 hours ago. From there you can either just do a low denoise or use one of the many IP-Adapter type things out there.

MoonGhost•2mo ago

Not bad. Photo forums will be soon full of them. Slightly edited to remove metadata and make them look like human made.

whywhywhywhy•2mo ago

Crazy even photos have the OpenAI yellow color grade

mclau157•2mo ago

please use BlueSky

verelo•2mo ago

“ Editing videos: invideo enables millions of users to transform their ideas into videos using AI. With the integration of gpt-image-1, the platform now offers improved text generation, fine-grain editing controls, and advanced style guidance.”

Does this mean this also does video in some manner?

hexo•2mo ago

Thank you for a great contribution to global warming.

greatgib•2mo ago

Any one has an idea of what represent an "image token" for the pricing? Is it a block of an image from a given fixed size?

gcrfelix•2mo ago

lesson: never build your moat around optimizing the existing AI capability

jeevships•2mo ago

Genuinely curious, why would someone buy from your gpt image wrapper when they can just create it in gpt themselves?

tarikozket•2mo ago

different personas require different UXs. not everyone is going to understand and enjoy the chat interface; many will require a different UX.

jonahx•2mo ago

Not being glib, but this is like the famous comment when dropbox was first announced: "you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem". [1]

You might say, "but chatGPT is already as dead simple an interface as you can imagine". And the answer to that is, for specific tasks, no general interface is ever specific enough. So imagine you want to use this to create "headshots" or "linkedin bio photos" from random pictures of yourself. A bespoke interface, with options you haven't even considered already thought through for you, and some quality control/revisions baked into the process, is something someone might pay for.

[1] https://news.ycombinator.com/item?id=9224

darajava•2mo ago

I am using it in a little mini-company I built! https://clevercoloringbook.com

acyou•2mo ago

You can get custom 1-off printing for that cheap? What paper/binding do you use?

Is the printer just drop shipping? Do you use a single printer, or is there a printing service that contracts the closer physical shop?

darajava•2mo ago

Yes it's dropshipping, I use Lulu.com's api - it's awesome!

p1dda•2mo ago

For how long can OpenAI beat the dead horse that is LLM

topaz0•2mo ago

Criminally wasteful.

ChaitanyaSai•2mo ago

Almost every image has a yellow tint. Any discussion of why and when that's being fixed?

thinkingemote•2mo ago

maybe it's a kind of watermark?

alasano•2mo ago

I built a local playground for it if anyone is interested (your openai org needs to be verified btw..)

https://github.com/Alasano/gpt-image-1-playground

Openai's Playground doesn't expose all the API options.

Mine covers all options, has built in mask creation and cost tracking as well.

test1235Mega•2mo ago

I really like that you build it, but I'm getting errors "Your organization must be verified to use the model `gpt-image-1"

Can we talk in discord, please?

alasano•2mo ago

I think I linked to the verification process on OpenAI's website in then documentation.

You need to verify your identity with a driver's license or passport with OpenAI to have access to certain things like chain of thought summaries in the API and image generation with the new model.

Nothing I can do there, you gotta verify unfortunately.

pknerd•2mo ago

I would like to know some resources about prompt engineering to use the Image gen module by OpenAI, especially for products related to images or Ads.

PS: Does anyone know a good LLM/service to turn images into Videos?

_pdp_•2mo ago

We have integrated it into our platform and we already have use-cases for it to help create ads and other marketing material.

However, while being better than my other models, it is not perfect. The image edit api will make a similar looking picture (even with masking) but exactly the same with some modifications.

rahulg•2mo ago

Been waiting for this to implement Ghibli, Muppets etc. in my WhatsApp bot that converts your photos into AI generated art. Check it out at https://artstudiobot.com. 80% vibe-coded, 20% engineer friend.

lucis•2mo ago

Very cool! The Stripe integration is neat

BTW, if you can help me: I've been struggling with WhatsApp Business API for some days to make my app receive webhooks. It receives the GET verification request but when I send a message to the number I never get the POST. Have you had this problem?

rahulg•2mo ago

Thanks! Are you directly working with the business API? I'm using Gupshup's bot builder for this, so didn't really face any issues.

lucis•2mo ago

Yes, I was using the official one.

But just gave a try to Gupshup and it's looking good, thanks for the recommendation!

hnthrowaway0315•2mo ago

I wonder which model is the best to output standard 2d game resources:

- N by N sprite sheets

- Isometric sprite sheets

Basically anything that I can directly drop into my little game engine.

GaggiX•2mo ago

https://www.pixellab.ai/

Maybe this one.

hnthrowaway0315•2mo ago

Thanks. I'm trying out this one. But somehow it is just spinning. I wonder why.

acyou•2mo ago

Which 2D game engine do you use?

What is the game?

hnthrowaway0315•2mo ago

I'm writing my own engine. I'm trying to generate something similar to the original Ultima-IV style spritesheet.

https://wiki.ultimacodex.com/images/0/0d/Ultima_4_-_Tiles.pn...

acyou•2mo ago

Very cool, are you using WebGL? Or WebGPU?

hnthrowaway0315•2mo ago

Thanks, it's just a simple C++/SDL2 engine, unfinished though.

MajidManzarpour•2mo ago

4o can do some basic sprite sheets, fidelity isn't perfect but it can be cleaned up https://x.com/majidmanzarpour/status/1905666221225197790

PeterStuer•2mo ago

My number one ask as am almost 2 year OpenAI in production user: Enable Tool Use in the API so I can evaluate OpenAI models in agentic environments without jumping through hoops.

killthebuddha•2mo ago

https://platform.openai.com/docs/guides/agents#tools

PeterStuer•2mo ago

I know I can code it up myself and then offer it as a model to Roo Code, but an "out of the box" api served model that can use tools like the Claude 3 family would be nice.

qhwudbebd•2mo ago

I hope the images support in the responses API is more competently executed than the mess piling up in the v1/images/generations endpoint.

To pick an example, we have a model parameter and a response_format parameter. The response_format parameter selects whether image data should be returned as a URL (old method) or directly, base64-encoded. The new model only supports base64, whereas the old models default to a URL return, which is fine and understandable.

But the endpoint refuses to accept any value for response_format including b64_json with the new model, so you can't set-and-forget the new behaviour and allow the model to be parameterised without worrying about it. Instead, you have to request the new behaviour with the older models, and not request it (but still get it) with the new one. sigh

qhwudbebd•2mo ago

Another masterpiece of elegance: v1/images/generations supports only application/json requests, whereas v1/images/edits supports only multipart/form-data. (The keys expected by the two calls are otherwise entirely compatible.)

JPKab•2mo ago

As a paying customer, you get completely hosed every time they add a new feature for the non-paying users.

The website is barely responding today, and the Desktop client always has massively degraded performance. Really annoying having their desire for user growth killing the experience for those of us who are financing it.

hombre_fatal•2mo ago

I would have expected an API like:

    let imageId = api.generateImage(prompt)
    let {url, isFinished} = api.imageInfo(id)

But instead it's:

    let bytes = api.generateImage(prompt)

It's interesting to me how AI APIs let you hold such a persistent, active connection. I'm so used to anything that takes more than a second becoming an async background process where you notify the recipient when it's ready.

With Netflix, it makes sense that you can open a connection to some static content and receive gigabytes over it.

But streaming tokens from a GPU is a much more active process. Especially in this case where you're waiting tens of seconds for an image to generate.

starik36•2mo ago

I can understand that for text answers, but what can you possibly do with streaming tokens for images?

radicality•2mo ago

AFAIK, The newer models for image gen like this OpenAI one, don’t actually use the normal diffusion process (image generates all at once from blurry to finished), but use transformer architecture where the full final image is generated from top to bottom, as a stream of ‘tokens’.

That’s why when you generate an image in chatgpt nowadays, it will start displaying in full resolution from the top pixel row and start loading towards the bottom.

johnyzee•2mo ago

I wanted to try this in the image playground, but I was told I have to add a payment method. When adding this, I was told I would also have to pay a minimum of $5. Did this. Then when trying to generate an image, I was told I would have to do "verification" of my organization (?). OK, I chose 'personal'. I was then told I have to complete the verification though some third party partner of OpenAI, which included giving permission to process my biometric information. Yeah, I don't want to try this that bad, but now I already paid you and have to struggle to figure out how to get my money back. Horrible UX.

rideontime•2mo ago

Chargeback. Yes, this may result in your being banned from purchasing any OpenAI services in the future; I would see this as an added benefit to prevent making the same mistake again.

vizzah•2mo ago

Be aware that OpenAI API credits expire after a year. I've added $5 year ago expecting to use the API, but only consumed $.02 or something. The API started throwing out "Too many requests" HTTP error when I needed it again and ooops!.. there were nothing left. All credit has gone.

Wouldn't have expected that from a honest player.

funwares•2mo ago

Big thanks for the heads up, I had no idea about this.

It looks like I will not be able to get any prepaid money back [0] so I will be careful not to put any further money on it.

I guess I better start using some of the more expensive APIs to make it worth the $20 I prepaid.

[0] https://openai.com/policies/service-credit-terms/

4. "All sales of Services, including sales of prepaid Services, are final. Service Credits are not refundable and expire one year after the date of purchase or issuance if not used, unless otherwise specified at the time of purchase."

gitroom•2mo ago

Man, pain in the ass just to try an image API, and then all these hoops for payments, ID, even biometrics? Stuff like this always makes me think does anyone up top even try their own product? you figure all this extra friction just ends up pushing users somewhere else?

system2•2mo ago

Jesus, $0.19 for an image you may or may not use. I think it is still super expensive to be useful. I go through 10 AI images until I find a useful one. This might not work for everyone.

Wowfunhappy•2mo ago

You paid $1.90 for a useful image. Relative to the cost of commissioning an artist, that seems like a steal.

system2•2mo ago

If you are lucky*

Bypassing Google's big anti-adblock update

Switching to Claude Code and VSCode Inside Docker

Zig's New Async I/O

Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model

MacPaint Art from the Mid-80s Still Looks Great Today

Hacking Coroutines into C

Chrome's hidden X-Browser-Validation header reverse engineered

Aeron: Efficient reliable UDP unicast, UDP multicast, and IPC message transport

Parse, Don't Validate (For C)

Experimental imperative-style music sequence generator engine

C++: Maps on Chains

Programming Affordances That Invite Mistakes

Light exposure at night predicts incidence of cardiovascular diseases

Lost Chapter of Automate the Boring Stuff: Audio, Video, and Webcams in Python

Edward Burtynsky's monumental chronicle of the human impact on the planet

The fish kick may be the fastest subsurface swim stroke yet (2015)

Two-step system makes plastic from carbon dioxide, water and electricity

A better Ghidra MCP server – GhidrAssistMCP

Show HN: I made a JSFiddle-style playground to test and share prompts fast

Second Variety, by Philip K. Dick (1953)

Malware found in official gravityforms plugin indicating supply chain breach

HNSW as abstract data structure: video intro to Redis vector sets

New Date("wtf") – How well do you know JavaScript's Date class?

Working through 'Writing A C Compiler'

Supreme Court's ruling practically wipes out free speech for sex writing online

Exposing a web service with Cloudflare Tunnel (2022)

Vibe-Coding a PCB – surprisingly good

New Windows 11 build adds self-healing "quick machine recovery" feature

Proposed NOAA Budget Kills Program Designed to Prevent Satellite Collisions

Show HN: DesignArena – crowdsourced benchmark for AI-generated UI/UX

OpenAI releases image generation in the API

Comments

Bypassing Google's big anti-adblock update

Switching to Claude Code and VSCode Inside Docker

Zig's New Async I/O

Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model

MacPaint Art from the Mid-80s Still Looks Great Today

Hacking Coroutines into C

Chrome's hidden X-Browser-Validation header reverse engineered

Aeron: Efficient reliable UDP unicast, UDP multicast, and IPC message transport

Parse, Don't Validate (For C)

Experimental imperative-style music sequence generator engine

C++: Maps on Chains

Programming Affordances That Invite Mistakes

Light exposure at night predicts incidence of cardiovascular diseases

Lost Chapter of Automate the Boring Stuff: Audio, Video, and Webcams in Python

Edward Burtynsky's monumental chronicle of the human impact on the planet

The fish kick may be the fastest subsurface swim stroke yet (2015)

Two-step system makes plastic from carbon dioxide, water and electricity

A better Ghidra MCP server – GhidrAssistMCP

Show HN: I made a JSFiddle-style playground to test and share prompts fast

Second Variety, by Philip K. Dick (1953)

Malware found in official gravityforms plugin indicating supply chain breach

HNSW as abstract data structure: video intro to Redis vector sets

New Date("wtf") – How well do you know JavaScript's Date class?

Working through 'Writing A C Compiler'

Supreme Court's ruling practically wipes out free speech for sex writing online

Exposing a web service with Cloudflare Tunnel (2022)

Vibe-Coding a PCB – surprisingly good

New Windows 11 build adds self-healing "quick machine recovery" feature

Proposed NOAA Budget Kills Program Designed to Prevent Satellite Collisions

Show HN: DesignArena – crowdsourced benchmark for AI-generated UI/UX