You have to be pretty darn sure that your job is going to do exactly what you want to be able to wait 24 hours for a response. It's like going back to the punched-card era. If I could get even 1% of the batch in a quicker response and then the rest more slowly, that would have made a big difference.
I wonder why they do that and who is actually getting value out of these batch APIs.
Thanks for sharing your experience!
I’m a developer working on a product that lets users upload content. This upload is not time sensitive. We pass the content through a review pipeline, where we did moderation and analysis, and some business-specific checks that the user uploaded relevant content. We’re migrating some of that to an LLM based approach because (in testing) the results are just as good, and tweaking a prompt is easier than updating code. We’ll probably use a batch API for this and accept that content can take 24 hours to be audited.
The other part that I think makes batch LLM inference unique, is that the results are not deterministic. That's where I think what the parent was saying about some of the data at least should be available earlier even if the rest will be available in 24h.
Here's an example:
If you are a TV broadcaster and you want to summarize and annotate the content generated in the past 12 hours you most probably need to have access to the summaries of the previous 12 hours too.
Now if you submit a batch job for the first 12 hours of content, you might end up in a situation where you want to process the next batch but the previous one is not delivered yet.
And imo that's fine as long as you somehow know that it will take more than 12h to complete but it might be delivered to you in 1h or in 23h.
That's the part of the these batch APIs that I find hard to understand how you use in a production environment outside of one off jobs.
I used the batch API extensively for my side project, where I wanted to ingest a large amount of images, extract descriptions, and create tags for searching. After you get the right prompt, and the output is good, you can just use the Batch API for your pipeline. For any non-time-sensitive operations, it is excellent.
Maybe I'm just thinking too much in data engineering terms here.
You do have - within 24 hours. So don't submit requests you need in 10 hours.
You can do this, just send 1% using the regular API.
So, if anybody else is frustrated and not finding anything online about this, here are a few things I learned, specifically for structured output generation (which is a main use case for batching) - the individual request JSON should resolve to this:
```json { "request": { "contents": [ { "parts": [ { "text": "Give me the main output please" } ] } ], "system_instruction": { "parts": [ { "text": "You are a main output maker." } ] }, "generation_config": { "response_mime_type": "application/json", "response_json_schema": { "type": "object", "properties": { "output1": { "type": "string" }, "output2": { "type": "string" } }, "required": [ "output1", "output2" ] } } }, "metadata": { "key": "my_id" } } ```
To get actual structured output, don't just do `generation_config.response_schema`, you need to include the mime-type, and the key should be `response_json_schema`. Any other combination will either throw opaque errors or won't trigger Structured Output (and will contain the usual LLM intros "I'm happy to do this for you...").
So you upload a .jsonl file with the above JSON, and then you try to submit it for a batch job. If something is wrong with your file, you'll get a "400" and no other info. If something is wrong with the request submission you'll get a 400 with "Invalid JSON payload received. Unknown name \"file_name\" at 'batch.input_config.requests': Cannot find field."
I got the above error endless times when trying their exact sample code: ``` BATCH_INPUT_FILE='files/123456' # File ID curl https://generativelanguage.googleapis.com/v1beta/models/gemi... \ -X POST \ -H "x-goog-api-key: $GEMINI_API_KEY" \ -H "Content-Type:application/json" \ -d "{ 'batch': { 'display_name': 'my-batch-requests', 'input_config': { 'requests': { 'file_name': ${BATCH_INPUT_FILE} } } } }" ```
Finally got the job submission working via the python api (`file_batch_job = client.batches.create()`), but remember, if something is wrong with the file you're submitting, they won't tell you what, or how.
Thanks for your post, I've stumbled upon the same issue as you.
So I should interpret the "Unknown name \"file_name\" at 'batch.input_config.requests'" as an error with the jsonl file and not the payload itself?
I'm trying to submit a batch with a .jsonl file, but I'm always getting the "Unknown name \"file_name\" at 'batch.input_config.requests'" error.
In a way this is saying that there are some GPUs just sitting around so they would rather get 50% than nothing for their use.
To handle peak daily load you need capacity that goes unused in offpeak hours.
It's cheaper because it's a different market with different needs which can be served by systems optimizing for throughput instead latency. Feels like you're looking for something that's not there.
[1]: http://web.archive.org/web/20240517173258/https://cloud.goog..., "By default Google caches a customer's inputs and outputs for Gemini models to accelerate responses to subsequent prompts from the customer. Cached contents are stored for up to 24 hours."
Sounds like a great option to have available? Not every task I use LLMs for need immediate responses, and if I wasn't using local models for those things, getting a 50% discount and having to wait a day sounds like a fine tradeoff.
Reading your comment history: are you an LLM?
https://discuss.ai.google.dev/t/gemini-2-5-pro-with-empty-re...
Edit: anthropic also stack batching and caching discounts
tripplyons•7mo ago
It's nice to see competition in this space. AI is getting cheaper and cheaper!
fantispug•7mo ago
I guess it lets them better utilise their hardware in quiet times throughout the day. It's interesting they all picked 50% discount.
qrian•7mo ago
calaphos•7mo ago
briangriffinfan•7mo ago
bayesianbot•7mo ago
https://api-docs.deepseek.com/quick_start/pricing
dlvhdr•7mo ago
dmos62•7mo ago
rvnx•7mo ago
dmos62•7mo ago
Workaccount2•7mo ago
2.5 flash non-thinking doesn't exist anymore. People call it a price increase but it's just confusion about what Google did.
sunaookami•7mo ago
dist-epoch•7mo ago
laborcontract•7mo ago
ridgewell•7mo ago