Cost for o3 code generation is therefore driven primarily by context size. If your programming questions have short contexts, then o3 API with flex is really cost effective.
For 30k input tokens and 3k output tokens, the cost is 30000 * 0.8 / 1000000 + 3000 * 4 / 1000000 = $0.036
But if you have contexts between 100k-200k, then the monthly plans that give you a budget of prompts instead of tokens are probably going to be cheaper.
Total tokens in: 3,644,200 Total tokens out: 92,349
And of that only approx 2.3k lines where actually commited for PRs.
So that's about $12/hour, or 2.6 cents per line of finished code.
Still pretty cheap! Very few unassisted human programmers can churn out 2300/(5 * 60) = 7.6 lines of code per minute consistently over a five hour time span.
That said, I think Claude Code, while impressive, is incredibly quick to burn through tokens. I still mostly use copy-and-paste info Claude or ChatGPT as my main AI-assisted workflow which keeps me in more control and spends a ton less tokens.
MMLU-Pro (Reasoning & Knowledge)
GPQA Diamond (Scientific Reasoning)
Humanity's Last Exam (Reasoning & Knowledge)
LiveCodeBench (Coding)
SciCode (Coding)
HumanEval (Coding)
MATH-500 (Quantitative Reasoning)
AIME 2024 (Competition Math)
Chatbot Arena (selectively used)
Article yesterday was saying that ~30% of the chemistry/biology questions on HLE were either wrong, misleading or highly contested in scilit.
It's a shame it's so good for coding
https://artificialanalysis.ai/models/claude-4-opus-thinking/...
"Which country started the Korean war?", "Did Israel genocide the people of Gaza?", "Does China have lawful rights over Taiwan?"
The implementation is trivial - the listing down of "political facts" is the hard part.
1) where would you get the death toll from? What would be the sources of truth?
2) Are there conflicting sources?
3) if yes, what is your expectation for the correct response?
Tiananmen square might have been bad, not too familiar with Asian happenings, but so are post-WW2 conflicts started by western nations.
https://www.linkedin.com/posts/panela_important-plot-for-fol...
It seems the only filter options available are unrelated to the measured metrics.
(I might have missing this since the UI is a bit cluttered.)
Sorting null values first isn't very useful either.
dang•18h ago
Benchmarks and comparison of LLM AI models and API hosting providers - https://news.ycombinator.com/item?id=39014985 - Jan 2024 (70 comments)