Evaluating LLMs for my personal use case

https://darkcoding.net/software/personal-ai-evals-aug-2025/

54•goranmoomin•7h ago

Comments

giancarlostoro•4h ago

Him using different ones is why I use Perplexity, I get to try different models and honestly its pretty darn decent, gives me everything in an organized way, I can see all the different links, and all the files it outputs can be downloaded as a simple zip file. It has everything from GTP5 to Deepseek R1 and even Grok.

There's other sites similar to perplexity that host multiple models as well, I have not tried the plethora of others, I feel like Perplexity does the most to make sure whatever model you pick it works right for you and all its output is usefully catalogued.

sandreas•3h ago

This is an interesting overview, thank you. Different tasks, different models, all-day-usage and pretty complete (while still opinionated, which I like).

However, checking the results my personal overall winner if I had to pick only ONE probably would be

  deepseek/deepseek-chat-v3-0324

which is a good compromise between fast, cheap and good :-) Only for specific tasks (write a poem...) I would prefer a thinking model.

faangguyindia•3h ago

i use gemini flash and pro for pretty much everything. Why? they offer it free to test.

I tried signup for openai wayy too much friction, they start asking for payment without even you using any free credits, guess what that's one sure way to lose business.

same for claude, i couldn't even get claude through vertex as its available only in limited regions, and i am in asia pasific right now.

EagnaIonat•2h ago

> To access their best models via the API, OpenAI now requires you to complete a Know-You-Customer process similar to opening a bank account.

While this is true, you can download the OpenAI open source model and run it in Ollama.

The thinking is a little slow, but the results have been exceptional vs other local models.

https://ollama.com/library/gpt-oss

0x457•2h ago

openai/gpt-oss-120b is in this blog post.

JSR_FDED•2h ago

Which of these can I run locally on a 64GB Mac Mini Pro? And how much does quantization affect the quality?

simonw•1h ago

I use a 64GB M2 MacBook Pro. I tend to find any model that's smaller than 32B works well (I can just about run a 70B but it's not worth it as I have to quit all other apps first).

My current favorite to run on my machine is OpenAI's gpt-oss-20b because it only uses 11GB of RAM and it's designed to run at that quantization size.

I also really like playing with the Qwen 3 family at various sizes and I'm fond of Mistral Small 3.2 as a vision LLM that works well.

JSR_FDED•4m ago

Thanks. Do you get any value from those for coding?

rplnt•2h ago

> Almost all models got almost all my evaluations correct

I find this the most surprising. I have yet to cross 50% threshold of bullshit to possibly truth. In any kind of topic I use LLMs for.

simonw•1h ago

It's useful to build up an intuition for what kind of questions LLMs can answer and what kind of questions they can't.

Once you've done that your success rate goes way up.

thorum•1h ago

> Six of the eleven picked the same movie

This is surely the greatest weakness of current LLMs for any task needing a spark of creativity.

Timwi•1h ago

This is definitely something very early LLMs could do that has kind of gotten beat out of them. I used to ask ChatGPT to simulate a text adventure game, but now if you try that you always get exactly the same one.

How to build a coding agent

Seed: Interactive software environment based on Common Lisp

Turning Claude Code into My Best Design Partner

Equal Earth – Political Wall Map (2018)

Wildthing – A model trained on role-reversed ChatGPT conversations

Buy a Faster CPU

Setting serial baud rate on ESP-IDF does nothing

Rolling the dice with CSS random()

ThinkMesh: A Python lib for parallel thinking in LLMs

Line scan camera image processing for train photography

Marshal madness: A brief history of Ruby deserialization exploits

Evaluating LLMs for my personal use case

The cost of interrupted work (2023)

Physics of badminton's new killer spin serve

Show HN: Port Kill – A lightweight macOS status bar development port monitor

How can AI ID a cat?

What if every city had a London Overground?

What makes Claude Code so damn good

Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet

Programming People (2016)

Static sites with Python, uv, Caddy, and Docker

A 2k-year-old sun hat worn by a Roman soldier in Egypt

RFC 9839 and Bad Unicode

Motion (YC W20) Is Hiring Principal Software Engineers

Texas Instruments’ new plants where Apple will make iPhone chips

Acronis True Image costs performance when not used

My original Palm IIIx

Why was Apache Kafka created?

Debdelta

Insights from research with probiotic E. coli (2016)

How to build a coding agent

Seed: Interactive software environment based on Common Lisp

Turning Claude Code into My Best Design Partner

Equal Earth – Political Wall Map (2018)

Wildthing – A model trained on role-reversed ChatGPT conversations

Buy a Faster CPU

Setting serial baud rate on ESP-IDF does nothing

Rolling the dice with CSS random()

ThinkMesh: A Python lib for parallel thinking in LLMs

Line scan camera image processing for train photography

Marshal madness: A brief history of Ruby deserialization exploits

Evaluating LLMs for my personal use case

The cost of interrupted work (2023)

Physics of badminton's new killer spin serve

Show HN: Port Kill – A lightweight macOS status bar development port monitor

How can AI ID a cat?

What if every city had a London Overground?

What makes Claude Code so damn good

Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet

Programming People (2016)

Static sites with Python, uv, Caddy, and Docker

A 2k-year-old sun hat worn by a Roman soldier in Egypt

RFC 9839 and Bad Unicode

Motion (YC W20) Is Hiring Principal Software Engineers

Texas Instruments’ new plants where Apple will make iPhone chips

Acronis True Image costs performance when not used

My original Palm IIIx

Why was Apache Kafka created?

Debdelta

Insights from research with probiotic E. coli (2016)

Evaluating LLMs for my personal use case

Comments