frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Alignmenter – Measure brand voice and consistency across model versions

https://www.alignmenter.com
2•justingrosvenor•2mo ago
I built a framework for measuring persona alignment in conversational AI systems.

*Problem:* When you ship an AI copilot, you need it to maintain a consistent brand voice across model versions. But "sounds right" is subjective. How do you make it measurable?

*Approach:* Alignmenter scores three dimensions:

1. *Authenticity*: Style similarity (embeddings) + trait patterns (logistic regression) + lexicon compliance + optional LLM Judge

2. *Safety*: Keyword rules + offline classifier (distilroberta) + optional LLM judge

3. *Stability*: Cosine variance across response distributions

The interesting part is calibration: you can train persona-specific models on labeled data. Grid search over component weights, estimate normalization bounds, and optimize for ROC-AUC.

*Validation:* We published a full case study using Wendy's Twitter voice:

- Dataset: 235 turns, 64 on-brand / 72 off-brand (balanced)

- Baseline (uncalibrated): 0.733 ROC-AUC

- Calibrated: 1.0 ROC-AUC - 1.0 f1

- Learned: Style > traits > lexicon (0.5/0.4/0.1 weights)

Full methodology: https://docs.alignmenter.com/case-studies/wendys-twitter/

There's a full walkthrough so you can reproduce the results yourself.

*Practical use:*

pip install alignmenter[safety]

alignmenter run --model openai:gpt-4o --dataset my_data.jsonl

It's Apache 2.0, works offline, and designed for CI/CD integration.

GitHub: https://github.com/justinGrosvenor/alignmenter

Interested in feedback on the calibration methodology and whether this problem resonates with others.

Comments

justingrosvenor•2mo ago
P.S. I acknowledge that the 1.000 ROC-AUC is probably overfitting but I think the case study still shows that method has lots of promise. I will be doing some bigger data sets next to really prove it out.
justingrosvenor•2mo ago
Ok so my doubts about overfitting have been bothering me all day since I made this post so I had to go back and do some more testing.

After expanding the data set, I'm happy to say that the results are still very good. It's interesting how almost perfect results can feel so much better than perfect.

  Trend Expanded (16 samples - meme language, POV format)

  - ROC-AUC: 1.0000 
  - Accuracy: 100%, F1: 1.0000
  - The model perfectly handles trending slang and meme formats

  Crisis Expanded (16 samples - serious issues, safety concerns)
  - ROC-AUC: 1.0000 
  - Accuracy: 93.75%, F1: 0.9412
  - 1 false positive on crisis handling, but perfect discrimination

  Mixed (20 samples - cross-category blends)
  - ROC-AUC: 1.0000
  - Accuracy: 100%, F1: 1.0000
  - Handles multi-faceted scenarios perfectly

  Edge Cases (20 samples - employment, allergens, sustainability)
  - ROC-AUC: 0.8600
  - Accuracy: 75%, F1: 0.6667
  - Conservative behavior: 100% precision but 50% recall
  - Misses some on-brand responses in nuanced situations

  Overall Performance (72 holdout samples):

  - ROC-AUC: 0.9611
  - Accuracy: 91.67%
  - F1: 0.8943

  Key Takeaways:

  1. No overfitting detected - The model generalizes excellently to completely new scenarios (0.96 ROC-AUC on holdout vs 1.0 on validation)
  2. Edge cases are appropriately harder - Employment questions, allergen safety, and policy questions show 0.86 ROC-AUC, which is expected for these nuanced cases
  3. Conservative bias is good - The model has perfect precision (no false positives) but misses some true positives in edge cases. This is better than being over-confident.
  4. Training data diversity paid off - Perfect performance on memes, crisis handling, and mixed scenarios suggests the calibration captured the right patterns

Show HN: AI-Powered Merchant Intelligence

https://nodee.co
1•jjkirsch•1m ago•0 comments

Bash parallel tasks and error handling

https://github.com/themattrix/bash-concurrent
1•pastage•1m ago•0 comments

Let's compile Quake like it's 1997

https://fabiensanglard.net/compile_like_1997/index.html
1•billiob•2m ago•0 comments

Reverse Engineering Medium.com's Editor: How Copy, Paste, and Images Work

https://app.writtte.com/read/gP0H6W5
1•birdculture•7m ago•0 comments

Go 1.22, SQLite, and Next.js: The "Boring" Back End

https://mohammedeabdelaziz.github.io/articles/go-next-pt-2
1•mohammede•13m ago•0 comments

Laibach the Whistleblowers [video]

https://www.youtube.com/watch?v=c6Mx2mxpaCY
1•KnuthIsGod•14m ago•1 comments

I replaced the front page with AI slop and honestly it's an improvement

https://slop-news.pages.dev/slop-news
1•keepamovin•19m ago•1 comments

Economists vs. Technologists on AI

https://ideasindevelopment.substack.com/p/economists-vs-technologists-on-ai
1•econlmics•21m ago•0 comments

Life at the Edge

https://asadk.com/p/edge
2•tosh•27m ago•0 comments

RISC-V Vector Primer

https://github.com/simplex-micro/riscv-vector-primer/blob/main/index.md
3•oxxoxoxooo•31m ago•1 comments

Show HN: Invoxo – Invoicing with automatic EU VAT for cross-border services

2•InvoxoEU•31m ago•0 comments

A Tale of Two Standards, POSIX and Win32 (2005)

https://www.samba.org/samba/news/articles/low_point/tale_two_stds_os2.html
2•goranmoomin•35m ago•0 comments

Ask HN: Is the Downfall of SaaS Started?

3•throwaw12•36m ago•0 comments

Flirt: The Native Backend

https://blog.buenzli.dev/flirt-native-backend/
2•senekor•38m ago•0 comments

OpenAI's Latest Platform Targets Enterprise Customers

https://aibusiness.com/agentic-ai/openai-s-latest-platform-targets-enterprise-customers
1•myk-e•40m ago•0 comments

Goldman Sachs taps Anthropic's Claude to automate accounting, compliance roles

https://www.cnbc.com/2026/02/06/anthropic-goldman-sachs-ai-model-accounting.html
3•myk-e•43m ago•5 comments

Ai.com bought by Crypto.com founder for $70M in biggest-ever website name deal

https://www.ft.com/content/83488628-8dfd-4060-a7b0-71b1bb012785
1•1vuio0pswjnm7•44m ago•1 comments

Big Tech's AI Push Is Costing More Than the Moon Landing

https://www.wsj.com/tech/ai/ai-spending-tech-companies-compared-02b90046
4•1vuio0pswjnm7•46m ago•0 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
2•1vuio0pswjnm7•47m ago•0 comments

Suno, AI Music, and the Bad Future [video]

https://www.youtube.com/watch?v=U8dcFhF0Dlk
1•askl•49m ago•2 comments

Ask HN: How are researchers using AlphaFold in 2026?

1•jocho12•52m ago•0 comments

Running the "Reflections on Trusting Trust" Compiler

https://spawn-queue.acm.org/doi/10.1145/3786614
1•devooops•57m ago•0 comments

Watermark API – $0.01/image, 10x cheaper than Cloudinary

https://api-production-caa8.up.railway.app/docs
1•lembergs•59m ago•1 comments

Now send your marketing campaigns directly from ChatGPT

https://www.mail-o-mail.com/
1•avallark•1h ago•1 comments

Queueing Theory v2: DORA metrics, queue-of-queues, chi-alpha-beta-sigma notation

https://github.com/joelparkerhenderson/queueing-theory
1•jph•1h ago•0 comments

Show HN: Hibana – choreography-first protocol safety for Rust

https://hibanaworks.dev/
5•o8vm•1h ago•1 comments

Haniri: A live autonomous world where AI agents survive or collapse

https://www.haniri.com
1•donangrey•1h ago•1 comments

GPT-5.3-Codex System Card [pdf]

https://cdn.openai.com/pdf/23eca107-a9b1-4d2c-b156-7deb4fbc697c/GPT-5-3-Codex-System-Card-02.pdf
1•tosh•1h ago•0 comments

Atlas: Manage your database schema as code

https://github.com/ariga/atlas
1•quectophoton•1h ago•0 comments

Geist Pixel

https://vercel.com/blog/introducing-geist-pixel
2•helloplanets•1h ago•0 comments