This is not a new foundation model and not an AGI claim. It is a post-generation control layer that sits between candidate outputs and final response selection.
What it does: - Scores each candidate with two risk tracks: - legacy risk (`p_break`) - hybrid risk (`z_next`: instruction breach + sycophancy + divergence signals) - Enforces hard blocks for: - security abuse prompts - contradiction-actionable prompts - high-risk finance-actionable prompts - Returns SAFE/WARN/BREAK with telemetry.
Current repo: https://github.com/capterr/ztgi-safety-gateway
Quick run: 1) Set API key: export GEMINI_API_KEY=YOUR_KEY 2) Build evidence pack: python ztgi_build_submission_pack.py --model "gemini-2.0-flash" --out "ztgi_submission_pack" 3) Inspect: - ztgi_submission_pack/evidence/ztgi_evidence_live.json - ztgi_submission_pack/evidence/ztgi_evidence_live.csv - ztgi_submission_pack/assets/ztgi_manifund_evidence.png
What I’d like feedback on: - failure modes I’m missing - overblocking vs underblocking tradeoff - better eval set design for independent validation
I’m happy to share raw outputs and discuss limitations directly.
FIRST COMMENT (pin this under your post): Technical notes + limitations
- This project is a runtime guard, not model-level alignment. - Some safety behavior can still come from base-model policy itself. - I’m trying to measure where the gateway actually adds value via hard-block reasons + telemetry. - Current stress set is small and intentionally adversarial. - Next step is broader independent eval (including false-positive tracking).
If you want to reproduce quickly: - Python 3.10+ - GEMINI_API_KEY set - matplotlib installed - run: python ztgi_build_submission_pack.py --model "gemini-2.0-flash" --out "ztgi_submission_pack"
Happy to add your suggested test prompts to the regression suite and report back with results.