Example failure mode: A model sees “CVE-2024-XXXX fixed in v2.1” and hallucinates a causal link to “Users must pay retroactive fees under EU regulation Article 56.”
To explore this, I built a regression dataset (40 edge cases) covering:
Fake identifier bindings (CVE + version)
Retroactive fiscal claims
Cross-domain causality leaps (Tech → Legal)
Over-assertive phrasing without evidence
Then I designed a structured system prompt that:
Detects official identifiers (CVE, Regulation numbers) vs placeholders
Flags monetary + retroactivity combinations as high-risk
Enforces proportional claim strength based on available evidence
Results:
Automated: 40/40 regression cases pass (JSON dataset + simple Python runner included).
Manual adversarial: ~40 prompts designed to test:
Draft article traps (e.g., hallucinated “Article 52c” in EU AI Act)
Pricing model fabrications (e.g., “billing based on parameter count”)
Version binding errors (e.g., incorrect Node.js default versions)
This is not fine-tuning—just a structured prompt experiment focused on structural validation.
Looking for feedback on:
Missing edge cases
Failure modes I didn’t consider
Whether this approach generalizes beyond legal/technical mixing
Gist (spec + dataset + runner): https://gist.github.com/ginsabo/6ebeb9490846ee6a268bd13560c0...
13pixels•1h ago
One edge case you might want to add: *Temporal Merging*. We often see models take a '2024 Roadmap' and a '2023 Release Note' and halluncinate that the roadmap features were released in 2023. It's valid syntax, valid entities, but impossible chronology.
Are you planning to expand this to RAG-specific failures (where the context retrieval causes the mix-up) or focusing purely on model-internal logic gaps?
Ginsabo•1h ago
I really like the "Temporal Merging" framing. You're right: roadmap + release notes = syntactically consistent, entity-valid, but chronologically impossible.
I haven't explicitly modeled temporal integrity yet, but that seems like a natural extension of the cross-domain tests.
Regarding RAG: So far the focus has been on model-internal structural logic gaps. I haven't built retrieval-aware tests yet.
That said, I suspect many RAG failures are just amplified cross-document merging errors, so a temporal integrity layer might actually generalize well there.
If you have examples from brand monitoring contexts, I'd love to add them as new regression cases.