1) Ground with retrieval: convert docs into semantic chunks, retrieve top-k, and pass explicit context to the LLM. When the system couldn't find an answer, the bot asked a clarifying question instead of hedging or hallucinating.
2) Prompt templates and response shaping: enforce tone, brevity, and banned phrases in the prompt. A strict template removed lead-ins like "As an AI" and capped answers to ~120 words.
3) Context management and guardrails: retrieve broadly, rerank with a cross-encoder, then truncate to stay within token limits. Add a similarity threshold that triggers escalation to a human or a clarifying question.
Results: on the flows we optimized we observed a significant drop in follow-up clarification rate (≈30%) and improved helpfulness ratings. Trade-offs included ~200–350ms additional latency for reranking and slightly higher infra cost for vector DBs and cross-encoder runs.
Limitations: multi-hop reasoning across multiple documents remains hard; tables and scanned PDFs require special parsing; quality depends on chunking strategy and retrieval coverage.
If you're instrumenting a bot, start with one high-traffic flow (billing, returns, or account management), implement retrieval + a strict prompt, and measure: follow-up clarifications, escalation rate, and user helpfulness. Curious if others have a simple heuristic they use for choosing max_k and reranker budget.