The results show that recent frontier LLMs like `gpt-4.1` and the trusty workhorse `gemini-2.0-flash` produce great quality Cypher reliably and reproducibly, with some prompt engineering to ensure that the graph schema is formatted well in the text2cypher prompt. Across a suite of 10 test queries (that are moderately complex and require paths to be retrieved from the knowledge graph), `gpt-4.1` and `gemini-2.0-flash` pass all tests, generating the right answers when a router agent is added to the workflow to enhance vanilla Graph RAG.
Prompt engineering is done using BAML (a programming language that makes it simple to prompt LLMs and get structured outputs from them in all experiments. In fact, the knowledge graph itself was constructed using BAML prompts that extract entities and relationships from unstructured data upstream.
A logical next step to this workflow is to build more complex agent loops that can run multi-step Cypher queries whose results can be consolidated to answer harder questions (similar to how a human would approach it). The general principles of testing and evaluation would apply here, too. Seems promising to explore these methods further!