Standard RAG retrieves semantic noise when confronted the logic of specialist domains like maths. I wanted to see if we could treat language models more like compilers by anchoring them to a structural ground truth.
I built a neuro-symbolic pipeline that grounds smaller models (<9B parameters, like Gemma2 and Qwen2.5-Math) in the OpenMath ontology using hybrid retrieval and cross-encoder reranking.
Evaluating on the MATH 500 benchmark revealed a severe bottleneck. When retrieval succeeds, reasoning and convergence improve. But the semantic gap between natural language and formal definitions is massive. When retrieval fails, the injected irrelevant ontological context actively degrades performance, hitting a hard "context utilization ceiling" in smaller models.
The paper and pipeline code are open. I ran these experiments locally without hyperscaler compute. I would love the community's technical feedback.
I’m continuing with the research now toward solving the retrieval quality bottleneck.
marcelolabre•1h ago
Standard RAG retrieves semantic noise when confronted the logic of specialist domains like maths. I wanted to see if we could treat language models more like compilers by anchoring them to a structural ground truth.
I built a neuro-symbolic pipeline that grounds smaller models (<9B parameters, like Gemma2 and Qwen2.5-Math) in the OpenMath ontology using hybrid retrieval and cross-encoder reranking.
Evaluating on the MATH 500 benchmark revealed a severe bottleneck. When retrieval succeeds, reasoning and convergence improve. But the semantic gap between natural language and formal definitions is massive. When retrieval fails, the injected irrelevant ontological context actively degrades performance, hitting a hard "context utilization ceiling" in smaller models.
The paper and pipeline code are open. I ran these experiments locally without hyperscaler compute. I would love the community's technical feedback.
I’m continuing with the research now toward solving the retrieval quality bottleneck.