It runs models through 30 test cases covering single-turn, multi-turn, and agentic scenarios, modeled loosely after the Berkeley Function Calling Leaderboard methodology.
Validation uses AST matching rather than string comparison to avoid false positives from formatting variations.
Supports two backends: OpenRouter for cloud models (GPT-5.2, Claude, Qwen 3.5, Mistral, etc.) and Ollama for local models with no API key needed.
Tests for best of N trials giving you a reliable score alongside raw accuracy.
Results export to JSON, TXT, CSV, or Markdown.
Quick start commands: Via Openrouter: `fc-eval --provider openrouter --models openai/gpt-5.2 anthropic/claude-sonnet-4.6`
Via Ollama: `fc-eval --provider ollama --models llama3.2`
GitHub repo: https://github.com/gauravvij/function-calling-cli
Happy to answer questions, especially around the test case design or validation logic.