Built for accuracy, speed, and multimodal evaluation, it enables developers to compare models like Gemini, GPT-4, Claude, and more—efficiently and securely.

Google launched LMEval, a powerful open-source framework designed to make large model evaluation faster, easier, and more consistent across providers. This release aligns with our ongoing collaboration with Giskard, which uses LMEval to run the Phare benchmark—an independent test for assessing model safety and security.
With LLMs evolving rapidly, developers and researchers need reliable tools to assess model performance across tasks and providers. Until now, cross-model benchmarking has been complex and resource-heavy. LMEval changes that.
Key Features of LMEval:
- Multi-Provider Compatibility: Powered by the LiteLLM framework, it works seamlessly with major providers including Google, OpenAI, Anthropic, Hugging Face, and Ollama.
- Incremental & Efficient Evaluation: No need to rerun entire test suites.its intelligent, multi-threaded engine only evaluates what’s new—reducing compute time and cost.
- Multimodal & Multi-Metric Support: Beyond text, it supports images and code. It handles various formats like boolean, multiple choice, and free-form generation. It also detects safety issues and punted outputs.
- Secure, Scalable Storage: Evaluation results are stored in a self-encrypting SQLite database—keeping them secure while remaining accessible.
It is user-friendly, with example notebooks available in the GitHub repository. Running evaluations on different model versions—like Gemini—takes just a few lines of code. It comes with LMEvalboard, a dashboard for interactive model comparisons.
The companion dashboard to it, enables users to explore and understand model performance in depth interactively. It allows for quick comparisons of overall model accuracy across benchmarks, helping users see how different models stack up. Users can also analyze individual models to uncover their strengths and weaknesses across various categories, gaining insight into specific performance trends. It makes it easy to perform head-to-head comparisons between models, highlighting areas of disagreement and revealing where one model may outperform another. By offering these capabilities, it supports responsible, fair, and transparent benchmarking—advancing trust in evaluating AI systems.