Use the vitals package with ellmer to evaluate and compare the accuracy of LLMs, including writing evals to test local models.
[2025.08] We have corrected the robustness results on the Aircraft dataset and uploaded an updated (arXiv) version of the paper. Our implementation is based on TPT ...