Pytest and Vitest integrations for evaluations in LangSmith

DATE: January 22, 2025

AUTHOR: The LangChain Team

Evaluations (evals) are crucial for building reliable, high-quality LLM applications. They ensure consistent performance, much like tests in software engineering. With the release of LangSmith v0.3.0, we’re thrilled to introduce Pytest and Vitest/Jest integrations, now available in beta for Python and TypeScript SDKs.

Why Testing Frameworks for LLM Evals?

If you already use Pytest or Vitest/Jest, these integrations combine familiar developer experience (DX) with LangSmith’s observability and sharing features. Here’s what they offer:

Debugging made easy. LangSmith saves inputs, outputs, and stack traces from your test cases, simplifying the debugging process for non-deterministic LLM behavior.
Metrics beyond Pass/Fail. Log nuanced metrics and track progress over time to ensure continuous improvement, even when hard pass/fail criteria don’t apply.
Effortless collaboration. Share results across your team to streamline collaboration, especially with subject matter experts involved in evals and prompts.
Built-in evaluation functions. Use tools like expect.edit_distance() to measure string differences or explore our API reference for more functions.

Testing Frameworks vs. `evaluate()`

While libraries like OpenAI Evals and LangSmith’s evaluate() work well for datasets, these integrations shine in:

Test-specific evaluation logic: Tailor evaluators per test case, ideal for complex, multi-tool agents.
Real-time local feedback: Debug quickly during iteration.
CI pipeline integration: Catch regressions early with automated test runs.

What’s Next?

Stay tuned for GitHub Actions to streamline CI workflows!

Try It Now

Read our blog for more info and check out our developer tutorials (Python, TypeScript) and video walkthroughs.

Pytest and Vitest integrations for evaluations in LangSmith

Why Testing Frameworks for LLM Evals?

Testing Frameworks vs. evaluate()

What’s Next?

Try It Now

Testing Frameworks vs. `evaluate()`