Pairwise annotation queues for comparing agent outputs

DATE: December 17, 2025

AUTHOR: The LangChain Team

Pairwise Annotation Queues in LangSmith are a fast, structured way to compare two agent outputs side-by-side and pick a winner. Scoring subjective tasks is hard. With pairwise annotation queues, you can run real A/B evaluations across experiments and see which agent, prompt, or model actually performs better in LangSmith. This allows you to have:

• A/B clarity for subjective tasks:
Tone, correctness, usefulness, or style — if it’s hard to score on a rubric, pairwise judgment makes it simple.

• Real experiment comparisons:
Test baseline vs. candidate systems using your own dataset, instructions, and reviewers to validate improvements.

• Faster iteration loops:
Side-by-side UI + hotkeys make reviewing runs fast and consistent, so you get results and production insight sooner.

What Pairwise Annotation Queues do

With pairwise queues, annotators see two runs presented together and decide: A is better, B is better, or Equal for each rubric item. LangSmith automatically pairs runs between two experiments, routes them into a queue, and manages reservations, reviewers, and trace access. It’s ideal for judging prompts, models, multi-agent systems, or any experiment where “better” is easier than “why.”

How to get started

Go to Datasets & Experiments in LangSmith
Select exactly two experiments you want to compare
Click Annotate → Add to Pairwise Annotation Queue
Define rubric + instructions, assign reviewers, and begin scoring

Pairwise annotation queues are available today — giving you a rigorous, human-aligned way to evaluate upgrades, test hypotheses, and ship better agents with confidence.

See the docs: https://docs.langchain.com/langsmith/annotation-queues#pairwise-annotation-queues

Changelog

Pairwise annotation queues for comparing agent outputs

What Pairwise Annotation Queues do

How to get started