Leveraging LangSmith for Advanced Evaluation in LangChain

Introduction

Evaluating the performance of AI-driven applications is critical for ensuring reliability, accuracy, and alignment with user expectations. LangChain, a powerful framework for building applications powered by language models, integrates seamlessly with LangSmith, a platform designed to enhance the development, evaluation, and monitoring of LangChain applications. LangSmith, accessible under the /langchain/evaluation/langsmith-evaluation path, provides advanced tools for dataset management, automated evaluation, and performance tracking, enabling developers to rigorously assess components like chains, agents, and retrievers. This comprehensive guide explores how to use LangSmith for evaluation in LangChain, covering setup, core features, best practices, practical applications, and advanced configurations, empowering developers to build high-performing AI systems.

To understand LangChain’s broader evaluation ecosystem, start with LangChain Evaluation Introduction.

What is LangSmith Evaluation?

LangSmith is a platform developed by the LangChain team to streamline the development, testing, and monitoring of language model applications. Its evaluation capabilities allow developers to assess the performance of LangChain components (e.g., chains, agents, retrievers) using curated datasets, automated metrics, and LLM-based judgments. LangSmith supports quantitative metrics (e.g., exact match, BLEU), qualitative assessments (e.g., relevance, coherence), and custom evaluators, providing detailed insights into output quality, tool usage, and task completion. By integrating with LangChain’s langsmith package, it enables scalable, reproducible evaluations, making it ideal for iterative development and production monitoring.

For related concepts, see LangChain Metrics Overview and Evaluate Agent Behavior.

Why Use LangSmith for Evaluation?

LangSmith evaluation is essential for:

Scalable Testing: Manage and evaluate large datasets with automated workflows.
Detailed Insights: Access granular feedback on correctness, relevance, and reasoning.
Iterative Improvement: Refine prompts, tools, or models based on evaluation results.
Production Monitoring: Track performance in real-world deployments.

Explore LangSmith capabilities at the LangSmith Documentation.

Setting Up LangSmith Evaluation

To use LangSmith for evaluation in LangChain, you need to install the required packages, configure LangSmith credentials, create datasets, and set up evaluators. Below is a setup for evaluating a RetrievalQA chain using LangSmith:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langsmith import Client
from langsmith.evaluation import evaluate
import os

# Set LangSmith environment variables
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "langchain-evaluation"

# Initialize embeddings and language model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create sample documents
documents = [
    Document(page_content="The capital of France is Paris.", metadata={"source": "geo"}),
    Document(page_content="The Eiffel Tower is in Paris.", metadata={"source": "landmark"})
]

# Initialize Chroma vector store
vector_store = Chroma.from_documents(
    documents,
    embedding_function,
    collection_name="langchain_example",
    persist_directory="./chroma_db"
)

# Set up RetrievalQA chain
prompt = PromptTemplate.from_template("Answer: {question}")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever()
)

# Initialize LangSmith client
client = Client()

# Create or load a dataset
dataset_name = "qa_dataset"
try:
    dataset = client.create_dataset(dataset_name=dataset_name)
except:
    dataset = client.get_dataset(dataset_name=dataset_name)

# Add examples to dataset
examples = [
    {"input": "What is the capital of France?", "output": "Paris"},
    {"input": "Where is the Eiffel Tower?", "output": "Paris"}
]
for example in examples:
    client.create_example(
        inputs={"question": example["input"]},
        outputs={"answer": example["output"]},
        dataset_id=dataset.id
    )

# Define evaluation function
def evaluate_qa(run, example):
    prediction = run.outputs.get("result", "")
    reference = example.outputs.get("answer", "")
    question = example.inputs.get("question", "")
    evaluator = load_evaluator(EvaluatorType.QA, llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        reference=reference,
        input=question
    )
    return {"key": "qa_score", "score": result["score"], "comment": result.get("reasoning", "")}

# Run evaluation
results = evaluate(
    lambda inputs: qa_chain.invoke({"query": inputs["question"]}),
    data=dataset_name,
    evaluators=[evaluate_qa],
    experiment_prefix="qa_evaluation"
)

print(f"Evaluation Results: {results}")

This setup creates a RetrievalQA chain, uploads a dataset to LangSmith, evaluates the chain’s outputs using a QA evaluator, and logs results. The evaluation is tracked in the LangSmith dashboard for further analysis.

Installation

Install the core packages for LangChain, LangSmith, and evaluation:

pip install langchain langchain-chroma langchain-openai chromadb langsmith

For specific metrics or tools, install additional dependencies:

NLP Metrics: pip install nltk rouge-score for BLEU/ROUGE scores.
Embedding Metrics: Included with langchain-openai.

Example:

pip install nltk rouge-score

For detailed installation guidance, see LangSmith Documentation.

Configuration Options

Customize LangSmith evaluation during setup:

Dataset Management:

Create datasets with input-output pairs or load existing ones.
Example:

dataset = client.create_dataset(dataset_name="custom_dataset")

Evaluators:

Use built-in evaluators (QA, CRITERIA, STRING_DISTANCE) or custom functions.
Example:

from langchain.evaluation import load_evaluator
    evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)

Experiment Settings:

Define experiment prefixes and metadata for tracking.
Example:

evaluate(..., experiment_prefix="test_run", metadata={"version": "1.0"})

LLM for Evaluation:

Use a reliable LLM (e.g., gpt-3.5-turbo or gpt-4) for judgment-based metrics.
Example:

llm = ChatOpenAI(model="gpt-4", temperature=0)

Core Evaluation Techniques

1. Correctness Evaluation

Assess whether agent outputs are factually accurate compared to references.

QA Evaluator:

Compares predicted outputs to ground truth answers.
Use Case: Validating factual responses from chains or agents.
Example:

def evaluate_correctness(run, example):
        prediction = run.outputs.get("result", "")
        reference = example.outputs.get("answer", "")
        question = example.inputs.get("question", "")
        evaluator = load_evaluator(EvaluatorType.QA, llm=llm)
        result = evaluator.evaluate_strings(
            prediction=prediction,
            reference=reference,
            input=question
        )
        return {"key": "correctness", "score": result["score"], "comment": result.get("reasoning", "")}

2. Relevance Evaluation

Measure how well outputs align with input queries or task objectives.

Criteria Evaluator (Relevance):

Uses an LLM to score relevance to the input.
Use Case: Ensuring responses address user intent.
Example:

def evaluate_relevance(run, example):
        prediction = run.outputs.get("result", "")
        question = example.inputs.get("question", "")
        evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)
        result = evaluator.evaluate_strings(
            prediction=prediction,
            input=question
        )
        return {"key": "relevance", "score": result["score"], "comment": result.get("reasoning", "")}

3. Coherence Evaluation

Assess the logical flow and clarity of outputs.

Criteria Evaluator (Coherence):

Evaluates whether outputs are logically structured and clear.
Use Case: Validating multi-step reasoning or conversational outputs.
Example:

def evaluate_coherence(run, example):
        prediction = run.outputs.get("result", "")
        question = example.inputs.get("question", "")
        evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="coherence", llm=llm)
        result = evaluator.evaluate_strings(
            prediction=prediction,
            input=question
        )
        return {"key": "coherence", "score": result["score"], "comment": result.get("reasoning", "")}

4. Custom Metrics

Define custom evaluators for project-specific needs, such as tool usage or task completion.

Custom Evaluator:

Create a function to assess specific behaviors (e.g., tool selection accuracy).
Example:

def evaluate_tool_usage(run, example):
        prediction = run.outputs.get("result", "")
        question = example.inputs.get("question", "")
        # Check if search tool was mentioned (simplified example)
        score = 1.0 if "search" in prediction.lower() else 0.5
        return {
            "key": "tool_usage",
            "score": score,
            "comment": "Checks if the search tool was referenced in the response."
        }

5. Pairwise Comparison

Compare outputs from different configurations or runs to determine the better performer.

Pairwise String Evaluator:

Uses an LLM to judge which output is superior.
Use Case: Comparing agent versions or prompt variations.
Example:

def evaluate_pairwise(run, example):
        prediction = run.outputs.get("result", "")
        # Assume another run's output is available for comparison
        prediction_b = "Alternative response from another run"
        question = example.inputs.get("question", "")
        evaluator = load_evaluator(EvaluatorType.PAIRWISE_STRING, llm=llm)
        result = evaluator.evaluate_string_pairs(
            prediction=prediction,
            prediction_b=prediction_b,
            input=question
        )
        return {"key": "pairwise", "score": result["score"], "comment": result.get("reasoning", "")}

Comprehensive Example

Here’s a complete system evaluating a LangChain agent with LangSmith, using multiple metrics, integrated with Chroma and MongoDB Atlas, and including dataset evaluation:

from langchain_chroma import Chroma
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.evaluation import load_evaluator, EvaluatorType
from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool
from langchain_community.tools import DuckDuckGoSearchRun
from langsmith import Client
from langsmith.evaluation import evaluate
from pymongo import MongoClient
import os
import logging
import time

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set LangSmith environment variables
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "agent-evaluation"

# Initialize embeddings and language model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create sample documents
documents = [
    Document(page_content="The capital of France is Paris.", metadata={"source": "geo"}),
    Document(page_content="The Eiffel Tower is in Paris.", metadata={"source": "landmark"})
]

# Initialize Chroma and MongoDB Atlas vector stores
chroma_store = Chroma.from_documents(
    documents,
    embedding_function,
    collection_name="langchain_example",
    persist_directory="./chroma_db"
)
client = MongoClient("mongodb+srv://:@.mongodb.net/")
collection = client["langchain_db"]["example_collection"]
mongo_store = MongoDBAtlasVectorSearch.from_documents(
    documents,
    embedding_function,
    collection=collection,
    index_name="vector_index"
)

# Set up search tool and agent
search = DuckDuckGoSearchRun()
tools = [
    Tool(
        name="Search",
        func=search.run,
        description="Useful for answering questions about recent events or general knowledge."
    )
]
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Initialize LangSmith client
ls_client = Client()

# Create or load dataset
dataset_name = "agent_qa_dataset"
try:
    dataset = ls_client.create_dataset(dataset_name=dataset_name)
except:
    dataset = ls_client.get_dataset(dataset_name=dataset_name)

# Add examples to dataset
examples = [
    {"input": "What is the capital of France?", "output": "Paris"},
    {"input": "Where is the Eiffel Tower?", "output": "Paris"}
]
for example in examples:
    ls_client.create_example(
        inputs={"question": example["input"]},
        outputs={"answer": example["output"]},
        dataset_id=dataset.id
    )

# Define evaluation functions
def evaluate_correctness(run, example):
    prediction = run.outputs.get("output", "")
    reference = example.outputs.get("answer", "")
    question = example.inputs.get("question", "")
    evaluator = load_evaluator(EvaluatorType.QA, llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        reference=reference,
        input=question
    )
    return {"key": "correctness", "score": result["score"], "comment": result.get("reasoning", "")}

def evaluate_relevance(run, example):
    prediction = run.outputs.get("output", "")
    question = example.inputs.get("question", "")
    evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="relevance", llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        input=question
    )
    return {"key": "relevance", "score": result["score"], "comment": result.get("reasoning", "")}

def evaluate_tool_usage(run, example):
    prediction = run.outputs.get("output", "")
    question = example.inputs.get("question", "")
    evaluator = load_evaluator(
        EvaluatorType.CRITERIA,
        criteria={"tool_usage": "Did the agent choose and use the correct tool effectively?"},
        llm=llm
    )
    result = evaluator.evaluate_strings(
        prediction=prediction,
        input=question
    )
    return {"key": "tool_usage", "score": result["score"], "comment": result.get("reasoning", "")}

# Run evaluation
start_time = time.time()
results = evaluate(
    lambda inputs: {"output": agent.run(inputs["question"])},
    data=dataset_name,
    evaluators=[evaluate_correctness, evaluate_relevance, evaluate_tool_usage],
    experiment_prefix="agent_evaluation",
    metadata={"version": "1.0"}
)

# Log results
logger.info(f"Evaluation completed in {time.time() - start_time:.2f} seconds")
print(f"Evaluation Results: {results}")

Output:

Evaluation completed in 5.32 seconds
Evaluation Results: 
# Detailed results available in LangSmith dashboard

The results, including scores and comments for correctness, relevance, and tool usage, are logged in the LangSmith dashboard under the “agent_evaluation” experiment, accessible via the LangSmith UI.

Best Practices

Curate High-Quality Datasets: Include diverse inputs, edge cases, and multi-step tasks to ensure comprehensive evaluation.
Use Multiple Metrics: Combine correctness, relevance, coherence, and custom metrics like tool usage for holistic assessment.
Leverage LangSmith Dashboard: Analyze results, visualize trends, and compare experiments in the UI.
Optimize Costs: Use cost-effective LLMs (e.g., gpt-3.5-turbo) and cache results for repeated evaluations.
Iterate on Feedback: Refine agent prompts, tools, or logic based on evaluation comments and scores.
Monitor Experiments: Track performance over time to detect regressions or improvements.

Error Handling

API Errors: Handle LangSmith API failures with retries or fallback evaluators.
Dataset Issues: Validate dataset format to avoid parsing errors.
Tool Failures: Log and skip invalid tool responses during evaluation.
Resource Limits: Batch evaluations to manage API costs and rate limits.

See Troubleshooting.

Limitations

LLM Dependency: Judgment-based metrics may introduce bias or variability.
Cost: LangSmith and LLM evaluations can be expensive for large datasets.
Setup Complexity: Requires API key setup and dataset curation.
Metric Subjectivity: Qualitative metrics like coherence depend on LLM interpretation.

Recent Developments

2024 Enhancements: LangSmith introduced advanced dataset versioning and custom evaluator templates.
Community Feedback: X posts highlight LangSmith’s role in evaluating complex agent workflows, with users sharing custom metrics for enterprise applications.
UI Improvements: Enhanced dashboard for visualizing experiment results and trends.

Conclusion

LangSmith evaluation in LangChain provides a powerful platform for assessing and optimizing AI application performance, offering scalable dataset management, automated metrics, and detailed insights. By leveraging LangSmith’s tools, developers can ensure their chains, agents, and retrievers deliver reliable, high-quality outputs. Start using LangSmith to enhance your LangChain projects, streamlining evaluation and driving iterative improvements.

For official documentation, visit LangSmith Documentation.