Creating Custom Metrics for LangChain Evaluation to Tailor AI Performance Assessment

Introduction

Evaluating AI-driven applications requires metrics that align with specific use cases and performance goals. While LangChain provides a robust set of built-in evaluation metrics within its langchain.evaluation module, such as correctness, relevance, and coherence, many applications demand tailored metrics to assess unique aspects of output quality or behavior. Custom metrics, accessible under the /langchain/evaluation/custom-metrics path, allow developers to define project-specific evaluation criteria by extending LangChain’s evaluation framework, enabling precise assessment of chains, agents, or retrievers. This comprehensive guide explores how to create and use custom metrics in LangChain, covering setup, core techniques, best practices, practical applications, and advanced configurations, empowering developers to build highly customized and effective AI performance evaluations.

To understand LangChain’s broader evaluation ecosystem, start with LangChain Evaluation Introduction.

What are Custom Metrics in LangChain?

Custom metrics in LangChain are user-defined evaluation criteria implemented by extending the StringEvaluator class or defining custom evaluation functions within the langchain.evaluation module. These metrics assess the quality of outputs from LangChain components—such as chains, agents, or retrievers—based on project-specific requirements, such as tone, specificity, domain accuracy, or task completion efficiency. Custom metrics can leverage LLMs for judgment-based scoring, incorporate traditional NLP metrics (e.g., BLEU, ROUGE), or use rule-based logic, and are often integrated with LangSmith for dataset-driven evaluation. They are ideal for scenarios where built-in metrics like QA or CRITERIA do not fully capture the desired evaluation dimensions.

For related concepts, see LangChain Metrics Overview and LangSmith Evaluation.

Why Use Custom Metrics?

Custom metrics are essential for:

Tailored Assessment: Evaluate unique qualities like domain-specific accuracy or user sentiment.
Flexibility: Adapt evaluation to specific tasks, such as tool usage or multi-step reasoning.
Precision: Capture nuanced performance aspects that standard metrics miss.
Scalability: Automate evaluations for complex use cases with LangSmith integration.

Explore evaluation capabilities at the LangChain Evaluation Documentation.

Setting Up Custom Metrics

To create custom metrics in LangChain, you need to install the required packages, define a custom evaluator by extending StringEvaluator or creating a function, and integrate it with your application. Below is a setup for evaluating a RetrievalQA chain with a custom metric for “specificity” using LangSmith:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.evaluation import StringEvaluator
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langsmith import Client
from langsmith.evaluation import evaluate
import os
import logging
from typing import Dict, Any

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set LangSmith environment variables
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "custom-metrics-evaluation"

# Initialize embeddings and language model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create sample documents
documents = [
    Document(page_content="The capital of France is Paris.", metadata={"source": "geo"}),
    Document(page_content="The Eiffel Tower is in Paris.", metadata={"source": "landmark"})
]

# Initialize Chroma vector store
vector_store = Chroma.from_documents(
    documents,
    embedding_function,
    collection_name="langchain_example",
    persist_directory="./chroma_db"
)

# Set up RetrievalQA chain
prompt = PromptTemplate.from_template("Answer: {question}")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever()
)

# Define custom evaluator for specificity
class SpecificityEvaluator(StringEvaluator):
    """Evaluates the specificity of a response using an LLM."""
    def __init__(self, llm):
        self.llm = llm

    def _evaluate_strings(self, prediction: str, input: str = None, **kwargs) -> Dict[str, Any]:
        from langchain.evaluation import load_evaluator, EvaluatorType
        evaluator = load_evaluator(
            EvaluatorType.CRITERIA,
            criteria={"specificity": "Is the response detailed and specific to the input?"},
            llm=self.llm
        )
        result = evaluator.evaluate_strings(
            prediction=prediction,
            input=input
        )
        return {
            "key": "specificity",
            "score": result["score"],
            "comment": result.get("reasoning", "")
        }

# Initialize LangSmith client
client = Client()

# Create or load a dataset
dataset_name = "qa_custom_metrics_dataset"
try:
    dataset = client.create_dataset(dataset_name=dataset_name)
except:
    dataset = client.get_dataset(dataset_name=dataset_name)

# Add examples to dataset
examples = [
    {"input": "What is the capital of France?", "output": "Paris"},
    {"input": "Describe Paris landmarks.", "output": ""}  # Open-ended for specificity
]
for example in examples:
    client.create_example(
        inputs={"question": example["input"]},
        outputs={"answer": example["output"]},
        dataset_id=dataset.id
    )

# Define evaluation functions
def evaluate_correctness(run, example) -> Dict[str, Any]:
    from langchain.evaluation import load_evaluator, EvaluatorType
    prediction = run.outputs.get("result", "")
    reference = example.outputs.get("answer", "")
    question = example.inputs.get("question", "")
    if not reference:
        return {"key": "correctness", "score": None, "comment": "No reference provided."}
    evaluator = load_evaluator(EvaluatorType.QA, llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        reference=reference,
        input=question
    )
    return {"key": "correctness", "score": result["score"], "comment": result.get("reasoning", "")}

def evaluate_specificity(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("result", "")
    question = example.inputs.get("question", "")
    evaluator = SpecificityEvaluator(llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        input=question
    )
    return result

# Run evaluation
import time
start_time = time.time()
results = evaluate(
    lambda inputs: {"result": qa_chain.invoke({"query": inputs["question"]})["result"]},
    data=dataset_name,
    evaluators=[evaluate_correctness, evaluate_specificity],
    experiment_prefix="custom_metrics_evaluation",
    metadata={"version": "1.0", "evaluated_at": "2025-05-15T15:25:00Z"}
)

# Log results
logger.info(f"Evaluation completed in {time.time() - start_time:.2f} seconds")
print(f"Evaluation Results: {results}")
print("View detailed results in LangSmith dashboard under 'custom_metrics_evaluation' experiment.")

This setup creates a RetrievalQA chain, defines a custom SpecificityEvaluator to assess response detail, uploads a dataset to LangSmith, and evaluates outputs for correctness and specificity. Results are logged in the LangSmith dashboard for analysis.

Installation

Install the core packages for LangChain, LangSmith, and evaluation:

pip install langchain langchain-chroma langchain-openai chromadb langsmith

For specific metrics, install additional dependencies:

NLP Metrics: pip install nltk rouge-score for BLEU/ROUGE scores.
Embedding Metrics: Included with langchain-openai.

Example:

pip install nltk rouge-score

For detailed installation guidance, see LangSmith Documentation.

Configuration Options

Customize custom metrics during setup:

Custom Evaluator Class:

Extend StringEvaluator to define bespoke logic.
Example:

class CustomEvaluator(StringEvaluator):
        def _evaluate_strings(self, prediction: str, **kwargs) -> dict:
            score = 1.0 if "detailed" in prediction.lower() else 0.5
            return {"score": score, "reasoning": "Checks for detailed content."}

LLM-Based Criteria:

Use LLM judgments for subjective metrics like specificity or tone.
Example:

evaluator = load_evaluator(
        EvaluatorType.CRITERIA,
        criteria={"specificity": "Is the response detailed and specific?"},
        llm=llm
    )

Dataset Configuration:

Include input-output pairs or open-ended inputs for flexible evaluation.
Example:

client.create_example(
        inputs={"question": "Explain AI ethics."},
        outputs={},
        dataset_id=dataset.id
    )

LangSmith Integration:

Track experiments with metadata for reproducibility.
Example:

evaluate(..., experiment_prefix="custom_test", metadata={"version": "1.0"})

Core Techniques for Custom Metrics

1. Extending StringEvaluator

Create custom evaluators by subclassing StringEvaluator for reusable metrics.

Specificity Evaluator:

Assesses response detail using an LLM.
Example (as shown above):

class SpecificityEvaluator(StringEvaluator):
        def __init__(self, llm):
            self.llm = llm

        def _evaluate_strings(self, prediction: str, input: str = None, **kwargs) -> Dict[str, Any]:
            evaluator = load_evaluator(
                EvaluatorType.CRITERIA,
                criteria={"specificity": "Is the response detailed and specific to the input?"},
                llm=self.llm
            )
            result = evaluator.evaluate_strings(prediction=prediction, input=input)
            return {"key": "specificity", "score": result["score"], "comment": result.get("reasoning", "")}

Tone Evaluator:

Evaluates whether the response matches a desired tone (e.g., formal).
Example:

class ToneEvaluator(StringEvaluator):
        def _evaluate_strings(self, prediction: str, **kwargs) -> Dict[str, Any]:
            score = 1.0 if any(word in prediction.lower() for word in ["formal", "dear", "respected"]) else 0.5
            return {
                "key": "tone",
                "score": score,
                "reasoning": "Checks for formal tone based on specific keywords."
            }

2. Rule-Based Custom Metrics

Define metrics using deterministic logic for specific patterns or conditions.

Keyword-Based Metric:

Scores outputs based on the presence of key terms.
Example:

def evaluate_keyword_presence(run, example) -> Dict[str, Any]:
        prediction = run.outputs.get("result", "")
        keywords = ["Paris", "landmark", "capital"]
        score = sum(1 for keyword in keywords if keyword.lower() in prediction.lower()) / len(keywords)
        return {
            "key": "keyword_presence",
            "score": score,
            "comment": f"Found {sum(1 for k in keywords if k.lower() in prediction.lower())}/{len(keywords)} keywords."
        }

3. LLM-Based Custom Criteria

Use LLMs to evaluate custom criteria tailored to the application.

Domain Accuracy:

Assesses accuracy in a specific domain (e.g., technical terminology).
Example:

def evaluate_domain_accuracy(run, example) -> Dict[str, Any]:
        prediction = run.outputs.get("result", "")
        question = example.inputs.get("question", "")
        evaluator = load_evaluator(
            EvaluatorType.CRITERIA,
            criteria={"domain_accuracy": "Does the response use accurate technical terminology?"},
            llm=llm
        )
        result = evaluator.evaluate_strings(prediction=prediction, input=question)
        return {
            "key": "domain_accuracy",
            "score": result["score"],
            "comment": result.get("reasoning", "")
        }

4. Combining Custom and Built-in Metrics

Integrate custom metrics with built-in evaluators for comprehensive assessment.

Hybrid Evaluation:

Combine correctness (built-in) with specificity (custom).
Example:

evaluators = [evaluate_correctness, evaluate_specificity]
    results = evaluate(
        lambda inputs: {"result": qa_chain.invoke({"query": inputs["question"]})["result"]},
        data=dataset_name,
        evaluators=evaluators
    )

5. Pairwise Custom Metrics

Compare two outputs to assess relative quality.

Custom Pairwise Evaluator:

Uses an LLM to judge which output better meets a custom criterion.
Example:

def evaluate_pairwise_specificity(run, example) -> Dict[str, Any]:
        prediction = run.outputs.get("result", "")
        prediction_b = "Alternative response"  # Placeholder
        question = example.inputs.get("question", "")
        evaluator = load_evaluator(EvaluatorType.PAIRWISE_STRING, llm=llm)
        result = evaluator.evaluate_string_pairs(
            prediction=prediction,
            prediction_b=prediction_b,
            input=question,
            criteria={"specificity": "Which response is more detailed and specific?"}
        )
        return {
            "key": "pairwise_specificity",
            "score": result["score"],
            "comment": result.get("reasoning", "")
        }

Comprehensive Example

Here’s a complete system evaluating a LangChain agent with custom and built-in metrics using LangSmith, integrated with Chroma and MongoDB Atlas:

from langchain_chroma import Chroma
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.evaluation import load_evaluator, EvaluatorType, StringEvaluator
from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool
from langchain_community.tools import DuckDuckGoSearchRun
from langsmith import Client
from langsmith.evaluation import evaluate
from pymongo import MongoClient
import os
import logging
import time
from typing import Dict, Any

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set LangSmith environment variables
os.environ["LANGSMITH_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "agent-custom-metrics"

# Initialize embeddings and language model
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create sample documents
documents = [
    Document(page_content="The capital of France is Paris.", metadata={"source": "geo"}),
    Document(page_content="The Eiffel Tower is in Paris.", metadata={"source": "landmark"})
]

# Initialize Chroma and MongoDB Atlas vector stores
chroma_store = Chroma.from_documents(
    documents,
    embedding_function,
    collection_name="langchain_example",
    persist_directory="./chroma_db"
)
client = MongoClient("mongodb+srv://:@.mongodb.net/")
collection = client["langchain_db"]["example_collection"]
mongo_store = MongoDBAtlasVectorSearch.from_documents(
    documents,
    embedding_function,
    collection=collection,
    index_name="vector_index"
)

# Set up search tool and agent
search = DuckDuckGoSearchRun()
tools = [
    Tool(
        name="Search",
        func=search.run,
        description="Useful for answering questions about recent events or general knowledge."
    )
]
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Initialize LangSmith client
ls_client = Client()

# Create or load dataset
dataset_name = "agent_custom_metrics_dataset"
try:
    dataset = ls_client.create_dataset(dataset_name=dataset_name)
except:
    dataset = ls_client.get_dataset(dataset_name=dataset_name)

# Add examples to dataset
examples = [
    {"input": "What is the capital of France?", "output": "Paris"},
    {"input": "Describe Paris landmarks.", "output": ""}  # Open-ended
]
for example in examples:
    ls_client.create_example(
        inputs={"question": example["input"]},
        outputs={"answer": example["output"]},
        dataset_id=dataset.id
    )

# Define custom evaluator
class SpecificityEvaluator(StringEvaluator):
    def __init__(self, llm):
        self.llm = llm

    def _evaluate_strings(self, prediction: str, input: str = None, **kwargs) -> Dict[str, Any]:
        evaluator = load_evaluator(
            EvaluatorType.CRITERIA,
            criteria={"specificity": "Is the response detailed and specific to the input?"},
            llm=self.llm
        )
        result = evaluator.evaluate_strings(prediction=prediction, input=input)
        return {"key": "specificity", "score": result["score"], "comment": result.get("reasoning", "")}

# Define evaluation functions
def evaluate_correctness(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("output", "")
    reference = example.outputs.get("answer", "")
    question = example.inputs.get("question", "")
    if not reference:
        return {"key": "correctness", "score": None, "comment": "No reference provided."}
    evaluator = load_evaluator(EvaluatorType.QA, llm=llm)
    result = evaluator.evaluate_strings(
        prediction=prediction,
        reference=reference,
        input=question
    )
    return {"key": "correctness", "score": result["score"], "comment": result.get("reasoning", "")}

def evaluate_specificity(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("output", "")
    question = example.inputs.get("question", "")
    evaluator = SpecificityEvaluator(llm=llm)
    result = evaluator.evaluate_strings(prediction=prediction, input=question)
    return result

def evaluate_tool_usage(run, example) -> Dict[str, Any]:
    prediction = run.outputs.get("output", "")
    question = example.inputs.get("question", "")
    score = 1.0 if "search" in prediction.lower() else 0.5
    return {
        "key": "tool_usage",
        "score": score,
        "comment": "Checks if the search tool was referenced in the response."
    }

# Run evaluation
start_time = time.time()
results = evaluate(
    lambda inputs: {"output": agent.run(inputs["question"])},
    data=dataset_name,
    evaluators=[evaluate_correctness, evaluate_specificity, evaluate_tool_usage],
    experiment_prefix="agent_custom_metrics",
    metadata={"version": "1.0", "evaluated_at": "2025-05-15T15:25:00Z"}
)

# Log results
logger.info(f"Evaluation completed in {time.time() - start_time:.2f} seconds")
print(f"Evaluation Results: {results}")
print("View detailed results in LangSmith dashboard under 'agent_custom_metrics' experiment.")

Output:

Evaluation completed in 8.23 seconds
Evaluation Results: 
View detailed results in LangSmith dashboard under 'agent_custom_metrics' experiment.

The evaluation runs automated metrics, including a custom SpecificityEvaluator and a rule-based tool usage metric, with results logged in LangSmith for detailed analysis.

Best Practices

Align Metrics with Goals: Design custom metrics to reflect project-specific requirements (e.g., domain accuracy, tone).
Combine Metrics: Use custom metrics alongside built-in evaluators for comprehensive assessment.
Validate Custom Logic: Test custom evaluators on sample data to ensure accuracy and consistency.
Optimize LLM Usage: Use cost-effective LLMs (e.g., gpt-3.5-turbo) for evaluation to balance cost and quality.
Integrate with LangSmith: Leverage LangSmith for dataset management, tracking, and visualization.
Iterate on Feedback: Refine components based on custom metric scores and reasoning.

Error Handling

LLM Failures: Implement retries or fallback models for evaluation errors.
Dataset Issues: Validate dataset format to avoid parsing errors.
Logic Errors: Test custom metric logic to prevent runtime exceptions.
Resource Limits: Batch evaluations to manage API costs and rate limits.

See Troubleshooting.

Limitations

LLM Bias: Judgment-based custom metrics may vary by model or prompt.
Development Overhead: Creating and testing custom metrics requires additional effort.
Cost: LLM-based evaluations can be expensive for large datasets.
Complexity: Complex metrics may require careful tuning to avoid false positives/negatives.

Recent Developments

2025 Updates: LangSmith introduced templates for custom evaluator creation, simplifying metric development.
Community Feedback: X posts highlight custom metrics for evaluating chatbot tone and domain-specific accuracy in healthcare.
LangSmith Enhancements: Improved support for custom metric visualization in the dashboard.

Conclusion

Custom metrics in LangChain enable developers to tailor AI performance assessment to specific use cases, enhancing evaluation precision and flexibility. By extending StringEvaluator or defining custom functions, and integrating with LangSmith, developers can assess unique qualities like specificity or tool usage, optimizing chains, agents, and retrievers. Start leveraging custom metrics to refine your LangChain projects, ensuring outputs meet precise performance goals.

For official documentation, visit LangSmith Documentation.