Harnessing LangChain’s Weaviate Vector Store for Powerful Similarity Search

Introduction

In the rapidly advancing field of artificial intelligence, efficiently retrieving relevant information from large datasets is essential for applications such as semantic search, question-answering systems, recommendation engines, and conversational AI. LangChain, a robust framework for building AI-driven solutions, integrates the Weaviate vector database to provide a high-performance vector store for similarity search. This comprehensive guide explores the Weaviate vector store’s setup, core features, performance optimization, practical applications, and advanced configurations, offering developers detailed insights to create scalable, context-aware systems.

To understand LangChain’s broader ecosystem, start with LangChain Fundamentals.

What is the Weaviate Vector Store?

LangChain’s Weaviate vector store leverages Weaviate, an open-source, cloud-native vector database designed for high-speed similarity search on high-dimensional vector embeddings. Weaviate combines vector search with structured data storage, enabling semantic queries and advanced filtering. The Weaviate vector store in LangChain, provided via the langchain_weaviate package, simplifies integration while supporting features like hybrid search, generative search, and GraphQL-based querying, making it ideal for tasks requiring semantic understanding.

For a primer on vector stores, see Vector Stores Introduction.

Why Weaviate?

Weaviate excels in scalability, flexibility, and developer experience, handling millions of vectors with low latency. It supports vector-based and keyword-based search, advanced filtering, and integration with generative AI models. LangChain’s implementation abstracts Weaviate’s complexities, offering a seamless interface for AI applications.

Explore Weaviate’s capabilities at the Weaviate Documentation.

Setting Up the Weaviate Vector Store

To use the Weaviate vector store, you need an embedding function to convert text into vectors. LangChain supports providers like OpenAI, HuggingFace, and custom models. Below is a basic setup using OpenAI embeddings with a local Weaviate instance:

from langchain_weaviate.vectorstores import WeaviateVectorStore
from langchain_openai import OpenAIEmbeddings
import weaviate
import weaviate.classes as wvc

client = weaviate.connect_to_local()
embedding_function = OpenAIEmbeddings(model="text-embedding-3-large")
vector_store = WeaviateVectorStore.from_documents(
    documents=[],
    embedding=embedding_function,
    client=client,
    index_name="LangChainExample"
)

This initializes a Weaviate vector store with an empty document set, connecting to a local Weaviate instance. The embedding_function generates vectors (e.g., 1536 dimensions for OpenAI’s text-embedding-3-large).

For alternative embedding options, visit Custom Embeddings.

Installation

Install the required packages:

pip install langchain-weaviate langchain-openai weaviate-client

For sparse retrieval (e.g., BM25), no additional dependencies are needed, as Weaviate includes built-in support. Run a local Weaviate instance using Docker:

docker run -d -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:latest

For Weaviate Cloud (WCS), obtain an API key and cluster URL from the Weaviate Console. Set environment variables (WEAVIATE_URL, WEAVIATE_API_KEY) or pass them directly to the client.

For detailed installation guidance, see Weaviate Integration.

Configuration Options

Customize the Weaviate vector store during initialization:

client: A weaviate.Client or weaviate.WeaviateClient instance.
embedding: Embedding function for dense vectors.
index_name: Name of the Weaviate class (capitalized, e.g., LangChainExample).
text_key: Property name for document content (default: text).
attributes: Additional metadata properties to index (default: None).
vectorizer: Weaviate vectorizer module (e.g., text2vec-openai; default: none for external embeddings).
by_text: Boolean to enable text-based search (default: False).

Example with Weaviate Cloud:

client = weaviate.connect_to_wcs(
    cluster_url="",
    auth_credentials=wvc.init.Auth.api_key("")
)
vector_store = WeaviateVectorStore(
    client=client,
    index_name="LangChainExample",
    embedding=embedding_function,
    text_key="content",
    attributes=["source", "id"]
)

Core Features

1. Indexing Documents

Indexing is the cornerstone of similarity search, enabling Weaviate to store and organize embeddings for rapid retrieval. The Weaviate vector store supports indexing raw texts, pre-computed embeddings, and documents with metadata, offering flexibility for various use cases.

Key Methods:

from_documents(documents, embedding, client, index_name, text_key="text", by_text=False, **kwargs): Creates a vector store from a list of Document objects.

Parameters:

documents: List of Document objects with page_content and optional metadata.
embedding: Embedding function for dense vectors.
client: Weaviate client instance.
index_name: Weaviate class name.
text_key: Property for document content.
by_text: Use Weaviate’s vectorizer instead of external embeddings.

Returns: A WeaviateVectorStore instance.

from_texts(texts, embedding, client, index_name, metadatas=None, text_key="text", by_text=False, **kwargs): Creates a vector store from a list of texts.
add_documents(documents, **kwargs): Adds documents to an existing class.

Parameters:

documents: List of Document objects.

Returns: List of object UUIDs.

add_texts(texts, metadatas=None, **kwargs): Adds texts to an existing class.

Class and Schema:

Weaviate organizes data in classes, with each class defining a schema for vectors and properties.
By default, LangChain creates a class with a text property and metadata fields (e.g., source, id).
Example Schema (auto-generated):

{
      "class": "LangChainExample",
      "properties": [
        {"name": "text", "dataType": ["text"]},
        {"name": "source", "dataType": ["text"]},
        {"name": "id", "dataType": ["int"]}
      ],
      "vectorizer": "none"
    }

Vectorization:

External embeddings (via embedding) are stored directly.
Weaviate’s vectorizer (e.g., text2vec-openai) can be used with by_text=True:

vector_store = WeaviateVectorStore.from_documents(
        documents=[],
        client=client,
        index_name="LangChainExample",
        by_text=True,
        vectorizer="text2vec-openai"
    )

Example (Dense Indexing):

from langchain_core.documents import Document
  documents = [
      Document(page_content="The sky is blue.", metadata={"source": "sky", "id": 1}),
      Document(page_content="The grass is green.", metadata={"source": "grass", "id": 2})
  ]
  vector_store = WeaviateVectorStore.from_documents(
      documents,
      embedding=embedding_function,
      client=client,
      index_name="LangChainExample",
      attributes=["source", "id"]
  )

Example (Hybrid Indexing with BM25):

Weaviate supports hybrid search combining vector and keyword-based (BM25) search:

vector_store = WeaviateVectorStore.from_documents(
      documents,
      embedding=embedding_function,
      client=client,
      index_name="LangChainExample"
  )

For advanced indexing, see Document Indexing.

2. Similarity Search

Similarity search retrieves documents closest to a query based on vector similarity, powering applications like semantic search and question answering.

Key Methods:

similarity_search(query, k=4, where_filter=None, **kwargs): Searches for the top k documents using vector similarity.

Parameters:

query: Input text.
k: Number of results (default: 4).
where_filter: Optional Weaviate filter (GraphQL-style).
kwargs: Additional query parameters (e.g., alpha for hybrid search).

Returns: List of Document objects.

similarity_search_with_score(query, k=4, where_filter=None, **kwargs): Returns tuples of (Document, score), where scores are normalized (0 to 1 for cosine).
similarity_search_by_vector(embedding, k=4, where_filter=None, **kwargs): Searches using a pre-computed embedding.
max_marginal_relevance_search(query, k=4, fetch_k=20, lambda_mult=0.5, where_filter=None, **kwargs): Uses Maximal Marginal Relevance (MMR) to balance relevance and diversity.

Parameters:

fetch_k: Number of candidates to fetch (default: 20).
lambda_mult: Diversity weight (0 for max diversity, 1 for min; default: 0.5).

Distance Metrics:

Weaviate uses cosine similarity by default, with options for l2, dot, or manhattan via index configuration.
Set in the class schema:

client.collections.create(
        name="LangChainExample",
        vectorizer="none",
        vector_index_config=wvc.config.Configure.VectorIndex(metric="cosine")
    )

Example (Vector Similarity Search):

query = "What is blue?"
  results = vector_store.similarity_search_with_score(
      query,
      k=2,
      where_filter={
          "path": ["source"],
          "operator": "Equal",
          "valueText": "sky"
      }
  )
  for doc, score in results:
      print(f"Text: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")

Example (Hybrid Search):

Combine vector and BM25 search with alpha weighting:

results = vector_store.similarity_search(
      query,
      k=2,
      alpha=0.5,  # 50% vector, 50% BM25
      search_type="hybrid"
  )
  for doc in results:
      print(f"Hybrid Text: {doc.page_content}, Metadata: {doc.metadata}")

Search Parameters:

Use additional in kwargs to include raw scores or vector data.
Example:

results = vector_store.similarity_search(
        query,
        k=2,
        additional=["certainty"]  # Returns cosine similarity score
    )

For querying strategies, see Querying Vector Stores.

3. Metadata Filtering

Metadata filtering refines search results using Weaviate’s GraphQL-style where clause, supporting complex conditions like equality, ranges, and logical operators.

Filter Syntax:

Filters use a dictionary with path, operator, and value fields (e.g., valueText, valueInt, valueNumber).
Operators: Equal, NotEqual, GreaterThan, LessThan, ContainsAny, Like, etc.
Example:

where_filter = {
        "operator": "And",
        "operands": [
            {"path": ["source"], "operator": "Equal", "valueText": "sky"},
            {"path": ["id"], "operator": "GreaterThan", "valueInt": 0}
        ]
    }
    results = vector_store.similarity_search(query, k=2, where_filter=where_filter)

Advanced Filtering:

Supports nested properties, regex (Like), and geo-location queries.
Example (Geo Filter):

where_filter = {
        "path": ["location"],
        "operator": "WithinGeoRange",
        "valueGeoRange": {
            "geoCoordinates": {"latitude": 40.0, "longitude": -74.0},
            "distance": {"max": 1000}
        }
    }

For advanced filtering, see Metadata Filtering.

4. Persistence and Serialization

Weaviate provides persistent storage, managed locally or in the cloud.

Key Methods:

from_texts(texts, embedding, client, index_name, metadatas=None, **kwargs): Creates a new class or adds to an existing one.
delete(where_filter=None, **kwargs): Deletes objects matching a filter.

Parameters:

where_filter: GraphQL-style filter to select objects.

delete_by_id(uuid, **kwargs): Deletes a specific object by UUID.

Example:

vector_store = WeaviateVectorStore.from_texts(
      texts=["The sky is blue."],
      embedding=embedding_function,
      client=client,
      index_name="LangChainExample"
  )
  vector_store.delete(where_filter={"path": ["source"], "operator": "Equal", "valueText": "sky"})

Storage Modes:

Local: Persistent storage via Docker or embedded Weaviate.
Weaviate Cloud: Managed storage with cluster URL and API key.
Embedded: Lightweight in-memory option for testing (data persists to disk).

5. Document Store Management

Weaviate stores data as objects in a class, with vectors and properties.

Object Structure:

Each object includes:

uuid: Unique identifier (auto-generated).
vector: Embedding vector.
properties: Dictionary with text_key (e.g., text) and metadata (e.g., source, id).

Example Object:

{
      "class": "LangChainExample",
      "id": "",
      "properties": {
        "text": "The sky is blue.",
        "source": "sky",
        "id": 1
      },
      "vector": [0.1, 0.2, ...]
    }

Custom Properties:

Specify attributes to index specific metadata fields.
Example:

vector_store = WeaviateVectorStore(
        client=client,
        index_name="LangChainExample",
        embedding=embedding_function,
        attributes=["source", "id"]
    )

Example:

documents = [
      Document(page_content="The sky is blue.", metadata={"source": "sky", "id": 1})
  ]
  vector_store.add_documents(documents)

Performance Optimization

Weaviate is optimized for speed and scalability, but performance depends on configuration.

Index Configuration

Vector Index:

Use HNSW for high-speed searches or FLAT for exact searches.
Configure efConstruction and maxConnections for HNSW:

client.collections.create(
        name="LangChainExample",
        vectorizer="none",
        vector_index_config=wvc.config.Configure.VectorIndex.hnsw(
            ef_construction=128,
            max_connections=32
        )
    )

BM25 Weighting:

Adjust bm25 parameters for hybrid search:

results = vector_store.similarity_search(
        query,
        k=2,
        alpha=0.5,
        search_type="hybrid",
        bm25_k1=1.2,
        bm25_b=0.75
    )

Search Optimization

Batch Upserts: Use batch_size in add_texts or add_documents to optimize throughput.
Query Limits: Set k to a reasonable value to reduce latency.
Hybrid Tuning: Tune alpha (0 for BM25, 1 for vector) for optimal relevance.

For optimization tips, see Vector Store Performance and Weaviate Documentation.

Practical Applications

Weaviate powers diverse AI applications:

Semantic Search:
- Index documents for natural language queries.
- Example: A knowledge base for technical manuals.

Question Answering:
- Use in a RAG pipeline to fetch context.
- See RetrievalQA Chain.

Recommendation Systems:
- Index product descriptions for personalized recommendations.

Chatbot Context:
- Store conversation history for context-aware responses.
- Explore Chat History Chain.

Try the Document Search Engine Tutorial.

Comprehensive Example

Here’s a complete semantic search system with hybrid search and metadata filtering:

from langchain_weaviate.vectorstores import WeaviateVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
import weaviate
import weaviate.classes as wvc

# Initialize client and embeddings
client = weaviate.connect_to_local()
embedding_function = OpenAIEmbeddings(model="text-embedding-3-large")

# Create documents
documents = [
    Document(page_content="The sky is blue and vast.", metadata={"source": "sky", "id": 1}),
    Document(page_content="The grass is green and lush.", metadata={"source": "grass", "id": 2}),
    Document(page_content="The sun is bright and warm.", metadata={"source": "sun", "id": 3})
]

# Initialize vector store
vector_store = WeaviateVectorStore.from_documents(
    documents,
    embedding=embedding_function,
    client=client,
    index_name="LangChainExample",
    attributes=["source", "id"]
)

# Similarity search
query = "What is blue?"
results = vector_store.similarity_search_with_score(
    query,
    k=2,
    where_filter={
        "path": ["source"],
        "operator": "Equal",
        "valueText": "sky"
    }
)
for doc, score in results:
    print(f"Text: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")

# Hybrid search
results = vector_store.similarity_search(
    query,
    k=2,
    alpha=0.5,
    search_type="hybrid"
)
for doc in results:
    print(f"Hybrid Text: {doc.page_content}, Metadata: {doc.metadata}")

# MMR search
mmr_results = vector_store.max_marginal_relevance_search(
    query,
    k=2,
    fetch_k=10
)
for doc in mmr_results:
    print(f"MMR Text: {doc.page_content}, Metadata: {doc.metadata}")

# Delete objects
vector_store.delete(where_filter={"path": ["source"], "operator": "Equal", "valueText": "sky"})
client.close()

Output:

Text: The sky is blue and vast., Metadata: {'source': 'sky', 'id': 1}, Score: 0.8766
Hybrid Text: The sky is blue and vast., Metadata: {'source': 'sky', 'id': 1}
Hybrid Text: The grass is green and lush., Metadata: {'source': 'grass', 'id': 2}
MMR Text: The sky is blue and vast., Metadata: {'source': 'sky', 'id': 1}
MMR Text: The sun is bright and warm., Metadata: {'source': 'sun', 'id': 3}

Error Handling

Common issues include:

Connection Errors: Verify Weaviate URL, API key, and network settings.
Dimension Mismatch: Ensure embedding dimensions match the class configuration.
Class Not Found: Create the class before indexing.
Invalid Filter: Check where_filter syntax for correct operators and value types.

See Troubleshooting.

Limitations

Docker Dependency: Local Weaviate requires Docker setup.
Hybrid Search Tuning: Requires careful alpha adjustment for optimal results.
Schema Management: Manual schema updates needed for new metadata fields.
Cloud Costs: Weaviate Cloud may incur costs for large datasets.

Conclusion

LangChain’s Weaviate vector store is a powerful solution for similarity search, combining Weaviate’s scalability with LangChain’s ease of use. Its support for vector, hybrid, and generative search, along with robust filtering and persistence, makes it ideal for semantic search, question answering, and recommendation systems. Start experimenting with Weaviate to build intelligent, scalable AI applications.

For official documentation, visit LangChain Weaviate.