Claude
Skills
Sign in
Back

vector-database-management

Included with Lifetime
$97 forever

Comprehensive guide for managing vector databases including Pinecone, Weaviate, and Chroma for semantic search, RAG systems, and similarity-based applications

data-engineeringvector-databaseembeddingssemantic-searchragpineconeweaviatechromasimilarity-search

What this skill does


# Vector Database Management

## Table of Contents

1. [Introduction](#introduction)
2. [Vector Embeddings Fundamentals](#vector-embeddings-fundamentals)
3. [Database Setup & Configuration](#database-setup--configuration)
4. [Index Operations](#index-operations)
5. [Vector Operations](#vector-operations)
6. [Similarity Search](#similarity-search)
7. [Metadata Filtering](#metadata-filtering)
8. [Hybrid Search](#hybrid-search)
9. [Namespace & Collection Management](#namespace--collection-management)
10. [Performance & Scaling](#performance--scaling)
11. [Production Best Practices](#production-best-practices)
12. [Cost Optimization](#cost-optimization)

## Introduction

Vector databases are specialized systems designed to store, index, and query high-dimensional vector embeddings efficiently. They power modern AI applications including semantic search, recommendation systems, RAG (Retrieval Augmented Generation), and similarity-based matching.

### Key Concepts

- **Vector Embeddings**: Numerical representations of data (text, images, audio) in high-dimensional space
- **Similarity Search**: Finding vectors that are "close" to a query vector using distance metrics
- **Metadata Filtering**: Combining vector similarity with structured data filtering
- **Indexing**: Optimization structures (HNSW, IVF, etc.) for fast approximate nearest neighbor search

### Database Comparison

| Feature | Pinecone | Weaviate | Chroma |
|---------|----------|----------|--------|
| Deployment | Fully managed | Managed or self-hosted | Self-hosted or cloud |
| Index Types | Serverless, Pods | HNSW | HNSW |
| Metadata Filtering | Advanced | GraphQL-based | Simple |
| Hybrid Search | Sparse-Dense | Built-in | Limited |
| Scale | Massive | Large | Small-Medium |
| Best For | Production RAG | Knowledge graphs | Local development |

## Vector Embeddings Fundamentals

### Understanding Vector Representations

Vector embeddings transform unstructured data into numerical arrays that capture semantic meaning:

```python
# Text to embeddings using OpenAI
from openai import OpenAI

client = OpenAI(api_key="YOUR_API_KEY")

def generate_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    """Generate embeddings from text using OpenAI."""
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Example usage
text = "Vector databases enable semantic search capabilities"
embedding = generate_embedding(text)
print(f"Embedding dimension: {len(embedding)}")  # 1536 dimensions
print(f"First 5 values: {embedding[:5]}")
```

### Popular Embedding Models

```python
# 1. OpenAI Embeddings (Production-grade)
from openai import OpenAI

def openai_embeddings(texts: list[str]) -> list[list[float]]:
    """Batch generate OpenAI embeddings."""
    client = OpenAI(api_key="YOUR_API_KEY")
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-large"  # 3072 dimensions
    )
    return [item.embedding for item in response.data]

# 2. Sentence Transformers (Open-source)
from sentence_transformers import SentenceTransformer

def sentence_transformer_embeddings(texts: list[str]) -> list[list[float]]:
    """Generate embeddings using Sentence Transformers."""
    model = SentenceTransformer('all-MiniLM-L6-v2')  # 384 dimensions
    embeddings = model.encode(texts)
    return embeddings.tolist()

# 3. Cohere Embeddings
import cohere

def cohere_embeddings(texts: list[str]) -> list[list[float]]:
    """Generate embeddings using Cohere."""
    co = cohere.Client("YOUR_API_KEY")
    response = co.embed(
        texts=texts,
        model="embed-english-v3.0",
        input_type="search_document"
    )
    return response.embeddings
```

### Embedding Dimensions & Trade-offs

```python
# Different embedding models for different use cases
EMBEDDING_CONFIGS = {
    "openai-small": {
        "model": "text-embedding-3-small",
        "dimensions": 1536,
        "cost_per_1m": 0.02,
        "use_case": "General purpose, cost-effective"
    },
    "openai-large": {
        "model": "text-embedding-3-large",
        "dimensions": 3072,
        "cost_per_1m": 0.13,
        "use_case": "High accuracy requirements"
    },
    "sentence-transformers": {
        "model": "all-MiniLM-L6-v2",
        "dimensions": 384,
        "cost_per_1m": 0.00,  # Open-source
        "use_case": "Local development, privacy-sensitive"
    },
    "cohere-multilingual": {
        "model": "embed-multilingual-v3.0",
        "dimensions": 1024,
        "cost_per_1m": 0.10,
        "use_case": "Multi-language applications"
    }
}
```

## Database Setup & Configuration

### Pinecone Setup

```python
# Install Pinecone SDK
# pip install pinecone-client

from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone client
pc = Pinecone(api_key="YOUR_API_KEY")

# List existing indexes
indexes = pc.list_indexes()
print(f"Existing indexes: {[idx.name for idx in indexes]}")

# Create serverless index (recommended for production)
index_name = "production-search"

if index_name not in [idx.name for idx in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=1536,  # Match your embedding model
        metric="cosine",  # cosine, dotproduct, or euclidean
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ),
        deletion_protection="enabled",  # Prevent accidental deletion
        tags={
            "environment": "production",
            "team": "ml",
            "project": "semantic-search"
        }
    )
    print(f"Created index: {index_name}")

# Connect to index
index = pc.Index(index_name)

# Get index stats
stats = index.describe_index_stats()
print(f"Index stats: {stats}")
```

### Selective Metadata Indexing (Pinecone)

```python
# Configure which metadata fields to index for filtering
# This optimizes memory usage and query performance

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_API_KEY")

# Create index with metadata configuration
pc.create_index(
    name="optimized-index",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1",
        schema={
            "fields": {
                # Index these fields for filtering
                "document_id": {"filterable": True},
                "category": {"filterable": True},
                "created_at": {"filterable": True},
                "tags": {"filterable": True},
                # Store but don't index (saves memory)
                "document_title": {"filterable": False},
                "document_url": {"filterable": False},
                "full_content": {"filterable": False}
            }
        }
    )
)

# This configuration allows you to:
# 1. Filter by document_id, category, created_at, tags
# 2. Retrieve document_title, document_url, full_content in results
# 3. Save memory by not indexing non-filterable fields
```

### Weaviate Setup

```python
# Install Weaviate client
# pip install weaviate-client

import weaviate
from weaviate.classes.config import Configure, Property, DataType

# Connect to Weaviate
client = weaviate.connect_to_local()

# Or connect to Weaviate Cloud
# client = weaviate.connect_to_wcs(
#     cluster_url="YOUR_WCS_URL",
#     auth_credentials=weaviate.auth.AuthApiKey("YOUR_API_KEY")
# )

# Create collection (schema)
try:
    collection = client.collections.create(
        name="Documents",
        vectorizer_config=Configure.Vectorizer.text2vec_openai(
            model="text-embedding-3-small"
        ),
        properties=[
            Property(name="title", data_type=DataType.TEXT),
            Property(name="content", data_type=DataType.TEXT),
            Property(name="category", data_type=DataType.TEXT),
            Property(name="created_at", data_type=DataType.DATE),
            Property(name="tags", data_type=DataType.TEXT_ARRAY)
        ]
   

Related in data-engineering