vector-database-management
Included with Lifetime
$97 forever
Comprehensive guide for managing vector databases including Pinecone, Weaviate, and Chroma for semantic search, RAG systems, and similarity-based applications
data-engineeringvector-databaseembeddingssemantic-searchragpineconeweaviatechromasimilarity-search
What this skill does
# Vector Database Management
## Table of Contents
1. [Introduction](#introduction)
2. [Vector Embeddings Fundamentals](#vector-embeddings-fundamentals)
3. [Database Setup & Configuration](#database-setup--configuration)
4. [Index Operations](#index-operations)
5. [Vector Operations](#vector-operations)
6. [Similarity Search](#similarity-search)
7. [Metadata Filtering](#metadata-filtering)
8. [Hybrid Search](#hybrid-search)
9. [Namespace & Collection Management](#namespace--collection-management)
10. [Performance & Scaling](#performance--scaling)
11. [Production Best Practices](#production-best-practices)
12. [Cost Optimization](#cost-optimization)
## Introduction
Vector databases are specialized systems designed to store, index, and query high-dimensional vector embeddings efficiently. They power modern AI applications including semantic search, recommendation systems, RAG (Retrieval Augmented Generation), and similarity-based matching.
### Key Concepts
- **Vector Embeddings**: Numerical representations of data (text, images, audio) in high-dimensional space
- **Similarity Search**: Finding vectors that are "close" to a query vector using distance metrics
- **Metadata Filtering**: Combining vector similarity with structured data filtering
- **Indexing**: Optimization structures (HNSW, IVF, etc.) for fast approximate nearest neighbor search
### Database Comparison
| Feature | Pinecone | Weaviate | Chroma |
|---------|----------|----------|--------|
| Deployment | Fully managed | Managed or self-hosted | Self-hosted or cloud |
| Index Types | Serverless, Pods | HNSW | HNSW |
| Metadata Filtering | Advanced | GraphQL-based | Simple |
| Hybrid Search | Sparse-Dense | Built-in | Limited |
| Scale | Massive | Large | Small-Medium |
| Best For | Production RAG | Knowledge graphs | Local development |
## Vector Embeddings Fundamentals
### Understanding Vector Representations
Vector embeddings transform unstructured data into numerical arrays that capture semantic meaning:
```python
# Text to embeddings using OpenAI
from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY")
def generate_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
"""Generate embeddings from text using OpenAI."""
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
# Example usage
text = "Vector databases enable semantic search capabilities"
embedding = generate_embedding(text)
print(f"Embedding dimension: {len(embedding)}") # 1536 dimensions
print(f"First 5 values: {embedding[:5]}")
```
### Popular Embedding Models
```python
# 1. OpenAI Embeddings (Production-grade)
from openai import OpenAI
def openai_embeddings(texts: list[str]) -> list[list[float]]:
"""Batch generate OpenAI embeddings."""
client = OpenAI(api_key="YOUR_API_KEY")
response = client.embeddings.create(
input=texts,
model="text-embedding-3-large" # 3072 dimensions
)
return [item.embedding for item in response.data]
# 2. Sentence Transformers (Open-source)
from sentence_transformers import SentenceTransformer
def sentence_transformer_embeddings(texts: list[str]) -> list[list[float]]:
"""Generate embeddings using Sentence Transformers."""
model = SentenceTransformer('all-MiniLM-L6-v2') # 384 dimensions
embeddings = model.encode(texts)
return embeddings.tolist()
# 3. Cohere Embeddings
import cohere
def cohere_embeddings(texts: list[str]) -> list[list[float]]:
"""Generate embeddings using Cohere."""
co = cohere.Client("YOUR_API_KEY")
response = co.embed(
texts=texts,
model="embed-english-v3.0",
input_type="search_document"
)
return response.embeddings
```
### Embedding Dimensions & Trade-offs
```python
# Different embedding models for different use cases
EMBEDDING_CONFIGS = {
"openai-small": {
"model": "text-embedding-3-small",
"dimensions": 1536,
"cost_per_1m": 0.02,
"use_case": "General purpose, cost-effective"
},
"openai-large": {
"model": "text-embedding-3-large",
"dimensions": 3072,
"cost_per_1m": 0.13,
"use_case": "High accuracy requirements"
},
"sentence-transformers": {
"model": "all-MiniLM-L6-v2",
"dimensions": 384,
"cost_per_1m": 0.00, # Open-source
"use_case": "Local development, privacy-sensitive"
},
"cohere-multilingual": {
"model": "embed-multilingual-v3.0",
"dimensions": 1024,
"cost_per_1m": 0.10,
"use_case": "Multi-language applications"
}
}
```
## Database Setup & Configuration
### Pinecone Setup
```python
# Install Pinecone SDK
# pip install pinecone-client
from pinecone import Pinecone, ServerlessSpec
# Initialize Pinecone client
pc = Pinecone(api_key="YOUR_API_KEY")
# List existing indexes
indexes = pc.list_indexes()
print(f"Existing indexes: {[idx.name for idx in indexes]}")
# Create serverless index (recommended for production)
index_name = "production-search"
if index_name not in [idx.name for idx in pc.list_indexes()]:
pc.create_index(
name=index_name,
dimension=1536, # Match your embedding model
metric="cosine", # cosine, dotproduct, or euclidean
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
),
deletion_protection="enabled", # Prevent accidental deletion
tags={
"environment": "production",
"team": "ml",
"project": "semantic-search"
}
)
print(f"Created index: {index_name}")
# Connect to index
index = pc.Index(index_name)
# Get index stats
stats = index.describe_index_stats()
print(f"Index stats: {stats}")
```
### Selective Metadata Indexing (Pinecone)
```python
# Configure which metadata fields to index for filtering
# This optimizes memory usage and query performance
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="YOUR_API_KEY")
# Create index with metadata configuration
pc.create_index(
name="optimized-index",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1",
schema={
"fields": {
# Index these fields for filtering
"document_id": {"filterable": True},
"category": {"filterable": True},
"created_at": {"filterable": True},
"tags": {"filterable": True},
# Store but don't index (saves memory)
"document_title": {"filterable": False},
"document_url": {"filterable": False},
"full_content": {"filterable": False}
}
}
)
)
# This configuration allows you to:
# 1. Filter by document_id, category, created_at, tags
# 2. Retrieve document_title, document_url, full_content in results
# 3. Save memory by not indexing non-filterable fields
```
### Weaviate Setup
```python
# Install Weaviate client
# pip install weaviate-client
import weaviate
from weaviate.classes.config import Configure, Property, DataType
# Connect to Weaviate
client = weaviate.connect_to_local()
# Or connect to Weaviate Cloud
# client = weaviate.connect_to_wcs(
# cluster_url="YOUR_WCS_URL",
# auth_credentials=weaviate.auth.AuthApiKey("YOUR_API_KEY")
# )
# Create collection (schema)
try:
collection = client.collections.create(
name="Documents",
vectorizer_config=Configure.Vectorizer.text2vec_openai(
model="text-embedding-3-small"
),
properties=[
Property(name="title", data_type=DataType.TEXT),
Property(name="content", data_type=DataType.TEXT),
Property(name="category", data_type=DataType.TEXT),
Property(name="created_at", data_type=DataType.DATE),
Property(name="tags", data_type=DataType.TEXT_ARRAY)
]