Complete pgvector Tutorial: Vector Database in PostgreSQL
pgvector is a PostgreSQL extension that allows you to store and perform similarity search on vector embeddings. This is extremely useful for AI applications such as semantic search, recommendation systems, and RAG (Retrieval-Augmented Generation).
What is pgvector?
pgvector adds the vector data type to PostgreSQL and provides:
- Vector storage with dimensions up to 16,000
- Similarity search with various distance metrics
- Indexing for fast search (IVFFlat, HNSW)
- Seamless integration with SQL queries
- Semantic search
- Similarity matching (images, documents, products)
- Recommendation systems
- RAG for LLM applications
- Clustering and classification
Installing pgvector
1. Install on Ubuntu
# Install dependencies
sudo apt update
sudo apt install -y postgresql postgresql-contrib
Install pgvector from source
sudo apt install -y postgresql-server-dev-all git build-essential
Clone and build pgvector
cd /tmp
git clone --branch v0.7.0 https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install
2. Install via Docker
# Pull image with pgvector
docker pull pgvector/pgvector:pg16
Run container
docker run -d \
--name pgvector-db \
-e POSTGRESPASSWORD=mysecretpassword \
-e POSTGRESDB=vectordb \
-p 5432:5432 \
pgvector/pgvector:pg16
3. Enable Extension
-- Connect to database
psql -U postgres -d vectordb
-- Create extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Verify installation
SELECT FROM pgextension WHERE extname = 'vector';
Basic Concepts
1. Vector Data Type
-- Create table with vector column
CREATE TABLE items (
id SERIAL PRIMARY KEY,
name TEXT,
embedding VECTOR(3) -- Vector with 3 dimensions
);
-- Insert vector
INSERT INTO items (name, embedding) VALUES
('item1', '[1, 2, 3]'),
('item2', '[4, 5, 6]'),
('item3', '[1, 2, 4]');
-- Query vector
SELECT FROM items;
2. Distance Metrics
pgvector supports several distance metrics:
| Operator | Distance | Use Case |
|----------|----------|----------|
| <-> | L2 (Euclidean) | Default, general purpose |
| <#> | Inner Product (Negative) | Dot product similarity |
| <=> | Cosine Distance | Normalized vectors |
| <+> | L1 (Manhattan) | Sparse vectors |
-- L2 Distance (Euclidean)
SELECT name, embedding <-> '[1, 2, 3]' AS distance
FROM items
ORDER BY distance
LIMIT 5;
-- Cosine Distance
SELECT name, embedding <=> '[1, 2, 3]' AS distance
FROM items
ORDER BY distance
LIMIT 5;
-- Inner Product
SELECT name, embedding <#> '[1, 2, 3]' AS distance
FROM items
ORDER BY distance
LIMIT 5;
3. Vector Dimensions
-- Check vector dimensions
SELECT vectordims(embedding) FROM items LIMIT 1;
-- Normalize vector
SELECT l2normalize(embedding) FROM items;
-- Vector arithmetic
SELECT embedding + '[1, 1, 1]' FROM items WHERE id = 1;
SELECT embedding 2 FROM items WHERE id = 1;
Indexing for Performance
1. IVFFlat Index
IVFFlat (Inverted File with Flat compression) is suitable for large datasets with accuracy trade-offs.
-- Create IVFFlat index
CREATE INDEX ON items USING ivfflat (embedding vectorl2ops)
WITH (lists = 100);
-- For cosine distance
CREATE INDEX ON items USING ivfflat (embedding vectorcosineops)
WITH (lists = 100);
-- For inner product
CREATE INDEX ON items USING ivfflat (embedding vectoripops)
WITH (lists = 100);
lists Parameter:
- Number of clusters for the index
- Rule of thumb:
sqrt(numrows)for < 1M rows - For > 1M rows:
sqrt(numrows)up tonumrows / 1000