Full-Text Search

BM25 Full-Text Search
BM25F Field-Weighted
Snippet Highlighting
GIN Index
Phrase, Fuzzy, Prefix
Trigram Similarity (pg_trgm)
Hybrid BM25 + Vector
Multi-Vector MaxSim (ColBERT)
Use Cases
SQL Reference

BM25 Full-Text Search

BM25 (Best Match 25) is the gold-standard keyword relevance algorithm used by Elasticsearch, Lucene, and Solr. Absolute DB implements BM25 natively in the SQL engine with SIMD-accelerated token matching — no plugins, no extensions, no external search infrastructure.

BM25 scores each document by term frequency (how often the query term appears), inverse document frequency (how rare the term is across the corpus), and document length normalisation. The result is a relevance score that ranks documents by how well they match the query.

sql — BM25 full-text search

-- Create a table with a full-text search index
CREATE TABLE articles (
    id        BIGSERIAL PRIMARY KEY,
    title     TEXT NOT NULL,
    body      TEXT NOT NULL,
    author    TEXT,
    published TIMESTAMPTZ DEFAULT now()
);

-- Create a GIN index for full-text search
CREATE INDEX idx_articles_fts
    ON articles USING GIN (to_tsvector('english', title || ' ' || body));

-- BM25 search using the MATCH operator
SELECT id, title, bm25_score(body, 'distributed consensus') AS score
FROM articles
WHERE body MATCH 'distributed consensus'
ORDER BY score DESC
LIMIT 10;

-- Boolean query operators
SELECT id, title
FROM articles
WHERE body MATCH 'consensus AND (raft OR paxos) NOT viewstamped'
ORDER BY bm25_score(body, 'consensus raft paxos') DESC
LIMIT 10;

BM25 Parameters

Parameter	Default	Description
`k1`	1.2	Term frequency saturation. Higher = more weight to repeated terms
`b`	0.75	Length normalisation. 0 = no normalisation, 1 = full normalisation
Language	`'english'`	Stemming/stop-word language. 20+ languages supported

BM25F Field-Weighted Search

BM25F extends BM25 to support multiple fields with independent weights. A match in a high-weight field (e.g., title) contributes more to the relevance score than a match in a lower-weight field (e.g., body). This is essential for product search, document search, and e-commerce catalogues.

sql — BM25F multi-field search

-- BM25F search with field weights
SELECT id, title,
       bm25f_score(
           fields => ARRAY['title', 'body', 'tags'],
           weights => ARRAY[3.0, 1.0, 2.0],
           query => 'machine learning database'
       ) AS score
FROM articles
WHERE bm25f_match(
    fields  => ARRAY['title', 'body', 'tags'],
    query   => 'machine learning database'
)
ORDER BY score DESC
LIMIT 10;

-- Equivalent using the field-weighted MATCH syntax
SELECT id, title
FROM articles
WHERE MATCH(title * 3, body * 1, tags * 2) AGAINST 'machine learning database'
ORDER BY relevance DESC
LIMIT 10;

Snippet Highlighting

The absdb_snippet() function extracts a relevant excerpt from a document with query terms highlighted. This is the standard "search result snippet" feature found in all major search engines.

sql — snippet highlighting

SELECT
    id,
    title,
    absdb_snippet(
        body,
        'distributed consensus',
        max_words   => 30,
        start_sel   => '<strong>',
        stop_sel    => '</strong>',
        min_words   => 15
    ) AS snippet
FROM articles
WHERE body MATCH 'distributed consensus'
ORDER BY bm25_score(body, 'distributed consensus') DESC
LIMIT 10;

Snippet highlighting is SIMD-accelerated. On a 4-core server with AVX2, it processes millions of tokens per second — fast enough for real-time web search result pages.

GIN Index

GIN (Generalised Inverted Index) is the standard index type for full-text search, JSONB containment, and array operations. A GIN index maps each lexeme (stemmed word token) to the set of documents that contain it, enabling fast keyword lookups regardless of table size.

sql — GIN index creation

-- Single-column GIN index
CREATE INDEX idx_articles_body_gin
    ON articles USING GIN (to_tsvector('english', body));

-- Multi-column GIN index (title + body combined)
CREATE INDEX idx_articles_full_gin
    ON articles USING GIN (
        to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''))
    );

-- GIN index on JSONB for containment queries
CREATE INDEX idx_products_meta_gin
    ON products USING GIN (metadata jsonb_path_ops);

-- GIN index on text array
CREATE INDEX idx_articles_tags_gin
    ON articles USING GIN (tags);

-- Check index usage
EXPLAIN SELECT * FROM articles WHERE body MATCH 'consensus';

Phrase, Fuzzy and Prefix Queries

The MATCH operator supports three enhanced query forms in addition to plain keyword queries. All three are native BM25-indexed search over the same inverted index used by plain matches.

Query form	Syntax	Behaviour
Plain keyword	`body MATCH 'raft consensus'`	Match documents containing both "raft" and "consensus" as tokens
Phrase	`body MATCH '"raft consensus"'`	Match documents where the two tokens appear consecutively in that order
Prefix / wildcard	`body MATCH 'rep*'`	Match any token starting with "rep" (replicate, replicated, replication, replay, …)
Fuzzy (edit-distance)	`body MATCH '~database'`	Match tokens within Levenshtein distance 1 of "database" (typos, variants)

sql — enhanced FTS queries

-- Exact phrase
SELECT id, title FROM articles
WHERE body MATCH '"raft consensus algorithm"'
ORDER BY bm25_score DESC
LIMIT 10;

-- Prefix wildcard
SELECT id, title FROM articles
WHERE body MATCH 'distrib*'
ORDER BY bm25_score DESC
LIMIT 20;

-- Fuzzy, edit-distance <= 1
SELECT id, title FROM articles
WHERE body MATCH '~databse'   -- matches "database", "databases"
ORDER BY bm25_score DESC
LIMIT 20;

-- Mix forms in one query
SELECT id, title FROM articles
WHERE body MATCH '"state machine" OR ~raft OR log*';

Implementation notes: phrase matches boost scoring by 2× BM25; fuzzy lookups use a dictionary-walk with a cached Levenshtein distance function; prefix queries walk a trie-sorted inverted index in O(log n + k) where k is the number of matching tokens.

Trigram Similarity (pg_trgm)

Trigram similarity splits strings into overlapping 3-character sequences and computes a Jaccard-like similarity coefficient. It is the standard algorithm for fuzzy string matching, typo-tolerant search, and "did you mean?" suggestions. Absolute DB includes pg_trgm compatibility built in — no extension needed.

A GIN trigram index enables fast similarity searches on large text columns without scanning every row.

sql — trigram similarity

-- Create GIN trigram index
CREATE INDEX idx_products_name_trgm
    ON products USING GIN (name gin_trgm_ops);

-- Compute similarity score between two strings (0.0 to 1.0)
SELECT similarity('apple iphone', 'apple iPhone 15 Pro');
-- Result: 0.62

-- Find similar product names (fuzzy match)
SELECT name, similarity(name, 'iphone') AS sim
FROM products
WHERE name % 'iphone'         -- % operator: similarity > threshold (default 0.3)
ORDER BY sim DESC
LIMIT 10;

-- Set similarity threshold
SET pg_trgm.similarity_threshold = 0.4;

-- Trigram search with ranking
SELECT name, similarity(name, $1) AS score
FROM products
WHERE similarity(name, $1) > 0.25
ORDER BY score DESC
LIMIT 20;

Function/Operator	Description
`similarity(a, b)`	Trigram similarity score between two strings (0.0–1.0)
`a % b`	True if similarity(a, b) exceeds the threshold
`word_similarity(a, b)`	Similarity of string `a` to any word in string `b`
`strict_word_similarity(a, b)`	Strict word boundary variant
`show_trgm(text)`	Returns array of trigrams for a string (debugging)

Hybrid BM25 + Vector Search

Combining BM25 keyword relevance with dense vector semantic similarity using Reciprocal Rank Fusion (RRF) consistently outperforms either approach alone across recall benchmarks. RRF is rank-based — it takes the reciprocal rank of each document in both result sets and sums them, avoiding the problem of incompatible score distributions.

sql — hybrid search

-- Hybrid search: BM25 keyword + dense vector
SELECT id, title
FROM articles
WHERE body MATCH 'consensus algorithm fault tolerance'
ORDER BY embedding <=> EMBED('minilm', 'consensus algorithm fault tolerance')
LIMIT 10;

-- Explicit hybrid search function with weight control
SELECT * FROM absdb_hybrid_search(
    table       => 'articles',
    query       => 'consensus algorithm fault tolerance',
    vec_query   => EMBED('minilm', 'consensus algorithm fault tolerance'),
    bm25_weight => 0.3,
    vec_weight  => 0.7,
    top_k       => 20
);

Multi-Vector MaxSim (ColBERT)

ColBERT stores a separate embedding vector for each token in a document and computes the maximum similarity between each query token and all document tokens. This late interaction approach is highly effective for long-document retrieval — it captures fine-grained term-level meaning that single-vector pooling loses.

sql — ColBERT MaxSim

-- Store per-token ColBERT embeddings (128-dim per token)
CREATE TABLE colbert_articles (
    id       BIGSERIAL PRIMARY KEY,
    title    TEXT,
    body     TEXT,
    tokens   VECTOR(128)[]
);

-- Auto-populate token embeddings on insert
ALTER TABLE colbert_articles
    SET (auto_embed_tokens = true, embed_model = 'colbert-v2');

-- MaxSim search
SELECT id, title,
       absdb_maxsim(tokens, EMBED_TOKENS('colbert-v2', 'fault tolerant consensus')) AS score
FROM colbert_articles
ORDER BY score DESC
LIMIT 10;

Use Cases

Use Case	Recommended Approach
E-commerce product search	BM25F (title weight 3×, description 1×, tags 2×) + trigram for typo tolerance
Document management / legal	BM25 + ColBERT MaxSim for long-document precision
Knowledge base / FAQ	Hybrid BM25 + vector (RRF) for semantic + keyword coverage
Log analysis	BM25 with GIN index on structured log fields
User name / address autocomplete	Trigram GIN index + `%` operator
Semantic search / RAG	Dense vector HNSW + BM25 hybrid
Multi-language content	BM25 with per-language stemming configuration

SQL Reference

SQL Element	Description
`MATCH` operator	Full-text search predicate (BM25 scored)
`bm25_score(col, query)`	Return BM25 relevance score for a column against a query string
`bm25f_score(fields, weights, query)`	BM25F multi-field weighted relevance score
`absdb_snippet(col, query, ...)`	Extract and highlight a relevant excerpt from a text column
`to_tsvector(lang, text)`	Tokenise and stem text for GIN indexing
`to_tsquery(lang, query)`	Parse a query string into a tsquery expression
`similarity(a, b)`	Trigram similarity score (0.0–1.0)
`a % b`	True if trigram similarity exceeds threshold
`word_similarity(a, b)`	Word-level trigram similarity
`CREATE INDEX USING GIN`	Create inverted index for full-text or JSONB search
`absdb_maxsim(vecs, query_vecs)`	ColBERT MaxSim operator for token-level retrieval

Continue Reading

Vector Search Indexing SQL Reference

Ready to run Absolute DB?

~154 KB binary · zero external dependencies · 2,737 tests passing · SQL:2023 100%

Download Free → View Pricing All Docs

Contents

BM25 Full-Text Search

BM25 Parameters

BM25F Field-Weighted Search

Snippet Highlighting

GIN Index

Phrase, Fuzzy and Prefix Queries

Trigram Similarity (pg_trgm)

Hybrid BM25 + Vector Search

Multi-Vector MaxSim (ColBERT)

Use Cases

SQL Reference

Continue Reading

Ready to run Absolute DB?