BM25, BM25F field-weighted search, trigram similarity, and hybrid BM25+vector retrieval — all built into Absolute DB with no extensions to install.
BM25 (Best Match 25) is the gold-standard keyword relevance algorithm used by Elasticsearch, Lucene, and Solr. Absolute DB implements BM25 natively in the SQL engine with SIMD-accelerated token matching — no plugins, no extensions, no external search infrastructure.
BM25 scores each document by term frequency (how often the query term appears), inverse document frequency (how rare the term is across the corpus), and document length normalisation. The result is a relevance score that ranks documents by how well they match the query.
-- Create a table with a full-text search index
CREATE TABLE articles (
id BIGSERIAL PRIMARY KEY,
title TEXT NOT NULL,
body TEXT NOT NULL,
author TEXT,
published TIMESTAMPTZ DEFAULT now()
);
-- Create a GIN index for full-text search
CREATE INDEX idx_articles_fts
ON articles USING GIN (to_tsvector('english', title || ' ' || body));
-- BM25 search using the MATCH operator
SELECT id, title, bm25_score(body, 'distributed consensus') AS score
FROM articles
WHERE body MATCH 'distributed consensus'
ORDER BY score DESC
LIMIT 10;
-- Boolean query operators
SELECT id, title
FROM articles
WHERE body MATCH 'consensus AND (raft OR paxos) NOT viewstamped'
ORDER BY bm25_score(body, 'consensus raft paxos') DESC
LIMIT 10;
| Parameter | Default | Description |
|---|---|---|
k1 | 1.2 | Term frequency saturation. Higher = more weight to repeated terms |
b | 0.75 | Length normalisation. 0 = no normalisation, 1 = full normalisation |
| Language | 'english' | Stemming/stop-word language. 20+ languages supported |
BM25F extends BM25 to support multiple fields with independent weights. A match in a high-weight field (e.g., title) contributes more to the relevance score than a match in a lower-weight field (e.g., body). This is essential for product search, document search, and e-commerce catalogues.
-- BM25F search with field weights
SELECT id, title,
bm25f_score(
fields => ARRAY['title', 'body', 'tags'],
weights => ARRAY[3.0, 1.0, 2.0],
query => 'machine learning database'
) AS score
FROM articles
WHERE bm25f_match(
fields => ARRAY['title', 'body', 'tags'],
query => 'machine learning database'
)
ORDER BY score DESC
LIMIT 10;
-- Equivalent using the field-weighted MATCH syntax
SELECT id, title
FROM articles
WHERE MATCH(title * 3, body * 1, tags * 2) AGAINST 'machine learning database'
ORDER BY relevance DESC
LIMIT 10;
The absdb_snippet() function extracts a relevant excerpt from a document with query terms highlighted. This is the standard "search result snippet" feature found in all major search engines.
SELECT
id,
title,
absdb_snippet(
body,
'distributed consensus',
max_words => 30,
start_sel => '<strong>',
stop_sel => '</strong>',
min_words => 15
) AS snippet
FROM articles
WHERE body MATCH 'distributed consensus'
ORDER BY bm25_score(body, 'distributed consensus') DESC
LIMIT 10;
GIN (Generalised Inverted Index) is the standard index type for full-text search, JSONB containment, and array operations. A GIN index maps each lexeme (stemmed word token) to the set of documents that contain it, enabling fast keyword lookups regardless of table size.
-- Single-column GIN index
CREATE INDEX idx_articles_body_gin
ON articles USING GIN (to_tsvector('english', body));
-- Multi-column GIN index (title + body combined)
CREATE INDEX idx_articles_full_gin
ON articles USING GIN (
to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''))
);
-- GIN index on JSONB for containment queries
CREATE INDEX idx_products_meta_gin
ON products USING GIN (metadata jsonb_path_ops);
-- GIN index on text array
CREATE INDEX idx_articles_tags_gin
ON articles USING GIN (tags);
-- Check index usage
EXPLAIN SELECT * FROM articles WHERE body MATCH 'consensus';
The MATCH operator supports three enhanced query forms in addition to plain keyword queries. All three are native BM25-indexed search over the same inverted index used by plain matches.
| Query form | Syntax | Behaviour |
|---|---|---|
| Plain keyword | body MATCH 'raft consensus' | Match documents containing both "raft" and "consensus" as tokens |
| Phrase | body MATCH '"raft consensus"' | Match documents where the two tokens appear consecutively in that order |
| Prefix / wildcard | body MATCH 'rep*' | Match any token starting with "rep" (replicate, replicated, replication, replay, …) |
| Fuzzy (edit-distance) | body MATCH '~database' | Match tokens within Levenshtein distance 1 of "database" (typos, variants) |
-- Exact phrase
SELECT id, title FROM articles
WHERE body MATCH '"raft consensus algorithm"'
ORDER BY bm25_score DESC
LIMIT 10;
-- Prefix wildcard
SELECT id, title FROM articles
WHERE body MATCH 'distrib*'
ORDER BY bm25_score DESC
LIMIT 20;
-- Fuzzy, edit-distance <= 1
SELECT id, title FROM articles
WHERE body MATCH '~databse' -- matches "database", "databases"
ORDER BY bm25_score DESC
LIMIT 20;
-- Mix forms in one query
SELECT id, title FROM articles
WHERE body MATCH '"state machine" OR ~raft OR log*';
Implementation notes: phrase matches boost scoring by 2× BM25; fuzzy lookups use a dictionary-walk with a cached Levenshtein distance function; prefix queries walk a trie-sorted inverted index in O(log n + k) where k is the number of matching tokens.
Trigram similarity splits strings into overlapping 3-character sequences and computes a Jaccard-like similarity coefficient. It is the standard algorithm for fuzzy string matching, typo-tolerant search, and "did you mean?" suggestions. Absolute DB includes pg_trgm compatibility built in — no extension needed.
A GIN trigram index enables fast similarity searches on large text columns without scanning every row.
-- Create GIN trigram index
CREATE INDEX idx_products_name_trgm
ON products USING GIN (name gin_trgm_ops);
-- Compute similarity score between two strings (0.0 to 1.0)
SELECT similarity('apple iphone', 'apple iPhone 15 Pro');
-- Result: 0.62
-- Find similar product names (fuzzy match)
SELECT name, similarity(name, 'iphone') AS sim
FROM products
WHERE name % 'iphone' -- % operator: similarity > threshold (default 0.3)
ORDER BY sim DESC
LIMIT 10;
-- Set similarity threshold
SET pg_trgm.similarity_threshold = 0.4;
-- Trigram search with ranking
SELECT name, similarity(name, $1) AS score
FROM products
WHERE similarity(name, $1) > 0.25
ORDER BY score DESC
LIMIT 20;
| Function/Operator | Description |
|---|---|
similarity(a, b) | Trigram similarity score between two strings (0.0–1.0) |
a % b | True if similarity(a, b) exceeds the threshold |
word_similarity(a, b) | Similarity of string a to any word in string b |
strict_word_similarity(a, b) | Strict word boundary variant |
show_trgm(text) | Returns array of trigrams for a string (debugging) |
Combining BM25 keyword relevance with dense vector semantic similarity using Reciprocal Rank Fusion (RRF) consistently outperforms either approach alone across recall benchmarks. RRF is rank-based — it takes the reciprocal rank of each document in both result sets and sums them, avoiding the problem of incompatible score distributions.
-- Hybrid search: BM25 keyword + dense vector
SELECT id, title
FROM articles
WHERE body MATCH 'consensus algorithm fault tolerance'
ORDER BY embedding <=> EMBED('minilm', 'consensus algorithm fault tolerance')
LIMIT 10;
-- Explicit hybrid search function with weight control
SELECT * FROM absdb_hybrid_search(
table => 'articles',
query => 'consensus algorithm fault tolerance',
vec_query => EMBED('minilm', 'consensus algorithm fault tolerance'),
bm25_weight => 0.3,
vec_weight => 0.7,
top_k => 20
);
ColBERT stores a separate embedding vector for each token in a document and computes the maximum similarity between each query token and all document tokens. This late interaction approach is highly effective for long-document retrieval — it captures fine-grained term-level meaning that single-vector pooling loses.
-- Store per-token ColBERT embeddings (128-dim per token)
CREATE TABLE colbert_articles (
id BIGSERIAL PRIMARY KEY,
title TEXT,
body TEXT,
tokens VECTOR(128)[]
);
-- Auto-populate token embeddings on insert
ALTER TABLE colbert_articles
SET (auto_embed_tokens = true, embed_model = 'colbert-v2');
-- MaxSim search
SELECT id, title,
absdb_maxsim(tokens, EMBED_TOKENS('colbert-v2', 'fault tolerant consensus')) AS score
FROM colbert_articles
ORDER BY score DESC
LIMIT 10;
| Use Case | Recommended Approach |
|---|---|
| E-commerce product search | BM25F (title weight 3×, description 1×, tags 2×) + trigram for typo tolerance |
| Document management / legal | BM25 + ColBERT MaxSim for long-document precision |
| Knowledge base / FAQ | Hybrid BM25 + vector (RRF) for semantic + keyword coverage |
| Log analysis | BM25 with GIN index on structured log fields |
| User name / address autocomplete | Trigram GIN index + % operator |
| Semantic search / RAG | Dense vector HNSW + BM25 hybrid |
| Multi-language content | BM25 with per-language stemming configuration |
| SQL Element | Description |
|---|---|
MATCH operator | Full-text search predicate (BM25 scored) |
bm25_score(col, query) | Return BM25 relevance score for a column against a query string |
bm25f_score(fields, weights, query) | BM25F multi-field weighted relevance score |
absdb_snippet(col, query, ...) | Extract and highlight a relevant excerpt from a text column |
to_tsvector(lang, text) | Tokenise and stem text for GIN indexing |
to_tsquery(lang, query) | Parse a query string into a tsquery expression |
similarity(a, b) | Trigram similarity score (0.0–1.0) |
a % b | True if trigram similarity exceeds threshold |
word_similarity(a, b) | Word-level trigram similarity |
CREATE INDEX USING GIN | Create inverted index for full-text or JSONB search |
absdb_maxsim(vecs, query_vecs) | ColBERT MaxSim operator for token-level retrieval |
~154 KB binary · zero external dependencies · 2,737 tests passing · SQL:2023 100%