How do I maintain a tsvector search index automatically?

Use a generated column that PostgreSQL updates automatically when the source columns change, then create a GIN index on the generated column for efficient full-text lookups.

← Back to SQL Mastery

SQL Full-Text Search: Mastering TSVector

Q: What is the difference between LIKE and full-text search in PostgreSQL?

LIKE is a string pattern match scanning every row without relevance ranking or language awareness. Full-text search uses tsvector and tsquery for linguistically aware search handling stemming, stop words, and relevance ranking with ts_rank.

Q: How do I rank full-text search results by relevance?

Use ts_rank(tsvector, tsquery) or ts_rank_cd(tsvector, tsquery) in the ORDER BY clause. ts_rank uses lexeme frequency while ts_rank_cd uses cover density measuring proximity of matching terms.

If you have a table of 1 million articles and you want to find the word "Architecture," you might use WHERE title LIKE '%Architecture%'.

This is an architectural failure.

It is Slow: The database has to look at every single character of every single row (A Sequential Scan). Your SSD will be saturated with meaningless reads.
It is "Dumb": It won't find "Architectural," "Architects," or even "architecture" (lowercase). It is a literal string match, not a search for "Intent."

Full-Text Search (FTS) solves this by building an Inverted Index. This 1,500+ word guide is your blueprints for Architecting Discovery. To master search is to master the physics of data retrieval.

1. The Anatomy of Search: Vector and Query

FTS works by separating the "Content" from the "Search Intent." This separation is what allows the database to stay fast. We don't search the text; we search a Mathematical Representation of the text.

TSVector: The "Hardware" Mirror of Content

A tsvector is a sorted list of distinct lexemes. Lexemes are not just words; they are the "Roots" of words.

Tokenization: The engine breaks your text into words (tokens) based on white space and punctuation.
Normalization: It converts everything to lowercase and eliminates noise.
Stopword Removal: Words like "the," "is," and "and" are deleted. In the world of Information Entropy, these words carry zero weight. They only bloat the index and waste CPU cycles during intersection.
Stemming: This is the magic. The engine uses algorithms (like the Porter Stemmer or Snowball) to strip suffixes. "Running" becomes "run." "Fisheye" becomes "fish."

TSQuery: The Search Objective

A tsquery is your search intent, parsed into the same normalized lexemes. It supports Boolean logic and proximity operators:

& (AND): database & performance
| (OR): postgres | mysql
! (NOT): architecture & !monolith
←> (PROXIMITY): search ←> engine (find where "search" is directly followed by "engine").

The Architect's Detail: When you search for "Searching for Engineers," the query is normalized to 'search' & 'engin'. The database is comparing the normalized query against the normalized vector. This is why "Search" finds "Searching."

2. Hardware-Mirror: Inverted Index Physics

How does a database search 100GB of text in 5 milliseconds? The answer is the GIN (Generalized Inverted Index).

The Posting List Geometry

In a standard B-Tree index, you store "Key -> RowID." If the key is 'User123', the value is the address of that user's row. In an Inverted Index (GIN), you store "Lexeme -> Posting List."

The Physics: Imagine a massive table where the keys are every unique word in your entire database. The "Value" associated with the word "Performance" is a highly compressed bitstream of every single RowID that contains that word.
The Search Execution: When you search for "Database," the engine doesn't scan the table. It jumps to the "Database" key in the GIN index, grabs the list of IDs, and returns them.
CPU Alignment: If you search for "Database & Performance," the engine grabs two ID lists (Posting Lists) and performs a Bitwise AND directly in the CPU cache. This is the fastest possible way for silicon to handle multi-term discovery.

3. Stemming Physics: The Suffix Strip

Stemming is what makes FTS "Smart" rather than "Literal."

The Algorithmic Suffix Mirror

The database doesn't have a giant dictionary of all words. Instead, it uses a set of rules (e.g., "If it ends in -ing, remove it").

Guarantee: By reducing words to their "Root," the database ensures a dramatically smaller Index Surface Area.
Storage Result: Instead of having separate index entries for "Architect," "Architectural," "Architecting," and "Architecture," you have one entry: architect.
Latency Result: Because the index is smaller, more of it fits in the OS Page Cache (RAM). A search that stays in RAM is 1,000x faster than one that has to hit the SSD.

4. Ranking Mathematics: The `ts_rank` Density

Finding the data is only 50% of the battle. The other 50% is Relevance. Why is Page 1 of Google better than Page 10? Because of Ranking.

ts_rank vs. ts_rank_cd

PostgreSQL provides two primary functions for this:

ts_rank: Calculates relevance based on Frequency. The more times the word "Cloud" appears in a document, the higher the score. It assumes that repetition equals importance.
ts_rank_cd (Cover Density): This is the modern standard. It looks at the Proximity of the words. If you search for "Vector Database," ts_rank_cd will score a row higher if those two words are next to each other than if they are at opposite ends of a 5,000-word article.

Weighted Vectors: Designing Importance

In production systems, we treat titles as more important than body text. We use a "Weighting Mirror":

Label A: Highest priority (Title).
Label B: High priority (Summary).
Label C: Medium priority (Tags).
Label D: Low priority (Body).

sql

SELECT 
    title, 
    ts_rank(
        setweight(to_tsvector('english', title), 'A') || 
        setweight(to_tsvector('english', body), 'D'),
        query
    ) AS score
FROM articles, to_tsquery('english', 'sql & performance') AS query
WHERE search_vector @@ query
ORDER BY score DESC;

This ensures that if a user searches for "SQL Performance," an article with those words in the TITLE will always beat a long article that only mentions them in a footnote.

5. Fuzzy Searching: The Trigram Mirror

FTS is perfect for exact words or stems, but what if the user makes a typo? Searching for "Datbase" instead of "Database"? A standard tsvector will return zero results because "Datbase" isn't a valid lexeme.

The 3-Character Sliding Window (pg_trgm)

To handle typos, we use the pg_trgm extension.

The Physics: It breaks words into 3-character chunks (Trigrams). "Database" becomes dat, ata, tab, aba, ase.
Similarity Scoring: When a user types a typo, the engine calculates the Jaccard Similarity (how many 3-character chunks they share).
The Architecture: In high-fidelity systems, the standard is to use FTS for speed and logic, and the Trigram Index as a fallback for "Did you mean?" suggestions when FTS returns zero results.

6. Performance Profile: Sequential Scan vs GIN

Developers often wonder: "When should I switch to FTS?"

The 10k Rule: For tables under 10,000 rows, a LIKE search might be "fast enough" because the data fits in the CPU cache.
The 1M Reality: At 1 million rows, a LIKE %word% search will take seconds. A GIN-indexed FTS search will take 10ms.
Write Amplification: GIN indexes are expensive to update. Every time you insert a row, the engine has to update the Posting List for dozens of words. If your application is 99% writes and 1% search, FTS might be a bottleneck. But for e-commerce, blogs, or documentation hubs, the search speed is worth the write cost.

7. Case Study: Eliminating ElasticSearch Overhead

A high-traffic e-commerce store was paying $500/month for a hosted search cluster.

The Complexity: Every time a product price changed in SQL, they had to "Sync" it to the search cluster via an API. This sync failed occasionally, leading to "Phantom Prices" in search results.
The Move: They migrated to PostgreSQL FTS.
The Hardware Result: By using a Generated Column (tsvector GENERATED ALWAYS AS ...), all indexing happened atomically during the SQL transaction. No external sync, no API lag.
The Business Result: Reduced monthly costs to $0, eliminated data inconsistency bugs, and maintained a sub-20ms search latency across 1 million products on a single database instance.

8. Summary: The Discovery Checklist

Never Use to_tsvector in SELECT: This is a production sin. It re-calculates the vector for every row on every search. Use a Generated Column to store the vector on disk.
Language Context: Always specify the language context (e.g., 'english'). Stemmers work differently across languages (e.g., German compounding vs. English suffixing).
Use GIN, Not GiST: For most Full-Text use cases, GIN (Inverted Index) is 3x faster than GiST for searching, though GiST is faster for updates.
Leverage Proximity: Use the ←> operator in your UI to allow "Phrase Search" (e.g., users putting quotes around words).
Monitor BLOAT: GIN indexes can grow very large. Run VACUUM ANALYZE regularly to ensure the statistics are fresh and the index pages are defragmented.

Full-Text Search turns your database into a Discovery Engine. By mastering the internal map of lexemes and the math of relevance density, you gain the power to build search experiences that are as fast as Google and as reliable as standard SQL. You have graduated from "Fetching data" to "Architecting Relevance."

Phase 26: Discovery Mastery Action Items

Install pg_trgm and build a "Typo-tolerant" fallback for your main search UI.
Implement a Generated Column for your search vector to automate indexing during insertions.
Compare ts_rank and ts_rank_cd results for a multi-word search query.
Use ts_headline() to generate dynamic "Search Snippets" that highlight keywords in your frontend results.

Frequently Asked Questions

Q: What is the difference between LIKE and full-text search in PostgreSQL?

LIKE '%keyword%' is a string pattern match - it scans every row checking for the substring, cannot use a standard B-tree index, and has no concept of relevance or language. Full-text search uses tsvector (a processed, normalised document representation) and tsquery (a structured search query) to perform linguistically aware searches: it handles stemming (searching "run" matches "running" and "ran"), stop words, lexeme weighting, and relevance ranking with ts_rank. Full-text search is indexed with GIN or GiST indexes, making it orders of magnitude faster than LIKE on large text corpora.

Q: How do I create and maintain a tsvector search index in PostgreSQL?

The simplest approach is a generated column: ALTER TABLE articles ADD COLUMN search_vector tsvector GENERATED ALWAYS AS (to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''))) STORED. Then create a GIN index: CREATE INDEX ON articles USING GIN (search_vector). The column updates automatically on INSERT and UPDATE. For more control, use a trigger to update a separate tsvector column, which allows combining multiple columns with different weights (setweight gives title matches higher relevance than body matches).

Q: How do I rank full-text search results by relevance in PostgreSQL?

Use `ts_rank(tsvector, tsquery)` or `ts_rank_cd(tsvector, tsquery)` in the ORDER BY clause. `ts_rank` computes relevance based on lexeme frequency; `ts_rank_cd` uses cover density (proximity of matching terms). Both return a float between 0 and 1. Query: `SELECT title, ts_rank(search_vector, query) AS rank FROM articles, to_tsquery('english', 'postgresql & performance') query WHERE search_vector @@ query ORDER BY rank DESC`. Combine with your own business signals (recency, view count) by weighting: `ORDER BY (ts_rank(...) * 0.7 + recency_score * 0.3) DESC`.

Part of the SQL Mastery Course - engineering the search.