SQL Full-Text Search: Mastering TSVector

SQL Full-Text Search: Mastering TSVector
If you have a table of 1 million articles and you want to find the word "Architecture," you might use WHERE title LIKE '%Architecture%'.
This is an architectural failure.
- It is Slow: The database has to look at every single character of every single row (A Sequential Scan). Your SSD will be saturated with meaningless reads.
- It is "Dumb": It won't find "Architectural," "Architects," or even "architecture" (lowercase). It is a literal string match, not a search for "Intent."
Full-Text Search (FTS) solves this by building an Inverted Index. This 1,500+ word guide is your blueprints for Architecting Discovery. To master search is to master the physics of data retrieval.
1. The Anatomy of Search: Vector and Query
FTS works by separating the "Content" from the "Search Intent." This separation is what allows the database to stay fast. We don't search the text; we search a Mathematical Representation of the text.
TSVector: The "Hardware" Mirror of Content
A tsvector is a sorted list of distinct lexemes. Lexemes are not just words; they are the "Roots" of words.
- Tokenization: The engine breaks your text into words (tokens) based on white space and punctuation.
- Normalization: It converts everything to lowercase and eliminates noise.
- Stopword Removal: Words like "the," "is," and "and" are deleted. In the world of Information Entropy, these words carry zero weight. They only bloat the index and waste CPU cycles during intersection.
- Stemming: This is the magic. The engine uses algorithms (like the Porter Stemmer or Snowball) to strip suffixes. "Running" becomes "run." "Fisheye" becomes "fish."
TSQuery: The Search Objective
A tsquery is your search intent, parsed into the same normalized lexemes. It supports Boolean logic and proximity operators:
&(AND):database & performance|(OR):postgres | mysql!(NOT):architecture & !monolith<->(PROXIMITY):search <-> engine(find where "search" is directly followed by "engine").
The Architect's Detail: When you search for "Searching for Engineers," the query is normalized to 'search' & 'engin'. The database is comparing the normalized query against the normalized vector. This is why "Search" finds "Searching."
2. Hardware-Mirror: Inverted Index Physics
How does a database search 100GB of text in 5 milliseconds? The answer is the GIN (Generalized Inverted Index).
The Posting List Geometry
In a standard B-Tree index, you store "Key -> RowID." If the key is 'User123', the value is the address of that user's row. In an Inverted Index (GIN), you store "Lexeme -> Posting List."
- The Physics: Imagine a massive table where the keys are every unique word in your entire database. The "Value" associated with the word "Performance" is a highly compressed bitstream of every single RowID that contains that word.
- The Search Execution: When you search for "Database," the engine doesn't scan the table. It jumps to the "Database" key in the GIN index, grabs the list of IDs, and returns them.
- CPU Alignment: If you search for "Database & Performance," the engine grabs two ID lists (Posting Lists) and performs a Bitwise AND directly in the CPU cache. This is the fastest possible way for silicon to handle multi-term discovery.
3. Stemming Physics: The Suffix Strip
Stemming is what makes FTS "Smart" rather than "Literal."
The Algorithmic Suffix Mirror
The database doesn't have a giant dictionary of all words. Instead, it uses a set of rules (e.g., "If it ends in -ing, remove it").
- Guarantee: By reducing words to their "Root," the database ensures a dramatically smaller Index Surface Area.
- Storage Result: Instead of having separate index entries for "Architect," "Architectural," "Architecting," and "Architecture," you have one entry:
architect. - Latency Result: Because the index is smaller, more of it fits in the OS Page Cache (RAM). A search that stays in RAM is 1,000x faster than one that has to hit the SSD.
4. Ranking Mathematics: The ts_rank Density
Finding the data is only 50% of the battle. The other 50% is Relevance. Why is Page 1 of Google better than Page 10? Because of Ranking.
ts_rank vs. ts_rank_cd
PostgreSQL provides two primary functions for this:
ts_rank: Calculates relevance based on Frequency. The more times the word "Cloud" appears in a document, the higher the score. It assumes that repetition equals importance.ts_rank_cd(Cover Density): This is the modern standard. It looks at the Proximity of the words. If you search for "Vector Database,"ts_rank_cdwill score a row higher if those two words are next to each other than if they are at opposite ends of a 5,000-word article.
Weighted Vectors: Designing Importance
In production systems, we treat titles as more important than body text. We use a "Weighting Mirror":
- Label A: Highest priority (Title).
- Label B: High priority (Summary).
- Label C: Medium priority (Tags).
- Label D: Low priority (Body).
This ensures that if a user searches for "SQL Performance," an article with those words in the TITLE will always beat a long article that only mentions them in a footnote.
5. Fuzzy Searching: The Trigram Mirror
FTS is perfect for exact words or stems, but what if the user makes a typo? Searching for "Datbase" instead of "Database"? A standard tsvector will return zero results because "Datbase" isn't a valid lexeme.
The 3-Character Sliding Window (pg_trgm)
To handle typos, we use the pg_trgm extension.
- The Physics: It breaks words into 3-character chunks (Trigrams). "Database" becomes
dat,ata,tab,aba,ase. - Similarity Scoring: When a user types a typo, the engine calculates the Jaccard Similarity (how many 3-character chunks they share).
- The Architecture: In high-fidelity systems, the standard is to use FTS for speed and logic, and the Trigram Index as a fallback for "Did you mean?" suggestions when FTS returns zero results.
6. Performance Profile: Sequential Scan vs GIN
Developers often wonder: "When should I switch to FTS?"
- The 10k Rule: For tables under 10,000 rows, a
LIKEsearch might be "fast enough" because the data fits in the CPU cache. - The 1M Reality: At 1 million rows, a
LIKE %word%search will take seconds. A GIN-indexed FTS search will take 10ms. - Write Amplification: GIN indexes are expensive to update. Every time you insert a row, the engine has to update the Posting List for dozens of words. If your application is 99% writes and 1% search, FTS might be a bottleneck. But for e-commerce, blogs, or documentation hubs, the search speed is worth the write cost.
7. Case Study: Eliminating ElasticSearch Overhead
A high-traffic e-commerce store was paying $500/month for a hosted search cluster.
- The Complexity: Every time a product price changed in SQL, they had to "Sync" it to the search cluster via an API. This sync failed occasionally, leading to "Phantom Prices" in search results.
- The Move: They migrated to PostgreSQL FTS.
- The Hardware Result: By using a Generated Column (
tsvector GENERATED ALWAYS AS ...), all indexing happened atomically during the SQL transaction. No external sync, no API lag. - The Business Result: Reduced monthly costs to $0, eliminated data inconsistency bugs, and maintained a sub-20ms search latency across 1 million products on a single database instance.
8. Summary: The Discovery Checklist
- Never Use
to_tsvectorinSELECT: This is a production sin. It re-calculates the vector for every row on every search. Use a Generated Column to store the vector on disk. - Language Context: Always specify the language context (e.g.,
'english'). Stemmers work differently across languages (e.g., German compounding vs. English suffixing). - Use GIN, Not GiST: For most Full-Text use cases, GIN (Inverted Index) is 3x faster than GiST for searching, though GiST is faster for updates.
- Leverage Proximity: Use the
<->operator in your UI to allow "Phrase Search" (e.g., users putting quotes around words). - Monitor BLOAT: GIN indexes can grow very large. Run
VACUUM ANALYZEregularly to ensure the statistics are fresh and the index pages are defragmented.
Full-Text Search turns your database into a Discovery Engine. By mastering the internal map of lexemes and the math of relevance density, you gain the power to build search experiences that are as fast as Google and as reliable as standard SQL. You have graduated from "Fetching data" to "Architecting Relevance."
Phase 26: Discovery Mastery Action Items
- Install
pg_trgmand build a "Typo-tolerant" fallback for your main search UI. - Implement a Generated Column for your search vector to automate indexing during insertions.
- Compare ts_rank and ts_rank_cd results for a multi-word search query.
- Use
ts_headline()to generate dynamic "Search Snippets" that highlight keywords in your frontend results.
Read next: JSONB & NoSQL: The Hybrid Database Revolution →
Part of the SQL Mastery Course — engineering the search.
