SQL Data Types and Normalization: The Database Blueprint

Q: What is the difference between VARCHAR and TEXT in PostgreSQL?

VARCHAR(n) and TEXT are stored identically with no performance difference in PostgreSQL. VARCHAR adds a length constraint while TEXT has no limit and is the recommended type for variable-length strings.

Q: What is database normalisation and what do the normal forms mean?

Normalisation reduces redundancy to improve data integrity. First Normal Form requires atomic values, Second requires non-key columns to depend on the entire primary key, and Third requires non-key columns to depend only on the primary key.

Q: When should I denormalise a database schema?

Denormalise when queries require many expensive joins that a flat table eliminates, when data changes infrequently, or when strict latency requirements cannot be met with normalised reads.

If you choose the wrong data type, you will lose money-literally. If you use a FLOAT for a bank balance, rounding errors will accumulate and destroy your balance sheet. If you don't "Normalize" your data, you will end up with "Update Anomalies" where a user changes their name in one place but it doesn't update in another, causing a "Sync Bug" that is nearly impossible to fix purely with code.

This 1,500+ word guide is your blueprint for the "Foundational Layer." We will explore the physics of storage, the rules of normalization, and the modern "De-normalization" patterns of 2026.

1. Hardware-Mirror: The Physics of Data Storage

When you define a table, the database isn't just "Saving text." It is allocating blocks of $8$ KB pages on your SSD. The way you order your columns physically changes how much space your database consumes.

Data Alignment and Padding

Most CPUs read memory in chunks of $8$ bytes. If you have a SMALLINT (2 bytes) followed by a BIGINT (8 bytes), the database engine must insert $6$ "Padding Bytes" of empty space to keep the next attribute aligned with the CPU's hardware lanes.

The Pro Tip: Arrange your columns from Largest to Smallest. Put your BIGINT and TIMESTAMPTZ at the top, and your BOOLEAN and SMALLINT at the bottom. Across 1 billion rows, this simple architectural change can save you hundreds of gigabytes of disk space and billions of CPU cycles.

The TOAST Mirror: Handling Large Values

What happens when you store a 5MB blog post in a TEXT column, but a data page is only 8KB?

The Physical Limit: Postgres cannot split a single tuple across multiple pages.
The TOAST Solution: The Oversized-Attribute Storage Technique (TOAST).
The Mirror Physics: Postgres compresses large values and moves them into a secondary "Shadow Table." The main table only stores a 18-byte "Pointer" to the shadow page.
Performance Tax: Reading TOASTed data requires a separate I/O operation to the shadow table. This is why you should keep high-frequency columns small and reserve large TEXT or BYTEA fields for data that isn't searched constantly.

The NULL Bitmap Mirror

How does a database know a value is NULL without reading the whole column?

The Header Physics: Every row starts with a HeapTupleHeaderData.
The Bitmap: Inside this header is a bitmask where each bit corresponds to a column. If a bit is "0", the column is NULL.
The Efficiency Mirror: The engine checks this bitmap before attempting to calculate the memory offset for a column. Sparse tables (many NULLs) benefit from this low-level lookup geometry.

The Financial Guardrail: DECIMAL vs. FLOAT

Never use FLOAT or REAL for money.

The Binary Trap: Computers use binary fractions. They cannot perfectly represent $0.1$ (it becomes $0.09999999$). Over 1 million transactions, these "Missing cents" will cause your accounting to fail an audit.
The Standard: Use DECIMAL(19, 4). It stores numbers as "Base-10" representations directly on the hardware, ensuring 100% precision for financial calculations.

2. Primary Keys: The 2-Billion Row Time Bomb

The most common mistake in database design is using a standard INT ($32$-bit) for a Primary Key.

Why this kills companies

A $32$-bit signed integer has a maximum value of $2,147,483,647$.

If your app is successful, you will hit this limit.
The moment you try to insert the $2,147,483,648$th row, the database will throw an error and shut down all writes.
Example: In 2020, a major financial platform went down for hours because their transaction IDs hit this exact "Integer Overflow" wall.

The Architect's Choice: BIGINT or UUID v7

BIGINT: A 64-bit integer ($9$ quintillion max). It is the standard for performance.
UUID v7: In 2026, we have moved away from UUID v4 (random) to UUID v7 (Time-Ordered).
- Why?: Standard UUIDs are random, which "Shreds" database indexes and slows down writes. UUID v7 includes a timestamp, meaning they are "Sortable." They give you the uniqueness of a UUID with the performance of a sequential number.

3. Normalization: The Mathematical Rigor

Normalization is the process of removing "Redundancy" to prevent data corruption.

1st Normal Form (1NF): Atomicity

Every cell must contain exactly one value.

Failure: Storing a list of IDs in a text column (e.g., "1,2,3").
Result: You can't use indexes to search for ID #2. It requires a "Full Table Scan" of every string.

2nd Normal Form (2NF): No Partial Dependencies

Every non-key column must relate to the whole primary key. If you have a table Orders(OrderID, ProductID, CategoryName), the CategoryName shouldn't be here. It depends on the ProductID, not the OrderID.

3rd Normal Form (3NF): No Transitive Dependencies

Columns must depend "on the key, the whole key, and nothing but the key."

Failure: Storing ZipCode and City in the same table. City depends on ZipCode, not the UserID. This creates a risk where you change the City for one user but forget to change it for another with the same ZipCode.

Boyce-Codd Normal Form (BCNF)

BCNF is a stronger version of 3NF. It handles cases where a table has multiple overlapping candidate keys. If you have a table where a Student and Subject determine a Teacher, but a Teacher only teaches one Subject, you have a BCNF violation. You must split the table to ensure every determinant is a candidate key.

4. JSONB: The 2026 Breakthrough

For a decade, architects had to choose between the Structure of SQL or the Flexibility of NoSQL (like MongoDB). Today, we have JSONB.

Postgres allows you to store a JSON object in a column, but it stores it as Binary Content.

The Power: You can create a GIN Index on the JSONB column. You can search for values inside the JSON string as fast as a standard SQL search.
Standard Practice: Keep your "Core Data" (Name, Email, ID) in strict SQL columns, and move your "Volatile Data" (User Settings, API Preferences) into a JSONB column.

5. Case Study: The "Soft Delete" Pattern

In high-availability systems, we never use the DELETE command for critical data. If a customer deletes their account, we don't want to lose their historical financial records.

The Professional Implementation

We add a deleted_at timestamp column to every table.

The Benefit: It's a "Safety Net." If a bug in your code accidentally deletes $10,000$ users, you can "Restore" them in $1$ millisecond by nullifying that column.
The Storage Cost: It takes 8 bytes per row, but in 2026, disk space is cheap-Reliability is expensive.

6. Summary: The Blueprint Checklist

Padding Optimization: Order your columns by size (Largest to Smallest) to save RAM and Disk space.
Money: Always use DECIMAL, never FLOAT.
Future-Proofing: Use BIGINT or UUID v7 for all Primary Keys to avoid the 2-billion row limit.
Logical Purity: Apply 3NF and BCNF normalization to prevent data desynchronization.
Volatility: Use JSONB for data that changes its structure every month.

Database design is the "Soil" in which your application grows. If the soil is poor (wrong types, bad normalization), your app will be slow and buggy forever. If the soil is rich, your system will scale effortlessly to millions of users. You are no longer "Defining tables"; you are "Architecting the Data Lifecycle."

7. The Normalization Physics: The Performance Tradeoff

Normalization is about Correctness, but it comes with a Latency Tax.

The Join Geometry

3NF Efficiency: You save disk space by not repeating strings (e.g., storing a CategoryID instead of CategoryName).
Execution Cost: To see the Category Name, you must perform a Join. Every Join is a computational cost-comparing two sets and matching keys.
The Golden Rule: Normalize for Writes (Prevent corruption). De-normalize (selectively) for Reads (Cache results). In 2026, we often store a "Normalized Mirror" as the source of truth and a "Denormalized View" for the dashboard.

Masterclass Alignment Checklist

Audit Column Ordering: Arrange by size (BIGINT -> INT -> SMALLINT) to minimize padding.
Implement Sovereign Money: Replace all FLOAT money columns with DECIMAL(19,4).
Use UUID v7: Migrate sequential IDs to sortable UUIDs for distributed readiness.
Map JSONB Indexing: Add a GIN index to any column using the JSONB mirror.

Frequently Asked Questions

Q: What is the difference between VARCHAR and TEXT in PostgreSQL?

In PostgreSQL, VARCHAR(n) and TEXT are stored identically - there is no performance difference. VARCHAR(n) adds a length constraint (an error is raised if you insert a string longer than n characters). TEXT has no length limit. Unlike MySQL or SQL Server where TEXT uses different storage, PostgreSQL's TEXT is the recommended type for variable-length strings with no need for a constraint. Use VARCHAR(n) only when you want the database to enforce a maximum length as a data quality rule, not for performance.

Q: What is database normalisation and what do the normal forms mean?

Normalisation is the process of structuring tables to reduce data redundancy and improve integrity. First Normal Form (1NF) requires each column to hold atomic values - no arrays or comma-separated lists. Second Normal Form (2NF) requires that non-key columns depend on the entire primary key, not just part of it (applies to composite keys). Third Normal Form (3NF) requires that non-key columns depend only on the primary key, not on other non-key columns (no transitive dependencies). In practice, most well-designed databases aim for 3NF; Boyce-Codd Normal Form (BCNF) is a stricter refinement rarely required outside academia.

Q: When should I denormalise a database schema?

Denormalisation - intentionally introducing redundancy - is justified for read performance when: the query requires many expensive joins that a denormalised table eliminates, reporting or analytics queries run on data that changes infrequently, or the application has strict latency requirements that the normalised schema cannot meet. Common denormalisation patterns include storing a computed aggregate column alongside the source data, flattening a lookup table into a fact table, or maintaining a pre-joined summary table. Always denormalise based on measured query performance, not speculation, and document why each denormalisation exists.