What indexes matter most for an e-commerce analytics database?

Index on customer_id for per-customer queries, on created_at for date range scans, on product_id for product performance, and composite indexes like (customer_id, created_at) to eliminate separate index scans in common query patterns.

← Back to SQL Mastery

SQL Project: E-commerce Analytics Engine

Q: What SQL patterns are most important for e-commerce analytics?

Cohort analysis groups users by acquisition month to track retention. Funnel analysis counts drop-offs at conversion steps. RFM scoring segments customers by recency, frequency, and monetary value. All rely heavily on window functions and CTEs.

Q: How do I efficiently compute customer lifetime value in SQL?

Simple historical LTV is SUM(order_total) GROUP BY customer_id. Predictive LTV requires modelling future spend based on purchase intervals calculated with window functions.

Welcome to the first Capstone Lab of the SQL Masterclass. You have spent modules learning the "Grammar" of SQL-from the basics of SELECT to the advanced physics of MVCC and GIN Indexes. Now, you will build a "Machine."

In the modern digital economy, the database is not just a storage box; it is the Engine of Truth. If this engine is slow, the mobile app lags, the Black Friday sales crash, and the company loses millions. As an Architect, your job is to ensure the Hardware Mirror of your code is perfectly aligned with the physical reality of SSDs and CPU cycles.

This project guide is a high-fidelity engineering assignment. You are the Lead Database Architect at "Arcane Commerce," a global marketplace. Your mission is to build the backend logic for their 2026 analytics dashboard, capable of handling millions of orders and real-time inventory checks.

1. Task 1: The "Hardware-Mirror" Infrastructure

Before you can analyze data, you must build a storage engine that doesn't collapse under load. Most developers default to SERIAL integers for Primary Keys. This is a "Legacy Trap."

The Mission: 3NF Relational Foundation

Design a 3rd Normal Form (3NF) relational schema that handles accounts, catalogs, and transactions.

Schema Requirements:
1. users: Must handle geographic data, account age, and registration metadata.
2. products: Must support a base structure (SKU, Price) plus custom metadata.
3. orders: The central fact table linking users and timestamps.
4. order_items: The grain of the transaction, storing price snapshots.

The Architect's Standard: UUID v7

Instead of standard integers or random UUID v4, use UUID v7 for all primary keys.

The Physics: UUID v7 is "Time-Ordered." Unlike random UUID v4, which causes "Index Fragmentation" by inserting rows into random pages of the B-Tree, UUID v7 preserves "Sequential Ordering."
B-Tree Page Splitting: When you use random IDs, the database has to split pages in the middle of the index to fit new rows. This results in Storage Bloat and higher Write Amplification. UUID v7 ensures new records land at the "Right-most Leaf," keeping the index dense and efficient.
The Result: Your disk writes stay sequential, and your index lookups stay in the CPU cache.

2. Task 2: The Hybrid Catalog (JSONB & GIN)

Marketing wants to store custom attributes for every product (e.g., "Fabric" for shirts, "Screen Type" for laptops, "Battery Capacity" for gadgets). If you create a new column for every possible attribute, your table will have 5,000 columns-an architectural failure known as "The Sparse Column Trap."

The Mission: Schema-less Flexibility

Implementation: Add a metadata JSONB column to the products table.
Storage Physics: Unlike standard JSON (which is stored as plain text and parsed on every read), JSONB is stored in a Decomposed Binary Format. This allows the engine to jump directly to specific keys without reading the entire document.
Requirement: Create a GIN (Generalized Inverted Index) on the metadata column.
The Challenge: Write a high-performance query using the @> containment operator to find all "Organic Cotton" shirts priced between $20 and $60.

The Mirror Detail: A GIN index doesn't index the rows; it indexes the Values inside the JSON. When you search for {"material": "cotton"}, the index tells the database exactly which RowIDs contain that specific key-value pair, bypassing a slow Sequential Scan of millions of products.

3. Task 3: Analytical Intelligence (Cohort Analysis)

The CEO doesn't care about single orders. They care about Retention. In e-commerce, the most expensive thing you can do is acquire a new customer. The most profitable thing is keeping one.

The Mission: Retention & Customer Lifetime Value (LTV)

Use Window Functions and CTEs to build a "Customer Lifetime Value" (LTV) report that tracks user behavior over time.

Step 1: The First Touch: Use MIN(order_date) OVER (PARTITION BY user_id) to find the date of every user's first purchase.
Step 2: The Running Total: Use SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) to calculate the cumulative revenue per user.
Step 3: Monthly Cohorts: Group users by the month they joined and calculate their average revenue after 3, 6, and 12 months.

The "Work_Mem" Challenge

When you run a Window Function over 10 million rows, the database must sort the data. If your work_mem setting is too low, the database will "Spill to Disk"-writing temporary files to the SSD. This is 100x slower than RAM. As an Architect, you must optimize your CTEs to ensure the sort happens entirely in memory.

SQL Logic Mirror:

sql

WITH user_history AS (
    SELECT 
        user_id,
        amount,
        order_date,
        -- The "Anchor" date for the cohort
        date_trunc('month', MIN(order_date) OVER (PARTITION BY user_id)) AS join_month
    FROM orders
),
cohort_calculations AS (
    SELECT 
        join_month,
        user_id,
        SUM(amount) AS total_spent,
        (EXTRACT(epoch FROM (CURRENT_DATE - join_month)) / 2592000)::int AS months_since_joined
    FROM user_history
    GROUP BY 1, 2, 4
)
SELECT 
    join_month,
    COUNT(DISTINCT user_id) AS cohort_size,
    AVG(total_spent) AS avg_ltv
FROM cohort_calculations
GROUP BY 1 ORDER BY 1 DESC;

4. Task 4: The Atomic Checkout Engine

Data integrity is the "Foundation of Trust." If an order is created but the inventory wasn't updated, the system is broken. If two people buy the "Last Item" at the same microsecond, and they both get a "Success" message, you have a physical supply chain failure.

The Mission: The `process_order` Engine

Write a Stored Procedure (PL/pgSQL) that handles a checkout atomically.

Requirements:
1. Isolation: Use SELECT FOR UPDATE to lock the specific row in the inventory table. This prevents "Lost Updates."
2. Safety: Check if inventory.quantity >= order_item.quantity.
3. Atomicity: If the quantity is insufficient, issue a RAISE EXCEPTION, which triggers an automatic ROLLBACK of the entire transaction.
4. Audit: On success, trigger an entry in the audit_logs table using the JSONB version of the row to record "State before vs after."

The Lock Contention Mirror

While SELECT FOR UPDATE is safe, it is also a bottleneck. If 5,000 people try to buy the same "Limited Edition" item, they will all queue up for a single row lock. Architects solve this by using Inventory Sharding or Soft-Reservation Logs, but for this project, you must implement the primary "Strict" lock logic.

5. Task 5: Performance Stress Lab (1 Million Row Benchmark)

Architecture is a theory until it hits $1$ million rows. A query that takes 1ms on your laptop might take 30 seconds in production.

The Mission: The 2:00 AM Black Friday Simulation

Step 1: Data Synthesis: Use generate_series() and random() to populate your tables with a realistic distribution of data.
Step 2: The Bottleneck: Run a complex join across users, orders, and products without primary/foreign key indexes.
Step 3: The Physics of EXPLAIN: Use EXPLAIN (ANALYZE, BUFFERS) to read the Hardware Cost Model.
- Look for "Seq Scan": The engine is reading the entire file from the SSD.
- Look for "Filter": The CPU is discarding rows after reading them from the disk.
Step 4: The Fix: Add B-Tree indexes on all join columns. Observe how the engine switches to a "Nested Loop Join" or "Hash Join", and see the Buffers: shared hit count increase. This means your data is now living in the OS Page Cache (RAM), not the slow disk.

6. Task 6: Materialized Dashboards

Leadership needs a "One Second Refresh" dashboard for daily sales metrics. Joining 1 million rows and calculating aggregates on every page refresh of a busy internal tool is a waste of $1,000/month in cloud compute.

The Mission: Caching the State

Implementation: Create a Materialized View named daily_sales_stats that calculates yesterday's total revenue, top products, and churn rate.
The "Staleness" Mirror: Data in a Materialized View is "Stale"-it doesn't update automatically.
Maintenance: Implement a logic that runs REFRESH MATERIALIZED VIEW CONCURRENTLY. The CONCURRENTLY keyword ensures that users can still read the "Old" data while the "New" data is being calculated in the background, avoiding any downtime.

7. Task 7: Growth Engineering (Moving Averages)

In e-commerce, raw numbers are deceptive. You need to see the Trend.

The Mission: The 7-Day Revenue Curve

Build a query that calculates the 7-Day Moving Average of revenue.

The Physics: Use the window frame ROWS BETWEEN 6 PRECEDING AND CURRENT ROW.
The Goal: Smooth out the "Noise" of low-traffic weekdays vs high-traffic weekends to give the marketing team a clear view of marketing campaign impact.

sql

SELECT 
    order_date,
    SUM(amount) AS daily_rev,
    AVG(SUM(amount)) OVER (
        ORDER BY order_date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM orders
GROUP BY 1 ORDER BY 1;

8. Summary: Assessment & Final Certification

To receive your "Lead Database Architect" certification for this project, your code must satisfy:

Normalization Fidelity: No PII data should be duplicated across tables.
Indexing Precision: Every query in your dashboard MUST result in an Index Scan or Index Only Scan. No Sequential Scans are allowed on tables over 10,000 rows.
Concurrency Safety: Your checkout procedure must be immune to "Race Conditions" where two users buy the same final item.
Hardware Awareness: You must be able to explain exactly why you chose JSONB over JSON and UUIDv7 over Serial.

This project is the ultimate proof of your SQL competence. By building this system from scratch, you have proven that you can handle the scale and complexity of a modern digital economy. You have graduated from "Fetching data" to "Architecting Relevance."

Phase 29: Deliverables Checklist

Complete .sql schema with UUID v7 primary keys and proper foreign key constraints.
A Stored Procedure for Atomic Checkout with FOR UPDATE locking and ROLLBACK logic.
A "Growth Report" SQL script utilizing LTV, Retention cohorts, and Moving Averages.
An EXPLAIN (ANALYZE, BUFFERS) log showing sub-1ms performance on indexed queries.
A Materialized View implementation for heavy aggregate caching.

Frequently Asked Questions

Q: What SQL patterns are most important for e-commerce analytics?

Cohort analysis (grouping users by acquisition month and tracking retention over time), funnel analysis (counting drop-offs at each conversion step), and RFM scoring (Recency, Frequency, Monetary value for customer segmentation) are the core patterns. All three rely heavily on window functions, CTEs for readable multi-step logic, and date arithmetic. Cohort analysis uses DATE_TRUNC and self-joins or window functions; funnel analysis uses conditional aggregation (COUNT(DISTINCT CASE WHEN event = 'checkout' THEN user_id END)); RFM uses NTILE or PERCENT_RANK to bucket customers into segments.

Q: How do I efficiently compute customer lifetime value (LTV) in SQL?

A simple LTV is total historical spend per customer: SELECT customer_id, SUM(order_total) AS ltv FROM orders GROUP BY customer_id. Predictive LTV requires modelling future spend - a common SQL approximation uses average order value x purchase frequency x average customer lifespan. For accurate churn-adjusted LTV, calculate each customer's average inter-purchase interval and flag likely churned customers (those whose interval significantly exceeds their historical average). Window functions make this tractable: AVG(days_between_orders) OVER (PARTITION BY customer_id) computes the average gap without subqueries.

Q: What indexes are most important for an e-commerce analytics database?

Index `orders.customer_id` for per-customer queries, `orders.created_at` for date range scans, and `order_items.product_id` for product performance queries. Composite indexes on frequently combined filters - `(customer_id, created_at)` for customer history by date - eliminate separate index scans. For the analytics layer specifically, consider a partial index on recent orders only if historical queries are rare: `CREATE INDEX ON orders (created_at) WHERE created_at > '2024-01-01'`. Materialised views pre-computing daily/monthly aggregates eliminate repeated expensive GROUP BY operations for dashboards.

Part of the SQL Mastery Course - project-based excellence.