SQLProjects

SQL CLI Project: Building a Database Explorer

TT
TopicTrick Team
SQL CLI Project: Building a Database Explorer

SQL Project: Building a High-Performance Database Explorer

Welcome to the Phase 1 Capstone of the SQL Masterclass. You have normalized schemas, mastered complex joins, and explored aggregations. Now, you will build the "Flashlight."

This 1,500+ word project guide is designed to be a technical assignment. You are a Tooling Engineer at "InfraGen," a company that manages $10,000$ database clusters. Your mission is to build a CLI utility that helps developers understand their schemas, detect "Bloat," and identify missing indexes without using a slow, heavy GUI.


1. Project Goal: The Observability Architectural Standard

Most developers treat the database as a "Black Box" where they send a query and get a row. As an architect, you must see the Machine Behind the Mirror.

Your goal is to build a tool that answers three fundamental questions for any database:

  1. Structure: What tables exist and how are they linked?
  2. Health: Which tables have too many "Dead Rows" and require maintenance?
  3. Performance: Which queries are consuming the most CPU time across the entire cluster?

The Technology Choice: Go or Zig?

For this project, we recommend building the companion code in Go (for its standard database/sql library and ease of cross-compilation) or Zig (for its bare-metal efficiency and C-level control over memory). Since these tools are used in production, they must be fast, lightweight, and have zero external dependencies.


2. Task 1: Interrogating the Catalog (Metadata Mirror)

Before your tool can display data, it must know what data exists.

The Information Schema Logic

The information_schema is a standard, ISO-compliant set of views that exists in almost every relational database (Postgres, SQL Server, MySQL). It is portable but can be slow.

Your first task is to write a routine that extracts all user-defined tables and their column counts.

sql
SELECT 
    t.table_name, 
    COALESCE(tc.column_count, 0) as column_count,
    t.table_type
FROM information_schema.tables t
LEFT JOIN (
    SELECT table_name, count(*) as column_count 
    FROM information_schema.columns 
    GROUP BY table_name
) tc ON t.table_name = tc.table_name
WHERE t.table_schema = 'public' 
AND t.table_type = 'BASE TABLE'
ORDER BY t.table_name;

Architect's Note: We use a LEFT JOIN on a subquery here to ensure we get a single row per table. On huge databases with 1,000+ tables, this is significantly faster than running a correlated subquery in the SELECT list.

2.1 The Physics of Metadata Latency: Schema vs. Catalog

Why is querying information_schema slower than querying the internal catalog?

  • The Abstraction Mirror: information_schema is a set of Complex Views. To show you a simple table list, the database engine must join sometimes dozens of internal tables (pg_class, pg_namespace, pg_authid) just to satisfy the ISO standard.
  • The Hardware Reality: Each join in the information_schema views involves internal memory overhead. If you are building a tool for 2026 systems, you should favor the Direct Catalog Access (pg_catalog) whenever performance is critical. Your tool should support a --fast mode that hits the catalogs directly, bypassing the ISO abstraction layer.

3. Task 2: Surgical Type Extraction (System Catalog Physics)

Standard tools use SELECT * FROM table LIMIT 0 to guess types. Your tool must be smarter. It must read the System Catalog (pg_attribute).

The Task

When a user runs explorer describe <table_name>, your CLI should fetch:

  • Column Name
  • Internal Data Type (including length/precision)
  • Primary Key Status
  • Default Values

The Hardware-Mirror Query

Querying pg_catalog.pg_attribute directly bypasses the "View" layer of the Information Schema, hitting the database's internal C-structs directly.

sql
SELECT 
    a.attname AS column_name,
    format_type(a.atttypid, a.atttypmod) AS data_type,
    a.attnotnull AS is_required,
    (SELECT pg_get_expr(adbin, adrelid) 
     FROM pg_attrdef WHERE adrelid = a.attrelid AND adnum = a.attnum) AS default_value
FROM pg_attribute a
WHERE a.attrelid = 'your_table_name'::regclass
AND a.attnum > 0 
AND NOT a.attisdropped;

Why this is professional: By using ::regclass, you allow Postgres to perform an O(1) OID lookup rather than a string search. This allows your tool to "Describe" a table in microseconds.


4. Task 3: The "Bloat Hunter" (Storage Efficiency Lab)

In MVCC (Multi-Version Concurrency Control) databases like Postgres, UPDATE and DELETE commands don't actually remove data on the disk. They mark it as "Dead." If a table has too many dead rows, it is called "Bloated."

The Task

Identify tables where the "Dead Row Ratio" is exceeding 20%. These tables are "Vacuum candidates" that are currently wasting disk space and slowing down sequential scans.

sql
SELECT 
    relname AS table_name,
    n_live_tup AS live_rows,
    n_dead_tup AS dead_rows,
    ROUND(n_dead_tup::numeric / GREATEST(n_live_tup + n_dead_tup, 1), 4) * 100 AS bloat_percentage
FROM pg_stat_user_tables
WHERE (n_live_tup + n_dead_tup) > 1000 -- Ignore tiny tables
ORDER BY bloat_percentage DESC;

Mission Requirement: Your CLI should output this in a list and highlight any row with >20% bloat in RED to warn the administrator.


5. Task 4: The "Ghost Index" Detector

Indexes are the "Speed of Sound" for data. But unused indexes waste memory, and missing indexes kill production site speed.

The Task

Build a "Missing Index Advisory." You must find tables where the count of Sequential Scans (scanning the whole table) is much higher than the count of Index Scans.

sql
SELECT 
    relname AS table_name,
    seq_scan AS full_scans,
    idx_scan AS indexed_scans,
    n_live_tup AS total_rows
FROM pg_stat_user_tables
WHERE seq_scan > 10000 
AND idx_scan < seq_scan
AND n_live_tup > 50000;

5.1 The Index Bloat (B-Tree Entropy) Mirror

It isn't just tables that get bloated; Indexes suffer from entropy too.

  • The Physics: When you delete a row, the B-Tree index entry is marked as dead but remains in the page. If you have "Heavy Deletes," your index can double or triple in size while holding the same number of active entries.
  • The Observability Challenge: A bloated index is slower because the CPU must traverse more "Empty" pages in the L2 cache to find a valid record.
  • The Task Upgrade: Add a command explorer-index-bloat that identifies indexes where the physical size is significantly larger than the theoretical size based on the item count. (Hint: Use pgstattuple extension if available).

6. Task 5: The "Top 10" Performance Monitor

To finish your tool, you must integrate with the pg_stat_statements extension. This allows you to see the "Heaviest" queries currently running on the server.

The Task

List the top 10 queries sorted by total execution time.

sql
SELECT 
    query,
    calls,
    total_exec_time / 1000 / 60 AS total_minutes,
    mean_exec_time AS average_ms,
    rows / calls AS avg_rows_per_call
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Formatting Challenge: Raw SQL strings can be thousands of characters long. In your CLI (Go/Zig), you must implement a "Truncation Filter" that extracts the first 80 characters of the query and appends ... for display in the terminal.


7. Go Implementation Snippet: The Connector

To help you get started, here is how you might architect the "Describe Table" functionality in Go.

go
func DescribeTable(db *sql.DB, tableName string) {
    query := `...[SQL from Task 2]...`
    rows, err := db.Query(query, tableName)
    if err != nil {
        log.Fatalf("Error describing table: %v", err)
    }
    defer rows.Close()

    fmt.Printf("Columns for %s:\n", tableName)
    fmt.Println("-------------------------------------------------")
    for rows.Next() {
        var name, dtype, required, defaultVal sql.NullString
        rows.Scan(&name, &dtype, &required, &defaultVal)
        fmt.Printf("%-20s | %-15s | %s\n", name.String, dtype.String, required.String)
    }
}

7.1 Portability Paradox: Supporting MySQL and Postgres

A truly professional CLI explorer should not be a "One-Trick Pony."

  • The Architecture: Use an Interface-based Pattern (in Go) or a Trait-based Pattern (in Zig).
  • The Implementation: Create a DatabaseDriver interface that defines methods like GetTables() and DescribeTable().
  • The Dialect Mirror: Your tool should detect the connection string (e.g., postgres:// vs mysql://) and automatically swap the SQL logic. MySQL users need to hit SHOW FULL COLUMNS, while Postgres users hit pg_catalog. This separation of "Logic" from "Dialect" is the mark of a Senior Infrastructure Architect.

8. Project Success Criteria: The Architect's Standard

To pass the "Database Explorer" lab, your tool must meet these engineering requirements:

  1. Extreme Performance: The tool must launch, connect, and query in under 100ms. Tooling that feels slow will not be used by developers.
  2. Zero-Write Protocol: Implement a --read-only flag (the default) that prevents the execution of any DROP, DELETE, or TRUNCATE commands.
  3. Connectivity Diagnostics: If the connection fails, don't just "Crash." Provide actionable feedback: "Error: Host unreachable. Check if the database instance is running on port 5432."
  4. UI Fidelity: Use ASCII tables for column outputs and ANSI color codes (Green for Healthy, Red for Bloated) to make the data glanceable.
  5. Streaming Exports: Implement a export <table_name> command that streams data out of the database as CSV. Requirement: It must use a Buffered Writer to avoid loading the entire dataset into RAM, allowing for 10GB exports on a 500MB RAM machine.

This project transitions you from a "Consumer of Data" to an "Infrastructure Architect." By mastering the movement of metadata and the physics of system observability, you gain the power to build tools that define how the rest of your engineering team interacts with the database.


Phase 6: Project Deliverables

  • A functional CLI binary (Go, Zig, or similar) that can "Describe" any table in a cluster.
  • An "Audit" command that outputs the Bloat Ratio for all user tables.
  • An "Advisory" engine that identifies at least one table with high sequential scan volume.
  • Advanced Goal: Implement a stream-export command using buffered I/O for large-scale data migration.
  • Full documentation on how to connect your tool to a production environment.

Read next: SQL Project: Financial Reporting Dashboard (Phase 3) →

Frequently Asked Questions

Q: What are the most useful psql meta-commands for exploring a PostgreSQL database?

\l lists all databases. \c dbname connects to a database. \dt lists tables in the current schema. \d tablename describes a table's columns, types, and constraints. \di lists indexes. \df lists functions. \dv lists views. \dn lists schemas. iming toggles query execution time display. \e opens the last query in your default editor. \i filename.sql executes a SQL file. These meta-commands are only available in psql — they are shortcuts that psql translates into system catalog queries, equivalent to querying information_schema or pg_catalog tables directly.

Q: How do I export query results to a CSV file from the command line?

In psql: \copy (SELECT * FROM table) TO '/path/to/file.csv' WITH CSV HEADER — \copy runs client-side and works without superuser privileges. The server-side COPY command (without backslash) writes to a server path and requires superuser. From the shell: psql -c "SELECT * FROM table" -A -F ',' -o output.csv using psql's own formatting. For large exports, piping through COPY is faster than fetching through the client. For scheduled exports or ETL, pg_dump handles full tables and psql -f query.sql handles scripted exports.

Q: What is information_schema and how do I use it to explore a database's structure?

information_schema is a standardised, read-only set of views defined by the SQL standard that exposes metadata about the database — tables, columns, constraints, views, and privileges. SELECT * FROM information_schema.tables WHERE table_schema = 'public' lists all user tables. SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'orders' describes a table's columns. SELECT * FROM information_schema.table_constraints lists constraints. It is more portable than database-specific catalogs (pg_catalog in PostgreSQL, INFORMATION_SCHEMA in SQL Server) but may expose less detail. For PostgreSQL-specific metadata, pg_catalog is more comprehensive.

Part of the SQL Mastery Course — engineering the infrastructure.