Database Schema Design

Schema mistakes echo for years. A denormalized table chosen for convenience becomes a bottleneck. A missing index slows queries by 100x. Your schema is your foundation. Build it carefully.

Normalization vs Denormalization

Start normalized. Only denormalize after measuring and confirming performance problems:

-- Normalized (start here)
CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  email VARCHAR(255) UNIQUE NOT NULL,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE posts (
  id SERIAL PRIMARY KEY,
  user_id INT NOT NULL REFERENCES users(id),
  title VARCHAR(256) NOT NULL,
  created_at TIMESTAMP DEFAULT NOW()
);

-- Only denormalize if measurement shows a problem
-- For example: SELECT users.*, COUNT(posts.id) FROM users...
-- becomes slow. Then add post_count to users table.
-- But maintain it with a trigger to prevent inconsistency.

Indexing Strategy

Index what you query, not everything. Every index has a cost: slower writes, more storage. Index deliberately:

-- Primary key index (automatic)
CREATE TABLE users (
  id SERIAL PRIMARY KEY
);

-- Unique constraint index
CREATE UNIQUE INDEX idx_users_email ON users(email);

-- Foreign key index (frequent joins)
CREATE INDEX idx_posts_user_id ON posts(user_id);

-- Composite index for common queries
-- "SELECT * FROM orders WHERE user_id = ? AND created_at > ?"
CREATE INDEX idx_orders_user_date 
  ON orders(user_id, created_at DESC);

-- Measure before and after
EXPLAIN ANALYZE SELECT * FROM posts WHERE user_id = 1;

Finding Slow Queries

Enable query logging. Find problems before production:

-- PostgreSQL: Enable slow query log
ALTER SYSTEM SET log_min_duration_statement = 1000; -- 1 second

-- Then restart
SELECT pg_reload_conf();

-- Query the log
SELECT * FROM pg_stat_statements 
ORDER BY total_time DESC 
LIMIT 10;

Common Query Patterns

Optimize for your access patterns. Different patterns need different structures:

-- Pattern 1: Frequent point lookups
-- Solution: Primary key + unique constraints
SELECT * FROM users WHERE id = 1;

-- Pattern 2: Range queries
-- Solution: B-tree index on range column
SELECT * FROM posts WHERE created_at > NOW() - INTERVAL '7 days';

-- Pattern 3: Text search
-- Solution: GIN index for full-text search
CREATE INDEX idx_posts_search ON posts USING GIN(
  to_tsvector('english', title || ' ' || content)
);

-- Pattern 4: Aggregations
-- Solution: Materialized views or denormalization
CREATE MATERIALIZED VIEW user_stats AS
SELECT user_id, COUNT(*) as post_count, MAX(created_at) as last_post
FROM posts
GROUP BY user_id;

Real-World Incident

A system I worked on had a slow reporting feature. Every query took 2+ seconds. Investigation revealed the schema had timestamps stored incorrectly, queries couldn't use indexes, and all 10 million rows were scanned. The fix:

Fix timestamp type: Use TIMESTAMP, not TEXT
Add index: Index the timestamp column used in WHERE clause
Query rewrite: Use date ranges instead of string matching

Result: Query dropped from 2 seconds to 50ms. Index a single column. 2000% improvement. This is why schema design matters.