Tutorial

Clean Code Python: From git init to Production Traffic

Theory without deployment is fiction. This capstone assembles all 22 prior patterns into a deployed, monitored, incident-ready multi-tenant Python backend — from Docker Compose to runbooks to your first production incident.

Tin Dang April 10, 2026 19 min read

Blueprint-style architectural diagram of a complete system with deployment pipelines connecting development to production

Theory without deployment is fiction. A clean architecture on your laptop is a hypothesis. A clean architecture handling real money in production is engineering.

This is the final post. Over the previous 22 parts, we built ShelfWise from a single main.py into a multi-tenant B2B SaaS with structured error handling, automatic tenant isolation, connection pooling, observability, caching, background tasks, rate limiting, API versioning, and performance optimization. Every pattern was designed in isolation, tested in isolation, explained in isolation.

Now we assemble them. This post is about the gaps that only appear when everything runs together: the Dockerfile that works on your machine but OOMs in CI, the migration that passes tests but deadlocks in production, the monitoring dashboard you forgot to build until the first incident.

The Complete ShelfWise Architecture

Complete Project Structure

Every module, every config file. This is what 22 posts of incremental architecture produces:

shelfwise/
├── src/
│   ├── api/
│   │   ├── v1/                         # Part 21: version translation
│   │   │   ├── __init__.py
│   │   │   ├── catalog.py
│   │   │   ├── orders.py
│   │   │   └── schemas/
│   │   │       ├── catalog.py
│   │   │       └── orders.py
│   │   ├── v2/
│   │   │   ├── __init__.py
│   │   │   ├── catalog.py
│   │   │   ├── orders.py
│   │   │   └── schemas/
│   │   │       ├── catalog.py
│   │   │       └── orders.py
│   │   ├── deps.py                     # Part 5: dependency injection
│   │   ├── router.py                   # Part 21: version mounting
│   │   └── middleware/
│   │       ├── tenant.py               # Part 9: tenant extraction
│   │       ├── rate_limit.py           # Part 15: per-tenant rate limiting
│   │       ├── query_count.py          # Part 22: N+1 detection
│   │       ├── deprecation.py          # Part 21: deprecation headers
│   │       └── request_id.py           # Part 12: correlation IDs
│   ├── core/
│   │   ├── config.py                   # Part 10: tenant-aware config
│   │   ├── context.py                  # Part 9: contextvars tenant propagation
│   │   ├── tenant.py                   # Part 9: tenant dataclass
│   │   ├── security.py                 # Part 18: auth and encryption
│   │   └── serialization.py            # Part 22: orjson response class
│   ├── db/
│   │   ├── base.py                     # Part 3: declarative base + TenantMixin
│   │   ├── session.py                  # Part 11: connection pooling
│   │   ├── query_counter.py            # Part 22: query event listener
│   │   └── events.py                   # Part 9: tenant session events
│   ├── models/
│   │   ├── book.py                     # Part 3: SQLAlchemy models
│   │   ├── author.py
│   │   ├── order.py
│   │   ├── tenant.py                   # Part 9: tenant model
│   │   └── audit_log.py               # Part 18: audit trail
│   ├── repositories/
│   │   ├── base.py                     # Part 3: generic repository protocol
│   │   ├── book_repository.py
│   │   ├── order_repository.py
│   │   └── tenant_repository.py
│   ├── services/
│   │   ├── book_service.py             # Part 4: service layer
│   │   ├── order_service.py
│   │   ├── tenant_service.py           # Part 9: tenant lifecycle
│   │   └── notification_service.py     # Part 20: event-driven notifications
│   ├── schemas/
│   │   ├── book.py                     # Part 2: Pydantic schemas with protocols
│   │   ├── order.py
│   │   └── tenant.py
│   ├── errors/
│   │   ├── base.py                     # Part 6: error hierarchy
│   │   ├── handlers.py                 # Part 6: FastAPI exception handlers
│   │   └── codes.py                    # Part 6: error code registry
│   ├── tasks/
│   │   ├── worker.py                   # Part 14: ARQ worker setup
│   │   ├── email_tasks.py             # Part 14: async email
│   │   ├── webhook_tasks.py           # Part 14: webhook delivery + retry
│   │   └── cleanup_tasks.py           # Part 14: tenant data cleanup
│   ├── cache/
│   │   ├── manager.py                  # Part 13: cache manager
│   │   └── invalidation.py            # Part 13: tenant-scoped invalidation
│   ├── events/
│   │   ├── bus.py                      # Part 20: event bus
│   │   ├── handlers.py                # Part 20: event handlers
│   │   └── schemas.py                 # Part 20: event schemas
│   ├── health/
│   │   └── checks.py                  # Part 19: health check endpoints
│   ├── observability/
│   │   ├── logging.py                  # Part 12: structlog configuration
│   │   ├── tracing.py                 # Part 12: OpenTelemetry setup
│   │   └── metrics.py                 # Part 12: Prometheus metrics
│   └── main.py                        # Application entrypoint
├── alembic/
│   ├── versions/                       # Part 17: migration management
│   │   ├── 001_initial.py
│   │   ├── 002_add_tenant.py
│   │   └── ...
│   ├── env.py
│   └── alembic.ini
├── tests/
│   ├── unit/                           # Part 8: unit tests
│   │   ├── test_book_service.py
│   │   ├── test_order_service.py
│   │   └── test_version_translation.py
│   ├── integration/                    # Part 8: integration tests
│   │   ├── test_tenant_isolation.py
│   │   └── test_order_flow.py
│   ├── contract/                       # Part 21: schemathesis contract tests
│   │   └── test_api_contract.py
│   ├── load/                           # Part 22: locust load tests
│   │   └── locustfile.py
│   ├── performance/                    # Part 22: p99 regression tests
│   │   ├── baseline.json
│   │   └── test_p99_regression.py
│   └── conftest.py
├── scripts/
│   ├── provision_tenant.py            # Tenant onboarding automation
│   ├── offboard_tenant.py             # Tenant offboarding
│   └── seed_dev_data.py               # Development data seeding
├── runbooks/
│   ├── db_slow.md                     # "DB is slow" runbook
│   ├── bad_deploy.md                  # "Bad deploy" runbook
│   ├── tenant_data_issue.md           # "Tenant reports data issue" runbook
│   └── tenant_deletion.md            # "Tenant deletion" runbook
├── docker-compose.yml
├── Dockerfile
├── pyproject.toml
├── .github/
│   └── workflows/
│       ├── ci.yml                     # Full CI pipeline
│       └── load-test.yml             # Scheduled load tests
└── .env.example

Docker Compose: The Local Stack

Development should mirror production topology. Docker Compose gives every developer a full ShelfWise environment with one command.

services:
  api:
    build:
      context: .
      target: runtime
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql+asyncpg://shelfwise:shelfwise@postgres:5432/shelfwise
      - REDIS_URL=redis://redis:6379/0
      - LOG_LEVEL=debug
      - ENVIRONMENT=development
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    volumes:
      - ./src:/app/src  # Hot reload in development
    command: uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload

  worker:
    build:
      context: .
      target: runtime
    environment:
      - DATABASE_URL=postgresql+asyncpg://shelfwise:shelfwise@postgres:5432/shelfwise
      - REDIS_URL=redis://redis:6379/0
      - LOG_LEVEL=debug
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    command: arq src.tasks.worker.WorkerSettings

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: shelfwise
      POSTGRES_PASSWORD: shelfwise
      POSTGRES_DB: shelfwise
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U shelfwise"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  mailhog:
    image: mailhog/mailhog
    ports:
      - "1025:1025"  # SMTP
      - "8025:8025"  # Web UI

volumes:
  pgdata:

The Dockerfile: Multi-Stage, Minimal, Secure

Every layer of this Dockerfile serves a purpose. The build stage installs dependencies. The runtime stage copies only what is needed. The result is a small, secure image with no build tools, no dev dependencies, and no root access.

# Dockerfile
# === Build stage: install dependencies ===
FROM python:3.12-slim AS builder

RUN pip install --no-cache-dir uv

WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev --no-editable

# === Runtime stage: minimal image ===
FROM python:3.12-slim AS runtime

# Security: non-root user
RUN groupadd --gid 1000 app && \
    useradd --uid 1000 --gid 1000 --shell /bin/bash app

WORKDIR /app

# Copy only the virtual environment and source code
COPY --from=builder /app/.venv /app/.venv
COPY src/ src/
COPY alembic/ alembic/
COPY alembic.ini .

# Use the virtual environment
ENV PATH="/app/.venv/bin:$PATH"
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

# Health check — Part 19 health endpoint
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD python -c "import httpx; httpx.get('http://localhost:8000/health').raise_for_status()"

USER app

EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

The CI/CD Pipeline

The pipeline is the assembly line where every pattern from this series is verified in sequence. A failure at any stage stops the deployment.

name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint-and-type-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
      - run: uv sync --frozen
      - run: uv run ruff check .
      - run: uv run ruff format --check .
      - run: uv run pyright --strict

  unit-tests:
    runs-on: ubuntu-latest
    needs: lint-and-type-check
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
      - run: uv sync --frozen
      - run: uv run pytest tests/unit/ --cov=src --cov-fail-under=80

  integration-tests:
    runs-on: ubuntu-latest
    needs: lint-and-type-check
    services:
      postgres:
        image: postgres:16-alpine
        env:
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
          POSTGRES_DB: test
        ports: ["5432:5432"]
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7-alpine
        ports: ["6379:6379"]
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
      - run: uv sync --frozen
      - run: uv run pytest tests/integration/ -x -v
        env:
          DATABASE_URL: postgresql+asyncpg://test:test@localhost:5432/test
          REDIS_URL: redis://localhost:6379/0

  contract-tests:
    runs-on: ubuntu-latest
    needs: integration-tests
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
      - run: uv sync --frozen
      - run: |
          uv run uvicorn src.main:app --host 0.0.0.0 --port 8000 &
          sleep 5
          uv run schemathesis run http://localhost:8000/openapi.json --checks all

  deploy-staging:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    needs: [unit-tests, integration-tests, contract-tests]
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t shelfwise:${{ github.sha }} .
      - run: ./scripts/deploy.sh staging ${{ github.sha }}

  smoke-tests:
    runs-on: ubuntu-latest
    needs: deploy-staging
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/smoke_test.sh https://staging.shelfwise.io

  deploy-production:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    needs: smoke-tests
    environment: production
    steps:
      - run: ./scripts/deploy.sh production ${{ github.sha }}

Environment Management

Staging must mirror production configuration with reduced resources. The differences should be minimal and explicit.

from pydantic import SecretStr
from pydantic_settings import BaseSettings, SettingsConfigDict


class Settings(BaseSettings):
    """Configuration from environment variables — Part 10."""
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        case_sensitive=False,
    )

    # Core
    environment: str = "development"  # development | staging | production
    debug: bool = False
    log_level: str = "info"

    # Database — Part 11 connection pooling
    database_url: SecretStr
    db_pool_min: int = 5
    db_pool_max: int = 20

    # Redis — Part 13 caching, Part 14 task queue
    redis_url: str = "redis://localhost:6379/0"

    # Security — Part 18
    jwt_secret: SecretStr = SecretStr("change-me-in-production")
    encryption_key: SecretStr = SecretStr("change-me-in-production")

    @property
    def is_production(self) -> bool:
        return self.environment == "production"

Setting	Staging	Production
DB pool size	5-10 connections	20-50 connections
API replicas	1	2-4 (auto-scaled)
Worker replicas	1	2-3
Log level	debug	info
Rate limits	Relaxed (10x headroom)	Enforced per plan tier
Cache TTL	60 seconds (catch stale issues)	300 seconds
Health check interval	30 seconds	10 seconds
Alert routing	Slack #staging-alerts	PagerDuty on-call rotation

Production Readiness Checklist

This is every pattern from every post, verified as a pre-launch gate. Do not ship without greens across the board.

Category	Requirement	Part	Verification
Structure	Domain modules under 400 lines	1	wc -l on every .py file
Types	100% public API type coverage	2	pyright --strict zero errors
Data	Repository pattern, no raw SQL in services	3	Code review
Logic	Service layer with no framework imports	4	grep for fastapi in services/
DI	All dependencies injected, no global state	5	No module-level mutable state
Errors	Structured error hierarchy, no bare except	6	ruff rule BLE001
Async	No blocking I/O in async functions	7	bandit check, code review
Tests	80%+ coverage, isolation test passes	8-9	pytest --cov-fail-under=80
Tenancy	Automatic tenant filtering on every query	9	Integration test with two tenants
Config	No secrets in code, env-based config	10	bandit, no .env in git
Pooling	Connection pool sized and monitored	11	Pool metrics in Grafana
Observability	Structured logs, traces, metrics	12	Dashboard exists, alerts configured
Cache	Tenant-scoped, invalidation tested	13	Cache hit rate metric
Tasks	Background tasks with retry and DLQ	14	Worker metrics in Grafana
Rate limits	Per-tenant, per-plan enforcement	15	Load test verifies limits
Consistency	Idempotency keys on mutations	16	Contract test with duplicate requests
Migrations	Reversible, zero-downtime	17	Alembic downgrade tested
Security	JWT auth, audit logging, encryption at rest	18	bandit -ll, pentest report
Health	Liveness + readiness + dependency checks	19	k8s/ECS probes configured
Events	Event bus for cross-domain side effects	20	Event handler tests pass
Versioning	Deprecated v1 with sunset headers	21	curl check for Deprecation header
Performance	p99 < 200ms, no N+1, load tested	22	Locust report, query counter < 10

Tenant Provisioning Automation

Onboarding a new tenant is a sequence of steps that must succeed atomically. If step 4 fails, steps 1-3 must be rolled back. Automate the entire flow.

"""Automated tenant provisioning pipeline."""
import asyncio
from uuid import UUID

import structlog

from src.core.context import set_current_tenant, system_context
from src.db.session import get_session
from src.models.tenant import TenantModel
from src.services.tenant_service import TenantService

logger = structlog.get_logger()


async def provision_tenant(
    slug: str,
    name: str,
    plan: str,
    admin_email: str,
) -> UUID:
    """Provision a new tenant end-to-end.

    Steps:
    1. Create tenant record in database
    2. Run tenant-specific migrations (if schema-per-tenant)
    3. Seed default data (categories, roles, permissions)
    4. Warm caches for the new tenant
    5. Send welcome email to admin
    6. Notify operations channel

    Raises TenantProvisioningError on any failure.
    """
    async with system_context(reason="provisioning", operator="system"):
        async with get_session() as session:
            # Step 1: Create tenant record
            tenant_service = TenantService(session)
            tenant = await tenant_service.create_tenant(
                slug=slug,
                name=name,
                plan=plan,
            )
            logger.info(
                "tenant_created",
                tenant_id=str(tenant.id),
                slug=slug,
            )

            # Step 2: Seed default data
            await tenant_service.seed_defaults(tenant.id)
            logger.info("tenant_defaults_seeded", tenant_id=str(tenant.id))

            # Step 3: Create admin user
            await tenant_service.create_admin_user(
                tenant_id=tenant.id,
                email=admin_email,
            )

            # Step 4: Warm caches
            await tenant_service.warm_caches(tenant.id)

            # Step 5: Send welcome email (via background task)
            await tenant_service.enqueue_welcome_email(
                tenant_id=tenant.id,
                admin_email=admin_email,
            )

            await session.commit()
            logger.info(
                "tenant_provisioned",
                tenant_id=str(tenant.id),
                slug=slug,
                plan=plan,
            )
            return tenant.id


if __name__ == "__main__":
    import sys

    tenant_id = asyncio.run(
        provision_tenant(
            slug=sys.argv[1],
            name=sys.argv[2],
            plan=sys.argv[3],
            admin_email=sys.argv[4],
        )
    )
    print(f"Provisioned tenant: {tenant_id}")

Tenant Offboarding

Offboarding is the reverse, with an important addition: you must preserve audit logs for compliance and export tenant data before deletion.

"""Tenant offboarding pipeline: suspend → export → delete → archive."""
import asyncio
from uuid import UUID

import structlog

logger = structlog.get_logger()


async def offboard_tenant(tenant_id: UUID, reason: str) -> None:
    """Offboard a tenant with full data lifecycle.

    This is a multi-day process, not a single operation:
    Day 0:  Suspend tenant (API returns 403, data preserved)
    Day 1:  Export tenant data to S3 (GDPR data portability)
    Day 30: Delete tenant data from database
    Day 30: Archive audit logs to cold storage (7-year retention)
    """
    # Phase 1: Immediate suspension
    await _suspend_tenant(tenant_id, reason)
    logger.info("tenant_suspended", tenant_id=str(tenant_id), reason=reason)

    # Phase 2: Schedule data export (async task)
    await _enqueue_data_export(tenant_id)
    logger.info("tenant_export_scheduled", tenant_id=str(tenant_id))

    # Phase 3: Schedule deletion (30-day grace period)
    await _schedule_deletion(tenant_id, grace_days=30)
    logger.info(
        "tenant_deletion_scheduled",
        tenant_id=str(tenant_id),
        grace_days=30,
    )


async def _suspend_tenant(tenant_id: UUID, reason: str) -> None:
    """Set tenant status to suspended. All API calls return 403."""
    ...


async def _enqueue_data_export(tenant_id: UUID) -> None:
    """Export all tenant data to S3 for data portability."""
    ...


async def _schedule_deletion(tenant_id: UUID, grace_days: int) -> None:
    """Schedule hard deletion after grace period."""
    ...

Runbooks

Runbooks are not documentation — they are decision trees for 3am. Each one starts with a symptom, confirms the diagnosis, and prescribes specific commands.

Runbook: “Database Is Slow”

Check connection pool saturation. Grafana panel: db_pool_checked_out / db_pool_size. If above 80%, pool is the bottleneck. Increase db_pool_max and restart (Part 11).
Check for long-running queries. SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10; Kill anything over 30 seconds: SELECT pg_terminate_backend(pid);
Check for lock contention. SELECT blocked.pid, blocked.query, blocking.pid, blocking.query FROM pg_catalog.pg_locks blocked_locks JOIN pg_stat_activity blocked ON blocked.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype WHERE NOT blocked_locks.granted;
Check for missing indexes. SELECT schemaname, relname, seq_scan, idx_scan FROM pg_stat_user_tables WHERE seq_scan > idx_scan ORDER BY seq_scan - idx_scan DESC LIMIT 10; Tables with more sequential scans than index scans need composite indexes (Part 22).

Runbook: “Bad Deploy”

Confirm the bad deploy. Check error rate spike in Grafana. If error rate > 5%, proceed to rollback.
Rollback immediately. ./scripts/deploy.sh production <previous-sha>. Do not debug in production. Rollback first, investigate second.
Verify rollback. Error rate should return to baseline within 2 minutes. If not, the problem is not the deploy — check database, Redis, external dependencies.
Post-rollback. Create an incident ticket. Identify the failing commit. Write a regression test that would have caught it. Deploy the fix through the full CI pipeline.

Runbook: “Tenant Reports Data Issue”

Identify the tenant. Get tenant_id from the support ticket or subdomain.
Pull recent traces. Filter Loki logs by tenant_id and the reported timeframe (Part 12). Look for error logs, failed transactions, unexpected query results.
Check tenant isolation. Run the isolation integration test against production read-replica. If it fails, this is a SEV-1 — tenant data boundary has been breached. Escalate immediately.
Check recent migrations. Did a migration run in the reported timeframe? Check Alembic history. Look for schema changes that could affect query results (Part 17).
Reproduce with tenant data. Use the system context (Part 9) to query the tenant’s data in a staging environment. Never modify production data directly.

Incident Response

Severity Levels

Level	Definition	Response Time	Example
SEV-1	Data breach, total outage, data corruption	15 minutes	Cross-tenant data leakage
SEV-2	Major feature broken, one tenant fully impacted	1 hour	Payment processing down for one tenant
SEV-3	Degraded performance, partial impact	4 hours	p99 above SLO, cache miss rate high
SEV-4	Minor issue, workaround available	Next business day	Incorrect sort order on one endpoint

Postmortem Template

Every SEV-1 and SEV-2 gets a postmortem within 48 hours. The postmortem is blameless — it focuses on system failures, not human failures.

## Incident: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV-N
**Impact:** [Number of tenants affected, revenue impact, data impact]

## Timeline
- HH:MM — Alert fired: [alert name]
- HH:MM — On-call acknowledged
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Monitoring confirmed recovery

## Root Cause
[One paragraph: what broke and why]

## Detection
How was this detected? Alert? Tenant report? Manual observation?
Detection gap: [What should have caught this earlier?]

## Resolution
What was done to fix it? Rollback? Hotfix? Configuration change?

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific, actionable item] | [Name] | [Date] | Open |

## Lessons Learned
- What went well?
- What went poorly?
- Where did we get lucky?

First-Day Production: Monitoring Dashboard

Before a single tenant sends a request, your Grafana dashboard must have these panels:

Request metrics: Request rate by endpoint, error rate (4xx, 5xx) by endpoint, p50/p95/p99 latency by endpoint. These are Part 12 Prometheus metrics visualized.

Database metrics: Connection pool utilization, query duration p99, queries per request (Part 22 counter), active connections by tenant.

Tenant metrics: Request rate by tenant, error rate by tenant, cache hit rate by tenant (Part 13), rate limit rejections by tenant (Part 15).

Worker metrics: Queue depth, task success/failure rate, task duration p99, dead letter queue size (Part 14).

Alerting rules:

groups:
  - name: shelfwise
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: sev2
        annotations:
          summary: "Error rate above 5% for 2 minutes"

      - alert: HighP99Latency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: sev3
        annotations:
          summary: "p99 latency above 500ms for 5 minutes"

      - alert: QueryCountSpike
        expr: avg(http_request_query_count) > 20
        for: 1m
        labels:
          severity: sev3
        annotations:
          summary: "Average queries per request above 20 — probable N+1"

      - alert: PoolExhaustion
        expr: db_pool_checked_out / db_pool_size > 0.9
        for: 2m
        labels:
          severity: sev2
        annotations:
          summary: "DB connection pool above 90% — requests will queue"

      - alert: WorkerQueueBacklog
        expr: arq_queue_length > 1000
        for: 5m
        labels:
          severity: sev2
        annotations:
          summary: "Background task queue backlog above 1000"

The First Incident

ShelfWise launches with 10 beta tenants. Three weeks in, the WorkerQueueBacklog alert fires. The ARQ queue has 2,400 pending tasks and is growing.

Investigation timeline:

T+0 min: Alert fires. On-call checks Grafana. Queue depth graph shows linear growth starting 2 hours ago.
T+2 min: Filter tasks by type. 95% are deliver_webhook tasks. All are retrying with exponential backoff.
T+5 min: Check webhook delivery logs. Tenant “Foyles” configured a webhook endpoint that is returning 500 Internal Server Error. Every failed delivery schedules a retry. 200 orders per hour, each triggering a webhook, each retrying 5 times.
T+8 min: Check the retry policy (Part 14). Max retries: 5. Backoff: exponential with jitter. But the webhook has been failing for 2 hours — the retries from early failures are still in the queue while new orders keep adding more.
T+12 min: Apply the fix. Two actions in parallel:
1. Pause webhook delivery for tenant “Foyles” by setting their webhook status to suspended in the tenant config (Part 10).
2. Drain the dead letter queue — all failed webhook tasks for Foyles are moved to DLQ for manual inspection.
T+15 min: Queue depth stops growing. Within 10 minutes, it drains back to normal.
T+20 min: Contact Foyles. Their webhook endpoint had a deployment that broke the handler. They fix it and confirm. Re-enable webhook delivery.

Postmortem action items:

Add per-tenant queue depth metric. Alert when a single tenant has > 100 pending tasks.
Add automatic webhook suspension after 10 consecutive failures for the same tenant.
Add circuit breaker (Part 14) to the webhook delivery task — stop retrying after the circuit opens.

This is a small incident. Nobody lost data. No tenants were affected except Foyles, and their orders were processed normally — only the webhook notifications were delayed. But the incident exposed a gap in the system: the retry policy did not account for a persistently failing webhook endpoint generating unbounded queue growth.

Every gap you find in production adds a new item to the checklist. The checklist grows, and the system gets more robust. That is the process.

What This Series Built

Twenty-three posts. One application. Here is the arc:

Parts 1-8 established the foundation. Project structure (Part 1) gave us navigable modules. Protocols and type safety (Part 2) made contracts explicit. The repository pattern (Part 3) isolated data access. The service layer (Part 4) isolated business logic. Dependency injection (Part 5) made it all testable. Error handling (Part 6) made failures informative. Async patterns (Part 7) made it concurrent. Testing strategies (Part 8) proved it all worked.

Parts 9-20 evolved the application into a production multi-tenant SaaS. Tenant isolation (Part 9) made data leakage structurally impossible. Configuration (Part 10) made each tenant customizable. Connection pooling (Part 11) made it scalable. Observability (Part 12) made it debuggable. Caching (Part 13) made it fast. Background tasks (Part 14) made it resilient. Rate limiting (Part 15) made it fair. Data consistency (Part 16) made it correct. Migrations (Part 17) made it evolvable. Security (Part 18) made it trustworthy. Health checks (Part 19) made it operable. Event-driven architecture (Part 20) made it extensible.

Parts 21-23 brought it to production. API versioning (Part 21) made it evolvable without breaking tenants. Performance profiling and load testing (Part 22) proved it could handle real traffic. And this post assembled it all, deployed it, monitored it, and handled the first incident.

The architecture is not finished. It never is. But the patterns are in place, the monitoring is live, the runbooks are written, and the team knows how to respond when something breaks. That is the difference between a codebase and a production system.

Clean code is not about aesthetics. It is about building systems that you can understand at 3am when the pager goes off, that new team members can navigate on their first day, and that can evolve for years without a rewrite. Every pattern in this series exists because the alternative — the quick hack, the missing test, the skipped abstraction — has a cost that compounds until it becomes the only thing the team works on.

Write code for the engineer who will maintain it. That engineer is usually you, six months from now, having forgotten everything.

Next in this series

Clean Code Python: Full-Stack DI with dependency-injector, FastAPI, and SQLAlchemy