Tutorial

Clean Code Python: From git init to Production Traffic

Theory without deployment is fiction. This capstone assembles all 22 prior patterns into a deployed, monitored, incident-ready multi-tenant Python backend — from Docker Compose to runbooks to your first production incident.

Tin Dang avatar
Tin Dang
Blueprint-style architectural diagram of a complete system with deployment pipelines connecting development to production

Theory without deployment is fiction. A clean architecture on your laptop is a hypothesis. A clean architecture handling real money in production is engineering.

This is the final post. Over the previous 22 parts, we built ShelfWise from a single main.py into a multi-tenant B2B SaaS with structured error handling, automatic tenant isolation, connection pooling, observability, caching, background tasks, rate limiting, API versioning, and performance optimization. Every pattern was designed in isolation, tested in isolation, explained in isolation.

Now we assemble them. This post is about the gaps that only appear when everything runs together: the Dockerfile that works on your machine but OOMs in CI, the migration that passes tests but deadlocks in production, the monitoring dashboard you forgot to build until the first incident.

The Complete ShelfWise Architecture

Complete Project Structure

Every module, every config file. This is what 22 posts of incremental architecture produces:

shelfwise/
├── src/
│ ├── api/
│ │ ├── v1/ # Part 21: version translation
│ │ │ ├── __init__.py
│ │ │ ├── catalog.py
│ │ │ ├── orders.py
│ │ │ └── schemas/
│ │ │ ├── catalog.py
│ │ │ └── orders.py
│ │ ├── v2/
│ │ │ ├── __init__.py
│ │ │ ├── catalog.py
│ │ │ ├── orders.py
│ │ │ └── schemas/
│ │ │ ├── catalog.py
│ │ │ └── orders.py
│ │ ├── deps.py # Part 5: dependency injection
│ │ ├── router.py # Part 21: version mounting
│ │ └── middleware/
│ │ ├── tenant.py # Part 9: tenant extraction
│ │ ├── rate_limit.py # Part 15: per-tenant rate limiting
│ │ ├── query_count.py # Part 22: N+1 detection
│ │ ├── deprecation.py # Part 21: deprecation headers
│ │ └── request_id.py # Part 12: correlation IDs
│ ├── core/
│ │ ├── config.py # Part 10: tenant-aware config
│ │ ├── context.py # Part 9: contextvars tenant propagation
│ │ ├── tenant.py # Part 9: tenant dataclass
│ │ ├── security.py # Part 18: auth and encryption
│ │ └── serialization.py # Part 22: orjson response class
│ ├── db/
│ │ ├── base.py # Part 3: declarative base + TenantMixin
│ │ ├── session.py # Part 11: connection pooling
│ │ ├── query_counter.py # Part 22: query event listener
│ │ └── events.py # Part 9: tenant session events
│ ├── models/
│ │ ├── book.py # Part 3: SQLAlchemy models
│ │ ├── author.py
│ │ ├── order.py
│ │ ├── tenant.py # Part 9: tenant model
│ │ └── audit_log.py # Part 18: audit trail
│ ├── repositories/
│ │ ├── base.py # Part 3: generic repository protocol
│ │ ├── book_repository.py
│ │ ├── order_repository.py
│ │ └── tenant_repository.py
│ ├── services/
│ │ ├── book_service.py # Part 4: service layer
│ │ ├── order_service.py
│ │ ├── tenant_service.py # Part 9: tenant lifecycle
│ │ └── notification_service.py # Part 20: event-driven notifications
│ ├── schemas/
│ │ ├── book.py # Part 2: Pydantic schemas with protocols
│ │ ├── order.py
│ │ └── tenant.py
│ ├── errors/
│ │ ├── base.py # Part 6: error hierarchy
│ │ ├── handlers.py # Part 6: FastAPI exception handlers
│ │ └── codes.py # Part 6: error code registry
│ ├── tasks/
│ │ ├── worker.py # Part 14: ARQ worker setup
│ │ ├── email_tasks.py # Part 14: async email
│ │ ├── webhook_tasks.py # Part 14: webhook delivery + retry
│ │ └── cleanup_tasks.py # Part 14: tenant data cleanup
│ ├── cache/
│ │ ├── manager.py # Part 13: cache manager
│ │ └── invalidation.py # Part 13: tenant-scoped invalidation
│ ├── events/
│ │ ├── bus.py # Part 20: event bus
│ │ ├── handlers.py # Part 20: event handlers
│ │ └── schemas.py # Part 20: event schemas
│ ├── health/
│ │ └── checks.py # Part 19: health check endpoints
│ ├── observability/
│ │ ├── logging.py # Part 12: structlog configuration
│ │ ├── tracing.py # Part 12: OpenTelemetry setup
│ │ └── metrics.py # Part 12: Prometheus metrics
│ └── main.py # Application entrypoint
├── alembic/
│ ├── versions/ # Part 17: migration management
│ │ ├── 001_initial.py
│ │ ├── 002_add_tenant.py
│ │ └── ...
│ ├── env.py
│ └── alembic.ini
├── tests/
│ ├── unit/ # Part 8: unit tests
│ │ ├── test_book_service.py
│ │ ├── test_order_service.py
│ │ └── test_version_translation.py
│ ├── integration/ # Part 8: integration tests
│ │ ├── test_tenant_isolation.py
│ │ └── test_order_flow.py
│ ├── contract/ # Part 21: schemathesis contract tests
│ │ └── test_api_contract.py
│ ├── load/ # Part 22: locust load tests
│ │ └── locustfile.py
│ ├── performance/ # Part 22: p99 regression tests
│ │ ├── baseline.json
│ │ └── test_p99_regression.py
│ └── conftest.py
├── scripts/
│ ├── provision_tenant.py # Tenant onboarding automation
│ ├── offboard_tenant.py # Tenant offboarding
│ └── seed_dev_data.py # Development data seeding
├── runbooks/
│ ├── db_slow.md # "DB is slow" runbook
│ ├── bad_deploy.md # "Bad deploy" runbook
│ ├── tenant_data_issue.md # "Tenant reports data issue" runbook
│ └── tenant_deletion.md # "Tenant deletion" runbook
├── docker-compose.yml
├── Dockerfile
├── pyproject.toml
├── .github/
│ └── workflows/
│ ├── ci.yml # Full CI pipeline
│ └── load-test.yml # Scheduled load tests
└── .env.example

Docker Compose: The Local Stack

Development should mirror production topology. Docker Compose gives every developer a full ShelfWise environment with one command.

docker-compose.yml
services:
api:
build:
context: .
target: runtime
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql+asyncpg://shelfwise:shelfwise@postgres:5432/shelfwise
- REDIS_URL=redis://redis:6379/0
- LOG_LEVEL=debug
- ENVIRONMENT=development
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
volumes:
- ./src:/app/src # Hot reload in development
command: uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload
worker:
build:
context: .
target: runtime
environment:
- DATABASE_URL=postgresql+asyncpg://shelfwise:shelfwise@postgres:5432/shelfwise
- REDIS_URL=redis://redis:6379/0
- LOG_LEVEL=debug
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
command: arq src.tasks.worker.WorkerSettings
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: shelfwise
POSTGRES_PASSWORD: shelfwise
POSTGRES_DB: shelfwise
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U shelfwise"]
interval: 5s
timeout: 3s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
mailhog:
image: mailhog/mailhog
ports:
- "1025:1025" # SMTP
- "8025:8025" # Web UI
volumes:
pgdata:

The Dockerfile: Multi-Stage, Minimal, Secure

Every layer of this Dockerfile serves a purpose. The build stage installs dependencies. The runtime stage copies only what is needed. The result is a small, secure image with no build tools, no dev dependencies, and no root access.

# Dockerfile
# === Build stage: install dependencies ===
FROM python:3.12-slim AS builder
RUN pip install --no-cache-dir uv
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev --no-editable
# === Runtime stage: minimal image ===
FROM python:3.12-slim AS runtime
# Security: non-root user
RUN groupadd --gid 1000 app && \
useradd --uid 1000 --gid 1000 --shell /bin/bash app
WORKDIR /app
# Copy only the virtual environment and source code
COPY --from=builder /app/.venv /app/.venv
COPY src/ src/
COPY alembic/ alembic/
COPY alembic.ini .
# Use the virtual environment
ENV PATH="/app/.venv/bin:$PATH"
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
# Health check — Part 19 health endpoint
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD python -c "import httpx; httpx.get('http://localhost:8000/health').raise_for_status()"
USER app
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

The CI/CD Pipeline

The pipeline is the assembly line where every pattern from this series is verified in sequence. A failure at any stage stops the deployment.

.github/workflows/ci.yml
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint-and-type-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- run: uv sync --frozen
- run: uv run ruff check .
- run: uv run ruff format --check .
- run: uv run pyright --strict
unit-tests:
runs-on: ubuntu-latest
needs: lint-and-type-check
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- run: uv sync --frozen
- run: uv run pytest tests/unit/ --cov=src --cov-fail-under=80
integration-tests:
runs-on: ubuntu-latest
needs: lint-and-type-check
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: test
ports: ["5432:5432"]
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7-alpine
ports: ["6379:6379"]
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- run: uv sync --frozen
- run: uv run pytest tests/integration/ -x -v
env:
DATABASE_URL: postgresql+asyncpg://test:test@localhost:5432/test
REDIS_URL: redis://localhost:6379/0
contract-tests:
runs-on: ubuntu-latest
needs: integration-tests
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- run: uv sync --frozen
- run: |
uv run uvicorn src.main:app --host 0.0.0.0 --port 8000 &
sleep 5
uv run schemathesis run http://localhost:8000/openapi.json --checks all
deploy-staging:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
needs: [unit-tests, integration-tests, contract-tests]
steps:
- uses: actions/checkout@v4
- run: docker build -t shelfwise:${{ github.sha }} .
- run: ./scripts/deploy.sh staging ${{ github.sha }}
smoke-tests:
runs-on: ubuntu-latest
needs: deploy-staging
steps:
- uses: actions/checkout@v4
- run: ./scripts/smoke_test.sh https://staging.shelfwise.io
deploy-production:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
needs: smoke-tests
environment: production
steps:
- run: ./scripts/deploy.sh production ${{ github.sha }}

Environment Management

Staging must mirror production configuration with reduced resources. The differences should be minimal and explicit.

src/core/config.py
from pydantic import SecretStr
from pydantic_settings import BaseSettings, SettingsConfigDict
class Settings(BaseSettings):
"""Configuration from environment variables — Part 10."""
model_config = SettingsConfigDict(
env_file=".env",
env_file_encoding="utf-8",
case_sensitive=False,
)
# Core
environment: str = "development" # development | staging | production
debug: bool = False
log_level: str = "info"
# Database — Part 11 connection pooling
database_url: SecretStr
db_pool_min: int = 5
db_pool_max: int = 20
# Redis — Part 13 caching, Part 14 task queue
redis_url: str = "redis://localhost:6379/0"
# Security — Part 18
jwt_secret: SecretStr = SecretStr("change-me-in-production")
encryption_key: SecretStr = SecretStr("change-me-in-production")
@property
def is_production(self) -> bool:
return self.environment == "production"
SettingStagingProduction
DB pool size 5-10 connections 20-50 connections
API replicas 1 2-4 (auto-scaled)
Worker replicas 1 2-3
Log level debug info
Rate limits Relaxed (10x headroom) Enforced per plan tier
Cache TTL 60 seconds (catch stale issues) 300 seconds
Health check interval 30 seconds 10 seconds
Alert routing Slack #staging-alerts PagerDuty on-call rotation

Production Readiness Checklist

This is every pattern from every post, verified as a pre-launch gate. Do not ship without greens across the board.

CategoryRequirementPartVerification
Structure Domain modules under 400 lines 1 wc -l on every .py file
Types 100% public API type coverage 2 pyright --strict zero errors
Data Repository pattern, no raw SQL in services 3 Code review
Logic Service layer with no framework imports 4 grep for fastapi in services/
DI All dependencies injected, no global state 5 No module-level mutable state
Errors Structured error hierarchy, no bare except 6 ruff rule BLE001
Async No blocking I/O in async functions 7 bandit check, code review
Tests 80%+ coverage, isolation test passes 8-9 pytest --cov-fail-under=80
Tenancy Automatic tenant filtering on every query 9 Integration test with two tenants
Config No secrets in code, env-based config 10 bandit, no .env in git
Pooling Connection pool sized and monitored 11 Pool metrics in Grafana
Observability Structured logs, traces, metrics 12 Dashboard exists, alerts configured
Cache Tenant-scoped, invalidation tested 13 Cache hit rate metric
Tasks Background tasks with retry and DLQ 14 Worker metrics in Grafana
Rate limits Per-tenant, per-plan enforcement 15 Load test verifies limits
Consistency Idempotency keys on mutations 16 Contract test with duplicate requests
Migrations Reversible, zero-downtime 17 Alembic downgrade tested
Security JWT auth, audit logging, encryption at rest 18 bandit -ll, pentest report
Health Liveness + readiness + dependency checks 19 k8s/ECS probes configured
Events Event bus for cross-domain side effects 20 Event handler tests pass
Versioning Deprecated v1 with sunset headers 21 curl check for Deprecation header
Performance p99 < 200ms, no N+1, load tested 22 Locust report, query counter < 10

Tenant Provisioning Automation

Onboarding a new tenant is a sequence of steps that must succeed atomically. If step 4 fails, steps 1-3 must be rolled back. Automate the entire flow.

scripts/provision_tenant.py
"""Automated tenant provisioning pipeline."""
import asyncio
from uuid import UUID
import structlog
from src.core.context import set_current_tenant, system_context
from src.db.session import get_session
from src.models.tenant import TenantModel
from src.services.tenant_service import TenantService
logger = structlog.get_logger()
async def provision_tenant(
slug: str,
name: str,
plan: str,
admin_email: str,
) -> UUID:
"""Provision a new tenant end-to-end.
Steps:
1. Create tenant record in database
2. Run tenant-specific migrations (if schema-per-tenant)
3. Seed default data (categories, roles, permissions)
4. Warm caches for the new tenant
5. Send welcome email to admin
6. Notify operations channel
Raises TenantProvisioningError on any failure.
"""
async with system_context(reason="provisioning", operator="system"):
async with get_session() as session:
# Step 1: Create tenant record
tenant_service = TenantService(session)
tenant = await tenant_service.create_tenant(
slug=slug,
name=name,
plan=plan,
)
logger.info(
"tenant_created",
tenant_id=str(tenant.id),
slug=slug,
)
# Step 2: Seed default data
await tenant_service.seed_defaults(tenant.id)
logger.info("tenant_defaults_seeded", tenant_id=str(tenant.id))
# Step 3: Create admin user
await tenant_service.create_admin_user(
tenant_id=tenant.id,
email=admin_email,
)
# Step 4: Warm caches
await tenant_service.warm_caches(tenant.id)
# Step 5: Send welcome email (via background task)
await tenant_service.enqueue_welcome_email(
tenant_id=tenant.id,
admin_email=admin_email,
)
await session.commit()
logger.info(
"tenant_provisioned",
tenant_id=str(tenant.id),
slug=slug,
plan=plan,
)
return tenant.id
if __name__ == "__main__":
import sys
tenant_id = asyncio.run(
provision_tenant(
slug=sys.argv[1],
name=sys.argv[2],
plan=sys.argv[3],
admin_email=sys.argv[4],
)
)
print(f"Provisioned tenant: {tenant_id}")

Tenant Offboarding

Offboarding is the reverse, with an important addition: you must preserve audit logs for compliance and export tenant data before deletion.

scripts/offboard_tenant.py
"""Tenant offboarding pipeline: suspend → export → delete → archive."""
import asyncio
from uuid import UUID
import structlog
logger = structlog.get_logger()
async def offboard_tenant(tenant_id: UUID, reason: str) -> None:
"""Offboard a tenant with full data lifecycle.
This is a multi-day process, not a single operation:
Day 0: Suspend tenant (API returns 403, data preserved)
Day 1: Export tenant data to S3 (GDPR data portability)
Day 30: Delete tenant data from database
Day 30: Archive audit logs to cold storage (7-year retention)
"""
# Phase 1: Immediate suspension
await _suspend_tenant(tenant_id, reason)
logger.info("tenant_suspended", tenant_id=str(tenant_id), reason=reason)
# Phase 2: Schedule data export (async task)
await _enqueue_data_export(tenant_id)
logger.info("tenant_export_scheduled", tenant_id=str(tenant_id))
# Phase 3: Schedule deletion (30-day grace period)
await _schedule_deletion(tenant_id, grace_days=30)
logger.info(
"tenant_deletion_scheduled",
tenant_id=str(tenant_id),
grace_days=30,
)
async def _suspend_tenant(tenant_id: UUID, reason: str) -> None:
"""Set tenant status to suspended. All API calls return 403."""
...
async def _enqueue_data_export(tenant_id: UUID) -> None:
"""Export all tenant data to S3 for data portability."""
...
async def _schedule_deletion(tenant_id: UUID, grace_days: int) -> None:
"""Schedule hard deletion after grace period."""
...

Runbooks

Runbooks are not documentation — they are decision trees for 3am. Each one starts with a symptom, confirms the diagnosis, and prescribes specific commands.

Runbook: “Database Is Slow”

  1. Check connection pool saturation. Grafana panel: db_pool_checked_out / db_pool_size. If above 80%, pool is the bottleneck. Increase db_pool_max and restart (Part 11).
  2. Check for long-running queries. SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10; Kill anything over 30 seconds: SELECT pg_terminate_backend(pid);
  3. Check for lock contention. SELECT blocked.pid, blocked.query, blocking.pid, blocking.query FROM pg_catalog.pg_locks blocked_locks JOIN pg_stat_activity blocked ON blocked.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype WHERE NOT blocked_locks.granted;
  4. Check for missing indexes. SELECT schemaname, relname, seq_scan, idx_scan FROM pg_stat_user_tables WHERE seq_scan > idx_scan ORDER BY seq_scan - idx_scan DESC LIMIT 10; Tables with more sequential scans than index scans need composite indexes (Part 22).

Runbook: “Bad Deploy”

  1. Confirm the bad deploy. Check error rate spike in Grafana. If error rate > 5%, proceed to rollback.
  2. Rollback immediately. ./scripts/deploy.sh production <previous-sha>. Do not debug in production. Rollback first, investigate second.
  3. Verify rollback. Error rate should return to baseline within 2 minutes. If not, the problem is not the deploy — check database, Redis, external dependencies.
  4. Post-rollback. Create an incident ticket. Identify the failing commit. Write a regression test that would have caught it. Deploy the fix through the full CI pipeline.

Runbook: “Tenant Reports Data Issue”

  1. Identify the tenant. Get tenant_id from the support ticket or subdomain.
  2. Pull recent traces. Filter Loki logs by tenant_id and the reported timeframe (Part 12). Look for error logs, failed transactions, unexpected query results.
  3. Check tenant isolation. Run the isolation integration test against production read-replica. If it fails, this is a SEV-1 — tenant data boundary has been breached. Escalate immediately.
  4. Check recent migrations. Did a migration run in the reported timeframe? Check Alembic history. Look for schema changes that could affect query results (Part 17).
  5. Reproduce with tenant data. Use the system context (Part 9) to query the tenant’s data in a staging environment. Never modify production data directly.

Incident Response

Severity Levels

LevelDefinitionResponse TimeExample
SEV-1Data breach, total outage, data corruption15 minutesCross-tenant data leakage
SEV-2Major feature broken, one tenant fully impacted1 hourPayment processing down for one tenant
SEV-3Degraded performance, partial impact4 hoursp99 above SLO, cache miss rate high
SEV-4Minor issue, workaround availableNext business dayIncorrect sort order on one endpoint

Postmortem Template

Every SEV-1 and SEV-2 gets a postmortem within 48 hours. The postmortem is blameless — it focuses on system failures, not human failures.

## Incident: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV-N
**Impact:** [Number of tenants affected, revenue impact, data impact]
## Timeline
- HH:MM — Alert fired: [alert name]
- HH:MM — On-call acknowledged
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Monitoring confirmed recovery
## Root Cause
[One paragraph: what broke and why]
## Detection
How was this detected? Alert? Tenant report? Manual observation?
Detection gap: [What should have caught this earlier?]
## Resolution
What was done to fix it? Rollback? Hotfix? Configuration change?
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific, actionable item] | [Name] | [Date] | Open |
## Lessons Learned
- What went well?
- What went poorly?
- Where did we get lucky?

First-Day Production: Monitoring Dashboard

Before a single tenant sends a request, your Grafana dashboard must have these panels:

Request metrics: Request rate by endpoint, error rate (4xx, 5xx) by endpoint, p50/p95/p99 latency by endpoint. These are Part 12 Prometheus metrics visualized.

Database metrics: Connection pool utilization, query duration p99, queries per request (Part 22 counter), active connections by tenant.

Tenant metrics: Request rate by tenant, error rate by tenant, cache hit rate by tenant (Part 13), rate limit rejections by tenant (Part 15).

Worker metrics: Queue depth, task success/failure rate, task duration p99, dead letter queue size (Part 14).

Alerting rules:

prometheus/alerts.yml
groups:
- name: shelfwise
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: sev2
annotations:
summary: "Error rate above 5% for 2 minutes"
- alert: HighP99Latency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: sev3
annotations:
summary: "p99 latency above 500ms for 5 minutes"
- alert: QueryCountSpike
expr: avg(http_request_query_count) > 20
for: 1m
labels:
severity: sev3
annotations:
summary: "Average queries per request above 20 — probable N+1"
- alert: PoolExhaustion
expr: db_pool_checked_out / db_pool_size > 0.9
for: 2m
labels:
severity: sev2
annotations:
summary: "DB connection pool above 90% — requests will queue"
- alert: WorkerQueueBacklog
expr: arq_queue_length > 1000
for: 5m
labels:
severity: sev2
annotations:
summary: "Background task queue backlog above 1000"

The First Incident

ShelfWise launches with 10 beta tenants. Three weeks in, the WorkerQueueBacklog alert fires. The ARQ queue has 2,400 pending tasks and is growing.

Investigation timeline:

  • T+0 min: Alert fires. On-call checks Grafana. Queue depth graph shows linear growth starting 2 hours ago.
  • T+2 min: Filter tasks by type. 95% are deliver_webhook tasks. All are retrying with exponential backoff.
  • T+5 min: Check webhook delivery logs. Tenant “Foyles” configured a webhook endpoint that is returning 500 Internal Server Error. Every failed delivery schedules a retry. 200 orders per hour, each triggering a webhook, each retrying 5 times.
  • T+8 min: Check the retry policy (Part 14). Max retries: 5. Backoff: exponential with jitter. But the webhook has been failing for 2 hours — the retries from early failures are still in the queue while new orders keep adding more.
  • T+12 min: Apply the fix. Two actions in parallel:
    1. Pause webhook delivery for tenant “Foyles” by setting their webhook status to suspended in the tenant config (Part 10).
    2. Drain the dead letter queue — all failed webhook tasks for Foyles are moved to DLQ for manual inspection.
  • T+15 min: Queue depth stops growing. Within 10 minutes, it drains back to normal.
  • T+20 min: Contact Foyles. Their webhook endpoint had a deployment that broke the handler. They fix it and confirm. Re-enable webhook delivery.

Postmortem action items:

  1. Add per-tenant queue depth metric. Alert when a single tenant has > 100 pending tasks.
  2. Add automatic webhook suspension after 10 consecutive failures for the same tenant.
  3. Add circuit breaker (Part 14) to the webhook delivery task — stop retrying after the circuit opens.

This is a small incident. Nobody lost data. No tenants were affected except Foyles, and their orders were processed normally — only the webhook notifications were delayed. But the incident exposed a gap in the system: the retry policy did not account for a persistently failing webhook endpoint generating unbounded queue growth.

Every gap you find in production adds a new item to the checklist. The checklist grows, and the system gets more robust. That is the process.

What This Series Built

Twenty-three posts. One application. Here is the arc:

Parts 1-8 established the foundation. Project structure (Part 1) gave us navigable modules. Protocols and type safety (Part 2) made contracts explicit. The repository pattern (Part 3) isolated data access. The service layer (Part 4) isolated business logic. Dependency injection (Part 5) made it all testable. Error handling (Part 6) made failures informative. Async patterns (Part 7) made it concurrent. Testing strategies (Part 8) proved it all worked.

Parts 9-20 evolved the application into a production multi-tenant SaaS. Tenant isolation (Part 9) made data leakage structurally impossible. Configuration (Part 10) made each tenant customizable. Connection pooling (Part 11) made it scalable. Observability (Part 12) made it debuggable. Caching (Part 13) made it fast. Background tasks (Part 14) made it resilient. Rate limiting (Part 15) made it fair. Data consistency (Part 16) made it correct. Migrations (Part 17) made it evolvable. Security (Part 18) made it trustworthy. Health checks (Part 19) made it operable. Event-driven architecture (Part 20) made it extensible.

Parts 21-23 brought it to production. API versioning (Part 21) made it evolvable without breaking tenants. Performance profiling and load testing (Part 22) proved it could handle real traffic. And this post assembled it all, deployed it, monitored it, and handled the first incident.

The architecture is not finished. It never is. But the patterns are in place, the monitoring is live, the runbooks are written, and the team knows how to respond when something breaks. That is the difference between a codebase and a production system.


Clean code is not about aesthetics. It is about building systems that you can understand at 3am when the pager goes off, that new team members can navigate on their first day, and that can evolve for years without a rewrite. Every pattern in this series exists because the alternative — the quick hack, the missing test, the skipped abstraction — has a cost that compounds until it becomes the only thing the team works on.

Write code for the engineer who will maintain it. That engineer is usually you, six months from now, having forgotten everything.

0

Next in this series

Clean Code Python: Full-Stack DI with dependency-injector, FastAPI, and SQLAlchemy

Continue reading