Theory without deployment is fiction. A clean architecture on your laptop is a hypothesis. A clean architecture handling real money in production is engineering.
This is the final post. Over the previous 22 parts, we built ShelfWise from a single main.py into a multi-tenant B2B SaaS with structured error handling, automatic tenant isolation, connection pooling, observability, caching, background tasks, rate limiting, API versioning, and performance optimization. Every pattern was designed in isolation, tested in isolation, explained in isolation.
Now we assemble them. This post is about the gaps that only appear when everything runs together: the Dockerfile that works on your machine but OOMs in CI, the migration that passes tests but deadlocks in production, the monitoring dashboard you forgot to build until the first incident.
The Complete ShelfWise Architecture
Complete Project Structure
Every module, every config file. This is what 22 posts of incremental architecture produces:
shelfwise/├── src/│ ├── api/│ │ ├── v1/ # Part 21: version translation│ │ │ ├── __init__.py│ │ │ ├── catalog.py│ │ │ ├── orders.py│ │ │ └── schemas/│ │ │ ├── catalog.py│ │ │ └── orders.py│ │ ├── v2/│ │ │ ├── __init__.py│ │ │ ├── catalog.py│ │ │ ├── orders.py│ │ │ └── schemas/│ │ │ ├── catalog.py│ │ │ └── orders.py│ │ ├── deps.py # Part 5: dependency injection│ │ ├── router.py # Part 21: version mounting│ │ └── middleware/│ │ ├── tenant.py # Part 9: tenant extraction│ │ ├── rate_limit.py # Part 15: per-tenant rate limiting│ │ ├── query_count.py # Part 22: N+1 detection│ │ ├── deprecation.py # Part 21: deprecation headers│ │ └── request_id.py # Part 12: correlation IDs│ ├── core/│ │ ├── config.py # Part 10: tenant-aware config│ │ ├── context.py # Part 9: contextvars tenant propagation│ │ ├── tenant.py # Part 9: tenant dataclass│ │ ├── security.py # Part 18: auth and encryption│ │ └── serialization.py # Part 22: orjson response class│ ├── db/│ │ ├── base.py # Part 3: declarative base + TenantMixin│ │ ├── session.py # Part 11: connection pooling│ │ ├── query_counter.py # Part 22: query event listener│ │ └── events.py # Part 9: tenant session events│ ├── models/│ │ ├── book.py # Part 3: SQLAlchemy models│ │ ├── author.py│ │ ├── order.py│ │ ├── tenant.py # Part 9: tenant model│ │ └── audit_log.py # Part 18: audit trail│ ├── repositories/│ │ ├── base.py # Part 3: generic repository protocol│ │ ├── book_repository.py│ │ ├── order_repository.py│ │ └── tenant_repository.py│ ├── services/│ │ ├── book_service.py # Part 4: service layer│ │ ├── order_service.py│ │ ├── tenant_service.py # Part 9: tenant lifecycle│ │ └── notification_service.py # Part 20: event-driven notifications│ ├── schemas/│ │ ├── book.py # Part 2: Pydantic schemas with protocols│ │ ├── order.py│ │ └── tenant.py│ ├── errors/│ │ ├── base.py # Part 6: error hierarchy│ │ ├── handlers.py # Part 6: FastAPI exception handlers│ │ └── codes.py # Part 6: error code registry│ ├── tasks/│ │ ├── worker.py # Part 14: ARQ worker setup│ │ ├── email_tasks.py # Part 14: async email│ │ ├── webhook_tasks.py # Part 14: webhook delivery + retry│ │ └── cleanup_tasks.py # Part 14: tenant data cleanup│ ├── cache/│ │ ├── manager.py # Part 13: cache manager│ │ └── invalidation.py # Part 13: tenant-scoped invalidation│ ├── events/│ │ ├── bus.py # Part 20: event bus│ │ ├── handlers.py # Part 20: event handlers│ │ └── schemas.py # Part 20: event schemas│ ├── health/│ │ └── checks.py # Part 19: health check endpoints│ ├── observability/│ │ ├── logging.py # Part 12: structlog configuration│ │ ├── tracing.py # Part 12: OpenTelemetry setup│ │ └── metrics.py # Part 12: Prometheus metrics│ └── main.py # Application entrypoint├── alembic/│ ├── versions/ # Part 17: migration management│ │ ├── 001_initial.py│ │ ├── 002_add_tenant.py│ │ └── ...│ ├── env.py│ └── alembic.ini├── tests/│ ├── unit/ # Part 8: unit tests│ │ ├── test_book_service.py│ │ ├── test_order_service.py│ │ └── test_version_translation.py│ ├── integration/ # Part 8: integration tests│ │ ├── test_tenant_isolation.py│ │ └── test_order_flow.py│ ├── contract/ # Part 21: schemathesis contract tests│ │ └── test_api_contract.py│ ├── load/ # Part 22: locust load tests│ │ └── locustfile.py│ ├── performance/ # Part 22: p99 regression tests│ │ ├── baseline.json│ │ └── test_p99_regression.py│ └── conftest.py├── scripts/│ ├── provision_tenant.py # Tenant onboarding automation│ ├── offboard_tenant.py # Tenant offboarding│ └── seed_dev_data.py # Development data seeding├── runbooks/│ ├── db_slow.md # "DB is slow" runbook│ ├── bad_deploy.md # "Bad deploy" runbook│ ├── tenant_data_issue.md # "Tenant reports data issue" runbook│ └── tenant_deletion.md # "Tenant deletion" runbook├── docker-compose.yml├── Dockerfile├── pyproject.toml├── .github/│ └── workflows/│ ├── ci.yml # Full CI pipeline│ └── load-test.yml # Scheduled load tests└── .env.exampleDocker Compose: The Local Stack
Development should mirror production topology. Docker Compose gives every developer a full ShelfWise environment with one command.
services: api: build: context: . target: runtime ports: - "8000:8000" environment: - DATABASE_URL=postgresql+asyncpg://shelfwise:shelfwise@postgres:5432/shelfwise - REDIS_URL=redis://redis:6379/0 - LOG_LEVEL=debug - ENVIRONMENT=development depends_on: postgres: condition: service_healthy redis: condition: service_healthy volumes: - ./src:/app/src # Hot reload in development command: uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload
worker: build: context: . target: runtime environment: - DATABASE_URL=postgresql+asyncpg://shelfwise:shelfwise@postgres:5432/shelfwise - REDIS_URL=redis://redis:6379/0 - LOG_LEVEL=debug depends_on: postgres: condition: service_healthy redis: condition: service_healthy command: arq src.tasks.worker.WorkerSettings
postgres: image: postgres:16-alpine environment: POSTGRES_USER: shelfwise POSTGRES_PASSWORD: shelfwise POSTGRES_DB: shelfwise ports: - "5432:5432" volumes: - pgdata:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U shelfwise"] interval: 5s timeout: 3s retries: 5
redis: image: redis:7-alpine ports: - "6379:6379" healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 3s retries: 5
mailhog: image: mailhog/mailhog ports: - "1025:1025" # SMTP - "8025:8025" # Web UI
volumes: pgdata:The Dockerfile: Multi-Stage, Minimal, Secure
Every layer of this Dockerfile serves a purpose. The build stage installs dependencies. The runtime stage copies only what is needed. The result is a small, secure image with no build tools, no dev dependencies, and no root access.
# Dockerfile# === Build stage: install dependencies ===FROM python:3.12-slim AS builder
RUN pip install --no-cache-dir uv
WORKDIR /appCOPY pyproject.toml uv.lock ./RUN uv sync --frozen --no-dev --no-editable
# === Runtime stage: minimal image ===FROM python:3.12-slim AS runtime
# Security: non-root userRUN groupadd --gid 1000 app && \ useradd --uid 1000 --gid 1000 --shell /bin/bash app
WORKDIR /app
# Copy only the virtual environment and source codeCOPY --from=builder /app/.venv /app/.venvCOPY src/ src/COPY alembic/ alembic/COPY alembic.ini .
# Use the virtual environmentENV PATH="/app/.venv/bin:$PATH"ENV PYTHONUNBUFFERED=1ENV PYTHONDONTWRITEBYTECODE=1
# Health check — Part 19 health endpointHEALTHCHECK --interval=30s --timeout=5s --retries=3 \ CMD python -c "import httpx; httpx.get('http://localhost:8000/health').raise_for_status()"
USER app
EXPOSE 8000CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]The CI/CD Pipeline
The pipeline is the assembly line where every pattern from this series is verified in sequence. A failure at any stage stops the deployment.
name: CI/CD Pipeline
on: push: branches: [main] pull_request: branches: [main]
jobs: lint-and-type-check: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: astral-sh/setup-uv@v3 - run: uv sync --frozen - run: uv run ruff check . - run: uv run ruff format --check . - run: uv run pyright --strict
unit-tests: runs-on: ubuntu-latest needs: lint-and-type-check steps: - uses: actions/checkout@v4 - uses: astral-sh/setup-uv@v3 - run: uv sync --frozen - run: uv run pytest tests/unit/ --cov=src --cov-fail-under=80
integration-tests: runs-on: ubuntu-latest needs: lint-and-type-check services: postgres: image: postgres:16-alpine env: POSTGRES_USER: test POSTGRES_PASSWORD: test POSTGRES_DB: test ports: ["5432:5432"] options: >- --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5 redis: image: redis:7-alpine ports: ["6379:6379"] steps: - uses: actions/checkout@v4 - uses: astral-sh/setup-uv@v3 - run: uv sync --frozen - run: uv run pytest tests/integration/ -x -v env: DATABASE_URL: postgresql+asyncpg://test:test@localhost:5432/test REDIS_URL: redis://localhost:6379/0
contract-tests: runs-on: ubuntu-latest needs: integration-tests steps: - uses: actions/checkout@v4 - uses: astral-sh/setup-uv@v3 - run: uv sync --frozen - run: | uv run uvicorn src.main:app --host 0.0.0.0 --port 8000 & sleep 5 uv run schemathesis run http://localhost:8000/openapi.json --checks all
deploy-staging: if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest needs: [unit-tests, integration-tests, contract-tests] steps: - uses: actions/checkout@v4 - run: docker build -t shelfwise:${{ github.sha }} . - run: ./scripts/deploy.sh staging ${{ github.sha }}
smoke-tests: runs-on: ubuntu-latest needs: deploy-staging steps: - uses: actions/checkout@v4 - run: ./scripts/smoke_test.sh https://staging.shelfwise.io
deploy-production: if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest needs: smoke-tests environment: production steps: - run: ./scripts/deploy.sh production ${{ github.sha }}Environment Management
Staging must mirror production configuration with reduced resources. The differences should be minimal and explicit.
from pydantic import SecretStrfrom pydantic_settings import BaseSettings, SettingsConfigDict
class Settings(BaseSettings): """Configuration from environment variables — Part 10.""" model_config = SettingsConfigDict( env_file=".env", env_file_encoding="utf-8", case_sensitive=False, )
# Core environment: str = "development" # development | staging | production debug: bool = False log_level: str = "info"
# Database — Part 11 connection pooling database_url: SecretStr db_pool_min: int = 5 db_pool_max: int = 20
# Redis — Part 13 caching, Part 14 task queue redis_url: str = "redis://localhost:6379/0"
# Security — Part 18 jwt_secret: SecretStr = SecretStr("change-me-in-production") encryption_key: SecretStr = SecretStr("change-me-in-production")
@property def is_production(self) -> bool: return self.environment == "production"| Setting | Staging | Production |
|---|---|---|
| DB pool size | 5-10 connections | 20-50 connections |
| API replicas | 1 | 2-4 (auto-scaled) |
| Worker replicas | 1 | 2-3 |
| Log level | debug | info |
| Rate limits | Relaxed (10x headroom) | Enforced per plan tier |
| Cache TTL | 60 seconds (catch stale issues) | 300 seconds |
| Health check interval | 30 seconds | 10 seconds |
| Alert routing | Slack #staging-alerts | PagerDuty on-call rotation |
Production Readiness Checklist
This is every pattern from every post, verified as a pre-launch gate. Do not ship without greens across the board.
| Category | Requirement | Part | Verification |
|---|---|---|---|
| Structure | Domain modules under 400 lines | 1 | wc -l on every .py file |
| Types | 100% public API type coverage | 2 | pyright --strict zero errors |
| Data | Repository pattern, no raw SQL in services | 3 | Code review |
| Logic | Service layer with no framework imports | 4 | grep for fastapi in services/ |
| DI | All dependencies injected, no global state | 5 | No module-level mutable state |
| Errors | Structured error hierarchy, no bare except | 6 | ruff rule BLE001 |
| Async | No blocking I/O in async functions | 7 | bandit check, code review |
| Tests | 80%+ coverage, isolation test passes | 8-9 | pytest --cov-fail-under=80 |
| Tenancy | Automatic tenant filtering on every query | 9 | Integration test with two tenants |
| Config | No secrets in code, env-based config | 10 | bandit, no .env in git |
| Pooling | Connection pool sized and monitored | 11 | Pool metrics in Grafana |
| Observability | Structured logs, traces, metrics | 12 | Dashboard exists, alerts configured |
| Cache | Tenant-scoped, invalidation tested | 13 | Cache hit rate metric |
| Tasks | Background tasks with retry and DLQ | 14 | Worker metrics in Grafana |
| Rate limits | Per-tenant, per-plan enforcement | 15 | Load test verifies limits |
| Consistency | Idempotency keys on mutations | 16 | Contract test with duplicate requests |
| Migrations | Reversible, zero-downtime | 17 | Alembic downgrade tested |
| Security | JWT auth, audit logging, encryption at rest | 18 | bandit -ll, pentest report |
| Health | Liveness + readiness + dependency checks | 19 | k8s/ECS probes configured |
| Events | Event bus for cross-domain side effects | 20 | Event handler tests pass |
| Versioning | Deprecated v1 with sunset headers | 21 | curl check for Deprecation header |
| Performance | p99 < 200ms, no N+1, load tested | 22 | Locust report, query counter < 10 |
Tenant Provisioning Automation
Onboarding a new tenant is a sequence of steps that must succeed atomically. If step 4 fails, steps 1-3 must be rolled back. Automate the entire flow.
"""Automated tenant provisioning pipeline."""import asynciofrom uuid import UUID
import structlog
from src.core.context import set_current_tenant, system_contextfrom src.db.session import get_sessionfrom src.models.tenant import TenantModelfrom src.services.tenant_service import TenantService
logger = structlog.get_logger()
async def provision_tenant( slug: str, name: str, plan: str, admin_email: str,) -> UUID: """Provision a new tenant end-to-end.
Steps: 1. Create tenant record in database 2. Run tenant-specific migrations (if schema-per-tenant) 3. Seed default data (categories, roles, permissions) 4. Warm caches for the new tenant 5. Send welcome email to admin 6. Notify operations channel
Raises TenantProvisioningError on any failure. """ async with system_context(reason="provisioning", operator="system"): async with get_session() as session: # Step 1: Create tenant record tenant_service = TenantService(session) tenant = await tenant_service.create_tenant( slug=slug, name=name, plan=plan, ) logger.info( "tenant_created", tenant_id=str(tenant.id), slug=slug, )
# Step 2: Seed default data await tenant_service.seed_defaults(tenant.id) logger.info("tenant_defaults_seeded", tenant_id=str(tenant.id))
# Step 3: Create admin user await tenant_service.create_admin_user( tenant_id=tenant.id, email=admin_email, )
# Step 4: Warm caches await tenant_service.warm_caches(tenant.id)
# Step 5: Send welcome email (via background task) await tenant_service.enqueue_welcome_email( tenant_id=tenant.id, admin_email=admin_email, )
await session.commit() logger.info( "tenant_provisioned", tenant_id=str(tenant.id), slug=slug, plan=plan, ) return tenant.id
if __name__ == "__main__": import sys
tenant_id = asyncio.run( provision_tenant( slug=sys.argv[1], name=sys.argv[2], plan=sys.argv[3], admin_email=sys.argv[4], ) ) print(f"Provisioned tenant: {tenant_id}")Tenant Offboarding
Offboarding is the reverse, with an important addition: you must preserve audit logs for compliance and export tenant data before deletion.
"""Tenant offboarding pipeline: suspend → export → delete → archive."""import asynciofrom uuid import UUID
import structlog
logger = structlog.get_logger()
async def offboard_tenant(tenant_id: UUID, reason: str) -> None: """Offboard a tenant with full data lifecycle.
This is a multi-day process, not a single operation: Day 0: Suspend tenant (API returns 403, data preserved) Day 1: Export tenant data to S3 (GDPR data portability) Day 30: Delete tenant data from database Day 30: Archive audit logs to cold storage (7-year retention) """ # Phase 1: Immediate suspension await _suspend_tenant(tenant_id, reason) logger.info("tenant_suspended", tenant_id=str(tenant_id), reason=reason)
# Phase 2: Schedule data export (async task) await _enqueue_data_export(tenant_id) logger.info("tenant_export_scheduled", tenant_id=str(tenant_id))
# Phase 3: Schedule deletion (30-day grace period) await _schedule_deletion(tenant_id, grace_days=30) logger.info( "tenant_deletion_scheduled", tenant_id=str(tenant_id), grace_days=30, )
async def _suspend_tenant(tenant_id: UUID, reason: str) -> None: """Set tenant status to suspended. All API calls return 403.""" ...
async def _enqueue_data_export(tenant_id: UUID) -> None: """Export all tenant data to S3 for data portability.""" ...
async def _schedule_deletion(tenant_id: UUID, grace_days: int) -> None: """Schedule hard deletion after grace period.""" ...Runbooks
Runbooks are not documentation — they are decision trees for 3am. Each one starts with a symptom, confirms the diagnosis, and prescribes specific commands.
Runbook: “Database Is Slow”
- Check connection pool saturation. Grafana panel:
db_pool_checked_out / db_pool_size. If above 80%, pool is the bottleneck. Increasedb_pool_maxand restart (Part 11). - Check for long-running queries.
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;Kill anything over 30 seconds:SELECT pg_terminate_backend(pid); - Check for lock contention.
SELECT blocked.pid, blocked.query, blocking.pid, blocking.query FROM pg_catalog.pg_locks blocked_locks JOIN pg_stat_activity blocked ON blocked.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype WHERE NOT blocked_locks.granted; - Check for missing indexes.
SELECT schemaname, relname, seq_scan, idx_scan FROM pg_stat_user_tables WHERE seq_scan > idx_scan ORDER BY seq_scan - idx_scan DESC LIMIT 10;Tables with more sequential scans than index scans need composite indexes (Part 22).
Runbook: “Bad Deploy”
- Confirm the bad deploy. Check error rate spike in Grafana. If error rate > 5%, proceed to rollback.
- Rollback immediately.
./scripts/deploy.sh production <previous-sha>. Do not debug in production. Rollback first, investigate second. - Verify rollback. Error rate should return to baseline within 2 minutes. If not, the problem is not the deploy — check database, Redis, external dependencies.
- Post-rollback. Create an incident ticket. Identify the failing commit. Write a regression test that would have caught it. Deploy the fix through the full CI pipeline.
Runbook: “Tenant Reports Data Issue”
- Identify the tenant. Get
tenant_idfrom the support ticket or subdomain. - Pull recent traces. Filter Loki logs by
tenant_idand the reported timeframe (Part 12). Look for error logs, failed transactions, unexpected query results. - Check tenant isolation. Run the isolation integration test against production read-replica. If it fails, this is a SEV-1 — tenant data boundary has been breached. Escalate immediately.
- Check recent migrations. Did a migration run in the reported timeframe? Check Alembic history. Look for schema changes that could affect query results (Part 17).
- Reproduce with tenant data. Use the system context (Part 9) to query the tenant’s data in a staging environment. Never modify production data directly.
Incident Response
Severity Levels
| Level | Definition | Response Time | Example |
|---|---|---|---|
| SEV-1 | Data breach, total outage, data corruption | 15 minutes | Cross-tenant data leakage |
| SEV-2 | Major feature broken, one tenant fully impacted | 1 hour | Payment processing down for one tenant |
| SEV-3 | Degraded performance, partial impact | 4 hours | p99 above SLO, cache miss rate high |
| SEV-4 | Minor issue, workaround available | Next business day | Incorrect sort order on one endpoint |
Postmortem Template
Every SEV-1 and SEV-2 gets a postmortem within 48 hours. The postmortem is blameless — it focuses on system failures, not human failures.
## Incident: [Title]**Date:** YYYY-MM-DD**Duration:** X hours Y minutes**Severity:** SEV-N**Impact:** [Number of tenants affected, revenue impact, data impact]
## Timeline- HH:MM — Alert fired: [alert name]- HH:MM — On-call acknowledged- HH:MM — Root cause identified- HH:MM — Fix deployed- HH:MM — Monitoring confirmed recovery
## Root Cause[One paragraph: what broke and why]
## DetectionHow was this detected? Alert? Tenant report? Manual observation?Detection gap: [What should have caught this earlier?]
## ResolutionWhat was done to fix it? Rollback? Hotfix? Configuration change?
## Action Items| Action | Owner | Due Date | Status ||--------|-------|----------|--------|| [Specific, actionable item] | [Name] | [Date] | Open |
## Lessons Learned- What went well?- What went poorly?- Where did we get lucky?First-Day Production: Monitoring Dashboard
Before a single tenant sends a request, your Grafana dashboard must have these panels:
Request metrics: Request rate by endpoint, error rate (4xx, 5xx) by endpoint, p50/p95/p99 latency by endpoint. These are Part 12 Prometheus metrics visualized.
Database metrics: Connection pool utilization, query duration p99, queries per request (Part 22 counter), active connections by tenant.
Tenant metrics: Request rate by tenant, error rate by tenant, cache hit rate by tenant (Part 13), rate limit rejections by tenant (Part 15).
Worker metrics: Queue depth, task success/failure rate, task duration p99, dead letter queue size (Part 14).
Alerting rules:
groups: - name: shelfwise rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 2m labels: severity: sev2 annotations: summary: "Error rate above 5% for 2 minutes"
- alert: HighP99Latency expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: sev3 annotations: summary: "p99 latency above 500ms for 5 minutes"
- alert: QueryCountSpike expr: avg(http_request_query_count) > 20 for: 1m labels: severity: sev3 annotations: summary: "Average queries per request above 20 — probable N+1"
- alert: PoolExhaustion expr: db_pool_checked_out / db_pool_size > 0.9 for: 2m labels: severity: sev2 annotations: summary: "DB connection pool above 90% — requests will queue"
- alert: WorkerQueueBacklog expr: arq_queue_length > 1000 for: 5m labels: severity: sev2 annotations: summary: "Background task queue backlog above 1000"The First Incident
ShelfWise launches with 10 beta tenants. Three weeks in, the WorkerQueueBacklog alert fires. The ARQ queue has 2,400 pending tasks and is growing.
Investigation timeline:
- T+0 min: Alert fires. On-call checks Grafana. Queue depth graph shows linear growth starting 2 hours ago.
- T+2 min: Filter tasks by type. 95% are
deliver_webhooktasks. All are retrying with exponential backoff. - T+5 min: Check webhook delivery logs. Tenant “Foyles” configured a webhook endpoint that is returning
500 Internal Server Error. Every failed delivery schedules a retry. 200 orders per hour, each triggering a webhook, each retrying 5 times. - T+8 min: Check the retry policy (Part 14). Max retries: 5. Backoff: exponential with jitter. But the webhook has been failing for 2 hours — the retries from early failures are still in the queue while new orders keep adding more.
- T+12 min: Apply the fix. Two actions in parallel:
- Pause webhook delivery for tenant “Foyles” by setting their webhook status to
suspendedin the tenant config (Part 10). - Drain the dead letter queue — all failed webhook tasks for Foyles are moved to DLQ for manual inspection.
- Pause webhook delivery for tenant “Foyles” by setting their webhook status to
- T+15 min: Queue depth stops growing. Within 10 minutes, it drains back to normal.
- T+20 min: Contact Foyles. Their webhook endpoint had a deployment that broke the handler. They fix it and confirm. Re-enable webhook delivery.
Postmortem action items:
- Add per-tenant queue depth metric. Alert when a single tenant has > 100 pending tasks.
- Add automatic webhook suspension after 10 consecutive failures for the same tenant.
- Add circuit breaker (Part 14) to the webhook delivery task — stop retrying after the circuit opens.
This is a small incident. Nobody lost data. No tenants were affected except Foyles, and their orders were processed normally — only the webhook notifications were delayed. But the incident exposed a gap in the system: the retry policy did not account for a persistently failing webhook endpoint generating unbounded queue growth.
Every gap you find in production adds a new item to the checklist. The checklist grows, and the system gets more robust. That is the process.
What This Series Built
Twenty-three posts. One application. Here is the arc:
Parts 1-8 established the foundation. Project structure (Part 1) gave us navigable modules. Protocols and type safety (Part 2) made contracts explicit. The repository pattern (Part 3) isolated data access. The service layer (Part 4) isolated business logic. Dependency injection (Part 5) made it all testable. Error handling (Part 6) made failures informative. Async patterns (Part 7) made it concurrent. Testing strategies (Part 8) proved it all worked.
Parts 9-20 evolved the application into a production multi-tenant SaaS. Tenant isolation (Part 9) made data leakage structurally impossible. Configuration (Part 10) made each tenant customizable. Connection pooling (Part 11) made it scalable. Observability (Part 12) made it debuggable. Caching (Part 13) made it fast. Background tasks (Part 14) made it resilient. Rate limiting (Part 15) made it fair. Data consistency (Part 16) made it correct. Migrations (Part 17) made it evolvable. Security (Part 18) made it trustworthy. Health checks (Part 19) made it operable. Event-driven architecture (Part 20) made it extensible.
Parts 21-23 brought it to production. API versioning (Part 21) made it evolvable without breaking tenants. Performance profiling and load testing (Part 22) proved it could handle real traffic. And this post assembled it all, deployed it, monitored it, and handled the first incident.
The architecture is not finished. It never is. But the patterns are in place, the monitoring is live, the runbooks are written, and the team knows how to respond when something breaks. That is the difference between a codebase and a production system.
Clean code is not about aesthetics. It is about building systems that you can understand at 3am when the pager goes off, that new team members can navigate on their first day, and that can evolve for years without a rewrite. Every pattern in this series exists because the alternative — the quick hack, the missing test, the skipped abstraction — has a cost that compounds until it becomes the only thing the team works on.
Write code for the engineer who will maintain it. That engineer is usually you, six months from now, having forgotten everything.