Tutorial

Clean Code Python: Observability — Structured Logging, Tracing, and Metrics

When a tenant reports 'my order failed 20 minutes ago,' you need to find that exact request across API logs, database queries, and background tasks within 30 seconds. Here is how to build observability into a multi-tenant Python backend from day one.

Tin Dang April 10, 2026 9 min read

Layered topographic map with interconnected measurement points showing system-wide visibility across terrain

When a tenant reports “my order failed 20 minutes ago,” you need to find that exact request across API logs, database queries, background tasks, and external API calls within 30 seconds. Without structured logging and tracing, debugging production issues is grep-and-pray. You search for a timestamp, find 400 log lines from 12 tenants, and spend an hour reconstructing what happened. With structured observability, you filter by tenant_id=powells and trace_id=abc123 and see the entire request lifecycle in one query.

This post adds three observability pillars to the ShelfWise architecture: structured logging with structlog, distributed tracing with OpenTelemetry, and Prometheus metrics with per-tenant labels.

The Three Pillars

Pillar	Question It Answers	Tool
Logs	What happened? In what order? With what data?	structlog (JSON)
Traces	How long did each step take? Where is the bottleneck?	OpenTelemetry
Metrics	How many? How fast? What percentage failed?	Prometheus

Logs tell you what happened. Traces tell you how long it took. Metrics tell you how often it happens. You need all three, and they must share the same correlation IDs so you can jump from a metric spike to a trace to the exact log line that explains the failure.

Structured Logging with structlog

The standard library logging module produces unstructured text by default. Parsing "2026-04-10 14:23:01 ERROR: Order failed for tenant powells" at scale requires regex. Structured logging produces JSON where every field is queryable.

Approach	Output	Searchable?
print()	Order failed for tenant powells	No — grep only, no field filtering
logging.error()	ERROR:order_service:Order failed for tenant powells	Barely — level + module, but data baked into message string
structlog	{"event": "order_failed", "tenant_id": "powells", "order_id": 42, "level": "error"}	Yes — filter by any field in any log aggregator

Configuration

import logging
import structlog


def configure_logging(*, json_output: bool = True, log_level: str = "INFO") -> None:
    """Configure structlog for production JSON or development console output."""
    shared_processors: list[structlog.types.Processor] = [
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
    ]

    if json_output:
        renderer = structlog.processors.JSONRenderer()
    else:
        renderer = structlog.dev.ConsoleRenderer()

    structlog.configure(
        processors=[
            *shared_processors,
            structlog.stdlib.ProcessorFormatter.wrap_for_formatter,
        ],
        logger_factory=structlog.stdlib.LoggerFactory(),
        wrapper_class=structlog.stdlib.BoundLogger,
        cache_logger_on_first_use=True,
    )

    formatter = structlog.stdlib.ProcessorFormatter(
        processors=[
            structlog.stdlib.ProcessorFormatter.remove_processors_meta,
            renderer,
        ],
    )

    handler = logging.StreamHandler()
    handler.setFormatter(formatter)

    root_logger = logging.getLogger()
    root_logger.handlers.clear()
    root_logger.addHandler(handler)
    root_logger.setLevel(log_level)

The critical line is structlog.contextvars.merge_contextvars. This processor pulls any values bound via structlog.contextvars.bind_contextvars() and merges them into every log entry automatically. Bind the tenant ID once in middleware, and every log line in that request includes it without passing a logger around.

Middleware: Bind Request Context

import uuid
import time
from collections.abc import Awaitable, Callable

import structlog
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import Response

from src.core.tenant import TenantContext

logger = structlog.get_logger()


class ObservabilityMiddleware(BaseHTTPMiddleware):
    """Bind correlation IDs and tenant context to every log line."""

    async def dispatch(
        self,
        request: Request,
        call_next: Callable[[Request], Awaitable[Response]],
    ) -> Response:
        request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
        tenant_id = TenantContext.get()

        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            request_id=request_id,
            tenant_id=tenant_id,
            method=request.method,
            path=request.url.path,
        )

        start = time.perf_counter()
        logger.info("request_started")

        try:
            response = await call_next(request)
        except Exception:
            logger.exception("request_failed")
            raise

        duration_ms = round((time.perf_counter() - start) * 1000, 2)
        logger.info(
            "request_completed",
            status_code=response.status_code,
            duration_ms=duration_ms,
        )
        response.headers["X-Request-ID"] = request_id
        return response

Every log line emitted during this request now includes request_id, tenant_id, method, and path without any explicit passing. The service layer, repository, and even third-party libraries that use stdlib logging get these fields automatically.

Distributed Tracing with OpenTelemetry

Logs tell you what happened. Traces tell you where the time went. OpenTelemetry provides auto-instrumentation for FastAPI, SQLAlchemy, httpx, and Redis — meaning you get spans for every HTTP handler, every SQL query, every outbound HTTP call, and every Redis command without writing instrumentation code.

Setup

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor


def configure_tracing(*, service_name: str, otlp_endpoint: str) -> None:
    """Initialize OpenTelemetry with auto-instrumentation."""
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
    provider.add_span_processor(BatchSpanProcessor(exporter))

    trace.set_tracer_provider(provider)

    # Auto-instrument everything — zero manual spans needed for basic coverage
    FastAPIInstrumentor.instrument()
    SQLAlchemyInstrumentor().instrument()
    HTTPXClientInstrumentor().instrument()
    RedisInstrumentor().instrument()

Auto-instrumentation gives you a span for every FastAPI route handler, every SQLAlchemy query, every outbound httpx request, and every Redis command. For most requests, this is sufficient to identify the bottleneck without writing a single custom span.

Custom Spans for Business Operations

Auto-instrumentation covers infrastructure. Business operations need explicit spans — you want to see “order processing” and “payment capture” as named spans in your trace, not just a series of SQL queries:

import structlog
from opentelemetry import trace

from src.core.protocols import OrderRepositoryProtocol, PaymentGatewayProtocol
from src.schemas.order import OrderCreate, OrderResponse

logger = structlog.get_logger()
tracer = trace.get_tracer(__name__)


class OrderService:
    def __init__(
        self,
        *,
        order_repo: OrderRepositoryProtocol,
        payment_gateway: PaymentGatewayProtocol,
    ) -> None:
        self._order_repo = order_repo
        self._payment = payment_gateway

    async def create_order(self, data: OrderCreate) -> OrderResponse:
        with tracer.start_as_current_span("order.create") as span:
            span.set_attribute("tenant_id", data.tenant_id)
            span.set_attribute("item_count", len(data.items))

            with tracer.start_as_current_span("order.validate_inventory"):
                await self._validate_inventory(data)

            with tracer.start_as_current_span("order.persist"):
                order = await self._order_repo.create(data)
                span.set_attribute("order_id", str(order.id))

            with tracer.start_as_current_span("order.capture_payment"):
                await self._payment.capture(
                    amount=order.total,
                    tenant_id=data.tenant_id,
                )

            logger.info(
                "order_created",
                order_id=str(order.id),
                total=str(order.total),
                item_count=len(data.items),
            )
            return order

The trace for this request now shows:

The bottleneck is immediately visible: the payment gateway takes 3700ms of a 5200ms request. No guessing, no log parsing.

The ShelfWise Debugging Scenario

Powell’s Books reports that their catalog search is taking 5 seconds. Here is the trace:

The API handler and service layer are fast. The repository call takes 4800ms, and the SQL query itself takes 4795ms. The trace pinpoints the problem to a specific query. You check the query plan, find a missing index on catalog.tenant_id + catalog.title, add it, and the query drops to 12ms.

Without tracing, you would have added caching, increased timeouts, scaled horizontally — all wrong fixes for a missing index.

Prometheus Metrics

Traces are for individual request debugging. Metrics are for aggregate system health: how many requests per second, what is the error rate, what is p99 latency.

Metric Types

from prometheus_client import Counter, Histogram, Gauge

# Request metrics
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    labelnames=["method", "endpoint", "status", "tenant_id"],
)

REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "Request duration in seconds",
    labelnames=["method", "endpoint", "tenant_id"],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)

# Business metrics
ORDERS_CREATED = Counter(
    "orders_created_total",
    "Total orders created",
    labelnames=["tenant_id"],
)

ORDER_REVENUE = Counter(
    "order_revenue_dollars_total",
    "Total revenue in dollars",
    labelnames=["tenant_id"],
)

# Infrastructure metrics
DB_POOL_SIZE = Gauge(
    "db_pool_connections_active",
    "Active database pool connections",
    labelnames=["tenant_id"],
)

Recording Metrics in Middleware

# src/middleware/observability.py (extended)
from src.observability.metrics import REQUEST_COUNT, REQUEST_DURATION


class ObservabilityMiddleware(BaseHTTPMiddleware):
    async def dispatch(
        self,
        request: Request,
        call_next: Callable[[Request], Awaitable[Response]],
    ) -> Response:
        request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
        tenant_id = TenantContext.get()

        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            request_id=request_id,
            tenant_id=tenant_id,
        )

        start = time.perf_counter()
        try:
            response = await call_next(request)
        except Exception:
            REQUEST_COUNT.labels(
                method=request.method,
                endpoint=request.url.path,
                status="500",
                tenant_id=tenant_id,
            ).inc()
            raise

        duration = time.perf_counter() - start

        REQUEST_COUNT.labels(
            method=request.method,
            endpoint=request.url.path,
            status=str(response.status_code),
            tenant_id=tenant_id,
        ).inc()

        REQUEST_DURATION.labels(
            method=request.method,
            endpoint=request.url.path,
            tenant_id=tenant_id,
        ).observe(duration)

        return response

Business Metrics in Services

# src/services/order_service.py (extended)
from src.observability.metrics import ORDERS_CREATED, ORDER_REVENUE


class OrderService:
    async def create_order(self, data: OrderCreate) -> OrderResponse:
        order = await self._persist_order(data)

        ORDERS_CREATED.labels(tenant_id=data.tenant_id).inc()
        ORDER_REVENUE.labels(tenant_id=data.tenant_id).inc(
            float(order.total)
        )

        return order

Business metrics answer questions that infrastructure metrics cannot: “How many orders did Powell’s Books process today?” and “Is revenue per tenant trending down this week?” These are the metrics your product team cares about.

Log Levels as a Production Tool

Log levels are not just severity markers. They are a production control knob. Set the global level to INFO in production and adjust per-tenant when debugging.

# src/observability/logging.py (extended)

# Per-tenant log level overrides stored in config (Part 10)
_tenant_log_overrides: dict[str, str] = {}


def set_tenant_log_level(tenant_id: str, level: str) -> None:
    """Override log level for a specific tenant. Used for live debugging."""
    _tenant_log_overrides[tenant_id] = level.upper()


def clear_tenant_log_level(tenant_id: str) -> None:
    """Remove tenant-specific log level override."""
    _tenant_log_overrides.pop(tenant_id, None)


def tenant_level_filter(
    logger: structlog.types.WrappedLogger,
    method_name: str,
    event_dict: structlog.types.EventDict,
) -> structlog.types.EventDict:
    """Drop log events below the tenant-specific override level."""
    tenant_id = event_dict.get("tenant_id")
    if tenant_id and tenant_id in _tenant_log_overrides:
        override = _tenant_log_overrides[tenant_id]
        level_map = {"DEBUG": 10, "INFO": 20, "WARNING": 30, "ERROR": 40}
        current_level = level_map.get(event_dict.get("level", "INFO").upper(), 20)
        required_level = level_map.get(override, 20)
        if current_level < required_level:
            raise structlog.DropEvent
    return event_dict

When Powell’s Books reports a problem, you call set_tenant_log_level("powells", "DEBUG") via an admin endpoint. DEBUG logs start flowing for that tenant only — every other tenant stays at INFO. You diagnose the issue and call clear_tenant_log_level("powells"). No restart, no redeployment, no flood of debug logs from 500 other tenants.

Alerting Thresholds

Metrics are useless without alerts. Define thresholds that catch real problems before tenants report them:

Metric	Threshold	Severity	Action
Error rate (5xx)	> 1% over 5 minutes	Critical	Page on-call, check error logs by tenant
p99 latency	> 2 seconds	Warning	Check traces for slow queries or external calls
DB pool utilization	> 80% active connections	Warning	Review pool sizing from Part 11, check for connection leaks
Order creation failures	> 5 per minute per tenant	Critical	Check payment gateway health, inventory service
Background task dead letters	> 0 in 15 minutes	Warning	Review dead letter queue, check task idempotency

Connecting the Three Pillars

The power of observability is the connection between pillars. A metric spike leads to traces, traces lead to logs:

This workflow takes 30 seconds, not 30 minutes. The metric tells you something is wrong. The trace tells you where. The log tells you why. Without all three connected by shared IDs (request_id, trace_id, tenant_id), you are searching blind.

Tying It Together: Application Startup

# src/main.py (extended)
from src.observability.logging import configure_logging
from src.observability.tracing import configure_tracing
from src.middleware.observability import ObservabilityMiddleware

configure_logging(json_output=True, log_level="INFO")
configure_tracing(
    service_name="shelfwise-api",
    otlp_endpoint="http://otel-collector:4317",
)

app = FastAPI(title="ShelfWise API")
app.add_middleware(ObservabilityMiddleware)

Observability is initialized before the application starts accepting requests. Every request from the first one onward has structured logs, distributed traces, and Prometheus metrics.

Key Takeaways

structlog with contextvars binds tenant_id, request_id, and trace_id once in middleware. Every log line in the request includes them automatically — no logger passing required.
OpenTelemetry auto-instrumentation gives you spans for FastAPI, SQLAlchemy, httpx, and Redis with zero manual code. Add custom spans only for business operations.
Prometheus metrics with tenant labels answer “how many” and “how fast” per tenant. Business metrics (orders, revenue) are as important as infrastructure metrics (latency, error rate).
Per-tenant log level overrides let you enable DEBUG logging for one tenant without drowning in logs from 500 others. No restart required.
Connect the three pillars with shared correlation IDs. A metric alert leads to a trace, a trace leads to logs, and the root cause is found in 30 seconds.

Next: Part 13 covers caching architecture — Redis, invalidation strategies, and the critical requirement of tenant-scoped cache keys.

Next in this series

Clean Code Python: Caching Architecture — Redis, Invalidation, and Tenant Isolation