When a tenant reports “my order failed 20 minutes ago,” you need to find that exact request across API logs, database queries, background tasks, and external API calls within 30 seconds. Without structured logging and tracing, debugging production issues is grep-and-pray. You search for a timestamp, find 400 log lines from 12 tenants, and spend an hour reconstructing what happened. With structured observability, you filter by tenant_id=powells and trace_id=abc123 and see the entire request lifecycle in one query.
This post adds three observability pillars to the ShelfWise architecture: structured logging with structlog, distributed tracing with OpenTelemetry, and Prometheus metrics with per-tenant labels.
The Three Pillars
| Pillar | Question It Answers | Tool |
|---|---|---|
| Logs | What happened? In what order? With what data? | structlog (JSON) |
| Traces | How long did each step take? Where is the bottleneck? | OpenTelemetry |
| Metrics | How many? How fast? What percentage failed? | Prometheus |
Logs tell you what happened. Traces tell you how long it took. Metrics tell you how often it happens. You need all three, and they must share the same correlation IDs so you can jump from a metric spike to a trace to the exact log line that explains the failure.
Structured Logging with structlog
The standard library logging module produces unstructured text by default. Parsing "2026-04-10 14:23:01 ERROR: Order failed for tenant powells" at scale requires regex. Structured logging produces JSON where every field is queryable.
| Approach | Output | Searchable? |
|---|---|---|
| print() | Order failed for tenant powells | No — grep only, no field filtering |
| logging.error() | ERROR:order_service:Order failed for tenant powells | Barely — level + module, but data baked into message string |
| structlog | {"event": "order_failed", "tenant_id": "powells", "order_id": 42, "level": "error"} | Yes — filter by any field in any log aggregator |
Configuration
import loggingimport structlog
def configure_logging(*, json_output: bool = True, log_level: str = "INFO") -> None: """Configure structlog for production JSON or development console output.""" shared_processors: list[structlog.types.Processor] = [ structlog.contextvars.merge_contextvars, structlog.processors.add_log_level, structlog.processors.TimeStamper(fmt="iso"), structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, ]
if json_output: renderer = structlog.processors.JSONRenderer() else: renderer = structlog.dev.ConsoleRenderer()
structlog.configure( processors=[ *shared_processors, structlog.stdlib.ProcessorFormatter.wrap_for_formatter, ], logger_factory=structlog.stdlib.LoggerFactory(), wrapper_class=structlog.stdlib.BoundLogger, cache_logger_on_first_use=True, )
formatter = structlog.stdlib.ProcessorFormatter( processors=[ structlog.stdlib.ProcessorFormatter.remove_processors_meta, renderer, ], )
handler = logging.StreamHandler() handler.setFormatter(formatter)
root_logger = logging.getLogger() root_logger.handlers.clear() root_logger.addHandler(handler) root_logger.setLevel(log_level)The critical line is structlog.contextvars.merge_contextvars. This processor pulls any values bound via structlog.contextvars.bind_contextvars() and merges them into every log entry automatically. Bind the tenant ID once in middleware, and every log line in that request includes it without passing a logger around.
Middleware: Bind Request Context
import uuidimport timefrom collections.abc import Awaitable, Callable
import structlogfrom starlette.middleware.base import BaseHTTPMiddlewarefrom starlette.requests import Requestfrom starlette.responses import Response
from src.core.tenant import TenantContext
logger = structlog.get_logger()
class ObservabilityMiddleware(BaseHTTPMiddleware): """Bind correlation IDs and tenant context to every log line."""
async def dispatch( self, request: Request, call_next: Callable[[Request], Awaitable[Response]], ) -> Response: request_id = request.headers.get("X-Request-ID", str(uuid.uuid4())) tenant_id = TenantContext.get()
structlog.contextvars.clear_contextvars() structlog.contextvars.bind_contextvars( request_id=request_id, tenant_id=tenant_id, method=request.method, path=request.url.path, )
start = time.perf_counter() logger.info("request_started")
try: response = await call_next(request) except Exception: logger.exception("request_failed") raise
duration_ms = round((time.perf_counter() - start) * 1000, 2) logger.info( "request_completed", status_code=response.status_code, duration_ms=duration_ms, ) response.headers["X-Request-ID"] = request_id return responseEvery log line emitted during this request now includes request_id, tenant_id, method, and path without any explicit passing. The service layer, repository, and even third-party libraries that use stdlib logging get these fields automatically.
Distributed Tracing with OpenTelemetry
Logs tell you what happened. Traces tell you where the time went. OpenTelemetry provides auto-instrumentation for FastAPI, SQLAlchemy, httpx, and Redis — meaning you get spans for every HTTP handler, every SQL query, every outbound HTTP call, and every Redis command without writing instrumentation code.
Setup
from opentelemetry import tracefrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterfrom opentelemetry.instrumentation.fastapi import FastAPIInstrumentorfrom opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentorfrom opentelemetry.instrumentation.httpx import HTTPXClientInstrumentorfrom opentelemetry.instrumentation.redis import RedisInstrumentorfrom opentelemetry.sdk.resources import Resourcefrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessor
def configure_tracing(*, service_name: str, otlp_endpoint: str) -> None: """Initialize OpenTelemetry with auto-instrumentation.""" resource = Resource.create({"service.name": service_name}) provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint=otlp_endpoint) provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument everything — zero manual spans needed for basic coverage FastAPIInstrumentor.instrument() SQLAlchemyInstrumentor().instrument() HTTPXClientInstrumentor().instrument() RedisInstrumentor().instrument()Auto-instrumentation gives you a span for every FastAPI route handler, every SQLAlchemy query, every outbound httpx request, and every Redis command. For most requests, this is sufficient to identify the bottleneck without writing a single custom span.
Custom Spans for Business Operations
Auto-instrumentation covers infrastructure. Business operations need explicit spans — you want to see “order processing” and “payment capture” as named spans in your trace, not just a series of SQL queries:
import structlogfrom opentelemetry import trace
from src.core.protocols import OrderRepositoryProtocol, PaymentGatewayProtocolfrom src.schemas.order import OrderCreate, OrderResponse
logger = structlog.get_logger()tracer = trace.get_tracer(__name__)
class OrderService: def __init__( self, *, order_repo: OrderRepositoryProtocol, payment_gateway: PaymentGatewayProtocol, ) -> None: self._order_repo = order_repo self._payment = payment_gateway
async def create_order(self, data: OrderCreate) -> OrderResponse: with tracer.start_as_current_span("order.create") as span: span.set_attribute("tenant_id", data.tenant_id) span.set_attribute("item_count", len(data.items))
with tracer.start_as_current_span("order.validate_inventory"): await self._validate_inventory(data)
with tracer.start_as_current_span("order.persist"): order = await self._order_repo.create(data) span.set_attribute("order_id", str(order.id))
with tracer.start_as_current_span("order.capture_payment"): await self._payment.capture( amount=order.total, tenant_id=data.tenant_id, )
logger.info( "order_created", order_id=str(order.id), total=str(order.total), item_count=len(data.items), ) return orderThe trace for this request now shows:
The bottleneck is immediately visible: the payment gateway takes 3700ms of a 5200ms request. No guessing, no log parsing.
The ShelfWise Debugging Scenario
Powell’s Books reports that their catalog search is taking 5 seconds. Here is the trace:
The API handler and service layer are fast. The repository call takes 4800ms, and the SQL query itself takes 4795ms. The trace pinpoints the problem to a specific query. You check the query plan, find a missing index on catalog.tenant_id + catalog.title, add it, and the query drops to 12ms.
Without tracing, you would have added caching, increased timeouts, scaled horizontally — all wrong fixes for a missing index.
Prometheus Metrics
Traces are for individual request debugging. Metrics are for aggregate system health: how many requests per second, what is the error rate, what is p99 latency.
Metric Types
from prometheus_client import Counter, Histogram, Gauge
# Request metricsREQUEST_COUNT = Counter( "http_requests_total", "Total HTTP requests", labelnames=["method", "endpoint", "status", "tenant_id"],)
REQUEST_DURATION = Histogram( "http_request_duration_seconds", "Request duration in seconds", labelnames=["method", "endpoint", "tenant_id"], buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],)
# Business metricsORDERS_CREATED = Counter( "orders_created_total", "Total orders created", labelnames=["tenant_id"],)
ORDER_REVENUE = Counter( "order_revenue_dollars_total", "Total revenue in dollars", labelnames=["tenant_id"],)
# Infrastructure metricsDB_POOL_SIZE = Gauge( "db_pool_connections_active", "Active database pool connections", labelnames=["tenant_id"],)Recording Metrics in Middleware
# src/middleware/observability.py (extended)from src.observability.metrics import REQUEST_COUNT, REQUEST_DURATION
class ObservabilityMiddleware(BaseHTTPMiddleware): async def dispatch( self, request: Request, call_next: Callable[[Request], Awaitable[Response]], ) -> Response: request_id = request.headers.get("X-Request-ID", str(uuid.uuid4())) tenant_id = TenantContext.get()
structlog.contextvars.clear_contextvars() structlog.contextvars.bind_contextvars( request_id=request_id, tenant_id=tenant_id, )
start = time.perf_counter() try: response = await call_next(request) except Exception: REQUEST_COUNT.labels( method=request.method, endpoint=request.url.path, status="500", tenant_id=tenant_id, ).inc() raise
duration = time.perf_counter() - start
REQUEST_COUNT.labels( method=request.method, endpoint=request.url.path, status=str(response.status_code), tenant_id=tenant_id, ).inc()
REQUEST_DURATION.labels( method=request.method, endpoint=request.url.path, tenant_id=tenant_id, ).observe(duration)
return responseBusiness Metrics in Services
# src/services/order_service.py (extended)from src.observability.metrics import ORDERS_CREATED, ORDER_REVENUE
class OrderService: async def create_order(self, data: OrderCreate) -> OrderResponse: order = await self._persist_order(data)
ORDERS_CREATED.labels(tenant_id=data.tenant_id).inc() ORDER_REVENUE.labels(tenant_id=data.tenant_id).inc( float(order.total) )
return orderBusiness metrics answer questions that infrastructure metrics cannot: “How many orders did Powell’s Books process today?” and “Is revenue per tenant trending down this week?” These are the metrics your product team cares about.
Log Levels as a Production Tool
Log levels are not just severity markers. They are a production control knob. Set the global level to INFO in production and adjust per-tenant when debugging.
# src/observability/logging.py (extended)
# Per-tenant log level overrides stored in config (Part 10)_tenant_log_overrides: dict[str, str] = {}
def set_tenant_log_level(tenant_id: str, level: str) -> None: """Override log level for a specific tenant. Used for live debugging.""" _tenant_log_overrides[tenant_id] = level.upper()
def clear_tenant_log_level(tenant_id: str) -> None: """Remove tenant-specific log level override.""" _tenant_log_overrides.pop(tenant_id, None)
def tenant_level_filter( logger: structlog.types.WrappedLogger, method_name: str, event_dict: structlog.types.EventDict,) -> structlog.types.EventDict: """Drop log events below the tenant-specific override level.""" tenant_id = event_dict.get("tenant_id") if tenant_id and tenant_id in _tenant_log_overrides: override = _tenant_log_overrides[tenant_id] level_map = {"DEBUG": 10, "INFO": 20, "WARNING": 30, "ERROR": 40} current_level = level_map.get(event_dict.get("level", "INFO").upper(), 20) required_level = level_map.get(override, 20) if current_level < required_level: raise structlog.DropEvent return event_dictWhen Powell’s Books reports a problem, you call set_tenant_log_level("powells", "DEBUG") via an admin endpoint. DEBUG logs start flowing for that tenant only — every other tenant stays at INFO. You diagnose the issue and call clear_tenant_log_level("powells"). No restart, no redeployment, no flood of debug logs from 500 other tenants.
Alerting Thresholds
Metrics are useless without alerts. Define thresholds that catch real problems before tenants report them:
| Metric | Threshold | Severity | Action |
|---|---|---|---|
| Error rate (5xx) | > 1% over 5 minutes | Critical | Page on-call, check error logs by tenant |
| p99 latency | > 2 seconds | Warning | Check traces for slow queries or external calls |
| DB pool utilization | > 80% active connections | Warning | Review pool sizing from Part 11, check for connection leaks |
| Order creation failures | > 5 per minute per tenant | Critical | Check payment gateway health, inventory service |
| Background task dead letters | > 0 in 15 minutes | Warning | Review dead letter queue, check task idempotency |
Connecting the Three Pillars
The power of observability is the connection between pillars. A metric spike leads to traces, traces lead to logs:
This workflow takes 30 seconds, not 30 minutes. The metric tells you something is wrong. The trace tells you where. The log tells you why. Without all three connected by shared IDs (request_id, trace_id, tenant_id), you are searching blind.
Tying It Together: Application Startup
# src/main.py (extended)from src.observability.logging import configure_loggingfrom src.observability.tracing import configure_tracingfrom src.middleware.observability import ObservabilityMiddleware
configure_logging(json_output=True, log_level="INFO")configure_tracing( service_name="shelfwise-api", otlp_endpoint="http://otel-collector:4317",)
app = FastAPI(title="ShelfWise API")app.add_middleware(ObservabilityMiddleware)Observability is initialized before the application starts accepting requests. Every request from the first one onward has structured logs, distributed traces, and Prometheus metrics.
Key Takeaways
- structlog with contextvars binds
tenant_id,request_id, andtrace_idonce in middleware. Every log line in the request includes them automatically — no logger passing required. - OpenTelemetry auto-instrumentation gives you spans for FastAPI, SQLAlchemy, httpx, and Redis with zero manual code. Add custom spans only for business operations.
- Prometheus metrics with tenant labels answer “how many” and “how fast” per tenant. Business metrics (orders, revenue) are as important as infrastructure metrics (latency, error rate).
- Per-tenant log level overrides let you enable DEBUG logging for one tenant without drowning in logs from 500 others. No restart required.
- Connect the three pillars with shared correlation IDs. A metric alert leads to a trace, a trace leads to logs, and the root cause is found in 30 seconds.
Next: Part 13 covers caching architecture — Redis, invalidation strategies, and the critical requirement of tenant-scoped cache keys.