Your ShelfWise platform has 200 tenants. Tenant A needs a rate limit of 1,000 requests per minute. Tenant B, your largest customer, negotiated 10,000. Tenant C is beta-testing your new AI recommendations feature. Tenant D must never see it — their contract explicitly excludes experimental features.
If your configuration lives in environment variables and your feature flags are if statements, you are about to learn why that does not scale. Every change requires a redeploy. Every tenant-specific behavior is a hardcoded branch. Every secret is a plaintext string that one docker inspect exposes.
This post builds a configuration system that solves all three problems: hierarchical config with per-tenant overrides, a feature flag system with percentage-based rollouts, and encrypted secrets storage — all hot-reloadable without restarting the application.
The Configuration Hierarchy
Configuration in a multi-tenant system is not a flat key-value store. It is a hierarchy with clear precedence rules:
A request for “rate_limit” resolves like this: check the tenant override first. If the tenant has a custom rate limit, use it. If not, check the environment variable. If that is not set either, fall back to the application default. The highest-specificity value wins.
Base Settings with Pydantic
The application defaults and environment variable layer use Pydantic’s BaseSettings. This gives you type-safe configuration with validation, nested models, and automatic environment variable parsing — no os.getenv() calls scattered across the codebase.
from pydantic import Field, SecretStrfrom pydantic_settings import BaseSettings, SettingsConfigDict
class DatabaseSettings(BaseSettings): model_config = SettingsConfigDict(env_prefix="DB_")
url: SecretStr = Field( default=SecretStr("postgresql+asyncpg://localhost:5432/shelfwise"), ) pool_size: int = Field(default=20, ge=1, le=100) max_overflow: int = Field(default=30, ge=0, le=200) pool_timeout: int = Field(default=30, ge=5) echo: bool = False
class RateLimitSettings(BaseSettings): model_config = SettingsConfigDict(env_prefix="RATE_LIMIT_")
requests_per_minute: int = Field(default=100, ge=1) burst_size: int = Field(default=20, ge=1)
class AISettings(BaseSettings): model_config = SettingsConfigDict(env_prefix="AI_")
model_name: str = "gpt-4o-mini" max_tokens: int = Field(default=2048, ge=1, le=16384) temperature: float = Field(default=0.7, ge=0.0, le=2.0) enabled: bool = False
class AppSettings(BaseSettings): """Root settings — assembled from nested models.
Environment variables use double-underscore for nesting: DB__POOL_SIZE=30, RATE_LIMIT__REQUESTS_PER_MINUTE=500 """ model_config = SettingsConfigDict( env_file=".env", env_file_encoding="utf-8", env_nested_delimiter="__", case_sensitive=False, )
app_name: str = "ShelfWise" debug: bool = False database: DatabaseSettings = DatabaseSettings() rate_limit: RateLimitSettings = RateLimitSettings() ai: AISettings = AISettings()
# Singleton — loaded once at startupsettings = AppSettings()The env_nested_delimiter="__" is the key to avoiding a flat namespace of dozens of SHELFWISE_DB_POOL_SIZE, SHELFWISE_RATE_LIMIT_RPM variables. Instead, the nested model structure maps directly to environment variables: DB__POOL_SIZE=30 sets settings.database.pool_size.
Per-Tenant Configuration Overrides
The base settings handle application-wide defaults. Per-tenant overrides live in the database, cached in Redis, and resolved at request time.
from uuid import UUID
from sqlalchemy import UniqueConstraintfrom sqlalchemy.orm import Mapped, mapped_column
from src.db.base import Base
class TenantConfig(Base): """Per-tenant configuration overrides.
Keys follow dot notation: "rate_limit.requests_per_minute", "ai.enabled". Values are stored as strings and coerced to the target type at resolution time. """ __tablename__ = "tenant_configs" __table_args__ = ( UniqueConstraint("tenant_id", "key", name="uq_tenant_config"), )
id: Mapped[int] = mapped_column(primary_key=True) tenant_id: Mapped[UUID] = mapped_column(nullable=False, index=True) key: Mapped[str] = mapped_column(nullable=False) value: Mapped[str] = mapped_column(nullable=False)The resolver walks the hierarchy:
from typing import TypeVar, overloadfrom uuid import UUID
import redis.asyncio as redis
from src.core.config import settings
T = TypeVar("T", str, int, float, bool)
# Redis client — initialized at startup_redis: redis.Redis | None = None
def init_redis(client: redis.Redis) -> None: global _redis _redis = client
def _cache_key(tenant_id: UUID, key: str) -> str: return f"tenant_config:{tenant_id}:{key}"
async def get_tenant_config( tenant_id: UUID, key: str, target_type: type[T], session=None,) -> T: """Resolve a config value through the hierarchy.
1. Check Redis cache for tenant override 2. If cache miss, check database 3. If no tenant override, fall back to app settings """ # Layer 1: Redis cache if _redis is not None: cached = await _redis.get(_cache_key(tenant_id, key)) if cached is not None: return _coerce(cached.decode(), target_type)
# Layer 2: Database lookup if session is not None: from sqlalchemy import select from src.models.tenant_config import TenantConfig
result = await session.execute( select(TenantConfig.value) .where(TenantConfig.tenant_id == tenant_id) .where(TenantConfig.key == key) ) row = result.scalar_one_or_none() if row is not None: # Populate cache for next time (TTL: 5 minutes) if _redis is not None: await _redis.set( _cache_key(tenant_id, key), row, ex=300 ) return _coerce(row, target_type)
# Layer 3: Application defaults return _resolve_default(key, target_type)
def _coerce(value: str, target_type: type[T]) -> T: """Coerce a string value to the target type.""" if target_type is bool: return target_type(value.lower() in ("true", "1", "yes")) return target_type(value)
def _resolve_default(key: str, target_type: type[T]) -> T: """Walk dot-notation key through the settings object.""" obj = settings for part in key.split("."): obj = getattr(obj, part) return target_type(obj)Usage in a service:
from src.core.config_resolver import get_tenant_configfrom src.core.context import get_current_tenant_id
async def get_rate_limit() -> int: """Get the rate limit for the current tenant.""" return await get_tenant_config( tenant_id=get_current_tenant_id(), key="rate_limit.requests_per_minute", target_type=int, )Cache Invalidation
When an admin updates a tenant’s configuration, the cache must be invalidated immediately. Not on the next TTL expiry — immediately. A tenant paying for a higher rate limit should not wait 5 minutes for it to take effect.
from uuid import UUID
from sqlalchemy.ext.asyncio import AsyncSession
from src.core.config_resolver import _redis, _cache_keyfrom src.models.tenant_config import TenantConfig
async def update_tenant_config( session: AsyncSession, tenant_id: UUID, key: str, value: str,) -> None: """Update a tenant config override and invalidate the cache.""" from sqlalchemy import select from sqlalchemy.dialects.postgresql import insert
stmt = insert(TenantConfig).values( tenant_id=tenant_id, key=key, value=value ).on_conflict_do_update( constraint="uq_tenant_config", set_={"value": value}, ) await session.execute(stmt) await session.commit()
# Invalidate cache immediately if _redis is not None: await _redis.delete(_cache_key(tenant_id, key))Feature Flags with Percentage Rollouts
Feature flags are configuration with behavior. A flag is either on or off for a given tenant, and the rollout percentage controls how many tenants see it.
ShelfWise is rolling out “AI book recommendations.” The rollout plan: 5% of tenants this week, 25% next week, 100% after validation. No deploys required.
from enum import StrEnumfrom uuid import UUIDimport hashlib
from src.core.config_resolver import get_tenant_config
class FeatureFlag(StrEnum): """Feature flags for ShelfWise.
Add new flags here. Remove dead flags aggressively — a flag that is 100% rolled out is not a flag, it is a feature. """ AI_RECOMMENDATIONS = "ai_recommendations" BULK_IMPORT_V2 = "bulk_import_v2" ADVANCED_ANALYTICS = "advanced_analytics" PUBLISHER_PORTAL = "publisher_portal"
async def is_enabled(flag: FeatureFlag, tenant_id: UUID) -> bool: """Check if a feature flag is enabled for a specific tenant.
Resolution order: 1. Explicit tenant override ("enabled" or "disabled") — always wins 2. Percentage-based rollout — deterministic hash of tenant_id + flag """ # Check for explicit override override = await _get_flag_override(flag, tenant_id) if override is not None: return override
# Percentage-based rollout rollout_pct = await _get_rollout_percentage(flag) if rollout_pct <= 0: return False if rollout_pct >= 100: return True
# Deterministic: same tenant + flag always gets the same result # until the percentage changes return _hash_into_bucket(tenant_id, flag) < rollout_pct
def _hash_into_bucket(tenant_id: UUID, flag: FeatureFlag) -> int: """Hash tenant_id + flag name into a 0-99 bucket.
Deterministic: the same tenant always lands in the same bucket. Uniform: tenants are evenly distributed across buckets. """ raw = f"{tenant_id}:{flag.value}".encode() digest = hashlib.sha256(raw).hexdigest() return int(digest[:8], 16) % 100
async def _get_flag_override( flag: FeatureFlag, tenant_id: UUID) -> bool | None: """Check if tenant has an explicit flag override.""" try: value = await get_tenant_config( tenant_id=tenant_id, key=f"feature.{flag.value}.override", target_type=str, ) if value == "enabled": return True if value == "disabled": return False except (AttributeError, KeyError): pass return None
async def _get_rollout_percentage(flag: FeatureFlag) -> int: """Get the current rollout percentage for a flag (0-100).""" try: return await get_tenant_config( tenant_id=UUID(int=0), # global config, not tenant-specific key=f"feature.{flag.value}.rollout_pct", target_type=int, ) except (AttributeError, KeyError): return 0 # Not rolled out by defaultUsage in a route handler:
from fastapi import APIRouter, HTTPExceptionfrom src.core.feature_flags import FeatureFlag, is_enabledfrom src.core.context import get_current_tenant_id
router = APIRouter()
@router.get("/books/{book_id}/recommendations")async def get_recommendations(book_id: int): tenant_id = get_current_tenant_id()
if not await is_enabled(FeatureFlag.AI_RECOMMENDATIONS, tenant_id): raise HTTPException( status_code=404, detail="This feature is not available for your plan.", )
return await generate_recommendations(book_id, tenant_id)The rollout lifecycle for AI recommendations:
# Week 1: Enable for 5% of tenantsawait update_tenant_config(session, UUID(int=0), "feature.ai_recommendations.rollout_pct", "5")
# Week 2: Ramp to 25%await update_tenant_config(session, UUID(int=0), "feature.ai_recommendations.rollout_pct", "25")
# Week 3: Full rolloutawait update_tenant_config(session, UUID(int=0), "feature.ai_recommendations.rollout_pct", "100")
# Force-enable for a specific beta tenant regardless of percentageawait update_tenant_config(session, beta_tenant_id, "feature.ai_recommendations.override", "enabled")
# Force-disable for a tenant whose contract excludes experimental featuresawait update_tenant_config(session, restricted_tenant_id, "feature.ai_recommendations.override", "disabled")Encrypted Secrets Storage
Some tenants bring their own API keys — a publisher who wants ShelfWise to push inventory updates to their existing ERP system. These secrets cannot be stored in plaintext. A database dump, a log leak, or a support engineer with read access should never expose a tenant’s API key.
from cryptography.fernet import Fernetfrom pydantic import SecretStr
from src.core.config import settings
# The encryption key is the ONE secret that lives in environment variables.# Everything else is encrypted with it and stored in the database._fernet: Fernet | None = None
def init_encryption(key: str) -> None: """Initialize encryption with the master key from environment.""" global _fernet _fernet = Fernet(key.encode())
def encrypt_secret(plaintext: str) -> str: """Encrypt a secret for database storage.""" if _fernet is None: raise RuntimeError("Encryption not initialized. Call init_encryption() at startup.") return _fernet.encrypt(plaintext.encode()).decode()
def decrypt_secret(ciphertext: str) -> SecretStr: """Decrypt a secret from database storage.
Returns SecretStr to prevent accidental logging of the plaintext. """ if _fernet is None: raise RuntimeError("Encryption not initialized. Call init_encryption() at startup.") plaintext = _fernet.decrypt(ciphertext.encode()).decode() return SecretStr(plaintext)| Approach | Env Vars Only | DB + Fernet Encryption | External Vault (HashiCorp/AWS) |
|---|---|---|---|
| Per-tenant secrets | No — one value per key | Yes — stored per tenant_id | Yes — path-per-tenant |
| Rotation | Redeploy required | Update DB row, invalidate cache | Vault handles rotation |
| Access control | Anyone with shell access | App-level (decrypt requires master key) | Policy-based (IAM, ACL) |
| Audit trail | None | DB audit log | Full audit log |
| Operational cost | Zero | Low (one master key to manage) | High (Vault cluster) |
| Best for | Single-tenant, < 10 secrets | Multi-tenant, 10-500 secrets | Regulated, 500+ secrets |
ShelfWise uses Fernet encryption for tenant secrets. The master encryption key is the only secret in environment variables. When a tenant provides their ERP API key, it is encrypted before it hits the database:
from uuid import UUID
from sqlalchemy.ext.asyncio import AsyncSession
from src.core.secrets import encrypt_secret, decrypt_secretfrom src.models.tenant_config import TenantConfig
async def store_tenant_api_key( session: AsyncSession, tenant_id: UUID, service_name: str, api_key: str,) -> None: """Store an encrypted API key for a tenant's external integration.""" encrypted = encrypt_secret(api_key) config = TenantConfig( tenant_id=tenant_id, key=f"secret.{service_name}.api_key", value=encrypted, ) session.add(config) await session.commit()
async def get_tenant_api_key( session: AsyncSession, tenant_id: UUID, service_name: str,) -> str: """Retrieve and decrypt a tenant's API key.""" from sqlalchemy import select
result = await session.execute( select(TenantConfig.value) .where(TenantConfig.tenant_id == tenant_id) .where(TenantConfig.key == f"secret.{service_name}.api_key") ) encrypted = result.scalar_one_or_none() if encrypted is None: raise ValueError(f"No API key configured for {service_name}")
return decrypt_secret(encrypted).get_secret_value()Hot-Reloading Configuration Without Restarts
The Redis cache with TTL handles most hot-reload needs — update the database, invalidate the cache, and the next request picks up the new value. But for application-wide settings (like changing the default rate limit), you need a mechanism that does not require cycling every application instance.
import asyncioimport logging
import redis.asyncio as redis
logger = logging.getLogger("shelfwise.config")
async def watch_config_changes(redis_client: redis.Redis) -> None: """Subscribe to config change events via Redis Pub/Sub.
When an admin updates a config value, the admin service publishes a message to "config:changed". Every app instance receives it and invalidates its local cache. """ pubsub = redis_client.pubsub() await pubsub.subscribe("config:changed")
async for message in pubsub.listen(): if message["type"] == "message": key = message["data"].decode() logger.info("Config changed: %s — clearing local cache", key) # Clear any in-process caches (e.g., functools.lru_cache) _clear_local_caches(key)
async def publish_config_change( redis_client: redis.Redis, key: str) -> None: """Notify all app instances that a config key has changed.""" await redis_client.publish("config:changed", key)
def _clear_local_caches(key: str) -> None: """Clear in-process caches for the changed key.
In practice, you maintain a registry of cached config accessors and clear only the affected ones. """ # Implementation depends on your caching strategy. # For functools.cache, call the .cache_clear() method on the decorated function. passThe publish step is added to the admin config update:
# Updated update_tenant_configasync def update_tenant_config( session: AsyncSession, tenant_id: UUID, key: str, value: str,) -> None: # ... insert/update logic ... await session.commit()
if _redis is not None: await _redis.delete(_cache_key(tenant_id, key)) await publish_config_change(_redis, f"{tenant_id}:{key}")Start the watcher as a background task at application startup:
from contextlib import asynccontextmanagerfrom collections.abc import AsyncIterator
from fastapi import FastAPI
from src.core.config_watcher import watch_config_changes
@asynccontextmanagerasync def lifespan(app: FastAPI) -> AsyncIterator[None]: # Start config watcher in background watcher_task = asyncio.create_task( watch_config_changes(redis_client) ) yield watcher_task.cancel()
app = FastAPI(lifespan=lifespan)Putting It All Together: The ShelfWise Rollout
Here is the complete flow for rolling out AI recommendations to ShelfWise tenants:
- Deploy with
AI__ENABLED=false— the feature code ships but is dormant. - Set rollout percentage to 5% via admin API —
feature.ai_recommendations.rollout_pct = "5". No deploy needed. 10 of your 200 tenants now see the recommendations endpoint. - Force-enable for your beta partner —
feature.ai_recommendations.override = "enabled"for that specific tenant. They see it regardless of the percentage. - Monitor error rates and latency for the 5% cohort.
- Ramp to 25%, then 50%, then 100% — each step is a single config update, zero deploys.
- Remove the feature flag — once at 100% and stable, delete the flag check from the code. A flag that is always on is dead code.
The entire rollout happens through configuration changes. The application binary does not change. The deployment pipeline does not run. The risk surface is a single database row, not a full release cycle.
Key Takeaways
- Pydantic
BaseSettingswith nested models gives you typed, validated configuration with environment variable parsing. No moreos.getenv()with manual type coercion and missing-key crashes. - Three-layer resolution (tenant override, environment, application default) handles both global defaults and per-tenant customization without code branches.
- Redis caching with explicit invalidation prevents database lookups on every request. TTL provides a safety net; explicit deletion provides immediacy.
- Deterministic percentage rollouts use SHA-256 hashing so the same tenant always gets the same result. No flickering, no lost features during ramp-up.
- Fernet encryption for tenant secrets keeps API keys encrypted at rest. The master key is the only secret in environment variables — everything else is encrypted in the database.
- Redis Pub/Sub for hot-reload notifies all application instances when configuration changes. No restarts, no deploy cycles.
Configuration, feature flags, and secrets are the control plane of a multi-tenant SaaS. Get them right and you can ship features to individual tenants, adjust behavior without deploys, and store third-party credentials without a security incident waiting to happen. Next: the database layer needs to handle the load that 200 tenants generate — connection pooling, read replicas, and circuit breakers.