Three months into running ShelfWise in production, a free-tier tenant wrote a script that hammered our search endpoint at 1,000 requests per second. Within ninety seconds, the PostgreSQL connection pool was saturated. Within three minutes, every other tenant — including Barnes & Noble on an enterprise plan — was getting 503s. The incident lasted six hours because we had no rate limiting, no tenant isolation, and no way to shed load from a single bad actor without restarting the entire service.
This is the “noisy neighbor” problem, and it is the number-one operational headache for multi-tenant SaaS. Rate limiting is not a nice-to-have. It is the difference between one tenant having a bad day and two hundred tenants having a bad day.
Algorithm Choice: Sliding Window
Before writing code, you need to pick an algorithm. The three contenders each make different tradeoffs.
| Algorithm | Accuracy | Memory | Burst Handling | Complexity |
|---|---|---|---|---|
| Fixed Window | Low — boundary spike allows 2x burst | Minimal — one counter per window | Poor — resets at window edge | Trivial |
| Sliding Window Log | High — exact request tracking | High — stores every timestamp | Excellent — true rolling window | Moderate |
| Sliding Window Counter | High — weighted approximation | Low — two counters per window | Good — smooths boundary spikes | Low |
| Token Bucket | Medium — allows configured bursts | Minimal — token count + timestamp | Configurable — burst size parameter | Moderate |
Fixed window is the most common and the most dangerous. If your limit is 100 requests per minute and a client sends 100 requests at 0:59 and 100 more at 1:01, they effectively got 200 requests in two seconds. The sliding window counter eliminates this boundary problem with almost no additional cost — two Redis keys instead of one.
We use the sliding window counter for ShelfWise. It provides near-exact accuracy with minimal memory.
The Sliding Window Counter in Redis
The algorithm works by tracking counts in two adjacent fixed windows and weighting them by how far into the current window we are.
from datetime import UTC, datetimefrom dataclasses import dataclassfrom typing import Final
import redis.asyncio as redis
WINDOW_SIZE: Final = 60 # seconds
@dataclass(frozen=True, slots=True)class RateLimitResult: allowed: bool limit: int remaining: int reset_at: int # Unix timestamp retry_after: float | None = None # seconds, only set when denied
class SlidingWindowRateLimiter: """Sliding window counter rate limiter backed by Redis."""
def __init__(self, redis_client: redis.Redis) -> None: self._redis = redis_client
async def check( self, key: str, limit: int, window: int = WINDOW_SIZE, ) -> RateLimitResult: now = datetime.now(UTC).timestamp() current_window = int(now // window) * window previous_window = current_window - window
current_key = f"rl:{key}:{current_window}" previous_key = f"rl:{key}:{previous_window}"
async with self._redis.pipeline(transaction=True) as pipe: pipe.get(previous_key) pipe.incr(current_key) pipe.expire(current_key, window * 2) results = await pipe.execute()
previous_count = int(results[0] or 0) current_count = int(results[1]) # already incremented
# Weight previous window by how much of it overlaps elapsed = now - current_window weight = 1.0 - (elapsed / window) weighted_count = int(previous_count * weight) + current_count
reset_at = current_window + window remaining = max(0, limit - weighted_count) allowed = weighted_count <= limit
retry_after: float | None = None if not allowed: retry_after = reset_at - now
return RateLimitResult( allowed=allowed, limit=limit, remaining=remaining, reset_at=reset_at, retry_after=retry_after, )The INCR + EXPIRE pipeline is atomic. Even under high concurrency, every request gets an accurate count. The weighted calculation means a client cannot exploit the window boundary.
Four Tiers of Rate Limiting
A single rate limit is not enough. ShelfWise enforces four tiers, checked in order:
Each tier protects against a different failure mode:
- Global: prevents infrastructure overload regardless of source
- Tenant: enforces plan-based quotas (ties to the feature flags from Part 10)
- Endpoint: protects expensive operations like search and bulk export
- User: prevents individual account abuse within a tenant
Tenant Tier Configuration
Rate limits come from the tenant’s plan, loaded from the configuration system we built in Part 10.
from dataclasses import dataclassfrom enum import StrEnum
class PlanTier(StrEnum): FREE = "free" STARTER = "starter" BUSINESS = "business" ENTERPRISE = "enterprise"
@dataclass(frozen=True, slots=True)class TenantRateLimits: requests_per_minute: int search_per_minute: int bulk_export_per_hour: int api_cost_budget_per_minute: int # weighted cost units
TIER_LIMITS: dict[PlanTier, TenantRateLimits] = { PlanTier.FREE: TenantRateLimits( requests_per_minute=60, search_per_minute=20, bulk_export_per_hour=5, api_cost_budget_per_minute=100, ), PlanTier.STARTER: TenantRateLimits( requests_per_minute=300, search_per_minute=100, bulk_export_per_hour=50, api_cost_budget_per_minute=500, ), PlanTier.BUSINESS: TenantRateLimits( requests_per_minute=3_000, search_per_minute=500, bulk_export_per_hour=200, api_cost_budget_per_minute=5_000, ), PlanTier.ENTERPRISE: TenantRateLimits( requests_per_minute=10_000, search_per_minute=2_000, bulk_export_per_hour=1_000, api_cost_budget_per_minute=50_000, ),}When that free-tier tenant scripted 1,000 requests per second, they would hit the 60/minute limit on their second request. Barnes & Noble, on their enterprise plan, would never notice.
Cost-Based Limiting
Not all endpoints cost the same. A GET /books/{id} is a primary key lookup. A GET /books/search?q=... runs full-text search across millions of rows. Treating them equally means you either over-restrict cheap endpoints or under-restrict expensive ones.
The solution is cost-weighted limiting. Each endpoint declares its cost, and the rate limiter deducts that many units from the tenant’s budget.
from typing import Final
# Cost units per request — higher = more expensiveENDPOINT_COSTS: dict[str, int] = { "GET /api/v1/books/{id}": 1, "GET /api/v1/books": 3, "GET /api/v1/books/search": 10, "POST /api/v1/orders": 5, "POST /api/v1/bulk/export": 50, "POST /api/v1/bulk/import": 100,}
DEFAULT_COST: Final = 1A free-tier tenant with 100 cost units per minute can run 100 book lookups, or 10 searches, or 1 bulk export. The budget forces rational usage without micromanaging every endpoint.
FastAPI Middleware
The middleware ties everything together. It checks all four tiers, sets standard rate limit headers, and implements graceful degradation.
from starlette.middleware.base import BaseHTTPMiddleware, RequestResponseEndpointfrom starlette.requests import Requestfrom starlette.responses import JSONResponse, Response
from src.core.tenant_context import get_current_tenantfrom src.infra.rate_limiter import SlidingWindowRateLimiter, RateLimitResultfrom src.core.rate_limit_config import TIER_LIMITS, PlanTierfrom src.api.rate_limit_costs import ENDPOINT_COSTS, DEFAULT_COST
class RateLimitMiddleware(BaseHTTPMiddleware): def __init__(self, app, limiter: SlidingWindowRateLimiter) -> None: # noqa: ANN001 super().__init__(app) self._limiter = limiter
async def dispatch( self, request: Request, call_next: RequestResponseEndpoint, ) -> Response: # Step 1: Global limit — protect infrastructure global_result = await self._limiter.check( key="global", limit=10_000, window=1, ) if not global_result.allowed: return self._reject(global_result)
tenant = get_current_tenant() if tenant is None: return await call_next(request)
tier = PlanTier(tenant.plan_tier) limits = TIER_LIMITS[tier]
# Step 2: Tenant limit — enforce plan quota tenant_result = await self._limiter.check( key=f"tenant:{tenant.id}", limit=limits.requests_per_minute, window=60, ) if not tenant_result.allowed: return self._reject(tenant_result)
# Step 3: Endpoint cost limit — protect expensive ops route = f"{request.method} {request.url.path}" cost = ENDPOINT_COSTS.get(route, DEFAULT_COST) cost_result = await self._limiter.check( key=f"cost:{tenant.id}", limit=limits.api_cost_budget_per_minute, window=60, ) # Deduct additional cost units for expensive endpoints if cost > 1: for _ in range(cost - 1): await self._limiter.check( key=f"cost:{tenant.id}", limit=limits.api_cost_budget_per_minute, window=60, ) if not cost_result.allowed: return self._reject(cost_result)
# Step 4: Per-user limit — prevent individual abuse user_id = getattr(request.state, "user_id", None) if user_id: user_result = await self._limiter.check( key=f"user:{tenant.id}:{user_id}", limit=limits.requests_per_minute // 10, # 10% of tenant quota window=60, ) if not user_result.allowed: return self._reject(user_result)
response = await call_next(request) self._add_headers(response, tenant_result) return response
def _reject(self, result: RateLimitResult) -> JSONResponse: headers = { "X-RateLimit-Limit": str(result.limit), "X-RateLimit-Remaining": "0", "X-RateLimit-Reset": str(result.reset_at), } if result.retry_after is not None: headers["Retry-After"] = str(int(result.retry_after) + 1) return JSONResponse( status_code=429, content={"detail": "Rate limit exceeded", "retry_after": result.retry_after}, headers=headers, )
def _add_headers(self, response: Response, result: RateLimitResult) -> None: response.headers["X-RateLimit-Limit"] = str(result.limit) response.headers["X-RateLimit-Remaining"] = str(result.remaining) response.headers["X-RateLimit-Reset"] = str(result.reset_at)Every response includes X-RateLimit-Remaining so clients can self-throttle before hitting the wall. Every rejection includes Retry-After so well-behaved clients know exactly when to retry.
Graceful Degradation
Hard-cutting at the limit is abrupt. A better pattern is to introduce artificial latency as the client approaches the limit, giving them a “soft warning” before the hard rejection.
# src/infra/rate_limiter.py — add to SlidingWindowRateLimiter
async def check_with_backpressure( self, key: str, limit: int, window: int = WINDOW_SIZE,) -> RateLimitResult: """Apply increasing delay as usage approaches the limit.""" result = await self.check(key, limit, window)
if result.allowed and result.remaining < limit * 0.2: # Below 20% remaining — add artificial delay usage_ratio = 1.0 - (result.remaining / limit) # Scale from 0ms at 80% usage to 2000ms at 100% delay = (usage_ratio - 0.8) * 10.0 # 0.0 to 2.0 seconds await asyncio.sleep(delay)
return resultWhen a tenant has used 80% of their quota, requests start taking longer. At 95%, each request adds nearly two seconds of latency. The client notices and backs off — or their users complain and they upgrade their plan. Either outcome is better than a hard 429.
Distributed Rate Limiting
With multiple API instances behind a load balancer, each instance needs the same view of the rate limit state. Redis handles this naturally — every instance reads and writes the same keys.
The concern is clock skew. If instance A thinks the current window starts at t=1000 and instance B thinks it starts at t=1001, their weighted calculations will diverge. The fix is straightforward: use Redis server time instead of local time.
async def _server_time(self) -> float: """Use Redis server time to avoid clock skew across instances.""" seconds, microseconds = await self._redis.time() return seconds + microseconds / 1_000_000Replace datetime.now(UTC).timestamp() with await self._server_time() in the check method. Now every instance computes identical window boundaries regardless of local clock drift.
Rate Limit Key Design
Key design determines isolation granularity. ShelfWise uses a hierarchical naming scheme:
# Key patterns for each tier"rl:global:{window_start}" # Global"rl:tenant:{tenant_id}:{window_start}" # Per-tenant"rl:cost:{tenant_id}:{window_start}" # Cost budget"rl:endpoint:{tenant_id}:{endpoint}:{window_start}" # Per-endpoint"rl:user:{tenant_id}:{user_id}:{window_start}" # Per-userThe tenant ID in every key except global means tenant A’s usage never affects tenant B’s counters. This is the same isolation principle from Part 9 applied to rate limiting.
Testing Rate Limits
Rate limiters are notoriously hard to test because they depend on time. The fix is to inject a clock.
import pytestfrom unittest.mock import AsyncMock
from src.infra.rate_limiter import SlidingWindowRateLimiter
@pytest.fixturedef fake_redis() -> AsyncMock: mock = AsyncMock() pipe = AsyncMock() pipe.execute = AsyncMock(return_value=[b"0", 1, True]) mock.pipeline.return_value.__aenter__ = AsyncMock(return_value=pipe) mock.pipeline.return_value.__aexit__ = AsyncMock(return_value=False) return mock
@pytest.mark.asyncioasync def test_first_request_allowed(fake_redis: AsyncMock) -> None: limiter = SlidingWindowRateLimiter(fake_redis) result = await limiter.check("test-key", limit=10) assert result.allowed is True assert result.remaining == 9
@pytest.mark.asyncioasync def test_limit_exceeded(fake_redis: AsyncMock) -> None: pipe = AsyncMock() pipe.execute = AsyncMock(return_value=[b"9", 2, True]) fake_redis.pipeline.return_value.__aenter__ = AsyncMock(return_value=pipe)
limiter = SlidingWindowRateLimiter(fake_redis) result = await limiter.check("test-key", limit=10) assert result.allowed is False assert result.remaining == 0 assert result.retry_after is not NoneWhat Happens Without This
During ShelfWise’s early days, a single free-tier tenant ran a bulk import script that sent 50,000 requests in under a minute. The PostgreSQL connection pool (sized for normal traffic) was exhausted. Every await session.execute() across all tenants queued behind this one tenant’s flood. Two hundred tenants experienced timeouts and 503 errors for six hours — because our on-call engineer was asleep and the only mitigation was to manually block the tenant’s API key.
With the rate limiting system described here, that tenant would have been throttled after their 60th request in the first minute. The other 199 tenants would never have noticed.
Checklist
Before shipping rate limiting to production:
- Sliding window counter eliminates boundary-spike exploits
- Four tiers protect infrastructure, tenants, endpoints, and users independently
- Cost-weighted limits prevent expensive endpoints from being abused at the same rate as cheap ones
X-RateLimit-*headers let clients self-throttleRetry-Afterheader tells rejected clients exactly when to retry- Graceful degradation adds latency before hard rejection
- Redis server time prevents clock skew across distributed instances
- Rate limit config is driven by tenant tier, not hardcoded
Next in Part 16, we tackle the data integrity problems that rate limiting alone cannot solve: lost updates from concurrent writes, double charges from network retries, and multi-step operations that fail halfway through.