Tutorial

Clean Code Python: Rate Limiting and Noisy Neighbor Prevention

Without rate limiting, one tenant's bulk import consumes all your database connections and API throughput. Here is how to build tiered, sliding-window rate limiting in Python with Redis — protecting every tenant from every other tenant.

Tin Dang April 10, 2026 10 min read

Traffic control tower overlooking multiple lanes of vehicles merging into a regulated flow

Three months into running ShelfWise in production, a free-tier tenant wrote a script that hammered our search endpoint at 1,000 requests per second. Within ninety seconds, the PostgreSQL connection pool was saturated. Within three minutes, every other tenant — including Barnes & Noble on an enterprise plan — was getting 503s. The incident lasted six hours because we had no rate limiting, no tenant isolation, and no way to shed load from a single bad actor without restarting the entire service.

This is the “noisy neighbor” problem, and it is the number-one operational headache for multi-tenant SaaS. Rate limiting is not a nice-to-have. It is the difference between one tenant having a bad day and two hundred tenants having a bad day.

Algorithm Choice: Sliding Window

Before writing code, you need to pick an algorithm. The three contenders each make different tradeoffs.

Algorithm	Accuracy	Memory	Burst Handling	Complexity
Fixed Window	Low — boundary spike allows 2x burst	Minimal — one counter per window	Poor — resets at window edge	Trivial
Sliding Window Log	High — exact request tracking	High — stores every timestamp	Excellent — true rolling window	Moderate
Sliding Window Counter	High — weighted approximation	Low — two counters per window	Good — smooths boundary spikes	Low
Token Bucket	Medium — allows configured bursts	Minimal — token count + timestamp	Configurable — burst size parameter	Moderate

Fixed window is the most common and the most dangerous. If your limit is 100 requests per minute and a client sends 100 requests at 0:59 and 100 more at 1:01, they effectively got 200 requests in two seconds. The sliding window counter eliminates this boundary problem with almost no additional cost — two Redis keys instead of one.

We use the sliding window counter for ShelfWise. It provides near-exact accuracy with minimal memory.

The Sliding Window Counter in Redis

The algorithm works by tracking counts in two adjacent fixed windows and weighting them by how far into the current window we are.

from datetime import UTC, datetime
from dataclasses import dataclass
from typing import Final

import redis.asyncio as redis

WINDOW_SIZE: Final = 60  # seconds


@dataclass(frozen=True, slots=True)
class RateLimitResult:
    allowed: bool
    limit: int
    remaining: int
    reset_at: int  # Unix timestamp
    retry_after: float | None = None  # seconds, only set when denied


class SlidingWindowRateLimiter:
    """Sliding window counter rate limiter backed by Redis."""

    def __init__(self, redis_client: redis.Redis) -> None:
        self._redis = redis_client

    async def check(
        self, key: str, limit: int, window: int = WINDOW_SIZE,
    ) -> RateLimitResult:
        now = datetime.now(UTC).timestamp()
        current_window = int(now // window) * window
        previous_window = current_window - window

        current_key = f"rl:{key}:{current_window}"
        previous_key = f"rl:{key}:{previous_window}"

        async with self._redis.pipeline(transaction=True) as pipe:
            pipe.get(previous_key)
            pipe.incr(current_key)
            pipe.expire(current_key, window * 2)
            results = await pipe.execute()

        previous_count = int(results[0] or 0)
        current_count = int(results[1])  # already incremented

        # Weight previous window by how much of it overlaps
        elapsed = now - current_window
        weight = 1.0 - (elapsed / window)
        weighted_count = int(previous_count * weight) + current_count

        reset_at = current_window + window
        remaining = max(0, limit - weighted_count)
        allowed = weighted_count <= limit

        retry_after: float | None = None
        if not allowed:
            retry_after = reset_at - now

        return RateLimitResult(
            allowed=allowed,
            limit=limit,
            remaining=remaining,
            reset_at=reset_at,
            retry_after=retry_after,
        )

The INCR + EXPIRE pipeline is atomic. Even under high concurrency, every request gets an accurate count. The weighted calculation means a client cannot exploit the window boundary.

Four Tiers of Rate Limiting

A single rate limit is not enough. ShelfWise enforces four tiers, checked in order:

Each tier protects against a different failure mode:

Global: prevents infrastructure overload regardless of source
Tenant: enforces plan-based quotas (ties to the feature flags from Part 10)
Endpoint: protects expensive operations like search and bulk export
User: prevents individual account abuse within a tenant

Tenant Tier Configuration

Rate limits come from the tenant’s plan, loaded from the configuration system we built in Part 10.

from dataclasses import dataclass
from enum import StrEnum


class PlanTier(StrEnum):
    FREE = "free"
    STARTER = "starter"
    BUSINESS = "business"
    ENTERPRISE = "enterprise"


@dataclass(frozen=True, slots=True)
class TenantRateLimits:
    requests_per_minute: int
    search_per_minute: int
    bulk_export_per_hour: int
    api_cost_budget_per_minute: int  # weighted cost units


TIER_LIMITS: dict[PlanTier, TenantRateLimits] = {
    PlanTier.FREE: TenantRateLimits(
        requests_per_minute=60,
        search_per_minute=20,
        bulk_export_per_hour=5,
        api_cost_budget_per_minute=100,
    ),
    PlanTier.STARTER: TenantRateLimits(
        requests_per_minute=300,
        search_per_minute=100,
        bulk_export_per_hour=50,
        api_cost_budget_per_minute=500,
    ),
    PlanTier.BUSINESS: TenantRateLimits(
        requests_per_minute=3_000,
        search_per_minute=500,
        bulk_export_per_hour=200,
        api_cost_budget_per_minute=5_000,
    ),
    PlanTier.ENTERPRISE: TenantRateLimits(
        requests_per_minute=10_000,
        search_per_minute=2_000,
        bulk_export_per_hour=1_000,
        api_cost_budget_per_minute=50_000,
    ),
}

When that free-tier tenant scripted 1,000 requests per second, they would hit the 60/minute limit on their second request. Barnes & Noble, on their enterprise plan, would never notice.

Cost-Based Limiting

Not all endpoints cost the same. A GET /books/{id} is a primary key lookup. A GET /books/search?q=... runs full-text search across millions of rows. Treating them equally means you either over-restrict cheap endpoints or under-restrict expensive ones.

The solution is cost-weighted limiting. Each endpoint declares its cost, and the rate limiter deducts that many units from the tenant’s budget.

from typing import Final

# Cost units per request — higher = more expensive
ENDPOINT_COSTS: dict[str, int] = {
    "GET /api/v1/books/{id}": 1,
    "GET /api/v1/books": 3,
    "GET /api/v1/books/search": 10,
    "POST /api/v1/orders": 5,
    "POST /api/v1/bulk/export": 50,
    "POST /api/v1/bulk/import": 100,
}

DEFAULT_COST: Final = 1

A free-tier tenant with 100 cost units per minute can run 100 book lookups, or 10 searches, or 1 bulk export. The budget forces rational usage without micromanaging every endpoint.

FastAPI Middleware

The middleware ties everything together. It checks all four tiers, sets standard rate limit headers, and implements graceful degradation.

from starlette.middleware.base import BaseHTTPMiddleware, RequestResponseEndpoint
from starlette.requests import Request
from starlette.responses import JSONResponse, Response

from src.core.tenant_context import get_current_tenant
from src.infra.rate_limiter import SlidingWindowRateLimiter, RateLimitResult
from src.core.rate_limit_config import TIER_LIMITS, PlanTier
from src.api.rate_limit_costs import ENDPOINT_COSTS, DEFAULT_COST


class RateLimitMiddleware(BaseHTTPMiddleware):
    def __init__(self, app, limiter: SlidingWindowRateLimiter) -> None:  # noqa: ANN001
        super().__init__(app)
        self._limiter = limiter

    async def dispatch(
        self, request: Request, call_next: RequestResponseEndpoint,
    ) -> Response:
        # Step 1: Global limit — protect infrastructure
        global_result = await self._limiter.check(
            key="global", limit=10_000, window=1,
        )
        if not global_result.allowed:
            return self._reject(global_result)

        tenant = get_current_tenant()
        if tenant is None:
            return await call_next(request)

        tier = PlanTier(tenant.plan_tier)
        limits = TIER_LIMITS[tier]

        # Step 2: Tenant limit — enforce plan quota
        tenant_result = await self._limiter.check(
            key=f"tenant:{tenant.id}",
            limit=limits.requests_per_minute,
            window=60,
        )
        if not tenant_result.allowed:
            return self._reject(tenant_result)

        # Step 3: Endpoint cost limit — protect expensive ops
        route = f"{request.method} {request.url.path}"
        cost = ENDPOINT_COSTS.get(route, DEFAULT_COST)
        cost_result = await self._limiter.check(
            key=f"cost:{tenant.id}",
            limit=limits.api_cost_budget_per_minute,
            window=60,
        )
        # Deduct additional cost units for expensive endpoints
        if cost > 1:
            for _ in range(cost - 1):
                await self._limiter.check(
                    key=f"cost:{tenant.id}",
                    limit=limits.api_cost_budget_per_minute,
                    window=60,
                )
        if not cost_result.allowed:
            return self._reject(cost_result)

        # Step 4: Per-user limit — prevent individual abuse
        user_id = getattr(request.state, "user_id", None)
        if user_id:
            user_result = await self._limiter.check(
                key=f"user:{tenant.id}:{user_id}",
                limit=limits.requests_per_minute // 10,  # 10% of tenant quota
                window=60,
            )
            if not user_result.allowed:
                return self._reject(user_result)

        response = await call_next(request)
        self._add_headers(response, tenant_result)
        return response

    def _reject(self, result: RateLimitResult) -> JSONResponse:
        headers = {
            "X-RateLimit-Limit": str(result.limit),
            "X-RateLimit-Remaining": "0",
            "X-RateLimit-Reset": str(result.reset_at),
        }
        if result.retry_after is not None:
            headers["Retry-After"] = str(int(result.retry_after) + 1)
        return JSONResponse(
            status_code=429,
            content={"detail": "Rate limit exceeded", "retry_after": result.retry_after},
            headers=headers,
        )

    def _add_headers(self, response: Response, result: RateLimitResult) -> None:
        response.headers["X-RateLimit-Limit"] = str(result.limit)
        response.headers["X-RateLimit-Remaining"] = str(result.remaining)
        response.headers["X-RateLimit-Reset"] = str(result.reset_at)

Every response includes X-RateLimit-Remaining so clients can self-throttle before hitting the wall. Every rejection includes Retry-After so well-behaved clients know exactly when to retry.

Graceful Degradation

Hard-cutting at the limit is abrupt. A better pattern is to introduce artificial latency as the client approaches the limit, giving them a “soft warning” before the hard rejection.

# src/infra/rate_limiter.py — add to SlidingWindowRateLimiter

async def check_with_backpressure(
    self, key: str, limit: int, window: int = WINDOW_SIZE,
) -> RateLimitResult:
    """Apply increasing delay as usage approaches the limit."""
    result = await self.check(key, limit, window)

    if result.allowed and result.remaining < limit * 0.2:
        # Below 20% remaining — add artificial delay
        usage_ratio = 1.0 - (result.remaining / limit)
        # Scale from 0ms at 80% usage to 2000ms at 100%
        delay = (usage_ratio - 0.8) * 10.0  # 0.0 to 2.0 seconds
        await asyncio.sleep(delay)

    return result

When a tenant has used 80% of their quota, requests start taking longer. At 95%, each request adds nearly two seconds of latency. The client notices and backs off — or their users complain and they upgrade their plan. Either outcome is better than a hard 429.

Distributed Rate Limiting

With multiple API instances behind a load balancer, each instance needs the same view of the rate limit state. Redis handles this naturally — every instance reads and writes the same keys.

The concern is clock skew. If instance A thinks the current window starts at t=1000 and instance B thinks it starts at t=1001, their weighted calculations will diverge. The fix is straightforward: use Redis server time instead of local time.

async def _server_time(self) -> float:
    """Use Redis server time to avoid clock skew across instances."""
    seconds, microseconds = await self._redis.time()
    return seconds + microseconds / 1_000_000

Replace datetime.now(UTC).timestamp() with await self._server_time() in the check method. Now every instance computes identical window boundaries regardless of local clock drift.

Rate Limit Key Design

Key design determines isolation granularity. ShelfWise uses a hierarchical naming scheme:

# Key patterns for each tier
"rl:global:{window_start}"                          # Global
"rl:tenant:{tenant_id}:{window_start}"              # Per-tenant
"rl:cost:{tenant_id}:{window_start}"                # Cost budget
"rl:endpoint:{tenant_id}:{endpoint}:{window_start}" # Per-endpoint
"rl:user:{tenant_id}:{user_id}:{window_start}"      # Per-user

The tenant ID in every key except global means tenant A’s usage never affects tenant B’s counters. This is the same isolation principle from Part 9 applied to rate limiting.

Testing Rate Limits

Rate limiters are notoriously hard to test because they depend on time. The fix is to inject a clock.

import pytest
from unittest.mock import AsyncMock

from src.infra.rate_limiter import SlidingWindowRateLimiter


@pytest.fixture
def fake_redis() -> AsyncMock:
    mock = AsyncMock()
    pipe = AsyncMock()
    pipe.execute = AsyncMock(return_value=[b"0", 1, True])
    mock.pipeline.return_value.__aenter__ = AsyncMock(return_value=pipe)
    mock.pipeline.return_value.__aexit__ = AsyncMock(return_value=False)
    return mock


@pytest.mark.asyncio
async def test_first_request_allowed(fake_redis: AsyncMock) -> None:
    limiter = SlidingWindowRateLimiter(fake_redis)
    result = await limiter.check("test-key", limit=10)
    assert result.allowed is True
    assert result.remaining == 9


@pytest.mark.asyncio
async def test_limit_exceeded(fake_redis: AsyncMock) -> None:
    pipe = AsyncMock()
    pipe.execute = AsyncMock(return_value=[b"9", 2, True])
    fake_redis.pipeline.return_value.__aenter__ = AsyncMock(return_value=pipe)

    limiter = SlidingWindowRateLimiter(fake_redis)
    result = await limiter.check("test-key", limit=10)
    assert result.allowed is False
    assert result.remaining == 0
    assert result.retry_after is not None

What Happens Without This

During ShelfWise’s early days, a single free-tier tenant ran a bulk import script that sent 50,000 requests in under a minute. The PostgreSQL connection pool (sized for normal traffic) was exhausted. Every await session.execute() across all tenants queued behind this one tenant’s flood. Two hundred tenants experienced timeouts and 503 errors for six hours — because our on-call engineer was asleep and the only mitigation was to manually block the tenant’s API key.

With the rate limiting system described here, that tenant would have been throttled after their 60th request in the first minute. The other 199 tenants would never have noticed.

Checklist

Before shipping rate limiting to production:

Sliding window counter eliminates boundary-spike exploits
Four tiers protect infrastructure, tenants, endpoints, and users independently
Cost-weighted limits prevent expensive endpoints from being abused at the same rate as cheap ones
X-RateLimit-* headers let clients self-throttle
Retry-After header tells rejected clients exactly when to retry
Graceful degradation adds latency before hard rejection
Redis server time prevents clock skew across distributed instances
Rate limit config is driven by tenant tier, not hardcoded

Next in Part 16, we tackle the data integrity problems that rate limiting alone cannot solve: lost updates from concurrent writes, double charges from network retries, and multi-step operations that fail halfway through.

Next in this series

Clean Code Python: Locking, Idempotency, and Sagas