API Rate Limiting

Updated 9 hours ago

Your API is a shared resource. Every request consumes CPU cycles, database connections, memory, bandwidth. Without limits, one client—malicious or just poorly written—can consume everything, leaving nothing for anyone else.

Rate limiting solves this. It's the bouncer at the door, deciding how many requests each client can make and what happens when they exceed their allowance.

The Problem Rate Limiting Solves

Imagine a developer writes a script to fetch data from your API. They forget to add a delay between requests. The script runs in a tight loop, making thousands of requests per second. Within moments, your database connection pool is exhausted. Your API servers are overwhelmed. Every other user of your API sees timeouts and errors.

This isn't malice—it's a bug. But the effect is indistinguishable from a denial-of-service attack.

Rate limiting stops this before it starts. The first hundred requests succeed. Request 101 gets rejected with a clear message: "Slow down. Try again in 60 seconds."

The misconfigured script fails fast. The developer fixes their bug. Your other users never notice.

Why Every Public API Needs This

Infrastructure protection is the obvious reason. A single bad actor—or bad script—shouldn't be able to take down your service.

Fairness is the deeper reason. Your API has finite capacity. If one user can consume unlimited resources, they're taking from everyone else. Rate limiting ensures the pie gets divided.

Cost control matters when your API calls external services, runs expensive queries, or triggers cloud functions. Without limits, a runaway client could generate a massive bill before anyone notices.

Business model emerges naturally. Free tier: 100 requests per hour. Paid tier: 10,000 requests per hour. Enterprise: let's talk. Rate limiting isn't just protection—it's product.

The Algorithms

The simplest approach has an obvious flaw. The flaw led to better approaches. Understanding this progression helps you choose the right one.

Fixed Window

Allow N requests per time window. When the window resets, the counter resets to zero.

100 requests per minute. At 10:00:00, the counter is 0. By 10:00:45, the user has made 95 requests. They make 5 more. They've hit the limit. They wait until 10:01:00, and the counter resets.

Simple. Easy to understand. Easy to implement.

Here's the flaw: A client makes 100 requests at 10:00:59. The window resets at 10:01:00. They make 100 more requests at 10:01:01. That's 200 requests in 2 seconds—despite a "100 per minute" limit.

The boundary problem. Your limit isn't really what you thought it was.

Sliding Window

Instead of fixed boundaries, look back exactly one window from the current moment.

At 10:00:45, count all requests from 09:59:45 to 10:00:45. If fewer than 100, allow the request.

No boundary problem. "100 per minute" truly means 100 in any 60-second period.

The cost: you need to store timestamps for every request. Memory adds up.

A clever approximation: keep counters for the current and previous windows, then calculate a weighted average based on how far through the current window you are. Less memory, nearly as accurate.

Token Bucket

Visualize a bucket that holds tokens. The bucket has a maximum capacity and refills at a constant rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Capacity: 100 tokens. Refill rate: 10 tokens per second.

A user makes 50 requests quickly, consuming 50 tokens. They wait 5 seconds. The bucket gains 50 tokens. They can burst again.

This is the algorithm most production APIs use. It allows bursts (up to bucket capacity) while enforcing an average rate over time. Bursty traffic is normal. Token bucket accommodates it gracefully.

Leaky Bucket

Requests enter a queue. The queue drains at a fixed rate. If the queue fills, new requests are rejected.

The output rate is perfectly smooth regardless of how bursty the input is. But requests wait in the queue, adding latency. For most APIs, the latency isn't worth the smoothness.

What Gets Limited

Limits can apply at different scopes, often in combination:

Per API key: Each application has its own limit. One app's bug doesn't affect others.

Per user: When multiple apps access your API on behalf of different users, each user gets their own limit.

Per IP address: Protects against attacks even before authentication happens.

Per endpoint: Expensive operations get tighter limits than cheap ones. A complex search might allow 10 requests per minute while a simple lookup allows 1000.

Global: A ceiling on total system load regardless of individual limits.

A typical configuration: 1000 requests per API key per hour AND 10 requests per second per IP address. Belt and suspenders.

Telling Clients What's Happening

Clients need to know the limits, how much they've used, and when limits reset. HTTP headers are the standard:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 743
X-RateLimit-Reset: 1735689600

When limits are exceeded, return status code 429 (Too Many Requests) with a Retry-After header:

HTTP/1.1 429 Too Many Requests
Retry-After: 120

{
  "error": "rate_limit_exceeded",
  "message": "Rate limit exceeded. Try again in 120 seconds.",
  "documentation_url": "https://api.example.com/docs/rate-limits"
}

Good error messages are a feature. The developer debugging at 2 AM will thank you.

Implementation

For a single server, in-memory counters work. Fast and simple. But if you have multiple servers, each has its own counters—a client could exceed limits by hitting different servers.

For multi-server deployments, Redis is the standard answer. All servers share the same counters. Atomic increment operations prevent race conditions. It's fast enough that the rate limiting check adds negligible latency.

API gateways (Kong, AWS API Gateway, Cloudflare) handle rate limiting before requests reach your application. This is efficient—rejected requests never touch your servers—and centralizes the logic.

Getting the Numbers Right

Too strict: legitimate users hit limits during normal use. Frustration. Support tickets. Churn.

Too lenient: limits don't protect anything. A misbehaving client can still cause problems.

Start by analyzing actual usage. What does the 95th percentile user look like? Set limits comfortably above normal usage but below the point where one user could cause problems.

Then monitor. High rates of 429 errors from many clients suggest limits are too tight. High rates from one client suggest they need to fix their code—or upgrade their plan.

Advanced Patterns

Adaptive limits adjust based on system load. When the system is healthy, be generous. When it's struggling, tighten up. This maximizes throughput while maintaining protection.

Cost-based limits charge different amounts for different operations. A simple lookup costs 1 point. A complex aggregation costs 10. Users get a point budget rather than a request count. This is fairer—why should a lightweight request count the same as an expensive one?

Reputation-based limits give established users more headroom than new accounts. Trust is earned.

Rate Limiting vs. Throttling

Rate limiting rejects requests that exceed limits. The client gets a 429 and decides what to do.

Throttling queues requests and processes them at the allowed rate. The client waits.

Most APIs use rate limiting. Throttling can cause timeouts and is harder for clients to reason about. Better to fail fast with a clear message than to hang indefinitely.

Frequently Asked Questions About API Rate Limiting

Was this page helpful?

😔

🤨

😃