Running DNS at ISP scale: what breaks first

Posted 2026-06-09 · 8 min read · performance

A resolver that handles a few hundred queries a second for an office is a fundamentally different machine than one carrying tens of thousands of queries a second for an ISP, even when the code is identical. Nothing in the DNS protocol changes. A query is still a query: a name, a type, an answer, a TTL. What changes is everything around the query — the logging, the accounting, the dashboards, the kernel that delivers the packets — and those are the parts that fall over first, long before the resolver logic itself runs out of road.

This is the uncomfortable lesson of scaling a DNS service: the resolver is rarely the bottleneck. The bottleneck is the instrumentation you bolted on to understand the resolver. A single SQL aggregation that returns in 18 milliseconds on a quiet network can take 15 minutes — or simply time out — once the table behind it holds a few million rows from the last hour. The query log that was a useful audit trail at 500 qps becomes a firehose writing gigabytes per hour. Below we walk through what breaks, in roughly the order it breaks, and the design moves that keep a high-volume resolver responsive instead of melting under its own telemetry.

The short version. At ISP scale, per-query overhead dominates, live metrics must come from in-memory counters rather than SQL, dashboards must be precomputed rather than calculated on request, and the cache hit-rate becomes the single most important number you can move. Storage stays sane only through aggressive aggregation and retention limits.

What actually changes between 500 and 30,000 qps

The intuitive model — "30,000 qps is just 60× the work of 500 qps" — is wrong in a useful way. Most of the resolver's per-query cost is fixed: parse the packet, check the cache, consult the filter chain, build the response, hand it back to the kernel. At low volume that fixed cost is invisible because the machine is mostly idle. At high volume the same fixed cost, multiplied across every query, is suddenly the entire CPU budget. Work that you could afford to do "per query" at 500 qps — a regex, a string allocation, a map lookup, a log line — becomes a tax you pay 30,000 times a second.

Three things break in a predictable sequence:

Stage	Symptom	Root cause
Logging	Disk I/O saturates, log files balloon, flushes stall the hot path	One structured log line per query is a firehose at five figures of qps
Live metrics	The "queries per second" card on the dashboard lags or times out	Counting recent rows with SQL scans a table that is still being written
Aggregates	Top-domains / top-clients panels spin forever	`GROUP BY` over millions of rows can't finish inside an HTTP timeout

Notice what is not on that list: resolving the names. The DNS engine keeps up fine. It is the accounting layer — the part whose job is to tell you the engine is keeping up — that collapses. That inversion is the whole story of operating DNS at scale, and it dictates the rest of the design.

Logging becomes a firehose

At 30,000 qps, a single line of query log per request is on the order of two and a half billion lines a day. Even at a modest hundred-odd bytes per line that is hundreds of gigabytes of write traffic competing with the resolver for the same disk and the same page cache. The instinct to "just log everything" — perfectly reasonable for a small deployment where the log is your forensic record — becomes an active denial-of-service against your own service.

The discipline that keeps logging survivable is to treat the on-disk query log as a bounded, rotating buffer, not an archive. A few principles:

Cap the buffer. Keep a fixed number of recent entries in memory and a bounded file on disk that rotates by size, so log volume can never grow without limit no matter how busy the network gets.
Batch the flushes. Flush to disk on an interval, in groups, rather than synchronously per query. The hot path should never block on a write.
Log exceptions, not normal operation. Background workers that process the log should be silent when everything is fine and only emit when something is actually wrong. "Processed 8,000 entries" written once a minute is noise; "failed to parse 12 entries" is signal.
Tail, don't re-read. Consumers of the log (real-time views, security analysis) follow new lines as they arrive and handle rotation, instead of repeatedly scanning the whole file.

The goal is that the act of observing traffic costs a small, constant amount regardless of how much traffic there is. Anything that scales linearly with query volume on the hot path is a future outage.

Live metrics: count in memory, not in SQL

The dashboard cards every operator stares at — queries per second, percent served from cache, percent of bandwidth saved, blocked ratio — feel like database questions. They are not, and treating them as database questions is the second thing to break.

Consider "percent of queries served from cache." The naive implementation runs a query like SELECT COUNT(*) ... WHERE cached = 1 against the table holding the current hour's traffic. At low volume that's instant. At high volume that table is the single most contended object in the system — it is being inserted into thousands of times a second — and now you're asking it to scan itself to answer a question whose answer you already knew at insert time.

The fix is to stop asking. As each query is recorded, increment a set of lock-free atomic counters in memory:

// conceptual — one atomic add per query, no locks, no SQL
totalQueries++        // every query
if servedFromCache {  cachedQueries++  }
if blocked {          blockedQueries++ }   // blocked == served locally
totalSize  += answerBytes
savedSize  += answerBytesIfCached

Reading the dashboard then becomes arithmetic on a handful of integers — effectively free, and instant, no matter the query rate. The counters are seeded from the database once at startup so a service restart doesn't reset the displayed totals to zero, and from then on they live entirely in memory.

Some metrics genuinely need recent SQL — "average response time over the last 60 seconds," for example, is a real query. The trick is to make that query touch only the last minute of data, which is small and well-indexed, and never to fall back to scanning the whole hour. A bounded window query is fine; an unbounded aggregate on the hot table is not.

There's a subtle correctness point worth stating: a blocked query and a cache hit are both queries the resolver answered without going upstream. For "bandwidth saved" accounting they belong in the same bucket. Getting these definitions right in the counter increments matters more than it looks, because once you stop running SQL there's no second source of truth to reconcile against.

Dashboards: precompute, don't compute on request

Live counters solve the simple scalar cards. They do nothing for the panels that are inherently aggregations: top blocked domains, most active clients, top destinations. Those need GROUP BY, and GROUP BY over the current hour's table — millions of rows, actively growing — is exactly the query that won't return before the browser gives up.

You cannot make that query fast on demand. So don't run it on demand. The pattern that works is a background worker that computes the expensive aggregates on a fixed cadence and stores the results, so the API serves a precomputed snapshot instead of triggering a live scan:

A background job runs once a minute (or whatever cadence the data freshness allows).
It computes the rolling 24-hour view from already-aggregated hourly summaries plus the current, not-yet-aggregated hour.
It writes the result into a small cache — both on disk for durability and in memory for speed.
The dashboard API reads that cache and returns instantly. No request ever triggers a heavy join.

This turns a 15-minute on-request query into a sub-second cache read, and it bounds cost: the expensive work runs at a known, fixed rate set by you, completely decoupled from how many people have the dashboard open or how fast queries are arriving. A thousand operators refreshing the page cost the same as one.

Defensive engineering. Even with precomputation, build escape hatches. If the current-hour table crosses a sanity threshold of rows, skip the live portion of an aggregate and serve only the already-summarized data, flagging the response so the UI can say "current hour pending" rather than hanging. Degrading gracefully beats timing out.

The kernel is in the loop too

At five-figure qps, the resolver process is not the only thing under pressure — the operating system's network stack is in the hot path for every single packet, and its defaults are tuned for a general-purpose server, not a DNS firehose. Three areas matter, and the right move is to scale them to the actual hardware rather than copy magic numbers from a blog post.

Socket buffers

DNS over UDP is bursty. When a flood of queries arrives faster than the application drains them, undersized receive buffers silently drop packets — which clients see as timeouts and retries, which makes the flood worse. Receive and send buffer ceilings should scale with available memory so the kernel can absorb bursts instead of discarding them.

Backlog and accept queues

The device backlog (how many incoming packets the kernel will queue before the application picks them up) and the listen/accept queues for TCP-based transports need headroom proportional to core count and offered load. Too small, and you drop traffic at the doorstep before the resolver ever sees it.

Connection tracking

If the resolver sits behind any stateful netfilter rules, the connection-tracking table becomes a hard ceiling: once it fills, new flows are dropped. DNS generates an enormous number of short-lived UDP "connections," so the conntrack table size and its UDP timeouts both need attention — or, for a pure resolver, conntrack on DNS traffic is often best avoided entirely.

The principle across all three: detect the hardware — cores, memory, NICs — at startup and size these parameters to it automatically, rather than shipping one set of values that is wrong for both a two-core VM and a 24-core appliance. Where the NIC supports it, spreading interrupts and flow hashing across queues keeps a single core from becoming the chokepoint while the others sit idle. The exact figures are deployment-specific; what's universal is that leaving the defaults in place is a decision, and usually the wrong one.

The cache hit-rate is the single biggest lever

Every optimization above makes a busy resolver survive. The cache makes it fast and cheap, and it dwarfs everything else. The arithmetic is brutal in your favor: a query answered from cache never crosses the network to an upstream resolver. It skips the slowest, least predictable part of the whole pipeline. Raising the cache hit-rate from, say, the seventies into the high eighties or nineties doesn't shave a few percent off the work — it removes a large fraction of all upstream traffic, all upstream latency, and all upstream-dependent failure modes at once.

Lever	What it buys	What it costs
Larger cache	More names resident, higher hit-rate	RAM — scale the cache to a sane fraction of total memory, capped
Optimistic / serve-stale	Answer instantly from a slightly-expired record while refreshing behind it	Briefly serving data up to its refresh window old
Prefetch	Popular records re-resolved before they expire, so hits stay warm	A little speculative upstream traffic for hot names

Optimistic caching and prefetch are the two highest-leverage features on a busy resolver because they attack the worst case directly: the cache miss on a popular name, where one expiry forces a slow upstream round-trip that a thousand waiting clients feel. Serve-stale answers that client immediately from the old record; prefetch ensures the record was refreshed before it ever expired. Between them, the tail of slow responses largely disappears.

There is a respect-the-data caveat: caching must honor TTLs and never override a deliberately short one with a long local cap, or you'll serve stale answers for records that were designed to change quickly. The lever is real, but it's a scalpel, not a hammer — see our deeper treatment in TTL, caching and prefetch.

Keeping storage sane: aggregate early, retain little

Raw per-query rows are the most expensive thing you can keep, and at ISP scale they accumulate faster than any disk can hold them for long. The answer is a lifecycle that turns detailed-but-ephemeral data into compact-but-durable data, then throws the detail away.

The pattern is hourly rotation with rollup:

Write raw queries into short-lived hourly tables. Each hour gets its own table, so the actively-written object stays bounded and indexes stay healthy.
Aggregate completed hours into compact summaries. Once an hour is over, a background job rolls its raw rows up into per-domain, per-client, per-type, and per-response-code daily summaries — the data you actually query for trends.
Drop the raw hourly table once it's summarized. The detail has served its purpose; the summary is two or three orders of magnitude smaller and answers every dashboard question.
Index for the queries you actually run. Composite indexes aligned to the real GROUP BY shapes (blocked-by-domain, client-by-status) turn aggregation from a full scan into an index walk.

One trap deserves a callout because it is invisible until it isn't: hour and day boundaries must be computed in the server's local time, not UTC. Several timezones — UTC+04:30, UTC+05:30 and others — sit on non-whole-hour offsets, and naive "truncate to the hour" logic that operates in UTC will bucket data into the wrong hour for those regions, silently corrupting every chart. The fix is trivial once you know it and maddening to debug if you don't: truncate against local wall-clock components, never against a UTC instant.

The mental model that holds all of this together: detail is a liability, summaries are an asset. Keep raw data only as long as you need it to build the summaries, then delete it without ceremony. A resolver that hoards raw query logs "just in case" will eventually spend more effort managing storage than answering queries.

The throughline

None of the moves here are exotic. Atomic counters, precomputed dashboards, bounded logs, kernel buffers sized to the hardware, a fat warm cache, and ruthless aggregation are all standard systems engineering. What's specific to DNS at scale is the order in which the naive version fails — the resolver stays healthy while the telemetry, the logging, and the reporting layer crumble around it — and the discipline of refusing to do any work on the hot path that grows with query volume.

Scale doesn't ask you to make the fast path faster. It asks you to make sure that everything you bolted on to watch the fast path is itself constant-cost. Get that right and a single, ordinary box carries an ISP's worth of queries without breaking a sweat — and the dashboard still answers in under a second while it does.

Built for volume

UnveilDNS keeps dashboards instant even under ISP-grade query load.

Deploy UnveilDNS free

UnveilDNS Blog

Running DNS at ISP scale: what breaks first

What actually changes between 500 and 30,000 qps

Logging becomes a firehose

Live metrics: count in memory, not in SQL

Dashboards: precompute, don't compute on request

The kernel is in the loop too

Socket buffers

Backlog and accept queues

Connection tracking

The cache hit-rate is the single biggest lever

Keeping storage sane: aggregate early, retain little

The throughline

Built for volume