Threat intelligence 101: how domain reputation actually works

Posted 2026-06-09 · 7 min read · threat detection

Open any security dashboard and you'll see a domain flagged "malicious" or an IP marked "high risk," usually with a tidy little number next to it. That number feels authoritative. It is also, almost always, a compressed summary of a dozen noisy, half-correlated signals that disagree with each other — flattened into a single value so a firewall, a mail filter, or a DNS resolver can make a yes/no decision in under a millisecond. The number is real. The certainty it implies is mostly an illusion of presentation.

"Reputation" is one of the most overloaded words in security, and the confusion costs real money: blocked customers, missed phishing, alert fatigue, and the occasional embarrassing incident where a perfectly legitimate site gets nuked because it shared an IP with something nasty. This article unpacks what reputation actually is — what signals feed it, how scores get combined and aged, why false positives are structural rather than accidental, and why a reputation score should be one vote in a decision, never the whole election.

The short version. Domain and IP reputation is a probabilistic estimate derived from observed behavior and metadata — not a verdict and not a binary. Treat it as a weighted input with a confidence band, decay it over time, and never let a single feed make an irreversible call on its own.

Reputation is a score, not a verdict

The first mental model to discard is the idea that a domain is "good" or "bad." A domain is an identifier; its reputation is an estimate of the probability that traffic to it is harmful, conditioned on everything we've observed about it. That estimate moves. A domain registered an hour ago has almost no behavioral history, so its reputation is dominated by weak priors (its registrar, its TLD, its hosting). A domain with ten years of clean traffic that suddenly starts serving malware has a stale-good reputation that lags reality by hours or days.

Because reputation is a probability, it has two properties people routinely forget. It is continuous — there's a meaningful difference between 0.55 and 0.95 — and it is uncertain — a score of 0.7 derived from one weak signal is not the same as 0.7 derived from five strong, independent ones. Collapsing both into "blocked" throws away exactly the information you need to make a good decision.

A reputation score answers "how suspicious does the evidence make this look?" — not "is this malicious?" Those are different questions, and conflating them is the root of most reputation-related outages.

Where the signals come from

No single observation tells you a domain is dangerous. Reputation is an exercise in correlation: stacking many independently-collected, individually-weak signals until a pattern emerges. The sources fall into a handful of broad categories, each with its own strengths and blind spots.

Signal category	What it observes	Strength	Blind spot
Passive DNS	Historical domain↔IP mappings seen by resolvers over time	Reveals infrastructure reuse, fast-flux, sudden re-pointing	Slow to populate for low-traffic domains; privacy-sensitive
Spam & abuse telemetry	Domains/IPs reported across email, web, and network abuse channels	High volume, near-real-time for active campaigns	Reporting bias; shared infrastructure causes collateral hits
Honeypots & sinkholes	Hosts that lure scanners, malware callbacks, and crawlers	Direct evidence of malicious behavior, low false-positive rate	Only sees attackers who reach the trap
Certificate Transparency logs	Public append-only log of every issued TLS certificate	Catches lookalike/typosquat domains at issuance, before they go live	Issuance ≠ malice; floods of benign certs to filter
Registration data (WHOIS/RDAP)	Age, registrar, contact privacy, bulk-registration patterns	Newly-registered + privacy-shielded + cheap TLD = strong prior	Redaction and resellers obscure ownership
Hosting / ASN reputation	The autonomous system and provider hosting the content	"Bulletproof" hosts and abuse-heavy ASNs raise the baseline	Major clouds and CDNs host everything, good and bad

The reason analysts cross-reference so many categories is that each one is easy to evade in isolation but hard to evade jointly. An attacker can register a fresh domain (defeating age-based heuristics), but the certificate they request lands in public CT logs the moment it's issued — and tools like crt.sh make that searchable. They can rotate IPs to dodge a static IP blocklist, but passive DNS records the rotation pattern. They can host on a reputable cloud to inherit a clean ASN, but their callback traffic still trips a honeypot. Reputation works because attackers have to win on every axis simultaneously, and defenders only have to catch them on one.

A note on what these feeds are not

Threat-intelligence feeds are observations, not oracles. A domain appearing on an abuse list means someone reported behavior that looked abusive from where they were standing. That's valuable, but it's testimony, not proof — and testimony has bias, latency, and the occasional outright error baked in. Good reputation systems treat each feed as a witness with a known reliability, not as ground truth.

How scores get combined

Once you have a pile of signals, you have to fuse them into something decidable. The naive approach — block if any feed says "bad" — maximizes recall and torches precision; you'll catch everything malicious and a painful amount of everything else. Real systems weight and combine.

A simplified weighted model looks like this: each signal contributes points scaled by how predictive that signal type has historically been, and the total is clamped into a bounded range. The illustrative pseudocode below uses placeholder weights — actual weights are tuned against labeled data and are revisited constantly.

score = 0
score += w_pdns      * passive_dns_anomaly      # infrastructure churn
score += w_abuse     * abuse_report_density      # how many, how recent
score += w_honeypot  * honeypot_hits             # direct callbacks
score += w_ct        * lookalike_similarity      # CT-log typosquat match
score += w_age       * registration_recency      # newer = riskier prior
score += w_asn       * hosting_asn_badness       # provider baseline

score = min(score, 100)
confidence = independent_signals_agreeing / total_signals_consulted

Two things matter more than the exact formula. First, independence: ten feeds that all re-publish the same upstream source are one signal wearing ten coats — they should not multiply confidence. Second, the confidence term is separate from the score. A high score backed by one lonely signal and a high score backed by five mutually-independent signals are very different bets, and the system needs to remember which is which all the way to the blocking decision.

Decay: reputation is perishable

Infrastructure turns over. Compromised sites get cleaned. Phishing pages get taken down. IPs get reassigned to entirely new tenants. A reputation score that never forgets is a score that slowly fills with ghosts — yesterday's malicious IP is today's small business on a recycled address. So scores decay: the weight of a signal diminishes as it ages, on a half-life appropriate to how fast that signal type goes stale.

Signal type	Typical freshness window	Why
Active honeypot callback	Hours to a few days	C2 infrastructure is short-lived and disposable
Abuse report	Days to weeks	Campaigns burn out; takedowns happen
Domain age / registration	Weeks to months	Risk genuinely falls as a domain proves benign over time
ASN / hosting reputation	Months	Provider behavior shifts slowly

Decay is also what lets a freshly-compromised legitimate site recover. Without it, one bad week would condemn a domain forever, and operators would (correctly) stop trusting your feed.

The precision/recall trade-off, and why false positives are structural

Every blocking system sits somewhere on a curve between two failure modes. Crank the threshold low and you catch almost everything malicious (high recall) while blocking a lot of innocents (low precision). Crank it high and your blocks are almost always right (high precision) while a lot of bad traffic sails through (low recall). You cannot maximize both with the same threshold; you choose where to sit based on the cost of each kind of mistake.

What makes false positives structural rather than fixable is that the modern internet shares infrastructure aggressively:

Shared hosting. Thousands of unrelated sites can live behind one IP. One sends spam; the IP earns a bad reputation; the other 4,999 inherit it. IP-level blocking punishes the neighbors of the guilty.
CDNs and clouds. A handful of large providers front a huge fraction of the web. The same IP ranges serve your bank, a hospital portal, and a phishing kit someone spun up an hour ago. ASN-level signals are nearly useless here, and IP-level signals are dangerous.
Freshly-compromised legitimate sites. A WordPress install gets popped and starts serving malware. For a window of hours, a domain with years of pristine history is genuinely malicious — and the moment it's cleaned, it's genuinely fine again. Any static label is wrong half the time.
Typosquat collateral. Similarity heuristics that catch paypa1.com will occasionally snag a real business whose name happens to resemble a popular brand.

The shared-IP tax. The single most common cause of a "we blocked a legitimate service" incident is an IP-level block landing on shared or CDN infrastructure. Prefer domain-level and behavior-level signals; reserve IP blocking for cases where the IP itself is the unit of badness (a scanner, a C2 host on dedicated infrastructure).

Confidence bands: blocking should respect them

If a score carries a confidence estimate, the worst thing you can do is ignore it and apply one global threshold. The better pattern is banded action: the same score triggers different responses depending on how much evidence stands behind it and how reversible the consequence is.

Band	Evidence	Reasonable action
High confidence, high score	Multiple independent strong signals agree	Block outright; log the evidence
Medium confidence	Some signals agree, others silent or weak	Block but make it appealable; flag for review; soft-fail for trusted clients
Low confidence	One weak signal, or a fresh domain with only priors	Monitor, rate-limit, or warn — do not hard-block

The asymmetry that should drive band design is the cost of being wrong. Blocking a malware C2 callback that turns out benign costs almost nothing — nobody legitimately needs that lookup. Blocking a payroll provider, a payment gateway, or a hospital's patient portal during business hours can be a genuine emergency. Reversible, low-cost blocks can sit on a hair trigger; high-cost blocks should demand high confidence and a fast path to undo.

Reputation is one input, not the decision

Here's the contrarian part, and the part that matters most operationally: a reputation score should rarely be the sole reason you block something. It's a prior — a starting estimate you update with everything else you know in the moment.

Consider how much context a reputation feed simply doesn't have. It doesn't know that this client is a kiosk that should only ever talk to three domains, or that the lookup volume to a freshly-seen domain just spiked 50× in ninety seconds (a tell for algorithmically-generated domains), or that the query pattern looks like data being smuggled out one label at a time (the signature of tunneling tools like iodine or dnscat2). Reputation is backward-looking and global; your local behavioral signals are real-time and specific. Fused together they're far stronger than either alone.

A 60th-percentile reputation score plus an anomalous local behavior is a confident block. The same score on a quiet, well-behaved client is a reason to watch, not to break things.

This is also why out-of-band reputation checks belong beside the resolution path, not inside it. A DNS lookup has a sub-millisecond budget; you cannot stall it on a network round-trip to a scoring service. The pragmatic architecture answers the query immediately using fast local signals, and consults heavier reputation asynchronously — feeding what it learns back into the blocklists that govern the next lookup. Reputation shapes policy over time; it doesn't gate individual packets in real time.

Putting it together

If you remember one thing, make it this: reputation is a weather forecast, not a verdict. A 70% chance of rain doesn't mean it rained; it means a sensible person carries an umbrella. A 70% reputation score doesn't mean a domain is malicious; it means a sensible resolver treats it with proportionate suspicion — and keeps watching, because by tomorrow the forecast will have changed.

The systems that get this right share a few habits. They prefer many independent weak signals over one loud feed. They track confidence separately from score. They decay aggressively so the past doesn't haunt the present. They block by band, matching the severity of the action to the strength of the evidence and the cost of being wrong. And they never let reputation stand alone — it's the prior, your own real-time behavioral telemetry is the update, and the decision lives in the combination. Do that, and reputation becomes what it should be: a powerful, honest input. Treat the number as gospel, and sooner or later it will block your own payroll on a Tuesday morning and tell you, with total confidence, that it was right to.

Reputation, applied at the lookup

UnveilDNS scores domains and clients and shows you the evidence behind each call.

Deploy UnveilDNS free

UnveilDNS Blog

Threat intelligence 101: how domain reputation actually works

Reputation is a score, not a verdict

Where the signals come from

A note on what these feeds are not

How scores get combined

Decay: reputation is perishable

The precision/recall trade-off, and why false positives are structural

Confidence bands: blocking should respect them

Reputation is one input, not the decision

Putting it together

Reputation, applied at the lookup