Typosquatting and IDN homographs: when gооgle.com isn't Google
Look at the title of this article again. The fourth word — the one that looks like "google" — contains two Cyrillic characters where you expect Latin letters. Your eyes read it as Google. A browser address bar, a phishing email, an SMS link: all of them can render that string so it is visually indistinguishable from the real thing. The domain it points to is not Google's, has never been Google's, and could be hosting a credential-harvesting page registered an hour ago.
This is the uncomfortable core of two related attack families. Typosquatting exploits the gap between what a user intends to type and what they actually type — fat fingers, autocomplete, misremembered spellings. IDN homograph attacks go further: they exploit the gap between what a string is and what it looks like, using the full breadth of Unicode to manufacture domains that are byte-for-byte different but pixel-for-pixel identical to a brand you trust. Both end in the same place — a victim who believes they are somewhere safe. This article walks through how the impostors are built, why the defences that exist are incomplete, and how to detect the lookalikes that target you.
The mutation families
Typosquatters do not pick a random misspelling and hope. They generate a permutation set — a programmatic enumeration of every domain that a human could plausibly land on instead of yours — and register the ones worth registering. The families below are the standard taxonomy, and any serious brand-monitoring effort needs to generate all of them. Examples use a fictional brand, acmebank.com.
| Family | Mechanism | Example impostor |
|---|---|---|
| Omission | A character is dropped | acmbank.com |
| Transposition | Two adjacent characters swapped | acmabnk.com |
| Replacement | A character replaced by a keyboard-adjacent one | scmebank.com (a→s, QWERTY neighbour) |
| Insertion | An extra character added | accmebank.com |
| Repetition | A character doubled | acmeebank.com |
| Bitsquatting | A single-bit flip in the stored/transmitted name | aÉmebank.com → e.g. agmebank.com |
| Wrong TLD | Same label, different suffix | acmebank.net, acmebank.co |
| Combosquatting | Brand plus an appended keyword | acmebank-login.com, secure-acmebank.com |
| Homoglyph | A character replaced by a confusable one (see below) | acmеbank.com (Cyrillic е) |
A few of these deserve a second look. Bitsquatting is the most counter-intuitive: it does not rely on a human mistyping anything. It relies on hardware. A single bit flip in a device's memory — from cosmic rays, heat, or failing DRAM — can silently change one character of a cached or in-flight domain name before the resolver ever sees it. Attackers register the one-bit-off variants of high-traffic domains and simply wait for the universe to deliver traffic. The volumes are small per domain but, at the scale of a popular brand, non-zero and entirely automatic.
Combosquatting is the family that defeats most casual inspection, because the brand name appears intact. There is no typo to spot. acmebank-security-alert.com contains the real string "acmebank", which is exactly why it reads as legitimate to a stressed user who just got a "suspicious login" email. Combosquats are also the hardest to defensively register, because the keyword space is effectively infinite.
IDN homographs: when the bytes lie
Internationalized Domain Names (IDNs) exist for a good reason: the world does not type in ASCII. A user in Athens, Moscow, or Seoul should be able to register and read a domain in their own script. The DNS protocol itself only speaks a restricted ASCII subset, so IDNs are encoded into ASCII using Punycode (RFC 3492), wrapped in the IDNA framework (RFC 5890 and related). A label that contains non-ASCII characters is transformed into an ASCII string prefixed with xn--.
The security problem is that Unicode contains many characters that are visually identical or near-identical across scripts but have entirely different code points. These are confusables. The Latin "a" (U+0061) and the Cyrillic "а" (U+0430) render the same in almost every font. So do Latin "e" and Cyrillic "е", Latin "o" and Cyrillic "о", Latin "p" and Cyrillic "р", and dozens more. The Unicode Consortium publishes a confusables table precisely because this ambiguity is a known, structural property of the character set — not a bug to be fixed.
A worked example
Suppose an attacker wants to impersonate apple.com. They take the legitimate label and substitute the Latin "a" with the Cyrillic "а" (U+0430). The resulting string looks like "apple" but is a different sequence of code points. When that label is encoded for the DNS, Punycode turns it into something like:
Displayed to the user: аpple.com (first 'a' is Cyrillic U+0430)
Actual DNS label: xn--pple-43d.com
What the user believes: apple.com
The registrant controls xn--pple-43d.com, a domain that has nothing to do with Apple. They can obtain a valid TLS certificate for it — certificate authorities validate control of the encoded name, and a green padlock says nothing about whether the name is a lookalike. The padlock means "the connection is encrypted to the domain shown", and the domain shown is the trap. Worse: an attacker can build entire labels from a single non-Latin script (a "whole-script" homograph), which sidesteps some browser defences that only flag mixed-script labels.
The padlock was never a trust signal about who you are talking to. It is a confidentiality signal about how. Homograph attacks live in exactly that gap.
The defences that exist — and where they leak
Browser vendors and registries did not ignore this. Over the years a layered set of mitigations appeared, and they genuinely help. They also leave gaps that an attacker who understands the rules can route around.
- Punycode display fallback. Modern browsers refuse to render certain IDN labels in their Unicode form and instead show the raw
xn--string in the address bar. The heuristics are script-mixing-aware: a label that mixes Latin and Cyrillic, or that uses a script inconsistent with the TLD's expected language, is shown as Punycode so the user sees the ugly truth. This is the single most effective control — and it only works in the address bar, after the click. - Registry-level script restrictions. Many registries enforce IDN tables that forbid mixing scripts within a single label, or restrict a label to scripts associated with that TLD. This kills a large class of mixed-script attacks at registration time.
- Mail and messaging rendering. Some clients apply similar Punycode fallbacks to links in email and chat.
The leaks are systematic. First, the address bar is the last surface a victim sees; the homograph already did its work in the email, the QR code, the SMS, the chat message, the document — none of which reliably apply browser-grade heuristics. Second, whole-script homographs (an entire label rendered in one non-Latin script) can evade mixed-script detection because there is no mixing to detect. Third, registry restrictions are uneven: a brand built from Latin letters has plenty of single-script confusable cousins available under registries with looser policies. Fourth, none of this addresses plain ASCII typosquatting and combosquatting at all — there is no Unicode trick to flag, just a domain that is one keystroke or one keyword away from yours.
Detection: generate, compare, watch
Defence that scales is built on the same primitive the attacker uses — the permutation set — turned around. Instead of registering variants, you enumerate and monitor them. There are three pillars.
1. Generate the permutation set
Start from your brand labels and mechanically produce every variant across all the families above: omissions, transpositions, keyboard-adjacent replacements, insertions, repetitions, bit flips, alternate TLDs, combosquat keyword combinations, and — critically — the homoglyph substitutions derived from the Unicode confusables data. Open-source typo-permutation tooling (for example, the widely used dnstwist family of generators) does most of this; the homoglyph step needs a confusables map so that "o → Cyrillic о", "a → Cyrillic а", and the rest are expanded into their xn-- encodings.
2. Compare against reality
A generated permutation is only interesting once it exists. Resolve each candidate and check for signs of life: does it have an A record, an MX record (a strong phishing signal — someone is preparing to send or receive mail as you), an active web server, a TLS certificate? Score the hits. A registered, mail-enabled, recently-created lookalike of your login domain is a near-certain attack in preparation. A parked variant owned by a domain squatter is a nuisance. Prioritize accordingly.
3. Watch the Certificate Transparency logs
Certificate Transparency (RFC 6962) requires CAs to publish every certificate they issue to public, append-only logs. This is a gift to defenders. The moment an attacker provisions TLS for xn--pple-43d.com or acmebank-login.net, a record of that name appears in the CT logs — often before the phishing campaign launches, because the certificate is set up first. Monitoring CT (via public front-ends such as crt.sh, or by streaming the logs directly) for substrings and confusable variants of your brand gives you the earliest possible warning. CT-watching catches the homograph at the exact step where the encoded name becomes public.
| Signal | What it tells you | Catches it how early |
|---|---|---|
| CT-log appearance | A cert was issued for a lookalike name | Before campaign launch |
| New DNS records (A/MX) | The lookalike is being stood up | Setup phase |
| Live web server / login form | The trap is armed | Just before / during campaign |
| Resolution requests from your users | Someone already clicked | During the attack |
The DNS vantage point
There is a fourth signal the table above hints at, and it is the one a resolver is uniquely positioned to see: your own users resolving a lookalike. CT logs and permutation scans tell you a trap exists somewhere on the internet. Your DNS tells you a device on your network just asked for it. That is the difference between threat intelligence and an active incident. A resolver that decodes xn-- labels, expands them back to their displayed form, and compares the result against your protected brands can flag a homograph query the instant it happens — and block the answer before the browser ever renders the page. The same logic, applied to the typo and combosquat permutation set, catches the ASCII attacks that Unicode heuristics never see. This is precisely where UnveilDNS does its work: at the resolution boundary, where the lookalike stops being a possibility and becomes a request.
Response: block, report, register selectively
Detection without a playbook is just anxiety. Once a lookalike is confirmed, three responses run in parallel, and they are not mutually exclusive.
- Block at resolution. The fastest, cheapest, and most reliable control you fully own. Add the confirmed lookalike to your resolver's blocklist so that no device on your network can reach it, regardless of what the user clicks. This protects your staff and managed users immediately, without waiting on anyone else's process. Homograph variants should be blocked by their
xn--encoding so there is no ambiguity. - Report for takedown. File abuse reports with the registrar, the hosting provider, and — for credential phishing — the impersonated brand's own abuse channels and the relevant CERT. CT-log evidence and resolution data make these reports concrete and actionable. Takedown is slow and outside your control, but it removes the threat for users outside your perimeter, which blocking cannot.
- Defensively register — selectively. You cannot register every variant; the combosquat and TLD space is unbounded. But the highest-risk handful — the exact homograph of your login domain, the one-character omission, the obvious
-login/-securecombosquats on the major TLDs — are worth owning outright so an attacker never can. Treat this as a small, prioritized list informed by your detection scoring, not a blanket policy.
The honest framing is that no single response closes the gap. Registration is bounded by cost; takedown is bounded by other people's timelines; blocking is bounded by your perimeter. Run all three, and let detection — the permutation set, the CT watch, the resolver's view of what your users actually ask for — decide where each one is worth spending.
The attacker's whole advantage is that a domain can look like the truth while being something else entirely. Take that advantage away by deciding, ahead of time and in code, what the truth looks like — then watching, at the one place every click passes through, for everything that merely resembles it.
Catch the lookalikes
DNS-side homograph and typosquat detection flags the impostors automatically.
Deploy UnveilDNS free