Correctly match URL against domain name without killing yourself with regular expressions

; Date: Tue Oct 26 2021

Tags: Node.JS »»»» JavaScript

The Internet relies on domain names as a more user-friendly humane address mechanism than IP addresses. That means we must often write code checking if a URL is "within" domain A or domain B, and to act accordingly. You might think a regular expression is the way to go, but it has failings because while a URL looks like a text string it's actually a data structure. A domain name comparison has to recognize that it's dealing with a data structure, and to compare correctly. Otherwise a URL with domain name "loveamazon.com" might match the regular expression /amazon.com$/i and do the wrong thing.

What do I mean that a domain name is a data structure? A domain name like (techsparx.com) techsparx.com sure looks like a string, and we store it as a string in software. The first thing to notice is a domain name like ae-9.r24.snjsca04.us.bb.gin.ntt.net is deeply nested. In production systems, the hardware elements (routers, etc) will often be named with deeply nested domain names reflecting the geographical location and other identifiers. Or consider that many sites, like google.com, have lots of subdomains, like drive.google.com or mail.google.com.

The domain name a.b.c.example.com looks like a string, but is a nested data structure. At the top is .com, then example.com, then c.example.com, and so forth. Domain names are case insensitive, FWIW. The test of whether a domain name is a subdomain of another is not a simple case-less string comparison. By convention the domain www.example.com is often equal to example.com. The natural way to describe any subdomain of example.com is with the pattern match *.example.com, but where do we get a suitable matching algorithm? These are examples of why I'm insisting you must treat a domain name as a data structure, rather than as a simple string.

A common task is matching a domain name or a URL to see if it's associated with a specific domain name. You then act on the domain name correctly for the domain name.

For example, in (akashacms.com) AkashaCMS has two plugins which must do this. One, the External Links plugin, looks for <a href=...> tags for outbound links, and will add rel=nofollow or other attributes depending on the domain. Another, Affiliate Links, looks to see if the outbound link matches a domain for which there is an affiliate relationship, it will add rel=nofollow, and additionally add the affiliate tag if it is missing.

These modules must do things like these:

  • Not match a domain like loveamazon.com as if it is amazon.com
  • Correctly match a subdomain like images.amazon.com as being associated with amazon.com

Example code

In AkashaCMS there was the following loop:

let href = ... the href= attribute of the link to modify
let urlP = url.parse(href, true, true);
[
    { country: "com", domain: /amazon\.com$/i },
    { country: "ca",  domain: /amazon\.ca$/i },
    { country: "co-jp",  domain: /amazon\.co\.jp$/i },
    { country: "co-uk",  domain: /amazon\.co\.uk$/i },
    { country: "de",  domain: /amazon\.de$/i },
    { country: "es",  domain: /amazon\.es$/i },
    { country: "fr",  domain: /amazon\.fr$/i },
    { country: "it",  domain: /amazon\.it$/i }
].forEach(amazonSite => {
    let amazonCode = getAmazonAffiliateCodeForCountry(amazonSite.country);
    if (amazonSite.domain.test(urlP.hostname) && amazonCode) {
        ... operate on the link
    }
});

The code as it stands "works" to a degree. It knows a set of Amazon domains, and uses the regular expression to match against the hostname portion of the URL.

But as I noted in the introduction, this doesn't match the domain name properly. Yes, I've made sure to use the case-less modifier (i) and to escape the . characters so I'm assuredly correctly matching the domain name. But, did I prevent it from matching a domain of iloveamazon.com? Nope.

What's desired is for the match to work like a domain name match should work. While I'm sure the predominant technique for matching domain names is regular expressions, they aren't a good mechanism for matching domain names.

Instead, as we noted earlier, the natural way to describe a subdomain match is the pattern *.amazon.com.

For example you want to match amazon.com and www.amazon.com and any other subdomain of amazon.com. One would possibly encode a more complete match in a more comprehensive regular expression ... e.g. /^amazon\.com$|.*\.amazon\.com$/i might work, or it might not though an expression like that would work. As you start accounting for more corner cases the regular expression starts to be more and more complex. You're on a slippery slope into regular expression hell, and perhaps it's necessary to take a step back and consider the situation.

Wouldn't a match expression like *.amazon.com make more sense? In other words, doesn't rewriting the above loop as so make more sense?

let href = ... the href= attribute of the link to modify
let urlP = url.parse(href, true, true);
[
    { country: "com", domain: '*.amazon.com' },
    { country: "ca",  domain: '*.amazon.ca' },
    { country: "co-jp",  domain: '*.amazon.co.jp' },
    { country: "co-uk",  domain: '*.amazon.co.uk' },
    { country: "de",  domain: '*.amazon.de' },
    { country: "es",  domain: '*.amazon.es' },
    { country: "fr",  domain: '*.amazon.fr' },
    { country: "it",  domain: '*.amazon.it' }
].forEach(amazonSite => {
    let amazonCode = getAmazonAffiliateCodeForCountry(amazonSite.country);
    if (domainMatch(amazonSite.domain, href) && amazonCode) {
        ... operate on the link
    }
});

The question is where to get the domainMatch function.

Introducing the Domain Match package

Try: (www.npmjs.com) https://www.npmjs.com/package/domain-match

USAGE is as above, or:

var domainMatch = require('domain-match');
var matched = domainMatch('*.abc.com/prefix/path', 'http://www.abc.com/prefix/path/filename.ext');
// matched == true

In other words, you don't even have to parse the URL, the domainMatch function does it for you. But more importantly, it does domain name matching the way it's supposed to be done. The matching expression in this case is simple and straight-forward and natural to the task of matching domain names.

$ node
> const domainMatch = require('domain-match');
undefined
> domainMatch('*.abc.com/prefix/path', 'http://www.abc.com/prefix/path/filename.ext');
true
> domainMatch('*.abc.com/prefix/path', 'http://www.abc.com/prefix2/path/filename.ext');
false

Even more interesting is it matches not just the domain name but the other parts of the URL. In this case changing prefix to prefix2 caused the URL comparison to not match.

A related package

The domain-match package is what came up first in my search on npmjs.com. Another package popped up in a broader search:

It's curious why domain-match is so thinly used, and why aren't there more packages of this sort? Or does everyone just use regular expressions or even worse simple string comparison?

About the Author(s)

(davidherron.com) David Herron : David Herron is a writer and software engineer focusing on the wise use of technology. He is especially interested in clean energy technologies like solar power, wind power, and electric cars. David worked for nearly 30 years in Silicon Valley on software ranging from electronic mail systems, to video streaming, to the Java programming language, and has published several books on Node.js programming and electric vehicles.

Books by David Herron

(Sponsored)