Correctly match URL against domain name without killing yourself with regular expressions

; Date: 2017-07-05 21:41

Tags: Node.JS »»»» JavaScript

The Internet relies on domain names as a more user-friendly humane address mechanism than IP addresses. That means we must often write code checking if a URL is "within" domain A or domain B, and to act accordingly. You might think a regular expression is the way to go, but it has failings because while a URL looks like a text string it's actually a data structure. A domain name comparison has to recognize that it's dealing with a data structure, and to compare correctly. Otherwise a URL with domain name "loveamazon.com" might match the regular expression /amazon.com$/i and do the wrong thing.

The task in my hand is scanning website content for links to affiliate partners, make sure the links have rel=nofollow, affiliate tags, and so on. The work is being done for the (akashacms.com) AkashaCMS Affiliate Links plugin which simplifies making affiliate links in an AkashaCMS website.

I had the following loop:

let href = ... the href= attribute of the link to modify
let urlP = url.parse(href, true, true);
[
    { country: "com", domain: /amazon\.com$/i },
    { country: "ca",  domain: /amazon\.ca$/i },
    { country: "co-jp",  domain: /amazon\.co\.jp$/i },
    { country: "co-uk",  domain: /amazon\.co\.uk$/i },
    { country: "de",  domain: /amazon\.de$/i },
    { country: "es",  domain: /amazon\.es$/i },
    { country: "fr",  domain: /amazon\.fr$/i },
    { country: "it",  domain: /amazon\.it$/i }
].forEach(amazonSite => {
    let amazonCode = getAmazonAffiliateCodeForCountry(amazonSite.country);
    if (amazonSite.domain.test(urlP.hostname) && amazonCode) {
        ... operate on the link
    }
});

The code as it stands "works" to a degree. It knows a set of Amazon domains, and uses the regular expression to match against the hostname portion of the URL.

But as I noted in the introduction, this doesn't match the domain name properly. Yes, I've made sure to use the caseless modifier (i) and to escape the . characters so I'm assuredly correctly matching the domain name. But, did I prevent it from matching a domain of iloveamazon.com? Nope.

What's desired is for the match to work like a domain name match should work. While I'm sure the predominant technique for matching domain names is regular expressions, they aren't a good mechanism for matching domain names.

For example you want to match amazon.com and www.amazon.com and any other subdomain of amazon.com. One would possibly encode a more complete match in a more comprehensive regular expression ... e.g. /^amazon\.com$|.*\.amazon\.com$/i might work, or it might not though an expression like that would work. As you start accounting for more corner cases the regular expression starts to be more and more complex. You're on a slippery slope into regular expression hell, and perhaps it's necessary to take a step back and consider the situation.

Wouldn't a match expression like *.amazon.com make more sense? In other words, doesn't rewriting the above loop as so make more sense?

let href = ... the href= attribute of the link to modify
let urlP = url.parse(href, true, true);
[
    { country: "com", domain: '*.amazon.com' },
    { country: "ca",  domain: '*.amazon.ca' },
    { country: "co-jp",  domain: '*.amazon.co.jp' },
    { country: "co-uk",  domain: '*.amazon.co.uk' },
    { country: "de",  domain: '*.amazon.de' },
    { country: "es",  domain: '*.amazon.es' },
    { country: "fr",  domain: '*.amazon.fr' },
    { country: "it",  domain: '*.amazon.it' }
].forEach(amazonSite => {
    let amazonCode = getAmazonAffiliateCodeForCountry(amazonSite.country);
    if (domainMatch(amazonSite.domain, href) && amazonCode) {
        ... operate on the link
    }
});

The question is where to get the domainMatch function.

Try: (www.npmjs.com) https://www.npmjs.com/package/domain-match

USAGE is as above, or:

var domainMatch = require('domain-match');
var matched = domainMatch('*.abc.com/prefix/path', 'http://www.abc.com/prefix/path/filename.ext');
// matched == true

In other words, you don't even have to parse the URL, the domainMatch function does it for you. But more importantly, it does domain name matching the way it's supposed to be done. The matching expression in this case is simple and straight-forward and natural to the task of matching domain names.

$ node
> const domainMatch = require('domain-match');
undefined
> domainMatch('*.abc.com/prefix/path', 'http://www.abc.com/prefix/path/filename.ext');
true
> domainMatch('*.abc.com/prefix/path', 'http://www.abc.com/prefix2/path/filename.ext');
false

Even more interesting is it matches not just the domain name but the other parts of the URL. In this case changing prefix to prefix2 caused the URL comparison to not match.

A related package

The domain-match package is what came up first in my search on npmjs.com. Another package popped up in a broader search:

It's curious why domain-match is so thinly used, and why aren't there more packages of this sort? Or does everyone just use regular expressions or even worse simple string comparison?