Tags: Node.JS »»»» JavaScript
The Internet relies on domain names as a more user-friendly humane address mechanism than IP addresses. That means we must often write code checking if a URL is "within" domain A or domain B, and to act accordingly. You might think a regular expression is the way to go, but it has failings because while a URL looks like a text string it's actually a data structure. A domain name comparison has to recognize that it's dealing with a data structure, and to compare correctly. Otherwise a URL with domain name "loveamazon.com" might match the regular expression /amazon.com$/i and do the wrong thing.
What do I mean that a domain name is a data structure? A domain name like
techsparx.com sure looks like a string, and we store it as a string in software. The first thing to notice is a domain name like ae-9.r24.snjsca04.us.bb.gin.ntt.net
is deeply nested. In production systems, the hardware elements (routers, etc) will often be named with deeply nested domain names reflecting the geographical location and other identifiers. Or consider that many sites, like google.com
, have lots of subdomains, like drive.google.com
or mail.google.com
.
The domain name a.b.c.example.com
looks like a string, but is a nested data structure. At the top is .com
, then example.com
, then c.example.com
, and so forth. Domain names are case insensitive, FWIW. The test of whether a domain name is a subdomain of another is not a simple case-less string comparison. By convention the domain www.example.com
is often equal to example.com
. The natural way to describe any subdomain of example.com
is with the pattern match *.example.com
, but where do we get a suitable matching algorithm? These are examples of why I'm insisting you must treat a domain name as a data structure, rather than as a simple string.
A common task is matching a domain name or a URL to see if it's associated with a specific domain name. You then act on the domain name correctly for the domain name.
For example, in
AkashaCMS has two plugins which must do this. One, the External Links plugin, looks for <a href=...>
tags for outbound links, and will add rel=nofollow
or other attributes depending on the domain. Another, Affiliate Links, looks to see if the outbound link matches a domain for which there is an affiliate relationship, it will add rel=nofollow
, and additionally add the affiliate tag if it is missing.
These modules must do things like these:
- Not match a domain like
loveamazon.com
as if it isamazon.com
- Correctly match a subdomain like
images.amazon.com
as being associated withamazon.com
Example code
In AkashaCMS there was the following loop:
let href = ... the href= attribute of the link to modify
let urlP = url.parse(href, true, true);
[
{ country: "com", domain: /amazon\.com$/i },
{ country: "ca", domain: /amazon\.ca$/i },
{ country: "co-jp", domain: /amazon\.co\.jp$/i },
{ country: "co-uk", domain: /amazon\.co\.uk$/i },
{ country: "de", domain: /amazon\.de$/i },
{ country: "es", domain: /amazon\.es$/i },
{ country: "fr", domain: /amazon\.fr$/i },
{ country: "it", domain: /amazon\.it$/i }
].forEach(amazonSite => {
let amazonCode = getAmazonAffiliateCodeForCountry(amazonSite.country);
if (amazonSite.domain.test(urlP.hostname) && amazonCode) {
... operate on the link
}
});
The code as it stands "works" to a degree. It knows a set of Amazon domains, and uses the regular expression to match against the hostname
portion of the URL.
But as I noted in the introduction, this doesn't match the domain name properly. Yes, I've made sure to use the case-less modifier (i
) and to escape the .
characters so I'm assuredly correctly matching the domain name. But, did I prevent it from matching a domain of iloveamazon.com
? Nope.
What's desired is for the match to work like a domain name match should work. While I'm sure the predominant technique for matching domain names is regular expressions, they aren't a good mechanism for matching domain names.
Instead, as we noted earlier, the natural way to describe a subdomain match is the pattern *.amazon.com
.
For example you want to match amazon.com
and www.amazon.com
and any other subdomain of amazon.com
. One would possibly encode a more complete match in a more comprehensive regular expression ... e.g. /^amazon\.com$|.*\.amazon\.com$/i
might work, or it might not though an expression like that would work. As you start accounting for more corner cases the regular expression starts to be more and more complex. You're on a slippery slope into regular expression hell, and perhaps it's necessary to take a step back and consider the situation.
Wouldn't a match expression like *.amazon.com
make more sense? In other words, doesn't rewriting the above loop as so make more sense?
let href = ... the href= attribute of the link to modify
let urlP = url.parse(href, true, true);
[
{ country: "com", domain: '*.amazon.com' },
{ country: "ca", domain: '*.amazon.ca' },
{ country: "co-jp", domain: '*.amazon.co.jp' },
{ country: "co-uk", domain: '*.amazon.co.uk' },
{ country: "de", domain: '*.amazon.de' },
{ country: "es", domain: '*.amazon.es' },
{ country: "fr", domain: '*.amazon.fr' },
{ country: "it", domain: '*.amazon.it' }
].forEach(amazonSite => {
let amazonCode = getAmazonAffiliateCodeForCountry(amazonSite.country);
if (domainMatch(amazonSite.domain, href) && amazonCode) {
... operate on the link
}
});
The question is where to get the domainMatch
function.
Introducing the Domain Match
package
Try: https://www.npmjs.com/package/domain-match
USAGE is as above, or:
var domainMatch = require('domain-match');
var matched = domainMatch('*.abc.com/prefix/path', 'http://www.abc.com/prefix/path/filename.ext');
// matched == true
In other words, you don't even have to parse the URL, the domainMatch
function does it for you. But more importantly, it does domain name matching the way it's supposed to be done. The matching expression in this case is simple and straight-forward and natural to the task of matching domain names.
$ node
> const domainMatch = require('domain-match');
undefined
> domainMatch('*.abc.com/prefix/path', 'http://www.abc.com/prefix/path/filename.ext');
true
> domainMatch('*.abc.com/prefix/path', 'http://www.abc.com/prefix2/path/filename.ext');
false
Even more interesting is it matches not just the domain name but the other parts of the URL. In this case changing prefix
to prefix2
caused the URL comparison to not match.
A related package
The domain-match
package is what came up first in my search on npmjs.com
. Another package popped up in a broader search:
-
https://www.npmjs.com/package/url-pattern Does full-fledged URL pattern matching, hence it does a superset of what
domain-match
does -
https://www.npmjs.com/package/domain-matcher Has a similar focus to
domain-match
- https://www.npmjs.com/package/wildcard-domain-matcher Also has similar focus
It's curious why domain-match
is so thinly used, and why aren't there more packages of this sort? Or does everyone just use regular expressions or even worse simple string comparison?