Correct character encoding with DOMDocument implementing a Wordpress content filter

; Date: July 6, 2017

Tags: Wordpress »»»» PHP

Using DOMDocument in a Wordpress content filter lets you correctly manipulate the content as HTML. It might be that Wordpress filters are commonly using regular expressions or text search/replace functions. While that can be fast and powerful, correctly changing HTML elements requires an HTML-oriented API. The specifics of HTML elements are such that regular expressions and text search/replace functions just don't cut it due to the many pitfalls from highly specific details of HTML. With DOMDocument you simply load the HTML into the library, then you use DOM functions to manipulate the HTML, then you serialize the DOM to HTML text, and voila powerful HTML manipulations easily performed. Unfortunately that method comes with its own pitfalls you must be careful of.

It's been said that solving a software coding problem by adding a regexp (regular expression) you now have two problems. Basically, regular expressions are a cool idea that's really hard to get right, and then really hard to maintain because it's really easy to forget why you concocted that specific regular expression. It's better to not use regexp's, for code maintainability if nothing else, and find other ways to manipulate text. That's especially true for changing HTML or URL strings because both have such stringent formatting restrictions that it's better to use an HTML or URL parser to construct a data object.

I've created a Wordpress plugin for manipulating external links in content, such as to add rel=nofollow or icons to a link. (see (github.com) https://github.com/robogeek/wp-nofollow)

That means I've been reviewing both Wordpress plugins and Drupal modules with similar functionality, to see how others have solved these same problems. Most are using regular expressions (PHP's regexp function) to match text, and PHP's str_replace to make changes.

The improved technique I'm recommending is to use the PHP DOMDocument object instead. One uses that class to parse the $content variable, and then you have all the DOM API calls you'd want to manipulate the text. Kudo's to the (github.com) https://github.com/whyte624/wordpress-favicon-links/ plugin for teaching me this trick.

The outline of your processing filter goes like so:

function xyzzy_links_the_content($content)
{
    try {
        $html = new DOMDocument(null, 'UTF-8');
        @$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $content);
// ... process the $html DOM object
        return $html->saveHTML();
    } catch (Exception $e) {
        return $content;
    }
}
add_filter('the_content', 'xyzzy_links_the_content');

With this you have a properly parsed DOM object and you don't have to worry about the encoding of anything. You're manipulating objects, and then when you're done it's serialized back to HTML.

If your processing needs to inspect all a tags:

foreach ($html->getElementsByTagName('a') as $a) {
    // ... process each link
}

If you want to add an attribute to a specific link, like target=_blank

$a->setAttribute('target', '_blank');

Basically with a DOM object you're free to make any HTML manipulation you want.

When you're done manipulating the HTML, the saveHTML function turns it back into HTML text and you hand it back to Wordpress.

QED? Not quite. Let's go over two issues.

Charset encoding

Up at the (php.net) loadHTML call I had this:

@$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $content);

What's that <meta> tag? The issue is fairly well known if you read the documentation linked above. By default loadHTML doesn't encode the text in a useful charset. If that <meta> tag is missing, the characters are wrongly encoded, and the users of your plugin start complaining about text in their language being mangled.

The other day I removed that tag from my plugin thinking the tag had caused a different issue (that we'll discuss in the next section). The very next day some users of the plugin complained about mangled text. See: (wordpress.org) wordpress.org support topic charset-changing-since-last-update

As soon as I added the <meta> tag back, they reported the text was no longer mangled. The documentation page on (php.net) php.net has more discussion if you wish to know.

Spurious DOCTYPE/HTML/HEAD/BODY tags

Another thing DOMDocument "helpfully" does is to add tags sufficient to make it into a full HTML document. This caused a serious problem for some users of the plugin (curiously, not so on my site) which took me awhile to determine the cause.

At the moment of the Wordpress the_content filter, it receives an HTML snippet corresponding to the content area. Hence it is not a full HTML document. There are lots and lots of cases where we want to use DOMDocument to process an HTML snippet. But, DOMDocument adds these extra tags that then make a problem.

What problem? The resulting page on your Wordpress site has HTML/HEAD/BODY tags wrapping the content area, and then another set of HTML/HEAD/BODY tags wrapping the whole page. If you validate the result e.g. with the W3C Validator, it throws lots of errors at you.

To demonstrate the problem let's consider this bit of PHP code:

<?php

$html = new DOMDocument(null, 'UTF-8');

$content = '<meta http-equiv="content-type" content="text/html; charset=utf-8"><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ullamcorper congue risus congue viverra. Integer ex ipsum, cursus vel lectus sit amet, sollicitudin eleifend ante. Fusce eget nulla dictum, varius libero vel, sagittis tortor. Donec neque felis, faucibus eget diam vitae, hendrerit fermentum lacus. Nunc in rhoncus metus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis molestie lorem efficitur placerat convallis.</p><p>Paragraph 2</p>';

@$html->loadHTML($content);

print $html->saveHTML();

It's the same outline but in a short simple PHP script we can run at the command line.

$ php test.php
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ullamcorper congue risus congue viverra. Integer ex ipsum, cursus vel lectus sit amet, sollicitudin eleifend ante. Fusce eget nulla dictum, varius libero vel, sagittis tortor. Donec neque felis, faucibus eget diam vitae, hendrerit fermentum lacus. Nunc in rhoncus metus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis molestie lorem efficitur placerat convallis.</p><p>Paragraph 2</p></body></html>

This took the HTML we provided, wrapping the whole thing with DOCTYPE, <html>, <head>, and <body> tags. Imagine if you will this being put in the center portion of a Wordpress page, encased within the wrappings of header area, sidebars, and so forth.

The issue is discussed in the saveHTML documentation: (php.net) php.net manual domdocument.savehtml.php

What we want is to still use DOMDocument to manipulate the HTML as a DOM, but somehow extract what's inside the <body> tag.

The saveHTML function takes an optional argument to select a subsection of the DOM to serialize. That means the last line of the script can be replaced with this:

print $html->saveHTML($html->getElementsByTagName('body')->item(0));

That's better:

$ php test.php
<body>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ullamcorper congue risus congue viverra. Integer ex ipsum, cursus vel lectus sit amet, sollicitudin eleifend ante. Fusce eget nulla dictum, varius libero vel, sagittis tortor. Donec neque felis, faucibus eget diam vitae, hendrerit fermentum lacus. Nunc in rhoncus metus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis molestie lorem efficitur placerat convallis.</p>
<p>Paragraph 2</p>
</body>

We got rid of some of the tags, but the <body> tag is still there.

print str_replace(array('<body>', '</body>'), '', $html->saveHTML($html->getElementsByTagName('body')->item(0)));

This removes the <body> tag

$ php test.php

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ullamcorper congue risus congue viverra. Integer ex ipsum, cursus vel lectus sit amet, sollicitudin eleifend ante. Fusce eget nulla dictum, varius libero vel, sagittis tortor. Donec neque felis, faucibus eget diam vitae, hendrerit fermentum lacus. Nunc in rhoncus metus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis molestie lorem efficitur placerat convallis.</p>
<p>Paragraph 2</p>

In a Wordpress filter like the snippet I showed earlier, simply use that line substituting return for print.

Still can't quite say QED but this is finished and is straightforward.