Correct character encoding with DOMDocument implementing a Wordpress content filter

By: (plus.google.com) +David Herron; Date: July 6, 2017

Tags: Wordpress » PHP

Using DOMDocument in a Wordpress content filter lets you correctly manipulate the content as HTML. It might be that Wordpress filters are commonly using regular expressions or text search/replace functions. While that can be fast and powerful, correctly changing HTML elements requires an HTML-oriented API. The specifics of HTML elements are such that regular expressions and text search/replace functions just don't cut it due to the many pitfalls from highly specific details of HTML. With DOMDocument you simply load the HTML into the library, then you use DOM functions to manipulate the HTML, then you serialize the DOM to HTML text, and voila powerful HTML manipulations easily performed. Unfortunately that method comes with its own pitfalls you must be careful of.

It's been said that solving a software coding problem by adding a regexp (regular expression) you now have two problems. Basically, regular expressions are a cool idea that's really hard to get right, and then really hard to maintain because it's really easy to forget why you concocted that specific regular expression. It's better to not use regexp's, for code maintainability if nothing else, and find other ways to manipulate text. That's especially true for changing HTML or URL strings because both have such stringent formatting restrictions that it's better to use an HTML or URL parser to construct a data object.

I've created a Wordpress plugin for manipulating external links in content, such as to add rel=nofollow or icons to a link. (see (github.com) https://github.com/robogeek/wp-nofollow)

That means I've been reviewing both Wordpress plugins and Drupal modules with similar functionality, to see how others have solved these same problems. Most are using regular expressions (PHP's regexp function) to match text, and PHP's str_replace to make changes.

The improved technique I'm recommending is to use the PHP DOMDocument object instead. One uses that class to parse the $content variable, and then you have all the DOM API calls you'd want to manipulate the text. Kudo's to the (github.com) https://github.com/whyte624/wordpress-favicon-links/ plugin for teaching me this trick.

The outline of your processing filter goes like so:

function xyzzy_links_the_content($content)
{
    try {
        $html = new DOMDocument(null, 'UTF-8');
        @$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $content);
// ... process the $html DOM object
        return $html->saveHTML();
    } catch (Exception $e) {
        return $content;
    }
}
add_filter('the_content', 'xyzzy_links_the_content');

With this you have a properly parsed DOM object and you don't have to worry about the encoding of anything. You're manipulating objects, and then when you're done it's serialized back to HTML.

If your processing needs to inspect all a tags:

foreach ($html->getElementsByTagName('a') as $a) {
    // ... process each link
}

If you want to add an attribute to a specific link, like target=_blank

$a->setAttribute('target', '_blank');

Basically with a DOM object you're free to make any HTML manipulation you want.

When you're done manipulating the HTML, the saveHTML function turns it back into HTML text and you hand it back to Wordpress.

QED? Not quite. Let's go over two issues.

Charset encoding

Up at the (php.net) loadHTML call I had this:

@$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $content);

What's that <meta> tag? The issue is fairly well known if you read the documentation linked above. By default loadHTML doesn't encode the text in a useful charset. If that <meta> tag is missing, the characters are wrongly encoded, and the users of your plugin start complaining about text in their language being mangled.

The other day I removed that tag from my plugin thinking the tag had caused a different issue (that we'll discuss in the next section). The very next day some users of the plugin complained about mangled text. See: (wordpress.org) https://wordpress.org/support/topic/charset-changing-since-last-update/

As soon as I added the <meta> tag back, they reported the text was no longer mangled. The documentation page on (php.net) php.net has more discussion if you wish to know.

Spurious DOCTYPE/HTML/HEAD/BODY tags

Another thing DOMDocument "helpfully" does is to add tags sufficient to make it into a full HTML document. This caused a serious problem for some users of the plugin (curiously, not so on my site) which took me awhile to determine the cause.

At the moment of the Wordpress the_content filter, it receives an HTML snippet corresponding to the content area. Hence it is not a full HTML document. There are lots and lots of cases where we want to use DOMDocument to process an HTML snippet. But, DOMDocument adds these extra tags that then make a problem.

What problem? The resulting page on your Wordpress site has HTML/HEAD/BODY tags wrapping the content area, and then another set of HTML/HEAD/BODY tags wrapping the whole page. If you validate the result e.g. with the W3C Validator, it throws lots of errors at you.

To demonstrate the problem let's consider this bit of PHP code:

<?php

$html = new DOMDocument(null, 'UTF-8');

$content = '<meta http-equiv="content-type" content="text/html; charset=utf-8"><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ullamcorper congue risus congue viverra. Integer ex ipsum, cursus vel lectus sit amet, sollicitudin eleifend ante. Fusce eget nulla dictum, varius libero vel, sagittis tortor. Donec neque felis, faucibus eget diam vitae, hendrerit fermentum lacus. Nunc in rhoncus metus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis molestie lorem efficitur placerat convallis.</p><p>Paragraph 2</p>';

@$html->loadHTML($content);

print $html->saveHTML();

It's the same outline but in a short simple PHP script we can run at the command line.

$ php test.php
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ullamcorper congue risus congue viverra. Integer ex ipsum, cursus vel lectus sit amet, sollicitudin eleifend ante. Fusce eget nulla dictum, varius libero vel, sagittis tortor. Donec neque felis, faucibus eget diam vitae, hendrerit fermentum lacus. Nunc in rhoncus metus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis molestie lorem efficitur placerat convallis.</p><p>Paragraph 2</p></body></html>

This took the HTML we provided, wrapping the whole thing with DOCTYPE, <html>, <head>, and <body> tags. Imagine if you will this being put in the center portion of a Wordpress page, encased within the wrappings of header area, sidebars, and so forth.

The issue is discussed in the saveHTML documentation: (php.net) http://php.net/manual/en/domdocument.savehtml.php

What we want is to still use DOMDocument to manipulate the HTML as a DOM, but somehow extract what's inside the <body> tag.

The saveHTML function takes an optional argument to select a subsection of the DOM to serialize. That means the last line of the script can be replaced with this:

print $html->saveHTML($html->getElementsByTagName('body')->item(0));

That's better:

$ php test.php
<body>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ullamcorper congue risus congue viverra. Integer ex ipsum, cursus vel lectus sit amet, sollicitudin eleifend ante. Fusce eget nulla dictum, varius libero vel, sagittis tortor. Donec neque felis, faucibus eget diam vitae, hendrerit fermentum lacus. Nunc in rhoncus metus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis molestie lorem efficitur placerat convallis.</p>
<p>Paragraph 2</p>
</body>

We got rid of some of the tags, but the <body> tag is still there.

print str_replace(array('<body>', '</body>'), '', $html->saveHTML($html->getElementsByTagName('body')->item(0)));

This removes the <body> tag

$ php test.php

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ullamcorper congue risus congue viverra. Integer ex ipsum, cursus vel lectus sit amet, sollicitudin eleifend ante. Fusce eget nulla dictum, varius libero vel, sagittis tortor. Donec neque felis, faucibus eget diam vitae, hendrerit fermentum lacus. Nunc in rhoncus metus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis molestie lorem efficitur placerat convallis.</p>
<p>Paragraph 2</p>

In a Wordpress filter like the snippet I showed earlier, simply use that line substituting return for print.

Still can't quite say QED but this is finished and is straightforward.

« US Postal Service introduces Informed Delivery, a Big-Brotherly preview of incoming postal mail Using the Arduino serial console to monitor your Sketch »
2016 Election Acer C720 Ad block AkashaCMS Android Anti-Fascism Apple Apple Hardware History Apple iPhone Hardware April 1st Arduino ARM Compilation Astronomy Asynchronous Programming Authoritarianism Automated Social Posting Bells Law Big Brother Black Holes Blade Runner Blogger Blogging Books Botnet Botnets Cassette Tapes Cellphones Christopher Eccleston Chrome Chrome Apps Chromebook Chromebooks Chromebox ChromeOS CIA CitiCards Civil Liberties Clinton Cluster Computing Command Line Tools Computer Hardware Computer Repair Computers Cross Compilation Crouton Curiosity Rover Cyber Security Cybermen Daleks Darth Vader Data backup Data Storage Database Database Backup Databases David Tenant DDoS Botnet Detect Adblocker Digital Photography DIY DIY Repair Docker Doctor Who Doctor Who Paradox Drobo Drupal Drupal Themes DVD Early Computers Election Hacks Electric Bicycles Electric Vehicles Electron Emdebian Enterprise Node ESP8266 Ethical Curation Eurovision Event Driven Asynchronous Express Facebook Fake News File transfer without iTunes FireFly Fraud Freedom of Speech Gallifrey git Gitlab GMAIL Google Google Chrome Google Gnome Google+ Government Spying Great Britain Home Automation HTTPS I2C Protocol Image Analysis Image Conversion Image Processing ImageMagick InfluxDB Internet Internet Advertising Internet Law Internet of Things Internet Policy Internet Privacy iOS Devices iPad iPhone iPhone hacking Iron Man Iternet of Things iTunes Java JavaScript JavaScript Injection JDBC John Simms Joyent Lets Encrypt LibreOffice Linux Linux Hints Linux Single Board Computers Logging Mac OS Mac OS X Make Money Online MariaDB Mars Matt Lucas MEADS Anti-Missile Mercurial Michele Gomez Military Hardware Minification Minimized CSS Minimized HTML Minimized JavaScript Missy Mobile Applications MODBUS Mondas Monty Python MQTT Music Player Music Streaming MySQL NanoPi Nardole NASA Net Neutrality Node Web Development Node.js Node.js Database Node.js Testing Node.JS Web Development Node.x North Korea Online advertising Online Fraud Open Media Vault Open Source Governance Open Source Software OpenAPI OpenVPN Personal Flight Peter Capaldi Photography PHP Plex Media Server Political Protest Postal Service Power Control Privacy Production use Public Violence Raspberry Pi Raspberry Pi 3 Raspberry Pi Zero Recycling Remote Desktop Republicans Retro-Technology Reviews Right to Repair River Song Rocket Ships RSS News Readers rsync Russia Russia Troll Factory Scheme Science Fiction Season 1 Season 10 Season 11 Security Security Cameras Server-side JavaScript Shell Scripts Silence Simsimi Skype Social Media Warfare Social Networks Software Development Space Flight Space Ship Reuse Space Ships SpaceX Spring Spring Boot SQLite3 SSD Drives SSD upgrade SSH SSH Key SSL Swagger Synchronizing Files Telescopes Terrorism The Cybermen The Daleks The Master Time-Series Database Torchwood Total Information Awareness Trump Trump Administration Ubuntu UDOO Virtual Private Networks VirtualBox VLC VNC VOIP Web Applications Web Developer Resources Web Development Web Development Tools Weeping Angels WhatsApp Wordpress YouTube