Using HTMLParser2, DOMUtils, to process HTML and XML in Node.js

; Date: Tue Dec 07 2021

Tags: Node.JS »»»» HTML

HTMLParser2 is part of a cluster of Node.js packages (domhandler, domutils, css-select, dom-serializer) that enable powerful manipulation of both HTML and XML DOM object trees. These packages can be used not just for web scraping, but for server-side DOM manipulation, and they form most of the underpinning of Cheerio, the Node.js package for jQuery-like DOM manipulation on Node.js.

It seems that most people use HTMLParser2, and its related packages, for Web Scraping. That means downloading a page and parsing out data from the HTML on that page. While that can be used to create immensely useful information resources, these packages can be used for many other tasks involving server-side DOM manipulation of both HTML and XML data. Each of these packages focuses on a specific purpose, and when used together programmers have powerful capabilities for ingesting XML or HTML documents, extracting data, and transforming such documents.

Unfortunately the documentation for these packages are unclear. My goal with this article is creating a useful resource for understanding how to use them.

I'm approaching this as the author of a static website generator platform, AkashaCMS. A core feature of AkashaCMS using Cheerio to implement server-side DOM manipulation to create the final HTML that's displayed to visitors. In some cases it is fixing HTML, rewriting URL's, or converting custom tags like <embed-video> to a YouTube video player.

In using Cheerio I hadn't paid attention to the implementation. I became interested when diagnosing a problem with the latest Cheerio version. In response, I want to take a deeper look at XML and HTML processing in Node.js. The first stop was to explore HTMLParser2, DOMHandler, DOMUtils, CSS-Select, and DOM-Serializer.

I'm also interested in the range of options for server-side DOM manipulation in Node.js. I understand that in front end engineering, many are phasing out jQuery use because improvements in the DOM API has made jQuery less necessary. I'm curious if any Node.js packages implement DOM manipulation with the sort of conciseness of the jQuery API.

My goal for this article includes evaluating DOM manipulation with DOMHandler and DOMUtils.

What is the DOM if there is no browser?

We need to have a little chat about The DOM. The DOM is not just the thing that web browsers generate based on web page content. In the normal case, for every a web page displayed in a web browser, the browser converts it into a DOM, then we use CSS to style the DOM and JavaScript to manipulate it. It's possible to implement quite advanced applications inside web browsers through browser-side DOM manipulation.

DOM, in this case, means Document Object Model, which is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree.

In other words XML and HTML look like text, but they're actually serialized data structures. The HTML or XML text is the serialization format. Software, like web browsers, which read those files, deserializes the textual representation into a DOM structure. It's therefore possible to deserialize XML/HTML into a DOM, manipulate the DOM, then serialize it back to XML or HTML.

The DOM is a standard under continuous development since the 1990's. Implementations exist in multiple programming languages. The standard DOM model involves Node objects of various kinds, that have attributes, and contain zero or more child Node objects. One type, the Element object, represents the familiar <tag> that we use in XML or HTML.

That's where we're heading, using packages that implement an API similar to the DOM standard on Node.js. The DOM API has never been limited to browsers, because it exists in multiple languages.

Setting up a Node.js project for HTMLParser2, DOMHandler, DOMUtils, CSS-Select, and DOM-Serializer

To explore these packages let's set up a simple project with a few example scripts. Prior to doing this, you must of course have Node.js installed on your computer. I am currently running Node.js 16.13.0, but I believe this will work on 14.x. To get started, create a directory and then run the following commands:

$ npm init -y
Wrote to /Volumes/Extra/ws/techsparx.com/projects/node.js/htmlparser2/package.json:

{
  "name": "htmlparser2",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "David Herron <david@davidherron.com>",
  "license": "ISC"
}

$ npm install htmlparser2 domhandler domutils css-select dom-serializer --save

added 11 packages, and audited 12 packages in 3s

10 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities

I named the directory htmlparser2, hence npm init -y caused the package.json to name the project htmlparser. Otherwise, this has given us a blank project with this cluster of packages.

Another task is to get one or more HTML files you can work with. The file I'm using is from one of the AkashaRender test suites, and it therefore has some custom tags.

<!doctype html>
<!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ -->
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9" lang="en"> <![endif]-->
<!-- Consider adding a manifest.appcache: h5bp.com/d/Offline -->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8" />
<!-- Use the .htaccess and remove these lines to avoid edge case issues. More info: h5bp.com/i/378 -->
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<title>Show Content</title>
<meta name="foo" content="bar"/>
<funky-bump></funky-bump>
<ak-stylesheets></ak-stylesheets>
<ak-headerJavaScript></ak-headerJavaScript>
<rss-header-meta href="/rss-for-header.xml"></rss-header-meta>
<external-stylesheet href="http://external.site/foo.css"></external-stylesheet>
<dns-prefetch
control="we must have control"
dnslist="foo1.com,foo2.com,foo3.com"></dns-prefetch>
<site-verification google="We are good"></site-verification>
<xml-sitemap></xml-sitemap>
<xml-sitemap href="/foo-bar-sitemap.xml" title="Foo Bar Sitemap"></xml-sitemap>
</head>
<body>
<h1>Show Content</h1>
<section id="teaser"><ak-teaser></ak-teaser></section>
<article id="original">
    <div class="article-head"><h2>Article title</h2></div>
    <p><show-content id="simple" href="/shown-content.html"></show-content></p>
    <p><show-content id="dest" dest="http://dest.url" href="/shown-content.html"></show-content></p>
    <p><show-content id="template" 
            template="ak_show-content-card.html.ejs" 
            href="/shown-content.html"
            content-image="/imgz/shown-content-image.jpg"
            >
    Caption text
    </show-content></p>
    <p><show-content id="template2" 
            template="ak_show-content-card.html.ejs" 
            href="/shown-content.html"
            dest="http://dest.url"
            content-image="/imgz/shown-content-image.jpg">
    Caption text
    </show-content></p>

</article>
<article id="duplicate">
    <ak-insert-body-content></ak-insert-body-content>
</article>
<ak-footerJavaScript></ak-footerJavaScript>
</body>
</html>

Feel free to use this file, or any other HTML file you're interested in. I myself am of course interested in the ability to convert these custom tags to the underlying HTML. But that's not the only possible application, since the possibilities for server-side DOM manipulation is endless. For example, one could generate SVG files on the server for display in a browser.

Parse an HTML file, serializing it to HTML

Let's start with a simple example, namely to read HTML to a DOM tree, then immediately serialize it to HTML. In other words, before we start running we need to learn to crawl.

import { default as htmlparser2, Parser } from "htmlparser2";
import { default as render } from "dom-serializer";
import { default as fs, promises as fsp } from 'fs';

const rawHtml = await fsp.readFile(process.argv[2], 'utf8');

const dom = htmlparser2.parseDocument(rawHtml);

console.log(dom);

const serilzd = render(dom);

console.log(serilzd);

This is written with the ES6 module format. You may be confused by the await keyword here, but starting in Node.js 14.x it became possible to use await at the top level of ES6 modules. To learn more, see: Node.js Script writers: Top-level async/await now available

We read the file into memory, then use the parseDocument method to parse it directly into a DOM structure.

The htmlparser2 package is a SAX-style parser, meaning it emits events noting the syntax elements it found in the incoming text. Those events are not a DOM object tree. Instead, the domhandler package uses those events to produce a DOM object tree. The parseDocument method must therefore instantiate domhandler to do so behind the scenes.

Next is the render function. This takes a DOM tree as produced by domhandler and serializes it to HTML

Unfortunately, when we run this script we get an error: TypeError: render is not a function. Namely, while the package is implemented in TypeScript, it is explicitly targeting the CommonJS environment, and doesn't work well on ES6.

Therefore we must rewrite this example as so:

const htmlparser2 = require('htmlparser2');
const render = require('dom-serializer').default;
const fs = require('fs');
const fsp = require('fs').promises;

(async () => {

    const rawHtml = await fsp.readFile(process.argv[2], 'utf8');

    const dom = htmlparser2.parseDocument(rawHtml);
    
    console.log(dom);
    
    const serilzd = render(dom);
    
    console.log(serilzd);
    
})().catch(err => {
    console.error(err);
});

This is the same script, but using CommonJS module syntax. Because we cannot use top-level await, we have to implement an async function for running the script.

Let's make sure you understand the code structure being used in these examples. It starts with an anonymous arrow function:

async () => { ... }

Wrapped around that is a function invocation:

(async () => { ... })()

The anonymous function is instantiated inside the parentheses, and then immediately invoked. If you wanted to pass in parameters it would look like this:

(async (fileName) => {
    const rawHtml = await fsp.readFile(fileName, 'utf8');
    ...
})(process.argv[2])

Because this is an async function, it returns a Promise. To ensure you see any errors the .catch method is required. If there were any data returned from the function, you'd use a .then method.

This program itself simply parses the HTML to a DOM, then immediately prints it out. Your output should be equivalent to the input file. There may be slight differences, but the nature of HTML allows the same data structure to be represented multiple ways. What's important is whether the output is semantically the same as the input.

Producing XHTML output from an HTML input

What if your use-case is converting HTML to XHTML? With a tiny tweak to the script we produce XHTML.

const serilzd = render(dom, {
    xmlMode: true
});

This changes the serialization to XML mode, a.k.a. XHTML.

You'll find a number of changes in this case. For example <meta> tags now have a closing slash, making them <meta/>, and tags that had closing tags (<tag></tag>) are now a single tag (<tag/>).

Using CSS-Select to find DOM elements

A typical task is using a "selector" to search for an item in the DOM to either extract some data, or to manipulate the DOM. In the HTMLParser2 world, the css-select package implements a selector syntax derived from both CSS4 and jQuery.

It doesn't provide any DOM manipulation, only the ability to select DOM nodes based on the selector.

// Demonstrate CSS selectors to extract data from XML or HTML

const htmlparser2 = require('htmlparser2');
const render = require('dom-serializer').default;
const CSSselect = require("css-select");
const fs = require('fs');
const fsp = require('fs').promises;
const util = require('util');

(async () => {

    const rawHtml = await fsp.readFile(process.argv[2], 'utf8');

    const dom = htmlparser2.parseDocument(rawHtml);

    for (let h1 of CSSselect.selectAll('h1', dom)) {
        console.log(`h1 ${render(h1)}`);
    }

    for (let articleHead of CSSselect.selectAll('article .article-head', dom)) {
        console.log(`articleHead ${render(articleHead)}`);
    }

    for (let articleHead of CSSselect.selectAll('article .article-head h1,h2,h3,h4,h5,h6', dom)) {
        console.log(`articleHead Hn ${render(articleHead)}`);
    }

    console.log(CSSselect.selectAll('article .article-head', dom));

})().catch(err => {
    console.error(err);
});

This example shows using CSSselect.selectAll to select all elements matching the selector, then printing the HTML for the selected element. The last usage prints the raw DOM data structure, so we can familiarize ourselves with the DOM data structure generated by domhandler.

$ node css-selector.js example1.html 
h1 <h1>Show Content</h1>
articleHead <div class="article-head"><h2>Article title</h2></div>
articleHead Hn <h2>Article title</h2>
[
  <ref *1> Element {
    type: 'tag',
    parent: Element {
      type: 'tag',
      parent: [Element],
      prev: [Text],
      next: [Text],
      startIndex: null,
      endIndex: null,
      children: [Array],
      name: 'article',
      attribs: [Object]
    },
    prev: Text {
      type: 'text',
      parent: [Element],
      prev: null,
      next: [Circular *1],
      startIndex: null,
      endIndex: null,
      data: '\n    '
    },
    next: Text {
      type: 'text',
      parent: [Element],
      prev: [Circular *1],
      next: [Element],
      startIndex: null,
      endIndex: null,
      data: '\n    '
    },
    startIndex: null,
    endIndex: null,
    children: [ [Element] ],
    name: 'div',
    attribs: { class: 'article-head' }
  }
]

The first three lines in the output show the HTML found by the selectors. We got this by running the selected DOM subtree through the render function to give us the HTML snippet for that subtree.

Using render for an Element selected in the DOM serializes the DOM nodes below the selected Element.

The last shows the actual DOM data returned from this method. Because it is selectAll, it returns an array of the matches. This array has one object, an Element instance. It has type tag, and the tag name is div, and it has an attribs array including a class attribute of article-head, all of which matches the HTML in the document. The children element is an array of DOM objects with, or below, this Element. There are parent, prev, and next, objects so that anybody receiving this DOM object can traverse the tree.

The technique to generate this printed output (console.log) does not print the entire object tree. Instead we see markers like [Element] indicating there is an object at that location of type Element.

Refer back to the code we implemented. While the loop structure is fairly straight-forward, it's not as succinct as the equivalent jQuery code. But, notice that the selectAll method returns an array. That means the inner portion of that example can be implemented this way:

CSSselect.selectAll('h1', dom).forEach(h1 => {
    console.log(`h1 ${render(h1)}`);
});

CSSselect.selectAll('article .article-head', dom)
.forEach(articleHead => {
    console.log(`articleHead ${render(articleHead)}`);
});

CSSselect.selectAll('article .article-head h1,h2,h3,h4,h5,h6', dom)
.forEach(articleHead => {
    console.log(`articleHead Hn ${render(articleHead)}`);
});

This uses the Array.forEach method, and is much closer to the equivalent jQuery code. It means we can use other operations such as the Array.map or Array.filter methods.

Using DOMUtils to manipulate the DOM

Next, let's do a little DOM manipulation. In this HTML document you see a couple custom HTML tags. Let's implement code to convert the custom tags into the correct HTML.

To do this, we'll use CSS-Select to select the DOM elements to work on, then use functions in the DOMUtils package to act on those elements.

const htmlparser2 = require('htmlparser2');
const domhandler = require('domhandler');
const domutils = require('domutils');
const render = require('dom-serializer').default;
const CSSselect = require("css-select");
const fs = require('fs');
const fsp = require('fs').promises;
const util = require('util');

(async () => {

    const rawHtml = await fsp.readFile(process.argv[2], 'utf8');

    const dom = htmlparser2.parseDocument(rawHtml);

    for (let fb of CSSselect.selectAll('funky-bump', dom)) {
        domutils.removeElement(fb);
    }

    for (let sm of CSSselect.selectAll('xml-sitemap', dom)) {
        // console.log(sm);
        if (sm.attribs.href) {
            const template = '<link rel="sitemap" type="application/xml" href=""/>';
            const link = htmlparser2.parseDocument(template);
            const links = CSSselect.selectAll('link', link);
            links[0].attribs.href = sm.attribs.href;
            // console.log(`sitemap link ${render(link)}`);
            domutils.replaceElement(sm, link);
        } else {
            domutils.removeElement(sm);
        }
    }
    
    const serilzd = render(dom);
    
    console.log(serilzd);
    
})().catch(err => {
    console.error(err);
});

The first loop looks for <funky-bump> elements and simply removes the tag. I added this tag for debugging purposes, and it does not need to be displayed in a production website.

The next loop looks for xml-sitemap tags. This is meant to be short-hand for the <link rel="sitemap"> tag. In other words, our task is replacing a custom tag with the actual tag. AkashaCMS supports a number of custom tags like this. The <show-content> tag you see here looks up another document on the site and shows some of its data. They are implemented using Cheerio and jQuery-like functions, but this example uses CSS-Select and DOMUtils.

Ask yourself what's the safest way to insert a URL into an href attribute of a DOM element that is to then be inserted into the DOM of the page? Remember that HTML is not a text format, but a data structure that's represented as text. My belief is that it's best to manipulate HTML as a data structure rather than by text substitution.

Aren't there a range of possible script injection attacks if you were to use a JavaScript template string? What I mean is this:

const replacement = `<link rel="sitemap" type="application/xml" href="${sm.attribs.href}"/>`;

While this is much simpler, isn't it open to injecting a malicious URL?

What the above example does is generate a small DOM tree from the inline template. It then sets a value into the attribs array such that link[href] has the correct URL value. The domutils.replaceElement function takes a DOM tree, we provide the DOM tree we just created.

In the other branch of this code, we use removeElement if there is no href provided.

$ node manipulate.js example1.html
<!doctype html>
    ...
<head>
<meta charset="utf-8">
<!-- Use the .htaccess and remove these lines to avoid edge case issues. More info: h5bp.com/i/378 -->
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Show Content</title>
<meta name="foo" content="bar">

<ak-stylesheets></ak-stylesheets>
<ak-headerjavascript></ak-headerjavascript>
<rss-header-meta href="/rss-for-header.xml"></rss-header-meta>
<external-stylesheet href="http://external.site/foo.css"></external-stylesheet>
<dns-prefetch control="we must have control" dnslist="foo1.com,foo2.com,foo3.com"></dns-prefetch>
<site-verification google="We are good"></site-verification>

<link rel="sitemap" type="application/xml" href="/foo-bar-sitemap.xml">
</head>
<body>
    ...
</html>

And, this is the result. The <funky-bump> element disappeared as does any speed bump. The xml-sitemap element has been replaced with the correct <link> tag.

Using Cheerio for DOM manipulation on Node.js

DOMUtils lets us manipulate the DOM. But, let's see how it stacks up against Cheerio.

Installation:

$ npm install cheerio --save

As of this writing, that installs cheerio@1.0.0-rc.10. As the name implies, the Cheerio team still feels it is not worthy of being called 1.0.

const cheerio = require('cheerio');
const fs = require('fs');
const fsp = require('fs').promises;
const util = require('util');

(async () => {

    const rawHtml = await fsp.readFile(process.argv[2], 'utf8');

    // const dom = htmlparser2.parseDocument(rawHtml);
    const $ = cheerio.load(rawHtml, {
        _useHtmlParser2: true
    });
    $('funky-bump').remove();
    for (let sm of $('xml-sitemap')) {
        if (!$(sm).attr('href')) $(sm).remove();
        else {
            let $template = cheerio.load(
                    '<link rel="sitemap" type="application/xml" href=""/>',
                    null, false);
            $template('link').attr('href', $(sm).attr('href'));
            $(sm).replaceWith($template.html());
        }
    }

    console.log($.html());
    // console.log($.root().html());
})().catch(err => {
    console.error(err);
});

This implements the same manipulations in the previous example. Because of the jQuery-like API, the code is more succinct.

Because my example HTML uses non-standard custom HTML tags, there is an issue with using the default settings for Cheerio. By default it uses the parser5 package for HTML. The default in that mode is to move the custom tags in the <head> into the <body>. You can see this by commenting out the _useHtmlParser2 option. That undocumented option, as the name implies, forces the use of htmlparser2.

The documented method for using htmlparser2 is the line of code that's commented out. However, that version fails with inscrutable error messages. Using this option, which was found by perusing the source code, works, producing the same output as in the previous section.

When creating the $template variable, we used cheerio.load again, just as we'd done in the previous section. This method parses HTML and produces a DOM tree that is suitable for use with Cheerio. The null and false options are necessary to ensure it is treated as an HTML snippet. Otherwise the generated DOM is wrapped by <html><body></body></html>, which then causes unwanted behavior.

In Cheerio, the .html() method serializes the DOM to text. The root method is documented to access the root of the document. Therefore it seems more correct to use the second (commented out) method for serializing the DOM to text, but in this example both $.html() and $.root().html() produced the same results.

Processing RSS feeds

The HTMLParser2 package includes built-in mode for parsing RSS or Atom feeds. Let's give this a spin with an RSS feed.

Let's start by fetching a feed.

$ wget https://akashacms.com/news/rss.xml

There's a whole world of RSS feeds out there, you don't have to pick this one.

const htmlparser2 = require('htmlparser2');
const fs = require('fs');
const fsp = require('fs').promises;

(async () => {

    const rawRSS = await fsp.readFile(process.argv[2], 'utf8');
    const feed = htmlparser2.parseFeed(rawRSS, {});

    console.log(feed);

})().catch(err => {
    console.error(err);
});

There are a large number of RSS and Atom parsing packages available for Node.js. This, for what it's worth, is easy to use.

$ node feed1.js rss.xml 
{
  type: 'rss',
  id: '',
  items: [
    {
      media: [],
      id: 'https://akashacms.com/news/2021/06/stacked-dirs.html',
      title: '<![CDATA[Stacked Directories - A directory/file watcher for static website generators]]>'
    },
    {
      media: [],
      id: 'https://akashacms.com/news/2021/05/gridjs.html',
      title: '<![CDATA[Using GridJS for fancy searchable HTML tables on statically generated websites]]>'
    },
    ...
}

This is simple, but you're probably asking what the <![CDATA[...]]> thing is. This is an HTML construct called CDATA, and it is largely transparent. It happens to be in the RSS feed generated by AkashaCMS:

<title><![CDATA[Stacked Directories - A directory/file watcher for static website generators]]></title>

If we add some options, like so:

const feed = htmlparser2.parseFeed(rawRSS, {
    recognizeCDATA: true,
    decodeEntities: true,
    recognizeSelfClosing: true
});

Then the output improves:

  items: [
    {
      media: [],
      id: 'https://akashacms.com/news/2021/06/stacked-dirs.html',
      title: '<![CDATA[Stacked Directories - A directory/file watcher for static website generators]]>',
      description: "It's very convenient when a ...."
    },
    {
      media: [],
      id: 'https://akashacms.com/news/2021/05/gridjs.html',
      title: '<![CDATA[Using GridJS for fancy searchable HTML tables on statically generated websites]]>',
      description: "There's a wide variety of ...."
    }
    ...
  ]

But the CDATA construct is still present. This can be removed from the output by manually editing the XML to remove the CDATA construct from the RSS feed.

The HTMLParser2 documentation actually suggests to use other RSS feed processor packages. But, this example gives us a taste for using HTMLParser2 to manipulate not just HTML but XML files.

XML file processing

In the previous section we processed an XML file, specifically an RSS feed, using a specialized function in htmlparser2. But, the package can be used for general XML manipulation.

Let's start with a simple XML file:

<data>
    <hello>World</hello>
    <table>
        <row id="1" brand="Tesla" mode="electric"/>
        <row id="2" brand="Dodge" mode="gas guzzler"/>
        <row id="3" brand="Ford" mode="multiple, possibly moving to electric"/>
        <row id="4" brand="GM" mode="multiple, possibly moving to electric"/>
        <row id="5" brand="VW" 
            mode="forced move to electric after emissions fraud conviction"/>
    </table>
</data>

FWIW, some of my work in the world involves writing news articles about electric vehicles.

To start, let's replicate the first example of reading the file and immediately serializing it:

const htmlparser2 = require('htmlparser2');
const render = require('dom-serializer').default;
const fs = require('fs');
const fsp = require('fs').promises;
const util = require('util');

(async () => {

    const rawHtml = await fsp.readFile(process.argv[2], 'utf8');

    const dom = htmlparser2.parseDocument(rawHtml, {
        xml: true,
        recognizeSelfClosing: true
    });
    
    const serilzd = render(dom);
    
    console.log(serilzd);
    
})().catch(err => {
    console.error(err);
});

This example is run as so:

$ node xml-simple.js data.xml 
<data>
    <hello>World</hello>
    <table>
        <row id="1" brand="Tesla" mode="electric"></row>
        <row id="2" brand="Dodge" mode="gas guzzler"></row>
        <row id="3" brand="Ford" mode="multiple, possibly moving to electric"></row>
        <row id="4" brand="GM" mode="multiple, possibly moving to electric"></row>
        <row id="5" brand="VW" mode="forced move to electric after emissions fraud conviction"></row>
    </table>
</data>

As you see the output is equivalent to the input. The self closing tags were converted into the <row></row> form, for example. But, it's semantically the same.

The difference is the two options passed to parseDocument which enable XML mode, and recognizing self-closing tags.

Next, let's use the CSS-Select package with XML:

const htmlparser2 = require('htmlparser2');
const render = require('dom-serializer').default;
const CSSselect = require("css-select");
const fs = require('fs');
const fsp = require('fs').promises;
const util = require('util');

(async () => {

    const rawHtml = await fsp.readFile(process.argv[2], 'utf8');

    const dom = htmlparser2.parseDocument(rawHtml, {
        xml: true,
        recognizeSelfClosing: true
    });
    
    for (let row of CSSselect.selectAll('row', dom)) {
        console.log(`${row.attribs.brand} - ${row.attribs.mode}`);
    }
    
})().catch(err => {
    console.error(err);
});

This is run as follows:

$ node xml-table.js data.xml 
Tesla - electric
Dodge - gas guzzler
Ford - multiple, possibly moving to electric
GM - multiple, possibly moving to electric
VW - forced move to electric after emissions fraud conviction

We're able to easily extract data from an XML file.

To wrap this up, let's try a little bit of DOM manipulation. We'll remove the <hello> tag, then add an attribute to every <row>.

const htmlparser2 = require('htmlparser2');
const render = require('dom-serializer').default;
const domutils = require('domutils');
const CSSselect = require("css-select");
const fs = require('fs');
const fsp = require('fs').promises;
const util = require('util');

(async () => {

    const rawHtml = await fsp.readFile(process.argv[2], 'utf8');

    const dom = htmlparser2.parseDocument(rawHtml, {
        xml: true,
        recognizeSelfClosing: true
    });
    
    for (let row of CSSselect.selectAll('row', dom)) {
        // console.log(`${row.attribs.brand} - ${row.attribs.mode}`);
        row.attribs.seen = "yes";
    }

    for (let hello of CSSselect.selectAll('hello', dom)) {
        domutils.removeElement(hello);
    }

    const serilzd = render(dom);
    
    console.log(serilzd);
    
})().catch(err => {
    console.error(err);
});

And, this program is run like so:

$ node xml-manipulate.js data.xml 
<data>
    
    <table>
        <row id="1" brand="Tesla" mode="electric" seen="yes"></row>
        <row id="2" brand="Dodge" mode="gas guzzler" seen="yes"></row>
        <row id="3" brand="Ford" mode="multiple, possibly moving to electric" seen="yes"></row>
        <row id="4" brand="GM" mode="multiple, possibly moving to electric" seen="yes"></row>
        <row id="5" brand="VW" mode="forced move to electric after emissions fraud conviction" seen="yes"></row>
    </table>
</data>

This example uses the same DOMUtils functions we discussed earlier. And we see that the same tools we used for modifying HTML DOM's also work with XML DOM's.

Summary

There are many Node.js packages for dealing with HTML, XML, and even RSS feeds. What we've done here is to explore one cluster of those packages. That gave us a good starting point, a grounding, in using these packages to read XML/HTML files, extract data, or manipulate their structure.

But what about the question about whether the API offered by packages are easier to use than jQuery and Cheerio?

Using DOMUtils and CSS-Select is not quite as succinct as when using the jQuery API. But it's close, especially when coupled with normal JavaScript programming features. Arguably accessing and changing attributes is easier this way, because it uses normal JavaScript object access and assignment operators, rather than working through the attr function as one does in jQuery.

We did not do a comprehensive comparison to see whether Cheerio/jQuery or DOMUtils et al offer more API functions. We can see that these packages offer more-or-less the same functionality, with the advantage of being closer to the DOM API standard, and the ability to directly manipulate DOM objects with normal JavaScript code.

Links

HTMLParser2

DOMHandler

DOMUtils

CSS Select

DOM Serializer

About the Author(s)

(davidherron.com) David Herron : David Herron is a writer and software engineer focusing on the wise use of technology. He is especially interested in clean energy technologies like solar power, wind power, and electric cars. David worked for nearly 30 years in Silicon Valley on software ranging from electronic mail systems, to video streaming, to the Java programming language, and has published several books on Node.js programming and electric vehicles.

Books by David Herron

(Sponsored)