Avoid killing performance with asynchronous ES7 JavaScript async/await functions

By: (plus.google.com) +David Herron; Date: May 12, 2018

Tags: Node.js »»»» Asynchronous Programming

While async functions in JavaScript are a gift from heaven, application performance can suffer if they're used badly. The straightforward approach in using async functions in some cases makes performance go down the drain. Some smart recoding can correct the problem, unfortunately at the cost of code clarity.  

Consider the case of a sequence of async function invocations. You want to handle all invocations before proceeding to the next step. The question is the method with the cleanest code that also performs well.

Prototype problem & solution

A naive approach might do as so:

const fs = require('fs').promises;
const path = require('path');

const num2copy = process.env.PERF_NUM_COPY;
const copy2dir = process.env.PERF_DIRNM;

async function copyFile(src, dest) {
    await fs.copyFile(src, dest);

(async () => {

    let count = 0;
    while (count < num2copy) {
        let destfile = path.join(copy2dir, `datafile${count}.txt`);
        console.log(`COPYING ${destfile}`);
        await copyFile('srcfile.txt', destfile);

We have an async function, copyFile, standing in for any kind of asyncronous operation. The main action is the loop at the bottom, which copies the files one-by-one, waiting for each copy operation to finish before starting the next.

The code is extremely clean and easy to read and the programmers intention shines clearly. But, the files are copied .. One .. At .. A .. Time. That fact loses the opportunity to have interleaving code execution by running the operations simultaneously. There is no dependency between copyFile invocations, and they would not interfer with each other.

If it's not clear, this example requires Node.js 10.1 because it uses the new fs.promises API. If you want to run on a previous release, substitute the 3rd party fs-extra module.

Consider a differrent implementation:

const parallelLimit = require('run-parallel-limit');

const fs = require('fs').promises;
const path = require('path');

const num2copy = process.env.PERF_NUM_COPY;
const copy2dir = process.env.PERF_DIRNM;
const numparallel = Math.floor(process.env.PERF_PARALLEL);

async function copyFile(src, dest) {
    await fs.copyFile(src, dest);

(async () => {

    const tasks = [];

    let count = 0;
    while (count < num2copy) {
        let destfile = path.join(copy2dir, `datafile${count}.txt`);
        tasks.push((cb) => {
            console.log(`COPYING ${destfile}`);
            copyFile('srcfile.txt', destfile)
            .then(results => { cb(undefined, results) })
            .catch(err => { cb(err); });

    console.log(`num2copy ${num2copy} tasks ${tasks.length}`);

    await new Promise((resolve, reject) => {
        parallelLimit(tasks, numparallel, function(err, results) {
            // gets here on final results
            if (err) reject(err);
            else resolve(results);

This creates an array of functions that will call copyFile - and it then uses parallelLimit to run the tasks in parallel. By using this function we can specify the degree of concurrency.

It is possible to instead use this:

    await Promise.all(tasks);

What would happen in this case is that all tasks entries would start at the same time. What if your tasks array has thousands of entries? Would your application survive if all the thousands of tasks were to start at once? Using parallelLimit keeps the simultaneity in check (I hope that's a word) so your application doesn't blow up.

Inspiration -- solving performance problems in AkashaCMS

Rendering techsparx.com using (akashacms.com) AkashaCMS was taking well over 1 hour 20 minutes. AkashaCMS is a static website generator system that I've developed over the last few years. This website has several hundred pages many of which include YouTube videos, and it seems that retrieving metadata from YouTube bogs down the rendering process.

My first stage of improvement was to buy a faster computer for rendering my websites and other tasks. I had an older Celeron-based Intel NUC with 8GB memory that was used to run Gogs (for Github-like service) and Jenkins (the continuous integration system). I push new website content to a repository on the NUC, and Jenkins is configured to automatically wake up and render the website. It was nice to just write content and have the system automatically take care of things. But as I added content, the time to render grew and grew.

The new NUC has Core i5 processor, 18GB memory, and is therefore a much faster computer. Rendering techsparx.com dropped to 40-45 minutes, much better but still slow.

Then, I had an inspired thought ... which is explained in the previous section.

In AkashaCMS the processing is far more complex than that simple file copy. There are templates to render, custom tags to process, lots of YouTube URL's to retrieve metadata for, and so on.

But - as complex as the rendering process is, it fell to one function in render.js to sequence the rendering. Turns out I'd written that function as a really nice and easy to read async loop, which processed the files one-at-a-time.

I thought, what if I were to rewrite that loop to render N files at a time? The result was a new rendering loop somewhat like the second example above. And, a massive performance gain.

Rendering techsparx.com involves processing many hundreds of files and it didn't take much thought to realize the problem I named earlier. Rendering the entire site simultaneously using Promise.all would have blown something up. Instead this required constrained concurrency.

For the code difference see: (github.com) github.com akashacms akasharender compare

Nice hand-waiving, lets see some numbers

It's one thing to wave your hands and make a claim. It's another thing to back it up with numbers. I have two sets of numbers to report, one using the above two applications, and one using AkashaCMS to render techsparx.com.

First, we need an input file to copy. Read the code above and you'll see it makes N copies of an input file. It's an artificial benchmark, chosen to be somewhat similar to AkashaCMS's rendering loop.

Make a dummy file of 1GB size -- this was executed on a Linux box (my rendering NUC). For macOS the invocation is slightly different.

$ dd if=/dev/urandom of=srcfile.txt bs=64M count=16 iflag=fullblock
16+0 records in
16+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 28.7961 s, 37.3 MB/s

Then we run the first application to establish a baseline:

$ mkdir -p d
$ rm -rf d/* ; time PERF_DIRNM=`pwd`/d PERF_NUM_COPY=20 node perftest.js 
COPYING /home/david/perftest/d/datafile0.txt
(node:20165) ExperimentalWarning: The fs.promises API is experimental
COPYING /home/david/perftest/d/datafile1.txt
COPYING /home/david/perftest/d/datafile2.txt
COPYING /home/david/perftest/d/datafile3.txt
COPYING /home/david/perftest/d/datafile4.txt
COPYING /home/david/perftest/d/datafile5.txt
COPYING /home/david/perftest/d/datafile6.txt
COPYING /home/david/perftest/d/datafile7.txt
COPYING /home/david/perftest/d/datafile8.txt
COPYING /home/david/perftest/d/datafile9.txt
COPYING /home/david/perftest/d/datafile10.txt
COPYING /home/david/perftest/d/datafile11.txt
COPYING /home/david/perftest/d/datafile12.txt
COPYING /home/david/perftest/d/datafile13.txt
COPYING /home/david/perftest/d/datafile14.txt
COPYING /home/david/perftest/d/datafile15.txt
COPYING /home/david/perftest/d/datafile16.txt
COPYING /home/david/perftest/d/datafile17.txt
COPYING /home/david/perftest/d/datafile18.txt
COPYING /home/david/perftest/d/datafile19.txt

real	9m2.562s
user	0m2.498s
sys	0m26.417s

We're making 20 copies of the file and it takes about 9 minutes to execute sequentially.

For the subsequent runs the command-line is:

$ rm -rf d/* ; time PERF_DIRNM=`pwd`/d PERF_NUM_COPY=20 PERF_PARALLEL=2 node perftest2.js 

The table of results:

Concurrency real user sys
2 5m52.841s 0m1.004s 0m28.086s
3 6m21.449s 0m1.052s 0m27.989s
4 8m9.749s 0m0.878s 0m30.790s
5 7m50.680s 0m0.776s 0m29.847s
6 5m35.472s 0m0.661s 0m27.840s
7 5m54.364s 0m0.714s 0m28.648s
8 5m38.746s 0m0.720s 0m29.185s
9 5m51.299s 0m0.689s 0m28.813s
10 7m19.212s 0m1.252s 0m29.907s
11 9m3.156s 0m0.618s 0m32.067s
11 8m1.227s 0m1.043s 0m30.161s
11 5m34.230s 0m1.059s 0m28.115s
11 5m40.371s 0m0.766s 0m29.142s

I wouldn't take the specific numbers as gods given most accurate performance measurements. There's some variation if you run the same test multiple times. I ran with concurrency=11 four times to demonstrate that behavior.

However, the trend is that concurrency causes execution time to cut in half, approximately. In this case there isn't much difference between any level of concurrency.

The next example was to render techsparx.com with different concurrency settings. You, the patient reader, will not be able to replicate this since you don't have the source code for techsparx.com so you'll have to take my word for the following table.

Concurrency real user sys
1 45m57.670s 13m50.561s 0m41.197s
2 20m55.243s 13m20.353s 0m35.821s
3 16m1.700s 12m25.354s 0m31.939s
4 16m39.358s 13m6.293s 0m30.883s
5 15m3.506s 12m35.197s 0m29.757s
6 14m17.362s 12m42.524s 0m28.053s
7 13m8.556s 12m17.924s 0m27.122s
8 14m1.350s 12m46.572s 0m26.166s
9 12m48.809s 12m37.150s 0m25.969s
10 12m27.071s 12m26.288s 0m25.814s
11 11m36.765s 12m3.887s 0m25.071s
12 11m29.905s 12m1.079s 0m24.928s
13 12m3.213s 12m22.917s 0m24.878s
14 11m38.163s 12m22.178s 0m23.863s
15 11m40.029s 12m15.376s 0m24.972s
16 11m45.688s 12m2.554s 0m25.401s

Clearly a huge improvement - from 45+ minutes down to about 12 minutes. I'll take that any day of the week.

Further gains to come

Another easy-to-imagine performance gain is to cache YouTube results in a local database rather than querying over and over for the same data. I'm sure YouTube's servers are getting tired of telling me about the same videos every few hours.