Memory-efficient CSV transformation into CSV or text reports in Node.js

; Date: Fri Feb 26 2021

Tags: Node.JS

Those of us who consume/edit/modify/publish CSV files must from time to time transform a CSV file. Maybe you need to delete columns, rearrange columns, add columns, rename volumes, or compute some values taking one CSV file and producing another CSV file. Or maybe you need to convert the CSV to a simple text report. You might have a raw CSV file with no column headers, and need to supply column names. Or your CSV might, instead, have header names.

CSV and TSV files are of a class of simple textual data files that resemble an array. Each text line of the CSV file contains one entry in the array, where each contains one or more columns of data. What differentiates CSV/TSV files are the column separator. Since CSV means comma separated values, the column separator is a comma, while TSV means tab separated values, where a tab character is used to separate columns. Other possible separator characters can be used, for example the traditional Unix /etc/passwd file uses colon characters to separate fields.

In other words there is quite a bit of variance to the format of CSV or TSV files. It's therefore important to use a framework to help with reading and writing JSON data. In this tutorial we'll use CSV For Node.js, a comprehensive suite of Node.js modules that was formerly known as Adaltas.

CSV for Node.js home page: (csv.js.org) https://csv.js.org/parse/

There are many Node.js packages for CSV manipulation. The chosen module is a fully featured mature package that handles most (or all) variations of CSV files.

We'll use "CSV" as a short-hand for all variations of this type of file. CSV files are widely used in all kinds of industries. In many cases we write ad-hoc scripts to process CSV files for custom purposes.

There is a companion article dealing with importing CSV to MySQL: Easily export or import data between CSV files and SQLite3 or MySQL databases with Node.js

This tutorial is updated and expanded from one written in 2016.

Adding the CSV for Node.js suite to a Node.js project

The CSV for Node.js packages are of course distributed through the npm registry. To get started with these packages you must have a Node.js project directory:

$ mkdir project
$ cd project
$ npm init -y
$ npm install csv --save

This creates a blank project directory, and installs the csv package. There are four individual packages in the CSV project, and you may want to use just one rather than install all four.

Transforming CSV to a text report in Node.js

Let's start with a simple example, converting a CSV file containing names and e-mail addresses into a simple text report. My inspiration for this came a couple months ago when Yahoo Groups was being shut down. I had two groups that needed to transition to another mailing list solution. That meant exporting the CSV file containing the membership list, and producing a text file in the format required by the new mailing list service.

The membership list started with this line:

"Display Name",Email,Status,Delivery,Posting,Joined

CSV files often start with a line like this, where these values are the name for each column. The purpose is similar to a field name in an SQL database schema. These names are to be treated as field names for data in the CSV file.

For the purpose of testing, create a file named list-members.csv containing this:

"Display Name",Email,Status,Delivery,Posting,Joined
John Smith,jon@smith.xyzzy,Member,Yes,Yes,Yes
IN Cognito,(Not disclosed),Lapsed,No,No,Yes
,frank@brown.xyzzy,Member,Yes,Yes,Yes

In this case I only needed the Display Name and Email fields, but I also needed to exclude entries where the Email field was (Not disclosed).

import parse from 'csv-parse';
import fs from 'fs';

const parser = parse({
    delimiter: ',',
    columns: true,
    relax_column_count: true
});
parser.on('error', function(err) {
    console.error(err.message)
});
parser.on('end', function() {
    // console.log(output);
});
parser.on('readable', function() {
    let record;
    while (record = parser.read()) {
        if (record['Email'] === '(Not disclosed)') continue;
        console.log(`${record['Display Name']} <${record['Email']}>`);
    }
})

fs.createReadStream('list-members.csv').pipe(parser);

This script is written as an ES6 module, so save it with a .mjs extension.

The parser module can be used in several modes, in this case we're using the Stream mode. At the bottom we use fs.createReadStream to open our input file, piping it into the parser stream handler. The parser is configured to use commas as the field separator, to look for column names in the first row of the CSV file, and to be relaxed about the number of columns.

In stream mode, the paradigm is to emit events for different conditions in the data being streamed. The error event is emitted when there's an error, and the end event is emitted when everything is finished. The readable event is emitted when there is data which can be read. The data is readable in a loop using the parser.read function. The data is constructed as an array as shown here.

To run this script:

$ node ./convert-list.mjs

Output will be text like this:

John Smith <jon@smith.xyzzy>
 <frank@brown.xyzzy>

Suppose your input file did not have column names in the first row of the file? We set columns to true which says to look for column names. The columns option can be set to instead supply column names:

const parser = parse({
    delimiter: ',',
    // from_line: 2,  Use this to skip the first line
    columns: [
        "Name", "Email", "Status", "Delivery", "Posting", "Joined"
    ],
    relax_column_count: true
});
...
        console.log(`${record['Name']} <${record['Email']}>`);
...

Here the columns are the same purpose as earlier, but instead of Display Name we've named the first column Name.

We can supply our own CSV column names, even if the file has them. This tells csv-parse to not look for column names in the first line, and instead the first line will be treated as CSV data. Therefore if that first line does contain CSV column names then we need to skip that line. Setting the from_line option tells csv-parse to start work from the second line of the file.

Using an Async Iterator to parse CSV file in Node.js

We don't have to use the stream mode parser. It's also possible to use csv-parse as an asynchronous iterator to use a simple for loop to read a CSV file.

import parse from 'csv-parse';
import fs from 'fs';

const parser = fs.createReadStream(`./list-members.csv`)
  .pipe(parse({
    delimiter: ',',
    columns: true,
    relax_column_count: true
  }));

for await (const record of parser) {
    if (record['Email'] === '(Not disclosed)') continue;
    console.log(`${record['Display Name']} <${record['Email']}>`);
}

One of the JavaScript changes ushered in by ES6 is the iteration protocol, which gave us this nifty for ... of construct. It means we can iterate over a variety of things using a simple for loop. In this case csv-parse lets us iterate over the lines of a CSV file using a for loop.

Adding the await clause here, first requires that you use Node.js 14.8 or later. That release introduced the ability to use the await keyword at the top level of a script, as outlined here: Node.js Script writers: Top-level async/await now available

The for await (...of..) loop construct means the iterator can produce values asynchronously. In this case the data will be available from parser as it is read, and parsed, from disc.

To run this script:

$ node ./convert-list-async.mjs

Transforming a CSV file to another CSV file in Node.js

For our last example, let's pull in two more packages from the CSV suite. The first, stream-transform, is a general purpose tool for transforming data being sent through a stream. The second, csv-stringify, is about generating a CSV file from a stream that is emitting data objects.

We're commonly required to produce a CSV file by processing another CSV file. For example in a job a few years ago, I dealt with a huge number of CSV files produced by a software system maintained by another team in the company. I needed to modify their CSV files for the specific needs of my team. A script somewhat like the following did the trick.

import parse from 'csv-parse';
import transform from 'stream-transform';
import stringify from 'csv-stringify';
import fs from 'fs';

fs.createReadStream('list-members.csv')
    .pipe(parse({
        delimiter: ',',
        from_line: 2,
        columns: [
            "Name", "Email", "Status", "Delivery", "Posting", "Joined"
        ],
        relax_column_count: true
    }))
    .pipe(transform(function (data) {
        // Skip data we don't want
        if (data['Email'] === '(Not disclosed)') return null;
        // Transform data to a new structure
        return { Name: data['Name'], Email: data['Email'] };
    }))
    .pipe(stringify({
        delimiter: ',',
        relax_column_count: true,
        skip_empty_lines: true,
        header: true,
        // This names the resulting columns for the output file.
        columns: [ "Name", "Email" ]
    }))
    .pipe(process.stdout);

This is structured into multiple stages. The first parses the CSV data into a stream of objects. The transform stage operates on those objects, skipping some of them, and converting the rest into a different structure. The last, stringify, converts the objects into a new CSV file.

In this case we found it useful to convert the Display Name column name to just Name to simplify accessing that field in the array.

The stream-transform package takes a function. Because the parser produces a stream of objects, that's what we receive in our function. This function is invoked once for each record in the CSV file.

To skip a record, simply return null as shown here.

To transform a record, create a new object and return it. In this case we create an object, assigning into it two fields from the input object. Use your imagination to come up with your desired transformation.

The next-to-last stage is configured to produce a comma-separated CSV. The first line will be the column headers passed in the columns attribute.

The last stage is simply piping the new CSV file to the standard output. This is where you would instead pipe it to a file.

To better understand what's going on, let's split that script up a bit:

import parse from 'csv-parse';
import transform from 'stream-transform';
import stringify from 'csv-stringify';
import fs from 'fs';

const parser = fs.createReadStream('list-members.csv')
    .pipe(parse({
        delimiter: ',',
        from_line: 2,
        columns: [
            "Name", "Email", "Status", "Delivery", "Posting", "Joined"
        ],
        relax_column_count: true
    }));

const transformer = parser.pipe(transform(function (data) {
        // For inspecting the data we receive from the parser stage
        // console.log(data);
        // Skip data we don't want
        if (data['Email'] === '(Not disclosed)') return null;
        // Transform data to a new structure
        return { Name: data['Name'], Email: data['Email'] };
    }));


for await (const record of transformer) {
    // For inspecting the data we receive from the transformer stage
    // console.log(record);
    console.log(`${record['Name']} <${record['Email']}>`);
}

What we've done is split that stream pipeline into discrete units. We can examine the data received from the transformer stage, and see that we receive objects containing Name and Email fields. Producing the desired text file is just a matter of a different console.log statement.

One question is whether the transformer stage can handle asynchronous functions?

const transformer = parser.pipe(transform(async function (data) {
        // Skip data we don't want
        if (data['Email'] === '(Not disclosed)') return null;
        // Transform data to a new structure
        return { Name: data['Name'], Email: data['Email'] };
    }));

This gives us a series of undefined values, indicating that the transformer cannot handle asynchronous operations. For example we might want to look up information from a database to add to the data record. We cannot use the transformer to do so. Instead our choice for such a need would be a for await loop.

Let's introduce an asynchronous operation, and then send the data to the stringify phase:


const stringifier = stringify({
    delimiter: ',',
    relax_column_count: true,
    skip_empty_lines: true,
    header: true,
    // This names the resulting columns for the output file.
    columns: [ "Name", "Email", 'result' ]
});

for await (const record of transformer) {
    // Mimic an asynchronous operation such as looking up data in a database
    record.result = await new Promise((resolve, reject) => {
        if (false) reject('Error message');
        else resolve('Asynchronous result');
    });
    stringifier.write(record);
    // console.log(`${record['Name']} <${record['Email']}>`);
}

stringifier.pipe(process.stdout);

stringifier.end();

The stringifier object is a new phase, that we used earlier to convert the stream to a new CSV file. Except we've added a new column to the output, result.

At the bottom we ran stringifier.pipe to send its output to the standard output, and used stringifier.end to close the stream.

In between we have a for await loop that handles the output from the transformer. For each, it performs an asynchronous operation. We don't have such an operation handy, so we've instantiated a Promise to mimic having called an async operation. The result is added to the object received from the transformer, meaning that it will now have fields named Name, Email, and result. Calling stringifier.write inserts the new object into the stringifier stage.

$ node ./process-transform-async.mjs
Name,Email,result
John Smith,jon@smith.xyzzy,Asynchronous result
,frank@brown.xyzzy,Asynchronous result

The output now is something like that. There are three columns matching what was in the stringify options, and the third column has the expected text from the asynchronous operation.

What does "Memory efficient" mean?

Since the process is piping data through Node.js streams, it is extremely memory efficient. The only memory being used is the current data item and some data in the stream buffers. Some CSV libraries instead read the entire CSV into memory before processing occurs. For a small CSV this would be fine, but for large CSV's that is a bad idea.

For example, the CSV files at the company I mentioned earlier was data captured from a solar array. Each row represented one minute of data, and could contain a couple hundred columns depending on the configuration of the data sensors. It's easy to find CSV files with gigabytes of data, which would tend to cause out of memory errors simply because the CSV file is too large to be practicably processed in memory.

You're better off using a stream mode CSV processor like csv-parse.