Memory-efficient CSV transformation in Node.js

By: (plus.google.com) +David Herron; Date: 2016-11-14 17:06

Tags: Node.JS

Those of us who consume/edit/modify/publish CSV files must from time to time transform a CSV file. Maybe you need to delete columns, rearrange columns, add columns, rename volumes, or compute some values taking one CSV file and producing another. In my case, I have a raw CSV file with no column headers that's organized in a way which makes sense for one team in our company, but we need that same data organized a different way, with different column names and containing selected fields. The following is what came from that need, which I managed to write in a fairly generic way. It not only extracts and renames columns, but with a bit of coding could perform other transformations.

As such this script performs a map operation, meaning it takes an input CSV and produces an output CSV with the same number of rows. The row contents are of course different, but the count of datums in the CSV is the same for input and output. With this script it would be difficult to perform a reduce or filter operation, because both decrease the number of rows, which would be difficult with this script as it is written.

The script relies on the CSV Suite for Node.js: (csv.adaltas.com) http://csv.adaltas.com/

/*
 * This script demonstrates a simple CSV transformation that's
 * formulated to use minimal memory.  The processing is done via
 * piping using the Node.js Streams interface.
 *
 * This transformation is to extract selected columns from the
 * input file, then write to another file using different column names.
 *
 * The `transform` section could make other changes such as adding
 * columns together.
 */
'use strict';

const parse     = require('csv-parse');
const stringify = require('csv-stringify');
const transform = require('stream-transform');
const fs        = require('fs-extra-promise');

const infname   = process.argv[2];
const outfname  = process.argv[3];

const inputFields = [
    // List field names for input file
];

const extractFields = [
    // List field names to extract from input
];

const outputFields = [
    // List field names in the output file
]

fs.createReadStream(infname)
.pipe(parse({
    delimiter: ',',
    // Use columns: true if the input has column headers
    // Otherwise list the input field names in the array above.
    columns: inputFields
}))
.pipe(transform(function(data) {
    // This sample transformation selects out fields
    // that will make it through to the output.  Simply
    // list the field names in the array above.
    return extractFields
    .map(nm => { return data[nm]; });
}))
.pipe(stringify({
    delimiter: ',',
    relax_column_count: true,
    skip_empty_lines: true,
    header: true,
    // This names the resulting columns for the output file.
    columns: outputFields
}))
.pipe(fs.createWriteStream(outfname));

The input file name and output file name are given on the command line. It's a good idea if the input file has CSV headers, but as written the script does not require column headers. What we mean by that is a feature not used in all CSV files. In some cases the first row of a CSV file gives a name for each column. Such a file is more useful since documentation of the fields are in the file. But obviously not everyone does this, and perhaps some software would choke on the column names.

In this script, if your input file has column names then name a change in the first stage:

.pipe(parse({
    delimiter: ',',
    columns: true
}))

Otherwise, list the column names in the inputFields array.

The second stage is the transformation. The algorithm shown here simply extracts the fields named in the extractFields array. You can rename columns, reorder columns, and eliminate columns this way.

Other transformations can be performed. This function will be called once per row, and the return value from the function constitutes the new value for the row. Hence, the transformation cannot add nor delete rows, meaning the transformed file has the same number of rows on output as for input.

The last stage outputs the CSV using the column names you specify in outputFields.

Since the process uses pipes it is extremely memory efficient. In an earlier version of this script I used a variant of the CSV parser which read the entire CSV into an array before processing could occur. For a large CSV file the Node.js process ran out of memory, and I had to learn how to adjust the Node.js heap size. With pipes the memory footprint at any one time is minimal.

« Useful reading to understand the Promises, Generators and the async/await feature for Node.js/JavaScript The advent of async/await for Node.js - Node.js v7 has now arrived »
2016 Election Acer C720 Ad block AkashaCMS Amazon Amazon Kindle Amiga Android Anti-Fascism Apple Apple Hardware History Apple iPhone Apple iPhone Hardware April 1st Arduino ARM Compilation Astronomy Asynchronous Programming Authoritarianism Automated Social Posting Ayo.JS Bells Law Big Brother Big Finish Black Holes Blade Runner Blogger Blogging Books Botnet Botnets Cassette Tapes Cellphones Christopher Eccleston Chrome Chrome Apps Chromebook Chromebooks Chromebox ChromeOS CIA CitiCards Citizen Journalism Civil Liberties Clinton Cluster Computing Command Line Tools Computer Hardware Computer Repair Computers Cross Compilation Crouton Cryptocurrency Curiosity Rover Cyber Security Cybermen Daleks Darth Vader Data backup Data Storage Database Database Backup Databases David Tenant DDoS Botnet Detect Adblocker Developers Editors Digital Photography DIY DIY Repair DNP3 Docker Doctor Who Doctor Who Paradox Drobo Drupal Drupal Themes DVD E-Books E-Readers Early Computers Election Hacks Electric Bicycles Electric Vehicles Electron Emdebian Energy Efficiency Enterprise Node EPUB ESP8266 Ethical Curation Eurovision Event Driven Asynchronous Express Facebook Fake News Fedora VirtualBox File transfer without iTunes FireFly Fraud Freedom of Speech Gallifrey git Gitlab GMAIL Google Google Chrome Google Gnome Google+ Government Spying Great Britain Heat Loss Hibernate Home Automation HTTPS I2C Protocol Image Analysis Image Conversion Image Processing ImageMagick InfluxDB Infrared Thermometers Insulation Internet Internet Advertising Internet Law Internet of Things Internet Policy Internet Privacy iOS Devices iPad iPhone iPhone hacking Iron Man Iternet of Things iTunes Java JavaScript JavaScript Injection JDBC John Simms Journalism Joyent Kindle Marketplace Lets Encrypt LibreOffice Linux Linux Hints Linux Single Board Computers Logging Mac OS Mac OS X MacOS X setup Make Money Online MariaDB Mars Matt Lucas MEADS Anti-Missile Mercurial Michele Gomez Micro Apartments Military Hardware Minification Minimized CSS Minimized HTML Minimized JavaScript Missy Mobile Applications MODBUS Mondas MongoDB Mongoose Monty Python MQTT Music Player Music Streaming MySQL NanoPi Nardole NASA Net Neutrality Node Web Development Node.js Node.js Database Node.js Testing Node.JS Web Development Node.x North Korea Online advertising Online Fraud Online Journalism Online Video Open Media Vault Open Source Governance Open Source Licenses Open Source Software OpenAPI OpenVPN Personal Flight Peter Capaldi Photography PHP Plex Plex Media Server Political Protest Postal Service Power Control Privacy Production use Public Violence Raspberry Pi Raspberry Pi 3 Raspberry Pi Zero Recycling Remote Desktop Republicans Retro-Technology Reviews Right to Repair River Song Robotics Rocket Ships RSS News Readers rsync Russia Russia Troll Factory SCADA Scheme Science Fiction Search Engine Ranking Season 1 Season 10 Season 11 Security Security Cameras Server-side JavaScript Shell Scripts Silence Simsimi Skype Social Media Warfare Social Networks Software Development Space Flight Space Ship Reuse Space Ships SpaceX Spear Phishing Spring Spring Boot SQLite3 SSD Drives SSD upgrade SSH SSH Key SSL Swagger Synchronizing Files Telescopes Terrorism The Cybermen The Daleks The Master Time-Series Database Torchwood Total Information Awareness Trump Trump Administration Ubuntu UDOO Virtual Private Networks VirtualBox VLC VNC VOIP Web Applications Web Developer Resources Web Development Web Development Tools Web Marketing Website Advertising Weeping Angels WhatsApp Window Insulation Wordpress YouTube YouTube Monetization