Memory-efficient CSV transformation in Node.js

By: (plus.google.com) +David Herron; Date: 2016-11-14 17:06

Tags: Node.JS

Those of us who consume/edit/modify/publish CSV files must from time to time transform a CSV file. Maybe you need to delete columns, rearrange columns, add columns, rename volumes, or compute some values taking one CSV file and producing another. In my case, I have a raw CSV file with no column headers that's organized in a way which makes sense for one team in our company, but we need that same data organized a different way, with different column names and containing selected fields. The following is what came from that need, which I managed to write in a fairly generic way. It not only extracts and renames columns, but with a bit of coding could perform other transformations.

As such this script performs a map operation, meaning it takes an input CSV and produces an output CSV with the same number of rows. The row contents are of course different, but the count of datums in the CSV is the same for input and output. With this script it would be difficult to perform a reduce or filter operation, because both decrease the number of rows, which would be difficult with this script as it is written.

The script relies on the CSV Suite for Node.js: (csv.adaltas.com) http://csv.adaltas.com/

/*
 * This script demonstrates a simple CSV transformation that's
 * formulated to use minimal memory.  The processing is done via
 * piping using the Node.js Streams interface.
 *
 * This transformation is to extract selected columns from the
 * input file, then write to another file using different column names.
 *
 * The `transform` section could make other changes such as adding
 * columns together.
 */
'use strict';

const parse     = require('csv-parse');
const stringify = require('csv-stringify');
const transform = require('stream-transform');
const fs        = require('fs-extra-promise');

const infname   = process.argv[2];
const outfname  = process.argv[3];

const inputFields = [
    // List field names for input file
];

const extractFields = [
    // List field names to extract from input
];

const outputFields = [
    // List field names in the output file
]

fs.createReadStream(infname)
.pipe(parse({
    delimiter: ',',
    // Use columns: true if the input has column headers
    // Otherwise list the input field names in the array above.
    columns: inputFields
}))
.pipe(transform(function(data) {
    // This sample transformation selects out fields
    // that will make it through to the output.  Simply
    // list the field names in the array above.
    return extractFields
    .map(nm => { return data[nm]; });
}))
.pipe(stringify({
    delimiter: ',',
    relax_column_count: true,
    skip_empty_lines: true,
    header: true,
    // This names the resulting columns for the output file.
    columns: outputFields
}))
.pipe(fs.createWriteStream(outfname));

The input file name and output file name are given on the command line. It's a good idea if the input file has CSV headers, but as written the script does not require column headers. What we mean by that is a feature not used in all CSV files. In some cases the first row of a CSV file gives a name for each column. Such a file is more useful since documentation of the fields are in the file. But obviously not everyone does this, and perhaps some software would choke on the column names.

In this script, if your input file has column names then name a change in the first stage:

.pipe(parse({
    delimiter: ',',
    columns: true
}))

Otherwise, list the column names in the inputFields array.

The second stage is the transformation. The algorithm shown here simply extracts the fields named in the extractFields array. You can rename columns, reorder columns, and eliminate columns this way.

Other transformations can be performed. This function will be called once per row, and the return value from the function constitutes the new value for the row. Hence, the transformation cannot add nor delete rows, meaning the transformed file has the same number of rows on output as for input.

The last stage outputs the CSV using the column names you specify in outputFields.

Since the process uses pipes it is extremely memory efficient. In an earlier version of this script I used a variant of the CSV parser which read the entire CSV into an array before processing could occur. For a large CSV file the Node.js process ran out of memory, and I had to learn how to adjust the Node.js heap size. With pipes the memory footprint at any one time is minimal.

« Useful reading to understand the Promises, Generators and the async/await feature for Node.js/JavaScript The advent of async/await for Node.js - Node.js v7 has now arrived »
2016 Election 2018 Elections Acer C720 Ad block Affiliate marketing Air Filters Air Quality Air Quality Monitoring AkashaCMS Amazon Amazon Kindle Amazon Web Services America Amiga and Jon Pertwee Android Anti-Fascism AntiVirus Software Apple Apple Flexgate Apple Hardware History Apple Hardware Mistakes Apple iPhone Apple iPhone Hardware April 1st Arduino ARM Compilation Artificial Intelligence Astronomy Astrophotography Asynchronous Programming Authoritarianism Automated Social Posting AWS DynamoDB AWS Lambda Ayo.JS Bells Law Big Brother Big Data Big Finish Big Science Bitcoin Mining Black Holes Blade Runner Blockchain Blogger Blogging Books Botnets Cassette Tapes Cellphones China China Manufacturing Christopher Eccleston Chrome Chrome Apps Chromebook Chromebox ChromeOS CIA CitiCards Citizen Journalism Civil Liberties Climate Change Clinton Cluster Computing Command Line Tools Comment Systems Computer Accessories Computer Hardware Computer Repair Computers Conservatives Cross Compilation Crouton Cryptocurrency Curiosity Rover Currencies Cyber Security Cybermen Cybersecurity Daleks Darth Vader Data backup Data Formats Data Storage Database Database Backup Databases David Tenant DDoS Botnet Department of Defense Department of Justice Detect Adblocker Developers Editors Digital audio Digital Nomad Digital Photography Direct Attach Storage Diskless Booting Disqus DIY DIY Repair DNP3 Do it yourself Docker Docker MAMP Docker Swarm Doctor Who Doctor Who Paradox Doctor Who Review Drobo Drupal Drupal Themes DuckDuckGo DVD E-Books E-Readers Early Computers eGPU Election Hacks Electric Bicycles Electric Vehicles Electron Eliminating Jobs for Human Emdebian Encabulators Energy Efficiency Enterprise Node EPUB ESP8266 Ethical Curation Eurovision Event Driven Asynchronous Express Face Recognition Facebook Fake Advertising Fake News Fedora VirtualBox Fifth Doctor File transfer without iTunes FireFly Flash Flickr Fraud Freedom of Speech Front-end Development G Suite Gallifrey Gig Economy git Github GitKraken Gitlab GMAIL Google Google Adsense Google Chrome Google Gnome Google+ Government Spying Great Britain Green Transportation Hate Speech Heat Loss Hibernate High Technology Hoax Science Home Automation HTTP Security HTTPS Human ID I2C Protocol Image Analysis Image Conversion Image Processing ImageMagick In-memory Computing Incognito Mode InfluxDB Infrared Thermometers Insulation Internet Internet Advertising Internet Law Internet of Things Internet Policy Internet Privacy iOS iOS Devices iPad iPhone iPhone hacking Iron Man iShowU Audio Capture iTunes Janet Fielding Java JavaFX JavaScript JavaScript Injection JDBC John Simms Journalism Joyent jQuery Kaspersky Labs Kext Kindle Kindle Marketplace Large Hadron Collider Lets Encrypt LibreOffice Linux Linux Hints Linux Single Board Computers Logging Mac Mini Mac OS Mac OS X Mac Pro MacBook Pro Machine Learning Machine Readable ID Macintosh macOS macOS High Sierra macOS Kext MacOS X setup Make Money Online Make Money with Gigs March For Our Lives MariaDB Mars Mass Violence Matt Lucas MEADS Anti-Missile Mercurial MERN Stack Michele Gomez Micro Apartments Microsoft Military AI Military Hardware Minification Minimized CSS Minimized HTML Minimized JavaScript Missy Mobile Applications Mobile Computers MODBUS Mondas Monetary System MongoDB Mongoose Monty Python MQTT Music Player Music Streaming MySQL NanoPi Nardole NASA Net Neutrality Network Attached Storage Node Web Development Node.js Node.js Database Node.js Performance Node.js Testing Node.JS Web Development Node.x North Korea npm NSA NVIDIA NY Times Online advertising Online Community Online Fraud Online Journalism Online News Online Photography Online Video Open Media Vault Open Source Open Source and Patents Open Source Governance Open Source Licenses Open Source Software OpenAPI OpenJDK OpenVPN Palmtop PDA Patrick Troughton PayPal Paywalls Personal Flight Peter Capaldi Peter Davison Phishing Photography PHP Plex Plex Media Server Political Protest Politics Postal Service Power Control President Trump Privacy Private E-mail server Production use Public Violence Raspberry Pi Raspberry Pi 3 Raspberry Pi Zero ReactJS Recaptcha Recycling Refurbished Computers Remote Desktop Removable Storage Renewable Energy Republicans Retro Computing Retro-Technology Reviews RFID Rich Internet Applications Right to Repair River Song Robotics Robots Rocket Ships RSS News Readers rsync Russia Russia Troll Factory Russian Hacking Rust SCADA Scheme Science Fiction SD Cards Search Engine Ranking Search Engines Season 1 Season 10 Season 11 Security Security Cameras Server-side JavaScript Serverless Framework Servers Shell Scripts Silence Simsimi Skype SmugMug Social Media Social Media Networks Social Media Warfare Social Network Management Social Networks Software Development Software Patents Space Flight Space Ship Reuse Space Ships SpaceX Spear Phishing Spring Spring Boot Spy Satellites SQLite3 SSD Drives SSD upgrade SSH SSH Key SSL Stand For Truth Strange Parts Swagger Synchronizing Files Tegan Jovanka Telescopes Terrorism The Cybermen The Daleks The Master Time-Series Database Tom Baker Torchwood Total Information Awareness Trump Trump Administration Trump Campaign Twitter Ubuntu Udemy UDOO US Department of Defense Video editing Virtual Private Networks VirtualBox VLC VNC VOIP Vue.js Walmart Weapons Systems Web Applications Web Developer Resources Web Development Web Development Tools Web Marketing Webpack Website Advertising Website Business Models Weeping Angels WhatsApp William Hartnell Window Insulation Windows Windows Alternatives Wordpress World Wide Web Yahoo YouTube YouTube Adpocalypse YouTube Monetization