Tags: Node.JS
Protocol Buffers is a Google-developed toolchain for binary encoding of data and objects that works between programming languages. It is the basis of gRPC, a cross-language remote procedure call system, but can be used separately.
The history of software engineering has always faced a challenge of sharing data between applications. A number of techniques exist, for example it is common to use a text-based data format like JSON, TOML, XML, or YAML. These formats can be read or written by programs in any programming langauge thanks to suitable libraries. But, storing data in these formats requires more disk space, network bandwidth when transmitting over the Internet, and CPU time to encode and decode.
Another choice is to use binary data formats. For example, before the Internet became popular the ISO-based network platform was going to be used for wide scale networking. The ISO protocols were based on a binary format, ASN.1, that was tightly specified to the umpteenth degree, and could support any sort of binary data encoding need. ASN.1 is a largely forgotten thing, except to someone like me who was working to bring an ISO protocol stack implementation to Unix systems thirty years ago. Today, it is historical example of a binary data format with guaranteed portability between applications written in different programming languages.
Google has this to say about Protocol Buffers:
Protocol buffers provide a language-neutral, platform-neutral, extensible mechanism for serializing structured data in a forward-compatible and backward-compatible way. It’s like JSON, except it's smaller and faster, and it generates native language bindings.
The phrase language neutral means that protocol buffers bindings are available for most popular programming languages, and platform neutral means the bindings are available for multiple chip architectures. The phrase structured data means protocol buffers supports data encoded per a schema declaration, and that the schema supports nesting definitions within other definitions.
Using protocol buffers starts by describing the schema using .proto
files. These files let you describe the format of a binary data block which is intended to be encoded into a binary message buffer. The .proto
files can be compiled into code in any of almost a dozen programming languages. The compiled module gives you code for encoding data to binary format, and decoding from binary format. What your application does with the encoded data is up to you.
It is intended that data encoded as a protocol buffer be used in a network protocol to communicate data over the Internet, hence the name of the project. Google uses protocol buffers widely in internal applications, for example. But, don't let that intended usage limit your imagination.
The official documentation is at: https://developers.google.com/protocol-buffers
The code shown in this article is available at: https://github.com/robogeek/nodejs-protocol-buffers
Getting started with Protocol Buffers on Node.js
We'll start this by using the Google-developed protocol buffers tools.
The first need is to get the protoc
compiler for converting protocol buffer definitions into code. You might be lucky and the package manager for your computer has a package containing protocol buffers tools.
For example, on my macOS laptop I use MacPorts to supply open source tools. The protobuf3-cpp
package is relatively up-to-date and claims to include the compiler.
On Ubuntu, I found the protobuf-compiler
to contain the correct tools. Therefore installation on Ubuntu is this simple:
$ sudo apt-get install protobuf-compiler
Failing that you can head to https://github.com/protocolbuffers/protobuf/releases
to retrieve prebuilt packages provided by the Protocol Buffers team. Or, you can also download the source code and compile it yourself.
The goal of this exercise is a command-line tool named protoc
which is the protocol buffers compiler.
$ protoc
Usage: protoc [OPTION] PROTO_FILES
Parse PROTO_FILES and generate output based on the options given:
The usage message is printed when the command is run with no arguments.
There are two versions of the Protocol Buffers specification language. For this tutorial we'll be using version 3 of this language, and the documentation is on-line at: https://developers.google.com/protocol-buffers/docs/proto3
The website, unfortunately, does not contain a tutorial for usage on Node.js. You can peruse official tutorials for other languages.
Google does provide a Node.js package at: https://www.npmjs.com/package/google-protobuf
And the source repository for that package contains documentation that's roughly like a tutorial: https://github.com/protocolbuffers/protobuf-javascript/blob/main/docs/index.md
Defining a simple data format using Protocol Buffers
For the rest of this tutorial we'll go over a simple data format, specified with protocol buffers, and a pair of Node.js scripts to encode and decode data using that data format. It will be a simple Todo object similar to what I implemented in a sample application written to explore the then newly released Bootstrap v5.
See: https://techsparx.com/nodejs/examples/todo-bootstrap/
Create a project directory:
$ mkdir protobuf
$ cd protobuf
$ npm init -y
$ npm install google-protobuf --save
This requires installing an external dependency, the protoc
compiler, described in the previous section.
In the directory create a file named todo.proto
, and put this at the top of the file:
syntax = "proto3";
This tells protoc
to use version three. If you don't do this, the compiler will print a warning containing the text No syntax specified for the proto file, and tell you to add the above text to your file.
The TODO object in my sample application contains four fields:
- An id which is a number identifying the object
- A title string that is presented in TODO lists
- A body string that is presented when the user is looking at the one TODO item
- A precedence enum giving high/medium/low priorities
In protocol buffers the definition looks like this:
message Todo {
int64 id = 1;
string title = 2;
string body = 3;
Precedence precedence = 4;
}
The word message
starts the definition of an object. The message
block has a name, in this case Todo
, giving the class name. Inside the body are one or more field definitions. Each field has a data type and a name. The number assignment is the field number rather than anything like a default value. The field number determines where the data is encoded within a record.
The last field is defined with the type Precedence. This isn't built-in to the protocol buffers language, but is defined by this application:
enum Precedence {
PRECEDENCE_NONE = 0;
PRECEDENCE_LOW = 1;
PRECEDENCE_MEDIUM = 2;
PRECEDENCE_HIGH = 3;
}
We chose to use an enum
object type to describe the permissable values for the precedence
field. In my sample application, the values for LOW, MEDIUM, and HIGH were specified as shown here. That meant initially this enum
was originally specified with just those three values, and PRECEDENCE_NONE
was not there. But, the compiler gave this error:
The first enum value must be zero in proto3.
The first field in an enum
must have the value 0
. To maintain compatibility with the values (1, 2, 3) used in the prior Todo application, I wanted to maintain the same values. That meant introducing PRECEDENCE_NONE
with a value of 0
.
It appears from the protocol buffers documentation that enum declarations can be placed within the body of message Todo {...}
, but in the sample implementation it is kept separate.
This defines a singular object, or what is called a Scalar Value Type. There are a dozen or so data types that boil down to integers, floats, booleans, and strings. Each are precisely defined to take N bytes so that they can be correctly encoded as binary.
Another issue to consider is the field numbers. Over the evolution of an application you may need to change the object definition. You could reuse an existing field to store a value of a different type. But that will break applications you've already deployed in the field. Instead it is a best practice to leave older field definitions in place, or to use the reserved
keyword to block out those older fields. The purpose is maintaining backwards compatibility with older releases of your application.
For example we might want to use Markdown in the body
field. The existing body
field is for simple text, not Markdown. We might change the message definition to this:
message Todo {
int64 id = 1;
string title = 2;
string bodyMD = 5;
Precedence precedence = 4;
reserved 3;
}
This change renames the body
field to bodyMD
to make it clear this field stores Markdown. This new field is marked as field number 5
, and field number 3
is marked as reserved
. An alternative is to leave the old body
field definition in place but for your application to ignore that field.
A scalar value type like Todo
is only useful for one object. You may want to send a list of Todo items, and therefore need a way to specify an array of objects.
In protocol buffers you describe that as so:
message Todos {
repeated Todo todos = 1;
}
We've defined a new object type, Todos
. The repeated
word is now we define what is essentially an array. It means the message contains zero or more instances of the Todo
object, and it is positioned at field number 1. Roughly speaking that makes it an array. This object could easily have other items in it, if that's needed for your application.
Generating Node.js source for the TODO protobuf schema
We've created a complete schema. The next step is converting it into source code we can use in an application. This is why we installed protoc
earlier.
This compiler can generate source code for multiple languages. BTW, if you remember back to your computer science classes, a compiler translates source code from one programming language to a completely different language. Hence protoc
is translating protocol buffers source code into any of several other programming languages.
Reading the Protocol Buffers documentation website might leave you scratching your head - we're aiming to use it with Node.js, but there's no documentation of Node.js usage. That's an issue for the Protocol Buffers team to answer, why don't they publish Node.js documentation on their site.
Go to: https://www.npmjs.com/package/google-protobuf
Then follow a few links to end up at: https://github.com/protocolbuffers/protobuf-javascript/blob/main/docs/index.md
Those two pages contain Google-authored documentation on using the Google implementation of protocol buffers, including usage of protoc
to generate Node.js (JavaScript) code. It's puzzling why Google doesn't expand on this to include documentation on the main website. It may be because of the warnings in the package documentation, since the project status in July 2022 is that the project is somewhat broken which they're trying to rectify.
In package.json
add this script
entry:
"scripts": {
"protoc": "protoc --js_out=import_style=commonjs,binary:. todo.proto"
},
It is a best practice to include commands like this in this file so that you don't have to spend precious brain cells remembering trivia.
For JavaScript output, there are two styles. One is to support import using CommonJS with the require
function as is the traditional usage in Node.js, and the other is the Google Closure style import. Since we're targeting Node.js, we specify commonjs
. The binary
option causes generation of functions to serialize to binary, and deserialize from binary. The :.
portion of the option specifies the output directory, in this case the current directory. Using :build/gen
means the output directory is ./build/gen
. The last part of this command specifies the input file, or files, in this case todo.proto
.
The compiler does a good job with error messages. A couple of them were shown earlier, and it was easy to determine what to do.
This command creates a file, todo_pb.js
, in the current directory. It is instructive to read this file. You'll see that JavaScript object definitions are created for Todo
, Todos
and Precedence
.
At the top is this:
var jspb = require('google-protobuf');
Our code does not use google-protobuf
, it is the generated code which does so. You'll see liberal use of functions in that package all through the generated code. To ensure this package is available, we installed it earlier.
Encoding data for protocol buffers
In a real application we might have a request handler function collecting some data, and needing to format it using protocol buffers to send a reply. In our case we want to demonstrate the core step of generating a protocol buffers object and then serializing it to a binary file. In the following script we'll demonstrate deserializing that binary file to read the data.
Create a file named encode.mjs
(since ES6 modules are the future of JavaScript, we use them wherever possible). Start with this:
import { default as Schema } from './todo_pb.js';
import { promises as fsp } from 'fs';
// console.log(Schema);
const todos = new Schema.Todos();
let todo = new Schema.Todo();
todo.setId(1);
todo.setTitle("Buy cheese");
todo.setBody("PIZZA NIGHT");
todo.setPrecedence(Schema.Precedence.PRECEDENCE_HIGH);
todos.addTodos(todo);
The generated code is in CommonJS format, and there does not seem to be an option to generate an ES6 module. This import pattern was deemed to be the most useful. We also import fs/promises
as fsp
so we have async filesystem functions.
The generated code lets us use new Schema.Todos()
and new Schema.Todo()
to generate the corresponding objects.
For the Todo
object I was unable to find a nifty way to set field values. Instead we have to specifically call the set
methods as shown here. Once the Todo object is created add it to the Todos
object using todos.addTodos
.
Repeat the last bit of code as many times as you like, adjusting the values as desired. End the script with this:
console.log(todos.toObject());
await fsp.writeFile('todos.bin', todos.serializeBinary());
The first gives you visual feedback of the object you've created. The toObject
method converts the protocol buffers object into a normal JavaScript object.
The last line writes the data to a file, todos.bin
. The serializeBinary
method converts the object into a binary blob, which is then written to the file.
The output will look something like this:
{
todosList: [
{ id: 1, title: 'Buy cheese', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 2, title: 'Buy sauce', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 3, title: 'Buy Spinach', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 4, title: 'Buy ham', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 5, title: 'Buy olives', body: 'PIZZA NIGHT', precedence: 3 }
]
}
In our house we make pizza from scratch every saturday night.
We can also inspect the binary file:
$ od -c todos.bin
0000000 \n 035 \b 001 022 \n B u y c h e e s e
0000020 032 \v P I Z Z A N I G H T 003 \n
0000040 034 \b 002 022 \t B u y s a u c e 032 \v
0000060 P I Z Z A N I G H T 003 \n 036 \b
0000100 003 022 \v B u y S p i n a c h 032 \v
0000120 P I Z Z A N I G H T 003 \n 032 \b
0000140 004 022 \a B u y h a m 032 \v P I Z Z
0000160 A N I G H T 003 \n 035 \b 005 022 \n B
0000200 u y o l i v e s 032 \v P I Z Z A
0000220 N I G H T 003
0000230
Look carefully at the bytes. There's what appears to be field numbers just before the text for each title, and what appears to be a length for each string, and so forth. If you like, the documentation includes a detailed description of this format. It's important at this point to notice that our data is in this file.
Another thing to notice is the relative size difference. The text form takes many more bytes than the binary form.
Deserializing protocol buffers message using Node.js
Because protocol buffers is language-neutral, we could deserialize the data using code written in a different language. But this is about doing so in Node.js, so let's focus on that.
Create a file named decode.mjs
containing:
import { default as Schema } from './todo_pb.js';
import { promises as fsp } from 'fs';
const todosBin = await fsp.readFile('todos.bin');
const todos = Schema.Todos.deserializeBinary(todosBin);
console.log(todos);
console.log(todos.toObject());
This simply reads todos.bin
and deserializes the data into an object. We then print the object itself, and the toObject
form as well.
{
wrappers_: { '1': [ [Object], [Object], [Object], [Object], [Object] ] },
messageId_: undefined,
arrayIndexOffset_: -1,
array: [ [ [Array], [Array], [Array], [Array], [Array] ] ],
pivot_: 1.7976931348623157e+308,
convertedPrimitiveFields_: {}
}
{
todosList: [
{ id: 1, title: 'Buy cheese', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 2, title: 'Buy sauce', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 3, title: 'Buy Spinach', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 4, title: 'Buy ham', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 5, title: 'Buy olives', body: 'PIZZA NIGHT', precedence: 3 }
]
}
The first gives you a glimpse under the covers of protocol buffers objects. We don't need to delve into the details, but it's interesting to see this. The second is the same data we saw above, indicating we successfully transferred the data from one application to another.
Transferring data between applications using protocol buffers
At the high level we see the process is:
- Define the schema in
.proto
files, then generate code for all languages of interest - Transferring data starts by generating a protocol buffer message object, then calling the
serializeBinary
method - Receiving the data is done by calling the
deserializeBinary
method, converting the message buffer into a protocol buffers object, and you then use that data in your application
Using an alternate protocol buffers implementation for Node.js/JavaScript
The official Google protocol buffers implementation leaves some things to be desired, when using it with Node.js. Browsing the npm repository we find several other packages covering the same space. One is protocolbuf.js which bills itself as a pure JavaScript implementation, with TypeScript support, that runs on both Node.js and browsers.
There are two packages:
protobufjs
is contains runtime support for using protocol buffers objects, parsing and using the schemas, and moreprotobufjs-cli
is a command-line tool that can be seen as roughly equivalent toprotoc
The documentation ( https://www.npmjs.com/package/protobufjs) talks about two usage modes:
- Load the
.proto
files directly without compilation, and right away start calling methods on objects - Compile the
.proto
files to static classes, similarly to what's shown above
For this tutorial we'll use the second mode so that it's easier to contrast with the code we just walked through.
Install the packages this way:
$ npm install protobufjs protobufjs-cli --save
The latter installs two commands, pbjs
and pbts
, which support JavaScript and TypeScript usage, respectively.
To compile the .proto
file to usable code run this:
$ npx pbjs -t static-module -w commonjs -o dist-pbjs/todo.js todo.proto
#### OR, for ES6 code generation
$ npx pbjs -t static-module -w es6 -o dist-pbjs/todo-es6.mjs todo.proto
This will target generating a "static module", meaning to create source code. The module format will be either CommonJS or ES6 format, depending on your preference. Note that for the ES6 module we named it with the .mjs
extension for Node.js compatibility.
It's useful to examine the generated code, if only because that's the most practical way to learn the generated API. I found the project documentation to be lacking, and the generated code was clean enough to understand directly how to use the package.
import { default as Schema } from './dist-pbjs/todo.js';
import { promises as fsp } from 'fs';
const todos = new Schema.Todos();
todos.todos.push(new Schema.Todo({
id: 1,
title: "Buy Cheese",
body: "PIZZA NIGHT",
precedence: Schema.Precedence.PRECEDENCE_HIGH
}));
// ...
console.log(Schema.Todos.toObject(todos));
await fsp.writeFile('todos-protobufjs.bin', Schema.Todos.encode(todos).finish());
Usage is a little different with this package. For example we can instantiate an object instance using a properties object. Studying the source code we see properties are generated where we can directly assign values.
The toObject
and encode
methods are not attached to the object instance, but are instead static methods of the class. Hence, we call Schema.Todos.toObject
rather than todos.toObject
.
Run the application and we see this output for the toObject
representation:
{
todos: [
{ id: 1, title: 'Buy Cheese', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 2, title: 'Buy sauce', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 3, title: 'Buy Spinach', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 4, title: 'Buy ham', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 5, title: 'Buy olives', body: 'PIZZA NIGHT', precedence: 3 }
]
}
It's roughly the same as before. The output file is exactly the same size as with the previous example:
$ ls -l todos*
-rw-rw-r-- 1 david david 152 Aug 22 11:38 todos.bin
-rw-rw-r-- 1 david david 152 Aug 22 11:37 todos-protobufjs.bin
This validates the idea that protocol buffers is language-neutral, because we used two different protocol buffers implementation to generate the same file.
With a small change to decode.mjs
we can name the file on the command line, and then decode the data file this way:
$ node decode.mjs todos-protobufjs.bin
{
wrappers_: { '1': [ [Object], [Object], [Object], [Object], [Object] ] },
messageId_: undefined,
arrayIndexOffset_: -1,
array: [ [ [Array], [Array], [Array], [Array], [Array] ] ],
pivot_: 1.7976931348623157e+308,
convertedPrimitiveFields_: {}
}
{
todosList: [
{ id: 1, title: 'Buy Cheese', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 2, title: 'Buy sauce', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 3, title: 'Buy Spinach', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 4, title: 'Buy ham', body: 'PIZZA NIGHT', precedence: 3 },
{ id: 5, title: 'Buy olives', body: 'PIZZA NIGHT', precedence: 3 }
]
}
This demonstrates generating a protocol buffers file with one implementation and decoding it with another.
A decoder written with protobufjs looks like this:
import { default as Schema } from './dist-pbjs/todo.js';
import { promises as fsp } from 'fs';
const todosBin = await fsp.readFile(process.argv[2]);
const todos = Schema.Todos.decode(todosBin);
console.log(Schema.Todos.toObject(todos).todos);
Execution (node decode-protobufjs.mjs todos-protobufjs.bin
) looks like this:
[
{
id: Long { low: 1, high: 0, unsigned: false },
title: 'Buy Cheese',
body: 'PIZZA NIGHT',
precedence: 3
},
{
id: Long { low: 2, high: 0, unsigned: false },
title: 'Buy sauce',
body: 'PIZZA NIGHT',
precedence: 3
},
{
id: Long { low: 3, high: 0, unsigned: false },
title: 'Buy Spinach',
body: 'PIZZA NIGHT',
precedence: 3
},
{
id: Long { low: 4, high: 0, unsigned: false },
title: 'Buy ham',
body: 'PIZZA NIGHT',
precedence: 3
},
{
id: Long { low: 5, high: 0, unsigned: false },
title: 'Buy olives',
body: 'PIZZA NIGHT',
precedence: 3
}
]
Curiously the id
field is represented differently. It is instead an object that appears suited to representing a numerical range. Otherwise the output is the same as with the previous implementation.
Benchmarking encoding and decoding using JSON, Protocol Buffers, and Protobuf.JS
I created three pairs of benchmark functions that call solely the encode or decode function on objects which are precreated. For JSON, it uses JSON.stringify
and JSON.parse
, and the others use the functions shown above. The object array in this case has 1000 elements.
$ node bench.mjs
cpu: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
runtime: node v18.6.0 (x64-linux)
benchmark time (avg) (min … max)
---------------------------------------------------
encode-JSON 342.37 µs/iter (311.93 µs … 1.19 ms)
decode-JSON 435.9 µs/iter (384.44 µs … 1.41 ms)
encode-PB 946.43 µs/iter (777.38 µs … 3.13 ms)
decode-PB 770.79 µs/iter (688.99 µs … 1.78 ms)
encode-PBJS 696.75 µs/iter (618.43 µs … 2.43 ms)
decode-PBJS 455.36 µs/iter (413.66 µs … 1.09 ms)
It's interesting that JSON encoding and decoding is significantly faster than the protocol buffers equivalent.
Another measure is the size for each representation:
-rw-rw-r-- 1 david david 328 Aug 22 16:46 todos.json
-rw-rw-r-- 1 david david 152 Aug 22 11:38 todos.bin
-rw-rw-r-- 1 david david 152 Aug 22 12:18 todos-protobufjs.bin
The equivalent JSON is over twice the size. Obviously a text-based data format is going to be larger than a binary data format.
Summary
Google's protocol buffers are a powerful way to exchange data between applications, or for an application to store its data. It encodes data structures as compact binary data blobs.
The primary advantage for protocol buffers is the size of the encoded data. Because it is well specified, and available in multiple programming languages, it is a good fit for many applications.
Text based data formats are also well specified and available in multiple programming languages. Obviously the huge number of applications using JSON, YAML, XML, and the like for data transfer is testament to their usefulness. But, what about cases where it is important to preserve network bandwidth? Whether it's a server farm housing a few hundred thousand busy servers, a remotely installed IoT device connecting over 5G cellular data, or a smart phone app, there are many scenarios where compact binary data will improve performance or lower network data costs.