Tags: MongoDB
The task of generating a sequence of integers, such as identifiers for documents in a MongoDB collection, is conceptually. But, like many seemingly simple coding tasks, it's easy to end up confronting an obtuse error message like "RangeError: Maximum call stack size exceeded".
MongoDB automatically generates an excellent document identifier, the _id
field. It's automatically there, is automatically unique, and is automatically increasing in value. But, my application is implementing an OpenAPI specification where many of the objects use an id
field as an object identifier. To implement that field, my application needed to generate a sequence of integers to put in that field.
If you'll pardon the SQL, this application needed a field defined like this SQL id MEDIUMINT NOT NULL AUTO_INCREMENT
(MySQL), but implemented on MongoDB. Unfortunately, MongoDB doesn't have this feature, because it has the _id
field instead.
The requirement boils down to reliably generating a value for id
that's incremented from the current maximum value for id
within the collection. That means retrieving the current maximum value for id
, incrementing it by one, and using that value in the id
field of a new document for the collection.
While it's simple in principle to generate a sequence of integers, putting it into practice has a couple tricky considerations.
The initial implementation and the JavaScript RangeError
The initial simplistic implementation, in Node.js, is as follows:
const maxIDs = await db.reports.distinct('id');
if (maxIDs.length <= 0) data.id = '0';
else {
const _maxIDs = maxIDs.map(id => {
let ret = Number.parseInt(id);
if (isNaN(ret)) {
ret = id;
}
return ret;
});
data.id = (Math.max(..._maxIDs) + 1).toString();
}
It first uses the distinct
query to ask MongoDB for an array of distinct values for the id
field. A special case is to use '0'
when this returns an empty array. In the else
section we're dealing with the usual case of generating the next value for id
.
Because this particular API specifies id
as a string
, it is stored in MongoDB as a string
, and distinct
gives us an array of strings. The final line is to use Math.max
to determine the maximum value. But, that means we need to convert the array of strings into an array of numbers using Number.parseInt
. The last step is converting the number back to a string using Number.toString
.
This simple code is straight-forward, easy to understand, and ran excellently for several months. Then, it started throwing an error: RangeError: Maximum call stack size exceeded. The stack trace said pointed to the Math.max
call.
To iterate is human, to recurse is divine, unless recursion causes cursing
If you paid attention while taking CS101, you'll remember that, in any programming language, the call stack records the parameters to each function call as well as local variables. As one function calls another, a new entry is added to the call stack, and when the function exits its entry is popped from that stack. The call stack always has a maximum possible size, and exceeding this size is a serious error.
Hence, Maximum call stack size exceeded means the above code exceeded the maximum size of the JavaScript/Node.js call stack.
The most common cause of exceeding the call stack size is a recursive function. A badly coded recursive function can quickly blow past any maximum size for the call stack triggering that error.
But... There's nothing recursive here, right? Not so fast, because looks can be deceiving. Turns out that Math.max
can throw this error for an array that is big enough.
It's not that Math.max
uses a recursive algorithm, because it doesn't.
The parameters to a function call are part of the call stack. So, what happens if there are a lot of parameters? With enough function parameters, enough to physically overwhelm the call stack, the JavaScript/Node.js call stack size can be exceeded.
In the application containing the code above, at least one data item is added to the collection every 15 minutes. After a few months there were over 130,000 documents in the collection, each with a distinct value for id
. That meant the _maxIDs
array was over 130,000 items.
The call, Math.max(..._maxIDs)
, means to expand the _maxIDs
array to parameters on the call stack. A call stack entry with over 130,000 items is going to stress the JavaScript runtime engine.
As they say in Washington DC, a few million here, a few million there, and pretty soon you're talking real money. In other words, that 130,000 floats can easily become a call stack entry consuming over 1 megabyte of memory. In typical cases there's no problem because a few parameters at a few bytes each is in the range of a few hundred bytes of memory. But, on an 8 megabyte VPS that's also running a MongoDB instance, such a call stack entry is infeasible.
No wonder the call stack blew up on a function that had no recursion.
Computing the integer sequence with iteration?
The distinct
query gives us an array, and somewhere in that array is the maximum value. We can find that maximum value by iterating over the array (a for
loop) rather than using Math.max
. But, that consumes more CPU time as the array size grows big. A solution relying on iteration, however, will not blow up the call stack.
A solution using sort
rather than a for
loop is this:
const maxIDs = await Reports.distinct('id');
if (maxIDs.length <= 0) data.id = '0';
else {
const compare = (a, b) => {
const nA = Number.parseInt(a);
const nB = Number.parseInt(b);
if (nA < nB) {
return -1;
} else if (nA > nB) {
return 1;
} else { return 0; }
};
// let lastIDs = maxIDs.sort(compare).slice(-10);
// logger.info(`addReport last IDs ${util.inspect(lastIDs)}`);
const maxID = maxIDs.sort(compare).slice(-1)[0];
data.id = (Number.parseInt(maxID) + 1).toString();
}
This is a simple drop-in replacement, using Array.sort
instead of Math.max
. The idea is to sort the distinct values for id
then retrieve the last (or first) item. The bit with .slice(-1)[0]
is a tricky way to retrieve the last element of the sorted array. With this compare function, that last element will be the maximum value in the array. This could be simplified by reversing the sort order (changing the compare
function), in which case the maximum value is the first entry in the array.
A variant is to code a for
loop inspecting every maxIDs
entry. That would be faster than the sort
approach by not having to sort the array.
The problem here is that as the number of items grows our application is slower. Whether it's a sort
function, or a for
loop, the more items in the array the more CPU time it consumes.
Computing the maximum value of an integer series in MongoDB
If we push this to MongoDB, it can theoretically optimize the operation better than the application code. For example, MongoDB could use the index for the id
field to aid in determining the maximum value, or in sorting the elements?
A StackOverflow has these suggested MongoDB commands:
db.collection.find().sort({ id: -1}).limit(1) // for MAX
db.collection.find().sort({ id: +1}).limit(1) // for MIN
db.collection.aggregate({ $group : { _id: null, max: { $max: "$id" }}});
db.collection.aggregate({ $group : { _id: null, min: { $min: "$id" }}});
The first two use an empty find
to retrieve all elements, sort them on the chosen field, then return one document containing either the maximum or minimum value. Since this retrieves the entire document, it may be useful to use project
to limit the fields which are returned. There should also be a .toArray
to convert this into a useful form.
The second two use an aggregation pipeline. The $group
stage, along with the $max
or $min
operator, will cause the output to be:
[
{
"_id": null,
"max": MAXIMUM VALUE // or min: MINUMUM VALUE
}
]
MongoDB can use the index to improve performance, and it can also recognize the .sort(..).limit(..)
combination to optimize memory usage by keeping only the maximum (or minimum) value(s) in memory.
But, this did not work directly in my case. It always gave a maximum value of '99999'
, even though there were many more documents in that collection.
Turns out that a sequence of integers does not sort in numerical order when the integers are stored as strings.
That's another thing we were taught in CS101, but can easily forget. The string '101234'
is not greater than the string'99999'
. Therefore the '99999'
value always came up as the maximum value. The solution is to somehow numerically sort these numerical strings. Those commands work very well if your application uses numbers, rather than strings. Since my application uses numerical strings, let's look a little further.
The compare
function shown earlier takes care of that problem by calling Number.parseInt
before doing the comparison. But the goal is to do this in MongoDB rather than in the application.
Numerically sorting strings in MongoDB
Another StackOverflow posting discusses telling MongoDB to sort in numerical order. After reading that the final code became:
const last = await db.reports
.find({}, { id: 1, _id: 0 })
.sort({ id: -1 })
.collation({
locale: "en_US",
numericOrdering: true
})
.limit(1)
.toArray();
if (!Array.isArray(last)
|| last.length <= 0) {
data.id = '0';
} else {
let lastID = last[0].id;
data.id = (Number.parseInt(lastID) + 1).toString();
}
// The computed ID value is stored into the
// object for which the ID is being computed.
// The next step should be to use
// COLLECTION.insertOne or COLLECTION.replaceOne
// to send it to the database.
First, we've modified the find
call. The empty query, {}
, will return every item in the collection. But, we're only interested in the id
field, and the sort
call will have extra work to do if every field is present. Adding { id: 1 }
is the projection parameter, and says it should only include the id
field. Adding { _id: 0 }
means the _id
is not included.
As a result, the sort
step receives documents containing only id
values.
The collation document lets us tell MongoDB how to sort values (
documentation). The numericOrdering
setting works for integers, but not floating point, and it does not consider +
or -
signs. As long as your application fits within those constraints, this is an excellent solution.
The StackOverflow posting also has a suggestion to use the $toInt
operator to convert the strings into actual integers, but I could not get that to work. In the MongoDB Compass client, this project parameter worked: { id: { $toInt: "$id" }, _id: 0 }
and produced documents where the id
fields were integers. I have not tested this in running code, however.
In other words, this untested code might be correct:
const last = await db.reports
.find({}, {
id: { $toInt: "$id" },
_id: 0
})
.sort({ id: -1 })
// .collation({
// locale: "en_US",
// numericOrdering: true
// })
.limit(1)
.toArray();
if (!Array.isArray(last)
|| last.length <= 0) {
data.id = '0';
} else {
let lastID = last[0].id;
data.id = (Number.parseInt(lastID) + 1).toString();
}
// The computed ID value is stored into the
// object for which the ID is being computed.
// The next step should be to use
// COLLECTION.insertOne or COLLECTION.replaceOne
// to send it to the database.
By converting the id
fields to integers, the collation stage should not be required.
The value for last
is determining by sorting the array of id
values, and returning only the first item, converting it to an array. MongoDB is supposed to be able to recognize find(...).sort(...).limit(1)
as a special case.
One downside of the final implementation is that the programmers intent is obscured. To understand that the intention is to generate the next item in the sequence of integers is not obvious. The code is wrapped around the mechanics of numerical sorting, retrieving one item, and incrementing that item, while maintaining the result as a string, that the quest for the next integer in a sequence is hard to see.
WARNING: Potential race conditions
There is a timing issue not addressed in this article. We discussed computating the next sequence number for documents in a MongoDB collection. This relies on values of an id
field in the collection. The value just generated by your application must be stored into the collection in order to be recognized. There are two points in time: A) Computing the next sequence number, B) Storing that sequence number. Another thread might try to generate another sequence number at about the same time, and if it does so between time A and time B it will generate the a duplicate sequence number.
Therefore, your application must minimize the time between generating the sequence number, and storing it into the collection.
In my application, the code above is immediately followed by db.reports.insertOne(report)
. That leaves a small window of time during which a sequence number has been generated and the generated number is stored in the database.
Conclusion
As with many software development tasks, what seems like a simple task comes with many considerations. Even the tiniest and simplest of functions can generate errors.
Another learning is that MongoDB has a rich set of features and operations worth exploring. It's not just a repository for data, but it can be used for transmogrifying data in many ways.
Functionality that was first implemented almost solely in the application code is now almost solely executing on MongoDB. The MongoDB documentation claims many advantages from moving implementation into MongoDB. While in this case the final implementation obscured the programmers intent, aggregation pipelines often make the programmers intent in a large scale data transmogrification task more obvious.