How to get the maximum value of a field in MongoDB

; Date: Thu Oct 26 2023

Tags: MongoDB

The task of generating a sequence of integers, such as identifiers for documents in a MongoDB collection, is conceptually. But, like many seemingly simple coding tasks, it's easy to end up confronting an obtuse error message like "RangeError: Maximum call stack size exceeded".

MongoDB automatically generates an excellent document identifier, the _id field. It's automatically there, is automatically unique, and is automatically increasing in value. But, my application is implementing an OpenAPI specification where many of the objects use an id field as an object identifier. To implement that field, my application needed to generate a sequence of integers to put in that field.

If you'll pardon the SQL, this application needed a field defined like this SQL id MEDIUMINT NOT NULL AUTO_INCREMENT (MySQL), but implemented on MongoDB. Unfortunately, MongoDB doesn't have this feature, because it has the _id field instead.

The requirement boils down to reliably generating a value for id that's incremented from the current maximum value for id within the collection. That means retrieving the current maximum value for id, incrementing it by one, and using that value in the id field of a new document for the collection.

While it's simple in principle to generate a sequence of integers, putting it into practice has a couple tricky considerations.

The initial implementation and the JavaScript RangeError

The initial simplistic implementation, in Node.js, is as follows:

const maxIDs = await db.reports.distinct('id');
if (maxIDs.length <= 0) data.id = '0';
else {
    const _maxIDs = maxIDs.map(id => {
        let ret = Number.parseInt(id);
        if (isNaN(ret)) {
            ret = id;
        }
        return ret;
    });
    data.id = (Math.max(..._maxIDs) + 1).toString();
}

It first uses the distinct query to ask MongoDB for an array of distinct values for the id field. A special case is to use '0' when this returns an empty array. In the else section we're dealing with the usual case of generating the next value for id.

Because this particular API specifies id as a string, it is stored in MongoDB as a string, and distinct gives us an array of strings. The final line is to use Math.max to determine the maximum value. But, that means we need to convert the array of strings into an array of numbers using Number.parseInt. The last step is converting the number back to a string using Number.toString.

This simple code is straight-forward, easy to understand, and ran excellently for several months. Then, it started throwing an error: RangeError: Maximum call stack size exceeded. The stack trace said pointed to the Math.max call.

To iterate is human, to recurse is divine, unless recursion causes cursing

If you paid attention while taking CS101, you'll remember that, in any programming language, the call stack records the parameters to each function call as well as local variables. As one function calls another, a new entry is added to the call stack, and when the function exits its entry is popped from that stack. The call stack always has a maximum possible size, and exceeding this size is a serious error.

Hence, Maximum call stack size exceeded means the above code exceeded the maximum size of the JavaScript/Node.js call stack.

The most common cause of exceeding the call stack size is a recursive function. A badly coded recursive function can quickly blow past any maximum size for the call stack triggering that error.

But... There's nothing recursive here, right? Not so fast, because looks can be deceiving. Turns out that Math.max can throw this error for an array that is big enough.

It's not that Math.max uses a recursive algorithm, because it doesn't.

The parameters to a function call are part of the call stack. So, what happens if there are a lot of parameters? With enough function parameters, enough to physically overwhelm the call stack, the JavaScript/Node.js call stack size can be exceeded.

In the application containing the code above, at least one data item is added to the collection every 15 minutes. After a few months there were over 130,000 documents in the collection, each with a distinct value for id. That meant the _maxIDs array was over 130,000 items.

The call, Math.max(..._maxIDs), means to expand the _maxIDs array to parameters on the call stack. A call stack entry with over 130,000 items is going to stress the JavaScript runtime engine.

As they say in Washington DC, a few million here, a few million there, and pretty soon you're talking real money. In other words, that 130,000 floats can easily become a call stack entry consuming over 1 megabyte of memory. In typical cases there's no problem because a few parameters at a few bytes each is in the range of a few hundred bytes of memory. But, on an 8 megabyte VPS that's also running a MongoDB instance, such a call stack entry is infeasible.

No wonder the call stack blew up on a function that had no recursion.

Computing the integer sequence with iteration?

The distinct query gives us an array, and somewhere in that array is the maximum value. We can find that maximum value by iterating over the array (a for loop) rather than using Math.max. But, that consumes more CPU time as the array size grows big. A solution relying on iteration, however, will not blow up the call stack.

A solution using sort rather than a for loop is this:

const maxIDs = await Reports.distinct('id');
if (maxIDs.length <= 0) data.id = '0';
else {
    const compare = (a, b) => {
        const nA = Number.parseInt(a);
        const nB = Number.parseInt(b);
        if (nA < nB) {
            return -1;
        } else if (nA > nB) {
            return 1;
        } else { return 0; }
    };
    // let lastIDs =  maxIDs.sort(compare).slice(-10);
    // logger.info(`addReport last IDs ${util.inspect(lastIDs)}`);
    const maxID = maxIDs.sort(compare).slice(-1)[0];
    data.id = (Number.parseInt(maxID) + 1).toString();
}

This is a simple drop-in replacement, using Array.sort instead of Math.max. The idea is to sort the distinct values for id then retrieve the last (or first) item. The bit with .slice(-1)[0] is a tricky way to retrieve the last element of the sorted array. With this compare function, that last element will be the maximum value in the array. This could be simplified by reversing the sort order (changing the compare function), in which case the maximum value is the first entry in the array.

A variant is to code a for loop inspecting every maxIDs entry. That would be faster than the sort approach by not having to sort the array.

The problem here is that as the number of items grows our application is slower. Whether it's a sort function, or a for loop, the more items in the array the more CPU time it consumes.

Computing the maximum value of an integer series in MongoDB

If we push this to MongoDB, it can theoretically optimize the operation better than the application code. For example, MongoDB could use the index for the id field to aid in determining the maximum value, or in sorting the elements?

A (stackoverflow.com) StackOverflow has these suggested MongoDB commands:

db.collection.find().sort({ id: -1}).limit(1) // for MAX
db.collection.find().sort({ id: +1}).limit(1) // for MIN
db.collection.aggregate({ $group : { _id: null, max: { $max: "$id" }}});
db.collection.aggregate({ $group : { _id: null, min: { $min: "$id" }}});

The first two use an empty find to retrieve all elements, sort them on the chosen field, then return one document containing either the maximum or minimum value. Since this retrieves the entire document, it may be useful to use project to limit the fields which are returned. There should also be a .toArray to convert this into a useful form.

The second two use an aggregation pipeline. The $group stage, along with the $max or $min operator, will cause the output to be:

[
    {
        "_id": null,
        "max": MAXIMUM VALUE // or min: MINUMUM VALUE
    }
]

MongoDB can use the index to improve performance, and it can also recognize the .sort(..).limit(..) combination to optimize memory usage by keeping only the maximum (or minimum) value(s) in memory.

But, this did not work directly in my case. It always gave a maximum value of '99999', even though there were many more documents in that collection.

Turns out that a sequence of integers does not sort in numerical order when the integers are stored as strings.

That's another thing we were taught in CS101, but can easily forget. The string '101234' is not greater than the string'99999'. Therefore the '99999' value always came up as the maximum value. The solution is to somehow numerically sort these numerical strings. Those commands work very well if your application uses numbers, rather than strings. Since my application uses numerical strings, let's look a little further.

The compare function shown earlier takes care of that problem by calling Number.parseInt before doing the comparison. But the goal is to do this in MongoDB rather than in the application.

Numerically sorting strings in MongoDB

Another (stackoverflow.com) StackOverflow posting discusses telling MongoDB to sort in numerical order. After reading that the final code became:

const last = await db.reports
    .find({}, { id: 1, _id: 0 })
    .sort({ id: -1 })
    .collation({
        locale: "en_US",
        numericOrdering: true
    })
    .limit(1)
    .toArray();
if (!Array.isArray(last)
 || last.length <= 0) {
    data.id = '0';
} else {
    let lastID = last[0].id;
    data.id = (Number.parseInt(lastID) + 1).toString();
}
// The computed ID value is stored into the
// object for which the ID is being computed.
// The next step should be to use
// COLLECTION.insertOne or COLLECTION.replaceOne
// to send it to the database.

First, we've modified the find call. The empty query, {}, will return every item in the collection. But, we're only interested in the id field, and the sort call will have extra work to do if every field is present. Adding { id: 1 } is the projection parameter, and says it should only include the id field. Adding { _id: 0 } means the _id is not included.

As a result, the sort step receives documents containing only id values.

The collation document lets us tell MongoDB how to sort values ( (www.mongodb.com) documentation). The numericOrdering setting works for integers, but not floating point, and it does not consider + or - signs. As long as your application fits within those constraints, this is an excellent solution.

The StackOverflow posting also has a suggestion to use the $toInt operator to convert the strings into actual integers, but I could not get that to work. In the MongoDB Compass client, this project parameter worked: { id: { $toInt: "$id" }, _id: 0 } and produced documents where the id fields were integers. I have not tested this in running code, however.

In other words, this untested code might be correct:

const last = await db.reports
    .find({}, {
        id: { $toInt: "$id" },
        _id: 0
    })
    .sort({ id: -1 })
    // .collation({
    //     locale: "en_US",
    //     numericOrdering: true
    // })
    .limit(1)
    .toArray();
if (!Array.isArray(last)
 || last.length <= 0) {
    data.id = '0';
} else {
    let lastID = last[0].id;
    data.id = (Number.parseInt(lastID) + 1).toString();
}
// The computed ID value is stored into the
// object for which the ID is being computed.
// The next step should be to use
// COLLECTION.insertOne or COLLECTION.replaceOne
// to send it to the database.

By converting the id fields to integers, the collation stage should not be required.

The value for last is determining by sorting the array of id values, and returning only the first item, converting it to an array. MongoDB is supposed to be able to recognize find(...).sort(...).limit(1) as a special case.

One downside of the final implementation is that the programmers intent is obscured. To understand that the intention is to generate the next item in the sequence of integers is not obvious. The code is wrapped around the mechanics of numerical sorting, retrieving one item, and incrementing that item, while maintaining the result as a string, that the quest for the next integer in a sequence is hard to see.

WARNING: Potential race conditions

There is a timing issue not addressed in this article. We discussed computating the next sequence number for documents in a MongoDB collection. This relies on values of an id field in the collection. The value just generated by your application must be stored into the collection in order to be recognized. There are two points in time: A) Computing the next sequence number, B) Storing that sequence number. Another thread might try to generate another sequence number at about the same time, and if it does so between time A and time B it will generate the a duplicate sequence number.

Therefore, your application must minimize the time between generating the sequence number, and storing it into the collection.

In my application, the code above is immediately followed by db.reports.insertOne(report). That leaves a small window of time during which a sequence number has been generated and the generated number is stored in the database.

Conclusion

As with many software development tasks, what seems like a simple task comes with many considerations. Even the tiniest and simplest of functions can generate errors.

Another learning is that MongoDB has a rich set of features and operations worth exploring. It's not just a repository for data, but it can be used for transmogrifying data in many ways.

Functionality that was first implemented almost solely in the application code is now almost solely executing on MongoDB. The MongoDB documentation claims many advantages from moving implementation into MongoDB. While in this case the final implementation obscured the programmers intent, aggregation pipelines often make the programmers intent in a large scale data transmogrification task more obvious.

About the Author(s)

(davidherron.com) David Herron : David Herron is a writer and software engineer focusing on the wise use of technology. He is especially interested in clean energy technologies like solar power, wind power, and electric cars. David worked for nearly 30 years in Silicon Valley on software ranging from electronic mail systems, to video streaming, to the Java programming language, and has published several books on Node.js programming and electric vehicles.