How to get the maximum value of a field in MongoDB

; Date: Thu Oct 26 2023

Tags: MongoDB

The task of generating a sequence of integers, such as identifiers for documents in a MongoDB collection, is conceptually. But, like many seemingly simple coding tasks, it's easy to end up confronting an obtuse error message like "RangeError: Maximum call stack size exceeded".

MongoDB automatically generates an excellent document identifier, the _id field. It's automatically there, is automatically unique, and is automatically increasing in value. But, my application is implementing an OpenAPI specification where many of the objects use an id field as an object identifier. To implement that field, my application needed to generate a sequence of integers to put in that field.

If you'll pardon the SQL, this application needed a field defined like id MEDIUMINT NOT NULL AUTO_INCREMENT (MySQL), but implemented on MongoDB. Unfortunately, MongoDB doesn't have this feature, because it has the _id field instead.

The requirement boils down to reliably generating a value for id that's incremented from the current maximum value for id within the collection. That means retrieving the current maximum value for id, incrementing it by one, and using that value in the id field of a new document for the collection.

While it's simple in principle to generate a sequence of integers, putting it into practice has a couple tricky considerations.

The initial implementation and the JavaScript RangeError

The initial simplistic implmentation, in Node.js, is as follows:

const maxIDs = await db.reports.distinct('id');
if (maxIDs.length <= 0) data.id = '0';
else {
    const _maxIDs = maxIDs.map(id => {
        let ret = Number.parseInt(id);
        if (isNaN(ret)) {
            ret = id;
        }
        return ret;
    });
    data.id = (Math.max(..._maxIDs) + 1).toString();
}

It first uses the distinct query to get an array of distinct values for the id field. A special case is when this returns an empty array, to use '0'. In the else section we're dealing with the usual case of generating the next value for id.

Because this particular API specifies id as a string, it is stored in the database as a string, and distinct gives us an array of strings. The final line is to use Math.max to determine the maximum value. But, that means we need to convert the array of strings into an array of numbers. The last step is converting the number back to a string.

This simple code is straight-forward, easy to understnad, and ran excelently for several months. Then, it started throwing an error: RangeError: Maximum call stack size exceeded. The stack trace said pointed to the Math.max call.

To iterate is human, to recurse is divine, unless recursion causes cursing

If you paid attention while taking CS101, you'll remember that the call stack records the parameters to each function call as well as local variables. As one function calls another, a new entry is added to the call stack, and when the function exits its entry is popped from that stack. In any programming language, the call stack has a maximum possible size, and exceeding this size is a serious error.

The most common cause of exceeding the call stack size is a recursive function. A badly coded recursive function can quickly blow past any maximum size for the call stack triggering that error.

But... There's nothing recursive here, right? Not so fast, because looks can be decieving. Turns out that Math.max can throw this error for an array that is big enough. Its implementation must require some recursion. After a few months, the document collection had grown to over 100,000 documents, making for over 100,000 distinct values for id. That was enough to throw the error in Math.max.

Computing the integer sequence with iteration?

The distinct query gives us an array, and somewhere in that array is the maximum value. We can find that maximum value by iterating over the array (a for loop) rather than using Math.max. But, that consumes more CPU time as the array size grows big. A solution relying on iteration, however, will certainly not blow up the call stack.

A solution using sort rather than a for loop is this:

const maxIDs = await Reports.distinct('id');
if (maxIDs.length <= 0) data.id = '0';
else {
    const compare = (a, b) => {
        const nA = Number.parseInt(a);
        const nB = Number.parseInt(b);
        if (nA < nB) {
            return -1;
        } else if (nA > nB) {
            return 1;
        } else { return 0; }
    };
    // let lastIDs =  maxIDs.sort(compare).slice(-10);
    // logger.info(`addReport last IDs ${util.inspect(lastIDs)}`);
    const maxID = maxIDs.sort(compare).slice(-1)[0];
    data.id = (Number.parseInt(maxID) + 1).toString();
}

This is a simple drop-in replacement, using Array.sort instead of Math.max. The idea is to sort the distinct values for id then retrieve the last (or first) item. The bit with .slice(-1)[0] is a tricky way to retrieve the last element of the sorted array. With this compare function, that last element will be the maximum value in the array. This could be simplified by reversing the sort order (changing the compare function), in which case the maximum value is the first entry in the array.

The problem here is that as the number of items grows our application is slower. No matter how good the sort function is, the more items it's sorting the more CPU time it consumes.

Computing the maximum value of an integer series in MongoDB

If we push this to MongoDB, it can theoretically optimize the operation better than the application code. For example, MongoDB could use the index for the id field to aid in determining the maximum value, or in sorting the elements?

A (stackoverflow.com) StackOverflow has these suggested MongoDB commands:

db.collection.find().sort({ id: -1}).limit(1) // for MAX
db.collection.find().sort({ id: +1}).limit(1) // for MIN
db.collection.aggregate({ $group : { _id: null, max: { $max: "$id" }}});
db.collection.aggregate({ $group : { _id: null, min: { $min: "$id" }}});

The first two use an empty find to retrieve all elements, sort them on the chosen field, then return one document containing either the maximum or minimum value. Since this retrieves the entire document, it may be useful to use project to limit the fields which are returned. There should also be a .toArray to convert this into a useful form.

The second two use an aggregation pipeline. The $group stage, along with the $max or $min operator, will cause the output to be:

[
    {
        "_id": null,
        "max": MAXIMUM VALUE // or min: MINUMUM VALUE
    }
]

MongoDB can use the index to improve performance, and it can also recognize the .sort(..).limit(..) combination to optimize memory usage by keeping only the maximum (or minimum) value(s) in memory.

But, this did not work directly in my case. It always gave a maximum value of '99999', even though there were many more documents in that collection.

Turns out that a sequence of integers does not sort in numerical order when the integers are stored as strings.

That's another thing we were taught in CS101, but can easily forget. The string '101234' is not greater than the string'99999'. Therefore the '99999' value always came up as the maximum value. The solution is to somehow numerically sort these numerical strings. Those command work very well if your application uses numbers, rather than strings. Since my application uses numerical strings, let's look a little further.

The compare function shown earlier takes care of that problem by calling Number.parseInt before doing the comparison. But the goal is to do this in MongoDB rather than in the application.

Numerically sorting strings in MongoDB

Another (stackoverflow.com) StackOverflow posting discusses telling MongoDB to sort in numerical order. After reading that the final code became:

const last = await db.reports.find()
    .sort({ id: -1 })
    .collation({
        locale: "en_US",
        numericOrdering: true
    })
    .limit(1)
    .toArray();
if (!Array.isArray(last)
 || last.length <= 0) {
    data.id = '0';
} else {
    let lastID = last[0].id;
    data.id = (Number.parseInt(lastID) + 1).toString();
}

The collation document lets us tell MongoDB how to sort values ( (www.mongodb.com) documentation). The numericOrdering setting works for integers, but not floating point, and it does not consider + or - signs. As long as your application fits within those constraints, this is an excellent solution.

The StackOverflow posting also has a suggestion to use the $toInt operator to convert the strings into actual integers, but I could not get that to work.

One downside of the final implementation is that the programmers intent is obscured. To understand that the intention is to generate the next item in the sequence of integers is not obvious. The code is wrapped around the mechanics of numerical sorting, retrieving one item, and incrementing that item, while maintaining the result as a string, that the quest for the next integer in a sequence is hard to see.

Warning: Potential race conditions

There is a timing issue not addressed in this article. We discussed computating the next sequence number for documents in a MongoDB collection. This relies on values of an id field in the collection. The value just generated by your application must be stored into the collection in order to be recognized. There are two points in time: A) Computing the next sequence number, B) Storing that sequence number. Another thread might try to generate another sequence number at about the same time, and if it does so between time A and time B it will generate the a duplicate sequence number.

Therefore, your application must minimize the time between generating the sequence number, and storing it into the collection.

In my application, the code above is immediately followed by db.reports.insertOne(report). That leaves a small window of time during which a sequence number has been generated and the generated number is stored in the database.

Conclusion

As with many software development tasks, what seems like a simple task comes with many considerations. Even the tiniest and simplest of functions can generate errors.

Another learning is that MongoDB has a rich set of features and operations worth exploring. It's not just a repository for data, but it can be used for transmogrifying data in many ways.

Functionality that was first implemented almost solely in the application code is now almost solely executing on MongoDB. The MongoDB documentation claims many advantages from moving implementation into MongoDB. While in this case the final implementation obscured the programmers intent, aggregation pipelines often make the programmers intent in a large scale data transmogrification task more obvious.

About the Author(s)

(davidherron.com) David Herron : David Herron is a writer and software engineer focusing on the wise use of technology. He is especially interested in clean energy technologies like solar power, wind power, and electric cars. David worked for nearly 30 years in Silicon Valley on software ranging from electronic mail systems, to video streaming, to the Java programming language, and has published several books on Node.js programming and electric vehicles.