Learn Powerful MongoDB Aggregation Pipelines From Practical Examples | by Lynn Kwong | Mar, 2022

Get perception into your information in MongoDB by aggregation

Picture from Pixabay

MongoDB isn’t just a NoSQL doc database. You possibly can carry out advanced analyses for the paperwork in collections utilizing aggregation pipelines. A standard process is to group the outcomes and get the overall or common information for every group. You possibly can even filter and convert or clear the info earlier than grouping.

An aggregation pipeline consists of a number of phases that course of paperwork sequentially. Every stage performs some operation on the enter paperwork and passes the processed ones to the following stage. For instance, the primary stage can filter paperwork primarily based on some situations, the second can group the filtered paperwork and do some aggregations, and the third one can output the outcomes.

We’ll use the info launched within the article for advanced MongoDB queries for the demo on this put up. If you wish to observe alongside, it’s useful to verify the system setup steps in that put up to get a greater data of the info. Nevertheless, in case you are in a rush, you possibly can simply obtain this JSON file and use the next instructions to import the info:

We use Docker to begin a MongoDB server regionally on this put up, however you need to use a hosted server comparable to MongoDB Atlas which has a free tier and is an efficient alternative for studying functions.

When the code above is run, we could have a laptops assortment within the merchandise database containing 200 paperwork of laptop computer information. The paperwork have content material as follows:

Now that the info is prepared, we will begin to write aggregation pipelines to investigate the info. It’s advisable to make use of a MongoDB IDE to put in writing the pipelines which usually span a number of traces. With an IDE, you possibly can have autocompletion for the code and might write queries or pipelines spanning a number of traces conveniently. Nevertheless, on this put up, you possibly can simply copy the code, run it inside mongosh and verify the consequence there.

$group stage

Let’s first verify what number of paperwork we have now within the assortment.

$ docker exec -it mongo-server bash
$ mongosh "mongodb://admin:cross@localhost:27017"
take a look at> use merchandise
merchandise> db.laptops.countDocuments()
200

We will additionally use an aggregation pipeline to rely the variety of paperwork, which appears an overkill now, however it may well function an introduction to the extra advanced pipelines later:

In our first pipeline, there is just one $group stage, which because the identify suggests teams the paperwork by the required _id. Usually we might specify a area for _id to group paperwork by that area. Nevertheless, particularly, when _id is null, it means we’ll group all paperwork collectively. Apart from, complete is a brand new area that holds the info returned by the $rely accumulator operator, which simply counts the variety of paperwork in every group.

Let’s now group by model and verify the quantity for every model:

As talked about above, within the $group stage, _id specifies the sector to group the enter paperwork by. _id accepts an aggregation expression as its worth which will be field paths, system variables, expression objects, and so forth.

"$model" is a area path and should be put in quotes. You could marvel why we should put a greenback signal $ earlier than the sector identify. Sure, that is very complicated whenever you first see it. You could perceive that _id accepts an expression, fairly than a plain area identify. And within the expression, if we need to specify a area, we should prefix it with a greenback signal, in any other case, it is going to be handled as a literal string.

Technically, "$model" is a shortcut for "$$CURRENT.model", the place CURRENT is a system variable that defaults to the present doc. As soon as you understand this, "$model" isn’t that mysterious, isn’t it?

$kind stage

Let’s kind the laptops by the quantity for the model in descending order:

A brand new $kind stage is added after the $group stage. As we already know, the filtered or aggregated paperwork of 1 stage are handed to subsequent one. On this case, the grouping outcomes are handed to the $kind stage. The fields of the unique paperwork are usually not out there now and solely these from the earlier $group stage can be found. Subsequently, we will order by the complete area which was generated within the earlier $group stage. Additionally, $kind settle for a plain string for the sector identify so that you don’t must prefix it with a greenback signal.

$match stage

We will add a $match stage to filter the enter paperwork and cross solely those that match the required situations to the following pipeline stage. Let’s filter out the laptops that aren’t in inventory and solely calculate those which might be nonetheless out there.

We will use the $match stage a number of instances in the identical pipeline. Let’s solely output the manufacturers which have greater than 10 in inventory:

$unwind stage

Now let’s introduce a extra advanced stage, $unwind, which deconstructs an array area from the enter paperwork and outputs a doc for every aspect of the unique array.

For a doc like this:

"_id": 1, "identify": "John", "scores": [70, 90, 80]

The $unwind stage on the scores area will generate the next paperwork:

"_id": 1, "identify": "John", "scores": 70
"_id": 1, "identify": "John", "scores": 90
"_id": 1, "identify": "John", "scores": 80

The resultant paperwork by $unwind from the identical array would have the identical major key _id. Let’s now unwind the attributes area of the laptops paperwork which is an array of attribute paperwork:

Sure, it’s unwound as anticipated, and every new doc accommodates a single attribute of the unique array. Observe that the $unwind stage expects a field path because the enter, which ought to be specified with a greenback signal prefix just like the _id area for the $group stage.

Let’s now discover out the utmost reminiscence for every model of laptop computer:

[
_id: 'HP', max_memory: '8GB' ,
_id: 'Dell', max_memory: '8GB' ,
_id: 'Asus', max_memory: '8GB' ,
_id: 'Apple', max_memory: '8GB' ,
_id: 'Lenovo', max_memory: '8GB'
]

The $final operator will get the final member of a bunch. When used along with the $kind stage which orders the paperwork by the required fields, we will get the utmost worth of a area that’s sorted in ascending order.

Curiously, all of the manufacturers have “8GB” as the utmost worth. It’s because the worth for reminiscence is a string worth and “8GB” is seen as larger than “16GB” or “32GB”. We have to convert it to a numeric worth for sorting, which will be achieved with the $mission stage.

$mission stage

The $mission stage passes alongside the paperwork with the requested fields to the following stage within the pipeline. The required fields will be present fields from the enter paperwork or newly computed ones. On this instance, we need to cross the model as it’s, however convert the reminiscence string worth to a numeric worth (integer on this case). The reminiscence string worth all accommodates a “GB” and we have to take away it earlier than it may be transformed to an integer, which will be carried out with the $trim operator. The conversion from a string to an integer will be carried out with the $toInt operator.

The ultimate pipeline used would be the one proven beneath. Observe that the $kind stage can be used twice, simply because the $match stage launched above.

And now the reminiscence is sorted correctly:

[
_id: 'HP', max_memory: 32 ,
_id: 'Lenovo', max_memory: 32 ,
_id: 'Dell', max_memory: 16 ,
_id: 'Asus', max_memory: 16 ,
_id: 'Apple', max_memory: 16
]

We will proceed to investigate the info with completely different mixtures of phases. Nevertheless, it is going to be just like what have we launched. Upon getting mastered the essential and generally used phases of the aggregation pipelines, it is possible for you to to carry out all types of analyses you need. It’s advisable that you simply write some pipelines manually and verify the output of every stage to have a greater understanding of the logic of various phases.

More Posts