Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Aggregation Framework
The aggregation framework allows you to process data records and return computed results. It is similar to theGROUP BY clause in SQL but far more powerful — think of it as building a data processing assembly line.
The factory analogy: Imagine a factory where raw materials (your documents) move along a conveyor belt. At each station (stage), a worker performs one specific task — one station filters out defective items ($match), the next groups items by color ($group), the next reshapes them for packaging ($project). Each station only sees what the previous station passed forward. That is exactly how the aggregation pipeline works: each stage receives the output of the previous stage and passes its result to the next.
The Pipeline
Aggregations work as a pipeline. Documents pass through a series of stages, where each stage transforms the documents. The pipeline is expressed as an array of stage objects.Common Stages
$match
Filters the documents — equivalent to find() or SQL WHERE. This stage should appear as early as possible in the pipeline to reduce the number of documents subsequent stages process.
$group
Groups documents by a specified identifier key and applies accumulator expressions to each group. This is MongoDB’s equivalent of SQL’s GROUP BY combined with aggregate functions.
| Accumulator | SQL Equivalent | What It Does |
|---|---|---|
$sum | SUM() / COUNT() | Adds values; use $sum: 1 to count |
$avg | AVG() | Calculates the mean |
$min / $max | MIN() / MAX() | Finds extremes |
$first / $last | N/A | First/last value in each group (depends on sort order) |
$push | N/A | Collects all values into an array |
$addToSet | N/A | Collects unique values into an array |
$project
Reshapes each document — include, exclude, rename fields, or create computed fields. Think of it as choosing which columns to show and adding calculated columns.
$sort
Sorts all input documents and returns them to the pipeline in sorted order.
$limit and $skip
Used for pagination.
$unwind
Deconstructs an array field, creating one document per array element. Essential when you need to aggregate across array values.
$lookup
Performs a left outer join to another collection — MongoDB’s equivalent of a SQL JOIN.
Example Pipelines
Example 1: Total Sales by Product (Basic)
Calculate the total sales per product for the year 2023, sorted by highest sales.Example 2: E-Commerce Dashboard (Realistic)
Generate a dashboard showing average order value by month, with the top product per month.Example 3: Tagging Analysis with $unwind (Array Processing)
Find the most popular tags across all blog posts and count how many posts use each tag.Aggregation Performance Tips
-
$matchfirst, always. A$matchat the start of the pipeline can use indexes and eliminates documents before expensive stages like$groupand$sort. -
Create indexes for
$matchand$sortstages. The first$matchand first$sortin a pipeline can leverage indexes. Subsequent ones cannot because the document structure has changed. -
Use
allowDiskUsefor large datasets. By default, each pipeline stage has a 100 MB RAM limit. For large aggregations:
-
Prefer
$projectearly to drop unused fields. If your documents have 50 fields but you only need 3, projecting early reduces the memory footprint for every subsequent stage. -
Avoid
$lookupon unindexed foreign fields. Always ensure theforeignFieldin a$lookuphas an index, or the join will trigger a collection scan on the target collection for every input document.
Summary
- Aggregation Pipeline processes data in stages — each stage transforms documents and passes them forward.
- Use
$matchto filter early (for performance — it can use indexes). - Use
$groupto calculate statistics (sum, avg, min, max, count). - Use
$projectto reshape output and drop unused fields. - Use
$unwindto work with array fields,$lookupto join collections. - Always place
$matchfirst and create indexes on fields used in$matchand$sortstages. - Pass
{ allowDiskUse: true }when aggregating large datasets that exceed the 100 MB stage memory limit.
Interview Deep-Dive
Why does MongoDB's documentation recommend placing $match as early as possible in an aggregation pipeline? Walk me through what happens internally when you reorder stages.
Why does MongoDB's documentation recommend placing $match as early as possible in an aggregation pipeline? Walk me through what happens internally when you reorder stages.
- Placing
$matchearly is the single most impactful optimization in aggregation pipeline design. The reason is simple: every subsequent stage only processes the documents that pass through the previous stage. If you filter 10 million documents down to 50,000 with a$matchfirst, every subsequent$group,$sort,$lookup, and$projectoperates on 50,000 documents instead of 10 million. The difference in execution time can be three orders of magnitude. - More importantly,
$matchat the beginning of a pipeline can use indexes. MongoDB’s query planner optimizes the first$matchstage the same way it optimizes afind()query — it can use compound indexes, covered index scans, and all the usual index strategies. A$matchthat appears after a$groupor$projectcannot use indexes because the data has been transformed and no longer maps to the original collection’s index structure. - What happens internally: MongoDB’s aggregation framework applies automatic optimizations. One key optimization is “match coalescence” — if you have
$matchstages separated by non-transformative stages, the optimizer merges them or moves them earlier. For example,$sortfollowed by$matchcan be reordered to$matchfollowed by$sortbecause filtering first, then sorting a smaller set is always faster. - However, the optimizer cannot move
$matchbefore stages that transform the data. If you have$groupfollowed by$match, the$matchfilters on the grouped output, which has a completely different schema from the input. The optimizer cannot push this$matchbefore the$groupbecause the filter fields do not exist yet. - Real-world example: I worked on a reporting pipeline that calculated monthly revenue by product category. The original pipeline was
$group(all 50M orders), then$match(filter to 2024). Execution time: 45 seconds. Reordered to$match(filter to 2024, roughly 12M orders) then$group— execution time: 8 seconds. Added a compound index on{ date: 1 }that the initial$matchcould use — execution time: 1.2 seconds. Same result, 37x faster.
- Run the pipeline with
.explain("executionStats")(or.explain("queryPlanner")for just the plan). The explain output shows the “optimizedPipeline” — the actual stages MongoDB will execute after optimization. Compare this to your original pipeline to see what the optimizer moved, merged, or eliminated. - You cannot directly override the optimizer’s decisions, but you can structure your pipeline to guide it. For instance, the optimizer will merge consecutive
$matchstages into a single$matchwith$and. If you do not want this (rare, but sometimes useful for readability), you can insert a no-op$addFieldsbetween them to prevent coalescence. - A practical tip: if
explainshows your initial$matchis not using an index, check whether you have a$addFieldsor$setbefore the$matchthat prevents index usage. Even a seemingly harmless$addFields: { fullName: { $concat: ["$firstName", " ", "$lastName"] } }before$match: { status: "active" }prevents the$matchfrom using an index onstatusbecause$addFieldstransforms the document stream.
Explain the $lookup stage. How does it compare to a SQL JOIN, and what are the performance implications that most developers underestimate?
Explain the $lookup stage. How does it compare to a SQL JOIN, and what are the performance implications that most developers underestimate?
$lookupperforms a left outer join between the current collection and a foreign collection. For each document in the pipeline, it finds all matching documents in the foreign collection and adds them as an array field. The basic syntax is:{ $lookup: { from: "orders", localField: "_id", foreignField: "userId", as: "userOrders" } }.- Compared to a SQL JOIN, there are critical differences. SQL joins are typically optimized by the query planner with hash joins, merge joins, and nested loop joins depending on data size. MongoDB’s
$lookupis essentially a nested loop join by default — for each document in the left side, it queries the right collection. This means performance is O(n * m) in the worst case, where n is the number of left-side documents and m is the average cost of each right-side lookup. - The performance implication most developers miss:
$lookupdoes use indexes on the foreign collection’sforeignField, but only if theforeignFieldhas an appropriate index. Without an index onorders.userId, each$lookupiteration does a collection scan of the orders collection. For a pipeline processing 10,000 users against 1 million orders, that is 10,000 collection scans — catastrophically slow. With an index onorders.userId, each lookup is an index seek, and the total cost is manageable. - Another underestimated cost: memory. The
$lookupresult for each document includes all matching foreign documents as an embedded array. If a user has 50,000 orders, the result document now contains a 50,000-element array. This can exceed the 16 MB document size limit and crash the pipeline. The mitigation is to add a$matchinside the$lookuppipeline (using the uncorrelated subquery syntax) to filter the right-side documents before they are assembled into the array. $lookupalso does not work across sharded collections as the foreign collection (thefromcollection). The local collection can be sharded, but the foreign collection must be unsharded or on the same shard. This is a significant limitation for large-scale systems where both collections need horizontal scaling.- My guidance: if you find yourself writing
$lookupin a query that runs frequently (more than a few times per second), it is a signal that your data model might benefit from embedding the referenced data instead of joining at query time.$lookupis better suited for analytics pipelines, admin dashboards, and infrequent reports — not hot-path user-facing queries.
- Denormalize. The MongoDB data modeling philosophy for hot-path queries is: do the work at write time so reads are fast. Instead of joining users and orders at read time, embed the user’s name and email in each order document when the order is created. The read query then fetches a single order document with all the data it needs — no join, no second collection lookup.
- The cost is write complexity: when a user updates their email, you need to propagate the change to all their orders. Use Change Streams to listen for user updates and asynchronously update the denormalized copies in orders. Accept that there is a brief window of staleness (usually seconds) after a user update.
- For cases where denormalization is impractical (the referenced data is too large or changes too frequently), use application-level caching. Fetch the order, then fetch the user from a Redis cache keyed by user ID. Cache hit rate for user profiles in an order-display context is typically 95%+, so you eliminate most of the “join” work.
- The anti-pattern is treating MongoDB like a relational database and using
$lookupeverywhere. If your data model requires 5$lookupstages to render a single page, you have designed a relational schema in a document database, and you should either restructure your data model or switch to PostgreSQL where those joins are optimized natively.
You need to process 500 million documents in an aggregation pipeline for a monthly report. The pipeline includes $match, $group, $sort, and $project. How do you prevent this from killing your production database?
You need to process 500 million documents in an aggregation pipeline for a monthly report. The pipeline includes $match, $group, $sort, and $project. How do you prevent this from killing your production database?
- A pipeline processing 500 million documents is a resource-intensive operation that can saturate CPU, consume all available RAM, and compete with production traffic for I/O bandwidth. The strategy is to isolate the workload and control resource consumption.
- First, run the pipeline on a secondary replica set member, not the primary. Set
readPreference: "secondary"in the aggregation options. This ensures the heavy read workload does not compete with production writes on the primary. The data might be slightly stale (replication lag), but for a monthly report, a few seconds of lag is irrelevant. - Second, use
allowDiskUse: true. By default, each aggregation pipeline stage is limited to 100 MB of RAM. For a$groupor$sorton 500 million documents, you will almost certainly exceed this limit.allowDiskUselets MongoDB spill intermediate results to disk when RAM is exhausted. It is slower than in-memory processing but prevents out-of-memory errors. - Third, optimize the pipeline to reduce the data volume at each stage. The
$matchshould be as selective as possible and use an index. If the report is for January 2024, filter by date range first — this might reduce 500 million to 40 million before any grouping happens. The$projectshould be placed early to drop unnecessary fields so that less data flows through subsequent stages. Every byte you eliminate early saves CPU and I/O at every subsequent stage. - Fourth, consider a dedicated analytics replica. MongoDB allows you to add a replica set member with a lower priority (so it never becomes primary) and route all analytics workloads to it. This member can have different hardware (more RAM, faster disks) optimized for heavy read operations. Atlas supports this natively with “Analytics Nodes.”
- Fifth, for recurring reports, pre-aggregate. Instead of crunching 500 million raw documents monthly, maintain a
daily_summariescollection that aggregates each day’s data into summary documents. The monthly report then aggregates 30 daily summaries instead of 500 million raw documents. This is the “Computed Pattern” from MongoDB’s data modeling playbook. The daily aggregation runs on a smaller dataset and can be scheduled during off-peak hours.
- When
allowDiskUseis enabled, MongoDB uses external merge sort for datasets that exceed memory. It sorts chunks of data in memory (up to the 100 MB limit per stage), writes the sorted chunks to temporary files on disk, then merges the sorted chunks. This is the same algorithm used by Unixsorton large files. The performance depends heavily on disk I/O speed — SSDs make a massive difference over spinning disks. - To speed it up: ensure the sort can be index-supported. If your pipeline’s
$matchfeeds directly into$sortand both use the same index, MongoDB can read the data in sorted order from the index, eliminating the sort step entirely. For example, with index{ date: 1, amount: -1 }, a pipeline of$match: { date: { range } }followed by$sort: { amount: -1 }can use the index’s ordering. - If index-supported sort is not possible (common after
$group, which produces new documents), reduce the data volume before sorting. If you are sorting grouped results, there are typically far fewer groups than raw documents — 500 million documents might produce 10,000 groups, which sorts instantly. - Another approach: partition the workload. Instead of one pipeline for all 500 million documents, run 12 smaller pipelines (one per month of the trailing year) in parallel, then combine the results in application code. Each smaller pipeline sorts a manageable subset.
Write an aggregation pipeline that finds the top 5 customers by total spending in the last 90 days, including their most recent order date and average order value. Explain each stage.
Write an aggregation pipeline that finds the top 5 customers by total spending in the last 90 days, including their most recent order date and average order value. Explain each stage.
- The key design decisions:
$matchfirst for index usage and data reduction.$limitbefore$lookupso we only join 5 documents instead of every customer.$roundin$projectfor clean financial output. Excluding cancelled orders in the initial match to avoid inflating totals. - The ideal index for this pipeline is
{ orderDate: 1, status: 1, customerId: 1, totalAmount: 1 }. This covers the$matchfilter and provides all the fields the$groupneeds without fetching full documents.
- A real-time dashboard that recomputes this aggregation on every page load is not viable for a large orders collection. Instead, I would use a materialized view pattern.
- MongoDB supports
$mergeas the final aggregation stage, which writes the pipeline results to a target collection. Schedule this pipeline to run every 5 minutes, writing results to atop_customers_90dcollection. The dashboard reads from this small, pre-computed collection (5 documents) — instant response time. - For even lower latency, use Change Streams on the orders collection. When a new order is placed, a consumer updates a running total in a
customer_statscollection. The dashboard reads fromcustomer_stats, which is always up to date without running a full aggregation. This is more complex to implement (you need to handle order cancellations, refunds, and the 90-day rolling window), but it provides true real-time data. - The
$mergeapproach is the pragmatic middle ground: near-real-time (5-minute lag), simple to implement, and the aggregation runs on a secondary node so it does not impact production.