Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Aggregation Pipeline

Aggregation Framework

The aggregation framework allows you to process data records and return computed results. It is similar to the GROUP BY clause in SQL but far more powerful — think of it as building a data processing assembly line. The factory analogy: Imagine a factory where raw materials (your documents) move along a conveyor belt. At each station (stage), a worker performs one specific task — one station filters out defective items ($match), the next groups items by color ($group), the next reshapes them for packaging ($project). Each station only sees what the previous station passed forward. That is exactly how the aggregation pipeline works: each stage receives the output of the previous stage and passes its result to the next.

The Pipeline

Aggregations work as a pipeline. Documents pass through a series of stages, where each stage transforms the documents. The pipeline is expressed as an array of stage objects.
db.orders.aggregate([
  // Stage 1: Filter -- only orders with status "A" pass through
  { $match: { status: "A" } },

  // Stage 2: Group -- combine filtered orders by customer, summing amounts
  { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
])
The $ prefix on field names inside aggregation stages (e.g., "$cust_id") means “the value of this field in the current document.” It is how you reference document fields inside expressions. Without the $, MongoDB treats it as a literal string.

Common Stages

$match

Filters the documents — equivalent to find() or SQL WHERE. This stage should appear as early as possible in the pipeline to reduce the number of documents subsequent stages process.
// Simple equality filter
{ $match: { status: "active" } }

// Complex filter with operators (same syntax as find queries)
{ $match: { age: { $gte: 18 }, status: { $in: ["active", "pending"] } } }
Performance rule of thumb: Always place $match as the first stage when possible. A $match at the start of a pipeline can use indexes, just like find(). A $match after a $group or $project cannot use indexes because the documents have already been transformed.

$group

Groups documents by a specified identifier key and applies accumulator expressions to each group. This is MongoDB’s equivalent of SQL’s GROUP BY combined with aggregate functions.
{
  $group: {
    _id: "$category",       // Group by category field (null groups everything)
    count: { $sum: 1 },     // Count documents per group
    avgPrice: { $avg: "$price" },   // Average price per group
    maxPrice: { $max: "$price" },   // Highest price per group
    allTags: { $addToSet: "$tag" }  // Unique tags per group (no duplicates)
  }
}
Common accumulators:
AccumulatorSQL EquivalentWhat It Does
$sumSUM() / COUNT()Adds values; use $sum: 1 to count
$avgAVG()Calculates the mean
$min / $maxMIN() / MAX()Finds extremes
$first / $lastN/AFirst/last value in each group (depends on sort order)
$pushN/ACollects all values into an array
$addToSetN/ACollects unique values into an array

$project

Reshapes each document — include, exclude, rename fields, or create computed fields. Think of it as choosing which columns to show and adding calculated columns.
// Include only name and total, exclude _id
{ $project: { name: 1, total: 1, _id: 0 } }

// Create computed fields
{
  $project: {
    fullName: { $concat: ["$firstName", " ", "$lastName"] },
    totalWithTax: { $multiply: ["$total", 1.08] },
    year: { $year: "$createdAt" }
  }
}

$sort

Sorts all input documents and returns them to the pipeline in sorted order.
{ $sort: { total: -1 } }          // Descending (highest first)
{ $sort: { lastName: 1, age: -1 } } // Multi-field: lastName ascending, then age descending
$sort in an aggregation pipeline has a 100 MB memory limit by default. For large datasets, either place $match before $sort to reduce document count, or pass { allowDiskUse: true } as an option to aggregate().

$limit and $skip

Used for pagination.
{ $skip: 10 },  // Skip the first 10 results
{ $limit: 5 }   // Return only 5 results

$unwind

Deconstructs an array field, creating one document per array element. Essential when you need to aggregate across array values.
// Before $unwind: { name: "Alice", hobbies: ["reading", "coding"] }
{ $unwind: "$hobbies" }
// After $unwind: two documents:
//   { name: "Alice", hobbies: "reading" }
//   { name: "Alice", hobbies: "coding" }
$unwind on an empty array removes the document entirely by default. Use { $unwind: { path: "$hobbies", preserveNullAndEmptyArrays: true } } to keep documents that have empty or missing arrays.

$lookup

Performs a left outer join to another collection — MongoDB’s equivalent of a SQL JOIN.
{
  $lookup: {
    from: "authors",         // The "other" collection to join
    localField: "authorId",  // Field in the current collection
    foreignField: "_id",     // Field in the "authors" collection
    as: "authorDetails"      // Output array field name
  }
}
// Result: each document gains an "authorDetails" array with matching author docs
$lookup can be expensive on large collections. Unlike SQL joins that rely on mature query planners, $lookup essentially runs a query per input document (though MongoDB optimizes this internally). If you find yourself doing heavy $lookup operations, consider whether your data model should embed the related data instead.

Example Pipelines

Example 1: Total Sales by Product (Basic)

Calculate the total sales per product for the year 2023, sorted by highest sales.
db.sales.aggregate([
  // Stage 1: Filter for 2023 only -- reduces dataset before grouping
  // TIP: This $match can use an index on the "date" field
  { $match: {
    date: { $gte: new Date('2023-01-01'), $lt: new Date('2024-01-01') }
  }},

  // Stage 2: Group by product, summing the amount field
  { $group: {
    _id: "$product",
    totalSales: { $sum: "$amount" },
    orderCount: { $sum: 1 }
  }},

  // Stage 3: Sort by totalSales descending (highest first)
  { $sort: { totalSales: -1 } }
])

Example 2: E-Commerce Dashboard (Realistic)

Generate a dashboard showing average order value by month, with the top product per month.
db.orders.aggregate([
  // Filter to completed orders in the last year
  { $match: {
    status: "completed",
    createdAt: { $gte: new Date('2023-01-01') }
  }},

  // Group by year-month, calculating stats
  { $group: {
    _id: {
      year: { $year: "$createdAt" },
      month: { $month: "$createdAt" }
    },
    avgOrderValue: { $avg: "$total" },
    totalRevenue: { $sum: "$total" },
    orderCount: { $sum: 1 }
  }},

  // Reshape output for readability
  { $project: {
    _id: 0,
    period: { $concat: [
      { $toString: "$_id.year" }, "-",
      { $toString: "$_id.month" }
    ]},
    avgOrderValue: { $round: ["$avgOrderValue", 2] },
    totalRevenue: { $round: ["$totalRevenue", 2] },
    orderCount: 1
  }},

  // Sort chronologically
  { $sort: { period: 1 } }
])

Example 3: Tagging Analysis with $unwind (Array Processing)

Find the most popular tags across all blog posts and count how many posts use each tag.
db.posts.aggregate([
  // Only published posts
  { $match: { status: "published" } },

  // Deconstruct the tags array -- one document per tag
  { $unwind: "$tags" },

  // Group by tag, count occurrences
  { $group: {
    _id: "$tags",
    postCount: { $sum: 1 }
  }},

  // Sort by popularity descending
  { $sort: { postCount: -1 } },

  // Top 10 tags only
  { $limit: 10 }
])

Aggregation Performance Tips

  1. $match first, always. A $match at the start of the pipeline can use indexes and eliminates documents before expensive stages like $group and $sort.
  2. Create indexes for $match and $sort stages. The first $match and first $sort in a pipeline can leverage indexes. Subsequent ones cannot because the document structure has changed.
  3. Use allowDiskUse for large datasets. By default, each pipeline stage has a 100 MB RAM limit. For large aggregations:
db.orders.aggregate([...stages...], { allowDiskUse: true })
  1. Prefer $project early to drop unused fields. If your documents have 50 fields but you only need 3, projecting early reduces the memory footprint for every subsequent stage.
  2. Avoid $lookup on unindexed foreign fields. Always ensure the foreignField in a $lookup has an index, or the join will trigger a collection scan on the target collection for every input document.

Summary

  • Aggregation Pipeline processes data in stages — each stage transforms documents and passes them forward.
  • Use $match to filter early (for performance — it can use indexes).
  • Use $group to calculate statistics (sum, avg, min, max, count).
  • Use $project to reshape output and drop unused fields.
  • Use $unwind to work with array fields, $lookup to join collections.
  • Always place $match first and create indexes on fields used in $match and $sort stages.
  • Pass { allowDiskUse: true } when aggregating large datasets that exceed the 100 MB stage memory limit.

Interview Deep-Dive

Strong Answer:
  • Placing $match early is the single most impactful optimization in aggregation pipeline design. The reason is simple: every subsequent stage only processes the documents that pass through the previous stage. If you filter 10 million documents down to 50,000 with a $match first, every subsequent $group, $sort, $lookup, and $project operates on 50,000 documents instead of 10 million. The difference in execution time can be three orders of magnitude.
  • More importantly, $match at the beginning of a pipeline can use indexes. MongoDB’s query planner optimizes the first $match stage the same way it optimizes a find() query — it can use compound indexes, covered index scans, and all the usual index strategies. A $match that appears after a $group or $project cannot use indexes because the data has been transformed and no longer maps to the original collection’s index structure.
  • What happens internally: MongoDB’s aggregation framework applies automatic optimizations. One key optimization is “match coalescence” — if you have $match stages separated by non-transformative stages, the optimizer merges them or moves them earlier. For example, $sort followed by $match can be reordered to $match followed by $sort because filtering first, then sorting a smaller set is always faster.
  • However, the optimizer cannot move $match before stages that transform the data. If you have $group followed by $match, the $match filters on the grouped output, which has a completely different schema from the input. The optimizer cannot push this $match before the $group because the filter fields do not exist yet.
  • Real-world example: I worked on a reporting pipeline that calculated monthly revenue by product category. The original pipeline was $group (all 50M orders), then $match (filter to 2024). Execution time: 45 seconds. Reordered to $match (filter to 2024, roughly 12M orders) then $group — execution time: 8 seconds. Added a compound index on { date: 1 } that the initial $match could use — execution time: 1.2 seconds. Same result, 37x faster.
Follow-up: The pipeline optimizer sometimes fails to optimize. How do you verify what the optimizer actually did, and can you manually override it?
  • Run the pipeline with .explain("executionStats") (or .explain("queryPlanner") for just the plan). The explain output shows the “optimizedPipeline” — the actual stages MongoDB will execute after optimization. Compare this to your original pipeline to see what the optimizer moved, merged, or eliminated.
  • You cannot directly override the optimizer’s decisions, but you can structure your pipeline to guide it. For instance, the optimizer will merge consecutive $match stages into a single $match with $and. If you do not want this (rare, but sometimes useful for readability), you can insert a no-op $addFields between them to prevent coalescence.
  • A practical tip: if explain shows your initial $match is not using an index, check whether you have a $addFields or $set before the $match that prevents index usage. Even a seemingly harmless $addFields: { fullName: { $concat: ["$firstName", " ", "$lastName"] } } before $match: { status: "active" } prevents the $match from using an index on status because $addFields transforms the document stream.
Strong Answer:
  • $lookup performs a left outer join between the current collection and a foreign collection. For each document in the pipeline, it finds all matching documents in the foreign collection and adds them as an array field. The basic syntax is: { $lookup: { from: "orders", localField: "_id", foreignField: "userId", as: "userOrders" } }.
  • Compared to a SQL JOIN, there are critical differences. SQL joins are typically optimized by the query planner with hash joins, merge joins, and nested loop joins depending on data size. MongoDB’s $lookup is essentially a nested loop join by default — for each document in the left side, it queries the right collection. This means performance is O(n * m) in the worst case, where n is the number of left-side documents and m is the average cost of each right-side lookup.
  • The performance implication most developers miss: $lookup does use indexes on the foreign collection’s foreignField, but only if the foreignField has an appropriate index. Without an index on orders.userId, each $lookup iteration does a collection scan of the orders collection. For a pipeline processing 10,000 users against 1 million orders, that is 10,000 collection scans — catastrophically slow. With an index on orders.userId, each lookup is an index seek, and the total cost is manageable.
  • Another underestimated cost: memory. The $lookup result for each document includes all matching foreign documents as an embedded array. If a user has 50,000 orders, the result document now contains a 50,000-element array. This can exceed the 16 MB document size limit and crash the pipeline. The mitigation is to add a $match inside the $lookup pipeline (using the uncorrelated subquery syntax) to filter the right-side documents before they are assembled into the array.
  • $lookup also does not work across sharded collections as the foreign collection (the from collection). The local collection can be sharded, but the foreign collection must be unsharded or on the same shard. This is a significant limitation for large-scale systems where both collections need horizontal scaling.
  • My guidance: if you find yourself writing $lookup in a query that runs frequently (more than a few times per second), it is a signal that your data model might benefit from embedding the referenced data instead of joining at query time. $lookup is better suited for analytics pipelines, admin dashboards, and infrequent reports — not hot-path user-facing queries.
Follow-up: How would you handle a situation where you need JOIN-like functionality in a hot-path query that runs thousands of times per second?
  • Denormalize. The MongoDB data modeling philosophy for hot-path queries is: do the work at write time so reads are fast. Instead of joining users and orders at read time, embed the user’s name and email in each order document when the order is created. The read query then fetches a single order document with all the data it needs — no join, no second collection lookup.
  • The cost is write complexity: when a user updates their email, you need to propagate the change to all their orders. Use Change Streams to listen for user updates and asynchronously update the denormalized copies in orders. Accept that there is a brief window of staleness (usually seconds) after a user update.
  • For cases where denormalization is impractical (the referenced data is too large or changes too frequently), use application-level caching. Fetch the order, then fetch the user from a Redis cache keyed by user ID. Cache hit rate for user profiles in an order-display context is typically 95%+, so you eliminate most of the “join” work.
  • The anti-pattern is treating MongoDB like a relational database and using $lookup everywhere. If your data model requires 5 $lookup stages to render a single page, you have designed a relational schema in a document database, and you should either restructure your data model or switch to PostgreSQL where those joins are optimized natively.
Strong Answer:
  • A pipeline processing 500 million documents is a resource-intensive operation that can saturate CPU, consume all available RAM, and compete with production traffic for I/O bandwidth. The strategy is to isolate the workload and control resource consumption.
  • First, run the pipeline on a secondary replica set member, not the primary. Set readPreference: "secondary" in the aggregation options. This ensures the heavy read workload does not compete with production writes on the primary. The data might be slightly stale (replication lag), but for a monthly report, a few seconds of lag is irrelevant.
  • Second, use allowDiskUse: true. By default, each aggregation pipeline stage is limited to 100 MB of RAM. For a $group or $sort on 500 million documents, you will almost certainly exceed this limit. allowDiskUse lets MongoDB spill intermediate results to disk when RAM is exhausted. It is slower than in-memory processing but prevents out-of-memory errors.
  • Third, optimize the pipeline to reduce the data volume at each stage. The $match should be as selective as possible and use an index. If the report is for January 2024, filter by date range first — this might reduce 500 million to 40 million before any grouping happens. The $project should be placed early to drop unnecessary fields so that less data flows through subsequent stages. Every byte you eliminate early saves CPU and I/O at every subsequent stage.
  • Fourth, consider a dedicated analytics replica. MongoDB allows you to add a replica set member with a lower priority (so it never becomes primary) and route all analytics workloads to it. This member can have different hardware (more RAM, faster disks) optimized for heavy read operations. Atlas supports this natively with “Analytics Nodes.”
  • Fifth, for recurring reports, pre-aggregate. Instead of crunching 500 million raw documents monthly, maintain a daily_summaries collection that aggregates each day’s data into summary documents. The monthly report then aggregates 30 daily summaries instead of 500 million raw documents. This is the “Computed Pattern” from MongoDB’s data modeling playbook. The daily aggregation runs on a smaller dataset and can be scheduled during off-peak hours.
Follow-up: Your $sort stage on 500 million documents is the bottleneck even with allowDiskUse. How does MongoDB sort that much data, and how can you speed it up?
  • When allowDiskUse is enabled, MongoDB uses external merge sort for datasets that exceed memory. It sorts chunks of data in memory (up to the 100 MB limit per stage), writes the sorted chunks to temporary files on disk, then merges the sorted chunks. This is the same algorithm used by Unix sort on large files. The performance depends heavily on disk I/O speed — SSDs make a massive difference over spinning disks.
  • To speed it up: ensure the sort can be index-supported. If your pipeline’s $match feeds directly into $sort and both use the same index, MongoDB can read the data in sorted order from the index, eliminating the sort step entirely. For example, with index { date: 1, amount: -1 }, a pipeline of $match: { date: { range } } followed by $sort: { amount: -1 } can use the index’s ordering.
  • If index-supported sort is not possible (common after $group, which produces new documents), reduce the data volume before sorting. If you are sorting grouped results, there are typically far fewer groups than raw documents — 500 million documents might produce 10,000 groups, which sorts instantly.
  • Another approach: partition the workload. Instead of one pipeline for all 500 million documents, run 12 smaller pipelines (one per month of the trailing year) in parallel, then combine the results in application code. Each smaller pipeline sorts a manageable subset.
Strong Answer:
db.orders.aggregate([
  // Stage 1: Filter to last 90 days. This is the most critical stage -- it
  // must be first to use the index on 'orderDate' and reduce the working set.
  {
    $match: {
      orderDate: {
        $gte: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)
      },
      status: { $ne: "cancelled" }
    }
  },

  // Stage 2: Group by customer. For each customer, calculate total spending,
  // count orders, and track the most recent order date.
  {
    $group: {
      _id: "$customerId",
      totalSpent: { $sum: "$totalAmount" },
      orderCount: { $sum: 1 },
      lastOrderDate: { $max: "$orderDate" },
      avgOrderValue: { $avg: "$totalAmount" }
    }
  },

  // Stage 3: Sort by total spending descending.
  { $sort: { totalSpent: -1 } },

  // Stage 4: Take only top 5. This dramatically reduces the data flowing
  // to the next stage -- from potentially millions of customers to 5.
  { $limit: 5 },

  // Stage 5: Enrich with customer details. $lookup is acceptable here
  // because we are only joining 5 documents, not the full dataset.
  {
    $lookup: {
      from: "customers",
      localField: "_id",
      foreignField: "_id",
      as: "customerInfo"
    }
  },

  // Stage 6: Flatten the customerInfo array.
  { $unwind: "$customerInfo" },

  // Stage 7: Shape the final output.
  {
    $project: {
      _id: 0,
      customerId: "$_id",
      customerName: "$customerInfo.name",
      customerEmail: "$customerInfo.email",
      totalSpent: { $round: ["$totalSpent", 2] },
      orderCount: 1,
      avgOrderValue: { $round: ["$avgOrderValue", 2] },
      lastOrderDate: 1
    }
  }
])
  • The key design decisions: $match first for index usage and data reduction. $limit before $lookup so we only join 5 documents instead of every customer. $round in $project for clean financial output. Excluding cancelled orders in the initial match to avoid inflating totals.
  • The ideal index for this pipeline is { orderDate: 1, status: 1, customerId: 1, totalAmount: 1 }. This covers the $match filter and provides all the fields the $group needs without fetching full documents.
Follow-up: What if you need this report to update in real-time on a dashboard, not just run once? How would you architect that?
  • A real-time dashboard that recomputes this aggregation on every page load is not viable for a large orders collection. Instead, I would use a materialized view pattern.
  • MongoDB supports $merge as the final aggregation stage, which writes the pipeline results to a target collection. Schedule this pipeline to run every 5 minutes, writing results to a top_customers_90d collection. The dashboard reads from this small, pre-computed collection (5 documents) — instant response time.
  • For even lower latency, use Change Streams on the orders collection. When a new order is placed, a consumer updates a running total in a customer_stats collection. The dashboard reads from customer_stats, which is always up to date without running a full aggregation. This is more complex to implement (you need to handle order cancellations, refunds, and the 90-day rolling window), but it provides true real-time data.
  • The $merge approach is the pragmatic middle ground: near-real-time (5-minute lag), simple to implement, and the aggregation runs on a secondary node so it does not impact production.