Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

CRUD: Read

Reading data is the most common database operation—and the one that can make or break your application’s performance. MongoDB provides a powerful query language that lets you find exactly the data you need, efficiently.

Understanding MongoDB Queries

Unlike SQL’s string-based queries, MongoDB uses JSON-like query documents. This approach is:
  • Type-safe: No SQL injection vulnerabilities from string concatenation
  • Composable: Build complex queries programmatically
  • Intuitive: If you know JSON, you can write queries

Query Performance Basics

Every query should be intentional. Poorly written queries can:
  • Scan millions of documents instead of using indexes
  • Consume excessive memory on large result sets
  • Block other operations due to collection-level locks (in older versions)
Always use projection to return only the fields you need, and consider indexes for frequently queried fields. We’ll cover indexing in a later chapter.

Basic Query Methods

Read operations retrieve documents from a collection; i.e., query a collection for documents.

find() and findOne()

  • find(filter): Returns a cursor to the matching documents.
  • findOne(filter): Returns the first matching document.
// Find all documents -- equivalent to SELECT * FROM users
// Returns a cursor (lazy iterator), not all documents at once
db.users.find({})

// Find users with name "Alice" -- equivalent to SELECT * FROM users WHERE name = 'Alice'
// This is an equality match -- the simplest query pattern
db.users.find({ name: "Alice" })

// findOne returns the first matching document directly (not a cursor)
// Useful when you expect exactly one result, like looking up by _id
db.users.findOne({ _id: ObjectId("507f1f77bcf86cd799439011") })
find() returns a cursor, not an array. The cursor fetches documents in batches (default 101 documents in the first batch, then 16 MB batches). This means you can iterate over millions of documents without loading them all into memory at once. In the shell, the cursor auto-prints the first 20 documents — type it to see more.

Comparison Operators

  • $eq: Equal to
  • $ne: Not equal to
  • $gt: Greater than
  • $gte: Greater than or equal to
  • $lt: Less than
  • $lte: Less than or equal to
  • $in: In an array
  • $nin: Not in an array
Examples:
// Age > 25
db.users.find({ age: { $gt: 25 } })

// Status is "active" OR "pending"
db.users.find({ status: { $in: ["active", "pending"] } })

Logical Operators

  • $or: Joins query clauses with a logical OR
  • $and: Joins query clauses with a logical AND
  • $not: Inverts the effect of a query expression
  • $nor: Joins query clauses with a logical NOR
Example:
// Age < 30 AND status = "active"
db.users.find({
  $and: [
    { age: { $lt: 30 } },
    { status: "active" }
  ]
})

// Implicit AND (shorter syntax)
db.users.find({ age: { $lt: 30 }, status: "active" })

Element Operators

  • $exists: Matches documents that have the specified field.
  • $type: Selects documents if a field is of the specified type.
// Find users who have a "phone" field
db.users.find({ phone: { $exists: true } })

Projection

Projection determines which fields are returned in the matching documents.
// Return only name and age (exclude _id)
db.users.find({}, { name: 1, age: 1, _id: 0 })

Summary

  • Use find() to query data.
  • Use comparison operators like $gt, $lt, $in for filtering.
  • Use logical operators like $or, $and for complex logic.
  • Use Projection to limit the fields returned.

Interview Deep-Dive

Strong Answer:
  • Step one: run the query with .explain("executionStats") to get the execution plan. This tells you everything: which index was used (or if none was), how many documents were scanned, how many were returned, and the execution time breakdown.
  • The key fields I look at in the explain output are winningPlan.stage (COLLSCAN is bad, IXSCAN is good), totalDocsExamined versus nReturned (if you examined 10 million documents to return 50, your query is inefficient), and executionTimeMillis broken down by stage.
  • If I see a COLLSCAN, the fix is straightforward: create an index that covers the query’s filter fields. For a query like db.orders.find({ customerId: "abc", status: "shipped" }).sort({ createdAt: -1 }), the ideal compound index is { customerId: 1, status: 1, createdAt: -1 }. The order matters — filter fields first (equality matches), then sort fields.
  • If an index exists but MongoDB is not using it, check for a few common culprits. First, the query planner might have chosen a different plan because it estimates a collection scan is cheaper (this happens with low-selectivity fields). Use .hint() to force index usage and compare performance. Second, the query might use operators that cannot use indexes efficiently, like $ne, $nin, or $regex without an anchor (/^prefix/ uses indexes, /substring/ does not).
  • If the index is being used but the query is still slow, check whether the index fits in RAM. Run db.orders.stats() and compare totalIndexSize to your available memory. If the index is 20 GB but your WiredTiger cache is 4 GB, every index lookup potentially hits disk. The fix is either more RAM, a more selective index (fewer fields, partial index), or rethinking the query.
  • Another common issue: returning too much data. If the query returns 100,000 documents with 50 fields each, the bottleneck might be network transfer and deserialization, not the database itself. Add a projection to return only the fields you need, and use .limit() if you do not need all results.
Follow-up: You add a compound index and the query drops from 12 seconds to 5 milliseconds. But now your write throughput has dropped by 15%. What is happening?
  • Every index adds overhead to write operations. When you insert or update a document, MongoDB must update every index that includes a modified field. A collection with 10 indexes means every insert triggers 10 index B-tree updates, each requiring disk I/O and potentially splitting B-tree pages.
  • The 15% write throughput drop is the cost of your new index. The question is whether the read improvement justifies it. If this query runs 10,000 times per second and was previously causing cascading timeouts, the trade-off is clearly worth it. If it runs once a day in a reporting script, consider dropping the index and accepting the slow report.
  • To mitigate write overhead from indexes: audit all indexes with db.orders.aggregate([{ $indexStats: {} }]). This shows which indexes are actually being used and how often. I regularly find “zombie indexes” — indexes created months ago for a feature that was later removed. Every unused index is pure write overhead with zero benefit. Drop them.
  • Also consider partial indexes. If your query only needs shipped orders, db.orders.createIndex({ customerId: 1 }, { partialFilterExpression: { status: "shipped" } }) creates a smaller index that only includes documents matching the filter. Smaller index means less write overhead and better cache utilization.
Strong Answer:
  • find() returns a cursor object, not the actual documents. The cursor is a pointer to the result set that the server maintains. Documents are fetched in batches (default first batch is 101 documents or 16 MB, whichever is smaller; subsequent batches are 16 MB each) as you iterate over the cursor. findOne() is syntactic sugar for find().limit(1) — it fetches a single document and closes the cursor immediately.
  • This distinction matters enormously for memory management. If your query matches 5 million documents and you call find(), the server does not load all 5 million into memory at once. It streams them in batches via the cursor. Your application processes one batch, requests the next, processes it, and so on. Memory usage stays constant regardless of result set size.
  • If your application code accidentally forces all documents into memory (for example, await collection.find({}).toArray() in Node.js), you lose the streaming benefit. For a 5-million-document result, .toArray() loads everything into a single JavaScript array, potentially consuming gigabytes of memory and causing your Node.js process to crash with an out-of-memory error.
  • Cursors also have a server-side timeout. By default, a cursor expires after 10 minutes of inactivity (the noCursorTimeout option overrides this, but it is dangerous because orphaned cursors consume server resources). If your application takes too long between batch fetches — maybe it is doing heavy processing on each document — the cursor expires and the next fetch fails with a “cursor not found” error.
  • The practical advice: for small result sets (under a few thousand documents), .toArray() is fine and more convenient. For large result sets, always iterate the cursor with a for await...of loop or .forEach(), process documents in batches, and ensure your processing is fast enough to avoid cursor timeouts.
Follow-up: You are iterating over a cursor that returns 10 million documents, and midway through the iteration, another process updates some of the documents you have not yet read. Do you see the updated or original versions?
  • This depends on the cursor’s snapshot behavior, and the answer is “it is complicated.” By default, MongoDB cursors do not guarantee a point-in-time snapshot. You might see the original version of some documents and the updated version of others, depending on whether the cursor has already fetched the batch containing those documents.
  • It gets worse: if an update changes a document’s size and it moves to a different location on disk (this used to happen with MMAPv1 but is rare with WiredTiger), the cursor could potentially skip the document entirely or return it twice. WiredTiger largely mitigates this because documents have stable record IDs, but it is not guaranteed.
  • If you need consistent point-in-time reads, use a snapshot read concern. With readConcern: "snapshot" (available inside a transaction), the cursor returns data as of the transaction’s start time, regardless of concurrent modifications. Outside transactions, the readConcern: "majority" provides a weaker but useful guarantee: you only see data that has been committed by a majority of replica set members, but it is not a frozen snapshot.
  • For the 10-million-document iteration use case, the pragmatic solution is often to not worry about perfect consistency. If you are generating a report or doing a data migration, small inconsistencies during the iteration are acceptable. If you need exact consistency, use a transaction with snapshot read concern — but be aware that long-running transactions (especially ones reading 10 million documents) hold resources and can time out.
Strong Answer:
  • For querying multiple values on the same field, $in and $or produce the same logical result, but they differ in how MongoDB executes them. $in is always the better choice for matching multiple values on a single field.
  • $in executes as a single index scan. When you query { status: { $in: ["active", "pending", "review"] } }, MongoDB performs one IXSCAN operation that seeks to each value in the index B-tree. The query planner treats this as a single, efficient operation.
  • $or with the same logic — { $or: [{ status: "active" }, { status: "pending" }, { status: "review" }] } — may execute as multiple index scans that are then merged. Each clause in the $or is evaluated independently, and the results are deduplicated. This has more overhead: multiple index lookups, a merge step, and deduplication.
  • Where $or is necessary is when you are matching across different fields. For example, “find users where name is Alice OR age is 30” — { $or: [{ name: "Alice" }, { age: 30 }] }. This cannot be expressed with $in because the conditions are on different fields. For $or to use indexes in this case, each clause must have its own index. If only name is indexed, the age clause triggers a COLLSCAN for that branch.
  • A subtle performance trap: $or at the top level can use indexes, but $or nested inside other expressions might not. For example, { active: true, $or: [{ name: "Alice" }, { age: 30 }] } can use an index on active plus indexes on name and age. But complex nesting can cause the query planner to give up on index optimization and fall back to a COLLSCAN.
  • My rule of thumb: use $in whenever you are checking multiple values on the same field. Reserve $or for conditions across different fields. And always verify with explain() — assumptions about query planning are often wrong.
Follow-up: You have a query using $in with 10,000 values. Does MongoDB handle this efficiently, or is there a practical limit?
  • MongoDB does not have a hard limit on the size of an $in array, but there are practical performance considerations. An $in with 10,000 values is on the edge of reasonable. Each value requires a seek in the index B-tree, so 10,000 values means 10,000 seeks. If the index is fully in memory, each seek is sub-microsecond and the total is a few milliseconds. If the index is on disk, each seek could be a disk I/O, and 10,000 disk I/Os could take seconds.
  • The query document itself must fit in the 16 MB BSON limit. With 10,000 ObjectId values (12 bytes each plus BSON overhead), you are at roughly 200 KB — well within the limit. But with 10,000 long strings, you could approach the limit.
  • At very high counts (100K+ values), consider restructuring the query. Instead of { userId: { $in: [huge list] } }, stage the list into a temporary collection and use $lookup in an aggregation pipeline to join against it. Or redesign the data model so the relationship is expressed differently — for example, tag users with a group ID and query { groupId: "batch-42" } instead of listing every user ID.
  • I have seen production incidents where a buggy application dynamically built an $in array that grew unboundedly — eventually hitting millions of values and causing timeouts. Defensive coding: always validate and cap the size of arrays used in $in clauses.
Strong Answer:
  • A covered query is one where all the fields in the query filter, projection, and sort are present in a single index. When a query is covered, MongoDB can satisfy it entirely from the index without ever loading the actual document from disk. The explain output shows totalDocsExamined: 0 — zero documents touched, only index entries.
  • Why this matters so much: indexes are typically much smaller than the documents they index. A compound index on { userId: 1, status: 1, createdAt: -1 } might be 2 GB for a collection where the full data is 100 GB. If that 2 GB index fits in the WiredTiger cache (RAM), every covered query is served entirely from memory with no disk I/O. This is the difference between a 1ms response and a 50ms response.
  • To create a covered query, your projection must exclude _id (unless _id is part of the index) and include only fields that are in the index. For example, with the index { userId: 1, email: 1 }, the query db.users.find({ userId: 123 }, { email: 1, _id: 0 }) is covered. But db.users.find({ userId: 123 }, { email: 1, name: 1, _id: 0 }) is not covered because name is not in the index.
  • The _id exclusion catches many developers. By default, MongoDB includes _id in query results. If _id is not part of your index and you do not explicitly exclude it with _id: 0, the query cannot be covered because MongoDB must fetch the document to get the _id value.
  • Covered queries are particularly powerful for count operations and existence checks. db.orders.countDocuments({ status: "shipped" }) with an index on { status: 1 } is covered — it counts index entries without touching documents.
  • The limitation: multikey indexes (indexes on array fields) cannot cover queries. Even if all projected fields are in the index, if the index is multikey, MongoDB must fetch the document to correctly reconstruct the array. This is a documented limitation of the query engine.
Follow-up: Can you have a covered query that involves a sort operation? What are the conditions?
  • Yes, and it is one of the most impactful optimizations. If the sort field is part of the index and the index ordering matches the requested sort direction, MongoDB reads index entries in order without a separate sort step. This is called an “index-provided sort.” Combined with projection on only indexed fields, you get a covered query with sorted results and no in-memory sort.
  • For example, with index { status: 1, createdAt: -1 }, the query db.orders.find({ status: "shipped" }, { status: 1, createdAt: 1, _id: 0 }).sort({ createdAt: -1 }) is fully covered with an index-provided sort. The explain output shows IXSCAN with no SORT stage.
  • The sort direction matters. If your index is { createdAt: 1 } (ascending) and your query sorts { createdAt: -1 } (descending), MongoDB can still use the index by traversing it in reverse. But with compound indexes, the reversal must be consistent across all fields — you cannot reverse just one field in a compound sort.
  • The in-memory sort limit is another reason to aim for index-provided sorts. MongoDB refuses to sort more than 100 MB of data in memory (the allowDiskUse: true option relaxes this for aggregation, but not for find().sort()). If your query matches 500,000 documents and needs a sort that is not index-supported, it fails with a “sort exceeded memory limit” error.