Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Databases & Collections

In MongoDB, data is organized in a three-level hierarchy. Think of it like a physical office building:
  • A Database is an entire filing room — a dedicated space for one application or domain.
  • A Collection is a filing cabinet drawer in that room — it holds documents of a similar type (users, orders, products).
  • A Document is a single folder in the drawer — one self-contained record with all its details inside.
MongoDB TermSQL EquivalentAnalogy
DatabaseDatabaseA filing room
CollectionTableA cabinet drawer
DocumentRowA folder in the drawer
FieldColumnA labeled sheet inside the folder
The key difference from SQL: in a relational table, every row must have the same columns. In a MongoDB collection, each document can have a completely different shape. One user document might have a phone field; another might not. The collection does not enforce uniformity unless you explicitly add validation rules.

Creating a Database

In MongoDB, you don’t explicitly “create” a database. You switch to a name and start inserting data — MongoDB creates the database lazily when the first document is written. This is like labeling an empty room: the room exists the moment you put something in it, not when you write the name on the door. Using mongosh (MongoDB Shell):
// Switch to a database (creates it lazily on first write)
use myNewDatabase

// Verify the current database
db.getName()  // returns "myNewDatabase"

// List all databases (only shows databases that contain data)
show dbs
show dbs only lists databases that contain at least one document. If you use myNewDatabase but never insert anything, the database will not appear in the list. This surprises many beginners.

Creating a Collection

Similarly, collections are created when you first insert data into them — another example of MongoDB’s lazy creation philosophy.
// Implicit creation: collection "users" is created automatically
db.users.insertOne({ name: "Alice", role: "admin" })
You can also explicitly create a collection when you need to set options like validation rules, capped collections, or collation settings.
// Explicit creation with schema validation
db.createCollection("users", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["name", "email"],
      properties: {
        name: { bsonType: "string", description: "must be a string" },
        email: { bsonType: "string", pattern: "^.+@.+$" }
      }
    }
  }
})

// Capped collection: fixed-size, FIFO behavior (great for logs)
db.createCollection("logs", {
  capped: true,
  size: 5242880,  // 5 MB max size
  max: 5000       // 5000 documents max
})
Capped collections are powerful for log-style data, but they come with constraints: you cannot delete individual documents, you cannot shard them, and updates that increase document size will fail. Use them only when FIFO (first-in, first-out) behavior is what you actually need.

Naming Conventions

MongoDB has rules for database and collection names that are worth knowing before you run into cryptic errors: Database names:
  • Cannot contain /\. "$*<>:|? or the null character
  • Are case-sensitive (MyApp and myapp are different databases)
  • Maximum 64 characters
  • Avoid names that differ only by case — some file systems are case-insensitive
Collection names:
  • Cannot contain the null character or start with system. (reserved)
  • Cannot contain $ (reserved for internal use)
  • Should be lowercase with underscores or camelCase — pick one convention and stick with it
// Good naming patterns
db.user_profiles.find()      // snake_case
db.userProfiles.find()       // camelCase

// Avoid these
db["my collection"].find()   // spaces force bracket notation
db.System_Users.find()       // confusable with system prefix

Dropping

Dropping is permanent. There is no “undo” in MongoDB for drop operations. Always verify you are connected to the correct database and cluster before dropping anything.

Drop Database

// DANGER: this is irreversible without a backup
use myDatabase
db.dropDatabase()

Drop Collection

// Removes the collection AND all its indexes
db.myCollection.drop()
Dropping a collection also drops all indexes on that collection. If you just want to remove all documents but keep the collection and its indexes, use db.myCollection.deleteMany({}) instead. Rebuilding indexes on large collections can take minutes or hours.

Document Structure

Documents are BSON objects. Unlike SQL rows, a document can contain nested objects (embedded documents) and arrays — this lets you store related data together rather than spreading it across multiple tables.
{
  // _id: auto-generated 12-byte ObjectId (timestamp + machine + process + counter)
  "_id": ObjectId("507f1f77bcf86cd799439011"),

  // Scalar fields -- like columns in SQL
  "name": "Alice",
  "age": 25,

  // Embedded document -- like a JOIN to an "addresses" table, but stored inline
  "address": {
    "street": "123 Main St",
    "city": "Wonderland"
  },

  // Array field -- like a one-to-many relationship, but self-contained
  "hobbies": ["reading", "coding"]
}

The _id Field

Every document must have a unique _id field.
  • If you don’t provide one, MongoDB generates a unique ObjectId automatically.
  • It is the primary key for the document.
  • It is immutable (cannot be changed after insertion).
  • MongoDB automatically creates a unique index on _id — you never need to create one yourself.

Practical Patterns: Embedding vs. Referencing

One of the first design decisions in MongoDB is whether to embed related data inside a document or reference it by _id from another collection.
// Pattern 1: Embedding (denormalized) -- good when data is read together
// Analogy: stapling the receipt inside the order folder
{
  "_id": ObjectId("..."),
  "orderNumber": "ORD-001",
  "customer": { "name": "Alice", "email": "alice@example.com" },
  "items": [
    { "product": "Widget", "qty": 2, "price": 9.99 },
    { "product": "Gadget", "qty": 1, "price": 24.99 }
  ]
}

// Pattern 2: Referencing (normalized) -- good when data is shared or large
// Analogy: writing "See customer file #C-42" on the order folder
{
  "_id": ObjectId("..."),
  "orderNumber": "ORD-001",
  "customerId": ObjectId("..."),  // reference to customers collection
  "items": [
    { "productId": ObjectId("..."), "qty": 2, "price": 9.99 }
  ]
}
A good rule of thumb: embed when the data is read together and does not grow without bound. Reference when the related data is large, changes independently, or is shared across many documents. A user’s shipping address? Embed. A user’s full order history? Reference.

Document Size Limits

MongoDB enforces a 16 MB maximum document size. This is generous for most use cases, but it means you should not embed unbounded arrays (e.g., every comment on a viral post). If an array can grow without limit, use a separate collection and reference by _id.

Indexing Tip for Document Design

Your document structure directly affects what you can index efficiently. If you embed data, you can create indexes on nested fields:
// Index on a nested field -- dot notation
db.orders.createIndex({ "customer.email": 1 })

// Index on an array field -- MongoDB creates a "multikey" index
// that indexes each element in the array
db.users.createIndex({ "hobbies": 1 })

Summary

  • Databases hold collections. Collections hold documents. Documents are the data records (JSON/BSON).
  • Databases and collections are created lazily — they spring into existence when the first document is inserted.
  • Every document has a unique _id with an automatic index.
  • Embed related data when it is read together and bounded in size; reference when it is shared or unbounded.
  • Document size is capped at 16 MB — design your schema to respect this limit.
  • You can index nested fields and array fields using dot notation.

Interview Deep-Dive

Strong Answer:
  • I would evaluate each relationship independently using three criteria: access pattern (is this data read together?), cardinality (is this bounded or unbounded?), and update frequency (does it change independently?).
  • Profile data (name, bio, avatar URL, settings): Embed. This data is always read together with the user, changes infrequently, and is bounded in size. One document fetch returns everything the app needs to render a profile page. No join, no second query.
  • Posts: Reference. A user’s posts are unbounded — a power user might have 50,000 posts. Embedding them would push the document toward the 16 MB limit. Posts also have their own lifecycle: they are liked, commented on, and queried independently (trending posts, feed algorithms). Separate posts collection with a userId field and an index on { userId: 1, createdAt: -1 } for efficient “get user’s recent posts” queries.
  • Followers: Reference, but with a design nuance. A naive approach is an array of follower IDs in the user document, but celebrities can have millions of followers. Instead, use a separate follows collection with documents like { followerId, followedId, createdAt }. Index on { followedId: 1 } for “who follows this user” and { followerId: 1 } for “who does this user follow.” This is the pattern Twitter (now X) and Instagram use at scale.
  • Message inbox: Reference. Messages are unbounded, change independently (read/unread status), and are queried by both sender and receiver. A separate messages collection with compound indexes on { recipientId: 1, createdAt: -1 } and { conversationId: 1, createdAt: -1 } handles the common access patterns.
  • The one exception: I would embed the most recent 3-5 posts as a denormalized cache in the user document for the profile page “preview.” This avoids a second query for the most common use case. When a new post is created, you update the embedded cache and the posts collection in the same operation. This is the “Extended Reference Pattern” from MongoDB’s data modeling playbook.
Follow-up: What happens when your embedded profile data needs to appear in other documents, like in the author field of a post? How do you handle updates?
  • This is the denormalization update problem, and it is the primary cost of embedding in MongoDB. If I embed { authorName: "Alice", authorAvatar: "url" } in every post document, and Alice changes her avatar, I now have 10,000 post documents with stale data.
  • There are three strategies. First, accept eventual consistency — update the user document immediately and run a background job to update the denormalized copies in posts. For avatar URLs, staleness for a few minutes is acceptable. Second, use the “Subset Pattern” — only embed fields that rarely change (username) and reference fields that change often (avatar). Third, use Change Streams — set up a listener that watches the users collection and propagates changes to all documents that reference the updated user.
  • The worst mistake is embedding data that changes frequently across thousands of documents. If a field changes more than once a week and is denormalized in more than 100 documents, it should probably be a reference instead.
Strong Answer:
  • An ObjectId is a 12-byte (24 hex character) unique identifier. Its structure is intentional and encodes useful metadata. The layout is: 4 bytes for the Unix timestamp (seconds since epoch), 5 bytes for a random value unique to the machine and process, and 3 bytes for an incrementing counter initialized to a random value.
  • This design solves a hard distributed systems problem: generating globally unique IDs without a central coordinator. In a sharded cluster with 50 mongos routers handling concurrent inserts, no two of them will generate the same ObjectId because the random component differentiates machines and the counter handles same-second collisions on the same machine.
  • What you can extract without querying: the creation timestamp. ObjectId("507f1f77bcf86cd799439011").getTimestamp() returns the exact second the document was created. This means you get a free createdAt field on every document without explicitly storing one. You can also sort by _id to get chronological order, and you can query by time range using ObjectId comparisons, which is faster than date comparisons because _id has a unique index by default.
  • There is a practical trick: if you want to find all documents created after January 1, 2024, you can construct a minimum ObjectId for that date and query { _id: { $gt: ObjectId("timestamp000000000000000000") } }. This uses the _id index directly, avoiding a separate date index.
  • One important change: in MongoDB 3.4 and earlier, the middle bytes encoded the machine identifier and process ID, which could leak infrastructure information. MongoDB 5.0+ changed this to a random value, which is more secure but means you can no longer determine which server generated a particular ObjectId.
Follow-up: When would you use a custom _id instead of the default ObjectId? What are the risks?
  • Custom IDs make sense when you have a natural unique identifier that your application already uses. For example, a users collection where every user has a unique email address — you could use { _id: "alice@example.com" }. This eliminates the need for a separate unique index on email.
  • Another common case: importing data from another system that already has unique IDs. If you are migrating from PostgreSQL and your orders have sequential integer IDs referenced by external systems (payment processors, shipping APIs), it is pragmatic to use { _id: existingOrderId } to avoid maintaining a mapping table.
  • The risks are real though. If your custom ID is monotonically increasing (like an auto-incrementing integer), all inserts go to the same chunk on the last shard in a sharded collection. This creates the “hot shard” problem. ObjectIds avoid this because the random component distributes inserts across chunks.
  • Another risk: if your custom ID can change (like an email address), you have a problem because _id is immutable in MongoDB. You would need to delete and reinsert the document, which breaks any references to that _id from other collections.
  • My rule: use custom IDs when the identifier is truly immutable, naturally unique, and your primary lookup key. Otherwise, stick with ObjectId and create a secondary unique index on your business identifier.
Strong Answer:
  • A capped collection is a fixed-size collection that works like a circular buffer. Once it reaches its size limit (specified in bytes) or document count, the oldest documents are automatically overwritten by new inserts. The insertion order is preserved, and documents cannot be deleted individually or updated in a way that increases their size.
  • The production use case where capped collections genuinely shine is high-throughput logging with strict insertion-order guarantees and predictable performance. Consider an application-level log pipeline that writes 10,000 events per second. A capped collection of 5 GB gives you the most recent few hours of logs, automatically purged, with no background deletion process, no index overhead for deletion, and guaranteed insertion-order reads.
  • Why a TTL index is not a suitable alternative in this specific case: TTL indexes rely on a background thread (TTLMonitor) that runs every 60 seconds by default and deletes expired documents. Under heavy write load, the TTL monitor can fall behind, causing the collection to grow unboundedly until the monitor catches up. I have seen TTL-indexed collections balloon to 50 GB when the expected steady state was 5 GB, because the deletion rate could not keep up with the insertion rate during a traffic spike.
  • Capped collections avoid this entirely because the deletion is part of the write path itself — the old document is overwritten atomically, so the collection size is guaranteed to never exceed the cap. There is no background process that can fall behind.
  • The trade-offs of capped collections are significant though. You cannot shard them. You cannot delete individual documents (only the collection as a whole). Updates that increase document size fail. You cannot add indexes that would be useful for arbitrary queries (though you can index fields for read queries). And if you need to keep specific documents permanently, a capped collection is fundamentally wrong because everything eventually gets overwritten.
  • Capped collections also power MongoDB’s internal replication mechanism — the oplog (local.oplog.rs) is a capped collection. This is the canonical example of where the circular buffer behavior is exactly what you want: a finite window of recent operations that secondaries replay to stay in sync.
Follow-up: Your team is debating between a capped collection and Kafka for application event logging. What are the trade-offs?
  • This is a common architectural decision. The answer depends on whether you need consumers to process events or just store them for querying.
  • Capped collection wins when: you primarily need queryable recent logs (find all errors in the last hour, search by request ID), you do not need multiple consumers processing the events, and you want zero additional infrastructure. It is just another MongoDB collection.
  • Kafka wins when: you have multiple consumers that need to process each event independently (alerting service, analytics pipeline, audit log archiver), you need event replay (a new consumer can read from the beginning of a topic), or you need ordering guarantees per partition at massive scale (millions of events per second).
  • The pragmatic middle ground: use MongoDB Change Streams as a lightweight event stream. Write to a regular collection, and Change Streams give you a real-time feed of changes that multiple consumers can subscribe to. You get queryable storage and event-driven processing without adding Kafka to your infrastructure. The limitation is that Change Streams require a replica set and have a finite oplog window — if a consumer falls too far behind, it loses its position.
Strong Answer:
  • Immediate response, in order. First, assess the blast radius: which collection was dropped, how large was it, and which services depend on it. Alert the team immediately — do not try to fix it silently.
  • Second, restore from backup. If you are on Atlas, use the point-in-time restore feature to recover the collection to the moment before the drop. Atlas retains continuous backups for the cluster tier you are on (typically 7 days of point-in-time, plus daily snapshots). If you are self-hosted, restore from your most recent mongodump or filesystem snapshot. The gap between the last backup and the drop is your data loss window.
  • Third, if you have a replica set and the oplog still contains the insert operations for the lost data, you can theoretically replay the oplog to reconstruct the collection. This is a complex, manual process and should only be attempted by someone experienced with oplog replay.
  • Prevention measures, in order of impact. First, implement role-based access control (RBAC). Production database users for your application should have only the permissions they need: readWrite on specific databases, never dbAdmin or dbOwner. The only users with drop permissions should be DBAs operating through a controlled runbook.
  • Second, require multi-factor approval for destructive operations. MongoDB Atlas supports audit logging and you can configure alerts on dropCollection and dropDatabase events. Better yet, use a database proxy or policy engine that intercepts destructive commands and requires approval from a second engineer.
  • Third, enforce a “no direct shell access to production” policy. All schema changes (including drops) go through a migration framework (like migrate-mongo) that is version-controlled, code-reviewed, and tested in staging before production.
  • Fourth, test your backup restoration process regularly. A backup you have never tested restoring is not a backup — it is a hope.
Follow-up: The dropped collection had 500 million documents. Your most recent backup is 6 hours old. How do you estimate the data loss and communicate it to stakeholders?
  • The data loss is every write to that collection in the 6-hour window between the backup and the drop. To estimate: check your application’s write throughput metrics. If the collection receives an average of 1,000 inserts per minute, the loss is approximately 360,000 documents. If write throughput varies by time of day, use the actual metrics from your monitoring system (Datadog, Grafana, Atlas monitoring) for that specific 6-hour window.
  • Communication to stakeholders should be factual and structured: what happened, what data was lost, what data was recovered, what is the customer impact, and what is the remediation timeline. Avoid vague language. “We lost approximately 360,000 order records from the last 6 hours” is better than “we lost some data.”
  • If the lost data was generated by user actions (orders, messages), check if the application has other data sources that captured the same information — API gateway logs, payment processor records, message queue dead letter queues. You may be able to reconstruct a significant portion of the lost data from these secondary sources.
  • The post-mortem should focus on reducing the Recovery Point Objective (RPO). If 6 hours of data loss is unacceptable, you need more frequent backups (Atlas supports continuous backup with 1-second granularity on M10+ tiers) or a real-time replication strategy to a secondary data store.