Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

CRUD: Create

CRUD stands for Create, Read, Update, Delete—the four fundamental operations for working with data in any database. In this chapter, we focus on Create: adding new documents to your MongoDB collections.

Understanding Document Insertion

Unlike relational databases where you must define a table schema before inserting data, MongoDB is schema-flexible. You can insert documents with different structures into the same collection. While this offers flexibility, it also means you should think carefully about your data model.

What Happens During Insert

When you insert a document:
  1. MongoDB validates the document structure (if validation rules exist)
  2. An _id field is automatically generated if not provided
  3. The document is written to the collection
  4. If the collection doesn’t exist, MongoDB creates it automatically

The _id Field

Every document in MongoDB must have a unique _id field that acts as the primary key:
  • If you don’t provide one, MongoDB generates an ObjectId
  • ObjectIds are 12-byte unique identifiers (timestamp + machine + process + counter)
  • You can provide your own _id (string, number, etc.), but it must be unique
Letting MongoDB generate _id values is usually the best choice—ObjectIds are designed to be unique across distributed systems without coordination.

Insert Methods

Create operations add new documents to a collection. If the collection does not currently exist, insert operations will create the collection.

insertOne()

Inserts a single document into a collection.
// Insert a single user document
// MongoDB will auto-generate an _id (ObjectId) since we did not provide one
db.users.insertOne({
  name: "Alice",
  age: 25,
  status: "active"
})
Output:
{
  "acknowledged": true,          // The server confirmed the write
  "insertedId": ObjectId("60d5ec...")  // The auto-generated _id
}
Indexing tip: If you plan to query users by status frequently, create an index before bulk-loading data: db.users.createIndex({ status: 1 }). Creating indexes on an empty collection is instant; creating them after loading millions of documents takes significantly longer.

insertMany()

Inserts multiple documents into a collection. Pass an array of documents.
// Insert multiple documents at once -- much faster than calling insertOne() in a loop
// MongoDB sends all documents to the server in a single network round-trip
db.users.insertMany([
  { name: "Bob", age: 30, status: "inactive" },
  { name: "Charlie", age: 35, status: "active" }
])
Output:
{
  "acknowledged": true,
  "insertedIds": {
    "0": ObjectId("..."),   // Bob's auto-generated _id
    "1": ObjectId("...")    // Charlie's auto-generated _id
  }
}
Think of insertMany like mailing a batch of letters in one trip to the post office rather than driving there separately for each letter. The network round-trip is often the bottleneck, so batching dramatically improves throughput — often 10-50x faster for bulk loads.

Ordered vs Unordered Inserts

By default, insertMany is ordered. If an error occurs during the insertion of one of the documents (e.g., duplicate key error), MongoDB stops inserting the remaining documents. To allow the remaining documents to be inserted even if one fails, set ordered: false.
db.users.insertMany([
  { _id: 1, name: "A" },
  { _id: 1, name: "B" }, // Duplicate _id -- this document fails
  { _id: 2, name: "C" }  // With ordered: false, this still gets inserted
], { ordered: false })
In this case, “A” and “C” will be inserted, and “B” will fail. With the default ordered: true, MongoDB would stop at “B” and “C” would never be attempted.

Practical Pattern: Bulk Importing with Error Tolerance

When importing data from an external source, you often expect some duplicates. Use ordered: false combined with a try/catch to handle this gracefully:
// Import products from a CSV -- some may already exist
try {
  const result = db.products.insertMany(csvProducts, {
    ordered: false  // Continue past duplicate key errors
  })
  print(`Inserted ${result.insertedIds.length} new products`)
} catch (e) {
  // BulkWriteError contains details about which inserts failed
  print(`Inserted ${e.result.nInserted} products, ${e.writeErrors.length} duplicates skipped`)
}

Summary

  • Use insertOne() for single documents.
  • Use insertMany() for multiple documents.
  • MongoDB automatically adds an _id if you don’t provide one.
  • insertMany is ordered by default; use { ordered: false } to continue on error.

Interview Deep-Dive

Strong Answer:
  • The naive approach — calling insertMany with 10 million documents in a single array — will fail. The BSON document size limit applies to the wire protocol message too, and even if it did not, you would exhaust memory on both the client and server. The correct approach involves batching, configuration tuning, and understanding the write path.
  • First, batch your inserts. Use insertMany with batches of 1,000 to 10,000 documents. The optimal batch size depends on your average document size. For small documents (under 1 KB), batches of 5,000-10,000 work well. For larger documents, reduce the batch size to stay under the 48 MB wire protocol message limit. The MongoDB driver actually auto-splits batches that exceed this limit, but explicit batching gives you better error handling and progress tracking.
  • Second, use { ordered: false } on every batch. Ordered inserts process sequentially and stop on the first error. Unordered inserts process in parallel on the server side and continue past errors. For a bulk load, unordered inserts can be 2-3x faster because the server can batch writes to the storage engine more efficiently.
  • Third, consider your write concern. The default w: "majority" means each batch waits for acknowledgment from a majority of replica set members. For a bulk load where you can re-run the import if something goes wrong, dropping to w: 1 (primary acknowledgment only) significantly reduces latency per batch. Some teams even use w: 0 (fire-and-forget) for initial loads, though I would not recommend this because you lose all error feedback.
  • Fourth, temporarily drop secondary indexes. If the collection has 5 indexes, every insert must update all 5 index B-trees. For a bulk load, it is faster to drop all indexes except _id, insert all data, then rebuild indexes with createIndex. Index builds on 10 million documents take minutes; maintaining indexes during 10 million inserts can take hours.
  • Fifth, parallelize across multiple client connections. A single connection to MongoDB is limited by the round-trip latency of each batch. Running 4-8 parallel workers, each inserting batches, saturates the server’s write throughput more effectively. Use a thread pool or worker processes in your import script.
  • At MongoDB scale, I have seen teams achieve 100,000+ inserts per second using this approach on a modest replica set with SSDs.
Follow-up: You used ordered: false and the import completes, but your monitoring shows that 50,000 out of 10 million documents failed with duplicate key errors. How do you handle this?
  • With { ordered: false }, MongoDB continues past errors and reports them in the response. The response object includes a writeErrors array with the index and error for each failed document. My first step is to log every error, not just the count.
  • For duplicate key errors specifically, I need to decide what the correct behavior is. There are three possibilities. First, the duplicates are exact duplicates from a retry (e.g., the import script crashed and was restarted) — in this case, the errors are expected and can be safely ignored because the data is already in the collection. Second, the duplicates have different data for the same _id — this means an upsert is needed, not an insert. I would re-process the failed documents using bulkWrite with updateOne + upsert: true operations. Third, the duplicates indicate a bug in the data generation — two different records were assigned the same ID. This requires investigation before proceeding.
  • A more robust import design uses bulkWrite from the start instead of insertMany. bulkWrite lets you mix insert, update, and upsert operations in a single batch, so you can handle the “insert if new, update if exists” pattern without a second pass.
Strong Answer:
  • insertMany is a convenience method that does exactly one thing: insert multiple documents. bulkWrite is the Swiss army knife — it accepts an array of mixed operations: insertOne, updateOne, updateMany, replaceOne, deleteOne, and deleteMany. Under the hood, insertMany is actually implemented as a bulkWrite with all insert operations.
  • Choose insertMany when you are purely inserting new documents and want clean, readable code. The API is simpler: pass an array of documents, get back an array of inserted IDs. Error handling is straightforward.
  • Choose bulkWrite when your batch operation involves mixed operations. The classic example is a sync job that processes a list of records from an external system: some records are new (insert), some have been updated (update), and some have been deleted (delete). Without bulkWrite, you would make three separate calls. With bulkWrite, it is one network round trip.
  • bulkWrite also gives you explicit control over operation ordering and write concern at the batch level. You can set { ordered: true } to ensure operations execute in sequence (important when later operations depend on earlier ones) or { ordered: false } for maximum throughput.
  • Performance-wise, bulkWrite with 1,000 operations is dramatically faster than 1,000 individual updateOne calls because it batches operations into a single wire protocol message. I have seen sync jobs go from 45 minutes to 90 seconds just by switching from individual operations to bulkWrite.
  • One gotcha: bulkWrite has a limit of 100,000 operations per call (the driver auto-splits beyond this, but the behavior varies by driver version). For very large batch operations, explicitly chunk your operations array.
Follow-up: Your bulkWrite includes 5,000 operations with ordered: true, and operation number 3,000 fails. What happens to operations 3,001 through 5,000?
  • With { ordered: true }, MongoDB stops processing at the first error. Operations 1 through 2,999 are committed, operation 3,000 fails, and operations 3,001 through 5,000 are never sent to the server. The response includes a writeErrors array with the error for operation 3,000 and a nInserted/nModified count reflecting only the successful operations.
  • This behavior is by design for cases where operation ordering matters — for example, if operation 3,001 updates a document that operation 3,000 was supposed to insert. Without the earlier insert, the update would silently do nothing.
  • The recovery strategy: log the error, fix the root cause (e.g., a duplicate key, a validation error), and re-run the remaining operations starting from 3,000. Your import script should track which operations have been processed so you can resume from the failure point rather than restarting from the beginning.
Strong Answer:
  • This is a beautiful example of distributed systems design. ObjectId uniqueness does not rely on the database at all — the IDs are generated client-side by the MongoDB driver, and the algorithm is designed to make collisions statistically impossible without any coordination between clients.
  • The ObjectId structure is 12 bytes: 4 bytes for the Unix timestamp (second-level granularity), 5 bytes for a random value (generated once per process at startup, unique to the machine and process), and 3 bytes for an incrementing counter (initialized to a random starting value at process startup).
  • The uniqueness guarantee works across dimensions. Two different machines have different random values in bytes 5-9, so even if they insert at the exact same second, their ObjectIds differ. Two processes on the same machine have different random values because each generates its own at startup. Two inserts in the same process at the same second differ because the 3-byte counter increments.
  • The counter can handle up to 16,777,216 (2^24) inserts per second per process before wrapping. At that point, collisions become theoretically possible, but you would need a single process inserting 16 million documents per second — which is far beyond what a single connection can handle.
  • The trade-off compared to UUID v4 (which is 16 bytes, purely random) is that ObjectId is smaller (12 bytes vs 16) and encodes a timestamp, but has a slightly higher theoretical collision probability. In practice, neither collides — the probability is on the order of 1 in 10^18 for ObjectId.
  • An important implementation detail: because the driver generates ObjectIds client-side, the _id is available in your application code immediately after constructing the document, before the insert even reaches the database. This is useful for logging or building references to the document before the write is confirmed.
Follow-up: You mentioned the _id is generated client-side. What happens if two clients somehow generate the same ObjectId and both try to insert?
  • The _id field has a mandatory unique index. If a duplicate ObjectId is generated (which, as discussed, is astronomically unlikely), the second insert fails with a duplicate key error: E11000 duplicate key error collection.
  • The unique index on _id is the safety net. The client-side algorithm makes collisions nearly impossible, and the server-side unique index makes them actually impossible by rejecting duplicates at write time. This is defense in depth.
  • If you are in a scenario where you somehow see duplicate key errors on _id (I have never seen this with ObjectId), it usually means your application is explicitly setting _id values rather than letting the driver generate them, and your ID generation logic has a bug.
Strong Answer:
  • This is one of the most treacherous edge cases in distributed systems. The answer is: you do not know. The insert may have succeeded (the server processed it but the acknowledgment was lost on the network), or it may have failed (the connection dropped before the server received the full message). You are in an ambiguous state.
  • MongoDB drivers (version 4.2+) handle this with retryable writes, which are enabled by default. When a write fails due to a network error, the driver automatically retries the operation once. The key innovation is that each write operation includes a unique transaction identifier (lsid + txnNumber). If the original write actually succeeded, the retry is recognized as a duplicate by the server and returns the original result without writing again. This is idempotency built into the protocol.
  • However, retryable writes only work for certain operations: insertOne, updateOne, deleteOne, findOneAndUpdate, and individual operations within bulkWrite. They do not work for insertMany as a whole (though each individual insert within insertMany is retryable), updateMany, or deleteMany. For these multi-document operations, a partial failure with a lost connection means some documents were written and some were not, and you do not know which.
  • The application-level strategy for handling this: design your operations to be idempotent. If your insert is part of a business process (like creating an order), include a unique idempotency key in the document (like orderId) with a unique index. If the retry inserts a duplicate, it fails with a duplicate key error, which you catch and treat as success — the original insert must have gone through.
  • For critical operations, use MongoDB transactions with writeConcern: { w: "majority", j: true }. The j: true ensures the write is committed to the journal on disk before acknowledgment. This does not eliminate the ambiguous state from network failures, but it ensures that if the write was acknowledged, it survives a server crash.
Follow-up: How do retryable writes interact with MongoDB transactions? Can a multi-document transaction be safely retried?
  • Yes, but with important nuances. MongoDB transactions are retryable at the commit level. If commitTransaction fails with a transient error (like a network blip or a primary election), the driver can retry the commit. The transaction itself has a unique identifier, so the server recognizes a retried commit and does not apply the changes twice.
  • However, the entire transaction has a maximum lifetime (60 seconds by default). If the retry happens after the transaction has expired, it fails permanently. Also, if the error is not transient (like a write conflict because another transaction modified the same document), the retry will also fail.
  • The recommended pattern from MongoDB’s documentation is the “Transaction with Retry” callback API, where you wrap the entire transaction in a withTransaction callback. This handles both transient transaction errors (by retrying the entire transaction) and unknown commit results (by retrying just the commit). The driver manages the complexity for you, but understanding what it does under the hood matters for debugging production issues.