Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Chapter 4: Commits & History
A commit is a snapshot of your project at a point in time. In this chapter, we’ll implement thecommit and log commands, learning how Git builds and traverses the commit graph.
A useful analogy: think of each commit as a photograph of every file in your project. The photo is labeled with the date, the photographer’s name (author), and a sticky note (commit message). The parent pointer is like writing “this is the photo that came right before me” on the back. By following those back-references, you can reconstruct the entire timeline of your project — that chain of pointers is the commit history.
Prerequisites: Completed Chapter 3: Staging & Index
Time: 2-3 hours
Outcome: Working
Time: 2-3 hours
Outcome: Working
commit and log commandsHow Commits Work
The Commit Graph (DAG)
Commits form a Directed Acyclic Graph (DAG):Why DAG, not just a linked list?
Merge commits have multiple parents, allowing branches to be combined while preserving all history. A linked list can only represent one straight line of work. A DAG captures the reality of software development: multiple people working in parallel, their work eventually converging. The “acyclic” constraint means you can never create a loop — commit A cannot be both an ancestor and a descendant of commit B — which guarantees that history always flows in one direction.
Merge commits have multiple parents, allowing branches to be combined while preserving all history. A linked list can only represent one straight line of work. A DAG captures the reality of software development: multiple people working in parallel, their work eventually converging. The “acyclic” constraint means you can never create a loop — commit A cannot be both an ancestor and a descendant of commit B — which guarantees that history always flows in one direction.
Implementation
Step 1: Build Tree from Index
Before creating a commit, we need to convert the flat index into a tree structure:src/utils/tree.js
Step 2: Implement the Commit Command
src/commands/commit.js
Step 3: Implement the Log Command
src/commands/log.js
Step 4: Update CLI Entry Point
src/mygit.js
Testing Your Implementation
How Git Optimizes
Deduplication
Deduplication
If you have the same file in multiple commits, Git stores it only once. The tree just points to the same blob hash.
Pack Files
Pack Files
Git periodically packs objects into
.pack files with delta compression. Similar files are stored as deltas from a base object.We won’t implement this, but it’s why Git is so space-efficient.Commit Caching
Commit Caching
Real Git caches parsed commits in memory. For our implementation, we re-read each commit, which is fine for learning.
Exercises
Exercise 1: Add --graph flag to log
Exercise 1: Add --graph flag to log
Show branch structure visually:
Exercise 2: Implement diff between commits
Exercise 2: Implement diff between commits
Show what changed between two commits:
Exercise 3: Handle merge commits
Exercise 3: Handle merge commits
Extend log to handle commits with multiple parents:
Key Takeaways
Commits are Snapshots
Each commit points to a complete tree (directory snapshot), not diffs
History is a DAG
Commits form a directed acyclic graph via parent pointers
Branches are Pointers
A branch is just a file containing a commit hash
Trees are Nested
Tree objects contain blobs and other trees, creating directory structure
Further Reading
DSA: Graph Algorithms
Understand graph traversal for commit history
DSA: Trees
Learn tree traversal for directory structures
What’s Next?
In Chapter 5: Branches & Checkout, we’ll implement:- The
branchcommand - The
checkoutcommand - Switching between branches
- Detached HEAD state
Next: Branches & Checkout
Learn how Git manages and switches between branches
Interview Deep-Dive
Git stores commits as snapshots, not diffs. Why is this design choice better for a version control system?
Git stores commits as snapshots, not diffs. Why is this design choice better for a version control system?
Strong Answer:
- Snapshot-based storage makes checkout a constant-time operation. To materialize any commit, Git reads the root tree and recursively reads the blobs. There is no need to replay a chain of diffs from the beginning of history. Checking out commit 1 is as fast as checking out commit 10,000.
- Diff-based systems (like older VCS tools) store the initial version and then a chain of deltas. To reconstruct version N, you must apply N-1 diffs sequentially. This gets slower as history grows and makes certain operations (like blame or bisect) expensive because they require reconstructing arbitrary versions.
- Snapshots also make branching and merging simpler. A merge is a comparison of two tree snapshots against a common ancestor. With diffs, you would need to reconcile overlapping delta chains, which is more error-prone.
- The apparent waste of storing full snapshots is mitigated by Git’s deduplication. Unchanged files share the same blob hash across commits, unchanged directories share tree hashes, and pack files apply delta compression after the fact. So Git gets the performance of snapshots with the storage efficiency of deltas — the best of both worlds.
git log --graph to visualize parallel development, git merge-base to find common ancestors, and three-way merge to reconcile divergent changes.Walk me through exactly what happens when you run git commit -m 'Fix bug'. What objects are created and what files are updated?
Walk me through exactly what happens when you run git commit -m 'Fix bug'. What objects are created and what files are updated?
Strong Answer:
- First, Git reads the index (
.git/index) to get the list of staged files with their blob hashes. It converts this flat list into a nested tree structure by grouping entries by directory. - It writes tree objects bottom-up: leaf directories first, then their parents, up to the root tree. Each tree object is hashed and stored in
.git/objects/. If a subtree is identical to one that already exists (unchanged directory), the same hash is reused and no new object is written. - Git reads HEAD to find the current branch (
ref: refs/heads/main), then reads the branch file to get the current commit hash (the parent). - It constructs the commit object: the root tree hash, the parent commit hash, author/committer lines with timestamps, and the commit message. This is hashed and written to the object store.
- Finally, Git updates the branch file (
.git/refs/heads/main) with the new commit’s hash. HEAD is not changed — it still saysref: refs/heads/main. Only the branch pointer advances. - Total new objects: one commit, one or more trees (only for changed directory paths), and the blobs were already written during
git add. Total file updates: the branch ref file and possibly the index (to update stat cache entries).
git fsck --unreachable will find it. The branch pointer still points to the old commit, so from the user’s perspective, the commit never happened. This is safe: the repository is in a consistent state, and the dangling object will be cleaned up by git gc after the reflog expiration period (default 30 days for unreachable objects). Git’s design ensures that the branch ref update is the atomic “commit point” — everything before that is preparation, and the repository is consistent whether the update happens or not.How does git log traverse history, and what is the performance characteristic for a repository with millions of commits?
How does git log traverse history, and what is the performance characteristic for a repository with millions of commits?
Strong Answer:
git logstarts at HEAD, reads the commit object, prints it, then follows the parent pointer to the next commit. For a linear history, this is a simple linked-list traversal — O(n) where n is the number of commits displayed.- For histories with branches and merges,
git logperforms a priority-queue-based topological sort of the DAG. It maintains a queue of commits sorted by timestamp, dequeues the newest, prints it, and enqueues its parents. This ensures commits are displayed in roughly chronological order while respecting the DAG structure. - For repositories with millions of commits (like the Linux kernel), the performance bottleneck is I/O: reading commit objects from pack files. Git mitigates this with commit-graph files (
.git/objects/info/commit-graph), which pre-compute and cache commit metadata (parents, tree hash, generation number) in a memory-mappable binary format. With a commit graph,git logcan traverse history without reading individual commit objects, reducing the operation from millions of object lookups to a single file scan. - Generation numbers in the commit graph enable even faster reachability queries: “is commit A an ancestor of commit B?” can be answered in O(1) if A’s generation number is greater than B’s (impossible to be an ancestor). This accelerates merge-base computation, fetch negotiation, and branch filtering.
git maintenance (enabled by default in modern Git) and updated incrementally during git gc. You should regenerate it after large imports or history rewrites with git commit-graph write --reachable. Without it, operations like git log --ancestry-path or git merge-base on large repositories can be orders of magnitude slower because they must open and parse individual commit objects from pack files instead of scanning the pre-computed graph.