Chapter 2: The Object Model
Git’s object model is its most elegant design. In this chapter, we’ll implement the foundation of Git: content-addressable storage using SHA-1 hashing. Here is the key insight to carry with you: in most filesystems, the name of a file is arbitrary — you choose it. In Git’s object store, the name is the content. The file’s address is a fingerprint derived from what is inside it. This is like a library where every book’s call number is computed from the text itself — if two copies of the same book arrive, they get the exact same call number and are stored only once. This single idea gives Git deduplication, integrity checking, and immutability for free.Prerequisites: Completed Chapter 1: Setup & Init
Time: 2-3 hours
Outcome: Working
Time: 2-3 hours
Outcome: Working
hash-object and cat-file commandsThe Core Insight
Git stores everything as objects in a content-addressable store. The “address” (filename) is derived from the content itself using SHA-1 hashing.Why is this brilliant?
- Same content = same hash = stored only once (deduplication!)
- Hash acts as a checksum (data integrity)
- Immutable objects (can’t change content without changing address)
The Three Object Types
Implementation
Step 1: SHA-1 Hashing Utility
Create the core hashing function that all objects will use:src/utils/objects.js
Step 2: Implement hash-object Command
Thehash-object command computes the hash of a file and optionally stores it:
src/commands/hashObject.js
Step 3: Implement cat-file Command
Thecat-file command reads objects from the store:
src/commands/catFile.js
Step 4: Update CLI Entry Point
src/mygit.js
Testing Your Implementation
Test hash-object
Test cat-file
Compare with Real Git
Understanding Blob Storage
Let’s trace through what happens when you store “Hello”:Why include type and size in the hash?
Why include type and size in the hash?
Including type and size in the hashed data:
- Prevents collisions: A blob “tree 100” won’t have the same hash as an actual tree
- Enables verification: We can check the stored size matches actual content
- Self-describing: The object header tells us what it is
Exercises
Exercise 1: Add stdin support
Exercise 1: Add stdin support
Implement reading from stdin for hash-object:Hint: Use
fs.readFileSync(0, 'utf8') to read from stdin (file descriptor 0).Exercise 2: Implement tree objects
Exercise 2: Implement tree objects
Create a utility to build tree objects:
Exercise 3: Pretty-print trees
Exercise 3: Pretty-print trees
Enhance Hint: Parse the binary tree format and detect type from mode.
cat-file -p to nicely format tree objects:Key Concepts Review
Content-Addressable
Objects are stored by their SHA-1 hash. Same content = same hash = same storage location.
Immutable Objects
Once written, objects never change. Changing content would change the hash.
Compression
All objects are zlib-compressed. Git is surprisingly space-efficient.
Object Format
Every object:
{type} {size}\0{content} - simple and consistent.Further Reading
DSA: Hash Maps
Understand the data structure behind content-addressable storage
Cryptography Basics
Learn more about SHA-1 and cryptographic hashing
What’s Next?
In Chapter 3: Staging & Index, we’ll implement:- The index (staging area) file format
- The
addcommand - The
statuscommand
Next: Staging & Index
Learn how Git’s staging area works
Interview Deep-Dive
How does Git actually store objects internally? Walk me through the exact bytes on disk for a simple file.
How does Git actually store objects internally? Walk me through the exact bytes on disk for a simple file.
Strong Answer:
- When you store the string “Hello” as a blob, Git constructs the full object as: the header
blob 5\0(type, space, size in bytes, null byte) concatenated with the contentHello. The complete byte sequence isblob 5\0Hello. - Git computes SHA-1 over this entire byte sequence to get the 40-character hex hash. This hash becomes the object’s address.
- The full object is then zlib-deflated (compressed) and written to disk at
.git/objects/<first-2-hex-chars>/<remaining-38-hex-chars>. The file on disk contains only the compressed bytes — no additional metadata. - To read the object back, Git reads the file, zlib-inflates it, parses the header to extract the type and size, verifies the size matches the actual content length (corruption check), and returns the content.
- The key design insight is that the hash covers the header too, not just the content. This means a blob containing the text “tree 100” will never collide with an actual tree object of size 100, because the headers differ (
blob 8\0tree 100vs.tree 100\0...). The header provides type safety within the content-addressable store.
.git directories are surprisingly small despite storing every version of every file. The compression also reduces I/O bandwidth when reading objects, which matters for large repositories. The trade-off is CPU cost for compression/decompression, but for the typical object sizes in Git (kilobytes to low megabytes), this is negligible. For very large binary files, the compression ratio drops and the CPU cost becomes noticeable, which is one reason large binaries are better handled by Git LFS.What is the difference between a blob, a tree, and a commit object? Why does Git need three separate types?
What is the difference between a blob, a tree, and a commit object? Why does Git need three separate types?
Strong Answer:
- Blobs store raw file content with no metadata — no filename, no permissions, no directory structure. This enables deduplication: if two files in different directories have the same content, they share one blob.
- Trees store directory structure: each entry has a mode (permissions), a name, and a hash pointing to a blob or sub-tree. Trees answer “what files exist in this directory and what are their names?” Separating this from content means renaming a file creates a new tree but reuses the existing blob.
- Commits store snapshot metadata: a pointer to the root tree (the complete project state), parent commit(s), author, committer, timestamp, and message. Commits answer “who changed what, when, and why?”
- The three types correspond to three distinct concerns: content (blob), structure (tree), and history (commit). Keeping them separate enables Git’s most powerful optimizations. Two commits that share a subtree (because a directory was unchanged) literally share the same tree object — Git does not store it twice. This cascades down: shared trees point to shared blobs. A 10,000-file repository where one file changes between commits creates approximately one new blob, a chain of new tree objects from the changed file to the root (maybe 3-5 objects), and one new commit. Everything else is shared.
git log --follow detects renames heuristically by comparing blob hashes across commits: if a blob hash disappears from one path and appears at another, Git infers a rename. The default similarity threshold is 50%. This design choice is intentional — Linus Torvalds argued that tracking renames explicitly creates complexity and edge cases (what if you rename and modify simultaneously?), while heuristic detection handles the common case well and degrades gracefully for ambiguous cases. The practical consequence is that git log <file> stops at renames unless you add --follow, which is a common source of confusion.You run git hash-object on the same file twice and get the same hash. Then you add a single space to the file and get a completely different hash. Why is this property valuable?
You run git hash-object on the same file twice and get the same hash. Then you add a single space to the file and get a completely different hash. Why is this property valuable?
Strong Answer:
- This is the avalanche property of cryptographic hash functions: even a tiny change in input produces a dramatically different output. Combined with content addressing, this gives Git three guarantees.
- First, integrity: if an object’s content is corrupted (bit flip on disk, bad network transfer), the hash will not match and Git will detect it immediately.
git fsckverifies every object in the repository by re-hashing and comparing. This catches silent data corruption that would go unnoticed in traditional filesystems. - Second, deduplication: identical content always produces the same hash, so Git stores it exactly once. Across a repository’s entire history, every unique version of every file exists as exactly one blob. This is why Git is space-efficient despite storing full snapshots, not diffs.
- Third, immutability: you cannot change an object’s content without changing its address. This means any commit hash you record (in a branch, a tag, or a deployment log) permanently identifies exactly that state of the code. No one can retroactively alter history without changing all subsequent hashes, which is immediately detectable.
- This trio of properties — integrity, deduplication, and immutability — is why content-addressable storage is used not just in Git but in systems like Docker image layers, IPFS, and blockchain. The principle is universal.