Git’s object model is its most elegant design. In this chapter, we’ll implement the foundation of Git: content-addressable storage using SHA-1 hashing.Here is the key insight to carry with you: in most filesystems, the name of a file is arbitrary — you choose it. In Git’s object store, the name is the content. The file’s address is a fingerprint derived from what is inside it. This is like a library where every book’s call number is computed from the text itself — if two copies of the same book arrive, they get the exact same call number and are stored only once. This single idea gives Git deduplication, integrity checking, and immutability for free.
Prerequisites: Completed Chapter 1: Setup & Init Time: 2-3 hours Outcome: Working hash-object and cat-file commands
Create the core hashing function that all objects will use:
src/utils/objects.js
const crypto = require('crypto');const zlib = require('zlib');const fs = require('fs');const path = require('path');/** * Compute SHA-1 hash for a Git object. * * Git object format: "{type} {size}\0{content}" * - type: blob, tree, commit, or tag * - size: content length in bytes (must be the byte length, not character count) * - \0: null byte separator -- this prevents ambiguity between the header and body * - content: the actual data * * Why include type and size in the hash? Two reasons: * 1. Security: a blob containing the text "tree 100" won't collide * with an actual tree object, because the headers differ. * 2. Verification: when reading back, we can check that the stored * size matches the actual content length -- catching corruption early. * * @param {Buffer|string} content - The object content * @param {string} type - Object type (blob, tree, commit) * @returns {{hash: string, data: Buffer}} - Hash and full object data */function hashObject(content, type = 'blob') { // Convert string to buffer if needed. // We use Buffer because we need the byte length, which can differ // from string length for multi-byte characters (e.g., UTF-8 emoji). const contentBuffer = Buffer.isBuffer(content) ? content : Buffer.from(content); // Create header: "{type} {size}\0" const header = Buffer.from(`${type} ${contentBuffer.length}\0`); // Full object = header + content const fullObject = Buffer.concat([header, contentBuffer]); // Compute SHA-1 hash. // The resulting 40-character hex string becomes the object's "address". const hash = crypto .createHash('sha1') .update(fullObject) .digest('hex'); return { hash, data: fullObject };}/** * Write an object to the Git object store * * @param {string} gitDir - Path to .git directory * @param {string} hash - 40-character hex hash * @param {Buffer} data - Full object data (header + content) * @returns {string} - The hash */function writeObject(gitDir, hash, data) { // Objects stored at: .git/objects/{first 2 chars}/{remaining 38 chars} // The 2-char prefix creates up to 256 subdirectories -- a fan-out // strategy that keeps directory listings small and fast. const objectDir = path.join(gitDir, 'objects', hash.slice(0, 2)); const objectPath = path.join(objectDir, hash.slice(2)); // Don't write if already exists. Because the hash is derived from the // content, an existing file with the same hash is guaranteed to have // the same content. This is the deduplication magic of content-addressable // storage -- identical files are never stored twice. if (fs.existsSync(objectPath)) { return hash; } // Create subdirectory if needed if (!fs.existsSync(objectDir)) { fs.mkdirSync(objectDir, { recursive: true }); } // Compress with zlib and write. // Real Git also uses zlib deflate. On average this reduces object size // by 60-70%, which is why .git directories are much smaller than you'd expect. // // Debugging tip: if you suspect a corrupt object, you can manually inflate // it with: zlib.inflateSync(fs.readFileSync(objectPath)).toString() const compressed = zlib.deflateSync(data); fs.writeFileSync(objectPath, compressed); return hash;}/** * Read an object from the Git object store * * @param {string} gitDir - Path to .git directory * @param {string} hash - 40-character hex hash * @returns {{type: string, size: number, content: Buffer}} */function readObject(gitDir, hash) { const objectPath = path.join( gitDir, 'objects', hash.slice(0, 2), hash.slice(2) ); if (!fs.existsSync(objectPath)) { throw new Error(`fatal: Not a valid object name ${hash}`); } // Read and decompress. Every object on disk is zlib-deflated. const compressed = fs.readFileSync(objectPath); const data = zlib.inflateSync(compressed); // Parse header: find the null byte that separates header from content. // The header tells us the object type and expected size, which lets us // verify integrity before trusting the data. const nullIndex = data.indexOf(0); const header = data.slice(0, nullIndex).toString(); const content = data.slice(nullIndex + 1); // Parse header: "{type} {size}" const [type, sizeStr] = header.split(' '); const size = parseInt(sizeStr, 10); // Verify size matches. If a disk error or interrupted write corrupted // the file, this check catches it immediately rather than letting bad // data propagate silently through your commit history. if (content.length !== size) { throw new Error(`Object ${hash} is corrupted`); } return { type, size, content };}module.exports = { hashObject, writeObject, readObject};
# Create a test fileecho "Hello, Git!" > test.txt# Hash without storingmygit hash-object test.txt# Expected: d670460b4b4aece5915caf5c68d12f560a9fe3e4# Hash and storemygit hash-object -w test.txt# Verify it was storedls .git/objects/d6/# Should show: 70460b4b4aece5915caf5c68d12f560a9fe3e4
# Use real gitgit hash-object test.txt# Should match your implementation!
Debugging tip: If your hash does NOT match real Git’s, the most common causes are:
Newline differences.echo "Hello, Git!" appends a newline on most shells, so the content is actually Hello, Git!\n (12 bytes, not 11). Make sure you’re hashing exactly the same bytes.
Encoding issues. If your file contains non-ASCII characters, ensure both your implementation and Git are reading the raw bytes, not a re-encoded string.
Wrong header size. The size in the header must be the byte length of the content only, not including the header itself.
Hint: Use fs.readFileSync(0, 'utf8') to read from stdin (file descriptor 0).
Exercise 2: Implement tree objects
Create a utility to build tree objects:
// A tree entry is: "{mode} {name}\0{20-byte-binary-sha}"// Modes: 100644 (regular file), 100755 (executable), 040000 (directory)function createTree(entries) { // entries: [{mode, name, hash}, ...] // Return the hash of the tree object}
Exercise 3: Pretty-print trees
Enhance cat-file -p to nicely format tree objects:
100644 blob a1b2c3d4... file.txt040000 tree e5f6g7h8... src
Hint: Parse the binary tree format and detect type from mode.
How does Git actually store objects internally? Walk me through the exact bytes on disk for a simple file.
Strong Answer:
When you store the string “Hello” as a blob, Git constructs the full object as: the header blob 5\0 (type, space, size in bytes, null byte) concatenated with the content Hello. The complete byte sequence is blob 5\0Hello.
Git computes SHA-1 over this entire byte sequence to get the 40-character hex hash. This hash becomes the object’s address.
The full object is then zlib-deflated (compressed) and written to disk at .git/objects/<first-2-hex-chars>/<remaining-38-hex-chars>. The file on disk contains only the compressed bytes — no additional metadata.
To read the object back, Git reads the file, zlib-inflates it, parses the header to extract the type and size, verifies the size matches the actual content length (corruption check), and returns the content.
The key design insight is that the hash covers the header too, not just the content. This means a blob containing the text “tree 100” will never collide with an actual tree object of size 100, because the headers differ (blob 8\0tree 100 vs. tree 100\0...). The header provides type safety within the content-addressable store.
Follow-up: Why does Git use zlib compression, and what is the typical compression ratio?Zlib (deflate algorithm) provides a good balance between compression ratio and speed. Source code is highly compressible text, and zlib typically achieves 60-70% reduction. For a 10KB source file, the on-disk object is ~3-4KB. This is why .git directories are surprisingly small despite storing every version of every file. The compression also reduces I/O bandwidth when reading objects, which matters for large repositories. The trade-off is CPU cost for compression/decompression, but for the typical object sizes in Git (kilobytes to low megabytes), this is negligible. For very large binary files, the compression ratio drops and the CPU cost becomes noticeable, which is one reason large binaries are better handled by Git LFS.
What is the difference between a blob, a tree, and a commit object? Why does Git need three separate types?
Strong Answer:
Blobs store raw file content with no metadata — no filename, no permissions, no directory structure. This enables deduplication: if two files in different directories have the same content, they share one blob.
Trees store directory structure: each entry has a mode (permissions), a name, and a hash pointing to a blob or sub-tree. Trees answer “what files exist in this directory and what are their names?” Separating this from content means renaming a file creates a new tree but reuses the existing blob.
Commits store snapshot metadata: a pointer to the root tree (the complete project state), parent commit(s), author, committer, timestamp, and message. Commits answer “who changed what, when, and why?”
The three types correspond to three distinct concerns: content (blob), structure (tree), and history (commit). Keeping them separate enables Git’s most powerful optimizations. Two commits that share a subtree (because a directory was unchanged) literally share the same tree object — Git does not store it twice. This cascades down: shared trees point to shared blobs. A 10,000-file repository where one file changes between commits creates approximately one new blob, a chain of new tree objects from the changed file to the root (maybe 3-5 objects), and one new commit. Everything else is shared.
Follow-up: How does Git handle file renames? There is no “rename” object type.Git does not track renames explicitly. When you rename a file, Git creates a new tree where the entry for the old name is gone and an entry for the new name points to the same blob hash. git log --follow detects renames heuristically by comparing blob hashes across commits: if a blob hash disappears from one path and appears at another, Git infers a rename. The default similarity threshold is 50%. This design choice is intentional — Linus Torvalds argued that tracking renames explicitly creates complexity and edge cases (what if you rename and modify simultaneously?), while heuristic detection handles the common case well and degrades gracefully for ambiguous cases. The practical consequence is that git log <file> stops at renames unless you add --follow, which is a common source of confusion.
You run git hash-object on the same file twice and get the same hash. Then you add a single space to the file and get a completely different hash. Why is this property valuable?
Strong Answer:
This is the avalanche property of cryptographic hash functions: even a tiny change in input produces a dramatically different output. Combined with content addressing, this gives Git three guarantees.
First, integrity: if an object’s content is corrupted (bit flip on disk, bad network transfer), the hash will not match and Git will detect it immediately. git fsck verifies every object in the repository by re-hashing and comparing. This catches silent data corruption that would go unnoticed in traditional filesystems.
Second, deduplication: identical content always produces the same hash, so Git stores it exactly once. Across a repository’s entire history, every unique version of every file exists as exactly one blob. This is why Git is space-efficient despite storing full snapshots, not diffs.
Third, immutability: you cannot change an object’s content without changing its address. This means any commit hash you record (in a branch, a tag, or a deployment log) permanently identifies exactly that state of the code. No one can retroactively alter history without changing all subsequent hashes, which is immediately detectable.
This trio of properties — integrity, deduplication, and immutability — is why content-addressable storage is used not just in Git but in systems like Docker image layers, IPFS, and blockchain. The principle is universal.
Follow-up: Can two different files theoretically produce the same SHA-1 hash in Git? What would happen?Theoretically yes — SHA-1 collisions exist (demonstrated by Google’s SHAttered attack in 2017). If two different blobs produced the same hash, Git would store the first one and silently return it when either blob was requested. The second blob’s content would be lost. In practice, the probability of an accidental collision is astronomically small (1 in 2^160). Targeted attacks are possible but extremely expensive computationally. Git mitigated the known SHAttered-style attack with a detection mechanism (hardened SHA-1) and is transitioning to SHA-256 for new repositories. For the vast majority of repositories, SHA-1 collision is not a practical concern, but it is a theoretically important limitation of the design.