Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 2: The Object Model

Git’s object model is its most elegant design. In this chapter, we’ll implement the foundation of Git: content-addressable storage using SHA-1 hashing. Here is the key insight to carry with you: in most filesystems, the name of a file is arbitrary — you choose it. In Git’s object store, the name is the content. The file’s address is a fingerprint derived from what is inside it. This is like a library where every book’s call number is computed from the text itself — if two copies of the same book arrive, they get the exact same call number and are stored only once. This single idea gives Git deduplication, integrity checking, and immutability for free.
Prerequisites: Completed Chapter 1: Setup & Init
Time: 2-3 hours
Outcome: Working hash-object and cat-file commands

The Core Insight

Git stores everything as objects in a content-addressable store. The “address” (filename) is derived from the content itself using SHA-1 hashing.
Content: "Hello, World!"

SHA-1("blob 13\0Hello, World!")

Hash: 5dd01c177f5d7d1be5346a5bc18a569a7410c2ef

Stored at: .git/objects/5d/d01c177f5d7d1be5346a5bc18a569a7410c2ef
Why is this brilliant?
  • Same content = same hash = stored only once (deduplication!)
  • Hash acts as a checksum (data integrity)
  • Immutable objects (can’t change content without changing address)

The Three Object Types

┌─────────────────────────────────────────────────────────────────────────────┐
│                          GIT OBJECT TYPES                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   BLOB                    TREE                      COMMIT                   │
│   ────                    ────                      ──────                   │
│   Raw file content        Directory snapshot        Project snapshot         │
│                                                                              │
│   ┌─────────────┐        ┌─────────────────┐       ┌──────────────────┐     │
│   │ Hello World │        │ 100644 file.txt │       │ tree d8329fc...  │     │
│   │             │        │ → blob abc123   │       │ parent 5a6f32... │     │
│   └─────────────┘        │ 040000 src/     │       │ author Jane      │     │
│         ↓                │ → tree def456   │       │ committer Jane   │     │
│   SHA-1 of content       └─────────────────┘       │                  │     │
│                                ↓                   │ Initial commit   │     │
│                          SHA-1 of entries          └──────────────────┘     │
│                                                            ↓                │
│                                                    SHA-1 of metadata        │
│                                                                              │
│   ALL OBJECTS: header + content → SHA-1 → stored compressed                 │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation

Step 1: SHA-1 Hashing Utility

Create the core hashing function that all objects will use:
src/utils/objects.js
const crypto = require('crypto');
const zlib = require('zlib');
const fs = require('fs');
const path = require('path');

/**
 * Compute SHA-1 hash for a Git object.
 * 
 * Git object format: "{type} {size}\0{content}"
 * - type: blob, tree, commit, or tag
 * - size: content length in bytes (must be the byte length, not character count)
 * - \0: null byte separator -- this prevents ambiguity between the header and body
 * - content: the actual data
 *
 * Why include type and size in the hash? Two reasons:
 *   1. Security: a blob containing the text "tree 100" won't collide
 *      with an actual tree object, because the headers differ.
 *   2. Verification: when reading back, we can check that the stored
 *      size matches the actual content length -- catching corruption early.
 * 
 * @param {Buffer|string} content - The object content
 * @param {string} type - Object type (blob, tree, commit)
 * @returns {{hash: string, data: Buffer}} - Hash and full object data
 */
function hashObject(content, type = 'blob') {
    // Convert string to buffer if needed.
    // We use Buffer because we need the byte length, which can differ
    // from string length for multi-byte characters (e.g., UTF-8 emoji).
    const contentBuffer = Buffer.isBuffer(content) 
        ? content 
        : Buffer.from(content);
    
    // Create header: "{type} {size}\0"
    const header = Buffer.from(`${type} ${contentBuffer.length}\0`);
    
    // Full object = header + content
    const fullObject = Buffer.concat([header, contentBuffer]);
    
    // Compute SHA-1 hash.
    // The resulting 40-character hex string becomes the object's "address".
    const hash = crypto
        .createHash('sha1')
        .update(fullObject)
        .digest('hex');
    
    return { hash, data: fullObject };
}

/**
 * Write an object to the Git object store
 * 
 * @param {string} gitDir - Path to .git directory
 * @param {string} hash - 40-character hex hash
 * @param {Buffer} data - Full object data (header + content)
 * @returns {string} - The hash
 */
function writeObject(gitDir, hash, data) {
    // Objects stored at: .git/objects/{first 2 chars}/{remaining 38 chars}
    // The 2-char prefix creates up to 256 subdirectories -- a fan-out
    // strategy that keeps directory listings small and fast.
    const objectDir = path.join(gitDir, 'objects', hash.slice(0, 2));
    const objectPath = path.join(objectDir, hash.slice(2));
    
    // Don't write if already exists. Because the hash is derived from the
    // content, an existing file with the same hash is guaranteed to have
    // the same content. This is the deduplication magic of content-addressable
    // storage -- identical files are never stored twice.
    if (fs.existsSync(objectPath)) {
        return hash;
    }
    
    // Create subdirectory if needed
    if (!fs.existsSync(objectDir)) {
        fs.mkdirSync(objectDir, { recursive: true });
    }
    
    // Compress with zlib and write.
    // Real Git also uses zlib deflate. On average this reduces object size
    // by 60-70%, which is why .git directories are much smaller than you'd expect.
    //
    // Debugging tip: if you suspect a corrupt object, you can manually inflate
    // it with: zlib.inflateSync(fs.readFileSync(objectPath)).toString()
    const compressed = zlib.deflateSync(data);
    fs.writeFileSync(objectPath, compressed);
    
    return hash;
}

/**
 * Read an object from the Git object store
 * 
 * @param {string} gitDir - Path to .git directory
 * @param {string} hash - 40-character hex hash
 * @returns {{type: string, size: number, content: Buffer}}
 */
function readObject(gitDir, hash) {
    const objectPath = path.join(
        gitDir, 
        'objects', 
        hash.slice(0, 2), 
        hash.slice(2)
    );
    
    if (!fs.existsSync(objectPath)) {
        throw new Error(`fatal: Not a valid object name ${hash}`);
    }
    
    // Read and decompress. Every object on disk is zlib-deflated.
    const compressed = fs.readFileSync(objectPath);
    const data = zlib.inflateSync(compressed);
    
    // Parse header: find the null byte that separates header from content.
    // The header tells us the object type and expected size, which lets us
    // verify integrity before trusting the data.
    const nullIndex = data.indexOf(0);
    const header = data.slice(0, nullIndex).toString();
    const content = data.slice(nullIndex + 1);
    
    // Parse header: "{type} {size}"
    const [type, sizeStr] = header.split(' ');
    const size = parseInt(sizeStr, 10);
    
    // Verify size matches. If a disk error or interrupted write corrupted
    // the file, this check catches it immediately rather than letting bad
    // data propagate silently through your commit history.
    if (content.length !== size) {
        throw new Error(`Object ${hash} is corrupted`);
    }
    
    return { type, size, content };
}

module.exports = {
    hashObject,
    writeObject,
    readObject
};

Step 2: Implement hash-object Command

The hash-object command computes the hash of a file and optionally stores it:
src/commands/hashObject.js
const fs = require('fs');
const path = require('path');
const { hashObject, writeObject } = require('../utils/objects');
const { findGitDir } = require('../utils/paths');

/**
 * hash-object - Compute object ID and optionally store it
 * 
 * Usage:
 *   mygit hash-object <file>           # Just compute hash
 *   mygit hash-object -w <file>        # Compute and write to object store
 *   mygit hash-object -t <type> <file> # Specify type (default: blob)
 *   mygit hash-object --stdin          # Read from stdin
 */
function execute(args) {
    // Parse options
    let write = false;
    let type = 'blob';
    let useStdin = false;
    const files = [];
    
    for (let i = 0; i < args.length; i++) {
        const arg = args[i];
        
        if (arg === '-w') {
            write = true;
        } else if (arg === '-t') {
            type = args[++i];
            if (!['blob', 'tree', 'commit', 'tag'].includes(type)) {
                throw new Error(`invalid object type "${type}"`);
            }
        } else if (arg === '--stdin') {
            useStdin = true;
        } else if (!arg.startsWith('-')) {
            files.push(arg);
        }
    }
    
    // Handle stdin
    if (useStdin) {
        // Read all stdin synchronously
        const chunks = [];
        const fd = fs.openSync(0, 'r'); // stdin
        const buffer = Buffer.alloc(1024);
        let bytesRead;
        
        while ((bytesRead = fs.readSync(fd, buffer, 0, buffer.length)) > 0) {
            chunks.push(buffer.slice(0, bytesRead));
        }
        
        const content = Buffer.concat(chunks);
        processContent(content, type, write);
        return;
    }
    
    // Handle files
    if (files.length === 0) {
        throw new Error('no file specified');
    }
    
    for (const file of files) {
        if (!fs.existsSync(file)) {
            throw new Error(`fatal: could not open '${file}' for reading`);
        }
        
        const content = fs.readFileSync(file);
        processContent(content, type, write);
    }
}

function processContent(content, type, write) {
    const { hash, data } = hashObject(content, type);
    
    if (write) {
        const gitDir = findGitDir();
        if (!gitDir) {
            throw new Error('fatal: not a git repository');
        }
        writeObject(gitDir, hash, data);
    }
    
    console.log(hash);
}

module.exports = { execute };

Step 3: Implement cat-file Command

The cat-file command reads objects from the store:
src/commands/catFile.js
const { readObject } = require('../utils/objects');
const { requireGitDir } = require('../utils/paths');

/**
 * cat-file - Provide content or type of repository objects
 * 
 * Usage:
 *   mygit cat-file -t <hash>   # Show object type
 *   mygit cat-file -s <hash>   # Show object size
 *   mygit cat-file -p <hash>   # Pretty-print object content
 *   mygit cat-file <type> <hash>  # Show content, expecting type
 */
function execute(args) {
    if (args.length < 2) {
        throw new Error('usage: mygit cat-file (-t | -s | -p | <type>) <object>');
    }
    
    const gitDir = requireGitDir();
    const option = args[0];
    const hash = resolveHash(gitDir, args[1]);
    
    const { type, size, content } = readObject(gitDir, hash);
    
    switch (option) {
        case '-t':
            // Show type
            console.log(type);
            break;
            
        case '-s':
            // Show size
            console.log(size);
            break;
            
        case '-p':
            // Pretty-print based on type
            prettyPrint(type, content);
            break;
            
        default:
            // Expect specific type
            if (type !== option) {
                throw new Error(`fatal: git cat-file: expected ${option}, got ${type}`);
            }
            process.stdout.write(content);
    }
}

/**
 * Resolve a potentially abbreviated hash to full hash
 */
function resolveHash(gitDir, partialHash) {
    // If it's already 40 chars, return as-is
    if (partialHash.length === 40) {
        return partialHash;
    }
    
    // For abbreviated hashes, we need at least 4 chars
    if (partialHash.length < 4) {
        throw new Error('fatal: too short object hash');
    }
    
    const fs = require('fs');
    const path = require('path');
    
    const prefix = partialHash.slice(0, 2);
    const rest = partialHash.slice(2);
    const objectDir = path.join(gitDir, 'objects', prefix);
    
    if (!fs.existsSync(objectDir)) {
        throw new Error(`fatal: Not a valid object name ${partialHash}`);
    }
    
    const matches = fs.readdirSync(objectDir)
        .filter(name => name.startsWith(rest));
    
    if (matches.length === 0) {
        throw new Error(`fatal: Not a valid object name ${partialHash}`);
    }
    
    if (matches.length > 1) {
        throw new Error(`fatal: ambiguous argument '${partialHash}'`);
    }
    
    return prefix + matches[0];
}

/**
 * Pretty-print object content based on type
 */
function prettyPrint(type, content) {
    switch (type) {
        case 'blob':
            // Blob: just print content
            process.stdout.write(content);
            break;
            
        case 'tree':
            // Tree: formatted entries
            printTree(content);
            break;
            
        case 'commit':
            // Commit: print as-is (it's already text)
            process.stdout.write(content);
            break;
            
        default:
            process.stdout.write(content);
    }
}

/**
 * Parse and print tree entries
 */
function printTree(content) {
    let offset = 0;
    
    while (offset < content.length) {
        // Find space (separates mode from name)
        const spaceIndex = content.indexOf(0x20, offset);
        const mode = content.slice(offset, spaceIndex).toString();
        
        // Find null byte (separates name from hash)
        const nullIndex = content.indexOf(0, spaceIndex);
        const name = content.slice(spaceIndex + 1, nullIndex).toString();
        
        // Next 20 bytes are the SHA-1 hash
        const hashBytes = content.slice(nullIndex + 1, nullIndex + 21);
        const hash = hashBytes.toString('hex');
        
        // Determine type from mode
        const typeStr = mode === '40000' ? 'tree' : 'blob';
        
        // Print in git's format: {mode} {type} {hash}\t{name}
        console.log(`${mode.padStart(6, '0')} ${typeStr} ${hash}\t${name}`);
        
        offset = nullIndex + 21;
    }
}

module.exports = { execute };

Step 4: Update CLI Entry Point

src/mygit.js
#!/usr/bin/env node

const commands = {
    init: require('./commands/init'),
    'hash-object': require('./commands/hashObject'),
    'cat-file': require('./commands/catFile'),
};

function main() {
    const args = process.argv.slice(2);
    
    if (args.length === 0) {
        console.log('usage: mygit <command> [<args>]');
        console.log('\nAvailable commands:');
        console.log('   init          Initialize a new repository');
        console.log('   hash-object   Compute object ID and optionally creates a blob');
        console.log('   cat-file      Provide content or type of repository objects');
        process.exit(1);
    }
    
    const command = args[0];
    const commandArgs = args.slice(1);
    
    if (!commands[command]) {
        console.error(`mygit: '${command}' is not a mygit command.`);
        process.exit(1);
    }
    
    try {
        commands[command].execute(commandArgs);
    } catch (error) {
        console.error(`${error.message}`);
        process.exit(1);
    }
}

main();

Testing Your Implementation

Test hash-object

# Create a test file
echo "Hello, Git!" > test.txt

# Hash without storing
mygit hash-object test.txt
# Expected: d670460b4b4aece5915caf5c68d12f560a9fe3e4

# Hash and store
mygit hash-object -w test.txt

# Verify it was stored
ls .git/objects/d6/
# Should show: 70460b4b4aece5915caf5c68d12f560a9fe3e4

Test cat-file

# Store an object first
mygit hash-object -w test.txt
# Returns: d670460b4b4aece5915caf5c68d12f560a9fe3e4

# Read back the content
mygit cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4
# Expected: Hello, Git!

# Check the type
mygit cat-file -t d670460b4b4aece5915caf5c68d12f560a9fe3e4
# Expected: blob

# Check the size
mygit cat-file -s d670460b4b4aece5915caf5c68d12f560a9fe3e4
# Expected: 12

Compare with Real Git

# Use real git
git hash-object test.txt
# Should match your implementation!
Debugging tip: If your hash does NOT match real Git’s, the most common causes are:
  1. Newline differences. echo "Hello, Git!" appends a newline on most shells, so the content is actually Hello, Git!\n (12 bytes, not 11). Make sure you’re hashing exactly the same bytes.
  2. Encoding issues. If your file contains non-ASCII characters, ensure both your implementation and Git are reading the raw bytes, not a re-encoded string.
  3. Wrong header size. The size in the header must be the byte length of the content only, not including the header itself.

Understanding Blob Storage

Let’s trace through what happens when you store “Hello”:
// Input
content = "Hello"

// Step 1: Create header
header = "blob 5\0"  // type + space + size + null byte

// Step 2: Concatenate
fullObject = "blob 5\0Hello"

// Step 3: SHA-1 hash
hash = sha1("blob 5\0Hello") = "5ab2f8a4323abafb10abb68657d46c..."

// Step 4: Compress with zlib
compressed = zlib.deflate("blob 5\0Hello")

// Step 5: Store at path
path = ".git/objects/5a/b2f8a4323abafb10abb68657d46c..."
Including type and size in the hashed data:
  1. Prevents collisions: A blob “tree 100” won’t have the same hash as an actual tree
  2. Enables verification: We can check the stored size matches actual content
  3. Self-describing: The object header tells us what it is

Exercises

Implement reading from stdin for hash-object:
echo "Hello" | mygit hash-object --stdin
Hint: Use fs.readFileSync(0, 'utf8') to read from stdin (file descriptor 0).
Create a utility to build tree objects:
// A tree entry is: "{mode} {name}\0{20-byte-binary-sha}"
// Modes: 100644 (regular file), 100755 (executable), 040000 (directory)

function createTree(entries) {
    // entries: [{mode, name, hash}, ...]
    // Return the hash of the tree object
}
Enhance cat-file -p to nicely format tree objects:
100644 blob a1b2c3d4... file.txt
040000 tree e5f6g7h8... src
Hint: Parse the binary tree format and detect type from mode.

Key Concepts Review

Content-Addressable

Objects are stored by their SHA-1 hash. Same content = same hash = same storage location.

Immutable Objects

Once written, objects never change. Changing content would change the hash.

Compression

All objects are zlib-compressed. Git is surprisingly space-efficient.

Object Format

Every object: {type} {size}\0{content} - simple and consistent.

Further Reading

DSA: Hash Maps

Understand the data structure behind content-addressable storage

Cryptography Basics

Learn more about SHA-1 and cryptographic hashing

What’s Next?

In Chapter 3: Staging & Index, we’ll implement:
  • The index (staging area) file format
  • The add command
  • The status command

Next: Staging & Index

Learn how Git’s staging area works

Interview Deep-Dive

Strong Answer:
  • When you store the string “Hello” as a blob, Git constructs the full object as: the header blob 5\0 (type, space, size in bytes, null byte) concatenated with the content Hello. The complete byte sequence is blob 5\0Hello.
  • Git computes SHA-1 over this entire byte sequence to get the 40-character hex hash. This hash becomes the object’s address.
  • The full object is then zlib-deflated (compressed) and written to disk at .git/objects/<first-2-hex-chars>/<remaining-38-hex-chars>. The file on disk contains only the compressed bytes — no additional metadata.
  • To read the object back, Git reads the file, zlib-inflates it, parses the header to extract the type and size, verifies the size matches the actual content length (corruption check), and returns the content.
  • The key design insight is that the hash covers the header too, not just the content. This means a blob containing the text “tree 100” will never collide with an actual tree object of size 100, because the headers differ (blob 8\0tree 100 vs. tree 100\0...). The header provides type safety within the content-addressable store.
Follow-up: Why does Git use zlib compression, and what is the typical compression ratio?Zlib (deflate algorithm) provides a good balance between compression ratio and speed. Source code is highly compressible text, and zlib typically achieves 60-70% reduction. For a 10KB source file, the on-disk object is ~3-4KB. This is why .git directories are surprisingly small despite storing every version of every file. The compression also reduces I/O bandwidth when reading objects, which matters for large repositories. The trade-off is CPU cost for compression/decompression, but for the typical object sizes in Git (kilobytes to low megabytes), this is negligible. For very large binary files, the compression ratio drops and the CPU cost becomes noticeable, which is one reason large binaries are better handled by Git LFS.
Strong Answer:
  • Blobs store raw file content with no metadata — no filename, no permissions, no directory structure. This enables deduplication: if two files in different directories have the same content, they share one blob.
  • Trees store directory structure: each entry has a mode (permissions), a name, and a hash pointing to a blob or sub-tree. Trees answer “what files exist in this directory and what are their names?” Separating this from content means renaming a file creates a new tree but reuses the existing blob.
  • Commits store snapshot metadata: a pointer to the root tree (the complete project state), parent commit(s), author, committer, timestamp, and message. Commits answer “who changed what, when, and why?”
  • The three types correspond to three distinct concerns: content (blob), structure (tree), and history (commit). Keeping them separate enables Git’s most powerful optimizations. Two commits that share a subtree (because a directory was unchanged) literally share the same tree object — Git does not store it twice. This cascades down: shared trees point to shared blobs. A 10,000-file repository where one file changes between commits creates approximately one new blob, a chain of new tree objects from the changed file to the root (maybe 3-5 objects), and one new commit. Everything else is shared.
Follow-up: How does Git handle file renames? There is no “rename” object type.Git does not track renames explicitly. When you rename a file, Git creates a new tree where the entry for the old name is gone and an entry for the new name points to the same blob hash. git log --follow detects renames heuristically by comparing blob hashes across commits: if a blob hash disappears from one path and appears at another, Git infers a rename. The default similarity threshold is 50%. This design choice is intentional — Linus Torvalds argued that tracking renames explicitly creates complexity and edge cases (what if you rename and modify simultaneously?), while heuristic detection handles the common case well and degrades gracefully for ambiguous cases. The practical consequence is that git log <file> stops at renames unless you add --follow, which is a common source of confusion.
Strong Answer:
  • This is the avalanche property of cryptographic hash functions: even a tiny change in input produces a dramatically different output. Combined with content addressing, this gives Git three guarantees.
  • First, integrity: if an object’s content is corrupted (bit flip on disk, bad network transfer), the hash will not match and Git will detect it immediately. git fsck verifies every object in the repository by re-hashing and comparing. This catches silent data corruption that would go unnoticed in traditional filesystems.
  • Second, deduplication: identical content always produces the same hash, so Git stores it exactly once. Across a repository’s entire history, every unique version of every file exists as exactly one blob. This is why Git is space-efficient despite storing full snapshots, not diffs.
  • Third, immutability: you cannot change an object’s content without changing its address. This means any commit hash you record (in a branch, a tag, or a deployment log) permanently identifies exactly that state of the code. No one can retroactively alter history without changing all subsequent hashes, which is immediately detectable.
  • This trio of properties — integrity, deduplication, and immutability — is why content-addressable storage is used not just in Git but in systems like Docker image layers, IPFS, and blockchain. The principle is universal.
Follow-up: Can two different files theoretically produce the same SHA-1 hash in Git? What would happen?Theoretically yes — SHA-1 collisions exist (demonstrated by Google’s SHAttered attack in 2017). If two different blobs produced the same hash, Git would store the first one and silently return it when either blob was requested. The second blob’s content would be lost. In practice, the probability of an accidental collision is astronomically small (1 in 2^160). Targeted attacks are possible but extremely expensive computationally. Git mitigated the known SHAttered-style attack with a detection mechanism (hardened SHA-1) and is transitioning to SHA-256 for new repositories. For the vast majority of repositories, SHA-1 collision is not a practical concern, but it is a theoretically important limitation of the design.