Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Git Object Model

Build Your Own Git

Target Audience: Students and Junior Engineers
Language: JavaScript (with Java & Go alternatives)
Duration: 2-3 weeks
Difficulty: ⭐⭐⭐☆☆

Why Build Git?

Git is the most ubiquitous tool in software development, yet most developers treat it as magic. They memorize commands like incantations without understanding the machinery underneath. By building your own Git, you’ll:
  • Understand content-addressable storage — the same idea that powers IPFS, Nix, and blockchain Merkle trees
  • Master hashing and cryptography basics — SHA-1 in practice, and why “same content = same hash” is so powerful
  • Learn tree data structures — directories as trees, commits as a DAG (Directed Acyclic Graph)
  • Build a real CLI tool — practical software engineering, argument parsing, and error handling
  • Debug Git problems from first principles — when git rebase goes wrong, you’ll understand why at the object level
This is NOT a tutorial on using Git. This is about understanding Git’s internals deeply enough to reimplement them. You should already be comfortable with basic Git usage (commit, branch, merge) before starting.

Git’s Beautiful Architecture

Git Object Model
┌─────────────────────────────────────────────────────────────────────────────┐
│                           GIT OBJECT MODEL                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   BLOB                    TREE                      COMMIT                   │
│   ─────                   ─────                     ──────                   │
│   File contents           Directory listing         Snapshot + metadata      │
│   SHA-1 of content        Points to blobs/trees     Points to tree + parent │
│                                                                              │
│   ┌───────────┐          ┌───────────────────┐     ┌───────────────────┐    │
│   │ Hello     │          │ 100644 hello.txt  │     │ tree abc123       │    │
│   │ World     │          │ 040000 src/       │     │ parent def456     │    │
│   └───────────┘          └───────────────────┘     │ author John       │    │
│        │                        │                   │ message: Initial  │    │
│        └────────────────────────┴───────────────────┘                        │
│                                                                              │
│   ALL OBJECTS ARE CONTENT-ADDRESSED:                                        │
│   SHA-1(type + size + content) → 40-character hex string                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

What You’ll Build

Core Commands

CommandDescriptionConcepts Learned
initInitialize repositoryFile structure, .git directory
hash-objectHash and store a fileSHA-1, zlib compression
cat-fileRead object contentsObject types, decompression
addStage filesIndex file format
commitCreate commit objectTree building, commit linking
logShow commit historyDAG traversal
statusShow working tree statusDiff against index
branchManage branchesRefs, symbolic refs
checkoutSwitch branchesUpdating HEAD, working tree

Implementation: JavaScript

Project Structure

mygit/
├── src/
│   ├── commands/
│   │   ├── init.js
│   │   ├── hashObject.js
│   │   ├── catFile.js
│   │   ├── add.js
│   │   ├── commit.js
│   │   ├── log.js
│   │   ├── status.js
│   │   ├── branch.js
│   │   └── checkout.js
│   ├── objects/
│   │   ├── blob.js
│   │   ├── tree.js
│   │   └── commit.js
│   ├── utils/
│   │   ├── hash.js
│   │   ├── compression.js
│   │   ├── index.js
│   │   └── refs.js
│   └── mygit.js
├── package.json
└── README.md

Core Implementation

const crypto = require('crypto');
const zlib = require('zlib');
const fs = require('fs');
const path = require('path');

/**
 * Compute SHA-1 hash of content with Git's format
 * Git format: "{type} {size}\0{content}"
 */
function hashObject(content, type = 'blob') {
    const header = `${type} ${content.length}\0`;
    const store = Buffer.concat([Buffer.from(header), Buffer.from(content)]);
    const hash = crypto.createHash('sha1').update(store).digest('hex');
    return { hash, store };
}

/**
 * Write object to .git/objects/{hash[0:2]}/{hash[2:40]}
 */
function writeObject(gitDir, hash, store) {
    const objectDir = path.join(gitDir, 'objects', hash.slice(0, 2));
    const objectPath = path.join(objectDir, hash.slice(2));
    
    if (!fs.existsSync(objectDir)) {
        fs.mkdirSync(objectDir, { recursive: true });
    }
    
    // Git stores objects compressed with zlib
    const compressed = zlib.deflateSync(store);
    fs.writeFileSync(objectPath, compressed);
    
    return hash;
}

/**
 * Read object from .git/objects
 */
function readObject(gitDir, hash) {
    const objectPath = path.join(
        gitDir, 'objects', 
        hash.slice(0, 2), 
        hash.slice(2)
    );
    
    if (!fs.existsSync(objectPath)) {
        throw new Error(`Object ${hash} not found`);
    }
    
    const compressed = fs.readFileSync(objectPath);
    const store = zlib.inflateSync(compressed);
    
    // Parse header: "{type} {size}\0{content}"
    const nullIndex = store.indexOf(0);
    const header = store.slice(0, nullIndex).toString();
    const [type, size] = header.split(' ');
    const content = store.slice(nullIndex + 1);
    
    return { type, size: parseInt(size), content };
}

module.exports = { hashObject, writeObject, readObject };

Exercises

Level 1: Basic Understanding

  1. Initialize a repository and create a blob manually
  2. Understand how SHA-1 hashing creates content-addressable storage
  3. Create a commit and inspect its structure with cat-file

Level 2: Core Implementation

  1. Implement the status command (compare index to working tree)
  2. Add support for .gitignore patterns
  3. Implement diff to show changes between commits

Level 3: Advanced Features

  1. Implement merge (fast-forward and three-way)
  2. Add remote repository support (fetch, push)
  3. Implement pack files for efficient storage

What You’ve Learned

Content-addressable storage using SHA-1 hashing
Tree data structures for representing directories
DAG (Directed Acyclic Graph) for commit history
Binary file formats (index file)
CLI tool development in JavaScript

Next Steps

Go Implementation

See how the same concepts translate to Go

Java Implementation

Enterprise-grade implementation with strong typing

Build Redis

Ready for the next challenge? Build a Redis clone

Interview Deep-Dive

Strong Answer:
  • In a traditional filesystem, you choose a filename and the OS stores the content at that location. The name is arbitrary and independent of the content. In Git’s object store, the “name” (address) is derived from the content itself via SHA-1 hashing. The file is stored at .git/objects/<first-2-chars>/<remaining-38-chars> of the hash.
  • This gives Git three properties for free: deduplication (identical content always produces the same hash, so it is stored once regardless of how many files or commits reference it), integrity verification (if a single bit flips on disk, the hash will not match and Git will detect corruption), and immutability (you cannot change an object’s content without changing its address, which means any object you have retrieved is guaranteed to be the same object that was originally stored).
  • This design is the same principle behind Amazon S3’s internal storage, IPFS, and blockchain Merkle trees. Content-addressable storage is one of the most powerful ideas in computer science, and Git is the most widely deployed example of it.
  • The practical implication for developers: Git repositories are much smaller than you would expect because of deduplication. Renaming a file costs almost nothing (the blob is the same, only the tree changes). And git fsck can verify the entire history’s integrity by re-hashing every object and comparing it to its stored address.
Follow-up: If SHA-1 has known collision vulnerabilities, why does Git still use it, and what is being done about it?SHA-1 collision attacks (like Google’s SHAttered in 2017) require computing two distinct inputs that produce the same hash. Git mitigated this immediately with a “hardened SHA-1” that detects known collision patterns. More fundamentally, Git is transitioning to SHA-256 via the extensions.objectFormat config, which has been supported since Git 2.29. The transition is backward-compatible: repositories can be converted, and the object format extension allows interop. In practice, SHA-1 collisions in Git require a targeted attack against a specific repository, which is operationally much harder than the academic attack. But the industry consensus is that SHA-256 is the correct long-term direction, and Git’s content-addressable design makes the hash algorithm swappable precisely because the architecture does not depend on SHA-1 specifically — it depends on the property of content addressing.
Strong Answer:
  • A blob stores raw file content with no metadata (no filename, no permissions). It is just the bytes of the file, prefixed with a header (blob <size>\0), then SHA-1 hashed. Two files with identical content across the entire repository (or across different commits) share the same blob object.
  • A tree represents a directory. It contains entries, each with a mode (permissions), a name (filename), and a pointer (SHA-1 hash) to either a blob (file) or another tree (subdirectory). Trees are recursive — a root tree can point to sub-trees, which point to sub-sub-trees, mirroring the directory hierarchy.
  • A commit is a metadata object that points to a root tree (the complete directory snapshot at that point in time), zero or more parent commits (previous commits in history), an author, a committer, a timestamp, and a message. The chain of parent pointers forms the commit DAG (directed acyclic graph) — the history.
  • The elegance is that commits are snapshots, not diffs. Each commit has a complete tree that represents the entire project state. Git does not store what changed between commits; it stores the full state. Diffs are computed on-the-fly by comparing two trees. This makes checkout extremely fast (just materialize one tree) and makes operations like blame, log, and bisect possible without reconstructing incremental changes.
Follow-up: If every commit stores a full snapshot, why are Git repositories not enormous?Two reasons: deduplication and pack files. Deduplication at the blob level means unchanged files across commits point to the same blob — no duplication. Deduplication at the tree level means unchanged directories share tree objects. For a commit that changes one file in a 10,000-file project, only one new blob and a chain of new tree objects (from the changed file up to the root) are created. The other 9,999 blobs and their trees are shared. On top of this, git gc periodically packs loose objects into .pack files using delta compression — similar files are stored as a base object plus a binary diff. This is why a Git repo with years of history is often smaller than a single checkout of the working directory.
Strong Answer:
  • A branch in Git is a 41-byte text file stored at .git/refs/heads/<branchname> containing a 40-character commit hash and a newline. Creating a branch literally writes 41 bytes to disk. There is no copying of code, no duplication of files, no additional storage proportional to repository size.
  • When you run git branch feature, Git creates the file .git/refs/heads/feature containing the same commit hash that HEAD currently points to. Both branches now point to the same commit object, which points to the same tree, which points to the same blobs. Nothing is duplicated.
  • As you make commits on the feature branch, the branch pointer advances to new commits. The old commits are still shared with main. Only new blobs (for changed files), new trees (for changed directories), and new commit objects are created. The cost is proportional to what changed, not to the size of the repository.
  • This is fundamentally different from systems like Subversion, where a branch was a physical copy of the directory tree (even if it was a “cheap copy” using copy-on-write at the filesystem level, it was still conceptually a copy). Git’s model makes branching a constant-time, constant-space operation, which is why Git workflows can use dozens or hundreds of short-lived branches without any performance impact.
Follow-up: What is HEAD, and why is the distinction between HEAD pointing to a branch vs. pointing to a commit important?HEAD is a symbolic reference stored in .git/HEAD. Normally it contains ref: refs/heads/main, meaning “I am on branch main.” When you commit, Git reads HEAD, follows the ref to the branch file, reads the current commit hash, creates the new commit with that as the parent, and updates the branch file with the new commit hash. HEAD itself does not change. When HEAD contains a raw commit hash (detached HEAD), commits still work, but no branch pointer advances. Those commits become “orphaned” — reachable only through the reflog — and will eventually be garbage collected if you switch away without creating a branch. This is why git checkout <commit> prints a warning: it is not dangerous, but it creates a workflow where work can be silently lost.