Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Build Your Own Git
Target Audience: Students and Junior EngineersLanguage: JavaScript (with Java & Go alternatives)
Duration: 2-3 weeks
Difficulty: ⭐⭐⭐☆☆
Why Build Git?
Git is the most ubiquitous tool in software development, yet most developers treat it as magic. They memorize commands like incantations without understanding the machinery underneath. By building your own Git, you’ll:- Understand content-addressable storage — the same idea that powers IPFS, Nix, and blockchain Merkle trees
- Master hashing and cryptography basics — SHA-1 in practice, and why “same content = same hash” is so powerful
- Learn tree data structures — directories as trees, commits as a DAG (Directed Acyclic Graph)
- Build a real CLI tool — practical software engineering, argument parsing, and error handling
- Debug Git problems from first principles — when
git rebasegoes wrong, you’ll understand why at the object level
Git’s Beautiful Architecture
What You’ll Build
Core Commands
| Command | Description | Concepts Learned |
|---|---|---|
init | Initialize repository | File structure, .git directory |
hash-object | Hash and store a file | SHA-1, zlib compression |
cat-file | Read object contents | Object types, decompression |
add | Stage files | Index file format |
commit | Create commit object | Tree building, commit linking |
log | Show commit history | DAG traversal |
status | Show working tree status | Diff against index |
branch | Manage branches | Refs, symbolic refs |
checkout | Switch branches | Updating HEAD, working tree |
Implementation: JavaScript
Project Structure
Core Implementation
Exercises
Level 1: Basic Understanding
- Initialize a repository and create a blob manually
- Understand how SHA-1 hashing creates content-addressable storage
- Create a commit and inspect its structure with
cat-file
Level 2: Core Implementation
- Implement the
statuscommand (compare index to working tree) - Add support for
.gitignorepatterns - Implement
diffto show changes between commits
Level 3: Advanced Features
- Implement merge (fast-forward and three-way)
- Add remote repository support (fetch, push)
- Implement pack files for efficient storage
What You’ve Learned
Content-addressable storage using SHA-1 hashing
Tree data structures for representing directories
DAG (Directed Acyclic Graph) for commit history
Binary file formats (index file)
CLI tool development in JavaScript
Next Steps
Go Implementation
See how the same concepts translate to Go
Java Implementation
Enterprise-grade implementation with strong typing
Build Redis
Ready for the next challenge? Build a Redis clone
Interview Deep-Dive
Git is often called a 'content-addressable filesystem.' What does that mean, and how does it differ from a traditional filesystem?
Git is often called a 'content-addressable filesystem.' What does that mean, and how does it differ from a traditional filesystem?
Strong Answer:
- In a traditional filesystem, you choose a filename and the OS stores the content at that location. The name is arbitrary and independent of the content. In Git’s object store, the “name” (address) is derived from the content itself via SHA-1 hashing. The file is stored at
.git/objects/<first-2-chars>/<remaining-38-chars>of the hash. - This gives Git three properties for free: deduplication (identical content always produces the same hash, so it is stored once regardless of how many files or commits reference it), integrity verification (if a single bit flips on disk, the hash will not match and Git will detect corruption), and immutability (you cannot change an object’s content without changing its address, which means any object you have retrieved is guaranteed to be the same object that was originally stored).
- This design is the same principle behind Amazon S3’s internal storage, IPFS, and blockchain Merkle trees. Content-addressable storage is one of the most powerful ideas in computer science, and Git is the most widely deployed example of it.
- The practical implication for developers: Git repositories are much smaller than you would expect because of deduplication. Renaming a file costs almost nothing (the blob is the same, only the tree changes). And
git fsckcan verify the entire history’s integrity by re-hashing every object and comparing it to its stored address.
extensions.objectFormat config, which has been supported since Git 2.29. The transition is backward-compatible: repositories can be converted, and the object format extension allows interop. In practice, SHA-1 collisions in Git require a targeted attack against a specific repository, which is operationally much harder than the academic attack. But the industry consensus is that SHA-256 is the correct long-term direction, and Git’s content-addressable design makes the hash algorithm swappable precisely because the architecture does not depend on SHA-1 specifically — it depends on the property of content addressing.Explain Git's object model: blobs, trees, and commits. How do they compose to represent a repository's history?
Explain Git's object model: blobs, trees, and commits. How do they compose to represent a repository's history?
Strong Answer:
- A blob stores raw file content with no metadata (no filename, no permissions). It is just the bytes of the file, prefixed with a header (
blob <size>\0), then SHA-1 hashed. Two files with identical content across the entire repository (or across different commits) share the same blob object. - A tree represents a directory. It contains entries, each with a mode (permissions), a name (filename), and a pointer (SHA-1 hash) to either a blob (file) or another tree (subdirectory). Trees are recursive — a root tree can point to sub-trees, which point to sub-sub-trees, mirroring the directory hierarchy.
- A commit is a metadata object that points to a root tree (the complete directory snapshot at that point in time), zero or more parent commits (previous commits in history), an author, a committer, a timestamp, and a message. The chain of parent pointers forms the commit DAG (directed acyclic graph) — the history.
- The elegance is that commits are snapshots, not diffs. Each commit has a complete tree that represents the entire project state. Git does not store what changed between commits; it stores the full state. Diffs are computed on-the-fly by comparing two trees. This makes checkout extremely fast (just materialize one tree) and makes operations like blame, log, and bisect possible without reconstructing incremental changes.
git gc periodically packs loose objects into .pack files using delta compression — similar files are stored as a base object plus a binary diff. This is why a Git repo with years of history is often smaller than a single checkout of the working directory.A junior developer says 'branches in Git are expensive because they copy the codebase.' Correct their misunderstanding and explain what a branch actually is.
A junior developer says 'branches in Git are expensive because they copy the codebase.' Correct their misunderstanding and explain what a branch actually is.
Strong Answer:
- A branch in Git is a 41-byte text file stored at
.git/refs/heads/<branchname>containing a 40-character commit hash and a newline. Creating a branch literally writes 41 bytes to disk. There is no copying of code, no duplication of files, no additional storage proportional to repository size. - When you run
git branch feature, Git creates the file.git/refs/heads/featurecontaining the same commit hash that HEAD currently points to. Both branches now point to the same commit object, which points to the same tree, which points to the same blobs. Nothing is duplicated. - As you make commits on the feature branch, the branch pointer advances to new commits. The old commits are still shared with main. Only new blobs (for changed files), new trees (for changed directories), and new commit objects are created. The cost is proportional to what changed, not to the size of the repository.
- This is fundamentally different from systems like Subversion, where a branch was a physical copy of the directory tree (even if it was a “cheap copy” using copy-on-write at the filesystem level, it was still conceptually a copy). Git’s model makes branching a constant-time, constant-space operation, which is why Git workflows can use dozens or hundreds of short-lived branches without any performance impact.
.git/HEAD. Normally it contains ref: refs/heads/main, meaning “I am on branch main.” When you commit, Git reads HEAD, follows the ref to the branch file, reads the current commit hash, creates the new commit with that as the parent, and updates the branch file with the new commit hash. HEAD itself does not change. When HEAD contains a raw commit hash (detached HEAD), commits still work, but no branch pointer advances. Those commits become “orphaned” — reachable only through the reflog — and will eventually be garbage collected if you switch away without creating a branch. This is why git checkout <commit> prints a warning: it is not dangerous, but it creates a workflow where work can be silently lost.