Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 1: Project Setup & Init

In this chapter, we’ll set up our project and implement the first Git command: init. By the end, you’ll understand exactly what happens when you run git init. Why start here? Because git init is deceptively simple — it looks like it just creates a folder. But that folder is the skeleton of a content-addressable filesystem, a concept borrowed from how large-scale storage systems (think Amazon S3 or IPFS) organize data. By building init yourself, you are laying the foundation that every subsequent command depends on, just like pouring the foundation of a building before raising walls.
Prerequisites: Basic JavaScript knowledge, Node.js installed
Time: 1-2 hours
Outcome: A working mygit init command

What You’ll Learn

  • How Git’s .git directory is structured
  • What each file and folder in .git means
  • How to build a CLI tool in Node.js
  • Content-addressable storage concepts

Understanding Git’s Directory Structure

When you run git init, Git creates a .git directory with this structure:
.git/
├── HEAD              # Points to current branch (ref: refs/heads/master)
├── config            # Repository-specific configuration
├── description       # Used by GitWeb (we'll skip this)
├── objects/          # Object database (blobs, trees, commits)
│   ├── info/
│   └── pack/
├── refs/             # Branch and tag pointers
│   ├── heads/        # Branch refs (e.g., refs/heads/master)
│   └── tags/         # Tag refs
├── hooks/            # Scripts triggered by Git events
└── info/
    └── exclude       # Local gitignore (not committed)
The most important parts are HEAD, objects/, and refs/. Everything else is optional for a minimal implementation. Think of HEAD as “you are here” on a map, objects/ as the warehouse that stores every version of every file, and refs/ as the labelled bookmarks pointing to specific moments in history.

Project Setup

1. Initialize Your Project

mkdir mygit
cd mygit
npm init -y

2. Create the Project Structure

mygit/
├── src/
│   ├── commands/         # Individual command implementations
│   │   └── init.js
│   ├── utils/            # Shared utilities
│   │   └── paths.js
│   └── mygit.js          # Main CLI entry point
├── package.json
└── README.md

3. Set Up package.json

package.json
{
  "name": "mygit",
  "version": "1.0.0",
  "description": "A Git implementation for learning",
  "main": "src/mygit.js",
  "bin": {
    "mygit": "./src/mygit.js"
  },
  "scripts": {
    "test": "node test/test.js"
  },
  "keywords": ["git", "vcs", "learning"],
  "license": "MIT"
}

Implementation

Step 1: Create the CLI Entry Point

src/mygit.js
#!/usr/bin/env node

/**
 * mygit - A Git implementation for learning
 * 
 * Usage: mygit <command> [options]
 * 
 * This file is the "front door" of our CLI tool. Its only job is to
 * figure out which command the user wants and hand off to the right module.
 * Keeping this file thin means each command can evolve independently --
 * the same separation-of-concerns principle behind real Git's source layout.
 */

const commands = {
    init: require('./commands/init'),
    // We'll add more commands later
};

function main() {
    // process.argv: [node, script, ...userArgs]
    // We slice off the first two to get the user-supplied arguments.
    const args = process.argv.slice(2);
    
    if (args.length === 0) {
        console.log('usage: mygit <command> [<args>]');
        console.log('\nAvailable commands:');
        console.log('   init       Initialize a new repository');
        process.exit(1);
    }
    
    const command = args[0];
    const commandArgs = args.slice(1);
    
    if (!commands[command]) {
        console.error(`mygit: '${command}' is not a mygit command.`);
        process.exit(1);
    }
    
    try {
        commands[command].execute(commandArgs);
    } catch (error) {
        console.error(`error: ${error.message}`);
        process.exit(1);
    }
}

main();

Step 2: Create Path Utilities

src/utils/paths.js
const path = require('path');
const fs = require('fs');

/**
 * Find the .git directory by walking up from the current directory.
 *
 * Why walk upward? Because Git commands work from any subdirectory
 * inside a repository. When you run `git status` from `src/utils/`,
 * Git doesn't give up -- it climbs the directory tree until it finds
 * the `.git` folder. This is the same behavior we're implementing here.
 *
 * Debugging tip: if this function returns null when you expect a repo,
 * check that you're calling it from a directory that actually lives
 * inside an initialized repository. A common mistake is running
 * `mygit init` in one terminal directory and testing in another.
 */
function findGitDir(startDir = process.cwd()) {
    let currentDir = startDir;
    
    // Walk up until we hit the filesystem root (e.g., "/" or "C:\")
    while (currentDir !== path.parse(currentDir).root) {
        const gitDir = path.join(currentDir, '.git');
        if (fs.existsSync(gitDir) && fs.statSync(gitDir).isDirectory()) {
            return gitDir;
        }
        currentDir = path.dirname(currentDir);
    }
    
    return null;
}

/**
 * Get the repository root (parent of .git)
 */
function getRepoRoot(gitDir) {
    return path.dirname(gitDir);
}

/**
 * Ensure we're in a Git repository
 */
function requireGitDir() {
    const gitDir = findGitDir();
    if (!gitDir) {
        throw new Error('not a git repository (or any of the parent directories): .git');
    }
    return gitDir;
}

module.exports = {
    findGitDir,
    getRepoRoot,
    requireGitDir
};

Step 3: Implement the Init Command

src/commands/init.js
const fs = require('fs');
const path = require('path');

/**
 * Initialize a new Git repository
 * 
 * Creates the .git directory structure:
 * - HEAD: Points to refs/heads/master
 * - config: Repository configuration
 * - objects/: Object database
 * - refs/heads/: Branch refs
 * - refs/tags/: Tag refs
 */
function execute(args) {
    // Parse arguments
    const directory = args[0] || '.';
    const repoPath = path.resolve(directory);
    const gitDir = path.join(repoPath, '.git');
    
    // Check if already a repository
    if (fs.existsSync(gitDir)) {
        console.log(`Reinitialized existing Git repository in ${gitDir}`);
        return;
    }
    
    // Create the directory structure.
    // Each of these directories serves a specific purpose:
    //   objects/      -> the database of all file contents, trees, and commits
    //   objects/info/ -> auxiliary info for dumb-protocol transfers (rarely used)
    //   objects/pack/ -> packed objects for storage efficiency (we'll skip packing)
    //   refs/heads/   -> one file per branch, each containing a commit hash
    //   refs/tags/    -> one file per tag, pointing to a tagged object
    //   info/         -> repo-level metadata like local exclude patterns
    const directories = [
        gitDir,
        path.join(gitDir, 'objects'),
        path.join(gitDir, 'objects', 'info'),
        path.join(gitDir, 'objects', 'pack'),
        path.join(gitDir, 'refs'),
        path.join(gitDir, 'refs', 'heads'),
        path.join(gitDir, 'refs', 'tags'),
        path.join(gitDir, 'info'),
    ];
    
    directories.forEach(dir => {
        fs.mkdirSync(dir, { recursive: true });
    });
    
    // Create HEAD file - points to the master branch.
    // This is a "symbolic reference" (symref). HEAD is the answer to
    // "where am I right now?" -- it almost always points to a branch name,
    // which in turn points to a commit hash. This level of indirection is
    // what lets commits advance a branch automatically.
    const headContent = 'ref: refs/heads/master\n';
    fs.writeFileSync(path.join(gitDir, 'HEAD'), headContent);
    
    // Create config file with minimal settings
    const configContent = `[core]
\trepositoryformatversion = 0
\tfilemode = false
\tbare = false
`;
    fs.writeFileSync(path.join(gitDir, 'config'), configContent);
    
    // Create description file (used by GitWeb)
    const descContent = 'Unnamed repository; edit this file to name the repository.\n';
    fs.writeFileSync(path.join(gitDir, 'description'), descContent);
    
    // Create info/exclude (local gitignore)
    const excludeContent = '# git ls-files --others --exclude-from=.git/info/exclude\n';
    fs.writeFileSync(path.join(gitDir, 'info', 'exclude'), excludeContent);
    
    console.log(`Initialized empty Git repository in ${gitDir}`);
}

module.exports = { execute };

Testing Your Implementation

Make It Executable

# Make the script executable (on Unix)
chmod +x src/mygit.js

# Link it globally for testing
npm link

Test It!

# Create a test directory
mkdir test-repo
cd test-repo

# Initialize with your implementation
mygit init

# Verify the structure
ls -la .git/
# Should show: HEAD, config, objects/, refs/

Compare with Real Git

# Create another directory
mkdir git-repo
cd git-repo
git init

# Compare the structures
diff -r ../test-repo/.git .git

Deep Dive: Understanding HEAD

The HEAD file is crucial to Git. Let’s understand it:
ref: refs/heads/master
This is a symbolic reference (symref). It says “I’m pointing to whatever refs/heads/master contains.”
  1. Git creates a new commit object
  2. Reads HEAD to find current branch: refs/heads/master
  3. Updates refs/heads/master to point to new commit
  4. HEAD still points to refs/heads/master (unchanged)
When HEAD contains a commit SHA instead of a ref:
abc123def456...  (not "ref: refs/heads/...")
This means you’re not on any branch!
A branch is just a file in refs/heads/ containing a commit SHA:
$ cat .git/refs/heads/master
abc123def456789...
That’s it! Branches are just pointers to commits.

Deep Dive: The Objects Directory

The objects/ directory is Git’s content-addressable storage:
objects/
├── ab/                    # First 2 characters of SHA
│   └── cdef123456...     # Remaining 38 characters
├── pack/                  # Packed objects (for efficiency)
└── info/                  # Additional info
Why split the hash? A directory with millions of files is slow on most filesystems because lookups degrade as directory entries grow. By using the first 2 hex characters as subdirectory names, Git creates up to 256 “buckets” (00-ff), each holding far fewer files. This is essentially the same trick a hash table uses — distribute entries across buckets to keep each bucket small. Real-world repos with hundreds of thousands of objects stay fast thanks to this simple fan-out.
We’ll implement the object storage in the next chapter!

Common Pitfalls

  • Forgetting the trailing newline in HEAD. The HEAD file must end with \n. Without it, some Git tools (and your own future commands) may misparse the ref. Always write 'ref: refs/heads/master\n', not 'ref: refs/heads/master'.
  • Path separator issues on Windows. If you’re developing on Windows, path.join produces backslashes. Git expects forward slashes inside its own metadata. For internal Git paths (like ref names), normalize with .split(path.sep).join('/').
  • Re-initializing an existing repo by accident. Real Git gracefully handles git init in an existing repo (it prints “Reinitialized…”). Make sure your implementation checks for an existing .git directory before overwriting it.

Exercises

Implement mygit init --bare which creates a bare repository (no working directory):
// Bare repos don't have:
// - A working directory
// - HEAD pointing to a branch (often points directly to a commit)
// - The .git folder IS the repository (not inside another folder)

// Hint: Check for --bare in args, then:
// 1. Don't create .git subdirectory, use current directory
// 2. Set bare = true in config
Solution outline:
const isBare = args.includes('--bare');
const repoDir = isBare ? repoPath : path.join(repoPath, '.git');
// ... rest of implementation
Implement mygit init --initial-branch=main to set a custom default branch:
// Modern Git uses 'main' instead of 'master'
// Parse the --initial-branch=NAME argument

// Hint: Update HEAD content:
// ref: refs/heads/main
Add validation to check if the target directory is writable:
// Before creating .git, check:
// 1. Parent directory exists (or can be created)
// 2. We have write permissions
// 3. Not trying to init inside another .git

// Use: fs.accessSync(dir, fs.constants.W_OK)

Key Takeaways

Simple Structure

Git’s .git directory is surprisingly simple: just files and folders

HEAD is King

HEAD always tells you where you are: which branch or which commit

Branches are Files

A branch is just a file containing 40 hex characters (a SHA-1 hash)

Content-Addressable

The objects/ directory stores everything by its content hash

What’s Next?

In Chapter 2: Object Model, we’ll implement Git’s object storage:
  • Create and store blob objects (file content)
  • Implement hash-object and cat-file commands
  • Understand SHA-1 hashing and zlib compression

Next: Object Model

Learn how Git stores files as content-addressed blobs

Interview Deep-Dive

Strong Answer:
  • This is a fan-out strategy borrowed from hash table design. Most filesystems degrade when a single directory contains hundreds of thousands of entries because directory lookups become linear scans (or at best, B-tree lookups that grow with entry count). By using the first two hex characters as a subdirectory, Git creates up to 256 buckets, keeping each directory small.
  • A repository with 100,000 objects has ~390 files per subdirectory on average. Without fan-out, that is 100,000 entries in a single objects/ directory, which causes readdir() and stat() to slow down dramatically on ext4 and other common filesystems.
  • The choice of 2 characters (256 buckets) is a pragmatic balance. One character (16 buckets) would still put thousands of files per directory. Three characters (4096 buckets) would waste directory entries on small repos. Two characters work well for repositories up to millions of objects.
  • This same pattern appears everywhere in systems design: Redis uses hash slots for cluster sharding, Cassandra uses consistent hashing for partition distribution, and CDNs use URL-based hashing for cache distribution. The underlying principle is always the same: distribute entries across buckets to avoid hotspots.
Follow-up: What are pack files, and how do they change this storage model?Pack files consolidate many loose objects into a single .pack file with an accompanying .idx index. Instead of 100,000 individual files in the fan-out directory structure, Git compresses them into one (or a few) pack files. The index provides O(1) lookup by hash using a binary search on sorted entries. git gc triggers packing, and git repack can be run manually. Pack files also use delta compression: similar objects are stored as a base plus a binary diff, which is why pack files are dramatically smaller than the sum of loose objects. The fan-out directories are still used for newly created objects (loose objects); packing happens periodically as maintenance.
Strong Answer:
  • A symbolic reference (symref) is a reference that points to another reference rather than directly to an object. HEAD normally contains ref: refs/heads/main, which means “follow refs/heads/main to find my value.” This level of indirection is what makes branch advancement automatic.
  • When you commit, Git reads HEAD, sees it is a symref pointing to refs/heads/main, reads the current commit hash from that file, creates a new commit with that hash as the parent, then writes the new commit’s hash back to refs/heads/main. HEAD does not change — only the branch file changes. This is how branches “grow”: the tip pointer moves forward one commit at a time.
  • Without this indirection, Git would have to update HEAD on every commit, and there would be no concept of “being on a branch.” Every state would be a detached HEAD. The symref is what connects the concept of “current branch” to the commit graph.
  • The practical consequence is that operations like git log can resolve HEAD to a branch name and display “On branch main” rather than just a raw hash. It also enables reflogs to track branch history separately from HEAD history, which is essential for recovery operations like git reflog after an accidental git reset --hard.
Follow-up: What happens to HEAD during a rebase or merge? Does it stay as a symref?During a normal rebase, HEAD remains a symref to the current branch. As each commit is replayed, the branch pointer advances. If a conflict occurs and the rebase pauses, HEAD is in a special “rebase in progress” state tracked by files in .git/rebase-merge/ or .git/rebase-apply/. During a merge, HEAD also stays as a symref. The merge commit is created with two (or more) parents, and the branch pointer advances to the merge commit. The only time HEAD becomes a raw hash (detached) is during git checkout <commit> or during the intermediate steps of an interactive rebase where Git checks out individual commits for editing.
Strong Answer:
  • It does not exist yet. HEAD contains ref: refs/heads/master, but the file .git/refs/heads/master is not created until the first commit. This is an intentional design: a branch file only exists when it has a commit to point to. Before the first commit, you are on an “unborn branch.”
  • This is why git status on a fresh repository says “No commits yet” and why git log fails with “does not have any commits yet.” The branch is conceptually created by HEAD’s symref, but it is not materialized until there is a commit hash to write into the branch file.
  • This design avoids a special “null commit” sentinel value. Instead of storing a special value meaning “no commits,” Git simply does not create the file. Code that reads branch refs handles the “file not found” case as “unborn branch,” which is cleaner than checking for a magic value.
  • A practical implication: if you try to create a branch with git branch feature before the first commit, Git fails because it cannot resolve HEAD to a commit hash. You must make at least one commit before branching. This trips up new Git users who try to create branches immediately after git init.
Follow-up: How does git init --initial-branch=main differ from git init followed by git branch -m main?git init --initial-branch=main writes ref: refs/heads/main\n to HEAD instead of the default ref: refs/heads/master\n. No branch file is created in either case — it only affects the symref target in HEAD. git branch -m main after git init would fail before the first commit because git branch -m requires the current branch to exist (it renames the branch file), and the branch file is not created until the first commit. So --initial-branch is the only correct way to set the default branch name before the first commit. This is why the Git project added the init.defaultBranch configuration option — it avoids the race between init and first commit entirely.