Git Internals Deep Dive
Why Internals Matter
The Fundamental Truth: Git is a Content-Addressable Filesystem
The Four Git Objects
1. Blobs - The Content
2. Trees - The Directories
3. Commits - The Snapshots
4. Tags - The Bookmarks
The Object Database Structure
How SHA-1 Hashing Works
The Index (Staging Area) Demystified
Why the Index is Brilliant
Refs - The Human-Readable Pointers
The Reflog - Your Safety Net
Packfiles - Compression and Efficiency
Delta Compression
Pack Structure
When Packing Happens
The Directed Acyclic Graph (DAG)
Why DAG Matters
How Merge Actually Works
Setup
The Algorithm
Inside a Merge Conflict
Interview Deep Dive Questions
Exploring Internals Yourself
Key Takeaways

Git Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to use Git and commit code, feel free to skip ahead. No judgment.

This chapter reveals Git’s elegant internal design. We will explore the content-addressable object database, understand how commits form a directed acyclic graph, and demystify the staging area. This knowledge transforms you from a Git user into someone who truly understands version control.

Why Internals Matter

Understanding Git internals helps you:

Recover from disasters when git reflog is your last hope
Debug merge conflicts by understanding the three-way merge algorithm
Optimize repositories with pack files and garbage collection
Ace interviews where Git internals are surprisingly common
Never fear Git again because you know exactly what is happening

The Fundamental Truth: Git is a Content-Addressable Filesystem

At its core, Git is a simple key-value store. You give it content, it gives you back a unique key (SHA-1 hash). This design decision is what makes Git fast, reliable, and elegant.

# Git stores content by its hash
$ echo "hello" | git hash-object --stdin
ce013625030ba8dba906f756967f9e9ca394464a

# Same content = same hash (always)
$ echo "hello" | git hash-object --stdin
ce013625030ba8dba906f756967f9e9ca394464a

This has profound implications:

Data integrity: If content changes, hash changes - corruption is detectable
Deduplication: Same content stored once, referenced everywhere
Fast comparisons: Compare 40-character hashes instead of file contents

The Four Git Objects

Git stores everything as one of four object types. Understanding these is understanding Git.

1. Blobs - The Content

A blob (binary large object) stores file content. Just content - no filename, no permissions, no metadata.

# Create a blob manually
$ echo "Hello, Git!" | git hash-object -w --stdin
8b137891791fe96927ad78e64b0aad7bded08bdc

# View blob content
$ git cat-file -p 8b137891791fe96927ad78e64b0aad7bded08bdc
Hello, Git!

# View blob type
$ git cat-file -t 8b137891791fe96927ad78e64b0aad7bded08bdc
blob

Key insight: Two files with identical content = one blob. Rename a file? Same blob, different tree entry.

2. Trees - The Directories

A tree is like a directory listing. It contains:

Pointers to blobs (files)
Pointers to other trees (subdirectories)
Mode (permissions), type, hash, and filename for each entry

$ git cat-file -p main^{tree}
100644 blob 8b137891791fe96927ad78e64b0aad7bded08bdc    README.md
100644 blob a5c19667710254f835085b99726e523457150e03    package.json
040000 tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904    src

Mode breakdown:

100644 - Regular file
100755 - Executable file
040000 - Directory (tree)
120000 - Symbolic link
160000 - Gitlink (submodule)

3. Commits - The Snapshots

A commit is a snapshot in time. It contains:

Pointer to a tree (the project state)
Pointer to parent commit(s)
Author (who wrote the code)
Committer (who made the commit)
Commit message
Timestamp

$ git cat-file -p HEAD
tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
parent a1b2c3d4e5f6789012345678901234567890abcd
author John Doe <john@example.com> 1701590400 +0000
committer John Doe <john@example.com> 1701590400 +0000

Add user authentication feature

Implements login/logout with session management.

Why author and committer?

Author: Original code writer
Committer: Person who applied/committed (different in cherry-pick, rebase, patches)

4. Tags - The Bookmarks

Annotated tags are objects containing:

Pointer to a commit
Tag name
Tagger information
Tag message

$ git cat-file -p v1.0.0
object a1b2c3d4e5f6789012345678901234567890abcd
type commit
tag v1.0.0
tagger Jane Smith <jane@example.com> 1701590400 +0000

Release version 1.0.0
- Added authentication
- Fixed critical bugs

Lightweight tags are just refs (pointers) with no object - less metadata, less useful.

The Object Database Structure

All objects live in .git/objects/, organized by hash:

.git/objects/
├── 8b/
│   └── 137891791fe96927ad78e64b0aad7bded08bdc  # First 2 chars = dir
├── a1/
│   └── b2c3d4e5f6789012345678901234567890abcd
├── info/
│   └── packs
└── pack/
    ├── pack-abc123.idx
    └── pack-abc123.pack

Objects are compressed with zlib. The 2-character directory split prevents filesystem issues with too many files in one directory.

How SHA-1 Hashing Works

Git computes hashes by prepending a header to content:

Header: "<type> <size>\0"
Content: <raw bytes>
Hash: SHA-1(Header + Content)

Example for a blob:

# Manual hash calculation
$ echo -e "blob 12\0Hello, Git!" | sha1sum
8b137891791fe96927ad78e64b0aad7bded08bdc

# Same as git hash-object
$ echo "Hello, Git!" | git hash-object --stdin
8b137891791fe96927ad78e64b0aad7bded08bdc

SHA-1 collision concerns: Yes, SHA-1 has known weaknesses. Git is transitioning to SHA-256 (git init --object-format=sha256). For now, practical attacks against Git specifically remain theoretical.

The Index (Staging Area) Demystified

The index (.git/index) is a binary file that tracks:

Which files are staged
Their blob hashes
Timestamps, permissions, sizes

# View index contents
$ git ls-files --stage
100644 8b137891791fe96927ad78e64b0aad7bded08bdc 0       README.md
100644 a5c19667710254f835085b99726e523457150e03 0       package.json

The stage number (0) matters for merge conflicts:

Stage 0: Normal, no conflict
Stage 1: Common ancestor version
Stage 2: Our version (HEAD)
Stage 3: Their version (merging branch)

# During a merge conflict
$ git ls-files --stage
100644 abc123... 1       file.txt  # Ancestor
100644 def456... 2       file.txt  # Ours
100644 789abc... 3       file.txt  # Theirs

Why the Index is Brilliant

Speed: Comparing mtimes/sizes is faster than hashing all files
Granularity: Stage parts of a file with git add -p
Atomic commits: Build up your commit before finalizing
Three-way merge: All versions available for conflict resolution

Refs - The Human-Readable Pointers

Refs are files containing SHA-1 hashes. They make Git usable.

.git/refs/
├── heads/          # Local branches
│   ├── main        # Contains: a1b2c3d4...
│   └── feature-x   # Contains: d5e6f7a8...
├── remotes/        # Remote-tracking branches
│   └── origin/
│       ├── main
│       └── feature-y
└── tags/           # Tags
    └── v1.0.0

# View what a ref points to
$ cat .git/refs/heads/main
a1b2c3d4e5f6789012345678901234567890abcd

# HEAD is special - it's usually a symbolic ref
$ cat .git/HEAD
ref: refs/heads/main

# Detached HEAD points directly to a commit
$ git checkout a1b2c3d
$ cat .git/HEAD
a1b2c3d4e5f6789012345678901234567890abcd

The Reflog - Your Safety Net

Every time HEAD moves, Git logs it in the reflog:

$ git reflog
a1b2c3d HEAD@{0}: commit: Add authentication
d5e6f7a HEAD@{1}: checkout: moving from feature-x to main
b8c9d0e HEAD@{2}: commit: WIP feature
...

# Recover a "lost" commit
$ git checkout HEAD@{2}

# Reflog is local only, expires after 90 days by default
$ git reflog expire --expire=now --all  # Don't do this

Packfiles - Compression and Efficiency

As repositories grow, storing every object separately is wasteful. Packfiles solve this.

Delta Compression

Git stores similar objects as deltas (differences):

Object A: "Hello, World!"
Object B: "Hello, Git!"

Stored as:
- Object A: Full content
- Object B: "Use Object A, replace 'World' with 'Git'"

Pack Structure

# View pack contents
$ git verify-pack -v .git/objects/pack/pack-abc123.idx

SHA-1           type    size    size-in-pack    offset    depth    base-SHA
a1b2c3d4...     commit  234     180             12        -        -
d5e6f7a8...     tree    89      78              192       -        -
8b137891...     blob    2048    156             270       2        f0e1d2c3

When Packing Happens

git gc - Manual garbage collection
git push - Objects packed for transfer
git fetch - Receive packfiles
Automatically when loose objects exceed threshold (~7000)

# Force repacking
$ git gc --aggressive

# Repack with delta depth optimization
$ git repack -a -d -f --depth=250 --window=250

The Directed Acyclic Graph (DAG)

Commits form a DAG - a graph with no cycles where edges point backwards (to parents).

Initial:     A

Linear:      A---B---C

Branch:      A---B---C
                  \
                   D---E

Merge:       A---B---C---F
                  \     /
                   D---E

Octopus:     A---B---C---G
                  \  |  /
                   D-E-F

Why DAG Matters

Reachability: “Is commit X an ancestor of Y?” is fast
Common ancestor: Three-way merge needs merge base
History traversal: git log walks the DAG
Garbage collection: Unreachable commits are pruned

# Find merge base (common ancestor)
$ git merge-base main feature-x
a1b2c3d4e5f6789012345678901234567890abcd

# Check if commit is ancestor
$ git merge-base --is-ancestor a1b2c3d main && echo "Yes"

How Merge Actually Works

Understanding the three-way merge algorithm:

Setup

         Base (B)
        /        \
    Ours (O)    Theirs (T)

The Algorithm

For each file, compare B, O, T:

Base	Ours	Theirs	Result
A	A	A	A (unchanged)
A	A	B	B (they changed)
A	B	A	B (we changed)
A	B	B	B (both same change)
A	B	C	CONFLICT
-	A	-	A (we added)
-	-	A	A (they added)
A	-	-	DELETE (both deleted)
A	-	A	DELETE (we deleted)
A	A	-	DELETE (they deleted)
-	A	B	CONFLICT (both added different)
A	B	-	CONFLICT (we changed, they deleted)

Inside a Merge Conflict

# The three versions during conflict
$ git show :1:file.txt  # Base (stage 1)
$ git show :2:file.txt  # Ours (stage 2)
$ git show :3:file.txt  # Theirs (stage 3)

# Conflict markers in file
<<<<<<< HEAD
our changes
=======
their changes
>>>>>>> feature-branch

Interview Deep Dive Questions

What is the Git object model?

Answer: Git uses four object types: blobs (file content), trees (directories mapping names to blobs/trees), commits (snapshots pointing to a tree plus metadata), and annotated tags (named pointers with metadata). Objects are identified by SHA-1 hash of their content, making Git a content-addressable filesystem.

What is the difference between git merge and git rebase?

Answer: Merge creates a new commit with two parents, preserving full history. Rebase replays commits on top of another branch, rewriting commit hashes and creating linear history. Merge is safer (no rewritten history), rebase is cleaner (linear log). Never rebase public/shared branches.

How does Git detect file renames?

Answer: Git does not track renames explicitly. It uses heuristics during diff/log to detect renames by comparing blob content. If files are >50% similar (configurable with -M), Git considers it a rename. This is why renaming and modifying in the same commit can confuse detection.

What happens during git checkout?

Answer: Checkout updates three things: 1) HEAD (point to new commit/branch), 2) Index (update staged files to match commit), 3) Working directory (update files to match index). If switching branches with uncommitted changes, Git refuses if changes would be overwritten.

How does git gc work?

Answer: Garbage collection: 1) Packs loose objects into packfiles with delta compression, 2) Removes objects unreachable from any ref or reflog, 3) Removes old reflog entries (>90 days), 4) Prunes empty directories in .git/objects. Run automatically when loose objects exceed threshold.

Explain detached HEAD state

Answer: Normally HEAD points to a branch name (symbolic ref), which points to a commit. Detached HEAD means HEAD points directly to a commit hash. Commits made in this state are not on any branch. When you checkout something else, those commits become unreachable and will be garbage collected (unless you create a branch).

Exploring Internals Yourself

# Create a new repo and explore
$ git init internals-demo && cd internals-demo

# Create and hash a file manually
$ echo "test content" > test.txt
$ git hash-object -w test.txt
d670460b4b4aece5915caf5c68d12f560a9fe3e4

# See where it's stored
$ ls .git/objects/d6/
70460b4b4aece5915caf5c68d12f560a9fe3e4

# Decompress and view
$ cat .git/objects/d6/70460... | zlib-decompress
blob 13test content

# Make a commit and explore its structure
$ git add test.txt
$ git commit -m "Initial commit"
$ git cat-file -p HEAD
$ git cat-file -p HEAD^{tree}

Key Takeaways

Git is a content-addressable filesystem - content hashes are keys
Four object types: blobs, trees, commits, annotated tags
SHA-1 hashes ensure integrity - any change = different hash
The index is the staging area - binary file tracking staged state
Refs make hashes human-readable - branches and tags are just files
Packfiles optimize storage - delta compression for similar objects
History is a DAG - commits point to parents, forming a graph
Three-way merge uses common ancestor - compares base, ours, theirs

Ready to master branching strategies? Next up: Git Branching where we will explore GitFlow, trunk-based development, and merge vs rebase.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Git Internals Deep Dive

​Why Internals Matter

​The Fundamental Truth: Git is a Content-Addressable Filesystem

​The Four Git Objects

​1. Blobs - The Content

​2. Trees - The Directories

​3. Commits - The Snapshots

​4. Tags - The Bookmarks

​The Object Database Structure

​How SHA-1 Hashing Works

​The Index (Staging Area) Demystified

​Why the Index is Brilliant

​Refs - The Human-Readable Pointers

​The Reflog - Your Safety Net

​Packfiles - Compression and Efficiency

​Delta Compression

​Pack Structure

​When Packing Happens

​The Directed Acyclic Graph (DAG)

​Why DAG Matters

​How Merge Actually Works

​Setup

​The Algorithm

​Inside a Merge Conflict

​Interview Deep Dive Questions

​Exploring Internals Yourself

​Key Takeaways