Git Internals Deep Dive
If you love understanding how things actually work, this chapter is for you. If you just want to use Git and commit code, feel free to skip ahead. No judgment.This chapter reveals Gitβs elegant internal design. We will explore the content-addressable object database, understand how commits form a directed acyclic graph, and demystify the staging area. This knowledge transforms you from a Git user into someone who truly understands version control.
Why Internals Matter
Understanding Git internals helps you:- Recover from disasters when
git reflogis your last hope - Debug merge conflicts by understanding the three-way merge algorithm
- Optimize repositories with pack files and garbage collection
- Ace interviews where Git internals are surprisingly common
- Never fear Git again because you know exactly what is happening
The Fundamental Truth: Git is a Content-Addressable Filesystem
At its core, Git is a simple key-value store. You give it content, it gives you back a unique key (SHA-1 hash). This design decision is what makes Git fast, reliable, and elegant.- Data integrity: If content changes, hash changes - corruption is detectable
- Deduplication: Same content stored once, referenced everywhere
- Fast comparisons: Compare 40-character hashes instead of file contents
The Four Git Objects
Git stores everything as one of four object types. Understanding these is understanding Git.1. Blobs - The Content
A blob (binary large object) stores file content. Just content - no filename, no permissions, no metadata.2. Trees - The Directories
A tree is like a directory listing. It contains:- Pointers to blobs (files)
- Pointers to other trees (subdirectories)
- Mode (permissions), type, hash, and filename for each entry
100644- Regular file100755- Executable file040000- Directory (tree)120000- Symbolic link160000- Gitlink (submodule)
3. Commits - The Snapshots
A commit is a snapshot in time. It contains:- Pointer to a tree (the project state)
- Pointer to parent commit(s)
- Author (who wrote the code)
- Committer (who made the commit)
- Commit message
- Timestamp
- Author: Original code writer
- Committer: Person who applied/committed (different in cherry-pick, rebase, patches)
4. Tags - The Bookmarks
Annotated tags are objects containing:- Pointer to a commit
- Tag name
- Tagger information
- Tag message
The Object Database Structure
All objects live in.git/objects/, organized by hash:
How SHA-1 Hashing Works
Git computes hashes by prepending a header to content:git init --object-format=sha256). For now, practical attacks against Git specifically remain theoretical.
The Index (Staging Area) Demystified
The index (.git/index) is a binary file that tracks:
- Which files are staged
- Their blob hashes
- Timestamps, permissions, sizes
- Stage 0: Normal, no conflict
- Stage 1: Common ancestor version
- Stage 2: Our version (HEAD)
- Stage 3: Their version (merging branch)
Why the Index is Brilliant
- Speed: Comparing mtimes/sizes is faster than hashing all files
- Granularity: Stage parts of a file with
git add -p - Atomic commits: Build up your commit before finalizing
- Three-way merge: All versions available for conflict resolution
Refs - The Human-Readable Pointers
Refs are files containing SHA-1 hashes. They make Git usable.The Reflog - Your Safety Net
Every time HEAD moves, Git logs it in the reflog:Packfiles - Compression and Efficiency
As repositories grow, storing every object separately is wasteful. Packfiles solve this.Delta Compression
Git stores similar objects as deltas (differences):Pack Structure
When Packing Happens
git gc- Manual garbage collectiongit push- Objects packed for transfergit fetch- Receive packfiles- Automatically when loose objects exceed threshold (~7000)
The Directed Acyclic Graph (DAG)
Commits form a DAG - a graph with no cycles where edges point backwards (to parents).Why DAG Matters
- Reachability: βIs commit X an ancestor of Y?β is fast
- Common ancestor: Three-way merge needs merge base
- History traversal:
git logwalks the DAG - Garbage collection: Unreachable commits are pruned
How Merge Actually Works
Understanding the three-way merge algorithm:Setup
The Algorithm
For each file, compare B, O, T:| Base | Ours | Theirs | Result |
|---|---|---|---|
| A | A | A | A (unchanged) |
| A | A | B | B (they changed) |
| A | B | A | B (we changed) |
| A | B | B | B (both same change) |
| A | B | C | CONFLICT |
| - | A | - | A (we added) |
| - | - | A | A (they added) |
| A | - | - | DELETE (both deleted) |
| A | - | A | DELETE (we deleted) |
| A | A | - | DELETE (they deleted) |
| - | A | B | CONFLICT (both added different) |
| A | B | - | CONFLICT (we changed, they deleted) |
Inside a Merge Conflict
Interview Deep Dive Questions
What is the Git object model?
What is the Git object model?
Answer: Git uses four object types: blobs (file content), trees (directories mapping names to blobs/trees), commits (snapshots pointing to a tree plus metadata), and annotated tags (named pointers with metadata). Objects are identified by SHA-1 hash of their content, making Git a content-addressable filesystem.
What is the difference between git merge and git rebase?
What is the difference between git merge and git rebase?
Answer: Merge creates a new commit with two parents, preserving full history. Rebase replays commits on top of another branch, rewriting commit hashes and creating linear history. Merge is safer (no rewritten history), rebase is cleaner (linear log). Never rebase public/shared branches.
How does Git detect file renames?
How does Git detect file renames?
Answer: Git does not track renames explicitly. It uses heuristics during diff/log to detect renames by comparing blob content. If files are >50% similar (configurable with
-M), Git considers it a rename. This is why renaming and modifying in the same commit can confuse detection.What happens during git checkout?
What happens during git checkout?
Answer: Checkout updates three things: 1) HEAD (point to new commit/branch), 2) Index (update staged files to match commit), 3) Working directory (update files to match index). If switching branches with uncommitted changes, Git refuses if changes would be overwritten.
How does git gc work?
How does git gc work?
Answer: Garbage collection: 1) Packs loose objects into packfiles with delta compression, 2) Removes objects unreachable from any ref or reflog, 3) Removes old reflog entries (>90 days), 4) Prunes empty directories in .git/objects. Run automatically when loose objects exceed threshold.
Explain detached HEAD state
Explain detached HEAD state
Answer: Normally HEAD points to a branch name (symbolic ref), which points to a commit. Detached HEAD means HEAD points directly to a commit hash. Commits made in this state are not on any branch. When you checkout something else, those commits become unreachable and will be garbage collected (unless you create a branch).
Exploring Internals Yourself
Key Takeaways
- Git is a content-addressable filesystem - content hashes are keys
- Four object types: blobs, trees, commits, annotated tags
- SHA-1 hashes ensure integrity - any change = different hash
- The index is the staging area - binary file tracking staged state
- Refs make hashes human-readable - branches and tags are just files
- Packfiles optimize storage - delta compression for similar objects
- History is a DAG - commits point to parents, forming a graph
- Three-way merge uses common ancestor - compares base, ours, theirs
Ready to master branching strategies? Next up: Git Branching where we will explore GitFlow, trunk-based development, and merge vs rebase.