> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Git Internals

> How Git actually works - the object model, SHA-1 hashing, packfiles, and refs

# Git Internals Deep Dive

> **If you love understanding how things actually work, this chapter is for you. If you just want to use Git and commit code, feel free to skip ahead. No judgment.**

This chapter reveals Git's elegant internal design. We will explore the content-addressable object database, understand how commits form a directed acyclic graph, and demystify the staging area. This knowledge transforms you from a Git user into someone who truly understands version control.

***

## Why Internals Matter

Understanding Git internals helps you:

* **Recover from disasters** when `git reflog` is your last hope
* **Debug merge conflicts** by understanding the three-way merge algorithm
* **Optimize repositories** with pack files and garbage collection
* **Ace interviews** where Git internals are surprisingly common
* **Never fear Git again** because you know exactly what is happening

***

## The Fundamental Truth: Git is a Content-Addressable Filesystem

At its core, Git is a simple **key-value store**. You give it content, it gives you back a unique key (SHA-1 hash). Think of it like a library where every book is shelved by a fingerprint of its contents rather than by title or author. If even one character changes, the fingerprint changes, and it goes on a different shelf. This design decision is what makes Git fast, reliable, and elegant.

```bash theme={null}
# Git stores content by its hash
$ echo "hello" | git hash-object --stdin
ce013625030ba8dba906f756967f9e9ca394464a

# Same content = same hash (always)
$ echo "hello" | git hash-object --stdin
ce013625030ba8dba906f756967f9e9ca394464a
```

This has profound implications:

* **Data integrity**: If content changes, hash changes - corruption is detectable
* **Deduplication**: Same content stored once, referenced everywhere
* **Fast comparisons**: Compare 40-character hashes instead of file contents

***

## The Four Git Objects

Git stores everything as one of four object types. Understanding these is understanding Git.

### 1. Blobs - The Content

A **blob** (binary large object) stores file content. Just content - no filename, no permissions, no metadata.

```bash theme={null}
# Create a blob manually
$ echo "Hello, Git!" | git hash-object -w --stdin
8b137891791fe96927ad78e64b0aad7bded08bdc

# View blob content
$ git cat-file -p 8b137891791fe96927ad78e64b0aad7bded08bdc
Hello, Git!

# View blob type
$ git cat-file -t 8b137891791fe96927ad78e64b0aad7bded08bdc
blob
```

**Key insight**: Two files with identical content = one blob. Rename a file? Same blob, different tree entry.

### 2. Trees - The Directories

A **tree** is like a directory listing. It contains:

* Pointers to blobs (files)
* Pointers to other trees (subdirectories)
* Mode (permissions), type, hash, and filename for each entry

```bash theme={null}
$ git cat-file -p main^{tree}
100644 blob 8b137891791fe96927ad78e64b0aad7bded08bdc    README.md
100644 blob a5c19667710254f835085b99726e523457150e03    package.json
040000 tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904    src
```

Mode breakdown:

* `100644` - Regular file
* `100755` - Executable file
* `040000` - Directory (tree)
* `120000` - Symbolic link
* `160000` - Gitlink (submodule)

### 3. Commits - The Snapshots

A **commit** is a snapshot in time. It contains:

* Pointer to a tree (the project state)
* Pointer to parent commit(s)
* Author (who wrote the code)
* Committer (who made the commit)
* Commit message
* Timestamp

```bash theme={null}
$ git cat-file -p HEAD
tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
parent a1b2c3d4e5f6789012345678901234567890abcd
author John Doe <john@example.com> 1701590400 +0000
committer John Doe <john@example.com> 1701590400 +0000

Add user authentication feature

Implements login/logout with session management.
```

**Why author and committer?**

* Author: Original code writer
* Committer: Person who applied/committed (different in cherry-pick, rebase, patches)

### 4. Tags - The Bookmarks

**Annotated tags** are objects containing:

* Pointer to a commit
* Tag name
* Tagger information
* Tag message

```bash theme={null}
$ git cat-file -p v1.0.0
object a1b2c3d4e5f6789012345678901234567890abcd
type commit
tag v1.0.0
tagger Jane Smith <jane@example.com> 1701590400 +0000

Release version 1.0.0
- Added authentication
- Fixed critical bugs
```

**Lightweight tags** are just refs (pointers) with no object - less metadata, less useful.

***

## The Object Database Structure

All objects live in `.git/objects/`, organized by hash:

```
.git/objects/
├── 8b/
│   └── 137891791fe96927ad78e64b0aad7bded08bdc  # First 2 chars = dir
├── a1/
│   └── b2c3d4e5f6789012345678901234567890abcd
├── info/
│   └── packs
└── pack/
    ├── pack-abc123.idx
    └── pack-abc123.pack
```

Objects are compressed with zlib. The 2-character directory split prevents filesystem issues with too many files in one directory.

***

## How SHA-1 Hashing Works

Git computes hashes by prepending a header to content:

```
Header: "<type> <size>\0"
Content: <raw bytes>
Hash: SHA-1(Header + Content)
```

Example for a blob:

```bash theme={null}
# Manual hash calculation
$ echo -e "blob 12\0Hello, Git!" | sha1sum
8b137891791fe96927ad78e64b0aad7bded08bdc

# Same as git hash-object
$ echo "Hello, Git!" | git hash-object --stdin
8b137891791fe96927ad78e64b0aad7bded08bdc
```

**SHA-1 collision concerns**: Yes, SHA-1 has known weaknesses. Git is transitioning to SHA-256 (`git init --object-format=sha256`). For now, practical attacks against Git specifically remain theoretical.

***

## The Index (Staging Area) Demystified

The index (`.git/index`) is a binary file that tracks:

* Which files are staged
* Their blob hashes
* Timestamps, permissions, sizes

```bash theme={null}
# View index contents
$ git ls-files --stage
100644 8b137891791fe96927ad78e64b0aad7bded08bdc 0       README.md
100644 a5c19667710254f835085b99726e523457150e03 0       package.json
```

The stage number (0) matters for merge conflicts:

* **Stage 0**: Normal, no conflict
* **Stage 1**: Common ancestor version
* **Stage 2**: Our version (HEAD)
* **Stage 3**: Their version (merging branch)

```bash theme={null}
# During a merge conflict
$ git ls-files --stage
100644 abc123... 1       file.txt  # Ancestor
100644 def456... 2       file.txt  # Ours
100644 789abc... 3       file.txt  # Theirs
```

### Why the Index is Brilliant

The index is one of Git's most under-appreciated design decisions. Most version control systems go straight from "changed files" to "committed." Git's staging area gives you an editing step in between.

1. **Speed**: Comparing file modification times and sizes against the index is much faster than hashing every file's contents on each `git status`
2. **Granularity**: `git add -p` lets you stage individual hunks within a file -- commit the bugfix on line 42 but not the debug logging you added on line 100
3. **Atomic commits**: Build up your commit piece by piece before finalizing. Changed 10 files but only 3 are related? Stage those 3, commit, then handle the rest separately.
4. **Three-way merge**: During conflicts, the index stores all three versions (ancestor, ours, theirs), giving merge tools everything they need for resolution

***

## Refs - The Human-Readable Pointers

Refs are files containing SHA-1 hashes. They make Git usable.

```
.git/refs/
├── heads/          # Local branches
│   ├── main        # Contains: a1b2c3d4...
│   └── feature-x   # Contains: d5e6f7a8...
├── remotes/        # Remote-tracking branches
│   └── origin/
│       ├── main
│       └── feature-y
└── tags/           # Tags
    └── v1.0.0
```

```bash theme={null}
# View what a ref points to
$ cat .git/refs/heads/main
a1b2c3d4e5f6789012345678901234567890abcd

# HEAD is special - it's usually a symbolic ref
$ cat .git/HEAD
ref: refs/heads/main

# Detached HEAD points directly to a commit
$ git checkout a1b2c3d
$ cat .git/HEAD
a1b2c3d4e5f6789012345678901234567890abcd
```

### The Reflog - Your Safety Net

Every time HEAD moves, Git logs it in the reflog:

```bash theme={null}
$ git reflog
a1b2c3d HEAD@{0}: commit: Add authentication
d5e6f7a HEAD@{1}: checkout: moving from feature-x to main
b8c9d0e HEAD@{2}: commit: WIP feature
...

# Recover a "lost" commit
$ git checkout HEAD@{2}

# Reflog is local only, expires after 90 days by default
$ git reflog expire --expire=now --all  # Don't do this
```

***

## Packfiles - Compression and Efficiency

As repositories grow, storing every object as a separate compressed file becomes wasteful -- the Linux kernel repo has millions of objects, and individual file I/O at that scale is slow. **Packfiles** solve this by bundling objects together with delta compression.

### Delta Compression

Git stores similar objects as deltas (differences):

```
Object A: "Hello, World!"
Object B: "Hello, Git!"

Stored as:
- Object A: Full content
- Object B: "Use Object A, replace 'World' with 'Git'"
```

### Pack Structure

```bash theme={null}
# View pack contents
$ git verify-pack -v .git/objects/pack/pack-abc123.idx

SHA-1           type    size    size-in-pack    offset    depth    base-SHA
a1b2c3d4...     commit  234     180             12        -        -
d5e6f7a8...     tree    89      78              192       -        -
8b137891...     blob    2048    156             270       2        f0e1d2c3
```

### When Packing Happens

* `git gc` - Manual garbage collection
* `git push` - Objects packed for transfer
* `git fetch` - Receive packfiles
* Automatically when loose objects exceed threshold (\~7000)

```bash theme={null}
# Force repacking
$ git gc --aggressive

# Repack with delta depth optimization
$ git repack -a -d -f --depth=250 --window=250
```

***

## The Directed Acyclic Graph (DAG)

Commits form a DAG - a graph with no cycles where edges point backwards (to parents).

```
Initial:     A

Linear:      A---B---C

Branch:      A---B---C
                  \
                   D---E

Merge:       A---B---C---F
                  \     /
                   D---E

Octopus:     A---B---C---G
                  \  |  /
                   D-E-F
```

### Why DAG Matters

1. **Reachability**: "Is commit X an ancestor of Y?" is fast
2. **Common ancestor**: Three-way merge needs merge base
3. **History traversal**: `git log` walks the DAG
4. **Garbage collection**: Unreachable commits are pruned

```bash theme={null}
# Find merge base (common ancestor)
$ git merge-base main feature-x
a1b2c3d4e5f6789012345678901234567890abcd

# Check if commit is ancestor
$ git merge-base --is-ancestor a1b2c3d main && echo "Yes"
```

***

## How Merge Actually Works

Understanding the three-way merge algorithm:

### Setup

```
         Base (B)
        /        \
    Ours (O)    Theirs (T)
```

### The Algorithm

For each file, compare B, O, T:

| Base | Ours | Theirs | Result                              |
| ---- | ---- | ------ | ----------------------------------- |
| A    | A    | A      | A (unchanged)                       |
| A    | A    | B      | B (they changed)                    |
| A    | B    | A      | B (we changed)                      |
| A    | B    | B      | B (both same change)                |
| A    | B    | C      | CONFLICT                            |
| -    | A    | -      | A (we added)                        |
| -    | -    | A      | A (they added)                      |
| A    | -    | -      | DELETE (both deleted)               |
| A    | -    | A      | DELETE (we deleted)                 |
| A    | A    | -      | DELETE (they deleted)               |
| -    | A    | B      | CONFLICT (both added different)     |
| A    | B    | -      | CONFLICT (we changed, they deleted) |

### Inside a Merge Conflict

```bash theme={null}
# The three versions during conflict
$ git show :1:file.txt  # Base (stage 1)
$ git show :2:file.txt  # Ours (stage 2)
$ git show :3:file.txt  # Theirs (stage 3)

# Conflict markers in file
<<<<<<< HEAD
our changes
=======
their changes
>>>>>>> feature-branch
```

***

## Interview Deep Dive Questions

<AccordionGroup>
  <Accordion title="What is the Git object model?" icon="circle-question">
    **Answer**: Git uses four object types: blobs (file content), trees (directories mapping names to blobs/trees), commits (snapshots pointing to a tree plus metadata), and annotated tags (named pointers with metadata). Objects are identified by SHA-1 hash of their content, making Git a content-addressable filesystem.
  </Accordion>

  <Accordion title="What is the difference between git merge and git rebase?" icon="circle-question">
    **Answer**: Merge creates a new commit with two parents, preserving full history. Rebase replays commits on top of another branch, rewriting commit hashes and creating linear history. Merge is safer (no rewritten history), rebase is cleaner (linear log). Never rebase public/shared branches.
  </Accordion>

  <Accordion title="How does Git detect file renames?" icon="circle-question">
    **Answer**: Git does not track renames explicitly. It uses heuristics during diff/log to detect renames by comparing blob content. If files are >50% similar (configurable with `-M`), Git considers it a rename. This is why renaming and modifying in the same commit can confuse detection.
  </Accordion>

  <Accordion title="What happens during git checkout?" icon="circle-question">
    **Answer**: Checkout updates three things: 1) HEAD (point to new commit/branch), 2) Index (update staged files to match commit), 3) Working directory (update files to match index). If switching branches with uncommitted changes, Git refuses if changes would be overwritten.
  </Accordion>

  <Accordion title="How does git gc work?" icon="circle-question">
    **Answer**: Garbage collection: 1) Packs loose objects into packfiles with delta compression, 2) Removes objects unreachable from any ref or reflog, 3) Removes old reflog entries (>90 days), 4) Prunes empty directories in .git/objects. Run automatically when loose objects exceed threshold.
  </Accordion>

  <Accordion title="Explain detached HEAD state" icon="circle-question">
    **Answer**: Normally HEAD points to a branch name (symbolic ref), which points to a commit. Detached HEAD means HEAD points directly to a commit hash. Commits made in this state are not on any branch. When you checkout something else, those commits become unreachable and will be garbage collected (unless you create a branch).
  </Accordion>
</AccordionGroup>

***

## Exploring Internals Yourself

```bash theme={null}
# Create a new repo and explore
$ git init internals-demo && cd internals-demo

# Create and hash a file manually
$ echo "test content" > test.txt
$ git hash-object -w test.txt
d670460b4b4aece5915caf5c68d12f560a9fe3e4

# See where it's stored
$ ls .git/objects/d6/
70460b4b4aece5915caf5c68d12f560a9fe3e4

# Decompress and view
$ cat .git/objects/d6/70460... | zlib-decompress
blob 13test content

# Make a commit and explore its structure
$ git add test.txt
$ git commit -m "Initial commit"
$ git cat-file -p HEAD
$ git cat-file -p HEAD^{tree}
```

***

## Key Takeaways

1. **Git is a content-addressable filesystem** - content hashes are keys
2. **Four object types**: blobs, trees, commits, annotated tags
3. **SHA-1 hashes ensure integrity** - any change = different hash
4. **The index is the staging area** - binary file tracking staged state
5. **Refs make hashes human-readable** - branches and tags are just files
6. **Packfiles optimize storage** - delta compression for similar objects
7. **History is a DAG** - commits point to parents, forming a graph
8. **Three-way merge uses common ancestor** - compares base, ours, theirs

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="Explain how Git's content-addressable storage enables deduplication, and give me a practical example of where this saves significant disk space in a real repository." icon="circle-question">
    **Strong Answer:**

    * Git hashes every piece of content (blobs, trees, commits) with SHA-1. The hash is derived purely from the content itself, which means identical content always produces the same hash, regardless of filename, location, or when it was created.
    * At the blob level, if two files have identical content, Git stores a single blob and both tree entries point to the same hash. Rename a file? Same blob, different tree entry. Copy a file to a new directory? Same blob. This is automatic and invisible.
    * Practical example: a monorepo with 50 microservices that all share a common `LICENSE` file, `.editorconfig`, and a `Makefile` template. Without deduplication, you would store 50 copies of each file across the commit history. With Git, there is exactly one blob per unique file version, regardless of how many directories reference it.
    * At the packfile level, Git goes further with delta compression. When you run `git gc`, Git identifies similar objects and stores only the differences. If you have a 1MB configuration file and make a 10-byte change, the packfile stores the original plus a 10-byte delta, not two 1MB copies. For large repositories with many similar files (like generated code, documentation with minor variations), this can reduce the on-disk size by 80-90%.
    * The Linux kernel repository demonstrates this beautifully. It has 1M+ commits and millions of file versions, but the repository is only \~4GB because of aggressive deduplication and delta compression. Without these techniques, it would be hundreds of gigabytes.

    **Follow-up: If Git deduplicates by content, what happens if two developers create different files with identical content? Is there any risk of collision?**

    No risk from identical content -- that is by design. If two developers independently create a file with the same content, Git creates one blob and both tree entries point to it. This is correct behavior. SHA-1 collision (two different contents producing the same hash) is theoretically possible but practically infeasible for organic content. The known SHA-1 collision (SHAttered attack) requires specifically crafted input and is not a realistic threat to Git repositories. Git is also transitioning to SHA-256 for defense in depth.
  </Accordion>

  <Accordion title="A colleague says 'Git stores diffs between files.' Correct this misconception and explain why Git's actual approach is superior for the operations Git prioritizes." icon="circle-question">
    **Strong Answer:**

    * Git stores complete snapshots, not diffs. Every commit points to a tree object that represents the full state of every file at that point in time. If you have 100 files and change 1, the commit's tree references the same 99 unchanged blobs (by hash, so no duplication) and one new blob for the changed file.
    * This is superior to diff-based systems (like SVN) for several key operations. First, branching and checkout: to check out any commit, Git just reads its tree and blobs. An SVN checkout of an old revision requires replaying every delta from the beginning to that point. Second, diffing between arbitrary commits: Git compares two trees directly. SVN must compute and combine all deltas between two revisions. Third, merge: Git's three-way merge compares three complete snapshots (base, ours, theirs). This is a parallel operation on complete file states, not a serial replay of patches.
    * The concern about disk space is addressed by packfiles. When you run `git gc`, Git compresses objects using delta encoding -- but this is a storage optimization, not the core data model. The delta compression in packfiles is chosen for storage efficiency (often based on similar content, not chronological order), which can actually be more space-efficient than chronological diffs.
    * The key insight: Git optimizes for speed of operations (checkout, branch, merge, diff) at the cost of naive storage size, then recovers the storage cost through smart compression. This is the right trade-off because operations happen millions of times more often than storage.

    **Follow-up: When Git does delta compression in packfiles, how does it choose which objects to delta against? It is not just the previous version, right?**

    Correct. Git's delta compression is content-based, not history-based. During packing, Git sorts objects by type and size, then tries to delta each object against nearby objects (controlled by the `--window` parameter). A file might be delta'd against a completely different file that happens to have similar content, or against a version from a different branch that is closer in content than the chronological predecessor. This window-based approach often produces better compression than chronological diffs because the most similar content is not always the previous version -- it might be a file from a parallel branch or a similar file in a different directory.
  </Accordion>

  <Accordion title="You are investigating why a Git repository has grown to 5GB despite only having 200MB of source code. Walk me through how you would diagnose this and reduce the size." icon="circle-question">
    **Strong Answer:**

    * The most common cause is large binary files committed to the repository. Even if they were deleted in a later commit, they still exist in Git history. Videos, database dumps, compiled binaries, and node\_modules accidentally committed are the usual culprits.
    * Diagnosis: I would run `git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sort -rnk3 | head -20` to find the 20 largest objects in the repository. This reveals the specific files consuming space. I would also run `git verify-pack -v .git/objects/pack/*.idx | sort -rnk3 | head -20` to check packfile contents.
    * If the culprits are historical large files that are no longer needed, I would use `git filter-repo --strip-blobs-bigger-than 10M` to remove all objects larger than 10MB from the entire history. Alternatively, `git filter-repo --path data-dump.sql --invert-paths` removes a specific file. This rewrites history, so all team members must re-clone.
    * For ongoing prevention: add large file patterns to `.gitignore`, and implement a pre-commit hook that rejects files above a size threshold. For large files that genuinely need versioning (design assets, test fixtures), use Git LFS, which stores large files in a separate server and keeps only lightweight pointers in the repository.
    * After cleanup, `git gc --aggressive --prune=now` repacks the repository. Then force-push all branches and tags. On the hosting platform (GitHub, GitLab), you may also need to trigger a garbage collection on the server side, as some platforms cache objects independently.

    **Follow-up: After removing the large files with filter-repo, some team members still have the 5GB repository. What is the fastest way to get everyone onto the clean version?**

    The cleanest approach is a fresh clone. Tell the team to delete their local repository and clone again from the remote (after you have force-pushed the rewritten history). For developers with in-progress feature branches, they should push their branches to the remote first, then re-clone and check out their branches. Attempting to `git pull` the rewritten history into an existing clone will cause massive conflicts because every commit hash changed. A fresh clone is always simpler and safer.
  </Accordion>
</AccordionGroup>

***

Ready to master branching strategies? Next up: [Git Branching](/courses/devops-tools/git-branching) where we will explore GitFlow, trunk-based development, and merge vs rebase.
