File Systems
The file system is the OS component that organizes persistent data on storage devices. Understanding file systems is crucial for performance tuning, debugging, and system design interviews.Interview Frequency: High
Key Topics: Inodes, journaling, ext4, VFS layer
Time to Master: 10-12 hours
Key Topics: Inodes, journaling, ext4, VFS layer
Time to Master: 10-12 hours
File System Basics
What a File System Provides
Abstraction
Files and directories instead of raw disk blocks
Naming
Human-readable names with hierarchical paths
Persistence
Data survives power cycles and reboots
Protection
Access control via permissions (rwx)
File System Layout
Inodes
The inode (index node) contains all file metadata except the name:Block Pointers
Hard Links vs Soft Links
Directory Structure
A directory is a special file mapping names to inode numbers:Path Resolution
File Allocation Strategies
Contiguous Allocation
Linked Allocation
Indexed Allocation (Unix/ext)
Extents (ext4, XFS)
Journaling
Journaling ensures crash consistency — the file system is always recoverable:The Problem
Journaling Modes
- Data Journaling
- Ordered Journaling (Default)
- Writeback
Write ALL changes (metadata + data) to journal first.
- Safest: All data guaranteed consistent
- Slowest: Data written twice (double-write)
Recovery Process
Virtual File System (VFS)
The VFS layer provides a uniform interface to different file systems:Key VFS Structures
Modern File Systems
ext4
| Feature | Description |
|---|---|
| Extents | Up to 128MB contiguous allocation |
| Delayed allocation | Batch writes for better layout |
| Journal checksums | Detect journal corruption |
| Flexible block groups | Better large FS management |
| Max file size | 16 TB |
| Max FS size | 1 EB |
XFS
| Feature | Description |
|---|---|
| Allocation Groups | Parallel allocation |
| B+ trees | For directories, free space, extents |
| Delayed allocation | Like ext4 |
| Online defrag/resize | Live maintenance |
| Best for | Large files, parallel I/O |
Btrfs
| Feature | Description |
|---|---|
| Copy-on-Write | Snapshots are free |
| Built-in RAID | 0, 1, 5, 6, 10 |
| Checksums | All data + metadata |
| Compression | zlib, lzo, zstd |
| Snapshots | Instant, writable |
| Subvolumes | Logical partitions |
ZFS
| Feature | Description |
|---|---|
| Copy-on-Write | Never overwrites |
| End-to-end checksums | Detects silent corruption |
| RAID-Z | Like RAID-5/6 but better |
| Deduplication | Block-level |
| ARC cache | Adaptive replacement cache |
| Self-healing | Auto-repair with redundancy |
File System Performance
Buffer Cache and Page Cache
I/O Scheduling
Interview Deep Dive Questions
Q1: Explain what happens when you delete a file
Q1: Explain what happens when you delete a file
Answer:
- Path resolution: Traverse to parent directory (/home/user)
-
Remove directory entry:
- Find entry for “file.txt” in directory data block
- Mark entry as deleted (or zero out)
-
Decrement inode link count:
inode->nlink--- If nlink > 0, stop here (other hard links exist)
-
If nlink == 0 AND open count == 0:
- Mark inode as free in inode bitmap
- Mark data blocks as free in data bitmap
- Blocks are NOT zeroed (just marked free)
-
If nlink == 0 BUT file is still open:
- File becomes “unlinked but open”
- Deletion deferred until last close()
- Common pattern: temp files delete immediately after open
Q2: How does journaling prevent data loss?
Q2: How does journaling prevent data loss?
Answer:Without journaling (crash scenarios):With journaling:Key insight: Journal creates atomicity — multi-block updates become all-or-nothing.
Q3: Compare ext4, XFS, and Btrfs for a database server
Q3: Compare ext4, XFS, and Btrfs for a database server
Answer:Database requirements:
- Reliable fsync/O_DIRECT
- Good random I/O
- Crash recovery
- Large file support
- ✅ Mature, well-tested
- ✅ Reliable fsync
- ✅ Good all-around performance
- ❌ Limited snapshot support
- Best for: General purpose, PostgreSQL often recommends
- ✅ Excellent parallel I/O
- ✅ Handles large files well
- ✅ Consistent performance
- ❌ Can’t shrink filesystem
- Best for: Large databases, data warehouses
- ✅ Snapshots (great for backup)
- ✅ Compression saves space
- ✅ Checksums detect corruption
- ⚠️ RAID5/6 not production-ready
- ❌ Some workloads show performance issues
- Best for: If you need snapshots, with RAID1
Q4: Why is checking if a path exists racy?
Q4: Why is checking if a path exists racy?
Answer:The problem:Security implications:
- Attacker replaces file with symlink to /etc/shadow
- Check passes on original file
- Open follows symlink to sensitive file
-
Don’t check, just do:
-
Use O_NOFOLLOW:
-
Use directory fd:
-
Check after open:
Q5: Design a distributed file system
Q5: Design a distributed file system
Answer:Requirements analysis:
- Scale to petabytes
- Handle node failures
- Consistency:
- Single master for serialization
- Leases for concurrent appends
- Relaxed consistency (eventual for appends)
- Failure handling:
- Heartbeats detect chunk server failure
- Master re-replicates under-replicated chunks
- Checksums detect silent corruption
- Client flow:
- Ask master for chunk locations
- Read/write directly to chunk servers
- Cache metadata, not data
Practice Exercises
1
Implement Simple FS
Create a FUSE-based file system with inodes, directories, and basic operations.
2
Journaling Simulator
Simulate a journal with crash injection. Verify recovery correctness.
3
Benchmark File Systems
Compare ext4, XFS on random vs sequential I/O with fio. Analyze results.
4
VFS Exploration
Write a kernel module that registers a simple file system with VFS.
Key Takeaways
Inodes Store Metadata
Everything except the filename. Hard links share inodes.
Journaling = Crash Safety
Ordered mode balances safety and performance. Recovery is fast.
VFS Enables Flexibility
Same API works for local disk, NFS, FUSE, and more.
Page Cache is Key
Most reads/writes hit cache. Dirty pages flushed in background.
Next: I/O Systems →