Skip to main content

File Systems

The file system is the OS component that organizes persistent data on storage devices. Understanding file systems is crucial for performance tuning, debugging, and system design interviews.
Interview Frequency: High
Key Topics: Inodes, journaling, ext4, VFS layer
Time to Master: 10-12 hours

File System Basics

What a File System Provides

Abstraction

Files and directories instead of raw disk blocks

Naming

Human-readable names with hierarchical paths

Persistence

Data survives power cycles and reboots

Protection

Access control via permissions (rwx)

File System Layout

File System Layout

Inodes

The inode (index node) contains all file metadata except the name:
struct inode {
    // File type and permissions
    mode_t      mode;       // rwxrwxrwx, file type
    
    // Ownership
    uid_t       uid;        // User ID
    gid_t       gid;        // Group ID
    
    // Size and timestamps
    off_t       size;       // File size in bytes
    time_t      atime;      // Last access time
    time_t      mtime;      // Last modification time
    time_t      ctime;      // Last status change time
    
    // Link count
    nlink_t     nlink;      // Number of hard links
    
    // Block pointers
    block_t     direct[12];      // Direct block pointers
    block_t     indirect;        // Single indirect
    block_t     double_indirect; // Double indirect
    block_t     triple_indirect; // Triple indirect
};

Block Pointers

Inode Structure Hard vs Soft Links

Directory Structure

A directory is a special file mapping names to inode numbers: Directory Entry

Path Resolution

Resolving /home/user/file.txt:

1. Start at root inode (usually 2)
2. Read root directory data block
3. Find "home" → inode 15
4. Read inode 15's data block (home directory)
5. Find "user" → inode 100
6. Read inode 100's data block (user directory)
7. Find "file.txt" → inode 200
8. Return inode 200

Each "/" component = directory lookup = potential disk read

File Allocation Strategies

Contiguous Allocation

Contiguous Allocation

Linked Allocation

Linked Allocation

Indexed Allocation (Unix/ext)

Indexed Allocation

Extents (ext4, XFS)

Extent-Based Allocation

Journaling

Journaling ensures crash consistency — the file system is always recoverable:

The Problem

Writing a file requires multiple disk operations:
1. Update inode (size, block pointers)
2. Update data bitmap (mark blocks used)
3. Write data blocks
4. Update directory entry

Crash in the middle → inconsistent state!
Journaling Process

Journaling Modes

Write ALL changes (metadata + data) to journal first.
1. Write data to journal
2. Write metadata to journal
3. Commit journal entry
4. Write data to final location
5. Write metadata to final location
6. Free journal entry
  • Safest: All data guaranteed consistent
  • Slowest: Data written twice (double-write)

Recovery Process

On mount after crash:

1. Scan journal for committed entries
2. For each committed entry:
   - Replay writes to final locations
3. For uncommitted entries:
   - Discard (operation was incomplete)
4. Clear journal
5. Mount complete — consistent state

Recovery is O(journal_size), not O(filesystem_size)

Virtual File System (VFS)

The VFS layer provides a uniform interface to different file systems: VFS Architecture

Key VFS Structures

struct inode {
    umode_t                 i_mode;     // Permissions + type
    uid_t                   i_uid;      // Owner
    gid_t                   i_gid;      // Group
    loff_t                  i_size;     // File size
    struct timespec         i_atime;    // Access time
    struct timespec         i_mtime;    // Modify time
    const struct inode_operations *i_op;
    const struct file_operations  *i_fop;
    struct address_space    *i_mapping; // Page cache
};

struct dentry {
    struct inode            *d_inode;   // Associated inode
    struct dentry           *d_parent;  // Parent directory
    struct qstr              d_name;    // Name component
    struct list_head         d_subdirs; // Child dentries
    // Used for path caching
};

struct file {
    struct path              f_path;    // Dentry + mount
    const struct file_operations *f_op;
    loff_t                   f_pos;     // Current position
    unsigned int             f_flags;   // Open flags
};

Modern File Systems

ext4

FeatureDescription
ExtentsUp to 128MB contiguous allocation
Delayed allocationBatch writes for better layout
Journal checksumsDetect journal corruption
Flexible block groupsBetter large FS management
Max file size16 TB
Max FS size1 EB

XFS

FeatureDescription
Allocation GroupsParallel allocation
B+ treesFor directories, free space, extents
Delayed allocationLike ext4
Online defrag/resizeLive maintenance
Best forLarge files, parallel I/O

Btrfs

FeatureDescription
Copy-on-WriteSnapshots are free
Built-in RAID0, 1, 5, 6, 10
ChecksumsAll data + metadata
Compressionzlib, lzo, zstd
SnapshotsInstant, writable
SubvolumesLogical partitions

ZFS

FeatureDescription
Copy-on-WriteNever overwrites
End-to-end checksumsDetects silent corruption
RAID-ZLike RAID-5/6 but better
DeduplicationBlock-level
ARC cacheAdaptive replacement cache
Self-healingAuto-repair with redundancy

File System Performance

Buffer Cache and Page Cache

Page Cache

I/O Scheduling

Application → VFS → File System → Block Layer → Device

Block Layer I/O Schedulers:
┌─────────────────────────────────────────────────────────────────┐
│                                                                  │
│  NONE (noop):     Pass requests directly to device             │
│                   Best for: NVMe, VMs with host scheduling     │
│                                                                  │
│  mq-deadline:     Deadline-based, prevents starvation          │
│                   Best for: HDDs, predictable latency needs    │
│                                                                  │
│  BFQ:             Budget Fair Queuing                          │
│                   Best for: Desktop, interactive workloads     │
│                                                                  │
│  kyber:           Multi-queue, low-latency target              │
│                   Best for: Fast SSDs                          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Check/set: /sys/block/sda/queue/scheduler

Interview Deep Dive Questions

Answer:
unlink("/home/user/file.txt")
  1. Path resolution: Traverse to parent directory (/home/user)
  2. Remove directory entry:
    • Find entry for “file.txt” in directory data block
    • Mark entry as deleted (or zero out)
  3. Decrement inode link count:
    • inode->nlink--
    • If nlink > 0, stop here (other hard links exist)
  4. If nlink == 0 AND open count == 0:
    • Mark inode as free in inode bitmap
    • Mark data blocks as free in data bitmap
    • Blocks are NOT zeroed (just marked free)
  5. If nlink == 0 BUT file is still open:
    • File becomes “unlinked but open”
    • Deletion deferred until last close()
    • Common pattern: temp files delete immediately after open
Space recovery is lazy — blocks reused when needed, not zeroed immediately. This is why “secure delete” tools overwrite before unlinking.
Answer:Without journaling (crash scenarios):
Operation: Create file with data

Crash after:         Result:
1. Write data        - Data written, but no inode points to it (leaked)
2. Update bitmap     - Bitmap says block used, no file uses it (leaked)
3. Update inode      - Inode points to garbage (corruption)
4. Update directory  - Directory entry points to incomplete inode
With journaling:
1. Write to journal:
   - Transaction start
   - Data blocks (if data journaling)
   - Inode update
   - Bitmap update
   - Directory update
   - Commit record

2. Checkpoint: Write journal contents to final locations

3. Free journal space

Crash recovery:
- Scan journal for committed transactions
- Replay any not checkpointed
- Result: Either operation fully applied or not at all
Key insight: Journal creates atomicity — multi-block updates become all-or-nothing.
Answer:Database requirements:
  • Reliable fsync/O_DIRECT
  • Good random I/O
  • Crash recovery
  • Large file support
ext4:
  • ✅ Mature, well-tested
  • ✅ Reliable fsync
  • ✅ Good all-around performance
  • ❌ Limited snapshot support
  • Best for: General purpose, PostgreSQL often recommends
XFS:
  • ✅ Excellent parallel I/O
  • ✅ Handles large files well
  • ✅ Consistent performance
  • ❌ Can’t shrink filesystem
  • Best for: Large databases, data warehouses
Btrfs:
  • ✅ Snapshots (great for backup)
  • ✅ Compression saves space
  • ✅ Checksums detect corruption
  • ⚠️ RAID5/6 not production-ready
  • ❌ Some workloads show performance issues
  • Best for: If you need snapshots, with RAID1
Recommendation: ext4 or XFS for production databases. Use LVM for snapshots with ext4/XFS if needed.
Answer:The problem:
// TOCTOU: Time-of-check to time-of-use race
if (access("/tmp/file", R_OK) == 0) {  // Check
    int fd = open("/tmp/file", O_RDONLY);  // Use
    // Between check and use:
    // - File could be deleted
    // - File could be replaced
    // - Permissions could change
}
Security implications:
  • Attacker replaces file with symlink to /etc/shadow
  • Check passes on original file
  • Open follows symlink to sensitive file
Solutions:
  1. Don’t check, just do:
    int fd = open("/tmp/file", O_RDONLY);
    if (fd < 0) handle_error();
    
  2. Use O_NOFOLLOW:
    int fd = open("/tmp/file", O_RDONLY | O_NOFOLLOW);
    
  3. Use directory fd:
    int dirfd = open("/tmp", O_DIRECTORY);
    int fd = openat(dirfd, "file", O_RDONLY);
    // Operate relative to dirfd, not path
    
  4. Check after open:
    int fd = open(path, O_RDONLY);
    struct stat st;
    fstat(fd, &st);
    // Verify inode, permissions on *opened* file
    
Answer:Requirements analysis:
  • Scale to petabytes
  • Handle node failures
  1. Consistency:
    • Single master for serialization
    • Leases for concurrent appends
    • Relaxed consistency (eventual for appends)
  2. Failure handling:
    • Heartbeats detect chunk server failure
    • Master re-replicates under-replicated chunks
    • Checksums detect silent corruption
  3. Client flow:
    • Ask master for chunk locations
    • Read/write directly to chunk servers
    • Cache metadata, not data

Practice Exercises

1

Implement Simple FS

Create a FUSE-based file system with inodes, directories, and basic operations.
2

Journaling Simulator

Simulate a journal with crash injection. Verify recovery correctness.
3

Benchmark File Systems

Compare ext4, XFS on random vs sequential I/O with fio. Analyze results.
4

VFS Exploration

Write a kernel module that registers a simple file system with VFS.

Key Takeaways

Inodes Store Metadata

Everything except the filename. Hard links share inodes.

Journaling = Crash Safety

Ordered mode balances safety and performance. Recovery is fast.

VFS Enables Flexibility

Same API works for local disk, NFS, FUSE, and more.

Page Cache is Key

Most reads/writes hit cache. Dirty pages flushed in background.

Next: I/O Systems