Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Virtual Filesystem Layer - Unified abstraction for all filesystem operations

Filesystem & VFS

The Virtual File System (VFS) is Linux’s abstraction layer that provides a unified interface to all filesystem types. Understanding VFS is crucial for debugging I/O issues and designing storage-aware systems.
Prerequisites: System calls, process fundamentals
Interview Focus: File descriptors, VFS architecture, I/O paths, page cache
Time to Master: 4-5 hours

VFS Architecture

VFS Layer Architecture
┌─────────────────────────────────────────────────────────────────────┐
│                       VFS ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  User Space                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Application                                                     ││
│  │  open(), read(), write(), close(), stat(), mmap()               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ════════════════════════════│═══════════════════════════════════   │
│                              │ System Call Interface                 │
│  ════════════════════════════│═══════════════════════════════════   │
│                              ▼                                       │
│  Kernel Space                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                    Virtual File System (VFS)                     ││
│  │                                                                  ││
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐             ││
│  │  │ File Objects │ │Dentry Cache  │ │ Inode Cache  │             ││
│  │  │(open files)  │ │(path lookup) │ │(file metadata)│             ││
│  │  └──────────────┘ └──────────────┘ └──────────────┘             ││
│  │                                                                  ││
│  │  ┌────────────────────────────────────────────────────────────┐ ││
│  │  │              VFS Operations (file_operations)               │ ││
│  │  └────────────────────────────────────────────────────────────┘ ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────┬─────────────┼─────────────┬─────────────┐          │
│  │             │             │             │             │          │
│  ▼             ▼             ▼             ▼             ▼          │
│  ┌─────┐    ┌─────┐    ┌──────┐    ┌──────┐    ┌───────┐           │
│  │ ext4 │    │ XFS │    │ NFS  │    │ proc │    │ sysfs │           │
│  └─────┘    └─────┘    └──────┘    └──────┘    └───────┘           │
│                              │                                       │
│  ════════════════════════════│═══════════════════════════════════   │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │              Block Layer / Network Stack                        ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Core VFS Data Structures

The Four Pillars

Represents a mounted filesystem:
struct super_block {
    struct list_head    s_list;       // List of all superblocks
    dev_t               s_dev;        // Device identifier
    unsigned long       s_blocksize;  // Block size in bytes
    unsigned char       s_blocksize_bits;
    loff_t              s_maxbytes;   // Max file size
    struct file_system_type *s_type;  // Filesystem type
    const struct super_operations *s_op;  // Operations
    struct dentry       *s_root;      // Root dentry
    struct list_head    s_inodes;     // All inodes
    struct list_head    s_dirty;      // Dirty inodes
    // ... more fields
};
Key operations:
  • alloc_inode(): Allocate inode
  • destroy_inode(): Free inode
  • write_inode(): Write inode to disk
  • sync_fs(): Sync filesystem

File Descriptors

File Descriptor Table

┌─────────────────────────────────────────────────────────────────────┐
│                    FILE DESCRIPTOR ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Process A                    Process B                             │
│  ┌────────────────────┐      ┌────────────────────┐                │
│  │ task_struct        │      │ task_struct        │                │
│  │ └─files ──────────────────└─files ─────┐      │                │
│  └────────────────────┘      └─────────────│──────┘                │
│            │                              │                         │
│            ▼                              ▼                         │
│  ┌─────────────────────┐      ┌─────────────────────┐              │
│  │ files_struct (A)    │      │ files_struct (B)    │              │
│  │ ┌───────────────┐   │      │ ┌───────────────┐   │              │
│  │ │ fd_array      │   │      │ │ fd_array      │   │              │
│  │ │ [0]────────────────┐     │ │ [0]──────────────┐│              │
│  │ │ [1]─────────────┐  │     │ │ [1]──────────┐ │ ││              │
│  │ │ [2]──────┐    │ │  │     │ │ [2]────┐   │ │ ││              │
│  │ │ [3]────┐ │    │ │  │     │ │ [3]──┐ │   │ │ ││              │
│  │ └────────│─│────│─│──│     │ └──────│─│───│─│─││              │
│  └──────────│─│────│─│──│     └────────│─│───│─│─││              │
│             │ │    │ │  │              │ │   │ │ ││              │
│             ▼ ▼    ▼ ▼  ▼              ▼ ▼   ▼ ▼ ▼│              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    struct file instances                    │   │
│  │                                                             │   │
│  │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐            │   │
│  │  │file A  │  │file B  │  │file C  │  │file D  │            │   │
│  │  │f_pos=0 │  │f_pos=100│ │f_pos=50│  │f_pos=0 │            │   │
│  │  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘            │   │
│  │      │           │           │           │                  │   │
│  └──────│───────────│───────────│───────────│──────────────────┘   │
│         └───────────┴───────────┴───────────┘                      │
│                              │                                      │
│                              ▼                                      │
│              ┌───────────────────────────────┐                     │
│              │          struct inode         │                     │
│              │       (single instance)       │                     │
│              └───────────────────────────────┘                     │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

File Descriptor Limits

# Per-process soft/hard limits
ulimit -n      # Current limit (soft)
ulimit -Hn     # Hard limit

# System-wide limit
cat /proc/sys/fs/file-max

# Current system-wide usage
cat /proc/sys/fs/file-nr
# allocated  free  maximum
# 9024       0     9223372036854775807

# Per-process FD usage
ls /proc/<pid>/fd | wc -l

# Which files does a process have open?
lsof -p <pid>
ls -la /proc/<pid>/fd/

File Descriptor Inheritance

// Fork: child inherits FDs
pid_t pid = fork();
// Both parent and child share the same struct file
// Changes to f_pos are visible to both!

// After exec: FDs preserved unless O_CLOEXEC
int fd = open("/tmp/file", O_RDONLY | O_CLOEXEC);
// fd will be closed on exec()

// Dup: creates new FD pointing to same file
int fd2 = dup(fd);    // fd2 shares f_pos with fd
int fd3 = dup2(fd, 5); // fd3 = 5, points to same file

Path Lookup

The namei() Journey

┌─────────────────────────────────────────────────────────────────────┐
│                    PATH RESOLUTION: /home/user/file.txt             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. Start at root dentry (or cwd for relative paths)                │
│     Current: /                                                       │
│     └─ Check dcache for "home"                                      │
│        └─ MISS: Read directory from disk                            │
│        └─ Find inode number for "home"                              │
│        └─ Create dentry, add to dcache                              │
│                                                                      │
│  2. Current: /home                                                   │
│     └─ Check dcache for "user"                                      │
│        └─ HIT: dentry found in cache                                │
│        └─ Check permissions: can we enter?                          │
│        └─ Handle mount points (if /home/user is mounted)            │
│                                                                      │
│  3. Current: /home/user                                              │
│     └─ Check dcache for "file.txt"                                  │
│        └─ MISS: Read directory from disk                            │
│        └─ Find inode number for "file.txt"                          │
│        └─ Create dentry, add to dcache                              │
│                                                                      │
│  4. Final: /home/user/file.txt                                       │
│     └─ Return dentry pointing to file's inode                       │
│                                                                      │
│  Permissions checked at EVERY step!                                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│                    SYMLINK RESOLUTION                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  /usr/local/bin/python → /usr/bin/python3.11                        │
│                                                                      │
│  1. Resolve /usr/local/bin/python                                   │
│  2. Find it's a symlink (inode type = S_IFLNK)                     │
│  3. Read link target: /usr/bin/python3.11                           │
│  4. Start new resolution from / (absolute) or current (relative)   │
│  5. Resolve /usr/bin/python3.11                                     │
│  6. Return final inode                                               │
│                                                                      │
│  Limits:                                                             │
│  - Max 40 symlink levels (MAXSYMLINKS)                              │
│  - Prevents infinite loops: A → B → A                               │
│                                                                      │
│  O_NOFOLLOW: Don't follow final symlink                             │
│  O_PATH: Return fd without opening file                             │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Page Cache

Page Cache Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        PAGE CACHE                                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Application                                                         │
│      │ read(fd, buf, 4096)                                          │
│      ▼                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                      Page Cache                                  ││
│  │                                                                  ││
│  │  struct address_space (per inode)                               ││
│  │  ┌────────────────────────────────────────────────────────────┐ ││
│  │  │  Radix tree of pages                                       │ ││
│  │  │                                                            │ ││
│  │  │   Offset:  0    4K   8K   12K  16K  20K  24K  28K         │ ││
│  │  │           ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐         │ ││
│  │  │   Pages:  │██│ │██│ │░░│ │██│ │░░│ │██│ │██│ │░░│         │ ││
│  │  │           └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘         │ ││
│  │  │                                                            │ ││
│  │  │   ██ = Page in cache    ░░ = Not cached                   │ ││
│  │  └────────────────────────────────────────────────────────────┘ ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│                              │ Page fault (cache miss)              │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                        Block Layer                               ││
│  │                     Read from disk                               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Page Cache Operations

# View page cache usage
cat /proc/meminfo | grep -E "^(Cached|Buffers|Dirty|Writeback)"

# Drop caches (for testing only!)
echo 1 > /proc/sys/vm/drop_caches  # Page cache
echo 2 > /proc/sys/vm/drop_caches  # Dentries + inodes
echo 3 > /proc/sys/vm/drop_caches  # Both

# Check if file is cached
vmtouch /path/to/file
# Files: 1
# Directories: 0
# Resident Pages: 256/256  100%
# Elapsed: 0.000123 seconds

# Cache a file
vmtouch -t /path/to/file

# Evict file from cache
vmtouch -e /path/to/file

# Check page cache hit ratio
perf stat -e cache-references,cache-misses ./my_program

Read-Ahead

# Check read-ahead setting (in 512-byte sectors)
cat /sys/block/sda/queue/read_ahead_kb
# Default: 128 (128KB)

# Increase for sequential workloads
echo 256 > /sys/block/sda/queue/read_ahead_kb

# Application-level hint
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);  # Enable read-ahead
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);      # Disable read-ahead
posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED);    # Pre-fetch
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);    # Evict from cache

Write Paths

Buffered vs Direct I/O

┌─────────────────────────────────────────────────────────────────────┐
│                    WRITE PATHS COMPARISON                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  BUFFERED WRITE                    DIRECT WRITE (O_DIRECT)          │
│  ───────────────                   ─────────────────────            │
│                                                                      │
│  Application                       Application                       │
│      │ write(fd, buf, 4096)           │ write(fd, buf, 4096)        │
│      ▼                                ▼                             │
│  ┌────────────┐                   (no page cache)                   │
│  │ Page Cache │                       │                             │
│  │ (dirty)    │                       │                             │
│  └────────────┘                       │                             │
│      │                                │                             │
│      │ (background flush)             │ (immediate)                 │
│      ▼                                ▼                             │
│  ┌────────────┐                   ┌────────────┐                    │
│  │ Block      │                   │ Block      │                    │
│  │ Layer      │                   │ Layer      │                    │
│  └────────────┘                   └────────────┘                    │
│      │                                │                             │
│      ▼                                ▼                             │
│  ┌────────────┐                   ┌────────────┐                    │
│  │ Disk       │                   │ Disk       │                    │
│  └────────────┘                   └────────────┘                    │
│                                                                      │
│  Pros:                            Pros:                              │
│  - Fast returns                   - Predictable latency             │
│  - Coalescing                     - App controls caching            │
│  - Read cache                     - No double buffering             │
│                                                                      │
│  Cons:                            Cons:                              │
│  - Data loss on crash             - Slower for small I/O            │
│  - Memory overhead                - Alignment requirements          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Write-Back and Dirty Pages

# View dirty pages and write-back status
cat /proc/meminfo | grep -E "^(Dirty|Writeback):"

# Tune write-back behavior
# Start write-back when dirty pages exceed this %
cat /proc/sys/vm/dirty_background_ratio   # Default: 10%

# Process blocks when dirty pages exceed this %
cat /proc/sys/vm/dirty_ratio              # Default: 20%

# Age at which data is old enough to be written (centiseconds)
cat /proc/sys/vm/dirty_expire_centisecs   # Default: 3000 (30 seconds)

# Interval between write-back daemon wake-ups (centiseconds)
cat /proc/sys/vm/dirty_writeback_centisecs # Default: 500 (5 seconds)

# Force sync
sync           # Sync all filesystems
syncfs(fd)     # Sync specific filesystem
fsync(fd)      # Sync specific file data + metadata
fdatasync(fd)  # Sync specific file data only

Filesystem Types

Virtual Filesystems

# procfs - Process information
/proc/
├── <pid>/          # Per-process info
   ├── cmdline     # Command line
   ├── environ     # Environment
   ├── fd/         # File descriptors
   ├── maps        # Memory mappings
   ├── stat        # Process status
   └── status      # Readable status
├── cpuinfo         # CPU information
├── meminfo         # Memory information
└── sys/            # Kernel parameters (sysctl)

# sysfs - Device/driver information
/sys/
├── block/          # Block devices
├── bus/            # Bus types
├── class/          # Device classes
├── devices/        # Device hierarchy
├── fs/             # Filesystem info
   └── cgroup/     # Cgroup controllers
└── kernel/         # Kernel settings

# tmpfs - RAM-based filesystem
mount -t tmpfs -o size=1G tmpfs /mnt/ramdisk
# Used for /tmp, /run, /dev/shm

Disk Filesystems

FilesystemUse CaseMax File SizeFeatures
ext4General purpose16 TiBJournaling, extents
XFSLarge files8 EiBParallel I/O, reflinks
BtrfsAdvanced features16 EiBSnapshots, checksums
ZFSEnterprise16 EiBPools, RAID, snapshots

Mount Namespaces and Bind Mounts

# View mount points
mount
cat /proc/mounts
findmnt

# Bind mount: Mount a directory in another location
mount --bind /source /target

# Recursive bind mount
mount --rbind /source /target

# Make mount private (don't propagate)
mount --make-private /target

# Make mount shared (propagate to slaves)
mount --make-shared /target

# Mount namespace: Isolated mount table
unshare --mount /bin/bash
# Mounts in this shell don't affect parent

# Containers use mount namespaces extensively
# Each container has its own root filesystem

Interview Deep Dives

Complete flow:
  1. Process creation: Shell forks, child execs /bin/cat
  2. open() syscall:
    • Path lookup via VFS (dcache, namei)
    • Permission check (DAC + MAC)
    • Allocate struct file
    • Allocate file descriptor
    • Return fd
  3. read() syscall:
    • fd → struct file → inode → address_space
    • Check page cache for requested pages
    • Cache miss: Submit I/O to block layer
    • I/O completion: Copy to page cache
    • Copy from page cache to user buffer
    • Return bytes read
  4. write() to stdout:
    • fd 1 → terminal device
    • TTY layer processes output
    • Display on screen
  5. close() + exit:
    • Decrement file refcount
    • Free fd in table
    • Process exits
fsync():
  • Flushes file data AND metadata to disk
  • Metadata: size, mtime, block pointers
  • Two writes: data blocks + inode
  • Required for crash consistency
fdatasync():
  • Flushes file data to disk
  • Only flushes metadata if needed for data access
  • Skips non-essential metadata (atime, mtime)
  • Faster for append-only patterns
When to use which:
// fdatasync: Appending to file
write(fd, data, len);
fdatasync(fd);  // Data + size update

// fsync: After rename
int fd = open("file.tmp", O_WRONLY);
write(fd, data, len);
fsync(fd);              // Ensure data is on disk
close(fd);
rename("file.tmp", "file");  // Atomic replace
int dir_fd = open(".", O_DIRECTORY);
fsync(dir_fd);          // Ensure directory is updated
Kernel mechanisms:
  1. RLIMIT_NOFILE: Per-process limit on open fds
    ulimit -n  # Check limit
    
  2. O_CLOEXEC: Close fd on exec
    int fd = open(path, O_RDONLY | O_CLOEXEC);
    
  3. Process exit: All fds automatically closed
User-space practices:
  1. Use RAII (C++/Rust): Fd closed when object destroyed
  2. Close fds in error paths
  3. Use valgrind/lsof to detect leaks
# Find fd leaks
watch -n1 'ls /proc/$(pidof myapp)/fd | wc -l'
lsof -p $(pidof myapp) | tail -20
Causes:
  1. Directory reading: getdents() syscall reads directory entries
    • Linear scan of directory file
    • ext4 uses htree for lookup, but listing is still O(n)
  2. stat() per file: ls -l stats every file
    • Each stat is a separate syscall
    • May require inode read from disk
  3. Sorting: ls sorts output
    • O(n log n) in memory
Solutions:
# Use ls -f (no sorting, includes . and ..)
ls -f

# Use ls -U (no sorting)
ls -U

# Avoid -l if not needed (no stat per file)
ls

# For really large directories
find . -maxdepth 1 -print0 | head -c 10000
Architectural solutions:
  • Don’t put millions of files in one directory
  • Use directory sharding: files/ab/cd/file.txt
  • Use filesystem with better large directory support (XFS)

Performance Monitoring

# VFS cache statistics
cat /proc/slabinfo | grep -E "dentry|inode"

# File system operations tracing
sudo bpftrace -e '
kprobe:vfs_read { @reads = count(); }
kprobe:vfs_write { @writes = count(); }
interval:s:1 { print(@reads); print(@writes); clear(@reads); clear(@writes); }
'

# Page cache hit ratio
sudo cachestat-bpfcc 1

# File I/O latency
sudo fileslower-bpfcc 10  # Show I/O > 10ms

# Open files by process
sudo opensnoop-bpfcc

# Watch for slow path lookups
sudo bpftrace -e '
kprobe:path_lookupat {
    @start[tid] = nsecs;
}
kretprobe:path_lookupat /@start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000;
    if ($lat > 1000) {
        printf("slow lookup: %d us\n", $lat);
    }
    delete(@start[tid]);
}
'

Interview Deep-Dive

Strong Answer:
  • Each open() call creates a new struct file instance, even if the same file is opened multiple times by the same or different processes. The struct file contains the file position (f_pos), access mode (f_mode), flags (f_flags like O_NONBLOCK), and a pointer to the inode. Each struct file has its own independent file position, so two processes reading the same file track their positions separately.
  • The file descriptor is just an integer index into the process’s file descriptor table (task_struct->files->fd_array[]). Different processes can use different fd numbers to refer to the same underlying struct file (for example, after dup2() or after passing an fd via Unix socket).
  • The inode (struct inode) is shared: there is exactly one inode per file in the VFS inode cache, regardless of how many times the file is opened. The inode holds metadata (size, permissions, timestamps, block pointers) and the address_space for the page cache. This means the page cache is shared: if process A reads a file into cache, process B reading the same file will hit the cache.
  • After fork(), the situation is special: the child inherits the parent’s file descriptor table, and both parent and child fd entries point to the same struct file objects. This means they share file positions — if the parent reads 100 bytes, the child’s position also advances by 100. This is a common source of bugs in forked programs that do not close inherited fds.
Follow-up: What happens to these structures when the last fd referencing a file is closed but the file has been deleted (unlinked)?Follow-up Answer:
  • When unlink() is called on a file, the directory entry is removed and the inode’s link count (i_nlink) is decremented. But if any process still has the file open (the inode’s reference count i_count is non-zero), the inode and its data blocks are not freed. The file becomes “anonymous” — it exists on disk, consuming space, but has no name in any directory. You can see these files via /proc/<pid>/fd/ and they show as “(deleted)” in ls -la /proc/<pid>/fd/. Only when the last struct file referencing the inode is closed (via close() or process exit) does the kernel call iput() on the inode, and if i_nlink == 0 and i_count == 0, the filesystem frees the blocks. This is why disk space is not reclaimed until you restart a process that has deleted files open — a common operational issue.
Strong Answer:
  • A page cache using 90% of memory is normal and desirable. Linux uses all available free memory as page cache because unused memory is wasted memory. The page cache accelerates file reads from milliseconds (disk) to microseconds (memory). The key metric is not cache size but cache hit ratio. If the working set fits in cache and the hit ratio is 95%+, the system is performing optimally.
  • It becomes a problem only when memory pressure causes the cache to evict pages that the application will need again soon, leading to major page faults (disk reads). The symptom is high pgmajfault counts in /proc/vmstat or high kswapd CPU usage.
  • The kernel uses a two-list LRU (Least Recently Used) algorithm to decide what to cache. Each page is on either the active or inactive list, and the lists are split by type (anonymous vs file-backed). When a file page is first read, it goes on the inactive list. If accessed again while on the inactive list, it is promoted to the active list. When memory reclaim is needed, kswapd scans the inactive list from the tail, evicting pages that have not been recently accessed. Active list pages are periodically demoted to the inactive list if not recently accessed.
  • The vm.vfs_cache_pressure sysctl tunes how aggressively the kernel reclaims dentry and inode caches relative to page cache. The vm.swappiness tunes the preference for reclaiming anonymous pages (swap out) versus file pages (drop from cache). For a database server, I would set swappiness=10 to preserve anonymous memory (heap, buffer pool) and prefer dropping file cache.
Follow-up: How does posix_fadvise() allow applications to influence page cache behavior, and when would you use it?Follow-up Answer:
  • posix_fadvise() lets applications give hints to the kernel about their access patterns. POSIX_FADV_SEQUENTIAL tells the kernel to increase read-ahead (prefetch more pages ahead of the current read position). POSIX_FADV_RANDOM disables read-ahead, which is better for random-access patterns like database index lookups. POSIX_FADV_WILLNEED asks the kernel to prefetch the specified range into cache (asynchronous read-ahead). POSIX_FADV_DONTNEED tells the kernel that the specified range is no longer needed and can be evicted from cache. I would use DONTNEED after processing a large file in a streaming fashion (like log processing) to avoid polluting the cache with data that will never be re-read. I would use WILLNEED before a database knows it will need specific pages for an upcoming query.
Strong Answer:
  • For write-heavy workloads with many small files, XFS generally outperforms ext4 due to fundamental architectural differences in allocation and journaling.
  • ext4 uses a bitmap-based block allocator: it searches free space bitmaps for available blocks. For many small files, this creates contention on the bitmap locks and can lead to fragmentation as the allocator struggles to find contiguous free space among many small allocations. ext4’s journaling (jbd2) writes metadata changes to a journal before committing them to their final locations, and the journal is a single sequential log shared by all operations.
  • XFS uses B+ tree-based allocation groups. The filesystem is divided into independent allocation groups (AGs), each with its own B+ trees for free space, inodes, and extents. This parallelism means multiple threads can allocate blocks in different AGs simultaneously without contending on a single lock. XFS also uses delayed allocation aggressively: it defers block allocation until writeback time, which allows the allocator to make better decisions about contiguity.
  • For journaling, XFS uses delayed logging: metadata changes are accumulated in memory and flushed in batches, reducing journal I/O. ext4’s jbd2 commits more frequently by default (every 5 seconds).
  • The practical impact: on a server creating millions of small files per day, XFS’s allocation group parallelism and B+ tree-based free space tracking significantly reduce lock contention and provide more predictable latency. ext4 is better for simpler workloads and has wider tooling support (fsck is more mature).
Follow-up: How does the VFS abstraction layer allow both filesystems to coexist, and what is the performance cost of that abstraction?Follow-up Answer:
  • VFS defines a set of operation tables (super_operations, inode_operations, file_operations, address_space_operations) that each filesystem implements. When user space calls write(), VFS dispatches through file->f_op->write_iter() which is a function pointer set by the filesystem during open(). This is a single indirect function call overhead — roughly 1-2 nanoseconds on modern CPUs. Given that the actual I/O work takes microseconds to milliseconds, the VFS abstraction cost is negligible (less than 0.01% of total I/O time). The real benefit is enormous: applications, system calls, and kernel subsystems (page cache, memory mapping) work identically regardless of the underlying filesystem. This is why you can switch from ext4 to XFS by simply reformatting and remounting without changing any application code.

Next: I/O Subsystem →