Skip to main content
Virtual Filesystem Layer - Unified abstraction for all filesystem operations

Filesystem & VFS

The Virtual File System (VFS) is Linux’s abstraction layer that provides a unified interface to all filesystem types. Understanding VFS is crucial for debugging I/O issues and designing storage-aware systems.
Prerequisites: System calls, process fundamentals
Interview Focus: File descriptors, VFS architecture, I/O paths, page cache
Time to Master: 4-5 hours

VFS Architecture

VFS Layer Architecture
┌─────────────────────────────────────────────────────────────────────┐
│                       VFS ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  User Space                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Application                                                     ││
│  │  open(), read(), write(), close(), stat(), mmap()               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ════════════════════════════│═══════════════════════════════════   │
│                              │ System Call Interface                 │
│  ════════════════════════════│═══════════════════════════════════   │
│                              ▼                                       │
│  Kernel Space                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                    Virtual File System (VFS)                     ││
│  │                                                                  ││
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐             ││
│  │  │ File Objects │ │Dentry Cache  │ │ Inode Cache  │             ││
│  │  │(open files)  │ │(path lookup) │ │(file metadata)│             ││
│  │  └──────────────┘ └──────────────┘ └──────────────┘             ││
│  │                                                                  ││
│  │  ┌────────────────────────────────────────────────────────────┐ ││
│  │  │              VFS Operations (file_operations)               │ ││
│  │  └────────────────────────────────────────────────────────────┘ ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────┬─────────────┼─────────────┬─────────────┐          │
│  │             │             │             │             │          │
│  ▼             ▼             ▼             ▼             ▼          │
│  ┌─────┐    ┌─────┐    ┌──────┐    ┌──────┐    ┌───────┐           │
│  │ ext4 │    │ XFS │    │ NFS  │    │ proc │    │ sysfs │           │
│  └─────┘    └─────┘    └──────┘    └──────┘    └───────┘           │
│                              │                                       │
│  ════════════════════════════│═══════════════════════════════════   │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │              Block Layer / Network Stack                        ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Core VFS Data Structures

The Four Pillars

Represents a mounted filesystem:
struct super_block {
    struct list_head    s_list;       // List of all superblocks
    dev_t               s_dev;        // Device identifier
    unsigned long       s_blocksize;  // Block size in bytes
    unsigned char       s_blocksize_bits;
    loff_t              s_maxbytes;   // Max file size
    struct file_system_type *s_type;  // Filesystem type
    const struct super_operations *s_op;  // Operations
    struct dentry       *s_root;      // Root dentry
    struct list_head    s_inodes;     // All inodes
    struct list_head    s_dirty;      // Dirty inodes
    // ... more fields
};
Key operations:
  • alloc_inode(): Allocate inode
  • destroy_inode(): Free inode
  • write_inode(): Write inode to disk
  • sync_fs(): Sync filesystem

File Descriptors

File Descriptor Table

┌─────────────────────────────────────────────────────────────────────┐
│                    FILE DESCRIPTOR ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Process A                    Process B                             │
│  ┌────────────────────┐      ┌────────────────────┐                │
│  │ task_struct        │      │ task_struct        │                │
│  │ └─files ──────────────────└─files ─────┐      │                │
│  └────────────────────┘      └─────────────│──────┘                │
│            │                              │                         │
│            ▼                              ▼                         │
│  ┌─────────────────────┐      ┌─────────────────────┐              │
│  │ files_struct (A)    │      │ files_struct (B)    │              │
│  │ ┌───────────────┐   │      │ ┌───────────────┐   │              │
│  │ │ fd_array      │   │      │ │ fd_array      │   │              │
│  │ │ [0]────────────────┐     │ │ [0]──────────────┐│              │
│  │ │ [1]─────────────┐  │     │ │ [1]──────────┐ │ ││              │
│  │ │ [2]──────┐    │ │  │     │ │ [2]────┐   │ │ ││              │
│  │ │ [3]────┐ │    │ │  │     │ │ [3]──┐ │   │ │ ││              │
│  │ └────────│─│────│─│──│     │ └──────│─│───│─│─││              │
│  └──────────│─│────│─│──│     └────────│─│───│─│─││              │
│             │ │    │ │  │              │ │   │ │ ││              │
│             ▼ ▼    ▼ ▼  ▼              ▼ ▼   ▼ ▼ ▼│              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    struct file instances                    │   │
│  │                                                             │   │
│  │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐            │   │
│  │  │file A  │  │file B  │  │file C  │  │file D  │            │   │
│  │  │f_pos=0 │  │f_pos=100│ │f_pos=50│  │f_pos=0 │            │   │
│  │  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘            │   │
│  │      │           │           │           │                  │   │
│  └──────│───────────│───────────│───────────│──────────────────┘   │
│         └───────────┴───────────┴───────────┘                      │
│                              │                                      │
│                              ▼                                      │
│              ┌───────────────────────────────┐                     │
│              │          struct inode         │                     │
│              │       (single instance)       │                     │
│              └───────────────────────────────┘                     │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

File Descriptor Limits

# Per-process soft/hard limits
ulimit -n      # Current limit (soft)
ulimit -Hn     # Hard limit

# System-wide limit
cat /proc/sys/fs/file-max

# Current system-wide usage
cat /proc/sys/fs/file-nr
# allocated  free  maximum
# 9024       0     9223372036854775807

# Per-process FD usage
ls /proc/<pid>/fd | wc -l

# Which files does a process have open?
lsof -p <pid>
ls -la /proc/<pid>/fd/

File Descriptor Inheritance

// Fork: child inherits FDs
pid_t pid = fork();
// Both parent and child share the same struct file
// Changes to f_pos are visible to both!

// After exec: FDs preserved unless O_CLOEXEC
int fd = open("/tmp/file", O_RDONLY | O_CLOEXEC);
// fd will be closed on exec()

// Dup: creates new FD pointing to same file
int fd2 = dup(fd);    // fd2 shares f_pos with fd
int fd3 = dup2(fd, 5); // fd3 = 5, points to same file

Path Lookup

The namei() Journey

┌─────────────────────────────────────────────────────────────────────┐
│                    PATH RESOLUTION: /home/user/file.txt             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. Start at root dentry (or cwd for relative paths)                │
│     Current: /                                                       │
│     └─ Check dcache for "home"                                      │
│        └─ MISS: Read directory from disk                            │
│        └─ Find inode number for "home"                              │
│        └─ Create dentry, add to dcache                              │
│                                                                      │
│  2. Current: /home                                                   │
│     └─ Check dcache for "user"                                      │
│        └─ HIT: dentry found in cache                                │
│        └─ Check permissions: can we enter?                          │
│        └─ Handle mount points (if /home/user is mounted)            │
│                                                                      │
│  3. Current: /home/user                                              │
│     └─ Check dcache for "file.txt"                                  │
│        └─ MISS: Read directory from disk                            │
│        └─ Find inode number for "file.txt"                          │
│        └─ Create dentry, add to dcache                              │
│                                                                      │
│  4. Final: /home/user/file.txt                                       │
│     └─ Return dentry pointing to file's inode                       │
│                                                                      │
│  Permissions checked at EVERY step!                                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│                    SYMLINK RESOLUTION                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  /usr/local/bin/python → /usr/bin/python3.11                        │
│                                                                      │
│  1. Resolve /usr/local/bin/python                                   │
│  2. Find it's a symlink (inode type = S_IFLNK)                     │
│  3. Read link target: /usr/bin/python3.11                           │
│  4. Start new resolution from / (absolute) or current (relative)   │
│  5. Resolve /usr/bin/python3.11                                     │
│  6. Return final inode                                               │
│                                                                      │
│  Limits:                                                             │
│  - Max 40 symlink levels (MAXSYMLINKS)                              │
│  - Prevents infinite loops: A → B → A                               │
│                                                                      │
│  O_NOFOLLOW: Don't follow final symlink                             │
│  O_PATH: Return fd without opening file                             │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Page Cache

Page Cache Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        PAGE CACHE                                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Application                                                         │
│      │ read(fd, buf, 4096)                                          │
│      ▼                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                      Page Cache                                  ││
│  │                                                                  ││
│  │  struct address_space (per inode)                               ││
│  │  ┌────────────────────────────────────────────────────────────┐ ││
│  │  │  Radix tree of pages                                       │ ││
│  │  │                                                            │ ││
│  │  │   Offset:  0    4K   8K   12K  16K  20K  24K  28K         │ ││
│  │  │           ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐         │ ││
│  │  │   Pages:  │██│ │██│ │░░│ │██│ │░░│ │██│ │██│ │░░│         │ ││
│  │  │           └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘         │ ││
│  │  │                                                            │ ││
│  │  │   ██ = Page in cache    ░░ = Not cached                   │ ││
│  │  └────────────────────────────────────────────────────────────┘ ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│                              │ Page fault (cache miss)              │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                        Block Layer                               ││
│  │                     Read from disk                               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Page Cache Operations

# View page cache usage
cat /proc/meminfo | grep -E "^(Cached|Buffers|Dirty|Writeback)"

# Drop caches (for testing only!)
echo 1 > /proc/sys/vm/drop_caches  # Page cache
echo 2 > /proc/sys/vm/drop_caches  # Dentries + inodes
echo 3 > /proc/sys/vm/drop_caches  # Both

# Check if file is cached
vmtouch /path/to/file
# Files: 1
# Directories: 0
# Resident Pages: 256/256  100%
# Elapsed: 0.000123 seconds

# Cache a file
vmtouch -t /path/to/file

# Evict file from cache
vmtouch -e /path/to/file

# Check page cache hit ratio
perf stat -e cache-references,cache-misses ./my_program

Read-Ahead

# Check read-ahead setting (in 512-byte sectors)
cat /sys/block/sda/queue/read_ahead_kb
# Default: 128 (128KB)

# Increase for sequential workloads
echo 256 > /sys/block/sda/queue/read_ahead_kb

# Application-level hint
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);  # Enable read-ahead
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);      # Disable read-ahead
posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED);    # Pre-fetch
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);    # Evict from cache

Write Paths

Buffered vs Direct I/O

┌─────────────────────────────────────────────────────────────────────┐
│                    WRITE PATHS COMPARISON                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  BUFFERED WRITE                    DIRECT WRITE (O_DIRECT)          │
│  ───────────────                   ─────────────────────            │
│                                                                      │
│  Application                       Application                       │
│      │ write(fd, buf, 4096)           │ write(fd, buf, 4096)        │
│      ▼                                ▼                             │
│  ┌────────────┐                   (no page cache)                   │
│  │ Page Cache │                       │                             │
│  │ (dirty)    │                       │                             │
│  └────────────┘                       │                             │
│      │                                │                             │
│      │ (background flush)             │ (immediate)                 │
│      ▼                                ▼                             │
│  ┌────────────┐                   ┌────────────┐                    │
│  │ Block      │                   │ Block      │                    │
│  │ Layer      │                   │ Layer      │                    │
│  └────────────┘                   └────────────┘                    │
│      │                                │                             │
│      ▼                                ▼                             │
│  ┌────────────┐                   ┌────────────┐                    │
│  │ Disk       │                   │ Disk       │                    │
│  └────────────┘                   └────────────┘                    │
│                                                                      │
│  Pros:                            Pros:                              │
│  - Fast returns                   - Predictable latency             │
│  - Coalescing                     - App controls caching            │
│  - Read cache                     - No double buffering             │
│                                                                      │
│  Cons:                            Cons:                              │
│  - Data loss on crash             - Slower for small I/O            │
│  - Memory overhead                - Alignment requirements          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Write-Back and Dirty Pages

# View dirty pages and write-back status
cat /proc/meminfo | grep -E "^(Dirty|Writeback):"

# Tune write-back behavior
# Start write-back when dirty pages exceed this %
cat /proc/sys/vm/dirty_background_ratio   # Default: 10%

# Process blocks when dirty pages exceed this %
cat /proc/sys/vm/dirty_ratio              # Default: 20%

# Age at which data is old enough to be written (centiseconds)
cat /proc/sys/vm/dirty_expire_centisecs   # Default: 3000 (30 seconds)

# Interval between write-back daemon wake-ups (centiseconds)
cat /proc/sys/vm/dirty_writeback_centisecs # Default: 500 (5 seconds)

# Force sync
sync           # Sync all filesystems
syncfs(fd)     # Sync specific filesystem
fsync(fd)      # Sync specific file data + metadata
fdatasync(fd)  # Sync specific file data only

Filesystem Types

Virtual Filesystems

# procfs - Process information
/proc/
├── <pid>/          # Per-process info
   ├── cmdline     # Command line
   ├── environ     # Environment
   ├── fd/         # File descriptors
   ├── maps        # Memory mappings
   ├── stat        # Process status
   └── status      # Readable status
├── cpuinfo         # CPU information
├── meminfo         # Memory information
└── sys/            # Kernel parameters (sysctl)

# sysfs - Device/driver information
/sys/
├── block/          # Block devices
├── bus/            # Bus types
├── class/          # Device classes
├── devices/        # Device hierarchy
├── fs/             # Filesystem info
   └── cgroup/     # Cgroup controllers
└── kernel/         # Kernel settings

# tmpfs - RAM-based filesystem
mount -t tmpfs -o size=1G tmpfs /mnt/ramdisk
# Used for /tmp, /run, /dev/shm

Disk Filesystems

FilesystemUse CaseMax File SizeFeatures
ext4General purpose16 TiBJournaling, extents
XFSLarge files8 EiBParallel I/O, reflinks
BtrfsAdvanced features16 EiBSnapshots, checksums
ZFSEnterprise16 EiBPools, RAID, snapshots

Mount Namespaces and Bind Mounts

# View mount points
mount
cat /proc/mounts
findmnt

# Bind mount: Mount a directory in another location
mount --bind /source /target

# Recursive bind mount
mount --rbind /source /target

# Make mount private (don't propagate)
mount --make-private /target

# Make mount shared (propagate to slaves)
mount --make-shared /target

# Mount namespace: Isolated mount table
unshare --mount /bin/bash
# Mounts in this shell don't affect parent

# Containers use mount namespaces extensively
# Each container has its own root filesystem

Interview Deep Dives

Complete flow:
  1. Process creation: Shell forks, child execs /bin/cat
  2. open() syscall:
    • Path lookup via VFS (dcache, namei)
    • Permission check (DAC + MAC)
    • Allocate struct file
    • Allocate file descriptor
    • Return fd
  3. read() syscall:
    • fd → struct file → inode → address_space
    • Check page cache for requested pages
    • Cache miss: Submit I/O to block layer
    • I/O completion: Copy to page cache
    • Copy from page cache to user buffer
    • Return bytes read
  4. write() to stdout:
    • fd 1 → terminal device
    • TTY layer processes output
    • Display on screen
  5. close() + exit:
    • Decrement file refcount
    • Free fd in table
    • Process exits
fsync():
  • Flushes file data AND metadata to disk
  • Metadata: size, mtime, block pointers
  • Two writes: data blocks + inode
  • Required for crash consistency
fdatasync():
  • Flushes file data to disk
  • Only flushes metadata if needed for data access
  • Skips non-essential metadata (atime, mtime)
  • Faster for append-only patterns
When to use which:
// fdatasync: Appending to file
write(fd, data, len);
fdatasync(fd);  // Data + size update

// fsync: After rename
int fd = open("file.tmp", O_WRONLY);
write(fd, data, len);
fsync(fd);              // Ensure data is on disk
close(fd);
rename("file.tmp", "file");  // Atomic replace
int dir_fd = open(".", O_DIRECTORY);
fsync(dir_fd);          // Ensure directory is updated
Kernel mechanisms:
  1. RLIMIT_NOFILE: Per-process limit on open fds
    ulimit -n  # Check limit
    
  2. O_CLOEXEC: Close fd on exec
    int fd = open(path, O_RDONLY | O_CLOEXEC);
    
  3. Process exit: All fds automatically closed
User-space practices:
  1. Use RAII (C++/Rust): Fd closed when object destroyed
  2. Close fds in error paths
  3. Use valgrind/lsof to detect leaks
# Find fd leaks
watch -n1 'ls /proc/$(pidof myapp)/fd | wc -l'
lsof -p $(pidof myapp) | tail -20
Causes:
  1. Directory reading: getdents() syscall reads directory entries
    • Linear scan of directory file
    • ext4 uses htree for lookup, but listing is still O(n)
  2. stat() per file: ls -l stats every file
    • Each stat is a separate syscall
    • May require inode read from disk
  3. Sorting: ls sorts output
    • O(n log n) in memory
Solutions:
# Use ls -f (no sorting, includes . and ..)
ls -f

# Use ls -U (no sorting)
ls -U

# Avoid -l if not needed (no stat per file)
ls

# For really large directories
find . -maxdepth 1 -print0 | head -c 10000
Architectural solutions:
  • Don’t put millions of files in one directory
  • Use directory sharding: files/ab/cd/file.txt
  • Use filesystem with better large directory support (XFS)

Performance Monitoring

# VFS cache statistics
cat /proc/slabinfo | grep -E "dentry|inode"

# File system operations tracing
sudo bpftrace -e '
kprobe:vfs_read { @reads = count(); }
kprobe:vfs_write { @writes = count(); }
interval:s:1 { print(@reads); print(@writes); clear(@reads); clear(@writes); }
'

# Page cache hit ratio
sudo cachestat-bpfcc 1

# File I/O latency
sudo fileslower-bpfcc 10  # Show I/O > 10ms

# Open files by process
sudo opensnoop-bpfcc

# Watch for slow path lookups
sudo bpftrace -e '
kprobe:path_lookupat {
    @start[tid] = nsecs;
}
kretprobe:path_lookupat /@start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000;
    if ($lat > 1000) {
        printf("slow lookup: %d us\n", $lat);
    }
    delete(@start[tid]);
}
'

Next: I/O Subsystem →