Virtual Filesystem Layer - Unified abstraction for all filesystem operations

Filesystem & VFS

The Virtual File System (VFS) is Linux’s abstraction layer that provides a unified interface to all filesystem types. Understanding VFS is crucial for debugging I/O issues and designing storage-aware systems.

Prerequisites: System calls, process fundamentals
Interview Focus: File descriptors, VFS architecture, I/O paths, page cache
Time to Master: 4-5 hours

VFS Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                       VFS ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  User Space                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Application                                                     ││
│  │  open(), read(), write(), close(), stat(), mmap()               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ════════════════════════════│═══════════════════════════════════   │
│                              │ System Call Interface                 │
│  ════════════════════════════│═══════════════════════════════════   │
│                              ▼                                       │
│  Kernel Space                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                    Virtual File System (VFS)                     ││
│  │                                                                  ││
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐             ││
│  │  │ File Objects │ │Dentry Cache  │ │ Inode Cache  │             ││
│  │  │(open files)  │ │(path lookup) │ │(file metadata)│             ││
│  │  └──────────────┘ └──────────────┘ └──────────────┘             ││
│  │                                                                  ││
│  │  ┌────────────────────────────────────────────────────────────┐ ││
│  │  │              VFS Operations (file_operations)               │ ││
│  │  └────────────────────────────────────────────────────────────┘ ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────┬─────────────┼─────────────┬─────────────┐          │
│  │             │             │             │             │          │
│  ▼             ▼             ▼             ▼             ▼          │
│  ┌─────┐    ┌─────┐    ┌──────┐    ┌──────┐    ┌───────┐           │
│  │ ext4 │    │ XFS │    │ NFS  │    │ proc │    │ sysfs │           │
│  └─────┘    └─────┘    └──────┘    └──────┘    └───────┘           │
│                              │                                       │
│  ════════════════════════════│═══════════════════════════════════   │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │              Block Layer / Network Stack                        ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Core VFS Data Structures

The Four Pillars

superblock
inode
dentry
file

Represents a mounted filesystem:

struct super_block {
    struct list_head    s_list;       // List of all superblocks
    dev_t               s_dev;        // Device identifier
    unsigned long       s_blocksize;  // Block size in bytes
    unsigned char       s_blocksize_bits;
    loff_t              s_maxbytes;   // Max file size
    struct file_system_type *s_type;  // Filesystem type
    const struct super_operations *s_op;  // Operations
    struct dentry       *s_root;      // Root dentry
    struct list_head    s_inodes;     // All inodes
    struct list_head    s_dirty;      // Dirty inodes
    // ... more fields
};

Key operations:

alloc_inode(): Allocate inode
destroy_inode(): Free inode
write_inode(): Write inode to disk
sync_fs(): Sync filesystem

Represents a file’s metadata:

struct inode {
    umode_t             i_mode;       // Permissions + type
    uid_t               i_uid;        // Owner UID
    gid_t               i_gid;        // Owner GID
    unsigned long       i_ino;        // Inode number
    loff_t              i_size;       // File size
    struct timespec64   i_atime;      // Access time
    struct timespec64   i_mtime;      // Modify time  
    struct timespec64   i_ctime;      // Change time
    unsigned int        i_nlink;      // Hard link count
    blkcnt_t            i_blocks;     // Blocks allocated
    union {
        struct block_device *i_bdev;  // Block device
        struct cdev *i_cdev;          // Char device
    };
    const struct inode_operations *i_op;
    const struct file_operations  *i_fop;
    struct address_space *i_mapping; // Page cache
    // ... more fields
};

Key insight: An inode exists once per file, regardless of how many processes have it open.

Represents a directory entry (path component):

struct dentry {
    struct qstr         d_name;       // Filename
    struct inode        *d_inode;     // Associated inode
    struct dentry       *d_parent;    // Parent directory
    struct hlist_node   d_hash;       // Lookup hash
    struct list_head    d_lru;        // LRU list
    struct list_head    d_child;      // Parent's children
    struct list_head    d_subdirs;    // Our children
    const struct dentry_operations *d_op;
    struct super_block  *d_sb;        // Superblock
    // ... more fields
};

dentry cache (dcache):

Caches pathname lookups
Hugely important for performance
Negative dentries cache “file not found”

Represents an open file:

struct file {
    struct path         f_path;       // dentry + vfsmount
    struct inode        *f_inode;     // Cached inode
    const struct file_operations *f_op;  // Operations
    spinlock_t          f_lock;       // Lock
    atomic_long_t       f_count;      // Reference count
    unsigned int        f_flags;      // O_RDONLY, O_NONBLOCK, etc.
    fmode_t             f_mode;       // FMODE_READ, FMODE_WRITE
    loff_t              f_pos;        // Current position
    struct address_space *f_mapping;  // Page cache
    void                *private_data; // Filesystem private
    // ... more fields
};

Key insight: Each open() creates a new struct file, even for the same inode.

File Descriptors

File Descriptor Table

┌─────────────────────────────────────────────────────────────────────┐
│                    FILE DESCRIPTOR ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Process A                    Process B                             │
│  ┌────────────────────┐      ┌────────────────────┐                │
│  │ task_struct        │      │ task_struct        │                │
│  │ └─files ──────────────────└─files ─────┐      │                │
│  └────────────────────┘      └─────────────│──────┘                │
│            │                              │                         │
│            ▼                              ▼                         │
│  ┌─────────────────────┐      ┌─────────────────────┐              │
│  │ files_struct (A)    │      │ files_struct (B)    │              │
│  │ ┌───────────────┐   │      │ ┌───────────────┐   │              │
│  │ │ fd_array      │   │      │ │ fd_array      │   │              │
│  │ │ [0]────────────────┐     │ │ [0]──────────────┐│              │
│  │ │ [1]─────────────┐  │     │ │ [1]──────────┐ │ ││              │
│  │ │ [2]──────┐    │ │  │     │ │ [2]────┐   │ │ ││              │
│  │ │ [3]────┐ │    │ │  │     │ │ [3]──┐ │   │ │ ││              │
│  │ └────────│─│────│─│──│     │ └──────│─│───│─│─││              │
│  └──────────│─│────│─│──│     └────────│─│───│─│─││              │
│             │ │    │ │  │              │ │   │ │ ││              │
│             ▼ ▼    ▼ ▼  ▼              ▼ ▼   ▼ ▼ ▼│              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    struct file instances                    │   │
│  │                                                             │   │
│  │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐            │   │
│  │  │file A  │  │file B  │  │file C  │  │file D  │            │   │
│  │  │f_pos=0 │  │f_pos=100│ │f_pos=50│  │f_pos=0 │            │   │
│  │  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘            │   │
│  │      │           │           │           │                  │   │
│  └──────│───────────│───────────│───────────│──────────────────┘   │
│         └───────────┴───────────┴───────────┘                      │
│                              │                                      │
│                              ▼                                      │
│              ┌───────────────────────────────┐                     │
│              │          struct inode         │                     │
│              │       (single instance)       │                     │
│              └───────────────────────────────┘                     │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

File Descriptor Limits

# Per-process soft/hard limits
ulimit -n      # Current limit (soft)
ulimit -Hn     # Hard limit

# System-wide limit
cat /proc/sys/fs/file-max

# Current system-wide usage
cat /proc/sys/fs/file-nr
# allocated  free  maximum
# 9024       0     9223372036854775807

# Per-process FD usage
ls /proc/<pid>/fd | wc -l

# Which files does a process have open?
lsof -p <pid>
ls -la /proc/<pid>/fd/

File Descriptor Inheritance

// Fork: child inherits FDs
pid_t pid = fork();
// Both parent and child share the same struct file
// Changes to f_pos are visible to both!

// After exec: FDs preserved unless O_CLOEXEC
int fd = open("/tmp/file", O_RDONLY | O_CLOEXEC);
// fd will be closed on exec()

// Dup: creates new FD pointing to same file
int fd2 = dup(fd);    // fd2 shares f_pos with fd
int fd3 = dup2(fd, 5); // fd3 = 5, points to same file

Path Lookup

The namei() Journey

┌─────────────────────────────────────────────────────────────────────┐
│                    PATH RESOLUTION: /home/user/file.txt             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. Start at root dentry (or cwd for relative paths)                │
│     Current: /                                                       │
│     └─ Check dcache for "home"                                      │
│        └─ MISS: Read directory from disk                            │
│        └─ Find inode number for "home"                              │
│        └─ Create dentry, add to dcache                              │
│                                                                      │
│  2. Current: /home                                                   │
│     └─ Check dcache for "user"                                      │
│        └─ HIT: dentry found in cache                                │
│        └─ Check permissions: can we enter?                          │
│        └─ Handle mount points (if /home/user is mounted)            │
│                                                                      │
│  3. Current: /home/user                                              │
│     └─ Check dcache for "file.txt"                                  │
│        └─ MISS: Read directory from disk                            │
│        └─ Find inode number for "file.txt"                          │
│        └─ Create dentry, add to dcache                              │
│                                                                      │
│  4. Final: /home/user/file.txt                                       │
│     └─ Return dentry pointing to file's inode                       │
│                                                                      │
│  Permissions checked at EVERY step!                                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Symlink Resolution

┌─────────────────────────────────────────────────────────────────────┐
│                    SYMLINK RESOLUTION                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  /usr/local/bin/python → /usr/bin/python3.11                        │
│                                                                      │
│  1. Resolve /usr/local/bin/python                                   │
│  2. Find it's a symlink (inode type = S_IFLNK)                     │
│  3. Read link target: /usr/bin/python3.11                           │
│  4. Start new resolution from / (absolute) or current (relative)   │
│  5. Resolve /usr/bin/python3.11                                     │
│  6. Return final inode                                               │
│                                                                      │
│  Limits:                                                             │
│  - Max 40 symlink levels (MAXSYMLINKS)                              │
│  - Prevents infinite loops: A → B → A                               │
│                                                                      │
│  O_NOFOLLOW: Don't follow final symlink                             │
│  O_PATH: Return fd without opening file                             │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Page Cache

Page Cache Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        PAGE CACHE                                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Application                                                         │
│      │ read(fd, buf, 4096)                                          │
│      ▼                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                      Page Cache                                  ││
│  │                                                                  ││
│  │  struct address_space (per inode)                               ││
│  │  ┌────────────────────────────────────────────────────────────┐ ││
│  │  │  Radix tree of pages                                       │ ││
│  │  │                                                            │ ││
│  │  │   Offset:  0    4K   8K   12K  16K  20K  24K  28K         │ ││
│  │  │           ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐         │ ││
│  │  │   Pages:  │██│ │██│ │░░│ │██│ │░░│ │██│ │██│ │░░│         │ ││
│  │  │           └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘         │ ││
│  │  │                                                            │ ││
│  │  │   ██ = Page in cache    ░░ = Not cached                   │ ││
│  │  └────────────────────────────────────────────────────────────┘ ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│                              │ Page fault (cache miss)              │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                        Block Layer                               ││
│  │                     Read from disk                               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Page Cache Operations

# View page cache usage
cat /proc/meminfo | grep -E "^(Cached|Buffers|Dirty|Writeback)"

# Drop caches (for testing only!)
echo 1 > /proc/sys/vm/drop_caches  # Page cache
echo 2 > /proc/sys/vm/drop_caches  # Dentries + inodes
echo 3 > /proc/sys/vm/drop_caches  # Both

# Check if file is cached
vmtouch /path/to/file
# Files: 1
# Directories: 0
# Resident Pages: 256/256  100%
# Elapsed: 0.000123 seconds

# Cache a file
vmtouch -t /path/to/file

# Evict file from cache
vmtouch -e /path/to/file

# Check page cache hit ratio
perf stat -e cache-references,cache-misses ./my_program

Read-Ahead

# Check read-ahead setting (in 512-byte sectors)
cat /sys/block/sda/queue/read_ahead_kb
# Default: 128 (128KB)

# Increase for sequential workloads
echo 256 > /sys/block/sda/queue/read_ahead_kb

# Application-level hint
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);  # Enable read-ahead
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);      # Disable read-ahead
posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED);    # Pre-fetch
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);    # Evict from cache

Write Paths

Buffered vs Direct I/O

┌─────────────────────────────────────────────────────────────────────┐
│                    WRITE PATHS COMPARISON                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  BUFFERED WRITE                    DIRECT WRITE (O_DIRECT)          │
│  ───────────────                   ─────────────────────            │
│                                                                      │
│  Application                       Application                       │
│      │ write(fd, buf, 4096)           │ write(fd, buf, 4096)        │
│      ▼                                ▼                             │
│  ┌────────────┐                   (no page cache)                   │
│  │ Page Cache │                       │                             │
│  │ (dirty)    │                       │                             │
│  └────────────┘                       │                             │
│      │                                │                             │
│      │ (background flush)             │ (immediate)                 │
│      ▼                                ▼                             │
│  ┌────────────┐                   ┌────────────┐                    │
│  │ Block      │                   │ Block      │                    │
│  │ Layer      │                   │ Layer      │                    │
│  └────────────┘                   └────────────┘                    │
│      │                                │                             │
│      ▼                                ▼                             │
│  ┌────────────┐                   ┌────────────┐                    │
│  │ Disk       │                   │ Disk       │                    │
│  └────────────┘                   └────────────┘                    │
│                                                                      │
│  Pros:                            Pros:                              │
│  - Fast returns                   - Predictable latency             │
│  - Coalescing                     - App controls caching            │
│  - Read cache                     - No double buffering             │
│                                                                      │
│  Cons:                            Cons:                              │
│  - Data loss on crash             - Slower for small I/O            │
│  - Memory overhead                - Alignment requirements          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Write-Back and Dirty Pages

# View dirty pages and write-back status
cat /proc/meminfo | grep -E "^(Dirty|Writeback):"

# Tune write-back behavior
# Start write-back when dirty pages exceed this %
cat /proc/sys/vm/dirty_background_ratio   # Default: 10%

# Process blocks when dirty pages exceed this %
cat /proc/sys/vm/dirty_ratio              # Default: 20%

# Age at which data is old enough to be written (centiseconds)
cat /proc/sys/vm/dirty_expire_centisecs   # Default: 3000 (30 seconds)

# Interval between write-back daemon wake-ups (centiseconds)
cat /proc/sys/vm/dirty_writeback_centisecs # Default: 500 (5 seconds)

# Force sync
sync           # Sync all filesystems
syncfs(fd)     # Sync specific filesystem
fsync(fd)      # Sync specific file data + metadata
fdatasync(fd)  # Sync specific file data only

Filesystem Types

Virtual Filesystems

# procfs - Process information
/proc/
├── <pid>/          # Per-process info
│   ├── cmdline     # Command line
│   ├── environ     # Environment
│   ├── fd/         # File descriptors
│   ├── maps        # Memory mappings
│   ├── stat        # Process status
│   └── status      # Readable status
├── cpuinfo         # CPU information
├── meminfo         # Memory information
└── sys/            # Kernel parameters (sysctl)

# sysfs - Device/driver information
/sys/
├── block/          # Block devices
├── bus/            # Bus types
├── class/          # Device classes
├── devices/        # Device hierarchy
├── fs/             # Filesystem info
│   └── cgroup/     # Cgroup controllers
└── kernel/         # Kernel settings

# tmpfs - RAM-based filesystem
mount -t tmpfs -o size=1G tmpfs /mnt/ramdisk
# Used for /tmp, /run, /dev/shm

Disk Filesystems

Filesystem	Use Case	Max File Size	Features
ext4	General purpose	16 TiB	Journaling, extents
XFS	Large files	8 EiB	Parallel I/O, reflinks
Btrfs	Advanced features	16 EiB	Snapshots, checksums
ZFS	Enterprise	16 EiB	Pools, RAID, snapshots

Mount Namespaces and Bind Mounts

# View mount points
mount
cat /proc/mounts
findmnt

# Bind mount: Mount a directory in another location
mount --bind /source /target

# Recursive bind mount
mount --rbind /source /target

# Make mount private (don't propagate)
mount --make-private /target

# Make mount shared (propagate to slaves)
mount --make-shared /target

# Mount namespace: Isolated mount table
unshare --mount /bin/bash
# Mounts in this shell don't affect parent

# Containers use mount namespaces extensively
# Each container has its own root filesystem

Interview Deep Dives

Q: Explain what happens when you run 'cat file.txt'

Complete flow:

Process creation: Shell forks, child execs /bin/cat
open() syscall:
- Path lookup via VFS (dcache, namei)
- Permission check (DAC + MAC)
- Allocate struct file
- Allocate file descriptor
- Return fd
read() syscall:
- fd → struct file → inode → address_space
- Check page cache for requested pages
- Cache miss: Submit I/O to block layer
- I/O completion: Copy to page cache
- Copy from page cache to user buffer
- Return bytes read
write() to stdout:
- fd 1 → terminal device
- TTY layer processes output
- Display on screen
close() + exit:
- Decrement file refcount
- Free fd in table
- Process exits

Q: What's the difference between fsync and fdatasync?

fsync():

Flushes file data AND metadata to disk
Metadata: size, mtime, block pointers
Two writes: data blocks + inode
Required for crash consistency

fdatasync():

Flushes file data to disk
Only flushes metadata if needed for data access
Skips non-essential metadata (atime, mtime)
Faster for append-only patterns

When to use which:

// fdatasync: Appending to file
write(fd, data, len);
fdatasync(fd);  // Data + size update

// fsync: After rename
int fd = open("file.tmp", O_WRONLY);
write(fd, data, len);
fsync(fd);              // Ensure data is on disk
close(fd);
rename("file.tmp", "file");  // Atomic replace
int dir_fd = open(".", O_DIRECTORY);
fsync(dir_fd);          // Ensure directory is updated

Q: How does the kernel prevent file descriptor leaks?

Kernel mechanisms:

RLIMIT_NOFILE: Per-process limit on open fds
```
ulimit -n  # Check limit
```

O_CLOEXEC: Close fd on exec

int fd = open(path, O_RDONLY | O_CLOEXEC);

Process exit: All fds automatically closed

User-space practices:

Use RAII (C++/Rust): Fd closed when object destroyed
Close fds in error paths
Use valgrind/lsof to detect leaks

# Find fd leaks
watch -n1 'ls /proc/$(pidof myapp)/fd | wc -l'
lsof -p $(pidof myapp) | tail -20

Q: Why is 'ls' slow on directories with many files?

Causes:

Directory reading: getdents() syscall reads directory entries
- Linear scan of directory file
- ext4 uses htree for lookup, but listing is still O(n)
stat() per file: ls -l stats every file
- Each stat is a separate syscall
- May require inode read from disk
Sorting: ls sorts output
- O(n log n) in memory

Solutions:

# Use ls -f (no sorting, includes . and ..)
ls -f

# Use ls -U (no sorting)
ls -U

# Avoid -l if not needed (no stat per file)
ls

# For really large directories
find . -maxdepth 1 -print0 | head -c 10000

Architectural solutions:

Don’t put millions of files in one directory
Use directory sharding: files/ab/cd/file.txt
Use filesystem with better large directory support (XFS)

Performance Monitoring

# VFS cache statistics
cat /proc/slabinfo | grep -E "dentry|inode"

# File system operations tracing
sudo bpftrace -e '
kprobe:vfs_read { @reads = count(); }
kprobe:vfs_write { @writes = count(); }
interval:s:1 { print(@reads); print(@writes); clear(@reads); clear(@writes); }
'

# Page cache hit ratio
sudo cachestat-bpfcc 1

# File I/O latency
sudo fileslower-bpfcc 10  # Show I/O > 10ms

# Open files by process
sudo opensnoop-bpfcc

# Watch for slow path lookups
sudo bpftrace -e '
kprobe:path_lookupat {
    @start[tid] = nsecs;
}
kretprobe:path_lookupat /@start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000;
    if ($lat > 1000) {
        printf("slow lookup: %d us\n", $lat);
    }
    delete(@start[tid]);
}
'

Next: I/O Subsystem →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Filesystem & VFS

​VFS Architecture

​Core VFS Data Structures

​The Four Pillars

​File Descriptors

​File Descriptor Table

​File Descriptor Limits

​File Descriptor Inheritance

​Path Lookup

​The namei() Journey

​Symlink Resolution

​Page Cache

​Page Cache Architecture

​Page Cache Operations

​Read-Ahead

​Write Paths

​Buffered vs Direct I/O

​Write-Back and Dirty Pages

​Filesystem Types

​Virtual Filesystems

​Disk Filesystems

​Mount Namespaces and Bind Mounts

​Interview Deep Dives

​Performance Monitoring

Filesystem & VFS

VFS Architecture

Core VFS Data Structures

The Four Pillars

File Descriptors

File Descriptor Table

File Descriptor Limits

File Descriptor Inheritance

Path Lookup

The namei() Journey

Symlink Resolution

Page Cache

Page Cache Architecture

Page Cache Operations

Read-Ahead

Write Paths

Buffered vs Direct I/O

Write-Back and Dirty Pages

Filesystem Types

Virtual Filesystems

Disk Filesystems

Mount Namespaces and Bind Mounts

Interview Deep Dives

Performance Monitoring