Skip to main content

Virtual File System (VFS)

The File System is the primary abstraction for long-term storage. A “Senior” understanding requires knowing how the kernel provides a unified interface across hundreds of different disk formats and how it optimizes the expensive process of finding a file on disk.

0. Files, Inodes, and Paths: The Basics

Before diving into VFS internals, make sure these fundamentals are crystal clear.

What is a File?

A file is an abstraction the OS provides over raw disk blocks. To the user, a file is:
  • A name (like report.txt).
  • A sequence of bytes (the content).
  • Some metadata (size, permissions, timestamps).
To the kernel, a file is not the name—the name is just a pointer.

What is an Inode?

An Inode (Index Node) is the kernel’s internal representation of a file. It contains:
  • Metadata: size, owner (UID/GID), permissions, timestamps (atime, mtime, ctime).
  • Block pointers: Where on the disk the file’s data actually lives.
  • No filename: The inode does not store the file’s name.
Think of an inode as a library catalog card: it tells you everything about the book (author, subject, location) but not what the book is called on the shelf.

What is a Directory?

A directory is a special file that maps names → inode numbers. When you ls a directory, the kernel:
  1. Reads the directory’s data blocks.
  2. Lists all (name, inode) pairs inside.
This is why renaming a file within the same filesystem is instant: only the directory entry changes, not the file’s data or inode.
  • Hard link: Another name pointing to the same inode. Both names are equal; deleting one doesn’t delete the data until the link count reaches zero.
  • Soft link (symlink): A file whose content is the path to another file. It can break if the target is deleted.

Path Resolution

When you open("/home/user/file.txt", ...):
  1. Kernel starts at the root inode (/).
  2. Looks up home in /’s directory → gets inode for /home.
  3. Looks up user in /home → gets inode for /home/user.
  4. Looks up file.txt in /home/user → gets inode for the file.
  5. Returns a file descriptor referencing that inode.
This is the path walk, and optimizing it (via the dentry cache) is one of the kernel’s most critical jobs.

1. The VFS Architecture

The Virtual File System (VFS) is a software layer in the kernel that provides the standard file-related system calls (open, read, write) to user-space, regardless of the underlying filesystem (EXT4, XFS, NFS, ProcFS).

The Four Primary Objects

  1. Superblock: Represents an entire mounted filesystem. It contains metadata like the block size, total number of inodes, and the “magic number.”
  2. Inode (Index Node): Represents a specific file (or directory) on disk. It contains all metadata (size, permissions, timestamps, block pointers) but not the filename.
  3. Dentry (Directory Entry): Connects an Inode to a filename. Dentries are transient objects created in memory to speed up path lookups.
  4. File: Represents an open file in a specific process. It stores the current “cursor” position (f_pos) and the access mode (Read/Write).

2. Path Lookup: The “Walk”

Converting a string like /home/user/code/main.c into an Inode is one of the most performance-critical paths in the kernel.

The Lookup Process

  1. Split the path: Divide by /.
  2. Dentry Cache (dcache) Lookup: The kernel first checks the memory-resident Dcache. If the dentry for home is found, it moves to the next part.
  3. Directory Traversal: If not in cache, the kernel must read the directory’s data blocks from disk to find the mapping from the filename to an Inode number.
  4. Inode Cache: Once the Inode number is found, the kernel checks the Inode cache or reads it from the Inode table on disk.

Optimization: RCU-Walk

To handle thousands of concurrent lookups (e.g., on a web server), modern Linux uses RCU-walk. It performs the path walk without taking any locks, assuming the directory structure won’t change. If a change is detected, it falls back to the slower “Ref-walk” with proper locking.

3. On-Disk Layout & Allocation

How data is actually stored on the physical platters or NAND cells.

Block Allocation

  • Extents: Instead of a list of block numbers (which is inefficient for large files), modern filesystems (EXT4, XFS) use Extents—a starting block number and a length (e.g., “Blocks 1000 to 2000”).
  • Delayed Allocation: The kernel buffers writes in memory and waits as long as possible before choosing physical blocks on disk. This allows the allocator to find a single contiguous extent for the entire file, reducing fragmentation.

Journaling: Atomic Operations

Writing to a file involves multiple steps: updating the bitmap, the Inode, and the data blocks. If power is lost mid-way, the filesystem becomes inconsistent.
  • The Journal: A dedicated circular buffer on disk. The kernel first writes the intended changes to the journal. Once the journal write is “committed,” it updates the actual filesystem. On reboot, the kernel simply “replays” the journal to restore consistency.

4. Log-Structured File Systems (LFS)

Used in Flash/SSDs (e.g., F2FS).
  • Philosophy: Never overwrite data. Treat the entire disk as a circular log.
  • Benefit: Converts random writes into sequential writes, which is significantly faster for NAND and reduces wear.
  • Trade-off: Requires a Garbage Collector to reclaim space from “dead” versions of files.

5. Page Cache & Writeback

The filesystem does not talk directly to the disk for every write().
  1. Write to Page Cache: The write() syscall simply copies data into kernel memory (Pages). The page is marked as Dirty.
  2. Flushing (Writeback): A background kernel thread (bdi_writeback) periodically writes dirty pages to disk.
  3. Direct I/O: Applications (like Databases) can use O_DIRECT to bypass the page cache and manage their own buffers to avoid “Double Buffering.”

6. Virtual Filesystems (procfs, sysfs)

Not all filesystems represent disks.
  • ProcFS (/proc): Exposes kernel data structures (processes, memory stats) as files.
  • SysFS (/sys): Exposes the hardware device tree and driver configurations.
  • TmpFS: A filesystem that exists entirely in RAM.

7. Choosing a Filesystem in Production

Selecting the right filesystem for your workload is a key architectural decision. Here’s a practical guide:

Decision Flowchart

┌─────────────────────────────────────────────────────────────────────┐
│                  FILESYSTEM SELECTION GUIDE                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  What's your primary concern?                                       │
│                                                                     │
│  ┌─────────────────┐     ┌─────────────────┐    ┌────────────────┐ │
│  │  DATA INTEGRITY │     │   PERFORMANCE   │    │   FLEXIBILITY  │ │
│  └────────┬────────┘     └────────┬────────┘    └───────┬────────┘ │
│           │                       │                      │          │
│           ▼                       ▼                      ▼          │
│  ┌─────────────────┐     ┌─────────────────┐    ┌────────────────┐ │
│  │      ZFS        │     │   XFS or EXT4   │    │     Btrfs      │ │
│  │  • Checksums    │     │   • Low latency │    │  • Snapshots   │ │
│  │  • Scrubbing    │     │   • High IOPS   │    │  • Compression │ │
│  │  • RAID-Z       │     │   • Scalability │    │  • Subvolumes  │ │
│  └─────────────────┘     └─────────────────┘    └────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Comparison Table

FilesystemBest ForAvoid WhenJournalingMax File Size
EXT4General Linux, boot partitionsVery large files (>16TB)Metadata or Data16 TB
XFSLarge files, high parallelismLots of small files, shrinkingMetadata8 EB
BtrfsSnapshots, compression, devProduction databases (still maturing)Copy-on-Write16 EB
ZFSData integrity, NAS, backupsLow-memory systems (less than 8GB)Copy-on-Write16 EB
F2FSSSDs, flash storageHDDsLog-structured16 TB

Production Checklist

  1. Database workloads (MySQL, PostgreSQL): XFS or EXT4 with noatime, data=ordered.
  2. Container hosts (Docker, Kubernetes): XFS for overlay2 driver, or Btrfs for native driver.
  3. Backup servers: ZFS (checksums detect silent corruption) or Btrfs (snapshots).
  4. High-throughput analytics: XFS (better large file and parallel write performance).
  5. Embedded/Flash: F2FS (reduces write amplification).

Quick Commands

# Check current filesystem
df -T /

# Create XFS filesystem (recommended for servers)
mkfs.xfs -f /dev/sdb1

# Mount with performance options
mount -o noatime,nodiratime,discard /dev/sdb1 /data

# Check filesystem-specific features
tune2fs -l /dev/sda1 | grep features   # EXT4
xfs_info /dev/sdb1                      # XFS
btrfs filesystem show /data             # Btrfs
zpool status                            # ZFS

Summary for Senior Engineers

  • Filenames are just pointers: A file can have multiple names (Hard Links) pointing to the same Inode.
  • Dentry Cache is the bottleneck for cold-start performance (e.g., npm install).
  • Extents and Delayed Allocation are the primary defenses against disk fragmentation.
  • Journaling protects the filesystem metadata, but not necessarily your data (unless data=journal mode is used).
Next: I/O Systems & Modern I/O