Virtual File System (VFS)
The File System is the primary abstraction for long-term storage. A “Senior” understanding requires knowing how the kernel provides a unified interface across hundreds of different disk formats and how it optimizes the expensive process of finding a file on disk.0. Files, Inodes, and Paths: The Basics
Before diving into VFS internals, make sure these fundamentals are crystal clear.What is a File?
A file is an abstraction the OS provides over raw disk blocks. To the user, a file is:- A name (like
report.txt). - A sequence of bytes (the content).
- Some metadata (size, permissions, timestamps).
What is an Inode?
An Inode (Index Node) is the kernel’s internal representation of a file. It contains:- Metadata: size, owner (UID/GID), permissions, timestamps (atime, mtime, ctime).
- Block pointers: Where on the disk the file’s data actually lives.
- No filename: The inode does not store the file’s name.
What is a Directory?
A directory is a special file that maps names → inode numbers. When youls a directory, the kernel:
- Reads the directory’s data blocks.
- Lists all
(name, inode)pairs inside.
Hard Links and Soft Links
- Hard link: Another name pointing to the same inode. Both names are equal; deleting one doesn’t delete the data until the link count reaches zero.
- Soft link (symlink): A file whose content is the path to another file. It can break if the target is deleted.
Path Resolution
When youopen("/home/user/file.txt", ...):
- Kernel starts at the root inode (
/). - Looks up
homein/’s directory → gets inode for/home. - Looks up
userin/home→ gets inode for/home/user. - Looks up
file.txtin/home/user→ gets inode for the file. - Returns a file descriptor referencing that inode.
1. The VFS Architecture
The Virtual File System (VFS) is a software layer in the kernel that provides the standard file-related system calls (open, read, write) to user-space, regardless of the underlying filesystem (EXT4, XFS, NFS, ProcFS).The Four Primary Objects
- Superblock: Represents an entire mounted filesystem. It contains metadata like the block size, total number of inodes, and the “magic number.”
- Inode (Index Node): Represents a specific file (or directory) on disk. It contains all metadata (size, permissions, timestamps, block pointers) but not the filename.
- Dentry (Directory Entry): Connects an Inode to a filename. Dentries are transient objects created in memory to speed up path lookups.
- File: Represents an open file in a specific process. It stores the current “cursor” position (
f_pos) and the access mode (Read/Write).
2. Path Lookup: The “Walk”
Converting a string like/home/user/code/main.c into an Inode is one of the most performance-critical paths in the kernel.
The Lookup Process
- Split the path: Divide by
/. - Dentry Cache (dcache) Lookup: The kernel first checks the memory-resident Dcache. If the dentry for
homeis found, it moves to the next part. - Directory Traversal: If not in cache, the kernel must read the directory’s data blocks from disk to find the mapping from the filename to an Inode number.
- Inode Cache: Once the Inode number is found, the kernel checks the Inode cache or reads it from the Inode table on disk.
Optimization: RCU-Walk
To handle thousands of concurrent lookups (e.g., on a web server), modern Linux uses RCU-walk. It performs the path walk without taking any locks, assuming the directory structure won’t change. If a change is detected, it falls back to the slower “Ref-walk” with proper locking.3. On-Disk Layout & Allocation
How data is actually stored on the physical platters or NAND cells.Block Allocation
- Extents: Instead of a list of block numbers (which is inefficient for large files), modern filesystems (EXT4, XFS) use Extents—a starting block number and a length (e.g., “Blocks 1000 to 2000”).
- Delayed Allocation: The kernel buffers writes in memory and waits as long as possible before choosing physical blocks on disk. This allows the allocator to find a single contiguous extent for the entire file, reducing fragmentation.
Journaling: Atomic Operations
Writing to a file involves multiple steps: updating the bitmap, the Inode, and the data blocks. If power is lost mid-way, the filesystem becomes inconsistent.- The Journal: A dedicated circular buffer on disk. The kernel first writes the intended changes to the journal. Once the journal write is “committed,” it updates the actual filesystem. On reboot, the kernel simply “replays” the journal to restore consistency.
4. Log-Structured File Systems (LFS)
Used in Flash/SSDs (e.g., F2FS).- Philosophy: Never overwrite data. Treat the entire disk as a circular log.
- Benefit: Converts random writes into sequential writes, which is significantly faster for NAND and reduces wear.
- Trade-off: Requires a Garbage Collector to reclaim space from “dead” versions of files.
5. Page Cache & Writeback
The filesystem does not talk directly to the disk for everywrite().
- Write to Page Cache: The
write()syscall simply copies data into kernel memory (Pages). The page is marked as Dirty. - Flushing (Writeback): A background kernel thread (
bdi_writeback) periodically writes dirty pages to disk. - Direct I/O: Applications (like Databases) can use
O_DIRECTto bypass the page cache and manage their own buffers to avoid “Double Buffering.”
6. Virtual Filesystems (procfs, sysfs)
Not all filesystems represent disks.- ProcFS (
/proc): Exposes kernel data structures (processes, memory stats) as files. - SysFS (
/sys): Exposes the hardware device tree and driver configurations. - TmpFS: A filesystem that exists entirely in RAM.
7. Choosing a Filesystem in Production
Selecting the right filesystem for your workload is a key architectural decision. Here’s a practical guide:Decision Flowchart
Comparison Table
| Filesystem | Best For | Avoid When | Journaling | Max File Size |
|---|---|---|---|---|
| EXT4 | General Linux, boot partitions | Very large files (>16TB) | Metadata or Data | 16 TB |
| XFS | Large files, high parallelism | Lots of small files, shrinking | Metadata | 8 EB |
| Btrfs | Snapshots, compression, dev | Production databases (still maturing) | Copy-on-Write | 16 EB |
| ZFS | Data integrity, NAS, backups | Low-memory systems (less than 8GB) | Copy-on-Write | 16 EB |
| F2FS | SSDs, flash storage | HDDs | Log-structured | 16 TB |
Production Checklist
- Database workloads (MySQL, PostgreSQL): XFS or EXT4 with
noatime,data=ordered. - Container hosts (Docker, Kubernetes): XFS for overlay2 driver, or Btrfs for native driver.
- Backup servers: ZFS (checksums detect silent corruption) or Btrfs (snapshots).
- High-throughput analytics: XFS (better large file and parallel write performance).
- Embedded/Flash: F2FS (reduces write amplification).
Quick Commands
Summary for Senior Engineers
- Filenames are just pointers: A file can have multiple names (Hard Links) pointing to the same Inode.
- Dentry Cache is the bottleneck for cold-start performance (e.g.,
npm install). - Extents and Delayed Allocation are the primary defenses against disk fragmentation.
- Journaling protects the filesystem metadata, but not necessarily your data (unless
data=journalmode is used).