Filesystem & VFS
The Virtual File System (VFS) is Linux’s abstraction layer that provides a unified interface to all filesystem types. Understanding VFS is crucial for debugging I/O issues and designing storage-aware systems.Prerequisites: System calls, process fundamentals
Interview Focus: File descriptors, VFS architecture, I/O paths, page cache
Time to Master: 4-5 hours
Interview Focus: File descriptors, VFS architecture, I/O paths, page cache
Time to Master: 4-5 hours
VFS Architecture
Copy
┌─────────────────────────────────────────────────────────────────────┐
│ VFS ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ User Space │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ Application ││
│ │ open(), read(), write(), close(), stat(), mmap() ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │ │
│ ════════════════════════════│═══════════════════════════════════ │
│ │ System Call Interface │
│ ════════════════════════════│═══════════════════════════════════ │
│ ▼ │
│ Kernel Space │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ Virtual File System (VFS) ││
│ │ ││
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││
│ │ │ File Objects │ │Dentry Cache │ │ Inode Cache │ ││
│ │ │(open files) │ │(path lookup) │ │(file metadata)│ ││
│ │ └──────────────┘ └──────────────┘ └──────────────┘ ││
│ │ ││
│ │ ┌────────────────────────────────────────────────────────────┐ ││
│ │ │ VFS Operations (file_operations) │ ││
│ │ └────────────────────────────────────────────────────────────┘ ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │ │
│ ┌─────────────┬─────────────┼─────────────┬─────────────┐ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────┐ ┌─────┐ ┌──────┐ ┌──────┐ ┌───────┐ │
│ │ ext4 │ │ XFS │ │ NFS │ │ proc │ │ sysfs │ │
│ └─────┘ └─────┘ └──────┘ └──────┘ └───────┘ │
│ │ │
│ ════════════════════════════│═══════════════════════════════════ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ Block Layer / Network Stack ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────────┘
Core VFS Data Structures
The Four Pillars
- superblock
- inode
- dentry
- file
Represents a mounted filesystem:Key operations:
Copy
struct super_block {
struct list_head s_list; // List of all superblocks
dev_t s_dev; // Device identifier
unsigned long s_blocksize; // Block size in bytes
unsigned char s_blocksize_bits;
loff_t s_maxbytes; // Max file size
struct file_system_type *s_type; // Filesystem type
const struct super_operations *s_op; // Operations
struct dentry *s_root; // Root dentry
struct list_head s_inodes; // All inodes
struct list_head s_dirty; // Dirty inodes
// ... more fields
};
alloc_inode(): Allocate inodedestroy_inode(): Free inodewrite_inode(): Write inode to disksync_fs(): Sync filesystem
Represents a file’s metadata:Key insight: An inode exists once per file, regardless of how many processes have it open.
Copy
struct inode {
umode_t i_mode; // Permissions + type
uid_t i_uid; // Owner UID
gid_t i_gid; // Owner GID
unsigned long i_ino; // Inode number
loff_t i_size; // File size
struct timespec64 i_atime; // Access time
struct timespec64 i_mtime; // Modify time
struct timespec64 i_ctime; // Change time
unsigned int i_nlink; // Hard link count
blkcnt_t i_blocks; // Blocks allocated
union {
struct block_device *i_bdev; // Block device
struct cdev *i_cdev; // Char device
};
const struct inode_operations *i_op;
const struct file_operations *i_fop;
struct address_space *i_mapping; // Page cache
// ... more fields
};
Represents a directory entry (path component):dentry cache (dcache):
Copy
struct dentry {
struct qstr d_name; // Filename
struct inode *d_inode; // Associated inode
struct dentry *d_parent; // Parent directory
struct hlist_node d_hash; // Lookup hash
struct list_head d_lru; // LRU list
struct list_head d_child; // Parent's children
struct list_head d_subdirs; // Our children
const struct dentry_operations *d_op;
struct super_block *d_sb; // Superblock
// ... more fields
};
- Caches pathname lookups
- Hugely important for performance
- Negative dentries cache “file not found”
Represents an open file:Key insight: Each open() creates a new
Copy
struct file {
struct path f_path; // dentry + vfsmount
struct inode *f_inode; // Cached inode
const struct file_operations *f_op; // Operations
spinlock_t f_lock; // Lock
atomic_long_t f_count; // Reference count
unsigned int f_flags; // O_RDONLY, O_NONBLOCK, etc.
fmode_t f_mode; // FMODE_READ, FMODE_WRITE
loff_t f_pos; // Current position
struct address_space *f_mapping; // Page cache
void *private_data; // Filesystem private
// ... more fields
};
struct file, even for the same inode.File Descriptors
File Descriptor Table
Copy
┌─────────────────────────────────────────────────────────────────────┐
│ FILE DESCRIPTOR ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Process A Process B │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ task_struct │ │ task_struct │ │
│ │ └─files ──────────────────└─files ─────┐ │ │
│ └────────────────────┘ └─────────────│──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ files_struct (A) │ │ files_struct (B) │ │
│ │ ┌───────────────┐ │ │ ┌───────────────┐ │ │
│ │ │ fd_array │ │ │ │ fd_array │ │ │
│ │ │ [0]────────────────┐ │ │ [0]──────────────┐│ │
│ │ │ [1]─────────────┐ │ │ │ [1]──────────┐ │ ││ │
│ │ │ [2]──────┐ │ │ │ │ │ [2]────┐ │ │ ││ │
│ │ │ [3]────┐ │ │ │ │ │ │ [3]──┐ │ │ │ ││ │
│ │ └────────│─│────│─│──│ │ └──────│─│───│─│─││ │
│ └──────────│─│────│─│──│ └────────│─│───│─│─││ │
│ │ │ │ │ │ │ │ │ │ ││ │
│ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ struct file instances │ │
│ │ │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │file A │ │file B │ │file C │ │file D │ │ │
│ │ │f_pos=0 │ │f_pos=100│ │f_pos=50│ │f_pos=0 │ │ │
│ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │
│ │ │ │ │ │ │ │
│ └──────│───────────│───────────│───────────│──────────────────┘ │
│ └───────────┴───────────┴───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────┐ │
│ │ struct inode │ │
│ │ (single instance) │ │
│ └───────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
File Descriptor Limits
Copy
# Per-process soft/hard limits
ulimit -n # Current limit (soft)
ulimit -Hn # Hard limit
# System-wide limit
cat /proc/sys/fs/file-max
# Current system-wide usage
cat /proc/sys/fs/file-nr
# allocated free maximum
# 9024 0 9223372036854775807
# Per-process FD usage
ls /proc/<pid>/fd | wc -l
# Which files does a process have open?
lsof -p <pid>
ls -la /proc/<pid>/fd/
File Descriptor Inheritance
Copy
// Fork: child inherits FDs
pid_t pid = fork();
// Both parent and child share the same struct file
// Changes to f_pos are visible to both!
// After exec: FDs preserved unless O_CLOEXEC
int fd = open("/tmp/file", O_RDONLY | O_CLOEXEC);
// fd will be closed on exec()
// Dup: creates new FD pointing to same file
int fd2 = dup(fd); // fd2 shares f_pos with fd
int fd3 = dup2(fd, 5); // fd3 = 5, points to same file
Path Lookup
The namei() Journey
Copy
┌─────────────────────────────────────────────────────────────────────┐
│ PATH RESOLUTION: /home/user/file.txt │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Start at root dentry (or cwd for relative paths) │
│ Current: / │
│ └─ Check dcache for "home" │
│ └─ MISS: Read directory from disk │
│ └─ Find inode number for "home" │
│ └─ Create dentry, add to dcache │
│ │
│ 2. Current: /home │
│ └─ Check dcache for "user" │
│ └─ HIT: dentry found in cache │
│ └─ Check permissions: can we enter? │
│ └─ Handle mount points (if /home/user is mounted) │
│ │
│ 3. Current: /home/user │
│ └─ Check dcache for "file.txt" │
│ └─ MISS: Read directory from disk │
│ └─ Find inode number for "file.txt" │
│ └─ Create dentry, add to dcache │
│ │
│ 4. Final: /home/user/file.txt │
│ └─ Return dentry pointing to file's inode │
│ │
│ Permissions checked at EVERY step! │
│ │
└─────────────────────────────────────────────────────────────────────┘
Symlink Resolution
Copy
┌─────────────────────────────────────────────────────────────────────┐
│ SYMLINK RESOLUTION │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ /usr/local/bin/python → /usr/bin/python3.11 │
│ │
│ 1. Resolve /usr/local/bin/python │
│ 2. Find it's a symlink (inode type = S_IFLNK) │
│ 3. Read link target: /usr/bin/python3.11 │
│ 4. Start new resolution from / (absolute) or current (relative) │
│ 5. Resolve /usr/bin/python3.11 │
│ 6. Return final inode │
│ │
│ Limits: │
│ - Max 40 symlink levels (MAXSYMLINKS) │
│ - Prevents infinite loops: A → B → A │
│ │
│ O_NOFOLLOW: Don't follow final symlink │
│ O_PATH: Return fd without opening file │
│ │
└─────────────────────────────────────────────────────────────────────┘
Page Cache
Page Cache Architecture
Copy
┌─────────────────────────────────────────────────────────────────────┐
│ PAGE CACHE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Application │
│ │ read(fd, buf, 4096) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ Page Cache ││
│ │ ││
│ │ struct address_space (per inode) ││
│ │ ┌────────────────────────────────────────────────────────────┐ ││
│ │ │ Radix tree of pages │ ││
│ │ │ │ ││
│ │ │ Offset: 0 4K 8K 12K 16K 20K 24K 28K │ ││
│ │ │ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │ ││
│ │ │ Pages: │██│ │██│ │░░│ │██│ │░░│ │██│ │██│ │░░│ │ ││
│ │ │ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ │ ││
│ │ │ │ ││
│ │ │ ██ = Page in cache ░░ = Not cached │ ││
│ │ └────────────────────────────────────────────────────────────┘ ││
│ │ ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │ │
│ │ Page fault (cache miss) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ Block Layer ││
│ │ Read from disk ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────────┘
Page Cache Operations
Copy
# View page cache usage
cat /proc/meminfo | grep -E "^(Cached|Buffers|Dirty|Writeback)"
# Drop caches (for testing only!)
echo 1 > /proc/sys/vm/drop_caches # Page cache
echo 2 > /proc/sys/vm/drop_caches # Dentries + inodes
echo 3 > /proc/sys/vm/drop_caches # Both
# Check if file is cached
vmtouch /path/to/file
# Files: 1
# Directories: 0
# Resident Pages: 256/256 100%
# Elapsed: 0.000123 seconds
# Cache a file
vmtouch -t /path/to/file
# Evict file from cache
vmtouch -e /path/to/file
# Check page cache hit ratio
perf stat -e cache-references,cache-misses ./my_program
Read-Ahead
Copy
# Check read-ahead setting (in 512-byte sectors)
cat /sys/block/sda/queue/read_ahead_kb
# Default: 128 (128KB)
# Increase for sequential workloads
echo 256 > /sys/block/sda/queue/read_ahead_kb
# Application-level hint
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL); # Enable read-ahead
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM); # Disable read-ahead
posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED); # Pre-fetch
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); # Evict from cache
Write Paths
Buffered vs Direct I/O
Copy
┌─────────────────────────────────────────────────────────────────────┐
│ WRITE PATHS COMPARISON │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ BUFFERED WRITE DIRECT WRITE (O_DIRECT) │
│ ─────────────── ───────────────────── │
│ │
│ Application Application │
│ │ write(fd, buf, 4096) │ write(fd, buf, 4096) │
│ ▼ ▼ │
│ ┌────────────┐ (no page cache) │
│ │ Page Cache │ │ │
│ │ (dirty) │ │ │
│ └────────────┘ │ │
│ │ │ │
│ │ (background flush) │ (immediate) │
│ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Block │ │ Block │ │
│ │ Layer │ │ Layer │ │
│ └────────────┘ └────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Disk │ │ Disk │ │
│ └────────────┘ └────────────┘ │
│ │
│ Pros: Pros: │
│ - Fast returns - Predictable latency │
│ - Coalescing - App controls caching │
│ - Read cache - No double buffering │
│ │
│ Cons: Cons: │
│ - Data loss on crash - Slower for small I/O │
│ - Memory overhead - Alignment requirements │
│ │
└─────────────────────────────────────────────────────────────────────┘
Write-Back and Dirty Pages
Copy
# View dirty pages and write-back status
cat /proc/meminfo | grep -E "^(Dirty|Writeback):"
# Tune write-back behavior
# Start write-back when dirty pages exceed this %
cat /proc/sys/vm/dirty_background_ratio # Default: 10%
# Process blocks when dirty pages exceed this %
cat /proc/sys/vm/dirty_ratio # Default: 20%
# Age at which data is old enough to be written (centiseconds)
cat /proc/sys/vm/dirty_expire_centisecs # Default: 3000 (30 seconds)
# Interval between write-back daemon wake-ups (centiseconds)
cat /proc/sys/vm/dirty_writeback_centisecs # Default: 500 (5 seconds)
# Force sync
sync # Sync all filesystems
syncfs(fd) # Sync specific filesystem
fsync(fd) # Sync specific file data + metadata
fdatasync(fd) # Sync specific file data only
Filesystem Types
Virtual Filesystems
Copy
# procfs - Process information
/proc/
├── <pid>/ # Per-process info
│ ├── cmdline # Command line
│ ├── environ # Environment
│ ├── fd/ # File descriptors
│ ├── maps # Memory mappings
│ ├── stat # Process status
│ └── status # Readable status
├── cpuinfo # CPU information
├── meminfo # Memory information
└── sys/ # Kernel parameters (sysctl)
# sysfs - Device/driver information
/sys/
├── block/ # Block devices
├── bus/ # Bus types
├── class/ # Device classes
├── devices/ # Device hierarchy
├── fs/ # Filesystem info
│ └── cgroup/ # Cgroup controllers
└── kernel/ # Kernel settings
# tmpfs - RAM-based filesystem
mount -t tmpfs -o size=1G tmpfs /mnt/ramdisk
# Used for /tmp, /run, /dev/shm
Disk Filesystems
| Filesystem | Use Case | Max File Size | Features |
|---|---|---|---|
| ext4 | General purpose | 16 TiB | Journaling, extents |
| XFS | Large files | 8 EiB | Parallel I/O, reflinks |
| Btrfs | Advanced features | 16 EiB | Snapshots, checksums |
| ZFS | Enterprise | 16 EiB | Pools, RAID, snapshots |
Mount Namespaces and Bind Mounts
Copy
# View mount points
mount
cat /proc/mounts
findmnt
# Bind mount: Mount a directory in another location
mount --bind /source /target
# Recursive bind mount
mount --rbind /source /target
# Make mount private (don't propagate)
mount --make-private /target
# Make mount shared (propagate to slaves)
mount --make-shared /target
# Mount namespace: Isolated mount table
unshare --mount /bin/bash
# Mounts in this shell don't affect parent
# Containers use mount namespaces extensively
# Each container has its own root filesystem
Interview Deep Dives
Q: Explain what happens when you run 'cat file.txt'
Q: Explain what happens when you run 'cat file.txt'
Complete flow:
-
Process creation: Shell forks, child execs
/bin/cat -
open() syscall:
- Path lookup via VFS (dcache, namei)
- Permission check (DAC + MAC)
- Allocate struct file
- Allocate file descriptor
- Return fd
-
read() syscall:
- fd → struct file → inode → address_space
- Check page cache for requested pages
- Cache miss: Submit I/O to block layer
- I/O completion: Copy to page cache
- Copy from page cache to user buffer
- Return bytes read
-
write() to stdout:
- fd 1 → terminal device
- TTY layer processes output
- Display on screen
-
close() + exit:
- Decrement file refcount
- Free fd in table
- Process exits
Q: What's the difference between fsync and fdatasync?
Q: What's the difference between fsync and fdatasync?
fsync():
- Flushes file data AND metadata to disk
- Metadata: size, mtime, block pointers
- Two writes: data blocks + inode
- Required for crash consistency
- Flushes file data to disk
- Only flushes metadata if needed for data access
- Skips non-essential metadata (atime, mtime)
- Faster for append-only patterns
Copy
// fdatasync: Appending to file
write(fd, data, len);
fdatasync(fd); // Data + size update
// fsync: After rename
int fd = open("file.tmp", O_WRONLY);
write(fd, data, len);
fsync(fd); // Ensure data is on disk
close(fd);
rename("file.tmp", "file"); // Atomic replace
int dir_fd = open(".", O_DIRECTORY);
fsync(dir_fd); // Ensure directory is updated
Q: How does the kernel prevent file descriptor leaks?
Q: How does the kernel prevent file descriptor leaks?
Kernel mechanisms:
-
RLIMIT_NOFILE: Per-process limit on open fds
Copy
ulimit -n # Check limit -
O_CLOEXEC: Close fd on exec
Copy
int fd = open(path, O_RDONLY | O_CLOEXEC); - Process exit: All fds automatically closed
- Use RAII (C++/Rust): Fd closed when object destroyed
- Close fds in error paths
- Use valgrind/lsof to detect leaks
Copy
# Find fd leaks
watch -n1 'ls /proc/$(pidof myapp)/fd | wc -l'
lsof -p $(pidof myapp) | tail -20
Q: Why is 'ls' slow on directories with many files?
Q: Why is 'ls' slow on directories with many files?
Causes:Architectural solutions:
- Directory reading: getdents() syscall reads directory entries
- Linear scan of directory file
- ext4 uses htree for lookup, but listing is still O(n)
- stat() per file: ls -l stats every file
- Each stat is a separate syscall
- May require inode read from disk
- Sorting: ls sorts output
- O(n log n) in memory
Copy
# Use ls -f (no sorting, includes . and ..)
ls -f
# Use ls -U (no sorting)
ls -U
# Avoid -l if not needed (no stat per file)
ls
# For really large directories
find . -maxdepth 1 -print0 | head -c 10000
- Don’t put millions of files in one directory
- Use directory sharding: files/ab/cd/file.txt
- Use filesystem with better large directory support (XFS)
Performance Monitoring
Copy
# VFS cache statistics
cat /proc/slabinfo | grep -E "dentry|inode"
# File system operations tracing
sudo bpftrace -e '
kprobe:vfs_read { @reads = count(); }
kprobe:vfs_write { @writes = count(); }
interval:s:1 { print(@reads); print(@writes); clear(@reads); clear(@writes); }
'
# Page cache hit ratio
sudo cachestat-bpfcc 1
# File I/O latency
sudo fileslower-bpfcc 10 # Show I/O > 10ms
# Open files by process
sudo opensnoop-bpfcc
# Watch for slow path lookups
sudo bpftrace -e '
kprobe:path_lookupat {
@start[tid] = nsecs;
}
kretprobe:path_lookupat /@start[tid]/ {
$lat = (nsecs - @start[tid]) / 1000;
if ($lat > 1000) {
printf("slow lookup: %d us\n", $lat);
}
delete(@start[tid]);
}
'
Next: I/O Subsystem →