Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Virtual File System (VFS)
The File System is the primary abstraction for long-term storage. Think of VFS as a universal power adapter: your laptop (user-space application) has one plug (open, read, write), and the adapter (VFS) lets it work with any outlet in the world (EXT4, XFS, NFS, procfs, tmpfs). Without VFS, every application would need to know the specifics of every filesystem format — an impossibility.
A senior understanding requires knowing how the kernel provides this unified interface across hundreds of different disk formats and how it optimizes the expensive process of finding a file on disk.
0. Files, Inodes, and Paths: The Basics
Before diving into VFS internals, make sure these fundamentals are crystal clear.What is a File?
A file is an abstraction the OS provides over raw disk blocks. To the user, a file is:- A name (like
report.txt). - A sequence of bytes (the content).
- Some metadata (size, permissions, timestamps).
What is an Inode?
An Inode (Index Node) is the kernel’s internal representation of a file. It contains:- Metadata: size, owner (UID/GID), permissions, timestamps (atime, mtime, ctime).
- Block pointers: Where on the disk the file’s data actually lives.
- No filename: The inode does not store the file’s name.
What is a Directory?
A directory is a special file that maps names → inode numbers. When youls a directory, the kernel:
- Reads the directory’s data blocks.
- Lists all
(name, inode)pairs inside.
Hard Links and Soft Links
- Hard link: Another name pointing to the same inode. Both names are equal; deleting one doesn’t delete the data until the link count reaches zero.
- Soft link (symlink): A file whose content is the path to another file. It can break if the target is deleted.
Path Resolution
When youopen("/home/user/file.txt", ...):
- Kernel starts at the root inode (
/). - Looks up
homein/’s directory → gets inode for/home. - Looks up
userin/home→ gets inode for/home/user. - Looks up
file.txtin/home/user→ gets inode for the file. - Returns a file descriptor referencing that inode.
1. The VFS Architecture
The Virtual File System (VFS) is a software layer in the kernel that provides the standard file-related system calls (open, read, write) to user-space, regardless of the underlying filesystem (EXT4, XFS, NFS, ProcFS).The Four Primary Objects
- Superblock: Represents an entire mounted filesystem. It contains metadata like the block size, total number of inodes, and the “magic number.”
- Inode (Index Node): Represents a specific file (or directory) on disk. It contains all metadata (size, permissions, timestamps, block pointers) but not the filename.
- Dentry (Directory Entry): Connects an Inode to a filename. Dentries are transient objects created in memory to speed up path lookups.
- File: Represents an open file in a specific process. It stores the current “cursor” position (
f_pos) and the access mode (Read/Write).
2. Path Lookup: The “Walk”
Converting a string like/home/user/code/main.c into an Inode is one of the most performance-critical paths in the kernel.
The Lookup Process
- Split the path: Divide by
/. - Dentry Cache (dcache) Lookup: The kernel first checks the memory-resident Dcache. If the dentry for
homeis found, it moves to the next part. - Directory Traversal: If not in cache, the kernel must read the directory’s data blocks from disk to find the mapping from the filename to an Inode number.
- Inode Cache: Once the Inode number is found, the kernel checks the Inode cache or reads it from the Inode table on disk.
Optimization: RCU-Walk
To handle thousands of concurrent lookups (e.g., on a web server serving static files), modern Linux uses RCU-walk. It performs the path walk without taking any locks, assuming the directory structure will not change. If a change is detected mid-walk, it falls back to the slower “Ref-walk” with proper locking. Think of it like walking through a building with all doors open (RCU-walk). If you find a door closed mid-walk (directory structure changed), you go back and get the key ring (take locks). The optimistic path is dramatically faster because lock acquisition and contention are the primary bottlenecks in high-throughput filesystem workloads. Practical tip: If your application opens thousands of files per second (e.g.,npm install, container startup), the dentry cache hit rate is your most important filesystem metric. Check it with cat /proc/sys/fs/dentry-state. A cold dentry cache after a reboot is often the real reason “the first request after deploy is slow.”
3. On-Disk Layout & Allocation
How data is actually stored on the physical platters or NAND cells.Block Allocation
- Extents: Instead of a list of block numbers (which is inefficient for large files), modern filesystems (EXT4, XFS) use Extents—a starting block number and a length (e.g., “Blocks 1000 to 2000”).
- Delayed Allocation: The kernel buffers writes in memory and waits as long as possible before choosing physical blocks on disk. This allows the allocator to find a single contiguous extent for the entire file, reducing fragmentation.
Journaling: Atomic Operations
Writing to a file involves multiple steps: updating the bitmap, the Inode, and the data blocks. If power is lost mid-way, the filesystem becomes inconsistent — like a bank transfer that debited one account but crashed before crediting the other.- The Journal: A dedicated circular buffer on disk that acts as a “transaction log.” The kernel first writes the intended changes to the journal (the “intent”). Once the journal write is “committed,” it updates the actual filesystem. On reboot, the kernel simply “replays” the journal to restore consistency — completing any transactions that were committed but not yet applied, and discarding any that were not committed.
data=ordered), which means file content can be lost on a crash but the filesystem structure stays consistent. If you need both metadata and data journaled, mount with data=journal — but this roughly halves write throughput because every byte is written twice (once to journal, once to final location). Databases bypass this tradeoff entirely by managing their own WAL (Write-Ahead Log) and using O_DIRECT + fsync.
4. Log-Structured File Systems (LFS)
Used in Flash/SSDs (e.g., F2FS, and conceptually similar to how SSDs work internally with their Flash Translation Layer).- Philosophy: Never overwrite data. Treat the entire disk as a circular log. New writes always go to the end, and old versions of data become garbage.
- Benefit: Converts random writes into sequential writes, which is significantly faster for NAND flash (where erasing a block before overwriting is expensive) and reduces wear leveling overhead.
- Trade-off: Requires a Garbage Collector to reclaim space from “dead” versions of files. Under heavy write load, the GC can compete with application I/O for bandwidth — a phenomenon called write amplification. This is the same tradeoff that SSDs face internally with their FTL.
5. Page Cache & Writeback
The filesystem does not talk directly to the disk for everywrite(). This would be devastatingly slow — a disk write takes milliseconds, but a memory copy takes microseconds. Instead, the kernel absorbs writes into RAM and flushes them lazily.
- Write to Page Cache: The
write()syscall simply copies data into kernel memory (Pages). The page is marked as Dirty. From the application’s perspective, the write “completed” in microseconds. - Flushing (Writeback): A background kernel thread (
bdi_writeback) periodically writes dirty pages to disk. The default dirty ratio is ~20% of RAM — only when dirty pages exceed this threshold does the kernel start aggressively flushing. This is why a sudden burst of writes can feel fast initially but then stall: you have filled the dirty page budget and the kernel is now synchronously writing to disk. - Direct I/O: Applications (like Databases) can use
O_DIRECTto bypass the page cache and manage their own buffers. This avoids “Double Buffering” (data sitting in both the application’s buffer pool and the kernel’s page cache) and gives the application precise control over when data hits disk viafsync().
wa in top) during write bursts, check dirty_ratio and dirty_background_ratio in /proc/sys/vm/. Tuning these values is one of the highest-impact knobs for write-heavy workloads. Lower the background ratio to start flushing earlier and avoid the “cliff” where synchronous writeback kicks in.
6. Virtual Filesystems (procfs, sysfs)
Not all filesystems represent disks.- ProcFS (
/proc): Exposes kernel data structures (processes, memory stats) as files. - SysFS (
/sys): Exposes the hardware device tree and driver configurations. - TmpFS: A filesystem that exists entirely in RAM.
7. Choosing a Filesystem in Production
Selecting the right filesystem for your workload is a key architectural decision. Here’s a practical guide:Decision Flowchart
Comparison Table
| Filesystem | Best For | Avoid When | Journaling | Max File Size |
|---|---|---|---|---|
| EXT4 | General Linux, boot partitions | Very large files (>16TB) | Metadata or Data | 16 TB |
| XFS | Large files, high parallelism | Lots of small files, shrinking | Metadata | 8 EB |
| Btrfs | Snapshots, compression, dev | Production databases (still maturing) | Copy-on-Write | 16 EB |
| ZFS | Data integrity, NAS, backups | Low-memory systems (less than 8GB) | Copy-on-Write | 16 EB |
| F2FS | SSDs, flash storage | HDDs | Log-structured | 16 TB |
Production Checklist
- Database workloads (MySQL, PostgreSQL): XFS or EXT4 with
noatime,data=ordered. - Container hosts (Docker, Kubernetes): XFS for overlay2 driver, or Btrfs for native driver.
- Backup servers: ZFS (checksums detect silent corruption) or Btrfs (snapshots).
- High-throughput analytics: XFS (better large file and parallel write performance).
- Embedded/Flash: F2FS (reduces write amplification).
Quick Commands
7.5 Caveats and Common Pitfalls
Summary for Senior Engineers
- Filenames are just pointers: A file can have multiple names (Hard Links) pointing to the same Inode.
- Dentry Cache is the bottleneck for cold-start performance (e.g.,
npm install). - Extents and Delayed Allocation are the primary defenses against disk fragmentation.
- Journaling protects the filesystem metadata, but not necessarily your data (unless
data=journalmode is used).
Interview Deep-Dive
ext4 vs XFS vs btrfs vs ZFS -- trade-offs
ext4 vs XFS vs btrfs vs ZFS -- trade-offs
Strong Answer Framework:Common Wrong Answers:
- Frame each on the trinity: integrity, performance, flexibility. No filesystem wins all three; the choice is workload-driven.
- ext4 is the boring, bulletproof default. Metadata journaling, extents, delayed allocation. Great for boot disks, mail spools, general Linux servers. Weak spots: no checksums (silent corruption invisible), max file size 16TB, scaling beyond ~50TB starts to feel sluggish.
- XFS is the high-throughput parallel-IO workhorse. Allocation groups give you parallel writers without lock contention. Great for large files (analytics, video, scientific data), backup repositories, and database data files. Weak spots: shrinking is unsupported (you can grow but never shrink), small-file workloads are competitive but not dominant, no native snapshots.
- btrfs is the Swiss army knife: copy-on-write, native snapshots, send/receive replication, integrated RAID, transparent compression. Excellent for development machines, container hosts, snapshot-based backups. Weak spots: RAID 5/6 is still buggy in 2025 (“do not use” in the kernel docs), performance can degrade with heavy fsync (Postgres), and large filesystems need careful balance management.
- ZFS is the gold standard for data integrity: end-to-end checksums, RAIDZ (no write hole), compression, encryption, dedup, send/receive. Used by Netflix, iXsystems, dozens of NAS vendors. Weak spots: not in mainline Linux (CDDL license), eats RAM (rule of thumb: 1GB per TB plus extra for ARC), L2ARC and SLOG add operational complexity.
- Make a recommendation per workload. Postgres on bare metal: ext4 or XFS with
noatime,data=ordered. Backup target with corruption detection: ZFS RAIDZ2. Container host with snapshot rollback: btrfs subvolumes (or ZFS, on FreeBSD). Petabyte-scale media archive: XFS with allocation groups tuned to NUMA.
Senior Follow-up 1: “Why is btrfs RAID 5/6 still considered unstable?”The “write hole” — if power is lost between writing data and parity blocks, parity becomes inconsistent. Combined with btrfs’s CoW data layout and metadata pinning bugs surfacing under recovery, you can lose data on reconstruct. ZFS RAIDZ avoids this with variable-stripe writes; mdadm RAID5 papers over it with a write-intent bitmap or a journal device. btrfs has had patches in flight for years; the kernel docs explicitly warn against RAID5/6 for important data.
Senior Follow-up 2: “Why does Postgres performance suffer on btrfs?”Postgres calls fsync frequently (every WAL commit). On CoW filesystems, fsync forces a metadata transaction even when data did not change semantically — because the block locations of the data changed. This amplifies write traffic and journal commits. Workarounds:
nodatacow on the data directory (loses btrfs’s data checksums), or use ext4/XFS for Postgres data and btrfs for everything else.Senior Follow-up 3: “How do you decide between native ZFS RAID and hardware RAID?”Hardware RAID hides the disks behind a single block device, defeating ZFS’s ability to detect and repair corruption. Always give ZFS raw access (HBA mode / IT mode), let it manage redundancy. The argument “but hardware RAID has battery-backed cache” is solved by SLOG (a small fast SSD as ZFS’s intent log device). Same for mdadm: if you are running ZFS, do not stack it on top of md.
- “ZFS is always best because of checksums.” True only when you can pay the RAM, complexity, and licensing cost. For a stateless container host with ephemeral storage, ext4 is fine.
- “btrfs is unstable; never use it.” The non-RAID-5/6 features are stable and used at major-vendor scale (SUSE, Synology, Facebook). Learn what is stable vs experimental.
- “XFS does not support journaling.” XFS has metadata journaling. It does not journal data, but neither does ext4 by default.
- “ZFS: The Last Word in Filesystems” — Jeff Bonwick’s original Sun design papers
- “Btrfs: The Linux B-tree Filesystem” — Mason et al., ACM TOS
- Dave Chinner’s XFS articles on lwn.net
How does a journaling filesystem recover after a crash?
How does a journaling filesystem recover after a crash?
Strong Answer Framework:Common Wrong Answers:
- Define the problem journaling solves. A filesystem operation typically updates multiple on-disk structures: inode, block bitmap, directory entry, data blocks. A crash mid-update leaves the filesystem inconsistent. Pre-journal recovery (
fsckon ext2) was a full scan — minutes to hours on multi-TB volumes. - Describe the journal as a write-ahead log. Before mutating real on-disk structures, the filesystem writes the intended changes (the “transaction”) to a circular journal area. The journal write completes, then the actual structures are updated, then the journal entry is marked committed.
- Walk through crash recovery. On mount, the FS examines the journal. Entries marked committed but not fully applied are replayed (re-applied to the main structures). Entries not committed are discarded (the in-progress transaction never happened from the FS’s view). Either way, the FS is consistent within seconds rather than hours.
- Distinguish modes. ext4 has
data=writeback(journal metadata only, no ordering between data and metadata — fastest, can show stale data after crash),data=ordered(default; journal metadata, write data first then commit metadata — consistent but data is not journaled),data=journal(journal both — 2x write traffic but full crash protection). Most production picks ordered. - State the limit. Journaling protects the filesystem’s internal consistency. It does NOT protect application data unless the app uses fsync correctly. A database that writes to a file without fsync, then crashes, can lose data even on a journaled filesystem.
- Note the alternatives. Copy-on-write filesystems (btrfs, ZFS) do not need a separate journal — they update by writing new blocks and atomically swinging a pointer. Crash recovery is implicit: either the new pointer is durable (new state) or it is not (old state).
Senior Follow-up 1: “What does
data=ordered actually order?”Data writes for a transaction must complete before the transaction’s metadata commit hits the journal. Without that ordering, the journal could record “block 12345 belongs to inode 99” before block 12345’s content was written — a crash leaves inode 99 pointing at stale or random data. Ordered mode prevents this by issuing data writes first and waiting for them, then committing metadata.Senior Follow-up 2: “What is the journal commit interval and why does it matter?”ext4 commits the journal every
commit=N seconds (default 5). A crash within that window can lose recently written metadata even though the FS will be consistent. For latency-sensitive workloads, lower commit intervals reduce the loss window at the cost of more journal writes. For throughput-bound batch jobs, larger intervals amortize journal overhead.Senior Follow-up 3: “How does a journal interact with barriers and write caches?”The journal entry must be durable on disk before the corresponding metadata update is durable, otherwise crash recovery sees a half-written transaction. ext4 issues
REQ_PREFLUSH | REQ_FUA around journal commits to force the device to flush volatile cache. If you mount with barrier=0 (you should not, in production), you trade durability for throughput; on a power failure you can corrupt the FS.- “Journaling protects my data.” It protects the filesystem’s metadata consistency. For data, you still need fsync.
- “Replay just runs every transaction.” It runs only committed transactions. Non-committed (in-flight at crash) entries are discarded.
- “CoW filesystems still use a journal.” btrfs has a small log tree for fast fsync, but the main mechanism is CoW + atomic superblock pointer swing. ZFS uses a ZIL (intent log) for synchronous writes but relies on uberblock atomic update for consistency.
- Stephen Tweedie, “EXT3, Journaling Filesystem” — the canonical paper
- “The Linux ext3 filesystem — Recovery and journaling” — LWN
- Dave Chinner, “XFS Algorithms and Data Structures” (xfs.org)
Hard link vs symlink -- semantic differences
Hard link vs symlink -- semantic differences
Strong Answer Framework:Common Wrong Answers:
- Define each precisely. A hard link is a directory entry mapping a name to an inode number. Two hard links pointing at the same inode are equal — there is no “original” and “link.” A symlink (symbolic link, soft link) is a tiny file whose contents are a path string. The OS resolves the path string at access time.
- State the consequences of each design.
- Hard links share the inode. Same data, same metadata, same permissions, same link count (incremented).
- Symlinks point at a path. They can be relative or absolute. They can dangle (target deleted). They can cross filesystems. They can point at directories.
- Walk through cross-cutting concerns.
- Filesystem boundary: Hard links must stay within one filesystem (the inode number is filesystem-local). Symlinks cross freely.
- Directories: Cannot hard link directories (cycles). Symlinks can point at directories.
- Deletion:
unlinkon a hard link decrements the link count; data freed only when count reaches 0 AND no fd is open.unlinkon a symlink removes only the symlink; the target is untouched.unlinkon a target leaves dangling symlinks. - Permissions: Hard links share the inode, so chmod/chown affects all names. Symlinks have their own metadata but most ops follow them;
lchown,lstat, andO_NOFOLLOWoperate on the link itself. - Size: Hard links cost one directory entry (~32 bytes). Symlinks cost a tiny file (often inlined in the inode if path is short).
- Race conditions: Symlinks enable TOCTOU attacks (
/tmpsymlink races). Hard links also have attacks (linking to files you cannot read to bypass permission checks on copy), which is why most distros enableprotected_hardlinkssysctl.
- Provide a decision rubric. Use hard links when you want true equivalence (e.g., backup tools deduplicating), within-filesystem aliases that should never break. Use symlinks when crossing filesystems, pointing at directories, exposing version aliases (
/usr/lib/libfoo.so->libfoo.so.1.2.3), or when the target may legitimately move.
git clone --reference requires the source repo to be on the same filesystem.Senior Follow-up 1: “Why does the kernel limit symlink resolution depth?”To prevent infinite loops (
a -> b -> a) and absurdly deep chains. Linux’s limit is 40 (MAXSYMLINKS). Exceeding it returns ELOOP. Per single component the limit is 8 (SYMLOOP_MAX semantically). This protects path resolution from blowing the kernel stack.Senior Follow-up 2: “What is
protected_hardlinks and what attack does it prevent?”The sysctl fs.protected_hardlinks=1 prevents users from hard-linking files they do not own and cannot read. The attack: a privileged process creates a temporary file with mode 0600. An unprivileged user hard-links it. Later, the temp file is unlinked but the user retains a hard link to its (now stable) inode. With race conditions on suid binaries, this can leak credentials.Senior Follow-up 3: “What changes if I rename across hard links?”
rename(a, b) where both are hard links to the same inode is a no-op for the inode but removes one of the names if a different one was used. Tools like git rely on this for safe renames. A subtle gotcha: rename is atomic, but rename followed by rename is not — backup tools that cycle through generations using rename trees need careful ordering.- “A symlink is a faster hard link.” Different mechanism, different semantics. Symlinks are slower (extra path resolution) and weaker (target can vanish).
- “Hard links work across filesystems.” They do not. Inode numbers are filesystem-scoped.
- “Deleting the original breaks hard links.” There is no original. Both names are equal.
- link(2), symlink(2), unlink(2) man pages — read them carefully
- “The TOCTOU problem and how Linux solves it” — LWN article on
openat2 - “Apple File System Reference” — the APFS chapters on snapshots illustrate why hard-linked directories needed to die
You delete large log files on a production server but disk usage does not decrease. What happened?
You delete large log files on a production server but disk usage does not decrease. What happened?
Strong Answer Framework:Common Wrong Answers:
- Diagnose the symptom precisely.
dfshows used space high,duon the directory shows nothing. Classic signature of unlinked-but-open files. - Explain the mechanism. When you
unlinka file, the dentry is removed. The inode and data blocks remain allocated as long as any fd references them. Disk space returns only when the last fd is closed AND the link count is zero. - Show how to confirm.
lsof +L1lists open files with link count zero. The output points at the holding process(es). On modern systems,lsof -nP +L1 | grep deletedis your friend. - Explain the common culprit. Logging daemons (rsyslogd, journald, application loggers) hold the file open for writes. Naive
rmof/var/log/big.logfollowed by waiting — forever. The process keeps writing to the now-orphan inode. - Provide multiple fixes.
- Restart the process holding the file. Kills the fd, releases the inode.
- Truncate via
/proc/<PID>/fd/<FD>—: > /proc/12345/fd/3zeros the file in place without the process noticing. - Send SIGHUP to a daemon that handles log reopen on signal.
- Use logrotate’s
copytruncatefor processes that cannot be signaled.
- Explain prevention. Configure logrotate properly. Use
truncate(2)rather thanunlink+createif you need to keep the daemon writing. Monitor for unlinked-but-open files in your alerting (Datadog has a check, Prometheus exporters expose it).
rm -rf’d a wrong directory and disk usage looked like it had not changed. It was actually fine — the inodes were freed — but during incident response, several engineers mistook still-open log files for “the rm did not work” and tried to redo the operation on a different host. The post-mortem explicitly called out misunderstanding of unlink semantics under fd pressure as a cognitive failure during the incident.Senior Follow-up 1: “Why does logrotate’s
copytruncate exist?”For daemons that cannot be signaled to reopen log files (or for which reopening is unsafe). copytruncate copies the existing log to a backup, then truncates the original to zero in place. The daemon’s fd still points at the same inode — now empty — and continues writing without disruption. Trade-off: a brief race window during which writes between the copy and truncate are lost.Senior Follow-up 2: “What is the
dentry cache, and why is it critical for performance?”The dcache is the kernel’s cache of recent name-to-inode lookups. Without it, every path component requires reading directory blocks. For /var/log/nginx/access.log, that is 5 lookups (root, var, log, nginx, access.log). A cold dcache after reboot is the real reason “the first request after deploy is slow” on many systems. Hit rate is typically 99 percent in steady state. Inspect with /proc/sys/fs/dentry-state.Senior Follow-up 3: “How would you find unlinked-but-open files at scale across a fleet?”Run
lsof +L1 periodically and ship to your observability stack. Or use eBPF: hook do_unlinkat and track inodes with non-zero open counts. The Datadog/Prometheus node-exporter node_filesystem_files_free versus _files ratio also surfaces inode pressure. Alerts on disk-vs-inode divergence catch this class of bug early.- “
rmdid not actually delete it; trysudo rm -f.” The unlink succeeded; understanding why disk did not free is the actual lesson. - “Reboot to fix it.” Works but masks the underlying logging configuration issue.
- “Use
shredto scrub the file first.” Solves nothing about open fds.shredis for secure deletion of contents, not for freeing inodes.
- “How Linux handles file deletion” — LWN
- The lsof man page (especially
+L) - GitLab post-mortem of the 2017 database incident (a long read, worth it)