In an operating system, processes are isolated by the Virtual Memory Manager to prevent one process from corrupting another. However, complex systems (like Chrome, Nginx, or a Database) require these isolated units to cooperate. IPC is the set of mechanisms provided by the kernel to bridge this isolation.
Caveat 1: Pipes are unidirectional — this trips up almost every newcomer. A pipe() call gives you two file descriptors: fd[0] for reading and fd[1] for writing. The data flows in one direction only. If you want a parent and child to talk back and forth, you need TWO pipes (one parent-to-child, one child-to-parent) — or you can use socketpair(AF_UNIX, SOCK_STREAM, 0, fds) which gives you a single bidirectional channel. This is why most modern code uses socketpair for parent-child IPC: half the file descriptors, no risk of crossing the streams. The pitfall: closing the write end of a pipe causes the reader to get EOF; closing the read end causes the writer to get SIGPIPE, which by default kills the process. Senior developers always install a SIGPIPE handler or set MSG_NOSIGNAL on every send.
Pattern: prefer socketpair for new bidirectional IPC. It avoids the four-FD bookkeeping of dual pipes, supports recvmsg/sendmsg (so you can pass file descriptors via SCM_RIGHTS), and gives you MSG_PEEK and MSG_NOSIGNAL. Reserve raw pipe() for the unidirectional cases where it is the obvious primitive: shell pipelines, popen-style streaming, log forwarding.
Caveat 2: Shared memory races are the most subtle bug class in systems programming. The kernel does ZERO synchronization for shared memory — it just maps the same physical pages into two virtual address spaces. If both processes write to the same byte without coordination, the result is undefined (literally: torn writes, lost updates, silent corruption). Worse, on weakly-ordered architectures (ARM, POWER), even reads of correctly-aligned 64-bit values can return garbage if the writer and reader did not insert appropriate barriers. “It works on x86” is not a correctness proof; x86’s TSO model accidentally papers over many missing barriers.
Pattern: pair shared memory with explicit synchronization. The standard recipe is shm_open + mmap + a POSIX semaphore (sem_open) or a process-shared mutex (pthread_mutexattr_setpshared(PTHREAD_PROCESS_SHARED)). For producer/consumer, use a shared-memory ring buffer with atomic head and tail indices using memory_order_release/memory_order_acquire. Test on ARM (a Raspberry Pi works) — if your code passes there, it almost certainly passes everywhere. Crash safety: if a process dies holding a shared mutex, use PTHREAD_MUTEX_ROBUST so the next acquirer gets EOWNERDEAD and can recover instead of deadlocking forever.
Caveat 3: Unix domain sockets vs. TCP loopback — UDS is roughly 2x faster but local-only. Engineers often default to 127.0.0.1:PORT because it is portable. On the same host, this routes through the full TCP stack: socket buffers, sequence numbers, checksums (skipped on loopback in Linux but still memory-copied twice), congestion control state. A Unix domain socket bypasses all of that — the kernel just hands the bytes from one process’s socket buffer to another’s. Benchmarks routinely show UDS at 2x-4x the throughput of TCP loopback for the same workload, with substantially lower CPU. The tradeoff: UDS only works for processes on the same kernel; you cannot transparently move them to different hosts.
Pattern: use UDS for sidecar / colocated IPC. When you have a service mesh sidecar (Envoy, Linkerd) on the same host as your application, talk to it over UDS, not loopback TCP. Same for daemon-style architectures (docker.sock, systemd socket activation, redis-cli to a local Redis). For cross-host fallback, abstract the connection behind an interface that picks UDS for unix:// URIs and TCP for tcp:// URIs — changing one config flag should be the only difference.
Caveat 4: POSIX message queues have hard size limits and can block silently. The kernel default mq_msgsize is 8192 bytes; mq_maxmsg defaults to 10. A mq_send with a full queue blocks the sender by default — and if the receiver crashes, the sender hangs forever. The system-wide limits in /proc/sys/fs/mqueue/ are also low (256 queues, ~819200 total bytes by default), so you cannot scale message queues like you would scale a Kafka topic.
Pattern: open mqueues O_NONBLOCK and handle EAGAIN explicitly.mq_send and mq_receive return EAGAIN when full/empty in non-blocking mode — treat this as backpressure, not as an error. Pair with mq_notify or a signalfd so you can wait for queue events in your event loop. If you genuinely need durable, large-volume messaging across processes on one host, prefer a real broker (Redis Streams, NATS, Kafka) over POSIX mqueues — mqueues are best for low-volume control plane messages, not data plane.
Mastery Level: Senior Systems Engineer Key Internals: Kernel Ring Buffers, Page Table Aliasing, Signal Frames, rt_sigreturn Prerequisites: Virtual Memory, Process Internals
Pipes are the oldest and most fundamental IPC mechanism in Unix. While they appear as simple file descriptors to user space, their internal implementation reveals sophisticated kernel buffer management.
A pipe is a unidirectional communication channel with:
Write end: One process writes data
Read end: Another process reads data
FIFO ordering: First In, First Out
Byte stream: No message boundaries
#include <unistd.h>#include <stdio.h>#include <string.h>int main() { int pipefd[2]; // pipefd[0] = read end, pipefd[1] = write end char buffer[128]; // Create pipe if (pipe(pipefd) == -1) { perror("pipe"); return 1; } pid_t pid = fork(); if (pid == 0) { // Child process - writer close(pipefd[0]); // Close unused read end const char *msg = "Hello from child!"; write(pipefd[1], msg, strlen(msg) + 1); close(pipefd[1]); } else { // Parent process - reader close(pipefd[1]); // Close unused write end ssize_t n = read(pipefd[0], buffer, sizeof(buffer)); printf("Parent received: %s (%zd bytes)\n", buffer, n); close(pipefd[0]); } return 0;}
Critical Design Pattern: Always close unused pipe ends. If the parent keeps pipefd[1] open, the read() call will never return 0 (EOF) because the kernel sees there’s still a potential writer.
In the Linux kernel, a pipe is implemented using a circular buffer structure (struct pipe_inode_info):
// Simplified kernel structure (from fs/pipe.c)struct pipe_buffer { struct page *page; // Points to a physical page unsigned int offset; // Offset within the page unsigned int len; // Length of data in this buffer const struct pipe_buf_operations *ops; unsigned int flags;};struct pipe_inode_info { struct mutex mutex; // Protects the pipe wait_queue_head_t rd_wait; // Reader wait queue wait_queue_head_t wr_wait; // Writer wait queue unsigned int head; // Write position unsigned int tail; // Read position unsigned int max_usage; // Max buffers (usually 16) unsigned int ring_size; // Number of buffer slots struct pipe_buffer *bufs; // Array of pipe buffers struct user_struct *user; // Owner};
When a process calls write(pipefd[1], data, size):
1. System Call Entry ├─> User space → Kernel space transition └─> syscall handler: sys_write() → vfs_write() → pipe_write()2. Acquire Pipe Mutex ├─> mutex_lock(&pipe->mutex) └─> Prevents concurrent access3. Check Capacity ├─> if ((head - tail) >= ring_size) // Pipe is full │ ├─> if (O_NONBLOCK) return -EAGAIN │ └─> else: wait_event_interruptible(pipe->wr_wait) └─> Process sleeps in TASK_INTERRUPTIBLE state4. Write Data to Buffer ├─> Allocate new pipe_buffer if needed ├─> Get page: alloc_page(GFP_HIGHUSER) ├─> Copy from user space: copy_from_user(page, data, size) └─> Update head pointer: pipe->head++5. Wake Readers ├─> wake_up_interruptible(&pipe->rd_wait) └─> Reader process moves to run queue6. Release Mutex & Return ├─> mutex_unlock(&pipe->mutex) └─> Return bytes written
Key Kernel Functions:
// Simplified from fs/pipe.cstatic ssize_t pipe_write(struct kiocb *iocb, struct iov_iter *from) { struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; ssize_t ret = 0; size_t total_len = iov_iter_count(from); mutex_lock(&pipe->mutex); for (;;) { unsigned int head = pipe->head; unsigned int tail = pipe->tail; unsigned int mask = pipe->ring_size - 1; if (!pipe_full(head, tail, pipe->max_usage)) { struct pipe_buffer *buf = &pipe->bufs[head & mask]; struct page *page = alloc_page(GFP_HIGHUSER); if (!page) { ret = -ENOMEM; break; } // Copy data from user space size_t chunk = min_t(size_t, total_len, PAGE_SIZE); if (copy_from_iter(page_address(page), chunk, from) != chunk) { __free_page(page); ret = -EFAULT; break; } buf->page = page; buf->offset = 0; buf->len = chunk; pipe->head = head + 1; ret += chunk; // Wake up readers wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN); if (ret >= total_len) break; } else { // Pipe is full if (filp->f_flags & O_NONBLOCK) { ret = -EAGAIN; break; } // Sleep until space available mutex_unlock(&pipe->mutex); wait_event_interruptible(pipe->wr_wait, !pipe_full(pipe->head, pipe->tail, pipe->max_usage)); mutex_lock(&pipe->mutex); } } mutex_unlock(&pipe->mutex); return ret;}
Regular pipes only work between related processes (parent-child via fork()). Named pipes (FIFOs) allow unrelated processes to communicate.
#include <sys/stat.h>#include <fcntl.h>#include <unistd.h>#include <stdio.h>#include <string.h>// Writer processint main() { const char *fifo_path = "/tmp/my_fifo"; // Create FIFO (like a file in the filesystem) if (mkfifo(fifo_path, 0666) == -1) { perror("mkfifo"); return 1; } int fd = open(fifo_path, O_WRONLY); // Blocks until reader opens const char *msg = "Hello via FIFO!"; write(fd, msg, strlen(msg) + 1); close(fd); unlink(fifo_path); // Remove FIFO file return 0;}// Reader process (separate program)int main() { const char *fifo_path = "/tmp/my_fifo"; char buffer[128]; int fd = open(fifo_path, O_RDONLY); // Blocks until writer opens ssize_t n = read(fd, buffer, sizeof(buffer)); printf("Received: %s\n", buffer); close(fd); return 0;}
Kernel Implementation: A FIFO is represented by an inode with type S_IFIFO. The inode’s i_pipe field points to the same struct pipe_inode_info as regular pipes.
Problem: Signal handlers are asynchronous and severely limited in what they can do (async-signal-safe functions only). How do you integrate signals with an event loop?Solution: The self-pipe trick.
#include <unistd.h>#include <signal.h>#include <poll.h>#include <stdio.h>#include <string.h>static int signal_pipe[2];void signal_handler(int sig) { // Only async-signal-safe operations allowed here char byte = sig; write(signal_pipe[1], &byte, 1); // Write is async-signal-safe}int main() { pipe(signal_pipe); // Set up signal handler struct sigaction sa; memset(&sa, 0, sizeof(sa)); sa.sa_handler = signal_handler; sigaction(SIGINT, &sa, NULL); sigaction(SIGTERM, &sa, NULL); // Event loop using poll() struct pollfd fds[2]; fds[0].fd = STDIN_FILENO; fds[0].events = POLLIN; fds[1].fd = signal_pipe[0]; fds[1].events = POLLIN; printf("Event loop running. Press Ctrl+C to test...\n"); while (1) { int ret = poll(fds, 2, -1); if (ret > 0) { if (fds[0].revents & POLLIN) { // Handle stdin char buf[128]; ssize_t n = read(STDIN_FILENO, buf, sizeof(buf)); printf("Got input: %.*s", (int)n, buf); } if (fds[1].revents & POLLIN) { // Handle signal char sig; read(signal_pipe[0], &sig, 1); printf("Received signal %d in main loop!\n", sig); if (sig == SIGINT || sig == SIGTERM) { printf("Cleaning up and exiting...\n"); break; } } } } close(signal_pipe[0]); close(signal_pipe[1]); return 0;}
Why it works:
Signal handler executes in async context (can’t safely do much)
Handler writes 1 byte to pipe (write is async-signal-safe)
Main event loop wakes up from poll()
Main loop reads signal number and handles it safely
Signal handling is now integrated with other I/O events
Shared Memory is the fastest IPC mechanism because it completely eliminates kernel involvement in data transfer. Once set up, processes communicate at memory speed.
Process A Kernel Process B┌─────────┐ ┌──────┐ ┌─────────┐│ User │ write() │ │ read() │ User ││ Buffer │ ──────────> │ Pipe │ ──────────> │ Buffer ││ [DATA] │ copy 1 │Buffer│ copy 2 │ [DATA] │└─────────┘ │[DATA]│ └─────────┘ └──────┘Total: 2 memory copies + 2 syscalls + 2 context switches
Shared Memory data flow:
Process A Process B┌─────────┐ ┌─────────┐│ User │ No kernel involvement! │ User ││ Buffer │ ──────────────────────────────> │ Buffer ││ [DATA] │ Direct memory access │ [DATA] │└─────────┘ └─────────┘ ↓ ↓ └────────────> Same Physical Memory <──────────┘Total: 0 copies (after initial setup)
#include <sys/ipc.h>#include <sys/shm.h>#include <stdio.h>#include <string.h>#define SHM_KEY 1234#define SHM_SIZE 4096// Creator processint main() { // Create shared memory segment int shmid = shmget(SHM_KEY, SHM_SIZE, IPC_CREAT | 0666); if (shmid == -1) { perror("shmget"); return 1; } // Attach to address space char *shm = shmat(shmid, NULL, 0); if (shm == (char*)-1) { perror("shmat"); return 1; } // Write data strcpy(shm, "System V shared memory!"); printf("Data written, shmid: %d\n", shmid); // Detach shmdt(shm); // Don't delete yet - let reader access it sleep(5); // Mark for deletion (actual deletion happens when all detach) shmctl(shmid, IPC_RMID, NULL); return 0;}// Reader processint main() { int shmid = shmget(SHM_KEY, SHM_SIZE, 0666); if (shmid == -1) { perror("shmget"); return 1; } char *shm = shmat(shmid, NULL, SHM_RDONLY); printf("Read: %s\n", shm); shmdt(shm); return 0;}
Persistence: System V shared memory persists until explicitly deleted with IPC_RMID or system reboot. Use ipcs -m to list and ipcrm -m <shmid> to delete orphaned segments.
Problem: Shared memory provides NO synchronization. Multiple processes accessing the same memory simultaneously will corrupt data.
Race Condition Example
// BROKEN CODE - Race conditionstruct shared_data { int counter; // Shared counter};// Both Process A and B execute this:shm->counter++; // NOT ATOMIC!// Assembly (what actually happens):// 1. mov eax, [counter] ; Read// 2. inc eax ; Increment// 3. mov [counter], eax ; Write// If both processes interleave:// A: Read (0)// B: Read (0)// A: Inc (1)// B: Inc (1)// A: Write (1)// B: Write (1)// Result: 1 (should be 2!)
Correct Solution
// CORRECT - Using semaphorestruct shared_data { sem_t mutex; int counter;};// Initialize (once)sem_init(&shm->mutex, 1, 1);// Each process does:sem_wait(&shm->mutex); // Lockshm->counter++;sem_post(&shm->mutex); // Unlock// Or using atomic operations:__sync_fetch_and_add(&shm->counter, 1);// or C11 atomics:atomic_fetch_add(&shm->counter, 1);
Producer-Consumer with Shared Memory:
#include <sys/mman.h>#include <fcntl.h>#include <semaphore.h>#include <stdio.h>#include <string.h>#include <unistd.h>#define SHM_NAME "/pc_shm"#define BUFFER_SIZE 10struct shared_buffer { sem_t mutex; // Mutual exclusion sem_t empty; // Count of empty slots sem_t full; // Count of full slots int buffer[BUFFER_SIZE]; int in; // Producer index int out; // Consumer index};// Producerint main() { int shm_fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666); ftruncate(shm_fd, sizeof(struct shared_buffer)); struct shared_buffer *shm = mmap(NULL, sizeof(struct shared_buffer), PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0); // Initialize semaphores sem_init(&shm->mutex, 1, 1); sem_init(&shm->empty, 1, BUFFER_SIZE); // Initially all empty sem_init(&shm->full, 1, 0); // Initially none full shm->in = 0; shm->out = 0; // Produce items for (int item = 0; item < 20; item++) { sem_wait(&shm->empty); // Wait for empty slot sem_wait(&shm->mutex); // Lock shm->buffer[shm->in] = item; printf("Produced: %d\n", item); shm->in = (shm->in + 1) % BUFFER_SIZE; sem_post(&shm->mutex); // Unlock sem_post(&shm->full); // Signal new full slot usleep(100000); // Simulate work } munmap(shm, sizeof(struct shared_buffer)); return 0;}// Consumerint main() { int shm_fd = shm_open(SHM_NAME, O_RDWR, 0666); struct shared_buffer *shm = mmap(NULL, sizeof(struct shared_buffer), PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0); // Consume items for (int i = 0; i < 20; i++) { sem_wait(&shm->full); // Wait for full slot sem_wait(&shm->mutex); // Lock int item = shm->buffer[shm->out]; printf("Consumed: %d\n", item); shm->out = (shm->out + 1) % BUFFER_SIZE; sem_post(&shm->mutex); // Unlock sem_post(&shm->empty); // Signal new empty slot usleep(150000); // Simulate work } munmap(shm, sizeof(struct shared_buffer)); shm_unlink(SHM_NAME); return 0;}
POSIX message queues are implemented in the kernel as a priority-sorted list:
// Simplified from ipc/mqueue.cstruct mqueue_inode_info { spinlock_t lock; struct inode vfs_inode; wait_queue_head_t wait_q; struct msg_msg **messages; // Array of messages struct mq_attr attr; struct list_head e_wait_q[2]; // 0=recv waiters, 1=send waiters};struct msg_msg { struct list_head m_list; long m_type; // Priority size_t m_ts; // Message size void *m_data; // Message data // Followed by actual message data};// Send operationstatic int do_mq_timedsend(mqd_t mqdes, const char *msg_ptr, size_t msg_len, unsigned int msg_prio) { struct mqueue_inode_info *info = get_mqueue_info(mqdes); spin_lock(&info->lock); if (info->attr.mq_curmsgs >= info->attr.mq_maxmsg) { // Queue full - block or return error if (mqdes->f_flags & O_NONBLOCK) { spin_unlock(&info->lock); return -EAGAIN; } // Wait for space... } // Allocate message struct msg_msg *msg = alloc_msg(msg_len); copy_from_user(msg->m_data, msg_ptr, msg_len); msg->m_type = msg_prio; // Insert in priority order insert_message_sorted(info->messages, msg, msg_prio); info->attr.mq_curmsgs++; // Wake up receivers wake_up(&info->wait_q); spin_unlock(&info->lock); return 0;}
Priority Queue Implementation: Messages are stored in a sorted array or priority queue. When receiving, the kernel returns the highest priority message in O(1) or O(log N) time.
What happens when Process B sends a signal to Process A?
1. Process B calls kill(pid_A, SIGINT) ├─> Syscall entry: sys_kill() └─> Kernel validates permission (same user or root)2. Kernel sets pending signal bit ├─> task_struct *task_A = find_task_by_pid(pid_A) ├─> sigaddset(&task_A->pending.signal, SIGINT) └─> If task_A is sleeping, wake it up3. Context switch to Process A ├─> Before returning to user space, kernel checks pending signals └─> do_signal() is called4. Signal frame construction ├─> Save current context (registers, stack pointer, instruction pointer) ├─> Allocate signal frame on user stack: │ ┌─────────────────┐ ← Original stack pointer │ │ Return address │ (to __restore_rt) │ ├─────────────────┤ │ │ Signal number │ (SIGINT = 2) │ ├─────────────────┤ │ │ siginfo_t │ (signal info) │ ├─────────────────┤ │ │ ucontext_t │ (saved registers) │ │ - RIP (PC) │ │ │ - RSP (SP) │ │ │ - RAX, RBX... │ │ └─────────────────┘ ← New stack pointer │ ├─> Modify saved user RIP to point to signal handler └─> Return to user space5. Signal handler executes ├─> Process A "resumes" at handler address ├─> Handler runs: sigint_handler(2) └─> Handler returns6. Signal return trampoline ├─> Return address points to __restore_rt (kernel-provided code) ├─> __restore_rt calls rt_sigreturn() syscall └─> Kernel restores original context from signal frame7. Resume normal execution └─> Process A continues where it was interrupted
Kernel Code (simplified from kernel/signal.c):
// Step 2: Send signalint kill_something_info(int sig, struct siginfo *info, pid_t pid) { struct task_struct *p = find_task_by_vpid(pid); // Check permission if (!kill_ok_by_cred(p)) return -EPERM; // Add to pending signals sigaddset(&p->pending.signal, sig); // Wake up if sleeping signal_wake_up(p, sig == SIGKILL || sig == SIGSTOP); return 0;}// Step 3-4: Deliver signalstatic void handle_signal(struct ksignal *ksig, struct pt_regs *regs) { struct task_struct *task = current; sigset_t *oldset = sigmask_to_save(); // Build signal frame on user stack if (setup_rt_frame(ksig, oldset, regs) < 0) { // Failed - force SIGSEGV force_sig(SIGSEGV); return; } // Clear handled signal from pending sigdelset(&task->pending.signal, ksig->sig);}// Build signal framestatic int setup_rt_frame(struct ksignal *ksig, sigset_t *set, struct pt_regs *regs) { struct rt_sigframe __user *frame; // Allocate frame on user stack frame = get_sigframe(&ksig->ka, regs, sizeof(*frame)); // Fill in signal frame put_user(ksig->sig, &frame->sig); copy_siginfo_to_user(&frame->info, &ksig->info); // Save register context frame->uc.uc_mcontext.rip = regs->ip; frame->uc.uc_mcontext.rsp = regs->sp; frame->uc.uc_mcontext.rax = regs->ax; // ... save all registers ... // Set return address to restorer put_user(__NR_rt_sigreturn, &frame->retcode); // Modify user-space RIP to point to handler regs->ip = (unsigned long)ksig->ka.sa.sa_handler; regs->sp = (unsigned long)frame; return 0;}// Step 6: Return from signal handlerSYSCALL_DEFINE0(rt_sigreturn) { struct pt_regs *regs = current_pt_regs(); struct rt_sigframe __user *frame; frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long)); // Restore register context regs->ip = frame->uc.uc_mcontext.rip; regs->sp = frame->uc.uc_mcontext.rsp; regs->ax = frame->uc.uc_mcontext.rax; // ... restore all registers ... return regs->ax; // Return value of interrupted syscall}
Problem: A signal can interrupt a process anywhere, including inside non-reentrant functions.
// DANGEROUS CODEchar *global_buffer = NULL;void signal_handler(int sig) { // WRONG: malloc is NOT async-signal-safe global_buffer = malloc(100); sprintf(global_buffer, "Signal %d", sig); // Also wrong printf("Handled: %s\n", global_buffer); // Also wrong free(global_buffer); // Also wrong}int main() { signal(SIGINT, signal_handler); // What if SIGINT arrives HERE? global_buffer = malloc(200); // Holding malloc's internal lock // Signal handler interrupts, tries to call malloc() // → DEADLOCK (waiting for lock it already holds) free(global_buffer); return 0;}
Why this deadlocks:
Thread execution timeline:Time Main Thread Signal Handler──────────────────────────────────────────────── 1 malloc(200) 2 acquire(heap_lock) 3 [allocating memory...] 4 ← SIGINT arrives! → signal_handler() 5 malloc(100) 6 try acquire(heap_lock) 7 ⏸️ BLOCKED (waiting for lock) 8 ⏸️ Cannot continue ⏸️ Still blocked 9 (handler must finish)DEADLOCK: Main thread waiting for handler to finish Handler waiting for main thread to release lock
Async-Signal-Safe Functions (partial list from POSIX):
// Safe to call from signal handlers:_exit() // NOT exit()write()read()open()close()signal() // Signal manipulationsigaction()kill()getpid()alarm()pause()// ... about 100 total
Queued: Multiple instances of the same signal are queued (standard signals are not)
Ordered: Delivered in priority order (lower signal numbers first)
Data: Can send an integer or pointer with the signal
// Senderunion sigval value;value.sival_int = 12345;sigqueue(target_pid, SIGRTMIN, value);value.sival_int = 67890;sigqueue(target_pid, SIGRTMIN, value); // Both will be delivered// Receiver gets both signals with their respective values
#include <signal.h>#include <stdio.h>#include <unistd.h>void handler(int sig) { printf("Handler started for signal %d\n", sig); sleep(3); // Simulate long-running handler printf("Handler finished for signal %d\n", sig);}int main() { signal(SIGUSR1, handler); signal(SIGUSR2, handler); sigset_t mask, oldmask; // Block SIGUSR2 sigemptyset(&mask); sigaddset(&mask, SIGUSR2); sigprocmask(SIG_BLOCK, &mask, &oldmask); printf("SIGUSR2 is now blocked\n"); printf("PID: %d\n", getpid()); // SIGUSR1 will be handled, SIGUSR2 will be pending sleep(10); printf("Unblocking SIGUSR2...\n"); sigprocmask(SIG_SETMASK, &oldmask, NULL); // Pending SIGUSR2 signals now delivered sleep(5); return 0;}
Test:
# Terminal 1./signal_mask# Note the PID# Terminal 2kill -USR1 <pid> # Handled immediatelykill -USR2 <pid> # Blocked, becomes pendingkill -USR2 <pid> # Another one (but standard signals don't queue)# After program unblocks, only ONE SIGUSR2 is delivered
Q1: Explain the 'self-pipe trick' and why it's necessary.
Problem: Signal handlers can interrupt a process anywhere, including inside non-reentrant functions like malloc(). This severely limits what you can do in a signal handler.Solution: The self-pipe trick:
Create a pipe: pipe(signal_pipe)
In signal handler: write(signal_pipe[1], &sig, 1) (write is async-signal-safe)
Add signal_pipe[0] to your event loop (epoll, select, poll)
When pipe becomes readable, main loop reads signal number and handles it safely
Why it works: The signal handler only does minimal work (one async-signal-safe write). The actual signal handling happens in the main event loop where all functions are safe to call.Used in: Redis, Nginx, Node.js, any event-driven server.
Q2: How does the kernel implement shared memory? Explain page table aliasing.
Concept: Two different processes map the same physical memory pages into their virtual address spaces.Mechanism:
Process A calls mmap(MAP_SHARED) on shared memory object
Kernel creates VMA (Virtual Memory Area) in Process A’s address space
Kernel allocates physical pages (or uses existing ones for the shared memory object)
Process B calls mmap(MAP_SHARED) on the SAME shared memory object
Kernel creates VMA in Process B’s address space (different virtual address)
Kernel updates Process B’s page tables to point to the SAME physical frames
Result: Two different virtual addresses resolve to the same physical memory. This is page table aliasing.Example:
Process A: Virtual 0x7000 → Physical Frame 0x5000Process B: Virtual 0x9000 → Physical Frame 0x5000Write by A to 0x7000 is immediately visible to B at 0x9000.
Key insight: The kernel doesn’t copy data. It just manipulates page table entries to create multiple mappings to the same physical pages.
Q3: What is SCM_RIGHTS and how does it enable privilege separation?
SCM_RIGHTS: A Unix Domain Socket control message type that allows passing open file descriptors between processes.Kernel Operation:
Sender has FD 3 → struct file* (kernel object)
Sender calls sendmsg() with SCM_RIGHTS control message containing FD 3
Kernel increments reference count of the struct file object
Kernel installs same struct file* pointer in receiver’s FD table at slot 5
Receiver now has FD 5 pointing to the same kernel file object
Privilege Separation Pattern:
Privileged Broker Process:- Runs as root or with capabilities- Opens protected resources (files, sockets, devices)- Validates requests from workers- Passes FDs to workers via SCM_RIGHTSUnprivileged Worker Process:- Runs with minimal privileges (nobody user, chroot jail)- Cannot open files directly- Requests resources from broker- Receives FDs and can use them (read/write)
Example: Chrome browser process (privileged) opens files and passes FDs to renderer processes (sandboxed, no filesystem access).Security benefit: Capability-based security. Worker gets access to specific resource instances, not broad permissions.
Q4: Why are writes ≤ PIPE_BUF atomic, and what happens for larger writes?
PIPE_BUF: Linux defines it as 4096 bytes (one page).Atomicity Guarantee: If two processes write ≤ 4096 bytes simultaneously, the kernel guarantees the data won’t interleave.Implementation:
// Kernel holds pipe mutex for entire write when size <= PIPE_BUFif (len <= PIPE_BUF) { mutex_lock(&pipe->mutex); // Copy entire write to pipe buffer // No other process can write during this time mutex_unlock(&pipe->mutex);}
For writes > PIPE_BUF:
// Kernel may release mutex between chunkswhile (bytes_remaining > 0) { mutex_lock(&pipe->mutex); copy_chunk_to_pipe_buffer(); mutex_unlock(&pipe->mutex); // Another process can write here!}
Example:
Process A writes 8000 bytes "AAA..."Process B writes 8000 bytes "BBB..."Possible result in pipe:AAAA (4096) BBBB (4096) AAAA (remainder) BBBB (remainder)Data is interleaved!
Solution: If you need atomicity for large messages:
Use message queues (message-oriented)
Use Unix sockets with framing protocol
Use shared memory with proper locking
Break into ≤4096 byte messages with sequence numbers
Q5: Compare System V vs POSIX IPC. When would you use each?
System V IPC (shmget, semget, msgget):Pros:
Kernel-persistent (survives process death until reboot or manual deletion)
Well-established, available on all Unix systems
Atomic operations (semop with multiple sem ops)
Cons:
Awkward API (ftok for key generation, numeric IDs)
No integration with file descriptors (can’t poll/select)
Treat shared memory as untrusted input (even from “trusted” processes)
Use capabilities/SELinux to limit which processes can access shared memory
Monitor for orphaned segments (ipcs -m, /dev/shm)
Consider using Unix sockets instead (better isolation, kernel-mediated)
Q7: How does the kernel deliver signals? Explain the signal frame and rt_sigreturn.
Signal Delivery Process:
1. Signal Generation: - kill(pid, SIGINT) syscall - Hardware exception (SIGSEGV) - Timer expiration (SIGALRM)2. Kernel marks signal pending: sigaddset(&task->pending.signal, SIGINT)3. Before returning to user space: - Kernel checks for pending signals - Calls do_signal()4. Signal Frame Construction: ┌─── User Stack ──┐ │ ... │ ← Original SP ├─────────────────┤ │ Return addr │ → Points to __restore_rt ├─────────────────┤ │ sig (int) │ Signal number ├─────────────────┤ │ siginfo_t │ Signal info ├─────────────────┤ │ ucontext_t │ Saved context: │ uc_mcontext: │ │ rip = 0x... │ Instruction pointer │ rsp = 0x... │ Stack pointer │ rax = 0x... │ All registers │ rbx = 0x... │ │ ... │ └─────────────────┘ ← New SP5. Modify user context: regs->ip = handler_address; regs->sp = &signal_frame;6. Return to user space: - Process "resumes" at handler - Handler executes - Handler returns7. Trampoline (__restore_rt): - Automatically called when handler returns - Calls rt_sigreturn() syscall8. rt_sigreturn syscall: - Restores registers from signal frame - Restores original RIP, RSP - Process continues where interrupted
Code:
// Kernel builds frame (simplified)struct rt_sigframe *frame = (void*)(user_sp - sizeof(*frame));frame->uc.uc_mcontext.rip = regs->ip; // Save current PCframe->uc.uc_mcontext.rsp = regs->sp; // Save current SP// ... save all registers ...regs->ip = (unsigned long)handler; // Jump to handlerregs->sp = (unsigned long)frame; // New stack// Handler finishes, returns to __restore_rt:__asm__("mov $15, %rax"); // __NR_rt_sigreturn__asm__("syscall");// Kernel restores:regs->ip = frame->uc.uc_mcontext.rip; // Restore original PCregs->sp = frame->uc.uc_mcontext.rsp; // Restore original SP
Key Insight: The signal frame is a “snapshot” of the process state that allows the kernel to resume execution exactly where it was interrupted.
Q8: What is the performance difference between pipes, Unix sockets, and shared memory?
Benchmark Results (100 MB data transfer):
Mechanism
Throughput
Latency (RTT)
Copies
Syscalls
Shared Memory
28,000 MB/s
0.5 µs
0
2 (sem)
Unix Socket
3,200 MB/s
2 µs
1
2
Pipe
2,800 MB/s
2 µs
1
2
TCP Localhost
1,500 MB/s
8 µs
2
2
Why Shared Memory is Fastest:
Pipe/Socket:User A Buffer → [Kernel Copy] → User B Buffer ↑ ↑ write() read()Shared Memory:User A Buffer ← Same Physical Memory → User B Buffer (No copies!)
When NOT to use shared memory:
Small messages (less than 4KB): Synchronization overhead dominates
Infrequent communication: Setup cost not amortized
Simple protocols: Complexity not worth it
Need message boundaries: Pipes/sockets handle this
Optimization Tips:
For Pipes/Sockets:
// Use larger buffer sizesint sndbuf = 1024 * 1024; // 1 MBsetsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sndbuf, sizeof(sndbuf));// Batch writesstruct iovec iov[10];writev(fd, iov, 10); // One syscall for multiple buffers
For Shared Memory:
// Use lock-free data structuresatomic_int head, tail;// Batch operations to amortize synchronizationwhile (count < BATCH_SIZE) { shm->buffer[count++] = data;}sem_post(&full); // One signal for batch
Pick an IPC mechanism for a high-throughput producer/consumer on the same machine. Walk through your decision and tradeoffs.
Strong Answer Framework:
Define the workload first. Throughput target (msgs/sec, bytes/sec), message size distribution, latency budget (p99), durability requirement, number of producers/consumers, crash semantics.
Eliminate clearly-wrong choices. If you need cross-host: UDS is out. If you need durability: shared memory and pipes are out (data is gone if either side crashes). If you need backpressure: signals are out (no flow control).
Compare the survivors on the metric that dominates:
Latency-sensitive, small messages, single producer/consumer: shared memory ring buffer with futex notification. ~50-100 nanoseconds per message. Used by HFT systems, LMAX Disruptor.
High throughput, multiple producers, small/medium messages: Unix domain sockets, SOCK_DGRAM. Each sendmsg is one atomic message. Kernel manages buffering and backpressure. ~1-3 microseconds per message. Used by journald, rsyslog.
Bulk transfer, large payloads, infrequent: shared memory (mmap) with semaphore. Zero copy beats everything for big payloads. Used by databases (PostgreSQL shared_buffers), Wayland (pixel buffers).
Address the specific question — a high-throughput producer/consumer on one machine usually wants Unix domain sockets unless you are above ~1M msgs/sec, at which point shared memory + ring buffer becomes worth the complexity.
Sketch the chosen design:socket(AF_UNIX, SOCK_DGRAM, 0), bind to an abstract namespace path (\0prodcons — no filesystem cleanup needed), producers connect with connect(), consumer uses epoll to multiplex, set SO_RCVBUF to 4MB or higher to avoid drops under burst.
Mention the failure modes you have to handle: consumer slow → EAGAIN on producer with non-blocking sockets → application-level backpressure. Producer crashes → kernel auto-closes the socket → consumer sees EOF on that connection, no orphan resources.
Real-World Example:
Facebook’s LogDevice writes its inter-thread queue using a shared-memory ring buffer (their MPMCQueue) and then ships data to remote nodes over TCP — they pick the right primitive at each layer. journald uses Unix domain sockets for log collection, with kernel buffering as the backpressure mechanism; in 2018 they added the journal-remote feature precisely because UDS does not cross hosts. The Aeron messaging framework (real-time finance) goes all-in on shared memory + busy-spin readers and hits ~10 million messages/second on a single core — the cost is dedicating that core to nothing else.
Senior Follow-up 1: When does shared memory beat UDS by enough to justify the complexity?
Above ~500K msgs/sec or below 1us p99 latency. Below that, UDS is “fast enough” and the synchronization complexity of shared memory (correct memory ordering, robust mutex handling, head/tail wraparound) is not worth it. Always benchmark before choosing — measured numbers, not vibes.
Senior Follow-up 2: Why use SOCK_DGRAM over SOCK_STREAM for log shipping?
SOCK_DGRAM gives message boundaries — one sendmsg = one recvmsg for the consumer, no framing protocol needed. For SOCK_STREAM you would have to add a length prefix or delimiter, which adds CPU on both sides. The cost: on Linux, SOCK_DGRAM on UDS is reliable (unlike UDP) but messages above wmem_max are dropped. Tune net.core.wmem_max if you have large messages.
Senior Follow-up 3: How does io_uring change this picture?
io_uring (Linux 5.1+) lets you batch syscalls and avoid the syscall overhead per message. For UDS, the win is moderate (~20-30%) because the syscall itself is cheap. For TCP loopback or shared-memory + futex_wake patterns, io_uring + IORING_OP_SEND with multi-shot recv can dramatically reduce CPU. As of 2024, frameworks like ScyllaDB’s Seastar use io_uring extensively to cut IPC CPU overhead.
Common Wrong Answers:
“Always use shared memory because it is fastest” — ignores synchronization complexity and crash semantics.
“Use Kafka / Redis / RabbitMQ” — adds a network hop and a separate process to manage; usually overkill for same-host IPC.
“Pipes are the standard Unix way” — pipes do not work for many-to-one without per-producer FIFOs and have PIPE_BUF atomicity limits.
Further Reading:
LMAX Disruptor whitepaper — the canonical shared-memory ring buffer design.
Aeron architecture docs — how a high-performance messaging system layers shared memory and UDP.
Brendan Gregg, “Linux Performance” page — has benchmarks of UDS vs. loopback TCP.
Implement a request-response protocol over a Unix domain socket. Walk through the API calls and the framing.
Strong Answer Framework:
Pick stream vs. datagram. SOCK_STREAM if requests/responses can exceed a single datagram or need ordering across messages. SOCK_DGRAM if every message is small (under ~64KB) and self-contained — you get message boundaries for free. For most RPC use cases, SOCK_STREAM with explicit framing is the standard.
Framing — the part candidates botch. Streams have no message boundaries, so you MUST frame. Two standard options:
Length prefix: 4-byte big-endian length, then length bytes of payload. Read exactly 4 bytes, then exactly N bytes. Simple, standard. Used by gRPC over UDS, Redis RESP.
Delimiter: e.g., newline-terminated JSON. Easy to debug with socat, slow due to per-byte scanning. Used by HTTP/1.1.
Request-response pairing. For single-threaded clients, the next response on the wire matches the next request — order is implicit. For pipelined clients (multiple in-flight requests), assign an integer request ID; server echoes it in the response.
Timeouts and cancellation. Set SO_RCVTIMEO so reads do not hang forever. For clean cancellation across both endpoints, signal via a separate “cancel” message or close the socket (the peer gets EPIPE/EOF).
FD passing if needed. Use sendmsg with SCM_RIGHTS cmsg to ship file descriptors — this is what makes UDS uniquely powerful for local IPC. The receiving process gets a brand new FD pointing to the same kernel object.
Real-World Example:
The Docker daemon serves its REST API on /var/run/docker.sock (UDS, SOCK_STREAM) using HTTP/1.1 framing. docker ps is just an HTTP GET /containers/json over UDS. systemd’s D-Bus broker also uses UDS with length-prefixed messages, and uses SCM_CREDENTIALS to authenticate the calling process by UID. Visual Studio Code’s language server protocol (LSP) uses UDS or pipes with JSON-RPC framing (Content-Length: N\r\n\r\n{...}).
Senior Follow-up 1: How do you handle partial reads on SOCK_STREAM?read(fd, buf, n) can return any value from 1 to n. You must loop: while (got < n) got += read(fd, buf+got, n-got);. Wrappers like recv_all() or read_exact() (Rust) encode this. On non-blocking sockets, EAGAIN means “no more data available right now; come back later” — combine with epoll edge-triggered mode for efficient event loops.
Senior Follow-up 2: How do you authenticate the peer on a Unix socket?getsockopt(fd, SOL_SOCKET, SO_PEERCRED, &cred, &len) returns struct ucred with the peer’s PID, UID, GID. The kernel populated this at connect time, so it is unforgeable — the peer cannot lie about who they are. This is how Polkit, D-Bus, and systemd authenticate clients without passwords. Linux-specific; BSD has LOCAL_PEERCRED or getpeereid.
Senior Follow-up 3: Why do some systems use abstract namespace sockets (path starts with \0)?
Abstract sockets (sun_path[0] = '\0', then a name) live in a kernel namespace, not the filesystem. Advantages: no unlink needed at startup or shutdown (no stale socket files), no filesystem permission issues, automatically cleaned up when all FDs close. Disadvantage: Linux-only, and visible only within a network namespace (so containers each see their own).
Common Wrong Answers:
“Just read() once and process the buffer” — ignores partial reads on streams; will randomly fail under load.
“Use port 0 and let the kernel pick” — that is a TCP concept; UDS uses paths.
“TLS over Unix sockets for security” — normally unnecessary; UDS is local-only and SO_PEERCRED is more useful than TLS for local auth. TLS adds CPU for no benefit on UDS.
Further Reading:
Beej’s Guide to Unix IPC — the practical reference for socket programming with UDS.
unix(7) man page on Linux — definitive reference for UDS semantics.
Shared memory + semaphore vs. POSIX message queue -- when do you reach for each, and what are the tradeoffs?
Strong Answer Framework:
State the core difference. Shared memory is “raw bytes you both can see, you handle synchronization.” Message queues are “kernel-managed mailbox: structured messages, built-in priority, kernel handles synchronization.”
Shared memory wins on:
Throughput: zero copy. The kernel only does the page-table aliasing once at setup; all subsequent reads/writes are in-process.
Bulk transfers: a 1GB shared region costs the same as a 1KB region.
Tight latency: spin or futex_wait on a flag, ~50ns wakeup.
Message queues win on:
Simplicity: kernel handles queueing, blocking, priority. No memory ordering bugs.
Discrete messages: mq_send is atomic at the message level, no framing protocol needed.
Priority delivery: messages are delivered in priority order (high priority first), useful for control-plane messages.
Crash safety: if a sender dies, queued messages survive until consumed or until queue is unlinked.
The honest tradeoff matrix:
Concern
Shared Mem
Message Queue
Throughput
Best
Limited by per-message syscall overhead
Programming complexity
High (manual sync, ordering)
Low (just mq_send/mq_receive)
Crash recovery
Hard (dangling locks, robust mutexes)
Easy (queue persists)
Size limits
Limited only by RAM
Tight kernel defaults (/proc/sys/fs/mqueue/)
Cross-process?
Yes
Yes
Cross-host?
No
No
Pick by workload: small structured messages, low frequency, need priority -> message queue. High-volume bulk data, single-digit microsecond latency required -> shared memory.
Real-World Example:
PostgreSQL uses shared memory for its buffer pool (shared_buffers) — every backend process maps the same region, and access is coordinated by spinlocks and lwlocks in shared memory. This is why PostgreSQL can serve thousands of concurrent reads from cache without copying pages. Conversely, the Linux kernel’s audit subsystem uses a netlink socket (similar to a message queue) to send audit events to userspace — audit events are small, structured, and the kernel needs reliable delivery semantics that a shared ring buffer would not provide without complex coordination.
Senior Follow-up 1: Can you implement a “message queue” on top of shared memory?
Yes — a shared-memory ring buffer with head/tail indices and a fixed message size is exactly that. Aeron and the LMAX Disruptor are essentially user-space message queues built on shared memory. They get higher throughput than POSIX mqueues because they avoid the per-message syscall, but require careful memory ordering and offer no priority semantics.
Senior Follow-up 2: What is mq_notify and when is it useful?mq_notify registers a one-shot notification: when a message arrives on a previously-empty queue, the kernel either sends a signal or spawns a thread (caller’s choice). Useful for waking an idle reader without polling. The catch: it is one-shot per registration, so you must re-arm after every wakeup. Modern code prefers mq_getattr + select/epoll on the queue’s FD (Linux only — mqueues are FDs in Linux).
Senior Follow-up 3: Why do most modern systems use neither, and prefer io_uring or eventfd-based protocols?
POSIX mqueues have unfriendly limits, no batching, and no zero-copy. Shared memory has no kernel semantics for delivery. io_uring (Linux 5.1+) gives you submission queues + completion queues that ARE shared memory rings, but with kernel cooperation — syscalls only when you choose to enter the kernel. eventfd is a simpler primitive when you just need a counter to wake a waiter. The 2020s default for high-performance Linux IPC is “shared memory ring + eventfd or io_uring for notification.”
Common Wrong Answers:
“Shared memory is always better because it is faster” — ignores complexity cost and that mqueues handle priority and crash safety automatically.
“Message queues are deprecated” — they are legacy but still useful for simple structured IPC; many embedded and POSIX-compliant systems still rely on them.
“Use a database for IPC” — adds disk I/O, transactions, and a separate process; vastly slower for short-lived inter-process signaling.
Further Reading:
mq_overview(7) man page — definitive reference for POSIX message queues.
Martin Thompson, “Mechanical Sympathy” blog — shared-memory design lessons from LMAX.
Compare pipes, shared memory, and Unix domain sockets for IPC. Your team is building a high-throughput logging pipeline where multiple producer processes send log lines to a single aggregator. Which mechanism do you choose and why?
Strong Answer:
For a logging pipeline with multiple producers and one consumer, Unix domain sockets are the right choice. Here is why.
Pipes are limited: a regular pipe only works between related processes (parent-child), and named pipes (FIFOs) are unidirectional. With multiple producers writing to a single FIFO, you get interleaving problems — writes larger than PIPE_BUF (4096 bytes on Linux) are NOT atomic, so log lines can get mixed together. You would need one FIFO per producer, which complicates the aggregator.
Shared memory is the fastest (zero-copy), but you have to build your own synchronization. You need a ring buffer or bounded queue in shared memory, protected by futexes or semaphores. You also need to handle producer crashes gracefully (what if a producer dies while holding a lock on the shared buffer?). For a logging pipeline, this is over-engineering unless you need extreme throughput (millions of log lines per second).
Unix domain sockets (SOCK_STREAM or SOCK_DGRAM) are the sweet spot. Multiple producers can connect to a single socket path. SOCK_DGRAM gives you message boundaries (each sendmsg is one complete log line, no framing needed) and is atomic for messages up to the socket buffer size. The aggregator uses epoll to multiplex all producer connections. You get kernel-managed buffering, backpressure (slow consumer causes producers to block on send), and clean handling of producer crashes (the kernel closes the socket on process exit).
In production, this is exactly what rsyslog and journald use — Unix domain sockets for local log collection. If throughput demands exceed what Unix sockets can handle, I would move to shared memory with a ring buffer (like the LMAX Disruptor pattern), but that is only justified at millions of messages per second.
Follow-up: What is SCM_RIGHTS, and how does Chrome use it for sandboxing?SCM_RIGHTS is a mechanism for passing file descriptors between processes over a Unix domain socket using ancillary messages (sendmsg/recvmsg with cmsg). The sending process puts an fd number in the control message, and the kernel creates a new fd in the receiving process’s fd table pointing to the same underlying file/socket/device. Chrome uses this to implement its sandbox: the renderer process (which has no filesystem access via seccomp) cannot open files directly. Instead, the browser process opens the file and passes the fd to the renderer over a Unix socket. The renderer can read/write through the fd without ever having the capability to open arbitrary files. This is capability-based security at the OS level.
Signals are often called 'software interrupts.' Explain how signal delivery works at the kernel level, and what is the self-pipe trick?
Strong Answer:
When a signal is sent to a process (via kill() or the kernel generating SIGSEGV), the kernel sets a bit in the target process’s pending signal mask. The signal is not delivered immediately — it is delivered when the process returns to user space (e.g., returning from a syscall, returning from an interrupt, or when the scheduler runs the process).
At the point of delivery, the kernel examines the process’s signal disposition. If the handler is SIG_DFL, the kernel performs the default action (terminate, ignore, stop). If a custom handler is installed, the kernel builds a “signal frame” on the process’s user-space stack: it saves the current registers (including the instruction pointer), pushes the signal number and siginfo, and modifies the process’s instruction pointer to point to the signal handler. When the handler returns (via sigreturn() syscall), the kernel restores the saved registers and the process resumes where it was interrupted.
The gotcha: signal handlers interrupt the process at arbitrary points. If the main code is in the middle of updating a data structure, the signal handler runs with that structure in an inconsistent state. This is why only “async-signal-safe” functions (a small subset — write, _exit, signal, etc.) can be safely called in a signal handler. Calling malloc, printf, or acquiring a mutex in a signal handler is undefined behavior.
The self-pipe trick: to integrate signal handling with an event loop (epoll/select), you create a pipe, and the signal handler writes a single byte to the pipe. The event loop monitors the pipe’s read end alongside other file descriptors. When a signal arrives, the pipe becomes readable, and the event loop handles it in its normal, non-interrupted context where it is safe to call any function. Modern Linux provides signalfd() which provides the same functionality without needing a pipe — the kernel delivers signal information as readable data on a file descriptor.
Follow-up: What happens if a signal arrives while the process is blocked in a syscall like read()?It depends on the SA_RESTART flag. By default (without SA_RESTART), the syscall is interrupted and returns -1 with errno set to EINTR. The application must check for EINTR and retry the syscall. With SA_RESTART set on the signal handler, the kernel automatically restarts the interrupted syscall after the signal handler returns. Not all syscalls are restartable — some (like select, poll, nanosleep) are never automatically restarted because their timeout semantics make restart ambiguous. This EINTR handling is one of the most common sources of bugs in systems programming.
How does shared memory IPC work at the page table level? If Process A and Process B share a memory region, what does the kernel actually do?
Strong Answer:
When Process A creates a shared memory segment (via shmget + shmat, or mmap with MAP_SHARED on a file or memfd), the kernel allocates physical frames and creates a VMA (virtual memory area) in A’s address space. The page table entries for that VMA point to those physical frames.
When Process B attaches to the same shared memory segment, the kernel creates a VMA in B’s address space and sets up B’s page table entries to point to the SAME physical frames. This is “page table aliasing” — two different virtual addresses (in different processes) map to the same physical pages. The MMU translates both to the same location in DRAM.
Writes by A are immediately visible to B (and vice versa) because they are reading/writing the same physical memory. However, “immediately visible” has caveats: on x86 (TSO), stores by one core are visible to other cores in order after the store buffer drains (which happens relatively quickly). On ARM (weak model), you need explicit memory barriers (or atomic operations) to ensure visibility.
The kernel does NOT provide any synchronization for shared memory. If A and B both write to the same location without coordination, you get a data race. The processes must use their own synchronization: POSIX semaphores (sem_init with pshared=1), futexes (the kernel-assisted mutex primitive that underlies pthread_mutex when initialized with PTHREAD_PROCESS_SHARED), or atomic operations on shared variables.
Performance: shared memory is the fastest IPC because there is zero data copying — both processes access the same physical memory. The only overhead is the synchronization mechanism. This is why databases (PostgreSQL’s shared_buffers), display servers (Wayland’s buffer sharing), and high-frequency trading systems use shared memory for their hot paths.
Follow-up: What is memfd_create, and why was it added when we already had shm_open?memfd_create() creates an anonymous file in RAM (backed by tmpfs) that is not linked to any filesystem path. It returns a file descriptor that can be passed to other processes via SCM_RIGHTS over a Unix socket or inherited via fork. The advantage over shm_open is that it has no namespace collision risk (no path to conflict with), it can be sealed (using F_SEAL flags to prevent resizing or writing, providing security guarantees for zero-copy data sharing), and it works naturally with mmap and fd passing. Wayland compositors use memfd_create + fd passing + sealing to share pixel buffers between the client and compositor without any risk of the client resizing the buffer while the compositor reads it.