Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Linux System Monitoring

As a DevOps engineer, you need to know what is happening on your servers at all times. When a page loads slowly, when an API starts timing out, when a deploy goes sideways — the first question is always “what is the server doing right now?” Monitoring tools answer that question. Think of them as the dashboard gauges in a car: you can drive without looking, but you will not see the engine overheating until it is too late.

1. Real-Time Monitoring

top / htop

The task manager of Linux. Shows CPU, memory, and running processes in real time.
# htop is the version you want -- color-coded, scrollable, and mouse-friendly
sudo apt install htop
htop
Key metrics to watch in htop:
  • Load Average: Three numbers showing system load over 1, 5, and 15 minutes. The rule of thumb: if load average exceeds the number of CPU cores, the system is overloaded. A 4-core machine with load 6.0 has processes waiting in line for CPU time.
  • CPU bars: Green is user-space processes (your apps). Red is kernel work (system calls, I/O). Blue is low-priority (nice) processes. If you see solid red bars, your system is spending too much time in the kernel, often a sign of heavy I/O or too many context switches.
  • Memory: Green is actively used. Yellow/blue is cache and buffers — this is memory that Linux is using intelligently to speed up disk reads, and it will be freed immediately when needed. Do not panic if “used” looks high — look at the “available” number instead.
Practical interpretation: High load + low CPU usage = I/O bottleneck (processes are waiting for disk or network). High load + high CPU usage = CPU bottleneck (you need more compute). High memory with low “available” = you are actually running out of RAM and may start swapping.

free

Check memory usage at a glance.
free -h
#               total        used        free      shared  buff/cache   available
# Mem:          15Gi        4.2Gi       5.1Gi       200Mi       6.1Gi       10Gi
# Swap:         2.0Gi          0B       2.0Gi
The most important column is available (not “free”). Available = free + reclaimable cache. That is the real amount of memory your applications can use. If “available” is low and “Swap used” is non-zero, your system is under memory pressure.

df and du - Disk Space

# Check disk space on all mounted filesystems
df -h
# Look for Usage% approaching 100% -- services crash when disks fill up
# Common culprits: logs, temp files, docker images

# Find what is consuming space in a specific directory
du -sh /var/log/*  | sort -rh | head -10
# Shows the 10 largest items under /var/log, sorted by size
A full disk is a silent killer. Databases refuse to write, logs stop recording, services crash with cryptic errors. Set up alerts at 80% disk usage so you have time to act. The most common cause is log files growing unbounded — always configure log rotation.

2. Resource Analysis

vmstat (Virtual Memory Statistics)

vmstat gives you a one-line summary of CPU, memory, I/O, and scheduling, updated at whatever interval you choose. It is the fastest way to determine whether your bottleneck is CPU, memory, or disk.
vmstat 1 5  # Update every 1 second, show 5 samples
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  2  0      0 5242880 163840 4194304  0    0     8    24  150  312 15  3 80  2  0
The columns that tell the story:
  • r (runnable): Processes waiting for CPU time. If consistently higher than your CPU core count, you are CPU-bound.
  • b (blocked): Processes in uninterruptible sleep (usually waiting for I/O). If this is consistently above zero, you have an I/O bottleneck.
  • si/so (swap in/out): Any non-zero value means the system is swapping — actively moving memory pages between RAM and disk. This is an emergency for performance-sensitive workloads.
  • wa (wait): Percentage of CPU time spent waiting for I/O. Above 10-20% usually indicates a disk or network bottleneck.
  • us/sy/id: User CPU, system CPU, and idle. High sy (system) can indicate too many context switches or heavy I/O.

iostat (Input/Output Statistics)

When vmstat suggests an I/O bottleneck, iostat tells you which disk and how severe.
# Install if not present
sudo apt install sysstat

# Show extended stats, skip idle devices, update every 1 second
iostat -xz 1
# Device  r/s    w/s    rkB/s  wkB/s  await  %util
# sda     150.0  80.0   4800   3200   12.5   95.2
Key columns:
  • await: Average time (ms) for I/O requests. If this is climbing into the hundreds, your disk cannot keep up.
  • %util: Percentage of time the disk was busy. Above 80% means the disk is saturated. At 100%, new I/O requests are queuing up and wait times will spike.
  • r/s and w/s: Reads and writes per second — helps you understand the I/O pattern (is it read-heavy or write-heavy?).

sar - Historical Data

Unlike top and vmstat which show the present, sar records and displays historical performance data. It runs continuously in the background (via sysstat) and lets you look back at what happened during last night’s incident.
# CPU usage for today, broken down by hour
sar -u

# Memory usage history
sar -r

# Disk I/O history
sar -d

# Network traffic history
sar -n DEV

# Look at data from a specific date
sar -u -f /var/log/sysstat/sa15  # Data from the 15th of the month

3. Network Monitoring

ss - Socket Statistics

# List all listening ports with the owning process
ss -tulpn
# -t = TCP, -u = UDP, -l = listening, -p = show process, -n = numeric

# Count connections by state (useful for detecting connection leaks)
ss -s

# Show all established connections to a specific port
ss -tn state established dport = :80

# Watch for TIME_WAIT buildup (a sign of short-lived connection churn)
ss -tan | grep TIME-WAIT | wc -l

iftop - Bandwidth by Connection

# Install and run (requires root)
sudo apt install iftop
sudo iftop -i eth0
# Shows real-time bandwidth usage per connection
# Useful for spotting which host or service is consuming your network bandwidth

Quick Network Diagnostics

# Check packet loss and latency to a host
ping -c 10 api.example.com
# Watch for: packet loss > 0% (network congestion or routing issues),
# high variance in round-trip time (jitter, common with saturated links)

# Measure actual TCP connection time (not just ICMP like ping)
# This is far more useful than ping because it tests the real path your app takes
curl -o /dev/null -s -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" https://api.example.com
# Interpretation:
# - High DNS time? Your nameserver is slow or unreachable. Check /etc/resolv.conf.
# - High Connect time? Network path is congested or the server is overloaded.
# - High TLS time? Certificate chain is large, or the server CPU is maxed on handshakes.
# - Total much higher than the sum? The server is slow to respond (application bottleneck).
Production gotcha: A large number of connections in TIME_WAIT state (thousands or more) is a common problem on high-traffic servers. TIME_WAIT is normal — it is TCP’s way of ensuring late-arriving packets are not confused with a new connection. But too many can exhaust ephemeral ports. If you see this, consider tuning net.ipv4.tcp_tw_reuse=1 or using connection pooling in your application. Do not set tcp_tw_recycle — it is deprecated and breaks connections behind NAT.

4. Log Management

Logs are the source of truth for what happened and when. When monitoring tells you something is wrong, logs tell you what is wrong.

journalctl (Systemd Logs)

Systemd captures stdout and stderr from every service it manages. journalctl is how you search and filter those logs.
# View all logs (latest at the bottom)
journalctl

# View logs for a specific service
journalctl -u nginx

# Follow logs in real time (essential during deploys and incident response)
journalctl -u nginx -f

# View logs from today only (filters out noise from previous days)
journalctl --since "today"

# View logs from a specific time range (great for post-incident investigation)
journalctl --since "2024-03-15 14:00" --until "2024-03-15 15:00"

# Show only error-level messages across all services
journalctl -p err

# Show logs from the last boot (ignore previous boot history)
journalctl -b

# Show how much disk space logs are consuming
journalctl --disk-usage

/var/log - Traditional Log Files

Not everything uses journald. Many applications write to files in /var/log directly.
# Key log files to know:
/var/log/syslog        # General system logs (Ubuntu/Debian)
/var/log/messages      # General system logs (RHEL/CentOS)
/var/log/auth.log      # Authentication: SSH logins, sudo usage, failed attempts
/var/log/kern.log      # Kernel messages (hardware errors, driver issues)
/var/log/nginx/access.log  # Web server access logs
/var/log/nginx/error.log   # Web server errors

# Follow a log file in real time
tail -f /var/log/syslog

# Search for errors in the last 1000 lines
tail -1000 /var/log/syslog | grep -i error

# Count error occurrences by type (quick pattern analysis)
grep -i error /var/log/syslog | awk '{print $5}' | sort | uniq -c | sort -rn | head -10

Log Rotation

Without rotation, log files grow until they fill the disk. logrotate handles this automatically.
# View the logrotate configuration
cat /etc/logrotate.conf

# Example rotation config for an application
# /etc/logrotate.d/myapp
/var/log/myapp/*.log {
    daily           # Rotate daily
    rotate 14       # Keep 14 days of history
    compress        # Gzip old logs to save space
    missingok       # Do not error if log file is missing
    notifempty      # Do not rotate if the file is empty
    postrotate      # Run this command after rotating
        systemctl reload myapp
    endscript
}

5. The Performance Investigation Workflow

When something is slow, follow this systematic approach instead of guessing:
# Step 1: What is the load? (Is the system under stress?)
uptime
# 14:30  up 45 days, load average: 12.50, 8.30, 4.20
# Load is climbing (4 -> 8 -> 12) -- something is getting worse right now

# Step 2: CPU or I/O? (Where is the bottleneck?)
vmstat 1 5
# High 'r' + low 'wa' = CPU bound
# Low 'r' + high 'wa' or 'b' = I/O bound
# High 'si/so' = Memory pressure, swapping

# Step 3: What process is responsible?
htop   # Sort by CPU (press P) or memory (press M) to find the culprit

# Step 4: Dig deeper into the offending process
strace -p PID -c   # What system calls is it making? (I/O heavy? Lock contention?)
lsof -p PID        # What files/sockets does it have open?

# Step 5: Check disk I/O if indicated
iostat -xz 1       # Which disk is saturated?

# Step 6: Check network if indicated
ss -s              # Connection count and states
iftop              # Bandwidth by connection

Key Takeaways

  • Use htop for a quick interactive overview of CPU, memory, and processes
  • Use vmstat to determine whether your bottleneck is CPU, memory, or I/O — it is the fastest single-command diagnosis
  • Use iostat when vmstat points to an I/O problem, to identify the saturated disk
  • Use free -h and look at the “available” column, not “free” — cached memory is available memory
  • Use journalctl to search and filter systemd service logs, especially during incidents
  • A full disk silently breaks everything — monitor disk usage and configure log rotation
  • Follow a systematic workflow (load, CPU/IO split, process, deep dive) instead of guessing

Next: Security Hardening →