> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# System Monitoring

> Monitoring system performance and logs

# Linux System Monitoring

As a DevOps engineer, you need to know what is happening on your servers at all times. When a page loads slowly, when an API starts timing out, when a deploy goes sideways -- the first question is always "what is the server doing right now?" Monitoring tools answer that question. Think of them as the dashboard gauges in a car: you can drive without looking, but you will not see the engine overheating until it is too late.

***

## 1. Real-Time Monitoring

### `top` / `htop`

The task manager of Linux. Shows CPU, memory, and running processes in real time.

```bash theme={null}
# htop is the version you want -- color-coded, scrollable, and mouse-friendly
sudo apt install htop
htop
```

**Key metrics to watch in htop**:

* **Load Average**: Three numbers showing system load over 1, 5, and 15 minutes. The rule of thumb: if load average exceeds the number of CPU cores, the system is overloaded. A 4-core machine with load 6.0 has processes waiting in line for CPU time.
* **CPU bars**: Green is user-space processes (your apps). Red is kernel work (system calls, I/O). Blue is low-priority (nice) processes. If you see solid red bars, your system is spending too much time in the kernel, often a sign of heavy I/O or too many context switches.
* **Memory**: Green is actively used. Yellow/blue is cache and buffers -- this is memory that Linux is using intelligently to speed up disk reads, and it will be freed immediately when needed. Do not panic if "used" looks high -- look at the "available" number instead.

<Tip>
  **Practical interpretation**: High load + low CPU usage = I/O bottleneck (processes are waiting for disk or network). High load + high CPU usage = CPU bottleneck (you need more compute). High memory with low "available" = you are actually running out of RAM and may start swapping.
</Tip>

### `free`

Check memory usage at a glance.

```bash theme={null}
free -h
#               total        used        free      shared  buff/cache   available
# Mem:          15Gi        4.2Gi       5.1Gi       200Mi       6.1Gi       10Gi
# Swap:         2.0Gi          0B       2.0Gi
```

The most important column is **available** (not "free"). Available = free + reclaimable cache. That is the real amount of memory your applications can use. If "available" is low and "Swap used" is non-zero, your system is under memory pressure.

### `df` and `du` - Disk Space

```bash theme={null}
# Check disk space on all mounted filesystems
df -h
# Look for Usage% approaching 100% -- services crash when disks fill up
# Common culprits: logs, temp files, docker images

# Find what is consuming space in a specific directory
du -sh /var/log/*  | sort -rh | head -10
# Shows the 10 largest items under /var/log, sorted by size
```

<Warning>
  **A full disk is a silent killer.** Databases refuse to write, logs stop recording, services crash with cryptic errors. Set up alerts at 80% disk usage so you have time to act. The most common cause is log files growing unbounded -- always configure log rotation.
</Warning>

***

## 2. Resource Analysis

### `vmstat` (Virtual Memory Statistics)

vmstat gives you a one-line summary of CPU, memory, I/O, and scheduling, updated at whatever interval you choose. It is the fastest way to determine whether your bottleneck is CPU, memory, or disk.

```bash theme={null}
vmstat 1 5  # Update every 1 second, show 5 samples
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  2  0      0 5242880 163840 4194304  0    0     8    24  150  312 15  3 80  2  0
```

**The columns that tell the story**:

* **r** (runnable): Processes waiting for CPU time. If consistently higher than your CPU core count, you are CPU-bound.
* **b** (blocked): Processes in uninterruptible sleep (usually waiting for I/O). If this is consistently above zero, you have an I/O bottleneck.
* **si/so** (swap in/out): Any non-zero value means the system is swapping -- actively moving memory pages between RAM and disk. This is an emergency for performance-sensitive workloads.
* **wa** (wait): Percentage of CPU time spent waiting for I/O. Above 10-20% usually indicates a disk or network bottleneck.
* **us/sy/id**: User CPU, system CPU, and idle. High `sy` (system) can indicate too many context switches or heavy I/O.

### `iostat` (Input/Output Statistics)

When vmstat suggests an I/O bottleneck, iostat tells you which disk and how severe.

```bash theme={null}
# Install if not present
sudo apt install sysstat

# Show extended stats, skip idle devices, update every 1 second
iostat -xz 1
# Device  r/s    w/s    rkB/s  wkB/s  await  %util
# sda     150.0  80.0   4800   3200   12.5   95.2
```

**Key columns**:

* **await**: Average time (ms) for I/O requests. If this is climbing into the hundreds, your disk cannot keep up.
* **%util**: Percentage of time the disk was busy. Above 80% means the disk is saturated. At 100%, new I/O requests are queuing up and wait times will spike.
* **r/s and w/s**: Reads and writes per second -- helps you understand the I/O pattern (is it read-heavy or write-heavy?).

### `sar` - Historical Data

Unlike top and vmstat which show the present, `sar` records and displays historical performance data. It runs continuously in the background (via sysstat) and lets you look back at what happened during last night's incident.

```bash theme={null}
# CPU usage for today, broken down by hour
sar -u

# Memory usage history
sar -r

# Disk I/O history
sar -d

# Network traffic history
sar -n DEV

# Look at data from a specific date
sar -u -f /var/log/sysstat/sa15  # Data from the 15th of the month
```

***

## 3. Network Monitoring

### `ss` - Socket Statistics

```bash theme={null}
# List all listening ports with the owning process
ss -tulpn
# -t = TCP, -u = UDP, -l = listening, -p = show process, -n = numeric

# Count connections by state (useful for detecting connection leaks)
ss -s

# Show all established connections to a specific port
ss -tn state established dport = :80

# Watch for TIME_WAIT buildup (a sign of short-lived connection churn)
ss -tan | grep TIME-WAIT | wc -l
```

### `iftop` - Bandwidth by Connection

```bash theme={null}
# Install and run (requires root)
sudo apt install iftop
sudo iftop -i eth0
# Shows real-time bandwidth usage per connection
# Useful for spotting which host or service is consuming your network bandwidth
```

### Quick Network Diagnostics

```bash theme={null}
# Check packet loss and latency to a host
ping -c 10 api.example.com
# Watch for: packet loss > 0% (network congestion or routing issues),
# high variance in round-trip time (jitter, common with saturated links)

# Measure actual TCP connection time (not just ICMP like ping)
# This is far more useful than ping because it tests the real path your app takes
curl -o /dev/null -s -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" https://api.example.com
# Interpretation:
# - High DNS time? Your nameserver is slow or unreachable. Check /etc/resolv.conf.
# - High Connect time? Network path is congested or the server is overloaded.
# - High TLS time? Certificate chain is large, or the server CPU is maxed on handshakes.
# - Total much higher than the sum? The server is slow to respond (application bottleneck).
```

<Tip>
  **Production gotcha**: A large number of connections in TIME\_WAIT state (thousands or more) is a common problem on high-traffic servers. TIME\_WAIT is normal -- it is TCP's way of ensuring late-arriving packets are not confused with a new connection. But too many can exhaust ephemeral ports. If you see this, consider tuning `net.ipv4.tcp_tw_reuse=1` or using connection pooling in your application. Do not set `tcp_tw_recycle` -- it is deprecated and breaks connections behind NAT.
</Tip>

***

## 4. Log Management

Logs are the source of truth for what happened and when. When monitoring tells you *something* is wrong, logs tell you *what* is wrong.

### `journalctl` (Systemd Logs)

Systemd captures stdout and stderr from every service it manages. journalctl is how you search and filter those logs.

```bash theme={null}
# View all logs (latest at the bottom)
journalctl

# View logs for a specific service
journalctl -u nginx

# Follow logs in real time (essential during deploys and incident response)
journalctl -u nginx -f

# View logs from today only (filters out noise from previous days)
journalctl --since "today"

# View logs from a specific time range (great for post-incident investigation)
journalctl --since "2024-03-15 14:00" --until "2024-03-15 15:00"

# Show only error-level messages across all services
journalctl -p err

# Show logs from the last boot (ignore previous boot history)
journalctl -b

# Show how much disk space logs are consuming
journalctl --disk-usage
```

### `/var/log` - Traditional Log Files

Not everything uses journald. Many applications write to files in `/var/log` directly.

```bash theme={null}
# Key log files to know:
/var/log/syslog        # General system logs (Ubuntu/Debian)
/var/log/messages      # General system logs (RHEL/CentOS)
/var/log/auth.log      # Authentication: SSH logins, sudo usage, failed attempts
/var/log/kern.log      # Kernel messages (hardware errors, driver issues)
/var/log/nginx/access.log  # Web server access logs
/var/log/nginx/error.log   # Web server errors

# Follow a log file in real time
tail -f /var/log/syslog

# Search for errors in the last 1000 lines
tail -1000 /var/log/syslog | grep -i error

# Count error occurrences by type (quick pattern analysis)
grep -i error /var/log/syslog | awk '{print $5}' | sort | uniq -c | sort -rn | head -10
```

### Log Rotation

Without rotation, log files grow until they fill the disk. `logrotate` handles this automatically.

```bash theme={null}
# View the logrotate configuration
cat /etc/logrotate.conf

# Example rotation config for an application
# /etc/logrotate.d/myapp
/var/log/myapp/*.log {
    daily           # Rotate daily
    rotate 14       # Keep 14 days of history
    compress        # Gzip old logs to save space
    missingok       # Do not error if log file is missing
    notifempty      # Do not rotate if the file is empty
    postrotate      # Run this command after rotating
        systemctl reload myapp
    endscript
}
```

***

## 5. The Performance Investigation Workflow

When something is slow, follow this systematic approach instead of guessing:

```bash theme={null}
# Step 1: What is the load? (Is the system under stress?)
uptime
# 14:30  up 45 days, load average: 12.50, 8.30, 4.20
# Load is climbing (4 -> 8 -> 12) -- something is getting worse right now

# Step 2: CPU or I/O? (Where is the bottleneck?)
vmstat 1 5
# High 'r' + low 'wa' = CPU bound
# Low 'r' + high 'wa' or 'b' = I/O bound
# High 'si/so' = Memory pressure, swapping

# Step 3: What process is responsible?
htop   # Sort by CPU (press P) or memory (press M) to find the culprit

# Step 4: Dig deeper into the offending process
strace -p PID -c   # What system calls is it making? (I/O heavy? Lock contention?)
lsof -p PID        # What files/sockets does it have open?

# Step 5: Check disk I/O if indicated
iostat -xz 1       # Which disk is saturated?

# Step 6: Check network if indicated
ss -s              # Connection count and states
iftop              # Bandwidth by connection
```

***

## Key Takeaways

* Use **htop** for a quick interactive overview of CPU, memory, and processes
* Use **vmstat** to determine whether your bottleneck is CPU, memory, or I/O -- it is the fastest single-command diagnosis
* Use **iostat** when vmstat points to an I/O problem, to identify the saturated disk
* Use **free -h** and look at the "available" column, not "free" -- cached memory is available memory
* Use **journalctl** to search and filter systemd service logs, especially during incidents
* A full disk silently breaks everything -- monitor disk usage and configure log rotation
* Follow a systematic workflow (load, CPU/IO split, process, deep dive) instead of guessing

***

Next: [Security Hardening →](/courses/devops-tools/linux-security)
