Linux Troubleshooting and Debugging Guide

29th Mar 2026
4 min read
Tags:
linux,
sre,
devops,
debugging,
troubleshooting,
sysadmin

Most Linux problems are not mysterious. They are misconfigurations, resource exhaustion, or broken dependencies -- and they leave evidence in logs, metrics, and system state. This guide gives you the tools and mental model to find that evidence systematically.

The troubleshooting loop:

OBSERVE   -- what is the current state?
EVENTS    -- what changed? (logs, dmesg, journal)
DIAGNOSE  -- form a hypothesis
TEST      -- verify it (commands, isolation)
RESOLVE   -- apply fix
VERIFY    -- confirm it worked
DOCUMENT  -- write it down

System Monitoring Commands

These are your first tools every time something feels wrong.

top and htop

# Real-time process monitor
top
# Inside top:
# P = sort by CPU
# M = sort by memory
# 1 = toggle per-core CPU view
# k = kill process by PID
# q = quit

# Better version with colors and mouse support
htop
# F6 = sort by column
# F9 = kill process
# F5 = tree view

# Sort by CPU from CLI
ps aux --sort=-%cpu | head -20

# Sort by memory
ps aux --sort=-%mem | head -20

Load Average

uptime
# 15:23:42 up 12 days, 3:41, 2 users, load average: 1.23, 0.98, 0.87
#                                                     1min  5min  15min
# Rule: load > number of CPU cores = overloaded
nproc              # how many CPU cores you have

vmstat -- virtual memory, CPU, I/O overview

vmstat 1           # refresh every 1 second
vmstat 1 10        # 10 samples then exit
# Key columns:
# r  = processes waiting for CPU (runqueue)
# b  = processes blocked on I/O
# si = swap in (KB/s) -- bad if nonzero
# so = swap out (KB/s) -- bad if nonzero
# wa = CPU time waiting on I/O (%)
# us = user CPU %
# sy = kernel CPU %

mpstat -- per-CPU statistics

mpstat -P ALL 1    # all CPUs, 1 second interval
# Shows per-core breakdown -- useful to spot single-threaded bottlenecks

CPU Issues

Diagnosing High CPU

# Step 1 -- find the offending process
top        # press P to sort by CPU
ps aux --sort=-%cpu | head -10

# Step 2 -- check if it is real CPU work or I/O wait
# In top header line:
# %Cpu(s): 12.5 us, 3.2 sy, 0.0 ni, 5.3 id, 78.9 wa
#                                              ↑ wa = I/O wait
# High wa means disk is the bottleneck, not CPU

# Step 3 -- investigate what a process is doing
strace -p <PID>            # system calls (adds overhead, use briefly)
strace -p <PID> -c         # summary of syscalls (less overhead)
lsof -p <PID>              # files the process has open
cat /proc/<PID>/status     # detailed process info
cat /proc/<PID>/cmdline    # exact command including args

# Step 4 -- check load vs CPU count
uptime
nproc
# If load/nproc > 2, system is significantly overloaded

Fixing High CPU

# Lower priority of a running process (nice: -20 = highest, 19 = lowest)
renice +10 <PID>                    # make less aggressive
renice -n 19 -p <PID>               # minimum priority

# Start a new command at low priority
nice -n 19 command                   # lowest priority
ionice -c 3 nice -n 19 command       # low CPU and low I/O

# Limit CPU usage to a percentage
cpulimit -p <PID> -l 50             # limit to 50% of one core

# Kill gracefully then force if needed
kill <PID>
sleep 10
kill -9 <PID>

# Kill all instances by name
killall process-name
pkill -f "pattern in command"

# Check for cryptominers or malware
ps aux | grep -E '(xmrig|miner|kworker)'
crontab -l
sudo crontab -l
ls -la /etc/cron.*
systemctl list-unit-files --state=enabled | grep -v systemd

Memory Issues

Diagnosing Memory Problems

# Overview
free -h
#               total   used    free   shared  buff/cache  available
# Mem:           15Gi   12Gi   500Mi     1Gi         3Gi       2.5Gi
# Swap:         8.0Gi  7.0Gi   1.0Gi
# "available" is what matters -- not "free"
# High swap usage = memory pressure = performance problem

# Top memory consumers
ps aux --sort=-%mem | head -15

# Check if OOM killer fired
dmesg | grep -i "out of memory"
dmesg | grep -i "killed process"
journalctl -xe | grep -i oom

# Watch a process for memory leaks
watch -n 2 'ps -p <PID> -o pid,rss,vsz,cmd'
# RSS (resident set size) growing steadily = likely leak

# Detailed process memory map
pmap -x <PID>
cat /proc/<PID>/smaps | grep -i total

# Check swap devices and usage
swapon --show
cat /proc/sys/vm/swappiness   # 60 = default, 10 = prefer RAM

Fixing Memory Issues

# Drop page cache (safe -- kernel rebuilds it)
sync
echo 3 | sudo tee /proc/sys/vm/drop_caches
# 1 = page cache, 2 = dentries/inodes, 3 = all

# Refresh swap (frees memory back to RAM -- brief freeze)
sudo swapoff -a && sudo swapon -a

# Reduce swappiness -- prefer RAM over swap
sudo sysctl vm.swappiness=10
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf

# Add a swap file if none exists
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# Protect a critical process from OOM killer
echo -1000 | sudo tee /proc/<PID>/oom_score_adj
# -1000 = never kill, 0 = default, 1000 = kill first

# In a systemd service unit file
sudo systemctl edit service-name

[Service]
OOMScoreAdjust=-1000
MemoryMax=2G

Disk I/O Issues

Diagnosing Disk Bottlenecks

# Is it actually I/O wait?
top
# Look for high "wa" in CPU line -- anything above 30% is significant

# Which disk is saturated?
iostat -x 1 5
# Key columns:
# %util  -- time disk was busy (>80% = saturated)
# await  -- average request wait time in ms (>20ms for SSD, >50ms HDD = slow)
# r/s    -- reads per second
# w/s    -- writes per second
# rkB/s  -- read throughput KB/s
# wkB/s  -- write throughput KB/s

# Which processes are doing the most I/O?
sudo iotop -o                  # -o = only show active processes
sudo pidstat -d 1              # per-process disk I/O stats every second

# Which files are being accessed?
sudo lsof | grep -E ' REG.*sda'
sudo inotifywait -m -r /path/  # watch for file system events

# Check disk health
sudo smartctl -H /dev/sda      # quick health check
sudo smartctl -a /dev/sda      # full SMART report
# Watch for: Reallocated_Sector_Ct > 0 = bad sectors
# Watch for: Current_Pending_Sector > 0 = failing sectors

# Check kernel for disk errors
dmesg | grep -i "i/o error"
dmesg | grep -i "ata.*error"
dmesg | grep -i "blk_update_request"

Fixing Disk I/O

# Lower I/O priority of a process
sudo ionice -c 3 -p <PID>           # idle class
sudo ionice -c 2 -n 7 -p <PID>      # best effort, lowest

# In systemd service
sudo systemctl edit service-name

[Service]
IOSchedulingClass=idle
Nice=19

# Use tmpfs for temp files to avoid disk I/O
sudo mount -t tmpfs -o size=2G tmpfs /var/cache/app
# Permanent via /etc/fstab:
# tmpfs /var/cache/app tmpfs size=2G 0 0

# Tune I/O scheduler
cat /sys/block/sda/queue/scheduler

# For SSD (no scheduling needed)
echo none | sudo tee /sys/block/sda/queue/scheduler

# For HDD with mixed workloads
echo bfq | sudo tee /sys/block/sda/queue/scheduler

# Make scheduler change permanent via udev rule
echo 'ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="none"' \
  | sudo tee /etc/udev/rules.d/60-io-scheduler.rules

# Flush write cache (safe before disk work)
sync

Disk Space Issues

# Check disk usage by filesystem
df -h
df -i              # check inode usage -- can be "full" even with space left

# Find where space is going
sudo du -sh /* 2>/dev/null | sort -rh | head -15
sudo du -sh /var/log/* 2>/dev/null | sort -rh | head -10
sudo du -sh /var/cache/* 2>/dev/null | sort -rh | head -10
sudo du -sh /home/* 2>/dev/null | sort -rh | head -10

# Find files larger than 100MB
sudo find / -type f -size +100M -exec du -h {} + 2>/dev/null | sort -rh | head -20

# Find files deleted but still held open by processes (still consuming space)
sudo lsof +L1
sudo lsof | grep deleted | sort -nrk 7 | head -15

# Quick cleanup
sudo apt clean && sudo apt autoremove --purge     # Debian/Ubuntu
sudo dnf clean all                                # RHEL/Fedora/AlmaLinux
sudo journalctl --vacuum-size=500M                # trim systemd journal
sudo journalctl --vacuum-time=7d

# Truncate a log file without breaking the running process
sudo truncate -s 0 /var/log/large-file.log

# Remove old Docker objects
docker system prune -a

# Remove old snap revisions
snap list --all | awk '/disabled/{print $1, $3}' | \
  while read name rev; do sudo snap remove "$name" --revision="$rev"; done

Log Rotation

# Check existing config
cat /etc/logrotate.conf
ls /etc/logrotate.d/

# Create custom rotation
sudo nano /etc/logrotate.d/myapp

/var/log/myapp/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    create 0640 www-data adm
}

sudo logrotate -d /etc/logrotate.d/myapp   # dry run
sudo logrotate -f /etc/logrotate.d/myapp   # force immediately

Networking

Interface and IP Diagnostics

# Show all interfaces and their state
ip link show
# Look for UP vs DOWN vs NO-CARRIER

# Show IP addresses
ip addr show
ip a                    # short form
ip a show eth0          # specific interface

# Show routing table
ip route show
ip r                    # short form
# Need: "default via <gateway-ip> dev <interface>"

# Test gateway
ping -c 3 $(ip r | awk '/default/{print $3}')

# Physical link check (wired)
ethtool eth0 | grep "Link detected"

# WiFi status
iwconfig wlan0
nmcli device wifi list
rfkill list             # check if radio is blocked
sudo rfkill unblock wifi

Connectivity Testing

# ICMP connectivity
ping -c 4 8.8.8.8                    # test internet
ping -c 4 192.168.1.1                # test gateway

# DNS test
nslookup google.com
dig google.com
dig google.com @8.8.8.8              # test against specific DNS server
host google.com

# Port test
telnet server-ip 80
nc -zv server-ip 80                   # netcat port test
nc -zvu server-ip 53                  # UDP port test
curl -v http://server-ip:80           # HTTP test with details

# Trace the route
traceroute google.com
mtr google.com                        # continuous traceroute (better)

# Test multiple ports at once
nmap -p 22,80,443 server-ip

Ports and Listening Services

# Show all listening TCP/UDP ports
sudo ss -tulnp
# -t = TCP, -u = UDP, -l = listening, -n = numeric, -p = show process

sudo netstat -tulnp                   # older alternative to ss

# What is using port 80?
sudo ss -tlnp | grep :80
sudo lsof -i :80
sudo fuser 80/tcp

# All connections including established
sudo ss -tap
sudo netstat -tap

# Check if port is open from remote
nc -zv remote-host 22

DNS Diagnostics

# Check current DNS config
cat /etc/resolv.conf
resolvectl status                     # systemd-resolved

# Test DNS resolution
nslookup google.com
nslookup google.com 8.8.8.8          # against specific server
dig google.com
dig @1.1.1.1 google.com              # against Cloudflare

# Flush DNS cache
sudo resolvectl flush-caches          # systemd-resolved
sudo systemd-resolve --flush-caches  # alternative
sudo service nscd restart             # if using nscd

# Fix DNS
sudo nano /etc/resolv.conf
# nameserver 8.8.8.8
# nameserver 1.1.1.1

# Persistent via NetworkManager
sudo nmcli con mod "connection-name" ipv4.dns "8.8.8.8 1.1.1.1"
sudo nmcli con up "connection-name"

# Persistent via systemd-resolved
sudo nano /etc/systemd/resolved.conf

[Resolve]
DNS=8.8.8.8 1.1.1.1
FallbackDNS=1.0.0.1

sudo systemctl restart systemd-resolved

Packet Capture

# Capture all traffic on eth0
sudo tcpdump -i eth0

# Capture HTTP traffic
sudo tcpdump -i eth0 port 80

# Capture DNS queries
sudo tcpdump -i eth0 port 53

# Save to file for analysis in Wireshark
sudo tcpdump -i eth0 -w capture.pcap

# Read saved file
tcpdump -r capture.pcap

# Verbose output (show packet contents)
sudo tcpdump -i eth0 -A port 80

# Filter by host
sudo tcpdump -i eth0 host 192.168.1.100

Firewall Diagnostics

# Which firewall is active?
sudo iptables -L -n -v
sudo systemctl status firewalld
sudo ufw status verbose
sudo nft list ruleset

# Allow port 80 -- iptables
sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 443 -j ACCEPT
sudo iptables-save | sudo tee /etc/iptables/rules.v4

# Allow service -- firewalld
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --reload
sudo firewall-cmd --list-all

# Allow port -- UFW
sudo ufw allow 80/tcp
sudo ufw allow from 192.168.1.0/24 to any port 22
sudo ufw enable
sudo ufw status numbered

# Temporarily test without firewall (caution on remote servers)
sudo iptables -F && sudo iptables -P INPUT ACCEPT

Boot and Startup Issues

Boot Diagnostics

# After boot -- analyze startup time
systemd-analyze
systemd-analyze blame           # slowest services first
systemd-analyze critical-chain  # critical path
systemd-analyze plot > boot.svg # visual graph

# Check what failed
systemctl --failed
systemctl list-jobs             # pending jobs

# Boot messages
dmesg | tail -50
dmesg -T | grep -i error        # errors with timestamps
dmesg -T | grep -i "fail"
journalctl -b                   # all logs from current boot
journalctl -b -p err            # errors from current boot
journalctl -b -1                # previous boot
journalctl -b -2                # two boots ago

GRUB Recovery

From the GRUB rescue prompt:

ls                               # list partitions
ls (hd0,gpt2)/                   # check partition contents
set root=(hd0,gpt2)
set prefix=(hd0,gpt2)/boot/grub
insmod normal
normal

After booting into the system:

sudo grub-install /dev/sda       # reinstall GRUB (whole disk, not partition)
sudo update-grub                 # regenerate grub.cfg

# For UEFI systems
sudo grub-install --target=x86_64-efi \
  --efi-directory=/boot/efi \
  --bootloader-id=ubuntu
sudo update-grub

# Check EFI boot entries
efibootmgr -v

fstab Issues

# Test all fstab mounts without rebooting
sudo mount -a
# If this fails, the system will fail to boot

# Find device UUIDs (use these in fstab, not /dev/sda names)
blkid
lsblk -f

# Verify fstab syntax
cat /etc/fstab

# Network mounts must include _netdev or they block boot
# //server/share /mnt/share cifs credentials=/etc/.creds,_netdev 0 0

Services and Processes

Service Diagnostics

# Check status
systemctl status service-name
systemctl is-active service-name
systemctl is-enabled service-name

# Full logs for a service
journalctl -u service-name
journalctl -u service-name -f           # follow
journalctl -u service-name -n 100       # last 100 lines
journalctl -u service-name --since "1 hour ago"
journalctl -u service-name -p err       # errors only

# Check service unit file
systemctl cat service-name

# Check dependencies
systemctl list-dependencies service-name

# Check what is blocking boot
systemd-analyze critical-chain service-name.service

# Test configs before restarting
nginx -t
apache2ctl configtest
sshd -t
mysqld --validate-config

Service Management

# Start / stop / restart
sudo systemctl start service-name
sudo systemctl stop service-name
sudo systemctl restart service-name
sudo systemctl reload service-name      # reload config without restart

# Enable / disable on boot
sudo systemctl enable service-name
sudo systemctl disable service-name
sudo systemctl enable --now service-name  # enable and start immediately

# Override service settings (creates /etc/systemd/system/service.d/override.conf)
sudo systemctl edit service-name

# After editing unit files
sudo systemctl daemon-reload

# Emergency: mask a broken service (can't be started even manually)
sudo systemctl mask service-name
sudo systemctl unmask service-name

Process Investigation

# Find process by name
pgrep nginx                             # returns PIDs
pgrep -a nginx                          # PIDs and full command
pidof nginx

# Process tree
pstree -p                               # all processes with PIDs
pstree -p <PID>                         # subtree from PID

# Detailed process info
cat /proc/<PID>/status
cat /proc/<PID>/cmdline | tr '\0' ' '   # full command with args
cat /proc/<PID>/environ | tr '\0' '\n'  # environment variables
ls -la /proc/<PID>/fd | wc -l           # number of open file descriptors

# What files does a process have open?
lsof -p <PID>

# What processes have a file open?
lsof /var/log/syslog
fuser /var/log/syslog

# What is holding a port?
sudo lsof -i :80
sudo fuser 80/tcp
sudo ss -tlnp | grep :80

Resource Limits

# Check limits for a process
cat /proc/<PID>/limits

# View systemd limits for a service
systemctl show service-name | grep -i limit

# Set per-user limits
sudo nano /etc/security/limits.conf

# username type item value
nginx      soft nofile 65536
nginx      hard nofile 65536
*          soft nproc  4096

# Set limits in systemd service
sudo systemctl edit service-name

[Service]
LimitNOFILE=65536
LimitNPROC=4096

Storage and Filesystems

Disk and Partition Info

# List block devices
lsblk
lsblk -f                    # include filesystem type and UUID

# List partitions
sudo fdisk -l
sudo parted -l

# Filesystem usage
df -h                       # human readable
df -i                       # inode usage
df -Th                      # with filesystem type

# Show UUIDs
blkid
blkid /dev/sda1

# Check physical device info
sudo hdparm -I /dev/sda     # ATA device info
sudo smartctl -i /dev/sda   # SMART device info

Filesystem Operations

# Mount / unmount
sudo mount /dev/sda1 /mnt
sudo mount -t ext4 /dev/sda1 /mnt
sudo umount /mnt
sudo umount -l /mnt         # lazy unmount (if device busy)

# Bind mount (mount directory to another location)
sudo mount --bind /source /destination

# Check if filesystem is mounted
mount | grep sda1
findmnt /mnt

# Show open files on a filesystem
lsof +D /mnt                # useful before unmounting
fuser -vm /mnt

Filesystem Check and Repair

# Check BEFORE mounting or on unmounted filesystem
sudo fsck /dev/sda1
sudo fsck -y /dev/sda1      # answer yes to all repairs automatically
sudo fsck -n /dev/sda1      # dry run (no changes)

# ext4 specific
sudo e2fsck -f /dev/sda1    # force check even if clean
sudo tune2fs -l /dev/sda1   # show filesystem parameters

# XFS (must be mounted for repair)
sudo xfs_repair /dev/sda1   # unmounted
sudo xfs_check /dev/sda1    # check only

# Check SMART health
sudo smartctl -H /dev/sda
sudo smartctl -t short /dev/sda && sleep 120 && sudo smartctl -a /dev/sda

LVM (Logical Volume Manager)

# List volumes
sudo pvs                    # physical volumes
sudo vgs                    # volume groups
sudo lvs                    # logical volumes

# Extend a logical volume and filesystem
sudo lvextend -L +10G /dev/vg/lv-name
sudo resize2fs /dev/vg/lv-name       # ext4
sudo xfs_growfs /mount/point         # xfs

# Create snapshot
sudo lvcreate -L 5G -s -n snapshot /dev/vg/lv-name

# Remove old snapshot
sudo lvremove /dev/vg/snapshot

Permissions and Access

Reading Permissions

ls -la /path/to/file
# -rw-r--r-- 1 alice developers 1234 Jan 1 12:00 file.txt
# │└──┘└──┘└──┘
# │ │   │   │
# │ │   │   └─ others: r-- (read only)
# │ │   └───── group:  r-- (read only)
# │ └───────── owner:  rw- (read + write)
# └─────────── type: - = file, d = dir, l = symlink

# Permission bits: r=4, w=2, x=1
# 755 = rwxr-xr-x (owner full, others read+execute)
# 644 = rw-r--r-- (owner read+write, others read)
# 600 = rw------- (owner only)
# 700 = rwx------ (owner only, executable)

Fixing Permissions

# Change permissions
chmod 644 file.txt
chmod 755 directory/
chmod -R 755 /path/dir/       # recursive (be careful)
chmod u+x script.sh           # add execute for owner
chmod go-w file.txt           # remove write from group and others
chmod a+r file.txt            # add read for all

# Change ownership
sudo chown user file.txt
sudo chown user:group file.txt
sudo chown -R user:group /path/dir/

# Change group only
sudo chgrp group file.txt

Special Attributes

# Check for special flags (immutable, append-only, etc.)
lsattr file.txt
lsattr -d /path/directory/

# Common flags:
# i = immutable (cannot modify or delete)
# a = append-only (can only add data)

# Remove immutable flag
sudo chattr -i file.txt

# Set append-only (useful for log files)
sudo chattr +a /var/log/app.log

ACLs (Access Control Lists)

# View ACLs
getfacl file.txt

# Add user ACL
setfacl -m u:username:rw file.txt

# Add group ACL
setfacl -m g:groupname:r file.txt

# Set default ACL on directory (inherits to new files)
setfacl -d -m u:username:rw /path/dir/

# Remove all ACLs
setfacl -b file.txt

SELinux

# Check SELinux mode
getenforce
sestatus

# Check file security context
ls -Z /path/to/file

# Check recent SELinux denials
sudo ausearch -m avc -ts recent
sudo sealert -a /var/log/audit/audit.log

# Temporarily disable for testing
sudo setenforce 0     # permissive mode
sudo setenforce 1     # back to enforcing

# Fix file context
sudo restorecon -v /path/to/file
sudo restorecon -Rv /path/dir/   # recursive

# Allow a port for a service
sudo semanage port -a -t http_port_t -p tcp 8080

# Apply a suggested policy from audit log
sudo ausearch -m avc -ts recent | audit2allow -M mymodule
sudo semodule -i mymodule.pp

Log Analysis

systemd Journal

# Most useful commands
journalctl -xe                          # recent errors with explanations
journalctl -f                           # follow live
journalctl -b                           # current boot
journalctl -b -1                        # previous boot
journalctl -p err                       # errors only
journalctl -p err -b                    # errors from current boot

# Filter by service
journalctl -u nginx
journalctl -u nginx -f
journalctl -u nginx --since "1 hour ago"

# Filter by time
journalctl --since "2026-01-01 10:00" --until "2026-01-01 11:00"
journalctl --since "30 minutes ago"

# Filter by PID
journalctl _PID=1234

# Show kernel messages
journalctl -k                           # kernel messages only

# Disk usage of journal
journalctl --disk-usage

# Clean old journal entries
sudo journalctl --vacuum-size=500M
sudo journalctl --vacuum-time=30d

Traditional Log Files

# Real-time monitoring
tail -f /var/log/syslog
tail -f /var/log/nginx/error.log

# Search in log files
grep "ERROR" /var/log/app.log
grep -i "error\|fail\|warn" /var/log/syslog | tail -50
grep -B 5 -A 5 "fatal" /var/log/app.log    # context before/after

# Search in compressed logs
zgrep "error" /var/log/syslog.2.gz
zcat /var/log/syslog.1.gz | grep error

# Kernel messages
dmesg
dmesg -T                                    # with human-readable timestamps
dmesg | tail -30
dmesg -T | grep -i error
dmesg -T | grep -i "fail\|error\|warn" | tail -20

Log Locations Reference

Log	Location (Debian/Ubuntu)	Location (RHEL/AlmaLinux)
General system	`/var/log/syslog`	`/var/log/messages`
Authentication	`/var/log/auth.log`	`/var/log/secure`
Kernel	`/var/log/kern.log`	via dmesg / journalctl -k
Boot	`/var/log/boot.log`	`/var/log/boot.log`
Cron	`/var/log/syslog`	`/var/log/cron`
Nginx	`/var/log/nginx/`	`/var/log/nginx/`
Apache	`/var/log/apache2/`	`/var/log/httpd/`
MySQL	`/var/log/mysql/`	`/var/log/mariadb/`
PostgreSQL	`/var/log/postgresql/`	`/var/log/postgresql/`
SSH	`/var/log/auth.log`	`/var/log/secure`

Package Management

Debian / Ubuntu (apt)

# Update and install
sudo apt update
sudo apt install package-name

# Fix broken dependencies
sudo apt --fix-broken install
sudo dpkg --configure -a

# Remove package cleanly
sudo apt remove package-name
sudo apt purge package-name          # also removes config files
sudo apt autoremove                  # remove unused deps

# If apt is locked
sudo lsof /var/lib/dpkg/lock-frontend
# Kill the locking process or wait
# If stuck after crash:
sudo rm /var/lib/dpkg/lock-frontend
sudo rm /var/lib/apt/lists/lock
sudo dpkg --configure -a

# Find what package owns a file
dpkg -S /usr/bin/nginx

# List files in a package
dpkg -L nginx

# Check package info
apt show nginx
apt-cache policy nginx               # shows installed vs available version

RHEL / AlmaLinux / Fedora (dnf/yum)

sudo dnf install package-name
sudo dnf remove package-name
sudo dnf update
sudo dnf clean all

# Find what package owns a file
rpm -qf /usr/bin/nginx
dnf provides /usr/bin/nginx

# List files in package
rpm -ql nginx

Network Interfaces -- Full Setup Reference

netplan (Ubuntu 20.04+)

sudo nano /etc/netplan/01-netcfg.yaml

network:
  version: 2
  renderer: networkd
  ethernets:
    eth0:
      dhcp4: true
      nameservers:
        addresses: [8.8.8.8, 1.1.1.1]

sudo netplan apply
sudo netplan try      # apply with automatic rollback after 120s

NetworkManager

# List connections
nmcli con show

# Connect to WiFi
nmcli device wifi connect "SSID" password "password"

# Edit DNS on a connection
sudo nmcli con mod "eth0" ipv4.dns "8.8.8.8 1.1.1.1"
sudo nmcli con up "eth0"

# Set static IP
sudo nmcli con mod "eth0" ipv4.method manual \
  ipv4.addresses "192.168.1.100/24" \
  ipv4.gateway "192.168.1.1"
sudo nmcli con up "eth0"

Quick Reference: Symptom to Command

System is Slow

top                          # 1. check CPU and load
free -h                      # 2. check memory -- is available near zero?
vmstat 1 5                   # 3. check si/so (swap) and wa (I/O wait)
iostat -x 1 3                # 4. check disk -- %util column
df -h                        # 5. check disk space

Cannot Reach a Service

ping server-ip               # 1. basic connectivity
ping hostname                # 2. if fails = DNS issue
nslookup hostname            # 3. DNS debug
sudo ss -tlnp | grep :port   # 4. is service listening?
curl -v http://server:port   # 5. detailed HTTP test
sudo iptables -L -n | head   # 6. firewall check

Service Won't Start

systemctl status service     # 1. status and recent log lines
journalctl -u service -n 50  # 2. more logs
systemctl cat service        # 3. unit file
journalctl -b -p err         # 4. all boot errors

Cannot Write File

df -h .                      # 1. disk full?
df -i .                      # 2. inodes exhausted?
ls -la file                  # 3. check permissions and owner
lsattr file                  # 4. immutable flag?
id                           # 5. who am I?

Something Used All Disk Space

df -h                                      # which filesystem?
sudo du -sh /var/log/* | sort -rh | head   # logs?
sudo lsof +L1                              # deleted files still open?
sudo find / -size +500M -type f 2>/dev/null # big files?

Essential Command Reference

System Monitoring

Command	What it shows
`top`	Processes, CPU, memory live
`htop`	Same but better
`uptime`	Load averages
`vmstat 1`	CPU, memory, swap, I/O overview
`mpstat -P ALL 1`	Per-core CPU usage
`free -h`	Memory and swap
`iostat -x 1`	Disk I/O per device
`iotop -o`	Disk I/O per process
`dstat`	Everything at once

Process

Command	What it does
`ps aux`	All processes
`pgrep -a name`	Find PID by name
`lsof -p <PID>`	Files open by process
`strace -p <PID>`	System calls made by process
`cat /proc/<PID>/status`	Detailed process info
`kill <PID>`	Graceful stop
`kill -9 <PID>`	Force kill
`renice +10 <PID>`	Lower process priority
`ionice -c 3 -p <PID>`	Lower I/O priority

Disk and Filesystem

Command	What it does
`df -h`	Filesystem usage
`df -i`	Inode usage
`du -sh /path`	Directory size
`lsblk -f`	Block devices with filesystems
`blkid`	UUIDs and filesystem types
`mount / umount`	Mount/unmount
`fsck /dev/sda1`	Filesystem check
`smartctl -H /dev/sda`	Disk health check
`lsof +L1`	Deleted files still open

Network

Command	What it does
`ip a`	IP addresses
`ip r`	Routes
`ip link`	Interface status
`ss -tulnp`	Listening ports
`ping / traceroute`	Connectivity
`dig / nslookup`	DNS lookup
`tcpdump -i eth0`	Packet capture
`nc -zv host port`	Port test
`mtr host`	Continuous traceroute

Logs

Command	What it does
`journalctl -xe`	Recent errors with context
`journalctl -u svc`	Service logs
`journalctl -b -p err`	Boot errors
`dmesg -T`	Kernel messages with timestamp
`tail -f /var/log/syslog`	Follow syslog
`grep -i error /var/log/app.log`	Search logs

Services

Command	What it does
`systemctl status svc`	Status + recent logs
`systemctl start/stop/restart svc`	Control service
`systemctl enable/disable svc`	Boot behavior
`systemctl --failed`	All failed units
`systemd-analyze blame`	Slowest startup services

Production Debugging Tips

Always check the obvious first: is the service running? is the disk full? is the port already in use?

Change one thing at a time. Multiple simultaneous changes make it impossible to know what fixed the problem.

Read the error message carefully. Linux error messages are precise. "No space left on device" and "Permission denied" mean exactly what they say.

Test your fix before declaring success. Restart the service. Reproduce the original failure scenario. Confirm it no longer happens.

Write down what you did. Incidents repeat. The runbook you write today saves you hours at 3am next month.

Resources

The Linux Command Line (free book)
Brendan Gregg's Linux Performance Tools
man pages -- read them
ArchWiki -- excellent reference even if you don't use Arch
Red Hat Sysadmin Articles