Troubleshooting a Unix System

Troubleshooting a Unix system is about systematically narrowing down the root cause of a problem. The steps depend on whether the issue is performance-related, a crash, a service outage, or hardware failure. Here’s a structured approach you can use:

1. Identify the Problem

2. Check System Health

Run basic commands:

uptime          # load average, uptime, logged-in users
top or htop     # CPU, memory, running processes
free -m         # memory usage
df -h           # disk usage
du -sh /path/*  # find large directories
iostat, vmstat  # I/O bottlenecks

Look for:

3. Check Processes & Services

Find misbehaving processes:

ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head

Restart or kill problematic processes:

kill -9 <pid>
systemctl restart <service>

Check open ports and network services:

netstat -tulnp
ss -ltnp

4. Check Logs

System logs:

less /var/log/syslog    # Debian/Ubuntu
less /var/log/messages  # RHEL/CentOS
journalctl -xe

Application/service logs (e.g., Apache: /var/log/httpd/).

Look for errors, warnings, crashes.

5. Check Network

Verify connectivity:

ping 8.8.8.8
curl -I http://example.com

Check interface status:

ip addr
ip route

DNS issues:

dig example.com
nslookup example.com

6. Hardware Checks

Disk health:

smartctl -a /dev/sda
dmesg | grep -i error

Memory test (requires reboot):

memtest86+

CPU temperature / sensors:

sensors

7. Boot & Filesystem Problems

fsck /dev/sda1

Rebuild initramfs or grub if necessary.

8. Security Checks

Look for suspicious logins:

last
w

Check for unusual processes:

ps aux | grep -v root
lsof -i

Verify permissions and firewall rules:

iptables -L -n

9. If Still Stuck

Rule of Thumb

Start broad (system load, health, logs), then drill down (processes, services, hardware).