- What process is using the CPU?
- Use
top or htop to see what process is using the CPU
- Optimize the code or limit the number of processes running
This issue occurs when the number of processes waiting to be executed exceeds the system's processing capacity.
- Check output of
uptime or top
- Is the system overloaded?
- Are there too many processes running simultaneously?
- Or is it one process that is causing the backlog?
- What quey to optimize if it is a db process?
- Maybe offload some tasks to another server
- Maybe swap CPU with one with better specs
A context switching is when the CPU switches between different processes to allocated resources. A context refers to the state of a running process that allows the CPU to resume a process later (Running, Waiting, and Stopped).
Too many context switches lead to inefficiency and higher CPU usage.
- Check context switching with
vmstat or pidstat. Check the number of context switches per second.
- How many context switches per second do we have
- More than 500-1000 context switches per second is considered high and may indicate that the system is overwhelmed
- Maybe reduce the number of running processes
- Optimize applications to use fewer threads
- Adjust system limits
This issue occurs when the CPU is the limiting factor in system performance.
- The CPU usage is consistently high (above 80%)
- Are tasks processing take too long?
- Load average exceeds the number of available CPU cores
- Use
top and htop to identify processes that are using an inordinate amount of CPU time
- Optime the processes using the CPU
- Maybe a hardware upgrade is needed
This issue occurs when the system runs out of physical memory (RAM) and starts using the hard drive as virtual memory
- Is the system running out of
swap space?
- The swap file is typical located at
/swapfile or on a dedicated swap partition
- Identify swap space with
swapon -s
- Has the system performance degraded?
- What does
top and htop say about swap usage?
- Is the usage more that 10%? That might be high
- Monitor swap usage with
free -h or vmstat
- Maybe more physical RAM is needed
- Or adjust the swappiness kernel value. ex:
sudo sysctl vm.swappiness=10
This issue happens when the system runs out of both physical and virtual memory
- Are critical processes been unexpectedly terminated?
- Check logs for OOM in system logs
- Adjust application configurations to optimize memory usage
- Increase RAM
- Adjust swap space
This issue occurs when the system slows down due to delays in reading from or writing to storage device.
This issue occurs when processes are waiting for data to be read from or written to the disk. Is is indicating that the disk is struggling to handle requests causing delays in executing processes.
- Monitor disk load with
iotop or dstat. top shows io wait as variable x.x wa. iostat give %iowait
- Optimize or spread out disk operations
- Improve performance or upgrade to SSDs if needed
Disk latency refers to the time it takes for the disk to respond to read or write requests. It can be cause by high number or concurrent request, a disk hardware issue, or inefficient disk configurations.
- What does
iostat output say about the disk latency?
- Is the latency above the normal 10ms? higher than 20ms may indicate a latency issues
- Is the disk operating at its maximum throughput?
- Maybe some drivers need to be upgraded
- Maybe adjusting the disk (RAID) config can help reduce the latency
- Maybe upgrade to faster disks
This issue occurs when accessing data from remote storage systems such as NFS, SAN, or cloud storage
- Long response time when accessing files?
- Check network performance using
ping or netperf to check for network issues
- Optimize network settings or upgrade network hardware
This issue occurs when data packets fail to reach their destination due to network congestion or hardware issues
ping -c 100 <DESTINATION> will give the percentage of packet dropped. Over 1% packet loss my be unacceptable
- Packet loss can lead to slow performance, timeouts, and application errors
- Check routers and switches for hardware issues and faulty cables
- Check for NICs errors using
ifconfig and ethtool
- Adjust QoS settings or increase bandwidth
This issue occurs when a network connection is unexpectedly terminated
- Are users or services suddenly losing network access?
- Look for
connection reset or connection closed messages in logs using dmesg or ifconfig
- Check network stack configuration
- Maybe the cable is faulty and some hardware in the path is faulty
- Maybe a firewall is closing the connection
- Check and adjust TCP settings if necessary
This issue occurs when a connection fails to receive a response within the expected time frame.
- Errors can be seen in logs
- Getting
connection timed out error when using curl to connect to a service?
- Maybe the network is congested
- Maybe there is a DNS issue
- Maybe the server is overloaded
- A timeout threshold of 5-10 seconds is typically acceptable
- Use
ping or traceroute to check for network congestion
- Make sure DNS servers are correctly configured
- Check server performance
- Adjust TCP timeout if necessary
High latency refers to the delay in the time it takes for data to travel from one point to another
Measure latency using ping or traceroute
A 100ms latency in a local network is considered high and 300ms over remote communication is also high
Check for network congestion
Identify hardware issues
Maybe the routing is misconfigured. Optimize the network path
Maybe upgrading the network infrastructure could help
Jitter is the variation of latency over time, which can cause problems in real-time applications.
- A value above 30ms of fluctuation can cause noticeable issues
- Detect Jitter issues with
ping -i 0.2 <DESTINATION>
- Check for network congestion or hardware issues
- Implement QoS to prioritize relevant traffic
This issue occurs when the network takes too long to respond to requests.
This could be due to:
This issue occurs when the network is unable to transmit data a a high enough rate
- Identify low throughput but using
iperf anything below 80% of the expected bandwidth is considered low throughput
- Check for network congestion
- Check for faulty cables
- Check for incorrect settings
- Maybe switch to a high bandwidth network
- Maybe reduce unnecessary traffic
- Optimize network routes
This issue occurs when an application takes longer than expected to react to user inputs
- Use
top and htop to identify application resource consumption
- Maybe the application code needs to be optimized
- Increase system resources
- Check disk I/O for overload
- Check for unnecessary background
It happens when commands in the terminal are delayed. The system takes an unusually long tome to execute commands.
- Use
top or iotop to check for system resource usage
- Optimize processes running on the system
- Cleanup system resources
- Add more RAM or CPU cores
System a taking an unusual time to boot up
- See which services take longer than expected using
systemd-analyze
- Maybe too many services are configured to start at the same time
- Maybe one of the startup services is misconfigured?
- Delay or disable non essential services from starting at boot time using
systemctl
- Optimize the boot sequence
This issue occurs when the system becomes completely unresponsive
- Is the system not accepting new input?
- Applications are no longer responding?
- Use
dmesg or journalctl to identify what caused kernel panic
- Identify runaway processes using
top and htp
- Maybe upgrade the RAM or add more CPU cores
This issue occurs when a process is unable to proceed due to waiting on resources or system locks
- Are command or application stuck?
ps and top show processes in D or `uninterruptible state
- Use
lsof to check with file a process is waiting for
- Use
strace to trace system calls and signals
- Is a process repeatedly stuck or blocked? Maybe due to resource contention
- Optimize disk I/O
- Maybe add more memory
- Investigate dependency issues between processes
This happens when processes consume more resources than expected
- Notice high CPU usage
- Unusual memory consumption
- Excessive disk activity
- Use
top, htop, or pidstat to identify this issue
- Optimize application resource usage
- Maybe configure system resource limits with
ulimit
This issue often signals attempted unauthorized access or brute force attacks
- Maybe a brute force attack?
- Unauthorized access attempts?
- System compromise?
- What are logs in
/var/log/auth.log saying?
- Check
journalctl for identify login attempts
- 5-10 login attempts from a single IP address within a short time may be a red flag for brute force attack
- Implement
fail2ban to block abusive IPs
- Enforce strong password policy
- Use MFA
- Limit access with firewall or IP allowlist