Linux: Troubleshooting Performance Issues

CPU Issues

High CPU usage

What process is using the CPU?
Use top or htop to see what process is using the CPU
Optimize the code or limit the number of processes running

High load average

This issue occurs when the number of processes waiting to be executed exceeds the system's processing capacity.

Check output of uptime or top
Is the system overloaded?
Are there too many processes running simultaneously?
Or is it one process that is causing the backlog?
What quey to optimize if it is a db process?
Maybe offload some tasks to another server
Maybe swap CPU with one with better specs

High context switching

A context switching is when the CPU switches between different processes to allocated resources. A context refers to the state of a running process that allows the CPU to resume a process later (Running, Waiting, and Stopped).

Too many context switches lead to inefficiency and higher CPU usage.

Check context switching with vmstat or pidstat. Check the number of context switches per second.
How many context switches per second do we have
More than 500-1000 context switches per second is considered high and may indicate that the system is overwhelmed
Maybe reduce the number of running processes
Optimize applications to use fewer threads
Adjust system limits

CPU bottleneck

This issue occurs when the CPU is the limiting factor in system performance.

The CPU usage is consistently high (above 80%)
Are tasks processing take too long?
Load average exceeds the number of available CPU cores
Use top and htop to identify processes that are using an inordinate amount of CPU time
Optime the processes using the CPU
Maybe a hardware upgrade is needed

Memory Issues

Swapping Issues

This issue occurs when the system runs out of physical memory (RAM) and starts using the hard drive as virtual memory

Is the system running out of swap space?
The swap file is typical located at /swapfile or on a dedicated swap partition
Identify swap space with swapon -s
Has the system performance degraded?
What does top and htop say about swap usage?
Is the usage more that 10%? That might be high
Monitor swap usage with free -h or vmstat
Maybe more physical RAM is needed
Or adjust the swappiness kernel value. ex: sudo sysctl vm.swappiness=10

Out of Memory (OOM) Errors

This issue happens when the system runs out of both physical and virtual memory

Are critical processes been unexpectedly terminated?
Check logs for OOM in system logs
Adjust application configurations to optimize memory usage
Increase RAM
Adjust swap space

Disk I/O Issues

This issue occurs when the system slows down due to delays in reading from or writing to storage device.

High input/output wait time

This issue occurs when processes are waiting for data to be read from or written to the disk. Is is indicating that the disk is struggling to handle requests causing delays in executing processes.

Monitor disk load with iotop or dstat. top shows io wait as variable x.x wa. iostat give %iowait
Optimize or spread out disk operations
Improve performance or upgrade to SSDs if needed

High disk latency

Disk latency refers to the time it takes for the disk to respond to read or write requests. It can be cause by high number or concurrent request, a disk hardware issue, or inefficient disk configurations.

What does iostat output say about the disk latency?
Is the latency above the normal 10ms? higher than 20ms may indicate a latency issues
Is the disk operating at its maximum throughput?
Maybe some drivers need to be upgraded
Maybe adjusting the disk (RAID) config can help reduce the latency
Maybe upgrade to faster disks

Slow remote storage response

This issue occurs when accessing data from remote storage systems such as NFS, SAN, or cloud storage

Long response time when accessing files?
Check network performance using ping or netperf to check for network issues
Optimize network settings or upgrade network hardware

Network Stability Issues

Packet drops

This issue occurs when data packets fail to reach their destination due to network congestion or hardware issues

ping -c 100 <DESTINATION> will give the percentage of packet dropped. Over 1% packet loss my be unacceptable
Packet loss can lead to slow performance, timeouts, and application errors
Check routers and switches for hardware issues and faulty cables
Check for NICs errors using ifconfig and ethtool
Adjust QoS settings or increase bandwidth

Random disconnects

This issue occurs when a network connection is unexpectedly terminated

Are users or services suddenly losing network access?
Look for connection reset or connection closed messages in logs using dmesg or ifconfig
Check network stack configuration
Maybe the cable is faulty and some hardware in the path is faulty
Maybe a firewall is closing the connection
Check and adjust TCP settings if necessary

Random timeouts

This issue occurs when a connection fails to receive a response within the expected time frame.

Errors can be seen in logs
Getting connection timed out error when using curl to connect to a service?
Maybe the network is congested
Maybe there is a DNS issue
Maybe the server is overloaded
A timeout threshold of 5-10 seconds is typically acceptable
Use ping or traceroute to check for network congestion
Make sure DNS servers are correctly configured
Check server performance
Adjust TCP timeout if necessary

Network Performance Issues

High latency

High latency refers to the delay in the time it takes for data to travel from one point to another

Measure latency using ping or traceroute A 100ms latency in a local network is considered high and 300ms over remote communication is also high Check for network congestion Identify hardware issues Maybe the routing is misconfigured. Optimize the network path Maybe upgrading the network infrastructure could help

Jitter

Jitter is the variation of latency over time, which can cause problems in real-time applications.

A value above 30ms of fluctuation can cause noticeable issues
Detect Jitter issues with ping -i 0.2 <DESTINATION>
Check for network congestion or hardware issues
Implement QoS to prioritize relevant traffic

Slow response time

This issue occurs when the network takes too long to respond to requests.

This could be due to:

High latency
Congestion
Overloaded servers
Misconfigured applications
Use curl or wget to measure response time and identify bottlenecks in the network or a server
Check server load
Optimize application code
Check for server resources
Review network configurations

Low throughput

This issue occurs when the network is unable to transmit data a a high enough rate

Identify low throughput but using iperf anything below 80% of the expected bandwidth is considered low throughput
Check for network congestion
Check for faulty cables
Check for incorrect settings
Maybe switch to a high bandwidth network
Maybe reduce unnecessary traffic
Optimize network routes

System Responsiveness Issues

Slow application response

This issue occurs when an application takes longer than expected to react to user inputs

Use top and htop to identify application resource consumption
Maybe the application code needs to be optimized
Increase system resources
Check disk I/O for overload
Check for unnecessary background

Sluggish terminal behavior

It happens when commands in the terminal are delayed. The system takes an unusually long tome to execute commands.

Use top or iotop to check for system resource usage
Optimize processes running on the system
Cleanup system resources
Add more RAM or CPU cores

Slow startup

System a taking an unusual time to boot up

See which services take longer than expected using systemd-analyze
Maybe too many services are configured to start at the same time
Maybe one of the startup services is misconfigured?
Delay or disable non essential services from starting at boot time using systemctl
Optimize the boot sequence

System unresponsiveness

This issue occurs when the system becomes completely unresponsive

Is the system not accepting new input?
Applications are no longer responding?
Use dmesg or journalctl to identify what caused kernel panic
Identify runaway processes using top and htp
Maybe upgrade the RAM or add more CPU cores

Process Management Issues

Blocked processes

This issue occurs when a process is unable to proceed due to waiting on resources or system locks

Are command or application stuck?
ps and top show processes in D or `uninterruptible state
Use lsof to check with file a process is waiting for
Use strace to trace system calls and signals
Is a process repeatedly stuck or blocked? Maybe due to resource contention
Optimize disk I/O
Maybe add more memory
Investigate dependency issues between processes

Exceeding baselines

This happens when processes consume more resources than expected

Notice high CPU usage
Unusual memory consumption
Excessive disk activity
Use top, htop, or pidstat to identify this issue
Optimize application resource usage
Maybe configure system resource limits with ulimit

High failed log-in attempts

This issue often signals attempted unauthorized access or brute force attacks

Maybe a brute force attack?
Unauthorized access attempts?
System compromise?
What are logs in /var/log/auth.log saying?
Check journalctl for identify login attempts
5-10 login attempts from a single IP address within a short time may be a red flag for brute force attack
Implement fail2ban to block abusive IPs
Enforce strong password policy
Use MFA
Limit access with firewall or IP allowlist