Linux: Troubleshooting Performance Issues
CPU Issues
High CPU usage
- What process is using the CPU?
- Use
top
orhtop
to see what process is using the CPU - Optimize the code or limit the number of processes running
High load average
This issue occurs when the number of processes waiting to be executed exceeds the system's processing capacity.
- Check output of
uptime
ortop
- Is the system overloaded?
- Are there too many processes running simultaneously?
- Or is it one process that is causing the backlog?
- What quey to optimize if it is a db process?
- Maybe offload some tasks to another server
- Maybe swap CPU with one with better specs
High context switching
A context switching is when the CPU switches between different processes to allocated resources. A context refers to the state of a running process that allows the CPU to resume a process later (Running
, Waiting
, and Stopped
).
Too many context switches lead to inefficiency and higher CPU usage.
- Check context switching with
vmstat
orpidstat
. Check the number of context switches per second. - How many context switches per second do we have
- More than 500-1000 context switches per second is considered high and may indicate that the system is overwhelmed
- Maybe reduce the number of running processes
- Optimize applications to use fewer threads
- Adjust system limits
CPU bottleneck
This issue occurs when the CPU is the limiting factor in system performance.
- The CPU usage is consistently high (above 80%)
- Are tasks processing take too long?
- Load average exceeds the number of available CPU cores
- Use
top
andhtop
to identify processes that are using an inordinate amount of CPU time - Optime the processes using the CPU
- Maybe a hardware upgrade is needed
Memory Issues
Swapping Issues
This issue occurs when the system runs out of physical memory (RAM) and starts using the hard drive as virtual memory
- Is the system running out of
swap space
? - The swap file is typical located at
/swapfile
or on a dedicated swap partition - Identify swap space with
swapon -s
- Has the system performance degraded?
- What does
top
andhtop
say about swap usage? - Is the usage more that 10%? That might be high
- Monitor swap usage with
free -h
orvmstat
- Maybe more physical RAM is needed
- Or adjust the swappiness kernel value. ex:
sudo sysctl vm.swappiness=10
Out of Memory (OOM) Errors
This issue happens when the system runs out of both physical and virtual memory
- Are critical processes been unexpectedly terminated?
- Check logs for OOM in system logs
- Adjust application configurations to optimize memory usage
- Increase RAM
- Adjust swap space
Disk I/O Issues
This issue occurs when the system slows down due to delays in reading from or writing to storage device.
High input/output wait time
This issue occurs when processes are waiting for data to be read from or written to the disk. Is is indicating that the disk is struggling to handle requests causing delays in executing processes.
- Monitor disk load with
iotop
ordstat
.top
shows io wait as variablex.x wa
.iostat
give%iowait
- Optimize or spread out disk operations
- Improve performance or upgrade to SSDs if needed
High disk latency
Disk latency refers to the time it takes for the disk to respond to read or write requests. It can be cause by high number or concurrent request, a disk hardware issue, or inefficient disk configurations.
- What does
iostat
output say about the disk latency? - Is the latency above the normal 10ms? higher than 20ms may indicate a latency issues
- Is the disk operating at its maximum throughput?
- Maybe some drivers need to be upgraded
- Maybe adjusting the disk (RAID) config can help reduce the latency
- Maybe upgrade to faster disks
Slow remote storage response
This issue occurs when accessing data from remote storage systems such as NFS, SAN, or cloud storage
- Long response time when accessing files?
- Check network performance using
ping
ornetperf
to check for network issues - Optimize network settings or upgrade network hardware
Network Stability Issues
Packet drops
This issue occurs when data packets fail to reach their destination due to network congestion or hardware issues
ping -c 100 <DESTINATION>
will give the percentage of packet dropped. Over 1% packet loss my be unacceptable- Packet loss can lead to slow performance, timeouts, and application errors
- Check routers and switches for hardware issues and faulty cables
- Check for NICs errors using
ifconfig
andethtool
- Adjust QoS settings or increase bandwidth
Random disconnects
This issue occurs when a network connection is unexpectedly terminated
- Are users or services suddenly losing network access?
- Look for
connection reset
orconnection closed
messages in logs usingdmesg
orifconfig
- Check network stack configuration
- Maybe the cable is faulty and some hardware in the path is faulty
- Maybe a firewall is closing the connection
- Check and adjust TCP settings if necessary
Random timeouts
This issue occurs when a connection fails to receive a response within the expected time frame.
- Errors can be seen in logs
- Getting
connection timed out
error when usingcurl
to connect to a service? - Maybe the network is congested
- Maybe there is a DNS issue
- Maybe the server is overloaded
- A timeout threshold of 5-10 seconds is typically acceptable
- Use
ping
ortraceroute
to check for network congestion - Make sure DNS servers are correctly configured
- Check server performance
- Adjust TCP timeout if necessary
Network Performance Issues
High latency
High latency refers to the delay in the time it takes for data to travel from one point to another
Measure latency using ping
or traceroute
A 100ms latency in a local network is considered high and 300ms over remote communication is also high
Check for network congestion
Identify hardware issues
Maybe the routing is misconfigured. Optimize the network path
Maybe upgrading the network infrastructure could help
Jitter
Jitter is the variation of latency over time, which can cause problems in real-time applications.
- A value above 30ms of fluctuation can cause noticeable issues
- Detect Jitter issues with
ping -i 0.2 <DESTINATION>
- Check for network congestion or hardware issues
- Implement QoS to prioritize relevant traffic
Slow response time
This issue occurs when the network takes too long to respond to requests.
This could be due to:
- High latency
- Congestion
- Overloaded servers
-
Misconfigured applications
-
Use
curl
orwget
to measure response time and identify bottlenecks in the network or a server - Check server load
- Optimize application code
- Check for server resources
- Review network configurations
Low throughput
This issue occurs when the network is unable to transmit data a a high enough rate
- Identify low throughput but using
iperf
anything below 80% of the expected bandwidth is considered low throughput - Check for network congestion
- Check for faulty cables
- Check for incorrect settings
- Maybe switch to a high bandwidth network
- Maybe reduce unnecessary traffic
- Optimize network routes
System Responsiveness Issues
Slow application response
This issue occurs when an application takes longer than expected to react to user inputs
- Use
top
andhtop
to identify application resource consumption - Maybe the application code needs to be optimized
- Increase system resources
- Check disk I/O for overload
- Check for unnecessary background
Sluggish terminal behavior
It happens when commands in the terminal are delayed. The system takes an unusually long tome to execute commands.
- Use
top
oriotop
to check for system resource usage - Optimize processes running on the system
- Cleanup system resources
- Add more RAM or CPU cores
Slow startup
System a taking an unusual time to boot up
- See which services take longer than expected using
systemd-analyze
- Maybe too many services are configured to start at the same time
- Maybe one of the startup services is misconfigured?
- Delay or disable non essential services from starting at boot time using
systemctl
- Optimize the boot sequence
System unresponsiveness
This issue occurs when the system becomes completely unresponsive
- Is the system not accepting new input?
- Applications are no longer responding?
- Use
dmesg
orjournalctl
to identify what caused kernel panic - Identify runaway processes using
top
andhtp
- Maybe upgrade the RAM or add more CPU cores
Process Management Issues
Blocked processes
This issue occurs when a process is unable to proceed due to waiting on resources or system locks
- Are command or application stuck?
ps
andtop
show processes inD
or `uninterruptible state- Use
lsof
to check with file a process is waiting for - Use
strace
to trace system calls and signals - Is a process repeatedly stuck or blocked? Maybe due to resource contention
- Optimize disk I/O
- Maybe add more memory
- Investigate dependency issues between processes
Exceeding baselines
This happens when processes consume more resources than expected
- Notice high CPU usage
- Unusual memory consumption
- Excessive disk activity
- Use
top
,htop
, orpidstat
to identify this issue - Optimize application resource usage
- Maybe configure system resource limits with
ulimit
Security-related Performance Issues
High failed log-in attempts
This issue often signals attempted unauthorized access or brute force attacks
- Maybe a brute force attack?
- Unauthorized access attempts?
- System compromise?
- What are logs in
/var/log/auth.log
saying? - Check
journalctl
for identify login attempts - 5-10 login attempts from a single IP address within a short time may be a red flag for brute force attack
- Implement
fail2ban
to block abusive IPs - Enforce strong password policy
- Use MFA
- Limit access with firewall or IP allowlist