Skip to content

Linux: Troubleshooting Performance Issues

CPU Issues

High CPU usage

  • What process is using the CPU?
  • Use top or htop to see what process is using the CPU
  • Optimize the code or limit the number of processes running

High load average

This issue occurs when the number of processes waiting to be executed exceeds the system's processing capacity.

  • Check output of uptime or top
  • Is the system overloaded?
  • Are there too many processes running simultaneously?
  • Or is it one process that is causing the backlog?
  • What quey to optimize if it is a db process?
  • Maybe offload some tasks to another server
  • Maybe swap CPU with one with better specs

High context switching

A context switching is when the CPU switches between different processes to allocated resources. A context refers to the state of a running process that allows the CPU to resume a process later (Running, Waiting, and Stopped).

Too many context switches lead to inefficiency and higher CPU usage.

  • Check context switching with vmstat or pidstat. Check the number of context switches per second.
  • How many context switches per second do we have
  • More than 500-1000 context switches per second is considered high and may indicate that the system is overwhelmed
  • Maybe reduce the number of running processes
  • Optimize applications to use fewer threads
  • Adjust system limits

CPU bottleneck

This issue occurs when the CPU is the limiting factor in system performance.

  • The CPU usage is consistently high (above 80%)
  • Are tasks processing take too long?
  • Load average exceeds the number of available CPU cores
  • Use top and htop to identify processes that are using an inordinate amount of CPU time
  • Optime the processes using the CPU
  • Maybe a hardware upgrade is needed

Memory Issues

Swapping Issues

This issue occurs when the system runs out of physical memory (RAM) and starts using the hard drive as virtual memory

  • Is the system running out of swap space?
  • The swap file is typical located at /swapfile or on a dedicated swap partition
  • Identify swap space with swapon -s
  • Has the system performance degraded?
  • What does top and htop say about swap usage?
  • Is the usage more that 10%? That might be high
  • Monitor swap usage with free -h or vmstat
  • Maybe more physical RAM is needed
  • Or adjust the swappiness kernel value. ex: sudo sysctl vm.swappiness=10

Out of Memory (OOM) Errors

This issue happens when the system runs out of both physical and virtual memory

  • Are critical processes been unexpectedly terminated?
  • Check logs for OOM in system logs
  • Adjust application configurations to optimize memory usage
  • Increase RAM
  • Adjust swap space

Disk I/O Issues

This issue occurs when the system slows down due to delays in reading from or writing to storage device.

High input/output wait time

This issue occurs when processes are waiting for data to be read from or written to the disk. Is is indicating that the disk is struggling to handle requests causing delays in executing processes.

  • Monitor disk load with iotop or dstat. top shows io wait as variable x.x wa. iostat give %iowait
  • Optimize or spread out disk operations
  • Improve performance or upgrade to SSDs if needed

High disk latency

Disk latency refers to the time it takes for the disk to respond to read or write requests. It can be cause by high number or concurrent request, a disk hardware issue, or inefficient disk configurations.

  • What does iostat output say about the disk latency?
  • Is the latency above the normal 10ms? higher than 20ms may indicate a latency issues
  • Is the disk operating at its maximum throughput?
  • Maybe some drivers need to be upgraded
  • Maybe adjusting the disk (RAID) config can help reduce the latency
  • Maybe upgrade to faster disks

Slow remote storage response

This issue occurs when accessing data from remote storage systems such as NFS, SAN, or cloud storage

  • Long response time when accessing files?
  • Check network performance using ping or netperf to check for network issues
  • Optimize network settings or upgrade network hardware

Network Stability Issues

Packet drops

This issue occurs when data packets fail to reach their destination due to network congestion or hardware issues

  • ping -c 100 <DESTINATION> will give the percentage of packet dropped. Over 1% packet loss my be unacceptable
  • Packet loss can lead to slow performance, timeouts, and application errors
  • Check routers and switches for hardware issues and faulty cables
  • Check for NICs errors using ifconfig and ethtool
  • Adjust QoS settings or increase bandwidth

Random disconnects

This issue occurs when a network connection is unexpectedly terminated

  • Are users or services suddenly losing network access?
  • Look for connection reset or connection closed messages in logs using dmesg or ifconfig
  • Check network stack configuration
  • Maybe the cable is faulty and some hardware in the path is faulty
  • Maybe a firewall is closing the connection
  • Check and adjust TCP settings if necessary

Random timeouts

This issue occurs when a connection fails to receive a response within the expected time frame.

  • Errors can be seen in logs
  • Getting connection timed out error when using curl to connect to a service?
  • Maybe the network is congested
  • Maybe there is a DNS issue
  • Maybe the server is overloaded
  • A timeout threshold of 5-10 seconds is typically acceptable
  • Use ping or traceroute to check for network congestion
  • Make sure DNS servers are correctly configured
  • Check server performance
  • Adjust TCP timeout if necessary

Network Performance Issues

High latency

High latency refers to the delay in the time it takes for data to travel from one point to another

Measure latency using ping or traceroute A 100ms latency in a local network is considered high and 300ms over remote communication is also high Check for network congestion Identify hardware issues Maybe the routing is misconfigured. Optimize the network path Maybe upgrading the network infrastructure could help

Jitter

Jitter is the variation of latency over time, which can cause problems in real-time applications.

  • A value above 30ms of fluctuation can cause noticeable issues
  • Detect Jitter issues with ping -i 0.2 <DESTINATION>
  • Check for network congestion or hardware issues
  • Implement QoS to prioritize relevant traffic

Slow response time

This issue occurs when the network takes too long to respond to requests.

This could be due to:

  • High latency
  • Congestion
  • Overloaded servers
  • Misconfigured applications

  • Use curl or wget to measure response time and identify bottlenecks in the network or a server

  • Check server load
  • Optimize application code
  • Check for server resources
  • Review network configurations

Low throughput

This issue occurs when the network is unable to transmit data a a high enough rate

  • Identify low throughput but using iperf anything below 80% of the expected bandwidth is considered low throughput
  • Check for network congestion
  • Check for faulty cables
  • Check for incorrect settings
  • Maybe switch to a high bandwidth network
  • Maybe reduce unnecessary traffic
  • Optimize network routes

System Responsiveness Issues

Slow application response

This issue occurs when an application takes longer than expected to react to user inputs

  • Use top and htop to identify application resource consumption
  • Maybe the application code needs to be optimized
  • Increase system resources
  • Check disk I/O for overload
  • Check for unnecessary background

Sluggish terminal behavior

It happens when commands in the terminal are delayed. The system takes an unusually long tome to execute commands.

  • Use top or iotop to check for system resource usage
  • Optimize processes running on the system
  • Cleanup system resources
  • Add more RAM or CPU cores

Slow startup

System a taking an unusual time to boot up

  • See which services take longer than expected using systemd-analyze
  • Maybe too many services are configured to start at the same time
  • Maybe one of the startup services is misconfigured?
  • Delay or disable non essential services from starting at boot time using systemctl
  • Optimize the boot sequence

System unresponsiveness

This issue occurs when the system becomes completely unresponsive

  • Is the system not accepting new input?
  • Applications are no longer responding?
  • Use dmesg or journalctl to identify what caused kernel panic
  • Identify runaway processes using top and htp
  • Maybe upgrade the RAM or add more CPU cores

Process Management Issues

Blocked processes

This issue occurs when a process is unable to proceed due to waiting on resources or system locks

  • Are command or application stuck?
  • ps and top show processes in D or `uninterruptible state
  • Use lsof to check with file a process is waiting for
  • Use strace to trace system calls and signals
  • Is a process repeatedly stuck or blocked? Maybe due to resource contention
  • Optimize disk I/O
  • Maybe add more memory
  • Investigate dependency issues between processes

Exceeding baselines

This happens when processes consume more resources than expected

  • Notice high CPU usage
  • Unusual memory consumption
  • Excessive disk activity
  • Use top, htop, or pidstat to identify this issue
  • Optimize application resource usage
  • Maybe configure system resource limits with ulimit

High failed log-in attempts

This issue often signals attempted unauthorized access or brute force attacks

  • Maybe a brute force attack?
  • Unauthorized access attempts?
  • System compromise?
  • What are logs in /var/log/auth.log saying?
  • Check journalctl for identify login attempts
  • 5-10 login attempts from a single IP address within a short time may be a red flag for brute force attack
  • Implement fail2ban to block abusive IPs
  • Enforce strong password policy
  • Use MFA
  • Limit access with firewall or IP allowlist