Skip to content

Troubleshooting

Linux: Troubleshooting Performance Issues

CPU Issues

High CPU usage

  • What process is using the CPU?
  • Use top or htop to see what process is using the CPU
  • Optimize the code or limit the number of processes running

High load average

This issue occurs when the number of processes waiting to be executed exceeds the system's processing capacity.

  • Check output of uptime or top
  • Is the system overloaded?
  • Are there too many processes running simultaneously?
  • Or is it one process that is causing the backlog?
  • What quey to optimize if it is a db process?
  • Maybe offload some tasks to another server
  • Maybe swap CPU with one with better specs

High context switching

A context switching is when the CPU switches between different processes to allocated resources. A context refers to the state of a running process that allows the CPU to resume a process later (Running, Waiting, and Stopped).

Too many context switches lead to inefficiency and higher CPU usage.

  • Check context switching with vmstat or pidstat. Check the number of context switches per second.
  • How many context switches per second do we have
  • More than 500-1000 context switches per second is considered high and may indicate that the system is overwhelmed
  • Maybe reduce the number of running processes
  • Optimize applications to use fewer threads
  • Adjust system limits

CPU bottleneck

This issue occurs when the CPU is the limiting factor in system performance.

  • The CPU usage is consistently high (above 80%)
  • Are tasks processing take too long?
  • Load average exceeds the number of available CPU cores
  • Use top and htop to identify processes that are using an inordinate amount of CPU time
  • Optime the processes using the CPU
  • Maybe a hardware upgrade is needed

Memory Issues

Swapping Issues

This issue occurs when the system runs out of physical memory (RAM) and starts using the hard drive as virtual memory

  • Is the system running out of swap space?
  • The swap file is typical located at /swapfile or on a dedicated swap partition
  • Identify swap space with swapon -s
  • Has the system performance degraded?
  • What does top and htop say about swap usage?
  • Is the usage more that 10%? That might be high
  • Monitor swap usage with free -h or vmstat
  • Maybe more physical RAM is needed
  • Or adjust the swappiness kernel value. ex: sudo sysctl vm.swappiness=10

Out of Memory (OOM) Errors

This issue happens when the system runs out of both physical and virtual memory

  • Are critical processes been unexpectedly terminated?
  • Check logs for OOM in system logs
  • Adjust application configurations to optimize memory usage
  • Increase RAM
  • Adjust swap space

Disk I/O Issues

This issue occurs when the system slows down due to delays in reading from or writing to storage device.

High input/output wait time

This issue occurs when processes are waiting for data to be read from or written to the disk. Is is indicating that the disk is struggling to handle requests causing delays in executing processes.

  • Monitor disk load with iotop or dstat. top shows io wait as variable x.x wa. iostat give %iowait
  • Optimize or spread out disk operations
  • Improve performance or upgrade to SSDs if needed

High disk latency

Disk latency refers to the time it takes for the disk to respond to read or write requests. It can be cause by high number or concurrent request, a disk hardware issue, or inefficient disk configurations.

  • What does iostat output say about the disk latency?
  • Is the latency above the normal 10ms? higher than 20ms may indicate a latency issues
  • Is the disk operating at its maximum throughput?
  • Maybe some drivers need to be upgraded
  • Maybe adjusting the disk (RAID) config can help reduce the latency
  • Maybe upgrade to faster disks

Slow remote storage response

This issue occurs when accessing data from remote storage systems such as NFS, SAN, or cloud storage

  • Long response time when accessing files?
  • Check network performance using ping or netperf to check for network issues
  • Optimize network settings or upgrade network hardware

Network Stability Issues

Packet drops

This issue occurs when data packets fail to reach their destination due to network congestion or hardware issues

  • ping -c 100 <DESTINATION> will give the percentage of packet dropped. Over 1% packet loss my be unacceptable
  • Packet loss can lead to slow performance, timeouts, and application errors
  • Check routers and switches for hardware issues and faulty cables
  • Check for NICs errors using ifconfig and ethtool
  • Adjust QoS settings or increase bandwidth

Random disconnects

This issue occurs when a network connection is unexpectedly terminated

  • Are users or services suddenly losing network access?
  • Look for connection reset or connection closed messages in logs using dmesg or ifconfig
  • Check network stack configuration
  • Maybe the cable is faulty and some hardware in the path is faulty
  • Maybe a firewall is closing the connection
  • Check and adjust TCP settings if necessary

Random timeouts

This issue occurs when a connection fails to receive a response within the expected time frame.

  • Errors can be seen in logs
  • Getting connection timed out error when using curl to connect to a service?
  • Maybe the network is congested
  • Maybe there is a DNS issue
  • Maybe the server is overloaded
  • A timeout threshold of 5-10 seconds is typically acceptable
  • Use ping or traceroute to check for network congestion
  • Make sure DNS servers are correctly configured
  • Check server performance
  • Adjust TCP timeout if necessary

Network Performance Issues

High latency

High latency refers to the delay in the time it takes for data to travel from one point to another

Measure latency using ping or traceroute A 100ms latency in a local network is considered high and 300ms over remote communication is also high Check for network congestion Identify hardware issues Maybe the routing is misconfigured. Optimize the network path Maybe upgrading the network infrastructure could help

Jitter

Jitter is the variation of latency over time, which can cause problems in real-time applications.

  • A value above 30ms of fluctuation can cause noticeable issues
  • Detect Jitter issues with ping -i 0.2 <DESTINATION>
  • Check for network congestion or hardware issues
  • Implement QoS to prioritize relevant traffic

Slow response time

This issue occurs when the network takes too long to respond to requests.

This could be due to:

  • High latency
  • Congestion
  • Overloaded servers
  • Misconfigured applications

  • Use curl or wget to measure response time and identify bottlenecks in the network or a server

  • Check server load
  • Optimize application code
  • Check for server resources
  • Review network configurations

Low throughput

This issue occurs when the network is unable to transmit data a a high enough rate

  • Identify low throughput but using iperf anything below 80% of the expected bandwidth is considered low throughput
  • Check for network congestion
  • Check for faulty cables
  • Check for incorrect settings
  • Maybe switch to a high bandwidth network
  • Maybe reduce unnecessary traffic
  • Optimize network routes

System Responsiveness Issues

Slow application response

This issue occurs when an application takes longer than expected to react to user inputs

  • Use top and htop to identify application resource consumption
  • Maybe the application code needs to be optimized
  • Increase system resources
  • Check disk I/O for overload
  • Check for unnecessary background

Sluggish terminal behavior

It happens when commands in the terminal are delayed. The system takes an unusually long tome to execute commands.

  • Use top or iotop to check for system resource usage
  • Optimize processes running on the system
  • Cleanup system resources
  • Add more RAM or CPU cores

Slow startup

System a taking an unusual time to boot up

  • See which services take longer than expected using systemd-analyze
  • Maybe too many services are configured to start at the same time
  • Maybe one of the startup services is misconfigured?
  • Delay or disable non essential services from starting at boot time using systemctl
  • Optimize the boot sequence

System unresponsiveness

This issue occurs when the system becomes completely unresponsive

  • Is the system not accepting new input?
  • Applications are no longer responding?
  • Use dmesg or journalctl to identify what caused kernel panic
  • Identify runaway processes using top and htp
  • Maybe upgrade the RAM or add more CPU cores

Process Management Issues

Blocked processes

This issue occurs when a process is unable to proceed due to waiting on resources or system locks

  • Are command or application stuck?
  • ps and top show processes in D or `uninterruptible state
  • Use lsof to check with file a process is waiting for
  • Use strace to trace system calls and signals
  • Is a process repeatedly stuck or blocked? Maybe due to resource contention
  • Optimize disk I/O
  • Maybe add more memory
  • Investigate dependency issues between processes

Exceeding baselines

This happens when processes consume more resources than expected

  • Notice high CPU usage
  • Unusual memory consumption
  • Excessive disk activity
  • Use top, htop, or pidstat to identify this issue
  • Optimize application resource usage
  • Maybe configure system resource limits with ulimit

High failed log-in attempts

This issue often signals attempted unauthorized access or brute force attacks

  • Maybe a brute force attack?
  • Unauthorized access attempts?
  • System compromise?
  • What are logs in /var/log/auth.log saying?
  • Check journalctl for identify login attempts
  • 5-10 login attempts from a single IP address within a short time may be a red flag for brute force attack
  • Implement fail2ban to block abusive IPs
  • Enforce strong password policy
  • Use MFA
  • Limit access with firewall or IP allowlist

Linux: Troubleshooting Security Issues

SELinux Issues

SELinux policy issues

SELinux Policy defines what actions users and applications can perform on a system based on security rules.

A too restricted or misconfigured policy can prevent the system from working properly.

avc: denied is a typical error message found in logs if dealing SELinux policy issues.

  • Review logs with ausearch or sealert
  • Modify rules if necessary
  • Test policy in a safe environment before applying

SELinux context issues

SELinux uses context to label every file, process, and resource on the system, determining what access is allowed.

Incorrect or misconfigured label can prevent applications for accessing the resources they need to function

  • User ls -Z for files and ps -Z for processes to look for SELinux context issues
  • Does the file or process have incorrect context?
  • Restore the context with sudo restorecon -v <FILE PATH>
  • Running restorecon regularly on key directories helps avoid repeated context mislabeling issues

SELinux boolean issues

SELinux Boolean allow adjustment of certain security settings without modifying the underlying policy.

An incorrectly set boolean can cause certain services or applications to malfunction

  • Check booleans with getsebool
  • Are certain booleans incorrectly set?
  • Toggle booleans with setsebool. ex: setsebool -P httpd_can_sendmail 1
  • Test modification and document changes

File and Directory Permission Issues

File attributes

File attributes control certain behaviors and restrictions on files and directories, which go beyond the regular rwx permissions.

  • Check file attributes with lsattr. i=immutable, a=append-only
  • Remove incorrect attribute with chattr. ex: chattr -i <FILE PATH>
  • Verify file access and document changes

Access Control Lists (ACLs)

ACLs provide more fine-grained control over who can access a file or directory and what actions can be performed.

  • Check if a file is using ACLs with getfacl
  • Adjust the ACLs with setfacl. ex: give read-only access to user tom setfacl -m u:tom:r <FILE PATH>
  • Verify proper access and document changes

Access Issues

Account access issues

Most common issue

  • Are the credentials incorrect?
  • Maybe the account is locked or disable
  • Check system logs for messages
  • Check if account is locked with sudo passwd -S tom
  • Unlock account with sudo passwd -u tom
  • Reset the user password with sudo passwd tom
  • Re-enable a disable account with sudo usermod -e '' tom. '' means no account expiration date

Remote access issues

Issues with VPN or SSH

  • Is the issue caused by network issues, misconfigurations, or firewall?
  • Is the SSH service running? check with sudo systemctl status sshd
  • Enable SSH service with sudo systemctl start sshd && systemctl enable sshd
  • Check firewall with sudo ufw status or sudo iptables -L
  • The problem sill persist? check routing, and public keys validity

Certificate issues

Common messages: SSL certificate expired, SSL handshake failure

  • Is the certificate expired?
  • Maybe the certificate chains are misconfigured
  • Maybe it is a CA issue
  • Check certificate issues with openssl s_client -connect mysite.com:443
  • Renew the certificate if necessary
  • Ensure the full certificate chain is correctly installed

Configuration Issues

Exposed or misconfigured services

This issue occur when system services are either left open to the public or configured incorrectly.

  • Does the service have proper security settings? The db should not accessible from the internet
  • Review security logs
  • Use tools like nmap to scan open ports
  • Configure the firewall to restrict access to trusted IPs
  • Disable unused services
  • Ensure critical services are only accessible when necessary

Misconfigured package repositories

This issue prevents the system from accessing the correct software sources. It prevents software updates and installations.

  • What errors show when running sudo apt update or sudo dnf update
  • Check repository configuration files: /etc/apt/sources.list on Debian-based systems or /etc/yum.repo.d/ on RHEL-based systems
  • Edit repository url if necessary

Vulnerabilities

Vulnerabilities are weaknesses of flaws in the system that can be exploited by attackers bo compromise security.

Unpatched vulnerable system

  • Do i have the latest security patches?
  • Use vulnerability scanners to detect security issues
  • Regular apply update with sudo apt update && sudo apt upgrade on Debian or sudo dnf update on RHEL.

The use of obsolete or insecure protocols and ciphers

  • Is the system using secure ciphers for data and communication protection?
  • Are insecure cipher like disable in the system? SSLv3 is vulnerable to POODLE Attack, RC4 is vulnerable to RC4 Bias Attack
  • Check used protocols in sshd_config for SSH and apache2.conf for Apache.
  • Disable outdated protocol
  • Remove week ciphers in the configuration files
  • Use strong ciphers like AES and protocols like TLS1.2, 1.3

Cipher negotiation issues

This issue occurs when there is a failure in the negotiation or encryption methods between a client and a server.

Review connection logs to confirm both server and client are using strong encryption methods

Linux: Troubleshooting Networking Issues

Firewall Issues

Misconfigured Firewall

Typo in firewall rule

A simple typo in a firewall rule can block traffic.

Use firewall-cmd --list-ports to see open ports

Remove bad rule with firewall-cmd --remove-port=<PORT>/PROTOCOL --permanent, re-issue the correct command, and reload the firewall with firewall-cmd --reload.

Incorrect Rule Ordering

This happens when a DROP or REJECT rule is placed above an ACCEPT rule, causing legitimate traffic to be blocked.

Forgetting to persist firewall changes across reboot

If a rule is added without --permanent the rule disappears after reboot.

Addressing Issues

DHCP issues

This issue occurs when servers or workstations fail to obtain an IP address automatically.

  • Is the DCHP service is running at all?
  • Does the server has free ip address to allocate? Check for DHCP scope for exhaustion by reviewing logs on the DHCP server.
  • Do I need to expand the pool?
  • Force client to request an ip again
  • Confirm connectivity
  • Update network documentation to reflect the change

IP conflicts

IP conflicts occur when two devices claim the same address, leading to intermittent connectivity or "duplicate address" warnings in syslog.

  • Common signs are random disconnect, slow network performance, or ARP conflict messages.
  • Identify all devices using the conflicting IP by checking the DHCP lease files and DNS records
  • Assign a unique address to one of the devices
  • Update any static configurations
  • Clear the ARP cache to ensure no stale entries remain
  • Monitor the network to confirm the conflict is gone

Dual stack issues

This issue occurs when a server configured for both IPV4 and IPV6 fail to handle traffic properly.

  • Ping test may fail for either IPV4 or IPV6
  • Does DNS records include both A and AAAA entries
  • Adjust service configuration files to listen to both IPV4 and IPV6
  • Test connectivity over both protocols and ensure firewalls allow the appropriate traffic on each address family

Routing Issues

DNS issues

ping my.server.com returns unknown host

Confirm the DNS server in /etc/resolv.conf Make changes if necessary Is the DNS server reachable Test DNS resolution

Wrong gateway

  • Why the packets are not leaving the local network?
  • Can devices in the different subnet communicate?
  • Can devices in other subnet communicate with external resources?
  • Check default route with ip route -n
  • Update default route if necessary
  • Ping external resources to confirm connectivity

Server unreachable

When a server is unreachable, nor the hostname or ip address respond to ping.

  • Use ip link to check if the network interface is up and running.
  • Check switch port and, VLAN settings
  • Is the firewall blocking ICMP or SSH?
  • Adjust port, VLAN, and firewall rule if necessary
  • Confirm connectivity using ping or SSH

Interface Misconfiguration

Subnet misconfiguration

This issue occurs when an interface is assigned to the wrong network or network mask. That prevents the server from communicating with other devices in the network.

  • Confirm address settings with ip addr
  • Edit the interface's configuration so the IP address and netmask align with the correct network segment
  • Apply changes with netplan apply or systemctl restart networking
  • Ping a known host on the subnet and confirm that traffic works has it should

MTU mismatch

This happens when one endpoint sends packet sized differently than the receiving interface can handle.

  • ping -s 1500 => Frag needed but DF set
  • Check MTU on each interface with ip link show
  • Pick a consistent MTU value, which is often 1500 for standard networks, and update the interface configuration.
  • Retry transfer or ping test to see correct connectivity

Cannot ping server

This often indicates a deeper interface misconfiguration, such as disabled interface, missing address, or firewall blocking ICMP

  • Is the interface up with a valid ip address?
  • Bring up the interface with ip link set <INTEFACE> up and assign the correct IP address
  • Is the firewall blocking ICMP? use sudo ufw status or iptables -L to ensure that ICMP is not blocked
  • Ping again to confirm connectivity

Interface bonding issues

Interface bonding is when combining two or more physical NICs into a single virtual interface to increase bandwidth and provide redundancy.

  • Is any interface in /proc/net/bonding/bond0 marked down even though it is plugged?
  • Mode 0 (balance-rr), Mode 1 (active-backup), Mode 4 (802.3ad/LACP)
  • Is the bonding driver loaded
  • Check the bonding configuration in either /etc/netplan...yaml on Ubuntu or /etc/sysconfig/network-scripts/ifcfg-bond0 on RHEL
  • Check switch setting to confirm matching valid configuration

MAC spoofing issues

This issue occurs when tow NICs present the same MAC address.

  • arping <IP ADDRESS> returns multiple MAC address
  • Does ip neigh shows frequent MAC flapping?
  • Look for duplicate MAC address with ip link show
  • Correct MAC settings
  • Restart network service to apply changes confirm with the command ip neigh show

This issue occurs when devices are unable to communicate effectively due to problem with the network interface.

The interface is failing to establish or maintain a connection. Maybe a faulty cable The port is disconnected? Maybe the hardware is faulty ip addr and ifconfig show the interface as down Logs are found with dmesg and journalctl Maybe the driver is bad Maybe the interface is misconfigured Maybe the interface is administratively down. use ip link show <INTERFACE> to confirm. Bring it up if necessary Restart networking with systemctl restart network

This involves problems in the automatic process where devices agree on the speed and duplex settings for their connection.

Common signs are poor performance, slow speeds, connectivity dropouts.

  • Check link status with ethtool <INTERFACE>
  • Is autonegotiation enabled
  • Maybe there the hardware have issues. Review system logs for related issues
  • Do the network driver have bugs and need to be updated?

Linux: Troubleshooting Hardware, Storage, and Linux OS

Troubleshooting steps:

  1. Identify the problem
  2. Establish a theory of probable cause
  3. Test the theory to confirm or refute the theory
  4. Establish a plan of action, implement the solution or escalated if needed, and then verify full system functionality
  5. Implement preventive measures to avoid recurrence and perform a root cause analysis

Boot Issues

Server Not Turning On

  • No power lights?
  • No fans?
  • No console output?
  • Do similar systems have the same issues?
  • Maybe the PDU is down?
  • Maybe the PSU has failed?
  • Check the power in the PDU
  • Swap in a known-good power cable
  • Plug another device into the same outlet
  • Still failing?
  • Inspect the PSU
  • Reseat connectors
  • Swap in a spare PSU
  • Verify the system powers on
  • Label cables
  • Schedule PSU health checks
  • Perform a root cause analysis

GRUB Misconfigurations

  • The server drops to a GRUB rescue prompt?
  • The server show an error like "file not found"
  • Are multiple kernels failing?
  • Maybe /etc/default/grub was edited?
  • Maybe an entry ininitrd was deleted?
  • Use the GRUB cli to probe available partitions
  • Verify the kernel and initramfs files are where GRUB expects them to be
  • Boot from rescue ISO or live environment
  • Mount the root filesystem
  • Correct the UUID or kernel path in /etc/default/grub
  • Regenerate GRUB configuration: grub2-mkconfig -o /boot/grub2/grub.cfg on RHEL-based systems. Or updated-grup on debian
  • Reboot and verify the kernel load properly
  • Backup grub.cfg before modifications
  • Why the issue occurred in the first place?
  • A rushed update?
  • A lack of peer review?

Kernel Corruption Issues

  • Observing errors such as "bad magic number" or "kernel image corrupt" during boot
  • Check whether only the latest kernel version is affected or if other versions work from te GRUB menu
  • Maybe a package update failed mid-install
  • Maybe the /boot partition has disk errors
  • Boot into an older, working kernel
  • Mounting /boot and verifying file checksums. If checksums fail, the corruption is real
  • Reinstall the corrupted kernel package
  • Reboot to verify that the new kernel loads
  • Monitor disk health
  • Ensure updates are completed successfully
  • See if disk failure or an interrupted update was at fault

Missing or Disabled Drivers

  • Boot hangs or drops to an initramfs shell with errors like "VFS: Cannot open root device"
  • Check if only certain hardware (for example, a RAID controller) is missing in /dev pr /sys
  • Maybe the initramfs was rebuilt without necessary driver module
  • Maybe someone blacklisted a driver
  • Examine the initramfs contents with lsinitrd or dracut --list module to confirm if the driver is absent
  • Rebuild the initramfs including the required modules
  • Reboot to verify that the driver loads and the root filesystem is detected
  • Document driver dependencies in the build scripts
  • Automate initramfs rebuilds when kernel updates occur
  • Was a kernel package change or manual configuration error caused the driver omission?

Kernel Panic Events

  • Read the panic message on the console
  • Does the panic happens on every boot or only after certain changes?
  • Maybe a newly added module is incompatible
  • Maybe the memory has gone bad
  • Let's try booting with a previous kernel
  • Run memtest86+
  • Disable suspect modules via the kernel boot line
  • Remove or update the offending module
  • Roll back to a known-good kernel
  • Replace faulty RAM
  • Reboot and verify full functionality
  • Maintain a reliable kernel testing process
  • Monitor hardware health
  • Keep a cross-tested module database
  • What was the root cause? Was it a faulty drover, hardware failure, or human error caused the panic.

Filesystem Issues

Filesystem not mounting

  • The usual mount command returns errors
  • Scheduled backups and applications suddenly cannot access certain directories
  • Errors like unknown filesystem type, mount: wrong fs type, superblock corrupt in system logs
  • Boot into rescue mode or unmount any stale references, run fsck against the affected device, and inspect or repair the superblock if needed.
  • If the issue arise from /etc/fstab, correct the UUID or device path and then test the mount manually before updating the fstab.
  • The system now mount cleanly?
  • Confirm read/write access and update any monitoring dashboards to reflect that the volume is back online

Partition not writable

  • Processing failing with permission denied message
  • Application unable to save files even the directories appears to exist
  • Maybe the filesystem is mounted readonly. Examine /proc/mounts to confirm ro flag
  • Unmount the partition, run fsck to repair any underlying errors, and then remount it with the correct read-write permissions
  • Does the issue persist?
  • Inspect ownership and ACLs, then apply chmod or chown to grant the correct user or service write access
  • Update any configuration management scripts

OS filesystem is full

  • Applications and users are unable to write logs and files
  • Check partition usage to confirm issue
  • Truncate or rotate logs, cleanup old core dumps, purge orphaned Docker images, or archive older data to a secondary storage
  • Extend the LVM volume or resize the partition, the resize the filesystem
  • Implement proactive monitoring for storage space

Inode exhaustion

  • df -h my show that space is available
  • Typical message: Cannot create file: No space left on device
  • check df -i and see if inode count is at 100%
  • Identify directories with excessive file counts and then clean up old or stale files
  • Create a new file system with higher inode ration and then migrate the data if necessary
  • Update cleanup policies or add scripts to remove temporary files automatically, preventing a repeat of the issue

Quota issues

  • Individual user or group cannot write files despite free space in the partition
  • Typical message is Disk quota exceeded when creating or writing to a file
  • Use repquota -a and quota -u <USERNAME> to view group or user quotas
  • Adjust soft and hard limits if necessary
  • Identify and remove unnecessary data from the user's home or project directories

Process Issues

Unresponsive Processes

It occur when a running program stops responding to inputs or system scheduling, causing tasks to hang indefinitely.

  • Don't respond to user input or system event?
  • Consume more resources
  • Spot this with top and ps
  • Use strace to watch process
  • Send SIGTERM to le the process shut down cleanly, and if that fails, escalate to SIGKILL to free resources by force
  • Examine journalctl to determine what cause the process to become unresponsive
  • Implement preventive measures

Killed Processes

They happen when a process is forcibly terminated by a signal.

  • Check journalctl and dmesg for reason the process was killed
  • Logs may show Killed process <PID> or oom_reaper to indicate killed process
  • Go through logs to determine if system or person killed the process

Segmentation Fault

A crash that happens when a program tries to access memory it shouldn't, leading to an abrupt termination with an error message.

  • Configure system to generate and retain core dump
  • Use GNU Debugger to analyze the core file and pinpoint the faulty code patch
  • Is the issue from an package? Reinstall a version of the package without that bug

Memory Leaks

Memory Leaks occur when a program continuously allocates memory without freeing it, gradually exhausting available RAM and. degrading system performance

  • Watch the RES memory rise steadily with no drop
  • Who is reserving the memory? review logs and output
  • Schedule periodic restarts of the service or allocate more RAM to reduce impact
  • Continue monitoring RES

System Issues

Device Failure

The server suddenly cannot read from or write to a critical piece of hardware, which is often a disk or network interface.

  • Identify the faulty device
  • Reseat or replace the device
  • If it is a RAID disk, mark the bad disk as failed and rebuild the array with a spare
  • Check disks and network to confirm full functionality of the system

Data corruptions issues

They occur when files refuse to open, applications crash, or filesystem errors in system logs.

  • Run fsck to detect corrupted data
  • Is there a known-good backup? restore from backup
  • Use fsck with repair options to attempt recovery on the live server
  • What was the root cause? failing disk? power outage?, ...
  • Verify full system functionality before it returns to production

Systemd unit failures

They occur when a service that should be running won't start or crashes immediately

  • Inspect service with systemctl status <SERVICE> or journalctl
  • Maybe edit the unit config in /etc/systemd/system/
  • Run systemctl daemon-reload to apply changes
  • Start service with systemctl start <SERVICE>
  • Setup alert to catch unit failures

Server inaccessible

User cannot remotely access the server.

  • Does ping timeout?
  • Does SSH hang?
  • Out-of-band tool does not respond?
  • Are other servers in the network reachable?
  • Try physical access
  • Maybe reboot the machine, or restore network configs from backups, or repair corrupt network service files
  • Validate server is reachable again

Dependency Issues

Package Dependency Issues

Occur when software cannot find or install the components it needs

  • Are the necessary repository enabled?
  • Run dnf deplist <PACKAGE> or apt-cache depends <PACKAGE> to find missing dependencies
  • Upgrade or downgrade package if necessary
  • Rerun installation and verify software loads without issues

Path Misconfiguration Issues

Occur when the system cannot locate a program despite being installed

Typical error message is Command not found

  • Examine echo $PATH to check current search directories
  • Add missing directory by editing /etc/profile or similar
  • Reload shell or re-login to apply changes
  • Run command again to confirm the program is found
  • Document changes for future deployments