Skip to content

Linux: Troubleshooting Hardware, Storage, and Linux OS

Troubleshooting steps:

  1. Identify the problem
  2. Establish a theory of probable cause
  3. Test the theory to confirm or refute the theory
  4. Establish a plan of action, implement the solution or escalated if needed, and then verify full system functionality
  5. Implement preventive measures to avoid recurrence and perform a root cause analysis

Boot Issues

Server Not Turning On

  • No power lights?
  • No fans?
  • No console output?
  • Do similar systems have the same issues?
  • Maybe the PDU is down?
  • Maybe the PSU has failed?
  • Check the power in the PDU
  • Swap in a known-good power cable
  • Plug another device into the same outlet
  • Still failing?
  • Inspect the PSU
  • Reseat connectors
  • Swap in a spare PSU
  • Verify the system powers on
  • Label cables
  • Schedule PSU health checks
  • Perform a root cause analysis

GRUB Misconfigurations

  • The server drops to a GRUB rescue prompt?
  • The server show an error like "file not found"
  • Are multiple kernels failing?
  • Maybe /etc/default/grub was edited?
  • Maybe an entry ininitrd was deleted?
  • Use the GRUB cli to probe available partitions
  • Verify the kernel and initramfs files are where GRUB expects them to be
  • Boot from rescue ISO or live environment
  • Mount the root filesystem
  • Correct the UUID or kernel path in /etc/default/grub
  • Regenerate GRUB configuration: grub2-mkconfig -o /boot/grub2/grub.cfg on RHEL-based systems. Or updated-grup on debian
  • Reboot and verify the kernel load properly
  • Backup grub.cfg before modifications
  • Why the issue occurred in the first place?
  • A rushed update?
  • A lack of peer review?

Kernel Corruption Issues

  • Observing errors such as "bad magic number" or "kernel image corrupt" during boot
  • Check whether only the latest kernel version is affected or if other versions work from te GRUB menu
  • Maybe a package update failed mid-install
  • Maybe the /boot partition has disk errors
  • Boot into an older, working kernel
  • Mounting /boot and verifying file checksums. If checksums fail, the corruption is real
  • Reinstall the corrupted kernel package
  • Reboot to verify that the new kernel loads
  • Monitor disk health
  • Ensure updates are completed successfully
  • See if disk failure or an interrupted update was at fault

Missing or Disabled Drivers

  • Boot hangs or drops to an initramfs shell with errors like "VFS: Cannot open root device"
  • Check if only certain hardware (for example, a RAID controller) is missing in /dev pr /sys
  • Maybe the initramfs was rebuilt without necessary driver module
  • Maybe someone blacklisted a driver
  • Examine the initramfs contents with lsinitrd or dracut --list module to confirm if the driver is absent
  • Rebuild the initramfs including the required modules
  • Reboot to verify that the driver loads and the root filesystem is detected
  • Document driver dependencies in the build scripts
  • Automate initramfs rebuilds when kernel updates occur
  • Was a kernel package change or manual configuration error caused the driver omission?

Kernel Panic Events

  • Read the panic message on the console
  • Does the panic happens on every boot or only after certain changes?
  • Maybe a newly added module is incompatible
  • Maybe the memory has gone bad
  • Let's try booting with a previous kernel
  • Run memtest86+
  • Disable suspect modules via the kernel boot line
  • Remove or update the offending module
  • Roll back to a known-good kernel
  • Replace faulty RAM
  • Reboot and verify full functionality
  • Maintain a reliable kernel testing process
  • Monitor hardware health
  • Keep a cross-tested module database
  • What was the root cause? Was it a faulty drover, hardware failure, or human error caused the panic.

Filesystem Issues

Filesystem not mounting

  • The usual mount command returns errors
  • Scheduled backups and applications suddenly cannot access certain directories
  • Errors like unknown filesystem type, mount: wrong fs type, superblock corrupt in system logs
  • Boot into rescue mode or unmount any stale references, run fsck against the affected device, and inspect or repair the superblock if needed.
  • If the issue arise from /etc/fstab, correct the UUID or device path and then test the mount manually before updating the fstab.
  • The system now mount cleanly?
  • Confirm read/write access and update any monitoring dashboards to reflect that the volume is back online

Partition not writable

  • Processing failing with permission denied message
  • Application unable to save files even the directories appears to exist
  • Maybe the filesystem is mounted readonly. Examine /proc/mounts to confirm ro flag
  • Unmount the partition, run fsck to repair any underlying errors, and then remount it with the correct read-write permissions
  • Does the issue persist?
  • Inspect ownership and ACLs, then apply chmod or chown to grant the correct user or service write access
  • Update any configuration management scripts

OS filesystem is full

  • Applications and users are unable to write logs and files
  • Check partition usage to confirm issue
  • Truncate or rotate logs, cleanup old core dumps, purge orphaned Docker images, or archive older data to a secondary storage
  • Extend the LVM volume or resize the partition, the resize the filesystem
  • Implement proactive monitoring for storage space

Inode exhaustion

  • df -h my show that space is available
  • Typical message: Cannot create file: No space left on device
  • check df -i and see if inode count is at 100%
  • Identify directories with excessive file counts and then clean up old or stale files
  • Create a new file system with higher inode ration and then migrate the data if necessary
  • Update cleanup policies or add scripts to remove temporary files automatically, preventing a repeat of the issue

Quota issues

  • Individual user or group cannot write files despite free space in the partition
  • Typical message is Disk quota exceeded when creating or writing to a file
  • Use repquota -a and quota -u <USERNAME> to view group or user quotas
  • Adjust soft and hard limits if necessary
  • Identify and remove unnecessary data from the user's home or project directories

Process Issues

Unresponsive Processes

It occur when a running program stops responding to inputs or system scheduling, causing tasks to hang indefinitely.

  • Don't respond to user input or system event?
  • Consume more resources
  • Spot this with top and ps
  • Use strace to watch process
  • Send SIGTERM to le the process shut down cleanly, and if that fails, escalate to SIGKILL to free resources by force
  • Examine journalctl to determine what cause the process to become unresponsive
  • Implement preventive measures

Killed Processes

They happen when a process is forcibly terminated by a signal.

  • Check journalctl and dmesg for reason the process was killed
  • Logs may show Killed process <PID> or oom_reaper to indicate killed process
  • Go through logs to determine if system or person killed the process

Segmentation Fault

A crash that happens when a program tries to access memory it shouldn't, leading to an abrupt termination with an error message.

  • Configure system to generate and retain core dump
  • Use GNU Debugger to analyze the core file and pinpoint the faulty code patch
  • Is the issue from an package? Reinstall a version of the package without that bug

Memory Leaks

Memory Leaks occur when a program continuously allocates memory without freeing it, gradually exhausting available RAM and. degrading system performance

  • Watch the RES memory rise steadily with no drop
  • Who is reserving the memory? review logs and output
  • Schedule periodic restarts of the service or allocate more RAM to reduce impact
  • Continue monitoring RES

System Issues

Device Failure

The server suddenly cannot read from or write to a critical piece of hardware, which is often a disk or network interface.

  • Identify the faulty device
  • Reseat or replace the device
  • If it is a RAID disk, mark the bad disk as failed and rebuild the array with a spare
  • Check disks and network to confirm full functionality of the system

Data corruptions issues

They occur when files refuse to open, applications crash, or filesystem errors in system logs.

  • Run fsck to detect corrupted data
  • Is there a known-good backup? restore from backup
  • Use fsck with repair options to attempt recovery on the live server
  • What was the root cause? failing disk? power outage?, ...
  • Verify full system functionality before it returns to production

Systemd unit failures

They occur when a service that should be running won't start or crashes immediately

  • Inspect service with systemctl status <SERVICE> or journalctl
  • Maybe edit the unit config in /etc/systemd/system/
  • Run systemctl daemon-reload to apply changes
  • Start service with systemctl start <SERVICE>
  • Setup alert to catch unit failures

Server inaccessible

User cannot remotely access the server.

  • Does ping timeout?
  • Does SSH hang?
  • Out-of-band tool does not respond?
  • Are other servers in the network reachable?
  • Try physical access
  • Maybe reboot the machine, or restore network configs from backups, or repair corrupt network service files
  • Validate server is reachable again

Dependency Issues

Package Dependency Issues

Occur when software cannot find or install the components it needs

  • Are the necessary repository enabled?
  • Run dnf deplist <PACKAGE> or apt-cache depends <PACKAGE> to find missing dependencies
  • Upgrade or downgrade package if necessary
  • Rerun installation and verify software loads without issues

Path Misconfiguration Issues

Occur when the system cannot locate a program despite being installed

Typical error message is Command not found

  • Examine echo $PATH to check current search directories
  • Add missing directory by editing /etc/profile or similar
  • Reload shell or re-login to apply changes
  • Run command again to confirm the program is found
  • Document changes for future deployments