Linux: Troubleshooting Hardware, Storage, and Linux OS

Troubleshooting steps:

Identify the problem
Establish a theory of probable cause
Test the theory to confirm or refute the theory
Establish a plan of action, implement the solution or escalated if needed, and then verify full system functionality
Implement preventive measures to avoid recurrence and perform a root cause analysis

Boot Issues

Server Not Turning On

No power lights?
No fans?
No console output?
Do similar systems have the same issues?
Maybe the PDU is down?
Maybe the PSU has failed?
Check the power in the PDU
Swap in a known-good power cable
Plug another device into the same outlet
Still failing?
Inspect the PSU
Reseat connectors
Swap in a spare PSU
Verify the system powers on
Label cables
Schedule PSU health checks
Perform a root cause analysis

GRUB Misconfigurations

The server drops to a GRUB rescue prompt?
The server show an error like "file not found"
Are multiple kernels failing?
Maybe /etc/default/grub was edited?
Maybe an entry ininitrd was deleted?
Use the GRUB cli to probe available partitions
Verify the kernel and initramfs files are where GRUB expects them to be
Boot from rescue ISO or live environment
Mount the root filesystem
Correct the UUID or kernel path in /etc/default/grub
Regenerate GRUB configuration: grub2-mkconfig -o /boot/grub2/grub.cfg on RHEL-based systems. Or updated-grup on debian
Reboot and verify the kernel load properly
Backup grub.cfg before modifications
Why the issue occurred in the first place?
A rushed update?
A lack of peer review?

Kernel Corruption Issues

Observing errors such as "bad magic number" or "kernel image corrupt" during boot
Check whether only the latest kernel version is affected or if other versions work from te GRUB menu
Maybe a package update failed mid-install
Maybe the /boot partition has disk errors
Boot into an older, working kernel
Mounting /boot and verifying file checksums. If checksums fail, the corruption is real
Reinstall the corrupted kernel package
Reboot to verify that the new kernel loads
Monitor disk health
Ensure updates are completed successfully
See if disk failure or an interrupted update was at fault

Missing or Disabled Drivers

Boot hangs or drops to an initramfs shell with errors like "VFS: Cannot open root device"
Check if only certain hardware (for example, a RAID controller) is missing in /dev pr /sys
Maybe the initramfs was rebuilt without necessary driver module
Maybe someone blacklisted a driver
Examine the initramfs contents with lsinitrd or dracut --list module to confirm if the driver is absent
Rebuild the initramfs including the required modules
Reboot to verify that the driver loads and the root filesystem is detected
Document driver dependencies in the build scripts
Automate initramfs rebuilds when kernel updates occur
Was a kernel package change or manual configuration error caused the driver omission?

Kernel Panic Events

Read the panic message on the console
Does the panic happens on every boot or only after certain changes?
Maybe a newly added module is incompatible
Maybe the memory has gone bad
Let's try booting with a previous kernel
Run memtest86+
Disable suspect modules via the kernel boot line
Remove or update the offending module
Roll back to a known-good kernel
Replace faulty RAM
Reboot and verify full functionality
Maintain a reliable kernel testing process
Monitor hardware health
Keep a cross-tested module database
What was the root cause? Was it a faulty drover, hardware failure, or human error caused the panic.

Filesystem Issues

Filesystem not mounting

The usual mount command returns errors
Scheduled backups and applications suddenly cannot access certain directories
Errors like unknown filesystem type, mount: wrong fs type, superblock corrupt in system logs
Boot into rescue mode or unmount any stale references, run fsck against the affected device, and inspect or repair the superblock if needed.
If the issue arise from /etc/fstab, correct the UUID or device path and then test the mount manually before updating the fstab.
The system now mount cleanly?
Confirm read/write access and update any monitoring dashboards to reflect that the volume is back online

Partition not writable

Processing failing with permission denied message
Application unable to save files even the directories appears to exist
Maybe the filesystem is mounted readonly. Examine /proc/mounts to confirm ro flag
Unmount the partition, run fsck to repair any underlying errors, and then remount it with the correct read-write permissions
Does the issue persist?
Inspect ownership and ACLs, then apply chmod or chown to grant the correct user or service write access
Update any configuration management scripts

OS filesystem is full

Applications and users are unable to write logs and files
Check partition usage to confirm issue
Truncate or rotate logs, cleanup old core dumps, purge orphaned Docker images, or archive older data to a secondary storage
Extend the LVM volume or resize the partition, the resize the filesystem
Implement proactive monitoring for storage space

Inode exhaustion

df -h my show that space is available
Typical message: Cannot create file: No space left on device
check df -i and see if inode count is at 100%
Identify directories with excessive file counts and then clean up old or stale files
Create a new file system with higher inode ration and then migrate the data if necessary
Update cleanup policies or add scripts to remove temporary files automatically, preventing a repeat of the issue

Quota issues

Individual user or group cannot write files despite free space in the partition
Typical message is Disk quota exceeded when creating or writing to a file
Use repquota -a and quota -u <USERNAME> to view group or user quotas
Adjust soft and hard limits if necessary
Identify and remove unnecessary data from the user's home or project directories

Process Issues

Unresponsive Processes

It occur when a running program stops responding to inputs or system scheduling, causing tasks to hang indefinitely.

Don't respond to user input or system event?
Consume more resources
Spot this with top and ps
Use strace to watch process
Send SIGTERM to le the process shut down cleanly, and if that fails, escalate to SIGKILL to free resources by force
Examine journalctl to determine what cause the process to become unresponsive
Implement preventive measures

Killed Processes

They happen when a process is forcibly terminated by a signal.

Check journalctl and dmesg for reason the process was killed
Logs may show Killed process <PID> or oom_reaper to indicate killed process
Go through logs to determine if system or person killed the process

Segmentation Fault

A crash that happens when a program tries to access memory it shouldn't, leading to an abrupt termination with an error message.

Configure system to generate and retain core dump
Use GNU Debugger to analyze the core file and pinpoint the faulty code patch
Is the issue from an package? Reinstall a version of the package without that bug

Memory Leaks

Memory Leaks occur when a program continuously allocates memory without freeing it, gradually exhausting available RAM and. degrading system performance

Watch the RES memory rise steadily with no drop
Who is reserving the memory? review logs and output
Schedule periodic restarts of the service or allocate more RAM to reduce impact
Continue monitoring RES

System Issues

Device Failure

The server suddenly cannot read from or write to a critical piece of hardware, which is often a disk or network interface.

Identify the faulty device
Reseat or replace the device
If it is a RAID disk, mark the bad disk as failed and rebuild the array with a spare
Check disks and network to confirm full functionality of the system

Data corruptions issues

They occur when files refuse to open, applications crash, or filesystem errors in system logs.

Run fsck to detect corrupted data
Is there a known-good backup? restore from backup
Use fsck with repair options to attempt recovery on the live server
What was the root cause? failing disk? power outage?, ...
Verify full system functionality before it returns to production

Systemd unit failures

They occur when a service that should be running won't start or crashes immediately

Inspect service with systemctl status <SERVICE> or journalctl
Maybe edit the unit config in /etc/systemd/system/
Run systemctl daemon-reload to apply changes
Start service with systemctl start <SERVICE>
Setup alert to catch unit failures

Server inaccessible

User cannot remotely access the server.

Does ping timeout?
Does SSH hang?
Out-of-band tool does not respond?
Are other servers in the network reachable?
Try physical access
Maybe reboot the machine, or restore network configs from backups, or repair corrupt network service files
Validate server is reachable again

Dependency Issues

Package Dependency Issues

Occur when software cannot find or install the components it needs

Are the necessary repository enabled?
Run dnf deplist <PACKAGE> or apt-cache depends <PACKAGE> to find missing dependencies
Upgrade or downgrade package if necessary
Rerun installation and verify software loads without issues

Path Misconfiguration Issues

Occur when the system cannot locate a program despite being installed

Typical error message is Command not found

Examine echo $PATH to check current search directories
Add missing directory by editing /etc/profile or similar
Reload shell or re-login to apply changes
Run command again to confirm the program is found
Document changes for future deployments