Linux: Troubleshooting Hardware, Storage, and Linux OS
Troubleshooting steps:
- Identify the problem
- Establish a theory of probable cause
- Test the theory to confirm or refute the theory
- Establish a plan of action, implement the solution or escalated if needed, and then verify full system functionality
- Implement preventive measures to avoid recurrence and perform a root cause analysis
Boot Issues
Server Not Turning On
- No power lights?
- No fans?
- No console output?
- Do similar systems have the same issues?
- Maybe the PDU is down?
- Maybe the PSU has failed?
- Check the power in the PDU
- Swap in a known-good power cable
- Plug another device into the same outlet
- Still failing?
- Inspect the PSU
- Reseat connectors
- Swap in a spare PSU
- Verify the system powers on
- Label cables
- Schedule PSU health checks
- Perform a root cause analysis
GRUB Misconfigurations
- The server drops to a GRUB rescue prompt?
- The server show an error like "file not found"
- Are multiple kernels failing?
- Maybe
/etc/default/grub
was edited? - Maybe an entry in
initrd
was deleted? - Use the GRUB cli to probe available partitions
- Verify the kernel and initramfs files are where GRUB expects them to be
- Boot from rescue ISO or live environment
- Mount the root filesystem
- Correct the UUID or kernel path in
/etc/default/grub
- Regenerate GRUB configuration:
grub2-mkconfig -o /boot/grub2/grub.cfg
on RHEL-based systems. Orupdated-grup
on debian - Reboot and verify the kernel load properly
- Backup
grub.cfg
before modifications - Why the issue occurred in the first place?
- A rushed update?
- A lack of peer review?
Kernel Corruption Issues
- Observing errors such as "bad magic number" or "kernel image corrupt" during boot
- Check whether only the latest kernel version is affected or if other versions work from te GRUB menu
- Maybe a package update failed mid-install
- Maybe the
/boot
partition has disk errors - Boot into an older, working kernel
- Mounting
/boot
and verifying file checksums. If checksums fail, the corruption is real - Reinstall the corrupted kernel package
- Reboot to verify that the new kernel loads
- Monitor disk health
- Ensure updates are completed successfully
- See if disk failure or an interrupted update was at fault
Missing or Disabled Drivers
- Boot hangs or drops to an initramfs shell with errors like "VFS: Cannot open root device"
- Check if only certain hardware (for example, a RAID controller) is missing in
/dev
pr/sys
- Maybe the initramfs was rebuilt without necessary driver module
- Maybe someone blacklisted a driver
- Examine the initramfs contents with
lsinitrd
ordracut --list
module to confirm if the driver is absent - Rebuild the initramfs including the required modules
- Reboot to verify that the driver loads and the root filesystem is detected
- Document driver dependencies in the build scripts
- Automate initramfs rebuilds when kernel updates occur
- Was a kernel package change or manual configuration error caused the driver omission?
Kernel Panic Events
- Read the panic message on the console
- Does the panic happens on every boot or only after certain changes?
- Maybe a newly added module is incompatible
- Maybe the memory has gone bad
- Let's try booting with a previous kernel
- Run
memtest86+
- Disable suspect modules via the kernel boot line
- Remove or update the offending module
- Roll back to a known-good kernel
- Replace faulty RAM
- Reboot and verify full functionality
- Maintain a reliable kernel testing process
- Monitor hardware health
- Keep a cross-tested module database
- What was the root cause? Was it a faulty drover, hardware failure, or human error caused the panic.
Filesystem Issues
Filesystem not mounting
- The usual mount command returns errors
- Scheduled backups and applications suddenly cannot access certain directories
- Errors like
unknown filesystem type
,mount: wrong fs type
,superblock corrupt
in system logs - Boot into rescue mode or unmount any stale references, run
fsck
against the affected device, and inspect or repair the superblock if needed. - If the issue arise from
/etc/fstab
, correct the UUID or device path and then test the mount manually before updating the fstab. - The system now mount cleanly?
- Confirm read/write access and update any monitoring dashboards to reflect that the volume is back online
Partition not writable
- Processing failing with
permission denied
message - Application unable to save files even the directories appears to exist
- Maybe the filesystem is mounted readonly. Examine
/proc/mounts
to confirmro
flag - Unmount the partition, run
fsck
to repair any underlying errors, and then remount it with the correct read-write permissions - Does the issue persist?
- Inspect ownership and ACLs, then apply
chmod
orchown
to grant the correct user or service write access - Update any configuration management scripts
OS filesystem is full
- Applications and users are unable to write logs and files
- Check partition usage to confirm issue
- Truncate or rotate logs, cleanup old core dumps, purge orphaned Docker images, or archive older data to a secondary storage
- Extend the LVM volume or resize the partition, the resize the filesystem
- Implement proactive monitoring for storage space
Inode exhaustion
df -h
my show that space is available- Typical message:
Cannot create file: No space left on device
- check
df -i
and see if inode count is at 100% - Identify directories with excessive file counts and then clean up old or stale files
- Create a new file system with higher inode ration and then migrate the data if necessary
- Update cleanup policies or add scripts to remove temporary files automatically, preventing a repeat of the issue
Quota issues
- Individual user or group cannot write files despite free space in the partition
- Typical message is
Disk quota exceeded
when creating or writing to a file - Use
repquota -a
andquota -u <USERNAME>
to view group or user quotas - Adjust soft and hard limits if necessary
- Identify and remove unnecessary data from the user's home or project directories
Process Issues
Unresponsive Processes
It occur when a running program stops responding to inputs or system scheduling, causing tasks to hang indefinitely.
- Don't respond to user input or system event?
- Consume more resources
- Spot this with
top
andps
- Use
strace
to watch process - Send
SIGTERM
to le the process shut down cleanly, and if that fails, escalate toSIGKILL
to free resources by force - Examine
journalctl
to determine what cause the process to become unresponsive - Implement preventive measures
Killed Processes
They happen when a process is forcibly terminated by a signal.
- Check
journalctl
anddmesg
for reason the process was killed - Logs may show
Killed process <PID>
oroom_reaper
to indicate killed process - Go through logs to determine if system or person killed the process
Segmentation Fault
A crash that happens when a program tries to access memory it shouldn't, leading to an abrupt termination with an error message.
- Configure system to generate and retain core dump
- Use GNU Debugger to analyze the core file and pinpoint the faulty code patch
- Is the issue from an package? Reinstall a version of the package without that bug
Memory Leaks
Memory Leaks occur when a program continuously allocates memory without freeing it, gradually exhausting available RAM and. degrading system performance
- Watch the RES memory rise steadily with no drop
- Who is reserving the memory? review logs and output
- Schedule periodic restarts of the service or allocate more RAM to reduce impact
- Continue monitoring RES
System Issues
Device Failure
The server suddenly cannot read from or write to a critical piece of hardware, which is often a disk or network interface.
- Identify the faulty device
- Reseat or replace the device
- If it is a RAID disk, mark the bad disk as failed and rebuild the array with a spare
- Check disks and network to confirm full functionality of the system
Data corruptions issues
They occur when files refuse to open, applications crash, or filesystem errors in system logs.
- Run fsck to detect corrupted data
- Is there a known-good backup? restore from backup
- Use fsck with repair options to attempt recovery on the live server
- What was the root cause? failing disk? power outage?, ...
- Verify full system functionality before it returns to production
Systemd unit failures
They occur when a service that should be running won't start or crashes immediately
- Inspect service with
systemctl status <SERVICE>
orjournalctl
- Maybe edit the unit config in
/etc/systemd/system/
- Run
systemctl daemon-reload
to apply changes - Start service with
systemctl start <SERVICE>
- Setup alert to catch unit failures
Server inaccessible
User cannot remotely access the server.
- Does ping timeout?
- Does SSH hang?
- Out-of-band tool does not respond?
- Are other servers in the network reachable?
- Try physical access
- Maybe reboot the machine, or restore network configs from backups, or repair corrupt network service files
- Validate server is reachable again
Dependency Issues
Package Dependency Issues
Occur when software cannot find or install the components it needs
- Are the necessary repository enabled?
- Run
dnf deplist <PACKAGE>
orapt-cache depends <PACKAGE>
to find missing dependencies - Upgrade or downgrade package if necessary
- Rerun installation and verify software loads without issues
Path Misconfiguration Issues
Occur when the system cannot locate a program despite being installed
Typical error message is Command not found
- Examine
echo $PATH
to check current search directories - Add missing directory by editing
/etc/profile
or similar - Reload shell or re-login to apply changes
- Run command again to confirm the program is found
- Document changes for future deployments