Skip to content

Blog

Ansible: Ad-Hoc Commands

An ansible ad-hoc command is a single command sent to an ansible client. For example:

ansible servers -m ping

This ad-hoc command ping all clients from the servers group in the configured inventory.

Run Shell Commands on Ansible Clients

ansible servers -m shell -a "ip addr show"

ansible servers -m shell -a "uptime"

This commands sends a shell command to all nodes in the group. -a is used to specify the argument for the ad-hoc command which is the command we want to run on each ansible client.

Copy Files from Ansible Controle Node to Clients

ansible servers -m copy -a "src=/home/me/my-file.txt dest=/etc/ansible/data/my-file.txt"
ansible servers -m copy -a "content='My Text file content' dest=/etc/ansible/data/my-text-file.txt"

This command copy a file from ansible control node to all clients. We can choose whether we want to copy the content of the file or file itself by using the content or src params. Note that the destination folder in which the file is going must exist in the clients. When uploading a file content, we must specify the complete destination file name (dest=/data/upload/my-file.conf).

If the file exist in the destination, ansible will use the md5 checksum of the file to determine if that task was previously done. If the file was not modified ansible will not re-upload the file. Otherwise, whether the file was updated in the control node or in the clients, ansible with upload the file again in the selected clients.

Create and Delete File and Folders in Ansible Clients

To create a new file in ansible clients:

ansible servers -m file -a "dest=FILE OR DIRECTORY DESTINATION state=touch"

To delete a file in ansible clients:

ansible servers -m file -a "dest=FILE OR DIRECTORY DESTINATION state=absent"

To create a directory, change the state to directory:

ansible servers -m file -a "dest=/my/directory/data state=directory"

A directory deletion is performed like a file deletion. Just specify the diirectory name in the dest and state=absent then ansible will delete the directory.

Install and Uninstall Packages on Ansible Clients

We can use use the shell or dnf/apt ansible module.

ansible servers -m shell -a "sudo dnf install nginx"
ansible servers -m dnf -a "name=nginx state=present" -b

If the operation requires root user priviledge, we can pass sudo to the shell command. But if we are using dnf/apt module, the ansible user must have root priviledge and we also need to add -b option to the command.

Use the latest state to update already installed package.

ansible servers -m dnf -a "name=nginx state=latest" -b

The state can be one of absent, installed, present, removed, and latest.

Understanding ansible ad-hoc commands is important for understanding ansible playbook. From here, we are going to move slowly towards efficient ways to automate tasks using ansible.

Ansible: Control Node Reasonable Setup

This post will focus on coming up with an reasonable Ansible control node setup for a homelab. By reasonable setup I mean a setup that will allow me to properly send tasks to managed nodes with a lower likelyhood of failure. From this point I would like to focus on learning the important parts of ansible instead of juggling left and right to fix basic setup errors.

Create or select a working folder

To keep things simple, I am going to have my inventory in /etc/ansible-admin/ own by the ansible-admin group.

Where to keep ansible.cfg

The default ansible.cfg can be left where it is. For managing our nodes, I am going to keep my own ansible configuration inside /etc/ansible-admin/ansible.cfg

Where to keep the inventories

The lab inventories can be kept in /etc/ansible-admin/inventory/

Disable the host key verification

From ansible.cfg:

[defaults]
host_key_checking = False

or from an environment variable:

export ANSIBLE_HOST_KEY_CHECKING=False

or from the command line:

ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook my_playbook.yml

Ansible: More about the Inventory File

Ansible default inventory is located at /etc/ansible/hosts. But we can have it elsewhere. For example at /home/me/ansible/hosts.ini. Then point ansible to it using the -i flag.

ansible web -m ping -i ./hosts.ini

Or I can just confirgure ansible.cfg to point to the path of my inventory file.

[defaults]

inventory = /etc/ansible/inventory/hosts.ini

Hosts can be organized in groups inside the inventory file. A group name must be unique and following the criteria of a valid variable name.

Here are example of groups: web and db

[web]
192.168.10.15
192.168.10.16

[db]
192.168.12.15
192.168.12.16
192.168.12.17

Here is the same inventory in YAML format

web:
  hosts:
    192.168.10.15:
    192.168.10.16:
db:
  hosts:
    192.168.12.15:
    192.168.12.16:
    192.168.12.17:

Ansible automatically creates the all and ungrouped groups behind the scene. The all group contains all hosts, and ungrouped group contains all that are not in any group.

So, ansible -m ping all will all hosts listed in the inventory file, and ansible -m ping ungrouped will ping all hosts not listed in any group.

Do more in your inventory

  • A host can be part of multiple groups

  • Groups can also be grouped

prod:
  children:
    web:
    db:
test:
  children:
    web_test:
[prod:children]
web
db

[test:children]
web_test
  • Add a range of hosts
[servers]
192.168.11.[15:35]
servers:
  hosts:
    192.168.11.[15:35]:
  • Add variables to hosts or groups
[prod]
192.168.10.15:4422

prod1 ansible_port=4422 ansible_host=192.168.10.22

You can do way more than what I have listed above, I am not going to bore with everything about Ansible inventory here because I don't need to use them at this stage of my learning. But if you feel like you want to learn more about this topic, go here

Good bye for now

Ansible: Initial Setup

In my previous post, I went quickly through ansible installation and initial setup. I did not really setup anything. I just showed you where the find things that are brought by ansible by default.

In this post I will go deeper in the setup process. But I am still not going to try to impress you here. Let keep that for future posts.

Ansible Control Node

Ansible config file is locate at /etc/ansible/ansible.cfg by default. We are going to use this file later to customize our installation of Ansible.

If you have just a fiew nodes, you can SSH into each one of them to make sure you can correctly connect. That also means that if you have just a few nodes, Ansible might not the right tool.

Use ssh-copy-id key.pub node-user@192.168.10.10 to add the controller ssh key to authorized hosts that can connect to the nodes.

Ansible Inventory

The inventory contains the nodes you want ansible to manage. The default inventory file is located at /etc/ansible/hosts. The nodes are put into groups for ease of management. The group names must be unique and they are case sensitive. The inventory file contains the IP addresses or FQDN of the managed hosts.

If we want to use the default inventory file we can just run:

# to ping all nodes in the web group
ansible -m ping web

But if we are working on a dedicated inventory file, like my_nodes.ini, we should tell ansible that we are providing and inventory file by adding -i [INVENTORY FILE]. For example, ansible web -i my_nodes -m ping

The inventory in the ini format looks like:

[web]
192.168.12.13
192.168.12.14

[db]
192.168.13.13
192.168.13.15

But the inventory file can also be written in the YAML format:

my_nodes:
  hosts:
    node_01:
      ansible_host: 192.168.10.12
    node_02:
      ansible_host: 192.168.10.13

[web] is a group name. It is unique accross the inventory file. We can have multiple groups in a inventory file.

To run ansible command on multiple groups we do separate the groups name with colons. For example:

ansible web:db -m ping -i my_nodes.ini --ask-pass

This command will nodes in the web and db groups. --ask-pass allows prompting for password if somehow the SSH daemon in the managed nodes is asking for the user password.

If our command requires an input to function, maybe we are doing it the wrong way. Ansible is suppose to facilitate automation. A command should be able to run until completion without additional user input. In my initial ansible setup, I provided input twice when I was running the the ping command: The first was the host keys verification, the second was to provide the node password because the ssh keys were not setup properly. We are going to fix this in our next posts.

How to Manage Nodes with Ansible

Until now we only learned how to ping our nodes using ansible ping module. ansible web -m ping is the language to tell ansible to use the ping module to ping the web group.

Key Points to Remember

  • Ansible is used to automate repetitive tasks we perform on network devices

  • Ansible inventory contains grouped list of nodes we want to manage

  • The inventory can be written in the ini or YAML format

  • Ansible comes with prebuilt modules like ping to faciliate the nodes management.

In my next posts, I will be going deeper on each importaint part of Ansible such as inventory and playbook.

So, read me soon.

Ansible: Installation and Initial Setup

What is Ansible?

Let's cut the chase. Ansible is tool for system and network admins to automate repetitive tasks for example installing and configuring multiple servers, and configuring routers, switches, firewalls, and WAPs at once. Ansible can talk to any device that talks the SSH language. Other connection types are supported but SSH is the default connection type0. Visit Ansible Documentation page to learn more.

This is not going to be step by step tutorial on how to using Ansible nor an in depth overview of Ansible. A lot of important basic topics will be missing from this post but they might appear in future posts. So if something is missing here, you can always look at the other posts in the same category. If something in my sayings does not feel right, you can reach out to me with questions or suggestions via LinkedIn or Email.

How to Install Ansible on Linux

Ansible is agent less. That means that you do not need to install Ansible on the managed nodes to have Ansible push some tasks into them. So only the control node needs to have Ansible installed on it. But you will need python and SSH installed and configured on the managed nodes.

You have multiple ways to install Ansible on your Linux workstation but I will be using the method via the Linux package manager.

How to locate python?

which python3 

# /usr/bin/python3

How to locate SSH?

sudo systemctl status sshd

The ssh daemon must enabled, active, and running.

Update your system:

sudo dnf update -y

Install Ansible using the package manager

sudo dnf install ansible -y

Ansible keeps its main configuration files in /etc/ansible. There you should find the files ansible.cfg and hosts.

Run ansible --version to get details about your Ansible installation. The command will also tell you the location of your Ansible default configuration file.

If you install Ansible using the Linux package manager, you should have the config file generated and set in Ansible. In case you are missing the ansible.cfg file in your installation, you can create it the file /etc/ansible/ansible.cfg. There are many ways to set Ansible configuration file but I am going stick with the one generated by default during the installation.

Ansible Inventory

Ansible inventory contains the list of hosts you want to manage. By default the hosts file contains the list of nodes but you can customize the hosts file inside ansible.cfg. In the hosts file, you can put the nodes into groups like:

[web]
172.16.10.10
172.16.10.12

[db]
172.16.20.22

If you're able to run ansible --version without issue and locate ansible installation folder (/etc/ansible), you are good to do awesome things with ansible. In the next posts, we are going to deaper in the basics of ansible.

So, stay tuned.

Linux: Troubleshooting Performance Issues

CPU Issues

High CPU usage

  • What process is using the CPU?
  • Use top or htop to see what process is using the CPU
  • Optimize the code or limit the number of processes running

High load average

This issue occurs when the number of processes waiting to be executed exceeds the system's processing capacity.

  • Check output of uptime or top
  • Is the system overloaded?
  • Are there too many processes running simultaneously?
  • Or is it one process that is causing the backlog?
  • What quey to optimize if it is a db process?
  • Maybe offload some tasks to another server
  • Maybe swap CPU with one with better specs

High context switching

A context switching is when the CPU switches between different processes to allocated resources. A context refers to the state of a running process that allows the CPU to resume a process later (Running, Waiting, and Stopped).

Too many context switches lead to inefficiency and higher CPU usage.

  • Check context switching with vmstat or pidstat. Check the number of context switches per second.
  • How many context switches per second do we have
  • More than 500-1000 context switches per second is considered high and may indicate that the system is overwhelmed
  • Maybe reduce the number of running processes
  • Optimize applications to use fewer threads
  • Adjust system limits

CPU bottleneck

This issue occurs when the CPU is the limiting factor in system performance.

  • The CPU usage is consistently high (above 80%)
  • Are tasks processing take too long?
  • Load average exceeds the number of available CPU cores
  • Use top and htop to identify processes that are using an inordinate amount of CPU time
  • Optime the processes using the CPU
  • Maybe a hardware upgrade is needed

Memory Issues

Swapping Issues

This issue occurs when the system runs out of physical memory (RAM) and starts using the hard drive as virtual memory

  • Is the system running out of swap space?
  • The swap file is typical located at /swapfile or on a dedicated swap partition
  • Identify swap space with swapon -s
  • Has the system performance degraded?
  • What does top and htop say about swap usage?
  • Is the usage more that 10%? That might be high
  • Monitor swap usage with free -h or vmstat
  • Maybe more physical RAM is needed
  • Or adjust the swappiness kernel value. ex: sudo sysctl vm.swappiness=10

Out of Memory (OOM) Errors

This issue happens when the system runs out of both physical and virtual memory

  • Are critical processes been unexpectedly terminated?
  • Check logs for OOM in system logs
  • Adjust application configurations to optimize memory usage
  • Increase RAM
  • Adjust swap space

Disk I/O Issues

This issue occurs when the system slows down due to delays in reading from or writing to storage device.

High input/output wait time

This issue occurs when processes are waiting for data to be read from or written to the disk. Is is indicating that the disk is struggling to handle requests causing delays in executing processes.

  • Monitor disk load with iotop or dstat. top shows io wait as variable x.x wa. iostat give %iowait
  • Optimize or spread out disk operations
  • Improve performance or upgrade to SSDs if needed

High disk latency

Disk latency refers to the time it takes for the disk to respond to read or write requests. It can be cause by high number or concurrent request, a disk hardware issue, or inefficient disk configurations.

  • What does iostat output say about the disk latency?
  • Is the latency above the normal 10ms? higher than 20ms may indicate a latency issues
  • Is the disk operating at its maximum throughput?
  • Maybe some drivers need to be upgraded
  • Maybe adjusting the disk (RAID) config can help reduce the latency
  • Maybe upgrade to faster disks

Slow remote storage response

This issue occurs when accessing data from remote storage systems such as NFS, SAN, or cloud storage

  • Long response time when accessing files?
  • Check network performance using ping or netperf to check for network issues
  • Optimize network settings or upgrade network hardware

Network Stability Issues

Packet drops

This issue occurs when data packets fail to reach their destination due to network congestion or hardware issues

  • ping -c 100 <DESTINATION> will give the percentage of packet dropped. Over 1% packet loss my be unacceptable
  • Packet loss can lead to slow performance, timeouts, and application errors
  • Check routers and switches for hardware issues and faulty cables
  • Check for NICs errors using ifconfig and ethtool
  • Adjust QoS settings or increase bandwidth

Random disconnects

This issue occurs when a network connection is unexpectedly terminated

  • Are users or services suddenly losing network access?
  • Look for connection reset or connection closed messages in logs using dmesg or ifconfig
  • Check network stack configuration
  • Maybe the cable is faulty and some hardware in the path is faulty
  • Maybe a firewall is closing the connection
  • Check and adjust TCP settings if necessary

Random timeouts

This issue occurs when a connection fails to receive a response within the expected time frame.

  • Errors can be seen in logs
  • Getting connection timed out error when using curl to connect to a service?
  • Maybe the network is congested
  • Maybe there is a DNS issue
  • Maybe the server is overloaded
  • A timeout threshold of 5-10 seconds is typically acceptable
  • Use ping or traceroute to check for network congestion
  • Make sure DNS servers are correctly configured
  • Check server performance
  • Adjust TCP timeout if necessary

Network Performance Issues

High latency

High latency refers to the delay in the time it takes for data to travel from one point to another

Measure latency using ping or traceroute A 100ms latency in a local network is considered high and 300ms over remote communication is also high Check for network congestion Identify hardware issues Maybe the routing is misconfigured. Optimize the network path Maybe upgrading the network infrastructure could help

Jitter

Jitter is the variation of latency over time, which can cause problems in real-time applications.

  • A value above 30ms of fluctuation can cause noticeable issues
  • Detect Jitter issues with ping -i 0.2 <DESTINATION>
  • Check for network congestion or hardware issues
  • Implement QoS to prioritize relevant traffic

Slow response time

This issue occurs when the network takes too long to respond to requests.

This could be due to:

  • High latency
  • Congestion
  • Overloaded servers
  • Misconfigured applications

  • Use curl or wget to measure response time and identify bottlenecks in the network or a server

  • Check server load
  • Optimize application code
  • Check for server resources
  • Review network configurations

Low throughput

This issue occurs when the network is unable to transmit data a a high enough rate

  • Identify low throughput but using iperf anything below 80% of the expected bandwidth is considered low throughput
  • Check for network congestion
  • Check for faulty cables
  • Check for incorrect settings
  • Maybe switch to a high bandwidth network
  • Maybe reduce unnecessary traffic
  • Optimize network routes

System Responsiveness Issues

Slow application response

This issue occurs when an application takes longer than expected to react to user inputs

  • Use top and htop to identify application resource consumption
  • Maybe the application code needs to be optimized
  • Increase system resources
  • Check disk I/O for overload
  • Check for unnecessary background

Sluggish terminal behavior

It happens when commands in the terminal are delayed. The system takes an unusually long tome to execute commands.

  • Use top or iotop to check for system resource usage
  • Optimize processes running on the system
  • Cleanup system resources
  • Add more RAM or CPU cores

Slow startup

System a taking an unusual time to boot up

  • See which services take longer than expected using systemd-analyze
  • Maybe too many services are configured to start at the same time
  • Maybe one of the startup services is misconfigured?
  • Delay or disable non essential services from starting at boot time using systemctl
  • Optimize the boot sequence

System unresponsiveness

This issue occurs when the system becomes completely unresponsive

  • Is the system not accepting new input?
  • Applications are no longer responding?
  • Use dmesg or journalctl to identify what caused kernel panic
  • Identify runaway processes using top and htp
  • Maybe upgrade the RAM or add more CPU cores

Process Management Issues

Blocked processes

This issue occurs when a process is unable to proceed due to waiting on resources or system locks

  • Are command or application stuck?
  • ps and top show processes in D or `uninterruptible state
  • Use lsof to check with file a process is waiting for
  • Use strace to trace system calls and signals
  • Is a process repeatedly stuck or blocked? Maybe due to resource contention
  • Optimize disk I/O
  • Maybe add more memory
  • Investigate dependency issues between processes

Exceeding baselines

This happens when processes consume more resources than expected

  • Notice high CPU usage
  • Unusual memory consumption
  • Excessive disk activity
  • Use top, htop, or pidstat to identify this issue
  • Optimize application resource usage
  • Maybe configure system resource limits with ulimit

High failed log-in attempts

This issue often signals attempted unauthorized access or brute force attacks

  • Maybe a brute force attack?
  • Unauthorized access attempts?
  • System compromise?
  • What are logs in /var/log/auth.log saying?
  • Check journalctl for identify login attempts
  • 5-10 login attempts from a single IP address within a short time may be a red flag for brute force attack
  • Implement fail2ban to block abusive IPs
  • Enforce strong password policy
  • Use MFA
  • Limit access with firewall or IP allowlist

Linux: Troubleshooting Security Issues

SELinux Issues

SELinux policy issues

SELinux Policy defines what actions users and applications can perform on a system based on security rules.

A too restricted or misconfigured policy can prevent the system from working properly.

avc: denied is a typical error message found in logs if dealing SELinux policy issues.

  • Review logs with ausearch or sealert
  • Modify rules if necessary
  • Test policy in a safe environment before applying

SELinux context issues

SELinux uses context to label every file, process, and resource on the system, determining what access is allowed.

Incorrect or misconfigured label can prevent applications for accessing the resources they need to function

  • User ls -Z for files and ps -Z for processes to look for SELinux context issues
  • Does the file or process have incorrect context?
  • Restore the context with sudo restorecon -v <FILE PATH>
  • Running restorecon regularly on key directories helps avoid repeated context mislabeling issues

SELinux boolean issues

SELinux Boolean allow adjustment of certain security settings without modifying the underlying policy.

An incorrectly set boolean can cause certain services or applications to malfunction

  • Check booleans with getsebool
  • Are certain booleans incorrectly set?
  • Toggle booleans with setsebool. ex: setsebool -P httpd_can_sendmail 1
  • Test modification and document changes

File and Directory Permission Issues

File attributes

File attributes control certain behaviors and restrictions on files and directories, which go beyond the regular rwx permissions.

  • Check file attributes with lsattr. i=immutable, a=append-only
  • Remove incorrect attribute with chattr. ex: chattr -i <FILE PATH>
  • Verify file access and document changes

Access Control Lists (ACLs)

ACLs provide more fine-grained control over who can access a file or directory and what actions can be performed.

  • Check if a file is using ACLs with getfacl
  • Adjust the ACLs with setfacl. ex: give read-only access to user tom setfacl -m u:tom:r <FILE PATH>
  • Verify proper access and document changes

Access Issues

Account access issues

Most common issue

  • Are the credentials incorrect?
  • Maybe the account is locked or disable
  • Check system logs for messages
  • Check if account is locked with sudo passwd -S tom
  • Unlock account with sudo passwd -u tom
  • Reset the user password with sudo passwd tom
  • Re-enable a disable account with sudo usermod -e '' tom. '' means no account expiration date

Remote access issues

Issues with VPN or SSH

  • Is the issue caused by network issues, misconfigurations, or firewall?
  • Is the SSH service running? check with sudo systemctl status sshd
  • Enable SSH service with sudo systemctl start sshd && systemctl enable sshd
  • Check firewall with sudo ufw status or sudo iptables -L
  • The problem sill persist? check routing, and public keys validity

Certificate issues

Common messages: SSL certificate expired, SSL handshake failure

  • Is the certificate expired?
  • Maybe the certificate chains are misconfigured
  • Maybe it is a CA issue
  • Check certificate issues with openssl s_client -connect mysite.com:443
  • Renew the certificate if necessary
  • Ensure the full certificate chain is correctly installed

Configuration Issues

Exposed or misconfigured services

This issue occur when system services are either left open to the public or configured incorrectly.

  • Does the service have proper security settings? The db should not accessible from the internet
  • Review security logs
  • Use tools like nmap to scan open ports
  • Configure the firewall to restrict access to trusted IPs
  • Disable unused services
  • Ensure critical services are only accessible when necessary

Misconfigured package repositories

This issue prevents the system from accessing the correct software sources. It prevents software updates and installations.

  • What errors show when running sudo apt update or sudo dnf update
  • Check repository configuration files: /etc/apt/sources.list on Debian-based systems or /etc/yum.repo.d/ on RHEL-based systems
  • Edit repository url if necessary

Vulnerabilities

Vulnerabilities are weaknesses of flaws in the system that can be exploited by attackers bo compromise security.

Unpatched vulnerable system

  • Do i have the latest security patches?
  • Use vulnerability scanners to detect security issues
  • Regular apply update with sudo apt update && sudo apt upgrade on Debian or sudo dnf update on RHEL.

The use of obsolete or insecure protocols and ciphers

  • Is the system using secure ciphers for data and communication protection?
  • Are insecure cipher like disable in the system? SSLv3 is vulnerable to POODLE Attack, RC4 is vulnerable to RC4 Bias Attack
  • Check used protocols in sshd_config for SSH and apache2.conf for Apache.
  • Disable outdated protocol
  • Remove week ciphers in the configuration files
  • Use strong ciphers like AES and protocols like TLS1.2, 1.3

Cipher negotiation issues

This issue occurs when there is a failure in the negotiation or encryption methods between a client and a server.

Review connection logs to confirm both server and client are using strong encryption methods

Linux: Troubleshooting Networking Issues

Firewall Issues

Misconfigured Firewall

Typo in firewall rule

A simple typo in a firewall rule can block traffic.

Use firewall-cmd --list-ports to see open ports

Remove bad rule with firewall-cmd --remove-port=<PORT>/PROTOCOL --permanent, re-issue the correct command, and reload the firewall with firewall-cmd --reload.

Incorrect Rule Ordering

This happens when a DROP or REJECT rule is placed above an ACCEPT rule, causing legitimate traffic to be blocked.

Forgetting to persist firewall changes across reboot

If a rule is added without --permanent the rule disappears after reboot.

Addressing Issues

DHCP issues

This issue occurs when servers or workstations fail to obtain an IP address automatically.

  • Is the DCHP service is running at all?
  • Does the server has free ip address to allocate? Check for DHCP scope for exhaustion by reviewing logs on the DHCP server.
  • Do I need to expand the pool?
  • Force client to request an ip again
  • Confirm connectivity
  • Update network documentation to reflect the change

IP conflicts

IP conflicts occur when two devices claim the same address, leading to intermittent connectivity or "duplicate address" warnings in syslog.

  • Common signs are random disconnect, slow network performance, or ARP conflict messages.
  • Identify all devices using the conflicting IP by checking the DHCP lease files and DNS records
  • Assign a unique address to one of the devices
  • Update any static configurations
  • Clear the ARP cache to ensure no stale entries remain
  • Monitor the network to confirm the conflict is gone

Dual stack issues

This issue occurs when a server configured for both IPV4 and IPV6 fail to handle traffic properly.

  • Ping test may fail for either IPV4 or IPV6
  • Does DNS records include both A and AAAA entries
  • Adjust service configuration files to listen to both IPV4 and IPV6
  • Test connectivity over both protocols and ensure firewalls allow the appropriate traffic on each address family

Routing Issues

DNS issues

ping my.server.com returns unknown host

Confirm the DNS server in /etc/resolv.conf Make changes if necessary Is the DNS server reachable Test DNS resolution

Wrong gateway

  • Why the packets are not leaving the local network?
  • Can devices in the different subnet communicate?
  • Can devices in other subnet communicate with external resources?
  • Check default route with ip route -n
  • Update default route if necessary
  • Ping external resources to confirm connectivity

Server unreachable

When a server is unreachable, nor the hostname or ip address respond to ping.

  • Use ip link to check if the network interface is up and running.
  • Check switch port and, VLAN settings
  • Is the firewall blocking ICMP or SSH?
  • Adjust port, VLAN, and firewall rule if necessary
  • Confirm connectivity using ping or SSH

Interface Misconfiguration

Subnet misconfiguration

This issue occurs when an interface is assigned to the wrong network or network mask. That prevents the server from communicating with other devices in the network.

  • Confirm address settings with ip addr
  • Edit the interface's configuration so the IP address and netmask align with the correct network segment
  • Apply changes with netplan apply or systemctl restart networking
  • Ping a known host on the subnet and confirm that traffic works has it should

MTU mismatch

This happens when one endpoint sends packet sized differently than the receiving interface can handle.

  • ping -s 1500 => Frag needed but DF set
  • Check MTU on each interface with ip link show
  • Pick a consistent MTU value, which is often 1500 for standard networks, and update the interface configuration.
  • Retry transfer or ping test to see correct connectivity

Cannot ping server

This often indicates a deeper interface misconfiguration, such as disabled interface, missing address, or firewall blocking ICMP

  • Is the interface up with a valid ip address?
  • Bring up the interface with ip link set <INTEFACE> up and assign the correct IP address
  • Is the firewall blocking ICMP? use sudo ufw status or iptables -L to ensure that ICMP is not blocked
  • Ping again to confirm connectivity

Interface bonding issues

Interface bonding is when combining two or more physical NICs into a single virtual interface to increase bandwidth and provide redundancy.

  • Is any interface in /proc/net/bonding/bond0 marked down even though it is plugged?
  • Mode 0 (balance-rr), Mode 1 (active-backup), Mode 4 (802.3ad/LACP)
  • Is the bonding driver loaded
  • Check the bonding configuration in either /etc/netplan...yaml on Ubuntu or /etc/sysconfig/network-scripts/ifcfg-bond0 on RHEL
  • Check switch setting to confirm matching valid configuration

MAC spoofing issues

This issue occurs when tow NICs present the same MAC address.

  • arping <IP ADDRESS> returns multiple MAC address
  • Does ip neigh shows frequent MAC flapping?
  • Look for duplicate MAC address with ip link show
  • Correct MAC settings
  • Restart network service to apply changes confirm with the command ip neigh show

This issue occurs when devices are unable to communicate effectively due to problem with the network interface.

The interface is failing to establish or maintain a connection. Maybe a faulty cable The port is disconnected? Maybe the hardware is faulty ip addr and ifconfig show the interface as down Logs are found with dmesg and journalctl Maybe the driver is bad Maybe the interface is misconfigured Maybe the interface is administratively down. use ip link show <INTERFACE> to confirm. Bring it up if necessary Restart networking with systemctl restart network

This involves problems in the automatic process where devices agree on the speed and duplex settings for their connection.

Common signs are poor performance, slow speeds, connectivity dropouts.

  • Check link status with ethtool <INTERFACE>
  • Is autonegotiation enabled
  • Maybe there the hardware have issues. Review system logs for related issues
  • Do the network driver have bugs and need to be updated?

Linux: Troubleshooting Hardware, Storage, and Linux OS

Troubleshooting steps:

  1. Identify the problem
  2. Establish a theory of probable cause
  3. Test the theory to confirm or refute the theory
  4. Establish a plan of action, implement the solution or escalated if needed, and then verify full system functionality
  5. Implement preventive measures to avoid recurrence and perform a root cause analysis

Boot Issues

Server Not Turning On

  • No power lights?
  • No fans?
  • No console output?
  • Do similar systems have the same issues?
  • Maybe the PDU is down?
  • Maybe the PSU has failed?
  • Check the power in the PDU
  • Swap in a known-good power cable
  • Plug another device into the same outlet
  • Still failing?
  • Inspect the PSU
  • Reseat connectors
  • Swap in a spare PSU
  • Verify the system powers on
  • Label cables
  • Schedule PSU health checks
  • Perform a root cause analysis

GRUB Misconfigurations

  • The server drops to a GRUB rescue prompt?
  • The server show an error like "file not found"
  • Are multiple kernels failing?
  • Maybe /etc/default/grub was edited?
  • Maybe an entry ininitrd was deleted?
  • Use the GRUB cli to probe available partitions
  • Verify the kernel and initramfs files are where GRUB expects them to be
  • Boot from rescue ISO or live environment
  • Mount the root filesystem
  • Correct the UUID or kernel path in /etc/default/grub
  • Regenerate GRUB configuration: grub2-mkconfig -o /boot/grub2/grub.cfg on RHEL-based systems. Or updated-grup on debian
  • Reboot and verify the kernel load properly
  • Backup grub.cfg before modifications
  • Why the issue occurred in the first place?
  • A rushed update?
  • A lack of peer review?

Kernel Corruption Issues

  • Observing errors such as "bad magic number" or "kernel image corrupt" during boot
  • Check whether only the latest kernel version is affected or if other versions work from te GRUB menu
  • Maybe a package update failed mid-install
  • Maybe the /boot partition has disk errors
  • Boot into an older, working kernel
  • Mounting /boot and verifying file checksums. If checksums fail, the corruption is real
  • Reinstall the corrupted kernel package
  • Reboot to verify that the new kernel loads
  • Monitor disk health
  • Ensure updates are completed successfully
  • See if disk failure or an interrupted update was at fault

Missing or Disabled Drivers

  • Boot hangs or drops to an initramfs shell with errors like "VFS: Cannot open root device"
  • Check if only certain hardware (for example, a RAID controller) is missing in /dev pr /sys
  • Maybe the initramfs was rebuilt without necessary driver module
  • Maybe someone blacklisted a driver
  • Examine the initramfs contents with lsinitrd or dracut --list module to confirm if the driver is absent
  • Rebuild the initramfs including the required modules
  • Reboot to verify that the driver loads and the root filesystem is detected
  • Document driver dependencies in the build scripts
  • Automate initramfs rebuilds when kernel updates occur
  • Was a kernel package change or manual configuration error caused the driver omission?

Kernel Panic Events

  • Read the panic message on the console
  • Does the panic happens on every boot or only after certain changes?
  • Maybe a newly added module is incompatible
  • Maybe the memory has gone bad
  • Let's try booting with a previous kernel
  • Run memtest86+
  • Disable suspect modules via the kernel boot line
  • Remove or update the offending module
  • Roll back to a known-good kernel
  • Replace faulty RAM
  • Reboot and verify full functionality
  • Maintain a reliable kernel testing process
  • Monitor hardware health
  • Keep a cross-tested module database
  • What was the root cause? Was it a faulty drover, hardware failure, or human error caused the panic.

Filesystem Issues

Filesystem not mounting

  • The usual mount command returns errors
  • Scheduled backups and applications suddenly cannot access certain directories
  • Errors like unknown filesystem type, mount: wrong fs type, superblock corrupt in system logs
  • Boot into rescue mode or unmount any stale references, run fsck against the affected device, and inspect or repair the superblock if needed.
  • If the issue arise from /etc/fstab, correct the UUID or device path and then test the mount manually before updating the fstab.
  • The system now mount cleanly?
  • Confirm read/write access and update any monitoring dashboards to reflect that the volume is back online

Partition not writable

  • Processing failing with permission denied message
  • Application unable to save files even the directories appears to exist
  • Maybe the filesystem is mounted readonly. Examine /proc/mounts to confirm ro flag
  • Unmount the partition, run fsck to repair any underlying errors, and then remount it with the correct read-write permissions
  • Does the issue persist?
  • Inspect ownership and ACLs, then apply chmod or chown to grant the correct user or service write access
  • Update any configuration management scripts

OS filesystem is full

  • Applications and users are unable to write logs and files
  • Check partition usage to confirm issue
  • Truncate or rotate logs, cleanup old core dumps, purge orphaned Docker images, or archive older data to a secondary storage
  • Extend the LVM volume or resize the partition, the resize the filesystem
  • Implement proactive monitoring for storage space

Inode exhaustion

  • df -h my show that space is available
  • Typical message: Cannot create file: No space left on device
  • check df -i and see if inode count is at 100%
  • Identify directories with excessive file counts and then clean up old or stale files
  • Create a new file system with higher inode ration and then migrate the data if necessary
  • Update cleanup policies or add scripts to remove temporary files automatically, preventing a repeat of the issue

Quota issues

  • Individual user or group cannot write files despite free space in the partition
  • Typical message is Disk quota exceeded when creating or writing to a file
  • Use repquota -a and quota -u <USERNAME> to view group or user quotas
  • Adjust soft and hard limits if necessary
  • Identify and remove unnecessary data from the user's home or project directories

Process Issues

Unresponsive Processes

It occur when a running program stops responding to inputs or system scheduling, causing tasks to hang indefinitely.

  • Don't respond to user input or system event?
  • Consume more resources
  • Spot this with top and ps
  • Use strace to watch process
  • Send SIGTERM to le the process shut down cleanly, and if that fails, escalate to SIGKILL to free resources by force
  • Examine journalctl to determine what cause the process to become unresponsive
  • Implement preventive measures

Killed Processes

They happen when a process is forcibly terminated by a signal.

  • Check journalctl and dmesg for reason the process was killed
  • Logs may show Killed process <PID> or oom_reaper to indicate killed process
  • Go through logs to determine if system or person killed the process

Segmentation Fault

A crash that happens when a program tries to access memory it shouldn't, leading to an abrupt termination with an error message.

  • Configure system to generate and retain core dump
  • Use GNU Debugger to analyze the core file and pinpoint the faulty code patch
  • Is the issue from an package? Reinstall a version of the package without that bug

Memory Leaks

Memory Leaks occur when a program continuously allocates memory without freeing it, gradually exhausting available RAM and. degrading system performance

  • Watch the RES memory rise steadily with no drop
  • Who is reserving the memory? review logs and output
  • Schedule periodic restarts of the service or allocate more RAM to reduce impact
  • Continue monitoring RES

System Issues

Device Failure

The server suddenly cannot read from or write to a critical piece of hardware, which is often a disk or network interface.

  • Identify the faulty device
  • Reseat or replace the device
  • If it is a RAID disk, mark the bad disk as failed and rebuild the array with a spare
  • Check disks and network to confirm full functionality of the system

Data corruptions issues

They occur when files refuse to open, applications crash, or filesystem errors in system logs.

  • Run fsck to detect corrupted data
  • Is there a known-good backup? restore from backup
  • Use fsck with repair options to attempt recovery on the live server
  • What was the root cause? failing disk? power outage?, ...
  • Verify full system functionality before it returns to production

Systemd unit failures

They occur when a service that should be running won't start or crashes immediately

  • Inspect service with systemctl status <SERVICE> or journalctl
  • Maybe edit the unit config in /etc/systemd/system/
  • Run systemctl daemon-reload to apply changes
  • Start service with systemctl start <SERVICE>
  • Setup alert to catch unit failures

Server inaccessible

User cannot remotely access the server.

  • Does ping timeout?
  • Does SSH hang?
  • Out-of-band tool does not respond?
  • Are other servers in the network reachable?
  • Try physical access
  • Maybe reboot the machine, or restore network configs from backups, or repair corrupt network service files
  • Validate server is reachable again

Dependency Issues

Package Dependency Issues

Occur when software cannot find or install the components it needs

  • Are the necessary repository enabled?
  • Run dnf deplist <PACKAGE> or apt-cache depends <PACKAGE> to find missing dependencies
  • Upgrade or downgrade package if necessary
  • Rerun installation and verify software loads without issues

Path Misconfiguration Issues

Occur when the system cannot locate a program despite being installed

Typical error message is Command not found

  • Examine echo $PATH to check current search directories
  • Add missing directory by editing /etc/profile or similar
  • Reload shell or re-login to apply changes
  • Run command again to confirm the program is found
  • Document changes for future deployments

Linux: Monitoring Concepts and Configurations

Service Monitoring

Service Level Indicators (SLIs) are specific metrics such as uptime, response time, or error rates. It is used to measure the performance of a service.

Service Level Objectives (SLOs) are targets to meet based on measurements such as maintaining 99.9 percent uptime.

Service Level Agreements (SLAs) is a formal promises to customers or stakeholders outlining expected level of service and consequences if expectations are not met.

Network Monitoring

Network monitoring is the process of keeping track of devices like routers, switches, and servers to make sure everything is running properly.

SNMP - Simple Network Monitoring Protocol

SNMP allows devices to report performance data using a structure called MIB, or Management Information Base. The MIB acts as a built-in database that defines everything that can be monitored on a device, including CPU load, memory usage, and network interface status.

The MIB contains Object Identifier (OID). and OID is a unique number used to locate and retrieve specific information.

SNMP Traps are automatic alerts triggered by specific events like hardware failure or dropped network connections.

Agent-agent vs Agentless Monitoring

Agent-based monitoring uses a software on the monitored device to collect monitored information. SNMP is an agent-based monitoring tool.

An agentless monitoring collects data using existing remote access protocols without requiring any additional software installation on the monitored devices. On Windows systems, protocols like Windows Management Instrumentation allow similar agentless access.

Event-driven Data Collection

Health Checks

Health checks allow systems to automatically test whether a service is running and responding as expected.

# checks if a web service returns a success response
curl -I http://localhost

# check if a systemd service is up and running
systemctl is-active ssh

Webhooks

Webhooks are often used for realtime integrations between services.

Log Aggregation

Log aggregations is the collection of logs from across the network and storing them in a central location.

Event Management

Logging

Logging provides the raw data needed to understand what is happening across a system. Logs are typically stored in the directory /var/log/ and includes files like syslog, auth.log, dmesg, and more.

SIEM Security Information and Event Management System. It collects and analyzes logs from across the network to help identify security threats, system issues, and unusual activity in real time.

Events

Events are generated when specific patterns or conditions are detected in the log data that indicate something noteworthy has happened.

Alerting and Notifications

Notifications

Notifications are how a Linux admin is informed when the system detects that something may require attention. They can be sent via Email, Text Messages, Desktop pop-ups, ticketing system or collaboration platforms.

Alerts

Alerts are the system's internal triggers that causes te notifications to be sent.