Skip to content

Linux

Ansible: More about the Inventory File

Ansible default inventory is located at /etc/ansible/hosts. But we can have it elsewhere. For example at /home/me/ansible/hosts.ini. Then point ansible to it using the -i flag.

ansible web -m ping -i ./hosts.ini

Or I can just confirgure ansible.cfg to point to the path of my inventory file.

[defaults]

inventory = /etc/ansible/inventory/hosts.ini

Hosts can be organized in groups inside the inventory file. A group name must be unique and following the criteria of a valid variable name.

Here are example of groups: web and db

[web]
192.168.10.15
192.168.10.16

[db]
192.168.12.15
192.168.12.16
192.168.12.17

Here is the same inventory in YAML format

web:
  hosts:
    192.168.10.15:
    192.168.10.16:
db:
  hosts:
    192.168.12.15:
    192.168.12.16:
    192.168.12.17:

Ansible automatically creates the all and ungrouped groups behind the scene. The all group contains all hosts, and ungrouped group contains all that are not in any group.

So, ansible -m ping all will all hosts listed in the inventory file, and ansible -m ping ungrouped will ping all hosts not listed in any group.

Do more in your inventory

  • A host can be part of multiple groups

  • Groups can also be grouped

prod:
  children:
    web:
    db:
test:
  children:
    web_test:
[prod:children]
web
db

[test:children]
web_test
  • Add a range of hosts
[servers]
192.168.11.[15:35]
servers:
  hosts:
    192.168.11.[15:35]:
  • Add variables to hosts or groups
[prod]
192.168.10.15:4422

prod1 ansible_port=4422 ansible_host=192.168.10.22

You can do way more than what I have listed above, I am not going to bore with everything about Ansible inventory here because I don't need to use them at this stage of my learning. But if you feel like you want to learn more about this topic, go here

Good bye for now

Ansible: Initial Setup

In my previous post, I went quickly through ansible installation and initial setup. I did not really setup anything. I just showed you where the find things that are brought by ansible by default.

In this post I will go deeper in the setup process. But I am still not going to try to impress you here. Let keep that for future posts.

Ansible Control Node

Ansible config file is locate at /etc/ansible/ansible.cfg by default. We are going to use this file later to customize our installation of Ansible.

If you have just a fiew nodes, you can SSH into each one of them to make sure you can correctly connect. That also means that if you have just a few nodes, Ansible might not the right tool.

Use ssh-copy-id key.pub node-user@192.168.10.10 to add the controller ssh key to authorized hosts that can connect to the nodes.

Ansible Inventory

The inventory contains the nodes you want ansible to manage. The default inventory file is located at /etc/ansible/hosts. The nodes are put into groups for ease of management. The group names must be unique and they are case sensitive. The inventory file contains the IP addresses or FQDN of the managed hosts.

If we want to use the default inventory file we can just run:

# to ping all nodes in the web group
ansible -m ping web

But if we are working on a dedicated inventory file, like my_nodes.ini, we should tell ansible that we are providing and inventory file by adding -i [INVENTORY FILE]. For example, ansible web -i my_nodes -m ping

The inventory in the ini format looks like:

[web]
192.168.12.13
192.168.12.14

[db]
192.168.13.13
192.168.13.15

But the inventory file can also be written in the YAML format:

my_nodes:
  hosts:
    node_01:
      ansible_host: 192.168.10.12
    node_02:
      ansible_host: 192.168.10.13

[web] is a group name. It is unique accross the inventory file. We can have multiple groups in a inventory file.

To run ansible command on multiple groups we do separate the groups name with colons. For example:

ansible web:db -m ping -i my_nodes.ini --ask-pass

This command will nodes in the web and db groups. --ask-pass allows prompting for password if somehow the SSH daemon in the managed nodes is asking for the user password.

If our command requires an input to function, maybe we are doing it the wrong way. Ansible is suppose to facilitate automation. A command should be able to run until completion without additional user input. In my initial ansible setup, I provided input twice when I was running the the ping command: The first was the host keys verification, the second was to provide the node password because the ssh keys were not setup properly. We are going to fix this in our next posts.

How to Manage Nodes with Ansible

Until now we only learned how to ping our nodes using ansible ping module. ansible web -m ping is the language to tell ansible to use the ping module to ping the web group.

Key Points to Remember

  • Ansible is used to automate repetitive tasks we perform on network devices

  • Ansible inventory contains grouped list of nodes we want to manage

  • The inventory can be written in the ini or YAML format

  • Ansible comes with prebuilt modules like ping to faciliate the nodes management.

In my next posts, I will be going deeper on each importaint part of Ansible such as inventory and playbook.

So, read me soon.

Ansible: Installation and Initial Setup

What is Ansible?

Let's cut the chase. Ansible is tool for system and network admins to automate repetitive tasks for example installing and configuring multiple servers, and configuring routers, switches, firewalls, and WAPs at once. Ansible can talk to any device that talks the SSH language. Other connection types are supported but SSH is the default connection type0. Visit Ansible Documentation page to learn more.

This is not going to be step by step tutorial on how to using Ansible nor an in depth overview of Ansible. A lot of important basic topics will be missing from this post but they might appear in future posts. So if something is missing here, you can always look at the other posts in the same category. If something in my sayings does not feel right, you can reach out to me with questions or suggestions via LinkedIn or Email.

How to Install Ansible on Linux

Ansible is agent less. That means that you do not need to install Ansible on the managed nodes to have Ansible push some tasks into them. So only the control node needs to have Ansible installed on it. But you will need python and SSH installed and configured on the managed nodes.

You have multiple ways to install Ansible on your Linux workstation but I will be using the method via the Linux package manager.

How to locate python?

which python3 

# /usr/bin/python3

How to locate SSH?

sudo systemctl status sshd

The ssh daemon must enabled, active, and running.

Update your system:

sudo dnf update -y

Install Ansible using the package manager

sudo dnf install ansible -y

Ansible keeps its main configuration files in /etc/ansible. There you should find the files ansible.cfg and hosts.

Run ansible --version to get details about your Ansible installation. The command will also tell you the location of your Ansible default configuration file.

If you install Ansible using the Linux package manager, you should have the config file generated and set in Ansible. In case you are missing the ansible.cfg file in your installation, you can create it the file /etc/ansible/ansible.cfg. There are many ways to set Ansible configuration file but I am going stick with the one generated by default during the installation.

Ansible Inventory

Ansible inventory contains the list of hosts you want to manage. By default the hosts file contains the list of nodes but you can customize the hosts file inside ansible.cfg. In the hosts file, you can put the nodes into groups like:

[web]
172.16.10.10
172.16.10.12

[db]
172.16.20.22

If you're able to run ansible --version without issue and locate ansible installation folder (/etc/ansible), you are good to do awesome things with ansible. In the next posts, we are going to deaper in the basics of ansible.

So, stay tuned.

Linux: Troubleshooting Performance Issues

CPU Issues

High CPU usage

  • What process is using the CPU?
  • Use top or htop to see what process is using the CPU
  • Optimize the code or limit the number of processes running

High load average

This issue occurs when the number of processes waiting to be executed exceeds the system's processing capacity.

  • Check output of uptime or top
  • Is the system overloaded?
  • Are there too many processes running simultaneously?
  • Or is it one process that is causing the backlog?
  • What quey to optimize if it is a db process?
  • Maybe offload some tasks to another server
  • Maybe swap CPU with one with better specs

High context switching

A context switching is when the CPU switches between different processes to allocated resources. A context refers to the state of a running process that allows the CPU to resume a process later (Running, Waiting, and Stopped).

Too many context switches lead to inefficiency and higher CPU usage.

  • Check context switching with vmstat or pidstat. Check the number of context switches per second.
  • How many context switches per second do we have
  • More than 500-1000 context switches per second is considered high and may indicate that the system is overwhelmed
  • Maybe reduce the number of running processes
  • Optimize applications to use fewer threads
  • Adjust system limits

CPU bottleneck

This issue occurs when the CPU is the limiting factor in system performance.

  • The CPU usage is consistently high (above 80%)
  • Are tasks processing take too long?
  • Load average exceeds the number of available CPU cores
  • Use top and htop to identify processes that are using an inordinate amount of CPU time
  • Optime the processes using the CPU
  • Maybe a hardware upgrade is needed

Memory Issues

Swapping Issues

This issue occurs when the system runs out of physical memory (RAM) and starts using the hard drive as virtual memory

  • Is the system running out of swap space?
  • The swap file is typical located at /swapfile or on a dedicated swap partition
  • Identify swap space with swapon -s
  • Has the system performance degraded?
  • What does top and htop say about swap usage?
  • Is the usage more that 10%? That might be high
  • Monitor swap usage with free -h or vmstat
  • Maybe more physical RAM is needed
  • Or adjust the swappiness kernel value. ex: sudo sysctl vm.swappiness=10

Out of Memory (OOM) Errors

This issue happens when the system runs out of both physical and virtual memory

  • Are critical processes been unexpectedly terminated?
  • Check logs for OOM in system logs
  • Adjust application configurations to optimize memory usage
  • Increase RAM
  • Adjust swap space

Disk I/O Issues

This issue occurs when the system slows down due to delays in reading from or writing to storage device.

High input/output wait time

This issue occurs when processes are waiting for data to be read from or written to the disk. Is is indicating that the disk is struggling to handle requests causing delays in executing processes.

  • Monitor disk load with iotop or dstat. top shows io wait as variable x.x wa. iostat give %iowait
  • Optimize or spread out disk operations
  • Improve performance or upgrade to SSDs if needed

High disk latency

Disk latency refers to the time it takes for the disk to respond to read or write requests. It can be cause by high number or concurrent request, a disk hardware issue, or inefficient disk configurations.

  • What does iostat output say about the disk latency?
  • Is the latency above the normal 10ms? higher than 20ms may indicate a latency issues
  • Is the disk operating at its maximum throughput?
  • Maybe some drivers need to be upgraded
  • Maybe adjusting the disk (RAID) config can help reduce the latency
  • Maybe upgrade to faster disks

Slow remote storage response

This issue occurs when accessing data from remote storage systems such as NFS, SAN, or cloud storage

  • Long response time when accessing files?
  • Check network performance using ping or netperf to check for network issues
  • Optimize network settings or upgrade network hardware

Network Stability Issues

Packet drops

This issue occurs when data packets fail to reach their destination due to network congestion or hardware issues

  • ping -c 100 <DESTINATION> will give the percentage of packet dropped. Over 1% packet loss my be unacceptable
  • Packet loss can lead to slow performance, timeouts, and application errors
  • Check routers and switches for hardware issues and faulty cables
  • Check for NICs errors using ifconfig and ethtool
  • Adjust QoS settings or increase bandwidth

Random disconnects

This issue occurs when a network connection is unexpectedly terminated

  • Are users or services suddenly losing network access?
  • Look for connection reset or connection closed messages in logs using dmesg or ifconfig
  • Check network stack configuration
  • Maybe the cable is faulty and some hardware in the path is faulty
  • Maybe a firewall is closing the connection
  • Check and adjust TCP settings if necessary

Random timeouts

This issue occurs when a connection fails to receive a response within the expected time frame.

  • Errors can be seen in logs
  • Getting connection timed out error when using curl to connect to a service?
  • Maybe the network is congested
  • Maybe there is a DNS issue
  • Maybe the server is overloaded
  • A timeout threshold of 5-10 seconds is typically acceptable
  • Use ping or traceroute to check for network congestion
  • Make sure DNS servers are correctly configured
  • Check server performance
  • Adjust TCP timeout if necessary

Network Performance Issues

High latency

High latency refers to the delay in the time it takes for data to travel from one point to another

Measure latency using ping or traceroute A 100ms latency in a local network is considered high and 300ms over remote communication is also high Check for network congestion Identify hardware issues Maybe the routing is misconfigured. Optimize the network path Maybe upgrading the network infrastructure could help

Jitter

Jitter is the variation of latency over time, which can cause problems in real-time applications.

  • A value above 30ms of fluctuation can cause noticeable issues
  • Detect Jitter issues with ping -i 0.2 <DESTINATION>
  • Check for network congestion or hardware issues
  • Implement QoS to prioritize relevant traffic

Slow response time

This issue occurs when the network takes too long to respond to requests.

This could be due to:

  • High latency
  • Congestion
  • Overloaded servers
  • Misconfigured applications

  • Use curl or wget to measure response time and identify bottlenecks in the network or a server

  • Check server load
  • Optimize application code
  • Check for server resources
  • Review network configurations

Low throughput

This issue occurs when the network is unable to transmit data a a high enough rate

  • Identify low throughput but using iperf anything below 80% of the expected bandwidth is considered low throughput
  • Check for network congestion
  • Check for faulty cables
  • Check for incorrect settings
  • Maybe switch to a high bandwidth network
  • Maybe reduce unnecessary traffic
  • Optimize network routes

System Responsiveness Issues

Slow application response

This issue occurs when an application takes longer than expected to react to user inputs

  • Use top and htop to identify application resource consumption
  • Maybe the application code needs to be optimized
  • Increase system resources
  • Check disk I/O for overload
  • Check for unnecessary background

Sluggish terminal behavior

It happens when commands in the terminal are delayed. The system takes an unusually long tome to execute commands.

  • Use top or iotop to check for system resource usage
  • Optimize processes running on the system
  • Cleanup system resources
  • Add more RAM or CPU cores

Slow startup

System a taking an unusual time to boot up

  • See which services take longer than expected using systemd-analyze
  • Maybe too many services are configured to start at the same time
  • Maybe one of the startup services is misconfigured?
  • Delay or disable non essential services from starting at boot time using systemctl
  • Optimize the boot sequence

System unresponsiveness

This issue occurs when the system becomes completely unresponsive

  • Is the system not accepting new input?
  • Applications are no longer responding?
  • Use dmesg or journalctl to identify what caused kernel panic
  • Identify runaway processes using top and htp
  • Maybe upgrade the RAM or add more CPU cores

Process Management Issues

Blocked processes

This issue occurs when a process is unable to proceed due to waiting on resources or system locks

  • Are command or application stuck?
  • ps and top show processes in D or `uninterruptible state
  • Use lsof to check with file a process is waiting for
  • Use strace to trace system calls and signals
  • Is a process repeatedly stuck or blocked? Maybe due to resource contention
  • Optimize disk I/O
  • Maybe add more memory
  • Investigate dependency issues between processes

Exceeding baselines

This happens when processes consume more resources than expected

  • Notice high CPU usage
  • Unusual memory consumption
  • Excessive disk activity
  • Use top, htop, or pidstat to identify this issue
  • Optimize application resource usage
  • Maybe configure system resource limits with ulimit

High failed log-in attempts

This issue often signals attempted unauthorized access or brute force attacks

  • Maybe a brute force attack?
  • Unauthorized access attempts?
  • System compromise?
  • What are logs in /var/log/auth.log saying?
  • Check journalctl for identify login attempts
  • 5-10 login attempts from a single IP address within a short time may be a red flag for brute force attack
  • Implement fail2ban to block abusive IPs
  • Enforce strong password policy
  • Use MFA
  • Limit access with firewall or IP allowlist

Linux: Troubleshooting Security Issues

SELinux Issues

SELinux policy issues

SELinux Policy defines what actions users and applications can perform on a system based on security rules.

A too restricted or misconfigured policy can prevent the system from working properly.

avc: denied is a typical error message found in logs if dealing SELinux policy issues.

  • Review logs with ausearch or sealert
  • Modify rules if necessary
  • Test policy in a safe environment before applying

SELinux context issues

SELinux uses context to label every file, process, and resource on the system, determining what access is allowed.

Incorrect or misconfigured label can prevent applications for accessing the resources they need to function

  • User ls -Z for files and ps -Z for processes to look for SELinux context issues
  • Does the file or process have incorrect context?
  • Restore the context with sudo restorecon -v <FILE PATH>
  • Running restorecon regularly on key directories helps avoid repeated context mislabeling issues

SELinux boolean issues

SELinux Boolean allow adjustment of certain security settings without modifying the underlying policy.

An incorrectly set boolean can cause certain services or applications to malfunction

  • Check booleans with getsebool
  • Are certain booleans incorrectly set?
  • Toggle booleans with setsebool. ex: setsebool -P httpd_can_sendmail 1
  • Test modification and document changes

File and Directory Permission Issues

File attributes

File attributes control certain behaviors and restrictions on files and directories, which go beyond the regular rwx permissions.

  • Check file attributes with lsattr. i=immutable, a=append-only
  • Remove incorrect attribute with chattr. ex: chattr -i <FILE PATH>
  • Verify file access and document changes

Access Control Lists (ACLs)

ACLs provide more fine-grained control over who can access a file or directory and what actions can be performed.

  • Check if a file is using ACLs with getfacl
  • Adjust the ACLs with setfacl. ex: give read-only access to user tom setfacl -m u:tom:r <FILE PATH>
  • Verify proper access and document changes

Access Issues

Account access issues

Most common issue

  • Are the credentials incorrect?
  • Maybe the account is locked or disable
  • Check system logs for messages
  • Check if account is locked with sudo passwd -S tom
  • Unlock account with sudo passwd -u tom
  • Reset the user password with sudo passwd tom
  • Re-enable a disable account with sudo usermod -e '' tom. '' means no account expiration date

Remote access issues

Issues with VPN or SSH

  • Is the issue caused by network issues, misconfigurations, or firewall?
  • Is the SSH service running? check with sudo systemctl status sshd
  • Enable SSH service with sudo systemctl start sshd && systemctl enable sshd
  • Check firewall with sudo ufw status or sudo iptables -L
  • The problem sill persist? check routing, and public keys validity

Certificate issues

Common messages: SSL certificate expired, SSL handshake failure

  • Is the certificate expired?
  • Maybe the certificate chains are misconfigured
  • Maybe it is a CA issue
  • Check certificate issues with openssl s_client -connect mysite.com:443
  • Renew the certificate if necessary
  • Ensure the full certificate chain is correctly installed

Configuration Issues

Exposed or misconfigured services

This issue occur when system services are either left open to the public or configured incorrectly.

  • Does the service have proper security settings? The db should not accessible from the internet
  • Review security logs
  • Use tools like nmap to scan open ports
  • Configure the firewall to restrict access to trusted IPs
  • Disable unused services
  • Ensure critical services are only accessible when necessary

Misconfigured package repositories

This issue prevents the system from accessing the correct software sources. It prevents software updates and installations.

  • What errors show when running sudo apt update or sudo dnf update
  • Check repository configuration files: /etc/apt/sources.list on Debian-based systems or /etc/yum.repo.d/ on RHEL-based systems
  • Edit repository url if necessary

Vulnerabilities

Vulnerabilities are weaknesses of flaws in the system that can be exploited by attackers bo compromise security.

Unpatched vulnerable system

  • Do i have the latest security patches?
  • Use vulnerability scanners to detect security issues
  • Regular apply update with sudo apt update && sudo apt upgrade on Debian or sudo dnf update on RHEL.

The use of obsolete or insecure protocols and ciphers

  • Is the system using secure ciphers for data and communication protection?
  • Are insecure cipher like disable in the system? SSLv3 is vulnerable to POODLE Attack, RC4 is vulnerable to RC4 Bias Attack
  • Check used protocols in sshd_config for SSH and apache2.conf for Apache.
  • Disable outdated protocol
  • Remove week ciphers in the configuration files
  • Use strong ciphers like AES and protocols like TLS1.2, 1.3

Cipher negotiation issues

This issue occurs when there is a failure in the negotiation or encryption methods between a client and a server.

Review connection logs to confirm both server and client are using strong encryption methods

Linux: Troubleshooting Networking Issues

Firewall Issues

Misconfigured Firewall

Typo in firewall rule

A simple typo in a firewall rule can block traffic.

Use firewall-cmd --list-ports to see open ports

Remove bad rule with firewall-cmd --remove-port=<PORT>/PROTOCOL --permanent, re-issue the correct command, and reload the firewall with firewall-cmd --reload.

Incorrect Rule Ordering

This happens when a DROP or REJECT rule is placed above an ACCEPT rule, causing legitimate traffic to be blocked.

Forgetting to persist firewall changes across reboot

If a rule is added without --permanent the rule disappears after reboot.

Addressing Issues

DHCP issues

This issue occurs when servers or workstations fail to obtain an IP address automatically.

  • Is the DCHP service is running at all?
  • Does the server has free ip address to allocate? Check for DHCP scope for exhaustion by reviewing logs on the DHCP server.
  • Do I need to expand the pool?
  • Force client to request an ip again
  • Confirm connectivity
  • Update network documentation to reflect the change

IP conflicts

IP conflicts occur when two devices claim the same address, leading to intermittent connectivity or "duplicate address" warnings in syslog.

  • Common signs are random disconnect, slow network performance, or ARP conflict messages.
  • Identify all devices using the conflicting IP by checking the DHCP lease files and DNS records
  • Assign a unique address to one of the devices
  • Update any static configurations
  • Clear the ARP cache to ensure no stale entries remain
  • Monitor the network to confirm the conflict is gone

Dual stack issues

This issue occurs when a server configured for both IPV4 and IPV6 fail to handle traffic properly.

  • Ping test may fail for either IPV4 or IPV6
  • Does DNS records include both A and AAAA entries
  • Adjust service configuration files to listen to both IPV4 and IPV6
  • Test connectivity over both protocols and ensure firewalls allow the appropriate traffic on each address family

Routing Issues

DNS issues

ping my.server.com returns unknown host

Confirm the DNS server in /etc/resolv.conf Make changes if necessary Is the DNS server reachable Test DNS resolution

Wrong gateway

  • Why the packets are not leaving the local network?
  • Can devices in the different subnet communicate?
  • Can devices in other subnet communicate with external resources?
  • Check default route with ip route -n
  • Update default route if necessary
  • Ping external resources to confirm connectivity

Server unreachable

When a server is unreachable, nor the hostname or ip address respond to ping.

  • Use ip link to check if the network interface is up and running.
  • Check switch port and, VLAN settings
  • Is the firewall blocking ICMP or SSH?
  • Adjust port, VLAN, and firewall rule if necessary
  • Confirm connectivity using ping or SSH

Interface Misconfiguration

Subnet misconfiguration

This issue occurs when an interface is assigned to the wrong network or network mask. That prevents the server from communicating with other devices in the network.

  • Confirm address settings with ip addr
  • Edit the interface's configuration so the IP address and netmask align with the correct network segment
  • Apply changes with netplan apply or systemctl restart networking
  • Ping a known host on the subnet and confirm that traffic works has it should

MTU mismatch

This happens when one endpoint sends packet sized differently than the receiving interface can handle.

  • ping -s 1500 => Frag needed but DF set
  • Check MTU on each interface with ip link show
  • Pick a consistent MTU value, which is often 1500 for standard networks, and update the interface configuration.
  • Retry transfer or ping test to see correct connectivity

Cannot ping server

This often indicates a deeper interface misconfiguration, such as disabled interface, missing address, or firewall blocking ICMP

  • Is the interface up with a valid ip address?
  • Bring up the interface with ip link set <INTEFACE> up and assign the correct IP address
  • Is the firewall blocking ICMP? use sudo ufw status or iptables -L to ensure that ICMP is not blocked
  • Ping again to confirm connectivity

Interface bonding issues

Interface bonding is when combining two or more physical NICs into a single virtual interface to increase bandwidth and provide redundancy.

  • Is any interface in /proc/net/bonding/bond0 marked down even though it is plugged?
  • Mode 0 (balance-rr), Mode 1 (active-backup), Mode 4 (802.3ad/LACP)
  • Is the bonding driver loaded
  • Check the bonding configuration in either /etc/netplan...yaml on Ubuntu or /etc/sysconfig/network-scripts/ifcfg-bond0 on RHEL
  • Check switch setting to confirm matching valid configuration

MAC spoofing issues

This issue occurs when tow NICs present the same MAC address.

  • arping <IP ADDRESS> returns multiple MAC address
  • Does ip neigh shows frequent MAC flapping?
  • Look for duplicate MAC address with ip link show
  • Correct MAC settings
  • Restart network service to apply changes confirm with the command ip neigh show

This issue occurs when devices are unable to communicate effectively due to problem with the network interface.

The interface is failing to establish or maintain a connection. Maybe a faulty cable The port is disconnected? Maybe the hardware is faulty ip addr and ifconfig show the interface as down Logs are found with dmesg and journalctl Maybe the driver is bad Maybe the interface is misconfigured Maybe the interface is administratively down. use ip link show <INTERFACE> to confirm. Bring it up if necessary Restart networking with systemctl restart network

This involves problems in the automatic process where devices agree on the speed and duplex settings for their connection.

Common signs are poor performance, slow speeds, connectivity dropouts.

  • Check link status with ethtool <INTERFACE>
  • Is autonegotiation enabled
  • Maybe there the hardware have issues. Review system logs for related issues
  • Do the network driver have bugs and need to be updated?

Linux: Troubleshooting Hardware, Storage, and Linux OS

Troubleshooting steps:

  1. Identify the problem
  2. Establish a theory of probable cause
  3. Test the theory to confirm or refute the theory
  4. Establish a plan of action, implement the solution or escalated if needed, and then verify full system functionality
  5. Implement preventive measures to avoid recurrence and perform a root cause analysis

Boot Issues

Server Not Turning On

  • No power lights?
  • No fans?
  • No console output?
  • Do similar systems have the same issues?
  • Maybe the PDU is down?
  • Maybe the PSU has failed?
  • Check the power in the PDU
  • Swap in a known-good power cable
  • Plug another device into the same outlet
  • Still failing?
  • Inspect the PSU
  • Reseat connectors
  • Swap in a spare PSU
  • Verify the system powers on
  • Label cables
  • Schedule PSU health checks
  • Perform a root cause analysis

GRUB Misconfigurations

  • The server drops to a GRUB rescue prompt?
  • The server show an error like "file not found"
  • Are multiple kernels failing?
  • Maybe /etc/default/grub was edited?
  • Maybe an entry ininitrd was deleted?
  • Use the GRUB cli to probe available partitions
  • Verify the kernel and initramfs files are where GRUB expects them to be
  • Boot from rescue ISO or live environment
  • Mount the root filesystem
  • Correct the UUID or kernel path in /etc/default/grub
  • Regenerate GRUB configuration: grub2-mkconfig -o /boot/grub2/grub.cfg on RHEL-based systems. Or updated-grup on debian
  • Reboot and verify the kernel load properly
  • Backup grub.cfg before modifications
  • Why the issue occurred in the first place?
  • A rushed update?
  • A lack of peer review?

Kernel Corruption Issues

  • Observing errors such as "bad magic number" or "kernel image corrupt" during boot
  • Check whether only the latest kernel version is affected or if other versions work from te GRUB menu
  • Maybe a package update failed mid-install
  • Maybe the /boot partition has disk errors
  • Boot into an older, working kernel
  • Mounting /boot and verifying file checksums. If checksums fail, the corruption is real
  • Reinstall the corrupted kernel package
  • Reboot to verify that the new kernel loads
  • Monitor disk health
  • Ensure updates are completed successfully
  • See if disk failure or an interrupted update was at fault

Missing or Disabled Drivers

  • Boot hangs or drops to an initramfs shell with errors like "VFS: Cannot open root device"
  • Check if only certain hardware (for example, a RAID controller) is missing in /dev pr /sys
  • Maybe the initramfs was rebuilt without necessary driver module
  • Maybe someone blacklisted a driver
  • Examine the initramfs contents with lsinitrd or dracut --list module to confirm if the driver is absent
  • Rebuild the initramfs including the required modules
  • Reboot to verify that the driver loads and the root filesystem is detected
  • Document driver dependencies in the build scripts
  • Automate initramfs rebuilds when kernel updates occur
  • Was a kernel package change or manual configuration error caused the driver omission?

Kernel Panic Events

  • Read the panic message on the console
  • Does the panic happens on every boot or only after certain changes?
  • Maybe a newly added module is incompatible
  • Maybe the memory has gone bad
  • Let's try booting with a previous kernel
  • Run memtest86+
  • Disable suspect modules via the kernel boot line
  • Remove or update the offending module
  • Roll back to a known-good kernel
  • Replace faulty RAM
  • Reboot and verify full functionality
  • Maintain a reliable kernel testing process
  • Monitor hardware health
  • Keep a cross-tested module database
  • What was the root cause? Was it a faulty drover, hardware failure, or human error caused the panic.

Filesystem Issues

Filesystem not mounting

  • The usual mount command returns errors
  • Scheduled backups and applications suddenly cannot access certain directories
  • Errors like unknown filesystem type, mount: wrong fs type, superblock corrupt in system logs
  • Boot into rescue mode or unmount any stale references, run fsck against the affected device, and inspect or repair the superblock if needed.
  • If the issue arise from /etc/fstab, correct the UUID or device path and then test the mount manually before updating the fstab.
  • The system now mount cleanly?
  • Confirm read/write access and update any monitoring dashboards to reflect that the volume is back online

Partition not writable

  • Processing failing with permission denied message
  • Application unable to save files even the directories appears to exist
  • Maybe the filesystem is mounted readonly. Examine /proc/mounts to confirm ro flag
  • Unmount the partition, run fsck to repair any underlying errors, and then remount it with the correct read-write permissions
  • Does the issue persist?
  • Inspect ownership and ACLs, then apply chmod or chown to grant the correct user or service write access
  • Update any configuration management scripts

OS filesystem is full

  • Applications and users are unable to write logs and files
  • Check partition usage to confirm issue
  • Truncate or rotate logs, cleanup old core dumps, purge orphaned Docker images, or archive older data to a secondary storage
  • Extend the LVM volume or resize the partition, the resize the filesystem
  • Implement proactive monitoring for storage space

Inode exhaustion

  • df -h my show that space is available
  • Typical message: Cannot create file: No space left on device
  • check df -i and see if inode count is at 100%
  • Identify directories with excessive file counts and then clean up old or stale files
  • Create a new file system with higher inode ration and then migrate the data if necessary
  • Update cleanup policies or add scripts to remove temporary files automatically, preventing a repeat of the issue

Quota issues

  • Individual user or group cannot write files despite free space in the partition
  • Typical message is Disk quota exceeded when creating or writing to a file
  • Use repquota -a and quota -u <USERNAME> to view group or user quotas
  • Adjust soft and hard limits if necessary
  • Identify and remove unnecessary data from the user's home or project directories

Process Issues

Unresponsive Processes

It occur when a running program stops responding to inputs or system scheduling, causing tasks to hang indefinitely.

  • Don't respond to user input or system event?
  • Consume more resources
  • Spot this with top and ps
  • Use strace to watch process
  • Send SIGTERM to le the process shut down cleanly, and if that fails, escalate to SIGKILL to free resources by force
  • Examine journalctl to determine what cause the process to become unresponsive
  • Implement preventive measures

Killed Processes

They happen when a process is forcibly terminated by a signal.

  • Check journalctl and dmesg for reason the process was killed
  • Logs may show Killed process <PID> or oom_reaper to indicate killed process
  • Go through logs to determine if system or person killed the process

Segmentation Fault

A crash that happens when a program tries to access memory it shouldn't, leading to an abrupt termination with an error message.

  • Configure system to generate and retain core dump
  • Use GNU Debugger to analyze the core file and pinpoint the faulty code patch
  • Is the issue from an package? Reinstall a version of the package without that bug

Memory Leaks

Memory Leaks occur when a program continuously allocates memory without freeing it, gradually exhausting available RAM and. degrading system performance

  • Watch the RES memory rise steadily with no drop
  • Who is reserving the memory? review logs and output
  • Schedule periodic restarts of the service or allocate more RAM to reduce impact
  • Continue monitoring RES

System Issues

Device Failure

The server suddenly cannot read from or write to a critical piece of hardware, which is often a disk or network interface.

  • Identify the faulty device
  • Reseat or replace the device
  • If it is a RAID disk, mark the bad disk as failed and rebuild the array with a spare
  • Check disks and network to confirm full functionality of the system

Data corruptions issues

They occur when files refuse to open, applications crash, or filesystem errors in system logs.

  • Run fsck to detect corrupted data
  • Is there a known-good backup? restore from backup
  • Use fsck with repair options to attempt recovery on the live server
  • What was the root cause? failing disk? power outage?, ...
  • Verify full system functionality before it returns to production

Systemd unit failures

They occur when a service that should be running won't start or crashes immediately

  • Inspect service with systemctl status <SERVICE> or journalctl
  • Maybe edit the unit config in /etc/systemd/system/
  • Run systemctl daemon-reload to apply changes
  • Start service with systemctl start <SERVICE>
  • Setup alert to catch unit failures

Server inaccessible

User cannot remotely access the server.

  • Does ping timeout?
  • Does SSH hang?
  • Out-of-band tool does not respond?
  • Are other servers in the network reachable?
  • Try physical access
  • Maybe reboot the machine, or restore network configs from backups, or repair corrupt network service files
  • Validate server is reachable again

Dependency Issues

Package Dependency Issues

Occur when software cannot find or install the components it needs

  • Are the necessary repository enabled?
  • Run dnf deplist <PACKAGE> or apt-cache depends <PACKAGE> to find missing dependencies
  • Upgrade or downgrade package if necessary
  • Rerun installation and verify software loads without issues

Path Misconfiguration Issues

Occur when the system cannot locate a program despite being installed

Typical error message is Command not found

  • Examine echo $PATH to check current search directories
  • Add missing directory by editing /etc/profile or similar
  • Reload shell or re-login to apply changes
  • Run command again to confirm the program is found
  • Document changes for future deployments

Linux: Monitoring Concepts and Configurations

Service Monitoring

Service Level Indicators (SLIs) are specific metrics such as uptime, response time, or error rates. It is used to measure the performance of a service.

Service Level Objectives (SLOs) are targets to meet based on measurements such as maintaining 99.9 percent uptime.

Service Level Agreements (SLAs) is a formal promises to customers or stakeholders outlining expected level of service and consequences if expectations are not met.

Network Monitoring

Network monitoring is the process of keeping track of devices like routers, switches, and servers to make sure everything is running properly.

SNMP - Simple Network Monitoring Protocol

SNMP allows devices to report performance data using a structure called MIB, or Management Information Base. The MIB acts as a built-in database that defines everything that can be monitored on a device, including CPU load, memory usage, and network interface status.

The MIB contains Object Identifier (OID). and OID is a unique number used to locate and retrieve specific information.

SNMP Traps are automatic alerts triggered by specific events like hardware failure or dropped network connections.

Agent-agent vs Agentless Monitoring

Agent-based monitoring uses a software on the monitored device to collect monitored information. SNMP is an agent-based monitoring tool.

An agentless monitoring collects data using existing remote access protocols without requiring any additional software installation on the monitored devices. On Windows systems, protocols like Windows Management Instrumentation allow similar agentless access.

Event-driven Data Collection

Health Checks

Health checks allow systems to automatically test whether a service is running and responding as expected.

# checks if a web service returns a success response
curl -I http://localhost

# check if a systemd service is up and running
systemctl is-active ssh

Webhooks

Webhooks are often used for realtime integrations between services.

Log Aggregation

Log aggregations is the collection of logs from across the network and storing them in a central location.

Event Management

Logging

Logging provides the raw data needed to understand what is happening across a system. Logs are typically stored in the directory /var/log/ and includes files like syslog, auth.log, dmesg, and more.

SIEM Security Information and Event Management System. It collects and analyzes logs from across the network to help identify security threats, system issues, and unusual activity in real time.

Events

Events are generated when specific patterns or conditions are detected in the log data that indicate something noteworthy has happened.

Alerting and Notifications

Notifications

Notifications are how a Linux admin is informed when the system detects that something may require attention. They can be sent via Email, Text Messages, Desktop pop-ups, ticketing system or collaboration platforms.

Alerts

Alerts are the system's internal triggers that causes te notifications to be sent.

Linux: Automated Tasks with Shell Scripting

Parameter Expansion

Parameter Expansion is a way to substitute the value of a variable into a command or script os that the instructions become dynamic and flexible instead of static. ex: ${var}

${var}

${var} is used in shell environments to insert the value of a variable into a command. var is the name of the variable we want to expand.

ex:

location="/var/log"

cd ${location}

Command Substitution

Command Substitution inserts the result of a command directly into another command or script.

'bar' - Single-Quoted String

Everything in the single quote is treated as literal text. There will be no variable expansion and no command substitution. The text will be printed as it is written.

ex:

echo 'Warning: $PATH cannot be found'

Warning: $PATH cannot be found

$(bar) - Substituting a Command

This is how command substitution is done. This will run the command inside the parentheses by replacing the $(...) with the command's output.

# /backup/YYYY-MM-DD
mkdir /backup/$(date +%F) # mkdir /backup/2025-11-15

Subshell Execution

A subshell is a separate child process created by the shell to execute a command or group of commands in isolation without affecting the current shell environment. What ever happens inside the subshell will not carry over to the main shell session.

(bar) - Creating a Subshell

The syntax is (cmd1; cmd2;...). All commands inside the parentheses are executed in a child shell.

ex:

# execute the command in a new shell
(bar)

Functions

A function is a set of commands packaged under a single name to allow repeated use without rewriting the commands each time.

ex:

function hello {
  echo "Hello, $1"
}

hello() {
  echo "Hello, $1"
}

Bash functions can only return numeric exit codes.

Variables by default are global. Use local to define a local variable in functions.

ex:

# to define a local variable in a function
function hello {
  local my_var="Hello"
}

Internal Field Separator / Output Field Separator

IFS tells the shell where to split input into distinct words.

OFS is a tool used to re-assemble data for output.

Avoiding Word Splitting

Word Splitting is the shell's habit of treating spaces, tabs, and newline inside a variable as natural break-point. To fix this, we wrap the variable in double quotes or pass it through printf. ex: printf '%s\n' "$variable".

With

file_path="My project/file.txt
cat file_path

the shell will attempt to open 2 files: My and project/file.txt.

But printf '%s\n' "$file_path" will produces the exact string, in one line, with no splits.

Controlling Input Splitting

IFS=<DELIMITER> read -r VAR1 VAR2,... <<< "$TEXT"

ex:

IFS=',' read -r name city role <<< "tome,New York,Developer"

Output Formatting

A common pattern is awk 'BEGIN{OFS="<DELIMITER>"} {print $1,$2,...}' <FILE>

ex:

# converting a portion of /etc/passwd file into a CSV
awk 'BEGIN{OFS=","} {print $1,$3,$4}' /etc/passwd | head -n 3

BEGIN{OFS=","} tells awk that commas should go between fields.

$1, $3, and $4 refers to the username, uid, and gid columns respectively.

Conditional Statements

if

It is used for running a single yes or no task like:

  • Verifying a service is running

  • Checking free disk space

  • Making sure a variable isn't empty

if condition; then
    commands
elif another_condition; then
    commands
else
    commands
fi
# to check for a file
location="/var/log/auth.log"

if [[ -f $location ]]; then
    echo "$location exists"
elif [[ -d $location ]]; then
    echo "$location is a directory"
else
    echo "$location does not exist"
fi

Options include:

  • -f for a file

  • -d for a directory

  • -z for a string

  • -eq numeric equal

  • -ne numeric not equal

  • -lt numeric less than

  • -gt numeric greater than

  • = string equal

  • != string not equal

case

A case statement is used when a variable can take several acceptable values, or answers, and needed different actions for each.

case expression in
    pattern1)
        commands ;;
    pattern2|pattern3)
        commands ;;
    *)
        commands ;;   # default case
esac
echo "Select an option: start | stop | restart"
read action

case $action in
    start)
        echo "Starting service..." ;;
    stop)
        echo "Stopping service..." ;;
    restart|reload)
        echo "Restarting service..." ;;
    *)
        echo "Unknown option: $action" ;;
esac

$1 is a positional parameter. It means it automatically holds the first command-line argument that was supplied when the script was launched.

Looping Statements

Loops allow a program to repeat actions automatically without rewriting the same instructions repeatedly.

for

for loop repeats a task a specific number of times of for each item in th a list.

ex:

for fruit in orange apple banana
  do
    echo "fruit: $fruit"
  done

while

while loop continues running as long as a condition remains true. A while loop is great when you do not know how many times something should repeat.

counter=1
while [ $counter -le 5 ]
  do
    echo "count is $counter"
    ((counter++))
  done

until

until runs until a condition becomes true.

counter=1
until [ $counter -ge 5 ]
  do
    echo "count is $counter"
    ((counter++))
  done

Interpreter Directive

An interpreter directive is a special line at the very top of the file that tells the system which program should be used to interpret the commands that follow.

It start with #! called shebang followed by the path of the interpreter like /bin/bash.

For bash script, we typically use #!/bin/bash

ex:

hello.sh

#!/bin/bash

echo "hello world"

Numerical Comparisons

  • -eq equal to

  • -ne not equal to

  • -lt less than

  • -le less than or equal to

  • -gt greater than

  • -ge greater than or equal to

They are always used in [] when making comparisons.

result=8
if [ "$result" -lt 5 ]; then
    echo "Less than 5"
elif [ "$result" -eq 5 ]; then
    echo "Equal to 5"
else
    echo "Greater than 5"
fi

Redirection String Operators

> redirection operator

> redirects outputs to a file. It creates the file automatically if it does not exist or overwrite its content if it exists.

echo "Operation completed with code 0" > result.txt

< redirection operator

< takes input from a file.

read value < input.txt

Comparison String Operators

'String comparison operators check whether two pieces of text are the same, different, match a pattern or follow a certain alphabetical order.

  • == and = for comparing if two strings are equal
  • != for checking if two strings are not equal
  • =~ for matching patterns using regular expressions
  • <= and >= for comparing string alphabetical order

==, =, and =~

== is typically used inside double square brackets ([[]]) and is read as is equal to.

= is used inside single square brackets ([]) and is read simply as equals.

=~ is used for more advanced comparison.

#!/bin/bash

text="Hello"

if [[$text == "Hello"]]; then
  echo "Text is exactly Hello"
fi

if [[$test =~ ^H]]; then
  echo "The test start with H"
fi

!=

#!/bin/bash

result="completed"

if [$result != "completed"]; then
  echo "The task completed successfully"
fi

<= and >=

This is a Lexicographical Comparison. Bash checks which string would come first or last in alphabetical order.

<= is read as is less than or equal to

>= is read as is greater than or equal to

#!/bin/bash

fruit="papaya"

if [[$fruit >= "mango"]]; then
  echo "$fruit comes after or is equal to mango"
fi

if [[$fruit <= "melon"]]; then
  echo "$fruit comes before or is equal to melon"
fi

Regular Expressions

A regex is a special sequence of characters that defines a search pattern.

Bash uses =~ inside [[]] to match patterns with regular expression ([[ $variable =~ pattern]]).

#!/bin/bash

data="234567"

if [[ $data =~ ^[0-9]+$ ]]; then
  echo "The data contains only numbers"
fi

Test Operators

Test operators are special symbols used to evaluate things like file existence, string content, and logical conditions. They return either true or false.

-d and -f

-d and -f are operators used in scripts to check whether something exists on the filesystem and whether it's is a directory or a regular file.

#!/bin/bash

if [ -d "project" ]; then
  echo "the project folder is a directory"
fi

if [ -f "app.conf" ]; then
  echo "app.conf is a file"
fi

-n and -z

-n and -z are string test operators. They help check whether a string has a value or is empty, which is especially useful when dealing with user input.

#!/bin/bash

input=""

if [ -z "$input" ]; then 
  echo "The input is empty"
fi

input="hello"

if [ -n "$input" ]; then
  echo "the input is not empty"
fi

!

! is the logical negation operator.

#!/bin/bash

if [ ! -f "config.txt" ]; then
  echo "config file does not exist"
fi

Variables

Variables are used to store and work with information like text, numbers, or user input.

Positional Arguments

Positional arguments are values passed to a script when running it, allowing the script to respond to user input. The first argument is $1, the second is $2 and so on.

#!/bin/bash

if [ $1 -gt 5 ]; then
  echo "The number is greater than 5"
else
  echo "The number is less than or equal to 5"
fi
# then run the script
./script.sh 10

Environment Variable

Environment variables are built-in variables provided by the system or user that store important information.

Built-in variables:

  • $USER: username
  • $HOME: home directory
  • $SHELL: current shell
#!/bin/bash

if [ $USER = 'root' ]; then
  echo "You are logged in as the root user"
else
  echo "You are logged as regular user $USER"
fi

Alias and Command Management

alias

alias command creates shortcuts for longer commands. The generic syntax is alias name='command'.

Aliases set in the terminal are only temporary and only last for that session.

# create a shortcut called ckdsk
alias ckdsk='df -h'

unalias

unalias command removes shortcuts that was previously created.

# to remove a previously created alias
unalias ckdsk

set

set modify the behavior of the shell.

#!/bin/bash
# to stop script from running if any command inside it fails
set -e

echo "running system update..."

sudo dnf update

echo "update completed"

Other options with set:

  • -x prints each command before it is executed
  • -u exits script when attempting to use an undefined variable
  • -o pipefail makes a pipeline fail if any command in the pipeline fails

Variable Management

  • export allows a variable to be passed to child processes
  • local restricts a variable's scope to within a function
  • unset deletes a variable

export

export is used to make a variable available to child processes, such as a subshell or another script that is launched from the current shell. The syntax is export VARIABLE=value

export LOG_LEVEL=debug

./myscript.sh # run in a separate shell process bt still has access to LOG_LEVEL because of 'export'

local

local command is used to restrict variable to within a function. The syntax is local VARIABLE=value

unset

unset is used to remove a variable. The syntax is unset VARIABLE

log_file="log.txt"

echo "processing file"

unset log_file

Return Codes

A return code or exit status is a number left behind after a command or program finishes in Linux to indicate success or failure.

$? is used to see the exit code of the last command.

  • 0 means success
  • Non-zero means error

Linux: Automation and Orchestration

Ansible IaC Core Concepts

Ansible let users automate system configuration and management using clear, repeatable commands. Ansible is agentless. I uses SSH on Linux and WinRM on Windows.

Installing Ansible

# on RHEL-based system
dnf install -y ansible-core

# on Debian-based system
apt install -y ansible

# to test install
ansible --version
ansible localhost -m ping

Inventory

Inventory is a list of all the servers or devices. Inventory can store as simple text file using INI format, as structured YAML file, or dynamically from cloud platforms or CMDBs.

[web] Group servers for easy management

ex inventory file

# ./hosts
[local]
localhost ansible_connection=local
# to install htop on RHEL-based system
ansible -i hosts local -m dnf -a "name=htop state=present update_cache=yes" --become

# to create a new user
ansible -i hosts local -m user -a "name=bob state=present" --become

# to copy a file from the control node to the managed nodes
ansible -i hosts local -m copy -a "src=my_config.conf dest=/apps/myapp.conf" --become

Ad Hoc Mode

Ad Hoc Mode is used to run one-time commands to test settings or apply changes across systems.

ex:

# to ping all hosts listed in the inventory
ansible all -m ping

# to restart the nginx service in the web group
ansible web -m service -a "name=nginx state=restarted"

Module

A module is a built-in tool that handles specific tasks like installing software, restarting services, or managing users.

ex of modules:

  • yum: used to install, update, remove packages on RHEL-based systems
  • apt: used to install, update, remove packages on Debian-based systems
  • user: used to manage user accounts on the system
  • service: used to start, stop, restart, or enable services
  • copy: used to transfer files from the control node to remote machines
  • file: used to create directories, change permissions, or delete files

Playbook

Playbook are complex, repeatable tasks, and structured automation. It is a structured YAML file that defines a set of tasks for Ansible to carry out on managed system.

Facts

Facts allows Ansible to automatically gather information about each machine and make decisions based on that data. Data collected can include IP addresses, operating system, available memory, and disk space. They are gathered at the beginning of playbook execution so it can decide what action to take based on the current setup of the machine. Ansible collects facts only when users run a task, using a direct connection like SSH.

Collections

Collections help manage and reuse tools, making it easier to scale and maintain automation environment over time.

Puppet Core Usage

Puppet helps automate system configuration by letting admins describe what the system should look like. Puppet is agent-based.

The Puppet Agent is responsible for communicating with the Puppet server and applying configurations. It is also responsible of collecting facts.

The Puppet server is called Puppet Master.

Facts

Facts are information Puppet collects on the managed devices such as operating system, hostname, IP addresses, memory, and more. Puppet Agent collects facts on a regular schedule during each check-in with the Puppet server. Puppet is well suited for large-scale enterprise environment because it enforces regular automated configuration.

Classes

Classes group related configuration tasks together into one logical unit. They help apply consistent settings to many systems with minimal duplication of effort.

Modules

A module is a package that includes everything needed to manage a specific task or part of a system. It can include one or many classes, files, templates, or custom facts.

Certificates

Certificates ensure that only authorized machines are allowed to talk to the server and receive configurations. The certificates must be approved and signed by the server before configurations are exchanged.

OpenTofu Core Usage

OpenTofu is an open-source designed to help manage and automate cloud infrastructure with code.

Provider

Provider connects configuration code to the actual cloud platform or service that the user is trying to manage. OpenTofu talks to services like AWS, Azure, and GCP using APIs.

Resources

Resources are the specific pieces of infrastructure user wants to create or manage, such as virtual machine, a firewall rule, or a storage bucket. OpenTofu resources focus on provisioning and configuring cloud services from the ground up.

State

The state is how OpenTofu keeps track of what is already been created in the environment.

Unattended Deployment

It is the automation of installation and initial configuration of systems to avoid manual step-by-step administrations.

Kickstart

Kickstart is commonly used in traditional data center environments with RHEL-based systems. You automate the RHEL-based installation by specifying things like language, disk setup, network settings, package selection in configuration file. The general syntax to start a kickstart install from a boot prompt is linux ks=<LOCATION OF KICKSTART FILE> inst.repo=<INSTALLATION SOURCE>

# to start a kickstart install
linux ks=http://192.168.10.10/kickstart/ks.cfg inst.repo=http://192.168.10.10/rhel8

Cloud-init

Cloud-init is the standard for automating deployments in cloud platforms like AWS, Azure, or OpenStack. It reads a YAML configuration file in order to apply the changes during the fire boot of a cloud instance.

ex:

# to create an install and configure using cloud-init
aws ec2 run-instances --image-id ami-0adfads185141422356 --instance-type t2.micro --user-data file://init-script.yaml

CI/CD Concepts

CI/CD is a system of tools and practices that brings order and automation to modern software development.

Version Control

Version control is a system that tracks changes to files over time, allowing developers to collaborate, review history, and roll back if something goes wrong. Git is the most common version control tool used today.

Pipelines

A pipeline is a sequence of automated steps that take code from commit to deployment. It might include testing, security scanning, building the software, deploying to production.

Modern CI/CD Approaches

Shift Left Testing

Shift left testing moves testing earlier in the development cycle, right alongside coding. The common tools used are Jenkins and GitLab CI.

DevSecOps

DevSecOps = Development, Security, and Operations. It is an approach that builds on CI/CD by embedding security practices throughout the software lifecycle.

GitOps

GitOps is a way of managing infrastructure and deployments using Git as the single source of truth. Common tools used are Argo CD and Flux.

Kubernetes Core Workloads for Deployment Orchestration

Kubernetes is an open-source platform that automates the deployment, scaling and management of containerized applications.

Pods

Pods are where the applications run. They allow users to tightly couple containers that need to work together. Containers that run in the same pods can talk to each other like they are running in the same machine.

Deployments

Deployments make sure the right number of Pods are up and are kept up to date. A deployment acts like a controller that keeps track of the application and ensures the right number of Pods are always running and up to date.

Services

Services ensure the application is reachable by other apps or users. They provide stable endpoint so other applications or users can reliably connect to the app regardless of which Pod is curring running.

Kubernetes Configuration

Variables

Variables are the simplest way to pass configuration settings into the containers.

ex:

# to tell the pod to use production settings
ENVIRONMENT=production
ConfigMaps

ConfigMaps store larger sets of configuration data in a Kubernetes object.

Secrets

Secrets work similarly to ConfigMaps, but are specifically designed to store sensitive data such as passwords, API tokens, SSH keys, or SSL/TLS certificates. The defaults, secrets are encoded in base64. Kubernetes uses RBAC to control access to these secrets.

Volumes

Volumes provide a way for containers to store and access data that needs to persist beyond the life of a single container.

Docker Swarm Core Workloads for Deployment Orchestration

Docker Swarm is a tool that helps orchestrate container deployments, making sure everything runs reliably.

Nodes

A node is a physical or virtual machine that is part of the swarm cluster. A node runs the Docker engine and is classified as Manager Node which make decisions and assign tasks, or Worker Node which carry out the tasks.

Tasks

A task is the actual instance of a container running on a node. Each task maps to exactly one container, and Swarm monitors them all continuously. Pods can host multiple containers, while a swarm task maps one-to-one. Tasks help ensure that the application stays running as expected.

Service

A service is a top-level object in Docker Swarm that defines how the application runs.

Docker Swarm Configuration

Networks

Networks defines how containers communicate within the swarm.

Overlay Networks

Overlay Networks are virtual networks that span across all nodes in the swarm. They enable secure, seamless communication between containers on different nodes.

ex docker-compose.yaml

# define an overlay network called frontend.

networks:

  frontend:

    driver: overlay

Scaling

Scaling refers to how many replicas of a service are running at any given time.

ex docker-compose.yaml:

services:

  web:

    image:nginx

    deploy:

      replicas: 3

Docker/Podman Compose for Deployment

Compose file

docker-compose.yaml

version: "3.8"

services:
  web:
    image:nginx
    ports:
      - "8080:80"

  app:
    image:my-web-app:latest
    environment:
      - ENV=production
    depends_on:
      - db

  db:
    image:postgres
    volumes:
      - db-data:/var/lib/postgresql/data

volumes:
  db-data:

Podman Compose is more a security-focused container engine.

up and down commands

# to start or bring down containers
docker-compose up
docker-compose down
# to start or bring down containers
podman-compose up
podman-compose down

Viewing Logs

Viewing logs is essential for understanding what's happening in the application.

# to view logs
docker-compose logs
podman-compose logs
# to tail logs
docker-compose logs --follow web