Linux

October 28, 2025
in Blog, Linux, Automation, Ansible
2 min read

Ansible: More about the Inventory File

Ansible default inventory is located at /etc/ansible/hosts. But we can have it elsewhere. For example at /home/me/ansible/hosts.ini. Then point ansible to it using the -i flag.

ansible web -m ping -i ./hosts.ini

Or I can just confirgure ansible.cfg to point to the path of my inventory file.

[defaults]

inventory = /etc/ansible/inventory/hosts.ini

Hosts can be organized in groups inside the inventory file. A group name must be unique and following the criteria of a valid variable name.

Here are example of groups: web and db

[web]
192.168.10.15
192.168.10.16

[db]
192.168.12.15
192.168.12.16
192.168.12.17

Here is the same inventory in YAML format

web:
  hosts:
    192.168.10.15:
    192.168.10.16:
db:
  hosts:
    192.168.12.15:
    192.168.12.16:
    192.168.12.17:

Ansible automatically creates the all and ungrouped groups behind the scene. The all group contains all hosts, and ungrouped group contains all that are not in any group.

So, ansible -m ping all will all hosts listed in the inventory file, and ansible -m ping ungrouped will ping all hosts not listed in any group.

Do more in your inventory

A host can be part of multiple groups
Groups can also be grouped

prod:
  children:
    web:
    db:
test:
  children:
    web_test:

[prod:children]
web
db

[test:children]
web_test

Add a range of hosts

[servers]
192.168.11.[15:35]

servers:
  hosts:
    192.168.11.[15:35]:

Add variables to hosts or groups

[prod]
192.168.10.15:4422

prod1 ansible_port=4422 ansible_host=192.168.10.22

You can do way more than what I have listed above, I am not going to bore with everything about Ansible inventory here because I don't need to use them at this stage of my learning. But if you feel like you want to learn more about this topic, go here

Good bye for now

October 22, 2025
in Blog, Linux, Automation, Ansible
3 min read

Ansible: Initial Setup

In my previous post, I went quickly through ansible installation and initial setup. I did not really setup anything. I just showed you where the find things that are brought by ansible by default.

In this post I will go deeper in the setup process. But I am still not going to try to impress you here. Let keep that for future posts.

Ansible Control Node

Ansible config file is locate at /etc/ansible/ansible.cfg by default. We are going to use this file later to customize our installation of Ansible.

If you have just a fiew nodes, you can SSH into each one of them to make sure you can correctly connect. That also means that if you have just a few nodes, Ansible might not the right tool.

Use ssh-copy-id key.pub node-user@192.168.10.10 to add the controller ssh key to authorized hosts that can connect to the nodes.

Ansible Inventory

The inventory contains the nodes you want ansible to manage. The default inventory file is located at /etc/ansible/hosts. The nodes are put into groups for ease of management. The group names must be unique and they are case sensitive. The inventory file contains the IP addresses or FQDN of the managed hosts.

If we want to use the default inventory file we can just run:

# to ping all nodes in the web group
ansible -m ping web

But if we are working on a dedicated inventory file, like my_nodes.ini, we should tell ansible that we are providing and inventory file by adding -i [INVENTORY FILE]. For example, ansible web -i my_nodes -m ping

The inventory in the ini format looks like:

[web]
192.168.12.13
192.168.12.14

[db]
192.168.13.13
192.168.13.15

But the inventory file can also be written in the YAML format:

my_nodes:
  hosts:
    node_01:
      ansible_host: 192.168.10.12
    node_02:
      ansible_host: 192.168.10.13

[web] is a group name. It is unique accross the inventory file. We can have multiple groups in a inventory file.

To run ansible command on multiple groups we do separate the groups name with colons. For example:

ansible web:db -m ping -i my_nodes.ini --ask-pass

This command will nodes in the web and db groups. --ask-pass allows prompting for password if somehow the SSH daemon in the managed nodes is asking for the user password.

If our command requires an input to function, maybe we are doing it the wrong way. Ansible is suppose to facilitate automation. A command should be able to run until completion without additional user input. In my initial ansible setup, I provided input twice when I was running the the ping command: The first was the host keys verification, the second was to provide the node password because the ssh keys were not setup properly. We are going to fix this in our next posts.

How to Manage Nodes with Ansible

Until now we only learned how to ping our nodes using ansible ping module. ansible web -m ping is the language to tell ansible to use the ping module to ping the web group.

Key Points to Remember

Ansible is used to automate repetitive tasks we perform on network devices
Ansible inventory contains grouped list of nodes we want to manage
The inventory can be written in the ini or YAML format
Ansible comes with prebuilt modules like ping to faciliate the nodes management.

In my next posts, I will be going deeper on each importaint part of Ansible such as inventory and playbook.

So, read me soon.

October 21, 2025
in Blog, Linux, Automation, Ansible
2 min read

Ansible: Installation and Initial Setup

What is Ansible?

Let's cut the chase. Ansible is tool for system and network admins to automate repetitive tasks for example installing and configuring multiple servers, and configuring routers, switches, firewalls, and WAPs at once. Ansible can talk to any device that talks the SSH language. Other connection types are supported but SSH is the default connection type0. Visit Ansible Documentation page to learn more.

This is not going to be step by step tutorial on how to using Ansible nor an in depth overview of Ansible. A lot of important basic topics will be missing from this post but they might appear in future posts. So if something is missing here, you can always look at the other posts in the same category. If something in my sayings does not feel right, you can reach out to me with questions or suggestions via LinkedIn or Email.

How to Install Ansible on Linux

Ansible is agent less. That means that you do not need to install Ansible on the managed nodes to have Ansible push some tasks into them. So only the control node needs to have Ansible installed on it. But you will need python and SSH installed and configured on the managed nodes.

You have multiple ways to install Ansible on your Linux workstation but I will be using the method via the Linux package manager.

How to locate python?

which python3 

# /usr/bin/python3

How to locate SSH?

sudo systemctl status sshd

The ssh daemon must enabled, active, and running.

Update your system:

sudo dnf update -y

Install Ansible using the package manager

sudo dnf install ansible -y

Ansible keeps its main configuration files in /etc/ansible. There you should find the files ansible.cfg and hosts.

Run ansible --version to get details about your Ansible installation. The command will also tell you the location of your Ansible default configuration file.

If you install Ansible using the Linux package manager, you should have the config file generated and set in Ansible. In case you are missing the ansible.cfg file in your installation, you can create it the file /etc/ansible/ansible.cfg. There are many ways to set Ansible configuration file but I am going stick with the one generated by default during the installation.

Ansible Inventory

Ansible inventory contains the list of hosts you want to manage. By default the hosts file contains the list of nodes but you can customize the hosts file inside ansible.cfg. In the hosts file, you can put the nodes into groups like:

[web]
172.16.10.10
172.16.10.12

[db]
172.16.20.22

If you're able to run ansible --version without issue and locate ansible installation folder (/etc/ansible), you are good to do awesome things with ansible. In the next posts, we are going to deaper in the basics of ansible.

So, stay tuned.

September 24, 2025
in Blog, Linux, Troubleshooting
6 min read

Linux: Troubleshooting Performance Issues

CPU Issues

High CPU usage

What process is using the CPU?
Use top or htop to see what process is using the CPU
Optimize the code or limit the number of processes running

High load average

This issue occurs when the number of processes waiting to be executed exceeds the system's processing capacity.

Check output of uptime or top
Is the system overloaded?
Are there too many processes running simultaneously?
Or is it one process that is causing the backlog?
What quey to optimize if it is a db process?
Maybe offload some tasks to another server
Maybe swap CPU with one with better specs

High context switching

A context switching is when the CPU switches between different processes to allocated resources. A context refers to the state of a running process that allows the CPU to resume a process later (Running, Waiting, and Stopped).

Too many context switches lead to inefficiency and higher CPU usage.

Check context switching with vmstat or pidstat. Check the number of context switches per second.
How many context switches per second do we have
More than 500-1000 context switches per second is considered high and may indicate that the system is overwhelmed
Maybe reduce the number of running processes
Optimize applications to use fewer threads
Adjust system limits

CPU bottleneck

This issue occurs when the CPU is the limiting factor in system performance.

The CPU usage is consistently high (above 80%)
Are tasks processing take too long?
Load average exceeds the number of available CPU cores
Use top and htop to identify processes that are using an inordinate amount of CPU time
Optime the processes using the CPU
Maybe a hardware upgrade is needed

Memory Issues

Swapping Issues

This issue occurs when the system runs out of physical memory (RAM) and starts using the hard drive as virtual memory

Is the system running out of swap space?
The swap file is typical located at /swapfile or on a dedicated swap partition
Identify swap space with swapon -s
Has the system performance degraded?
What does top and htop say about swap usage?
Is the usage more that 10%? That might be high
Monitor swap usage with free -h or vmstat
Maybe more physical RAM is needed
Or adjust the swappiness kernel value. ex: sudo sysctl vm.swappiness=10

Out of Memory (OOM) Errors

This issue happens when the system runs out of both physical and virtual memory

Are critical processes been unexpectedly terminated?
Check logs for OOM in system logs
Adjust application configurations to optimize memory usage
Increase RAM
Adjust swap space

Disk I/O Issues

This issue occurs when the system slows down due to delays in reading from or writing to storage device.

High input/output wait time

This issue occurs when processes are waiting for data to be read from or written to the disk. Is is indicating that the disk is struggling to handle requests causing delays in executing processes.

Monitor disk load with iotop or dstat. top shows io wait as variable x.x wa. iostat give %iowait
Optimize or spread out disk operations
Improve performance or upgrade to SSDs if needed

High disk latency

Disk latency refers to the time it takes for the disk to respond to read or write requests. It can be cause by high number or concurrent request, a disk hardware issue, or inefficient disk configurations.

What does iostat output say about the disk latency?
Is the latency above the normal 10ms? higher than 20ms may indicate a latency issues
Is the disk operating at its maximum throughput?
Maybe some drivers need to be upgraded
Maybe adjusting the disk (RAID) config can help reduce the latency
Maybe upgrade to faster disks

Slow remote storage response

This issue occurs when accessing data from remote storage systems such as NFS, SAN, or cloud storage

Long response time when accessing files?
Check network performance using ping or netperf to check for network issues
Optimize network settings or upgrade network hardware

Network Stability Issues

Packet drops

This issue occurs when data packets fail to reach their destination due to network congestion or hardware issues

ping -c 100 <DESTINATION> will give the percentage of packet dropped. Over 1% packet loss my be unacceptable
Packet loss can lead to slow performance, timeouts, and application errors
Check routers and switches for hardware issues and faulty cables
Check for NICs errors using ifconfig and ethtool
Adjust QoS settings or increase bandwidth

Random disconnects

This issue occurs when a network connection is unexpectedly terminated

Are users or services suddenly losing network access?
Look for connection reset or connection closed messages in logs using dmesg or ifconfig
Check network stack configuration
Maybe the cable is faulty and some hardware in the path is faulty
Maybe a firewall is closing the connection
Check and adjust TCP settings if necessary

Random timeouts

This issue occurs when a connection fails to receive a response within the expected time frame.

Errors can be seen in logs
Getting connection timed out error when using curl to connect to a service?
Maybe the network is congested
Maybe there is a DNS issue
Maybe the server is overloaded
A timeout threshold of 5-10 seconds is typically acceptable
Use ping or traceroute to check for network congestion
Make sure DNS servers are correctly configured
Check server performance
Adjust TCP timeout if necessary

Network Performance Issues

High latency

High latency refers to the delay in the time it takes for data to travel from one point to another

Measure latency using ping or traceroute A 100ms latency in a local network is considered high and 300ms over remote communication is also high Check for network congestion Identify hardware issues Maybe the routing is misconfigured. Optimize the network path Maybe upgrading the network infrastructure could help

Jitter

Jitter is the variation of latency over time, which can cause problems in real-time applications.

A value above 30ms of fluctuation can cause noticeable issues
Detect Jitter issues with ping -i 0.2 <DESTINATION>
Check for network congestion or hardware issues
Implement QoS to prioritize relevant traffic

Slow response time

This issue occurs when the network takes too long to respond to requests.

This could be due to:

High latency
Congestion
Overloaded servers
Misconfigured applications
Use curl or wget to measure response time and identify bottlenecks in the network or a server
Check server load
Optimize application code
Check for server resources
Review network configurations

Low throughput

This issue occurs when the network is unable to transmit data a a high enough rate

Identify low throughput but using iperf anything below 80% of the expected bandwidth is considered low throughput
Check for network congestion
Check for faulty cables
Check for incorrect settings
Maybe switch to a high bandwidth network
Maybe reduce unnecessary traffic
Optimize network routes

System Responsiveness Issues

Slow application response

This issue occurs when an application takes longer than expected to react to user inputs

Use top and htop to identify application resource consumption
Maybe the application code needs to be optimized
Increase system resources
Check disk I/O for overload
Check for unnecessary background

Sluggish terminal behavior

It happens when commands in the terminal are delayed. The system takes an unusually long tome to execute commands.

Use top or iotop to check for system resource usage
Optimize processes running on the system
Cleanup system resources
Add more RAM or CPU cores

Slow startup

System a taking an unusual time to boot up

See which services take longer than expected using systemd-analyze
Maybe too many services are configured to start at the same time
Maybe one of the startup services is misconfigured?
Delay or disable non essential services from starting at boot time using systemctl
Optimize the boot sequence

System unresponsiveness

This issue occurs when the system becomes completely unresponsive

Is the system not accepting new input?
Applications are no longer responding?
Use dmesg or journalctl to identify what caused kernel panic
Identify runaway processes using top and htp
Maybe upgrade the RAM or add more CPU cores

Process Management Issues

Blocked processes

This issue occurs when a process is unable to proceed due to waiting on resources or system locks

Are command or application stuck?
ps and top show processes in D or `uninterruptible state
Use lsof to check with file a process is waiting for
Use strace to trace system calls and signals
Is a process repeatedly stuck or blocked? Maybe due to resource contention
Optimize disk I/O
Maybe add more memory
Investigate dependency issues between processes

Exceeding baselines

This happens when processes consume more resources than expected

Notice high CPU usage
Unusual memory consumption
Excessive disk activity
Use top, htop, or pidstat to identify this issue
Optimize application resource usage
Maybe configure system resource limits with ulimit

High failed log-in attempts

This issue often signals attempted unauthorized access or brute force attacks

Maybe a brute force attack?
Unauthorized access attempts?
System compromise?
What are logs in /var/log/auth.log saying?
Check journalctl for identify login attempts
5-10 login attempts from a single IP address within a short time may be a red flag for brute force attack
Implement fail2ban to block abusive IPs
Enforce strong password policy
Use MFA
Limit access with firewall or IP allowlist

September 22, 2025
in Blog, Linux, Troubleshooting, Security
3 min read

Linux: Troubleshooting Security Issues

SELinux Issues

SELinux policy issues

SELinux Policy defines what actions users and applications can perform on a system based on security rules.

A too restricted or misconfigured policy can prevent the system from working properly.

avc: denied is a typical error message found in logs if dealing SELinux policy issues.

Review logs with ausearch or sealert
Modify rules if necessary
Test policy in a safe environment before applying

SELinux context issues

SELinux uses context to label every file, process, and resource on the system, determining what access is allowed.

Incorrect or misconfigured label can prevent applications for accessing the resources they need to function

User ls -Z for files and ps -Z for processes to look for SELinux context issues
Does the file or process have incorrect context?
Restore the context with sudo restorecon -v <FILE PATH>
Running restorecon regularly on key directories helps avoid repeated context mislabeling issues

SELinux boolean issues

SELinux Boolean allow adjustment of certain security settings without modifying the underlying policy.

An incorrectly set boolean can cause certain services or applications to malfunction

Check booleans with getsebool
Are certain booleans incorrectly set?
Toggle booleans with setsebool. ex: setsebool -P httpd_can_sendmail 1
Test modification and document changes

File and Directory Permission Issues

File attributes

File attributes control certain behaviors and restrictions on files and directories, which go beyond the regular rwx permissions.

Check file attributes with lsattr. i=immutable, a=append-only
Remove incorrect attribute with chattr. ex: chattr -i <FILE PATH>
Verify file access and document changes

Access Control Lists (ACLs)

ACLs provide more fine-grained control over who can access a file or directory and what actions can be performed.

Check if a file is using ACLs with getfacl
Adjust the ACLs with setfacl. ex: give read-only access to user tom setfacl -m u:tom:r <FILE PATH>
Verify proper access and document changes

Access Issues

Account access issues

Most common issue

Are the credentials incorrect?
Maybe the account is locked or disable
Check system logs for messages
Check if account is locked with sudo passwd -S tom
Unlock account with sudo passwd -u tom
Reset the user password with sudo passwd tom
Re-enable a disable account with sudo usermod -e '' tom. '' means no account expiration date

Remote access issues

Issues with VPN or SSH

Is the issue caused by network issues, misconfigurations, or firewall?
Is the SSH service running? check with sudo systemctl status sshd
Enable SSH service with sudo systemctl start sshd && systemctl enable sshd
Check firewall with sudo ufw status or sudo iptables -L
The problem sill persist? check routing, and public keys validity

Certificate issues

Common messages: SSL certificate expired, SSL handshake failure

Is the certificate expired?
Maybe the certificate chains are misconfigured
Maybe it is a CA issue
Check certificate issues with openssl s_client -connect mysite.com:443
Renew the certificate if necessary
Ensure the full certificate chain is correctly installed

Configuration Issues

Exposed or misconfigured services

This issue occur when system services are either left open to the public or configured incorrectly.

Does the service have proper security settings? The db should not accessible from the internet
Review security logs
Use tools like nmap to scan open ports
Configure the firewall to restrict access to trusted IPs
Disable unused services
Ensure critical services are only accessible when necessary

Misconfigured package repositories

This issue prevents the system from accessing the correct software sources. It prevents software updates and installations.

What errors show when running sudo apt update or sudo dnf update
Check repository configuration files: /etc/apt/sources.list on Debian-based systems or /etc/yum.repo.d/ on RHEL-based systems
Edit repository url if necessary

Vulnerabilities

Vulnerabilities are weaknesses of flaws in the system that can be exploited by attackers bo compromise security.

Unpatched vulnerable system

Do i have the latest security patches?
Use vulnerability scanners to detect security issues
Regular apply update with sudo apt update && sudo apt upgrade on Debian or sudo dnf update on RHEL.

The use of obsolete or insecure protocols and ciphers

Is the system using secure ciphers for data and communication protection?
Are insecure cipher like disable in the system? SSLv3 is vulnerable to POODLE Attack, RC4 is vulnerable to RC4 Bias Attack
Check used protocols in sshd_config for SSH and apache2.conf for Apache.
Disable outdated protocol
Remove week ciphers in the configuration files
Use strong ciphers like AES and protocols like TLS1.2, 1.3

Cipher negotiation issues

This issue occurs when there is a failure in the negotiation or encryption methods between a client and a server.

Review connection logs to confirm both server and client are using strong encryption methods

September 21, 2025
in Blog, Linux, Troubleshooting, Networking
4 min read

Linux: Troubleshooting Networking Issues

Firewall Issues

Misconfigured Firewall

Typo in firewall rule

A simple typo in a firewall rule can block traffic.

Use firewall-cmd --list-ports to see open ports

Remove bad rule with firewall-cmd --remove-port=<PORT>/PROTOCOL --permanent, re-issue the correct command, and reload the firewall with firewall-cmd --reload.

Incorrect Rule Ordering

This happens when a DROP or REJECT rule is placed above an ACCEPT rule, causing legitimate traffic to be blocked.

Forgetting to persist firewall changes across reboot

If a rule is added without --permanent the rule disappears after reboot.

Addressing Issues

DHCP issues

This issue occurs when servers or workstations fail to obtain an IP address automatically.

Is the DCHP service is running at all?
Does the server has free ip address to allocate? Check for DHCP scope for exhaustion by reviewing logs on the DHCP server.
Do I need to expand the pool?
Force client to request an ip again
Confirm connectivity
Update network documentation to reflect the change

IP conflicts

IP conflicts occur when two devices claim the same address, leading to intermittent connectivity or "duplicate address" warnings in syslog.

Common signs are random disconnect, slow network performance, or ARP conflict messages.
Identify all devices using the conflicting IP by checking the DHCP lease files and DNS records
Assign a unique address to one of the devices
Update any static configurations
Clear the ARP cache to ensure no stale entries remain
Monitor the network to confirm the conflict is gone

Dual stack issues

This issue occurs when a server configured for both IPV4 and IPV6 fail to handle traffic properly.

Ping test may fail for either IPV4 or IPV6
Does DNS records include both A and AAAA entries
Adjust service configuration files to listen to both IPV4 and IPV6
Test connectivity over both protocols and ensure firewalls allow the appropriate traffic on each address family

Routing Issues

DNS issues

ping my.server.com returns unknown host

Confirm the DNS server in /etc/resolv.conf Make changes if necessary Is the DNS server reachable Test DNS resolution

Wrong gateway

Why the packets are not leaving the local network?
Can devices in the different subnet communicate?
Can devices in other subnet communicate with external resources?
Check default route with ip route -n
Update default route if necessary
Ping external resources to confirm connectivity

Server unreachable

When a server is unreachable, nor the hostname or ip address respond to ping.

Use ip link to check if the network interface is up and running.
Check switch port and, VLAN settings
Is the firewall blocking ICMP or SSH?
Adjust port, VLAN, and firewall rule if necessary
Confirm connectivity using ping or SSH

Interface Misconfiguration

Subnet misconfiguration

This issue occurs when an interface is assigned to the wrong network or network mask. That prevents the server from communicating with other devices in the network.

Confirm address settings with ip addr
Edit the interface's configuration so the IP address and netmask align with the correct network segment
Apply changes with netplan apply or systemctl restart networking
Ping a known host on the subnet and confirm that traffic works has it should

MTU mismatch

This happens when one endpoint sends packet sized differently than the receiving interface can handle.

ping -s 1500 => Frag needed but DF set
Check MTU on each interface with ip link show
Pick a consistent MTU value, which is often 1500 for standard networks, and update the interface configuration.
Retry transfer or ping test to see correct connectivity

`Cannot ping server`

This often indicates a deeper interface misconfiguration, such as disabled interface, missing address, or firewall blocking ICMP

Is the interface up with a valid ip address?
Bring up the interface with ip link set <INTEFACE> up and assign the correct IP address
Is the firewall blocking ICMP? use sudo ufw status or iptables -L to ensure that ICMP is not blocked
Ping again to confirm connectivity

Interface bonding issues

Interface bonding is when combining two or more physical NICs into a single virtual interface to increase bandwidth and provide redundancy.

Is any interface in /proc/net/bonding/bond0 marked down even though it is plugged?
Mode 0 (balance-rr), Mode 1 (active-backup), Mode 4 (802.3ad/LACP)
Is the bonding driver loaded
Check the bonding configuration in either /etc/netplan...yaml on Ubuntu or /etc/sysconfig/network-scripts/ifcfg-bond0 on RHEL
Check switch setting to confirm matching valid configuration

MAC spoofing issues

This issue occurs when tow NICs present the same MAC address.

arping <IP ADDRESS> returns multiple MAC address
Does ip neigh shows frequent MAC flapping?
Look for duplicate MAC address with ip link show
Correct MAC settings
Restart network service to apply changes confirm with the command ip neigh show

Link Issues

This issue occurs when devices are unable to communicate effectively due to problem with the network interface.

Link down

The interface is failing to establish or maintain a connection. Maybe a faulty cable The port is disconnected? Maybe the hardware is faulty ip addr and ifconfig show the interface as down Logs are found with dmesg and journalctl Maybe the driver is bad Maybe the interface is misconfigured Maybe the interface is administratively down. use ip link show <INTERFACE> to confirm. Bring it up if necessary Restart networking with systemctl restart network

Link negotiation

This involves problems in the automatic process where devices agree on the speed and duplex settings for their connection.

Common signs are poor performance, slow speeds, connectivity dropouts.

Check link status with ethtool <INTERFACE>
Is autonegotiation enabled
Maybe there the hardware have issues. Review system logs for related issues
Do the network driver have bugs and need to be updated?

September 20, 2025
in Blog, Linux, Troubleshooting
6 min read

Linux: Troubleshooting Hardware, Storage, and Linux OS

Troubleshooting steps:

Identify the problem
Establish a theory of probable cause
Test the theory to confirm or refute the theory
Establish a plan of action, implement the solution or escalated if needed, and then verify full system functionality
Implement preventive measures to avoid recurrence and perform a root cause analysis

Boot Issues

Server Not Turning On

No power lights?
No fans?
No console output?
Do similar systems have the same issues?
Maybe the PDU is down?
Maybe the PSU has failed?
Check the power in the PDU
Swap in a known-good power cable
Plug another device into the same outlet
Still failing?
Inspect the PSU
Reseat connectors
Swap in a spare PSU
Verify the system powers on
Label cables
Schedule PSU health checks
Perform a root cause analysis

GRUB Misconfigurations

The server drops to a GRUB rescue prompt?
The server show an error like "file not found"
Are multiple kernels failing?
Maybe /etc/default/grub was edited?
Maybe an entry ininitrd was deleted?
Use the GRUB cli to probe available partitions
Verify the kernel and initramfs files are where GRUB expects them to be
Boot from rescue ISO or live environment
Mount the root filesystem
Correct the UUID or kernel path in /etc/default/grub
Regenerate GRUB configuration: grub2-mkconfig -o /boot/grub2/grub.cfg on RHEL-based systems. Or updated-grup on debian
Reboot and verify the kernel load properly
Backup grub.cfg before modifications
Why the issue occurred in the first place?
A rushed update?
A lack of peer review?

Kernel Corruption Issues

Observing errors such as "bad magic number" or "kernel image corrupt" during boot
Check whether only the latest kernel version is affected or if other versions work from te GRUB menu
Maybe a package update failed mid-install
Maybe the /boot partition has disk errors
Boot into an older, working kernel
Mounting /boot and verifying file checksums. If checksums fail, the corruption is real
Reinstall the corrupted kernel package
Reboot to verify that the new kernel loads
Monitor disk health
Ensure updates are completed successfully
See if disk failure or an interrupted update was at fault

Missing or Disabled Drivers

Boot hangs or drops to an initramfs shell with errors like "VFS: Cannot open root device"
Check if only certain hardware (for example, a RAID controller) is missing in /dev pr /sys
Maybe the initramfs was rebuilt without necessary driver module
Maybe someone blacklisted a driver
Examine the initramfs contents with lsinitrd or dracut --list module to confirm if the driver is absent
Rebuild the initramfs including the required modules
Reboot to verify that the driver loads and the root filesystem is detected
Document driver dependencies in the build scripts
Automate initramfs rebuilds when kernel updates occur
Was a kernel package change or manual configuration error caused the driver omission?

Kernel Panic Events

Read the panic message on the console
Does the panic happens on every boot or only after certain changes?
Maybe a newly added module is incompatible
Maybe the memory has gone bad
Let's try booting with a previous kernel
Run memtest86+
Disable suspect modules via the kernel boot line
Remove or update the offending module
Roll back to a known-good kernel
Replace faulty RAM
Reboot and verify full functionality
Maintain a reliable kernel testing process
Monitor hardware health
Keep a cross-tested module database
What was the root cause? Was it a faulty drover, hardware failure, or human error caused the panic.

Filesystem Issues

Filesystem not mounting

The usual mount command returns errors
Scheduled backups and applications suddenly cannot access certain directories
Errors like unknown filesystem type, mount: wrong fs type, superblock corrupt in system logs
Boot into rescue mode or unmount any stale references, run fsck against the affected device, and inspect or repair the superblock if needed.
If the issue arise from /etc/fstab, correct the UUID or device path and then test the mount manually before updating the fstab.
The system now mount cleanly?
Confirm read/write access and update any monitoring dashboards to reflect that the volume is back online

Partition not writable

Processing failing with permission denied message
Application unable to save files even the directories appears to exist
Maybe the filesystem is mounted readonly. Examine /proc/mounts to confirm ro flag
Unmount the partition, run fsck to repair any underlying errors, and then remount it with the correct read-write permissions
Does the issue persist?
Inspect ownership and ACLs, then apply chmod or chown to grant the correct user or service write access
Update any configuration management scripts

OS filesystem is full

Applications and users are unable to write logs and files
Check partition usage to confirm issue
Truncate or rotate logs, cleanup old core dumps, purge orphaned Docker images, or archive older data to a secondary storage
Extend the LVM volume or resize the partition, the resize the filesystem
Implement proactive monitoring for storage space

Inode exhaustion

df -h my show that space is available
Typical message: Cannot create file: No space left on device
check df -i and see if inode count is at 100%
Identify directories with excessive file counts and then clean up old or stale files
Create a new file system with higher inode ration and then migrate the data if necessary
Update cleanup policies or add scripts to remove temporary files automatically, preventing a repeat of the issue

Quota issues

Individual user or group cannot write files despite free space in the partition
Typical message is Disk quota exceeded when creating or writing to a file
Use repquota -a and quota -u <USERNAME> to view group or user quotas
Adjust soft and hard limits if necessary
Identify and remove unnecessary data from the user's home or project directories

Process Issues

Unresponsive Processes

It occur when a running program stops responding to inputs or system scheduling, causing tasks to hang indefinitely.

Don't respond to user input or system event?
Consume more resources
Spot this with top and ps
Use strace to watch process
Send SIGTERM to le the process shut down cleanly, and if that fails, escalate to SIGKILL to free resources by force
Examine journalctl to determine what cause the process to become unresponsive
Implement preventive measures

Killed Processes

They happen when a process is forcibly terminated by a signal.

Check journalctl and dmesg for reason the process was killed
Logs may show Killed process <PID> or oom_reaper to indicate killed process
Go through logs to determine if system or person killed the process

Segmentation Fault

A crash that happens when a program tries to access memory it shouldn't, leading to an abrupt termination with an error message.

Configure system to generate and retain core dump
Use GNU Debugger to analyze the core file and pinpoint the faulty code patch
Is the issue from an package? Reinstall a version of the package without that bug

Memory Leaks

Memory Leaks occur when a program continuously allocates memory without freeing it, gradually exhausting available RAM and. degrading system performance

Watch the RES memory rise steadily with no drop
Who is reserving the memory? review logs and output
Schedule periodic restarts of the service or allocate more RAM to reduce impact
Continue monitoring RES

System Issues

Device Failure

The server suddenly cannot read from or write to a critical piece of hardware, which is often a disk or network interface.

Identify the faulty device
Reseat or replace the device
If it is a RAID disk, mark the bad disk as failed and rebuild the array with a spare
Check disks and network to confirm full functionality of the system

Data corruptions issues

They occur when files refuse to open, applications crash, or filesystem errors in system logs.

Run fsck to detect corrupted data
Is there a known-good backup? restore from backup
Use fsck with repair options to attempt recovery on the live server
What was the root cause? failing disk? power outage?, ...
Verify full system functionality before it returns to production

Systemd unit failures

They occur when a service that should be running won't start or crashes immediately

Inspect service with systemctl status <SERVICE> or journalctl
Maybe edit the unit config in /etc/systemd/system/
Run systemctl daemon-reload to apply changes
Start service with systemctl start <SERVICE>
Setup alert to catch unit failures

Server inaccessible

User cannot remotely access the server.

Does ping timeout?
Does SSH hang?
Out-of-band tool does not respond?
Are other servers in the network reachable?
Try physical access
Maybe reboot the machine, or restore network configs from backups, or repair corrupt network service files
Validate server is reachable again

Dependency Issues

Package Dependency Issues

Occur when software cannot find or install the components it needs

Are the necessary repository enabled?
Run dnf deplist <PACKAGE> or apt-cache depends <PACKAGE> to find missing dependencies
Upgrade or downgrade package if necessary
Rerun installation and verify software loads without issues

Path Misconfiguration Issues

Occur when the system cannot locate a program despite being installed

Typical error message is Command not found

Examine echo $PATH to check current search directories
Add missing directory by editing /etc/profile or similar
Reload shell or re-login to apply changes
Run command again to confirm the program is found
Document changes for future deployments

September 19, 2025
in Blog, Linux, Monitoring
2 min read

Linux: Monitoring Concepts and Configurations

Service Monitoring

Service Level Indicators (SLIs) are specific metrics such as uptime, response time, or error rates. It is used to measure the performance of a service.

Service Level Objectives (SLOs) are targets to meet based on measurements such as maintaining 99.9 percent uptime.

Service Level Agreements (SLAs) is a formal promises to customers or stakeholders outlining expected level of service and consequences if expectations are not met.

Network Monitoring

Network monitoring is the process of keeping track of devices like routers, switches, and servers to make sure everything is running properly.

SNMP - Simple Network Monitoring Protocol

SNMP allows devices to report performance data using a structure called MIB, or Management Information Base. The MIB acts as a built-in database that defines everything that can be monitored on a device, including CPU load, memory usage, and network interface status.

The MIB contains Object Identifier (OID). and OID is a unique number used to locate and retrieve specific information.

SNMP Traps are automatic alerts triggered by specific events like hardware failure or dropped network connections.

Agent-agent vs Agentless Monitoring

Agent-based monitoring uses a software on the monitored device to collect monitored information. SNMP is an agent-based monitoring tool.

An agentless monitoring collects data using existing remote access protocols without requiring any additional software installation on the monitored devices. On Windows systems, protocols like Windows Management Instrumentation allow similar agentless access.

Event-driven Data Collection

Health Checks

Health checks allow systems to automatically test whether a service is running and responding as expected.

# checks if a web service returns a success response
curl -I http://localhost

# check if a systemd service is up and running
systemctl is-active ssh

Webhooks

Webhooks are often used for realtime integrations between services.

Log Aggregation

Log aggregations is the collection of logs from across the network and storing them in a central location.

Event Management

Logging

Logging provides the raw data needed to understand what is happening across a system. Logs are typically stored in the directory /var/log/ and includes files like syslog, auth.log, dmesg, and more.

SIEM Security Information and Event Management System. It collects and analyzes logs from across the network to help identify security threats, system issues, and unusual activity in real time.

Events

Events are generated when specific patterns or conditions are detected in the log data that indicate something noteworthy has happened.

Alerting and Notifications

Notifications

Notifications are how a Linux admin is informed when the system detects that something may require attention. They can be sent via Email, Text Messages, Desktop pop-ups, ticketing system or collaboration platforms.

Alerts

Alerts are the system's internal triggers that causes te notifications to be sent.

September 11, 2025
in Blog, Linux
7 min read

Linux: Automated Tasks with Shell Scripting

Parameter Expansion

Parameter Expansion is a way to substitute the value of a variable into a command or script os that the instructions become dynamic and flexible instead of static. ex: ${var}

`${var}`

${var} is used in shell environments to insert the value of a variable into a command. var is the name of the variable we want to expand.

ex:

location="/var/log"

cd ${location}

Command Substitution

Command Substitution inserts the result of a command directly into another command or script.

`'bar'` - Single-Quoted String

Everything in the single quote is treated as literal text. There will be no variable expansion and no command substitution. The text will be printed as it is written.

ex:

echo 'Warning: $PATH cannot be found'

Warning: $PATH cannot be found

`$(bar)` - Substituting a Command

This is how command substitution is done. This will run the command inside the parentheses by replacing the $(...) with the command's output.

# /backup/YYYY-MM-DD
mkdir /backup/$(date +%F) # mkdir /backup/2025-11-15

Subshell Execution

A subshell is a separate child process created by the shell to execute a command or group of commands in isolation without affecting the current shell environment. What ever happens inside the subshell will not carry over to the main shell session.

`(bar)` - Creating a Subshell

The syntax is (cmd1; cmd2;...). All commands inside the parentheses are executed in a child shell.

ex:

# execute the command in a new shell
(bar)

Functions

A function is a set of commands packaged under a single name to allow repeated use without rewriting the commands each time.

ex:

function hello {
  echo "Hello, $1"
}

hello() {
  echo "Hello, $1"
}

Bash functions can only return numeric exit codes.

Variables by default are global. Use local to define a local variable in functions.

ex:

# to define a local variable in a function
function hello {
  local my_var="Hello"
}

Internal Field Separator / Output Field Separator

IFS tells the shell where to split input into distinct words.

OFS is a tool used to re-assemble data for output.

Avoiding Word Splitting

Word Splitting is the shell's habit of treating spaces, tabs, and newline inside a variable as natural break-point. To fix this, we wrap the variable in double quotes or pass it through printf. ex: printf '%s\n' "$variable".

With

file_path="My project/file.txt
cat file_path

the shell will attempt to open 2 files: My and project/file.txt.

But printf '%s\n' "$file_path" will produces the exact string, in one line, with no splits.

Controlling Input Splitting

IFS=<DELIMITER> read -r VAR1 VAR2,... <<< "$TEXT"

ex:

IFS=',' read -r name city role <<< "tome,New York,Developer"

Output Formatting

A common pattern is awk 'BEGIN{OFS="<DELIMITER>"} {print $1,$2,...}' <FILE>

ex:

# converting a portion of /etc/passwd file into a CSV
awk 'BEGIN{OFS=","} {print $1,$3,$4}' /etc/passwd | head -n 3

BEGIN{OFS=","} tells awk that commas should go between fields.

$1, $3, and $4 refers to the username, uid, and gid columns respectively.

Conditional Statements

`if`

It is used for running a single yes or no task like:

Verifying a service is running
Checking free disk space
Making sure a variable isn't empty

if condition; then
    commands
elif another_condition; then
    commands
else
    commands
fi

# to check for a file
location="/var/log/auth.log"

if [[ -f $location ]]; then
    echo "$location exists"
elif [[ -d $location ]]; then
    echo "$location is a directory"
else
    echo "$location does not exist"
fi

Options include:

-f for a file
-d for a directory
-z for a string
-eq numeric equal
-ne numeric not equal
-lt numeric less than
-gt numeric greater than
= string equal
!= string not equal

`case`

A case statement is used when a variable can take several acceptable values, or answers, and needed different actions for each.

case expression in
    pattern1)
        commands ;;
    pattern2|pattern3)
        commands ;;
    *)
        commands ;;   # default case
esac

echo "Select an option: start | stop | restart"
read action

case $action in
    start)
        echo "Starting service..." ;;
    stop)
        echo "Stopping service..." ;;
    restart|reload)
        echo "Restarting service..." ;;
    *)
        echo "Unknown option: $action" ;;
esac

$1 is a positional parameter. It means it automatically holds the first command-line argument that was supplied when the script was launched.

Looping Statements

Loops allow a program to repeat actions automatically without rewriting the same instructions repeatedly.

`for`

for loop repeats a task a specific number of times of for each item in th a list.

ex:

for fruit in orange apple banana
  do
    echo "fruit: $fruit"
  done

`while`

while loop continues running as long as a condition remains true. A while loop is great when you do not know how many times something should repeat.

counter=1
while [ $counter -le 5 ]
  do
    echo "count is $counter"
    ((counter++))
  done

`until`

until runs until a condition becomes true.

counter=1
until [ $counter -ge 5 ]
  do
    echo "count is $counter"
    ((counter++))
  done

Interpreter Directive

An interpreter directive is a special line at the very top of the file that tells the system which program should be used to interpret the commands that follow.

It start with #! called shebang followed by the path of the interpreter like /bin/bash.

For bash script, we typically use #!/bin/bash

ex:

hello.sh

#!/bin/bash

echo "hello world"

Numerical Comparisons

-eq equal to
-ne not equal to
-lt less than
-le less than or equal to
-gt greater than
-ge greater than or equal to

They are always used in [] when making comparisons.

result=8
if [ "$result" -lt 5 ]; then
    echo "Less than 5"
elif [ "$result" -eq 5 ]; then
    echo "Equal to 5"
else
    echo "Greater than 5"
fi

Redirection String Operators

`>` redirection operator

> redirects outputs to a file. It creates the file automatically if it does not exist or overwrite its content if it exists.

echo "Operation completed with code 0" > result.txt

`<` redirection operator

< takes input from a file.

read value < input.txt

Comparison String Operators

'String comparison operators check whether two pieces of text are the same, different, match a pattern or follow a certain alphabetical order.

== and = for comparing if two strings are equal
!= for checking if two strings are not equal
=~ for matching patterns using regular expressions
<= and >= for comparing string alphabetical order

`==`, `=`, and `=~`

== is typically used inside double square brackets ([[]]) and is read as is equal to.

= is used inside single square brackets ([]) and is read simply as equals.

=~ is used for more advanced comparison.

#!/bin/bash

text="Hello"

if [[$text == "Hello"]]; then
  echo "Text is exactly Hello"
fi

if [[$test =~ ^H]]; then
  echo "The test start with H"
fi

`!=`

#!/bin/bash

result="completed"

if [$result != "completed"]; then
  echo "The task completed successfully"
fi

`<=` and `>=`

This is a Lexicographical Comparison. Bash checks which string would come first or last in alphabetical order.

<= is read as is less than or equal to

>= is read as is greater than or equal to

#!/bin/bash

fruit="papaya"

if [[$fruit >= "mango"]]; then
  echo "$fruit comes after or is equal to mango"
fi

if [[$fruit <= "melon"]]; then
  echo "$fruit comes before or is equal to melon"
fi

Regular Expressions

A regex is a special sequence of characters that defines a search pattern.

Bash uses =~ inside [[]] to match patterns with regular expression ([[ $variable =~ pattern]]).

#!/bin/bash

data="234567"

if [[ $data =~ ^[0-9]+$ ]]; then
  echo "The data contains only numbers"
fi

Test Operators

Test operators are special symbols used to evaluate things like file existence, string content, and logical conditions. They return either true or false.

`-d` and `-f`

-d and -f are operators used in scripts to check whether something exists on the filesystem and whether it's is a directory or a regular file.

#!/bin/bash

if [ -d "project" ]; then
  echo "the project folder is a directory"
fi

if [ -f "app.conf" ]; then
  echo "app.conf is a file"
fi

`-n` and `-z`

-n and -z are string test operators. They help check whether a string has a value or is empty, which is especially useful when dealing with user input.

#!/bin/bash

input=""

if [ -z "$input" ]; then 
  echo "The input is empty"
fi

input="hello"

if [ -n "$input" ]; then
  echo "the input is not empty"
fi

`!`

! is the logical negation operator.

#!/bin/bash

if [ ! -f "config.txt" ]; then
  echo "config file does not exist"
fi

Variables

Variables are used to store and work with information like text, numbers, or user input.

Positional Arguments

Positional arguments are values passed to a script when running it, allowing the script to respond to user input. The first argument is $1, the second is $2 and so on.

#!/bin/bash

if [ $1 -gt 5 ]; then
  echo "The number is greater than 5"
else
  echo "The number is less than or equal to 5"
fi

# then run the script
./script.sh 10

Environment Variable

Environment variables are built-in variables provided by the system or user that store important information.

Built-in variables:

$USER: username
$HOME: home directory
$SHELL: current shell

#!/bin/bash

if [ $USER = 'root' ]; then
  echo "You are logged in as the root user"
else
  echo "You are logged as regular user $USER"
fi

Alias and Command Management

`alias`

alias command creates shortcuts for longer commands. The generic syntax is alias name='command'.

Aliases set in the terminal are only temporary and only last for that session.

# create a shortcut called ckdsk
alias ckdsk='df -h'

`unalias`

unalias command removes shortcuts that was previously created.

# to remove a previously created alias
unalias ckdsk

`set`

set modify the behavior of the shell.

#!/bin/bash
# to stop script from running if any command inside it fails
set -e

echo "running system update..."

sudo dnf update

echo "update completed"

Other options with set:

-x prints each command before it is executed
-u exits script when attempting to use an undefined variable
-o pipefail makes a pipeline fail if any command in the pipeline fails

Variable Management

export allows a variable to be passed to child processes
local restricts a variable's scope to within a function
unset deletes a variable

`export`

export is used to make a variable available to child processes, such as a subshell or another script that is launched from the current shell. The syntax is export VARIABLE=value

export LOG_LEVEL=debug

./myscript.sh # run in a separate shell process bt still has access to LOG_LEVEL because of 'export'

`local`

local command is used to restrict variable to within a function. The syntax is local VARIABLE=value

`unset`

unset is used to remove a variable. The syntax is unset VARIABLE

log_file="log.txt"

echo "processing file"

unset log_file

Return Codes

A return code or exit status is a number left behind after a command or program finishes in Linux to indicate success or failure.

$? is used to see the exit code of the last command.

0 means success
Non-zero means error

September 10, 2025
in Blog, Linux
7 min read

Linux: Automation and Orchestration

Ansible IaC Core Concepts

Ansible let users automate system configuration and management using clear, repeatable commands. Ansible is agentless. I uses SSH on Linux and WinRM on Windows.

Installing Ansible

# on RHEL-based system
dnf install -y ansible-core

# on Debian-based system
apt install -y ansible

# to test install
ansible --version
ansible localhost -m ping

Inventory

Inventory is a list of all the servers or devices. Inventory can store as simple text file using INI format, as structured YAML file, or dynamically from cloud platforms or CMDBs.

[web] Group servers for easy management

ex inventory file

# ./hosts
[local]
localhost ansible_connection=local

# to install htop on RHEL-based system
ansible -i hosts local -m dnf -a "name=htop state=present update_cache=yes" --become

# to create a new user
ansible -i hosts local -m user -a "name=bob state=present" --become

# to copy a file from the control node to the managed nodes
ansible -i hosts local -m copy -a "src=my_config.conf dest=/apps/myapp.conf" --become

Ad Hoc Mode

Ad Hoc Mode is used to run one-time commands to test settings or apply changes across systems.

ex:

# to ping all hosts listed in the inventory
ansible all -m ping

# to restart the nginx service in the web group
ansible web -m service -a "name=nginx state=restarted"

Module

A module is a built-in tool that handles specific tasks like installing software, restarting services, or managing users.

ex of modules:

yum: used to install, update, remove packages on RHEL-based systems
apt: used to install, update, remove packages on Debian-based systems
user: used to manage user accounts on the system
service: used to start, stop, restart, or enable services
copy: used to transfer files from the control node to remote machines
file: used to create directories, change permissions, or delete files

Playbook

Playbook are complex, repeatable tasks, and structured automation. It is a structured YAML file that defines a set of tasks for Ansible to carry out on managed system.

Facts

Facts allows Ansible to automatically gather information about each machine and make decisions based on that data. Data collected can include IP addresses, operating system, available memory, and disk space. They are gathered at the beginning of playbook execution so it can decide what action to take based on the current setup of the machine. Ansible collects facts only when users run a task, using a direct connection like SSH.

Collections

Collections help manage and reuse tools, making it easier to scale and maintain automation environment over time.

Puppet Core Usage

Puppet helps automate system configuration by letting admins describe what the system should look like. Puppet is agent-based.

The Puppet Agent is responsible for communicating with the Puppet server and applying configurations. It is also responsible of collecting facts.

The Puppet server is called Puppet Master.

Facts

Facts are information Puppet collects on the managed devices such as operating system, hostname, IP addresses, memory, and more. Puppet Agent collects facts on a regular schedule during each check-in with the Puppet server. Puppet is well suited for large-scale enterprise environment because it enforces regular automated configuration.

Classes

Classes group related configuration tasks together into one logical unit. They help apply consistent settings to many systems with minimal duplication of effort.

Modules

A module is a package that includes everything needed to manage a specific task or part of a system. It can include one or many classes, files, templates, or custom facts.

Certificates

Certificates ensure that only authorized machines are allowed to talk to the server and receive configurations. The certificates must be approved and signed by the server before configurations are exchanged.

OpenTofu Core Usage

OpenTofu is an open-source designed to help manage and automate cloud infrastructure with code.

Provider

Provider connects configuration code to the actual cloud platform or service that the user is trying to manage. OpenTofu talks to services like AWS, Azure, and GCP using APIs.

Resources

Resources are the specific pieces of infrastructure user wants to create or manage, such as virtual machine, a firewall rule, or a storage bucket. OpenTofu resources focus on provisioning and configuring cloud services from the ground up.

State

The state is how OpenTofu keeps track of what is already been created in the environment.

Unattended Deployment

It is the automation of installation and initial configuration of systems to avoid manual step-by-step administrations.

Kickstart

Kickstart is commonly used in traditional data center environments with RHEL-based systems. You automate the RHEL-based installation by specifying things like language, disk setup, network settings, package selection in configuration file. The general syntax to start a kickstart install from a boot prompt is linux ks=<LOCATION OF KICKSTART FILE> inst.repo=<INSTALLATION SOURCE>

# to start a kickstart install
linux ks=http://192.168.10.10/kickstart/ks.cfg inst.repo=http://192.168.10.10/rhel8

Cloud-init

Cloud-init is the standard for automating deployments in cloud platforms like AWS, Azure, or OpenStack. It reads a YAML configuration file in order to apply the changes during the fire boot of a cloud instance.

ex:

# to create an install and configure using cloud-init
aws ec2 run-instances --image-id ami-0adfads185141422356 --instance-type t2.micro --user-data file://init-script.yaml

CI/CD Concepts

CI/CD is a system of tools and practices that brings order and automation to modern software development.

Version Control

Version control is a system that tracks changes to files over time, allowing developers to collaborate, review history, and roll back if something goes wrong. Git is the most common version control tool used today.

Pipelines

A pipeline is a sequence of automated steps that take code from commit to deployment. It might include testing, security scanning, building the software, deploying to production.

Modern CI/CD Approaches

Shift Left Testing

Shift left testing moves testing earlier in the development cycle, right alongside coding. The common tools used are Jenkins and GitLab CI.

DevSecOps

DevSecOps = Development, Security, and Operations. It is an approach that builds on CI/CD by embedding security practices throughout the software lifecycle.

GitOps

GitOps is a way of managing infrastructure and deployments using Git as the single source of truth. Common tools used are Argo CD and Flux.

Kubernetes Core Workloads for Deployment Orchestration

Kubernetes is an open-source platform that automates the deployment, scaling and management of containerized applications.

Pods

Pods are where the applications run. They allow users to tightly couple containers that need to work together. Containers that run in the same pods can talk to each other like they are running in the same machine.

Deployments

Deployments make sure the right number of Pods are up and are kept up to date. A deployment acts like a controller that keeps track of the application and ensures the right number of Pods are always running and up to date.

Services

Services ensure the application is reachable by other apps or users. They provide stable endpoint so other applications or users can reliably connect to the app regardless of which Pod is curring running.

Kubernetes Configuration

Variables

Variables are the simplest way to pass configuration settings into the containers.

ex:

# to tell the pod to use production settings
ENVIRONMENT=production

ConfigMaps

ConfigMaps store larger sets of configuration data in a Kubernetes object.

Secrets

Secrets work similarly to ConfigMaps, but are specifically designed to store sensitive data such as passwords, API tokens, SSH keys, or SSL/TLS certificates. The defaults, secrets are encoded in base64. Kubernetes uses RBAC to control access to these secrets.

Volumes

Volumes provide a way for containers to store and access data that needs to persist beyond the life of a single container.

Docker Swarm Core Workloads for Deployment Orchestration

Docker Swarm is a tool that helps orchestrate container deployments, making sure everything runs reliably.

Nodes

A node is a physical or virtual machine that is part of the swarm cluster. A node runs the Docker engine and is classified as Manager Node which make decisions and assign tasks, or Worker Node which carry out the tasks.

Tasks

A task is the actual instance of a container running on a node. Each task maps to exactly one container, and Swarm monitors them all continuously. Pods can host multiple containers, while a swarm task maps one-to-one. Tasks help ensure that the application stays running as expected.

# define an overlay network called frontend.

networks:

  frontend:

    driver: overlay

Scaling

Scaling refers to how many replicas of a service are running at any given time.

ex docker-compose.yaml:

services:

  web:

    image:nginx

    deploy:

      replicas: 3

Docker/Podman Compose for Deployment

Compose file

docker-compose.yaml

version: "3.8"

services:
  web:
    image:nginx
    ports:
      - "8080:80"

  app:
    image:my-web-app:latest
    environment:
      - ENV=production
    depends_on:
      - db

  db:
    image:postgres
    volumes:
      - db-data:/var/lib/postgresql/data

volumes:
  db-data:

Podman Compose is more a security-focused container engine.

up and down commands

# to start or bring down containers
docker-compose up
docker-compose down

# to start or bring down containers
podman-compose up
podman-compose down

Viewing Logs

Viewing logs is essential for understanding what's happening in the application.

# to view logs
docker-compose logs
podman-compose logs

# to tail logs
docker-compose logs --follow web