Skip to content

Blog

Ansible: More on Playbooks

In my previous post in this serie, I talked about what ansible playbook is and how to get started. In this post, I am going to talk about few things that make working with playbook more fun. At first, ansible playbook sounds basic and connot do much beside pingging hosts, installing packages, copying files, and checking services (at least the way presented it previously). But in ansible can do way more than that using a debugger, variables, and more.

Ansible Debug Module

Here is a simple example of a playbook that displays debug messages. It does not peroform any particular task but showing you how debug messages can be used in a playbook.

---
- name: A playbook with example debug messages
  hosts: servers
  become: 'yes'

  tasks:
  - name: Simple message
    ansible.builtin.debug:
      msg: This is a simple message

  - name: Showing a multi-line message
    ansible.builtin.debug:
      msg: 
      - This is the first message
      - This is the second message

  - name: Showing host facts
    ansible.builtin.debug:
      msg: 
      - The node's hostname is {{ inventory_hostname }}

ansible.builtin.debug has 3 parameters.

  • msg: The debug message we want to show

  • var: The variable we want to debug and show in logs we the playbook is ran. It cannot be used simultanuously with msg.

  • verbosity: An integer that represents the debug level when the playbook is ran. It can have a value between 1 and 5 (-v to -vvvvv). The default value is 0, meaning no verbosity.

---
- name: A playbook with example debug messages
  hosts: servers
  become: 'yes'

  tasks:
  - name: Debug a variable
    ansible.builtin.debug:
      var: inventory_hostname

  - name: Debug a variable with verbosity of 3
    ansible.builtin.debug:
      msg: This is a message with a verbosity of 3
      verbosity: 3

When we run the playbook without the verbosity flag, the messages with verbosity will not be logged. So, if we want to show all messages, we should run:

ansible-playbook my-playbook.yml -vvv

-vvv designate the verbosity level 3.

Defining variables in a playbook

We can define variables to store data we want to use in multiple places in a playbook. We define variable in the following way:


  more code...

  vars:
    var1: Hello world
    var2: 15
    var3: true
    var4:
    - Apples
    - Green
    - 1.5

  more code...


  more code...

  vars:
    grouped:
      var5: Hi there
      var6: 30
      var7: false

  more code...

Debugging multiple variables

 more code...

  tasks:
  - name: Display multiple variables
    ansible.builtin.debug:
      msg: |
        var1: {{ var1 }}
        var2: {{ var2 }}
        var3: {{ var3 }}
        var4: {{ var4 }}

 more code...    

 more code...

  tasks:
  - name: Display multiple variables
    ansible.builtin.debug:
      var: grouped

 more code...    

Storing Outputs with Registers

Most ansible modules run and return a success or failure outputs. But, sometimes we want the resulting output of a task for later use. We can use a register to store that output. Here is an example:

---

- name: This is a playbook showcasing the use of registers
  become: 'yes'
  hosts: servers

  tasks:
  - name: Using a register to store output
    ansible.builtin.shell: ssh -V
    register: ssh_version

  - name: Showing the ssh version
    ansible.builtin.debug:
      var: ssh_version

We store the output in the veriable in the register key and then we can use var or msg from the debug module to show.

Storing Data with Set_Fact Module

set_fact is used to store data associated to a node. It takes key: value pairs to store the variables. The key is the name of the variable and value is its value. For example:

---

- name: This is a playbook showcasing the use of set_fact
  become: 'yes'
  hosts: servers

  tasks:
  - name: Using a register to store output
    ansible.builtin.shell: ssh -V
    register: ssh_version

  - ansible.builtin.set_fact:
      ssh_version_number: "{{ ssh_version.stderr }}"

  - ansible.builtin.debug:
      var: ssh_version_number

Are you wondering why I used stderr instead of stdout or stdout_lines? That is ssh -V normal behavior.

Reading Variables at Runtime

For data we cannot hard code in the playbook, we can pass them to the playbook at runtime using the vars_prompt module.

---

- name: This is a playbook showcasing the use of vars_prompt
  become: 'yes'
  hosts: localhost

  vars_prompt:
  - name: description
    prompt: Please provide the description
    private: no

  tasks:
  - ansible.builtin.debug:
      var: description

Date, Time, and Timestamp

ansible_date_time

ansible_date_time is coming from the facts. The playbook needs to gather the facts of the nodes. Otherwise it will be undefined.

---

- name: This is a playbook showcasing ansible_date_time
  become: 'yes'
  hosts: localhost
  gather_facts: true

  tasks:
  - ansible.builtin.debug:
      msg: "Datetime data {{ ansible_date_time }}"

  - ansible.builtin.debug:
      msg: "Date {{ ansible_date_time.date }}"

  - ansible.builtin.debug:
      msg: "Time {{ ansible_date_time.time }}"

  - ansible.builtin.debug:
      msg: "Timestamp {{ ansible_date_time.iso8601 }}"

Conditional Statements

when

A task with when conditional statement will only execute if the statement is true. For example:

---

- name: This is a playbook showcasing the use of `when` conditional statement
  become: 'yes'
  hosts: localhost
  gather_facts: true

  tasks:

  - ansible.builtin.debug:
      msg: "Date {{ ansible_date_time.date }}"
    when: ansible_date_time is defined

The debug task will only run if ansible_date_time is define.

failed_when

---

- name: This is a playbook showcasing the use of `failed_when` conditional statement
  become: 'yes'
  hosts: localhost
  gather_facts: false

  tasks:

  - name: Check connection
    command: ping -c 4 mywebapp.local
    register: ping_result
    failed_when: false # never fail

In the above example, the task never fails. But when failed_when is given a statement that evaluate to true or false, the task will be kipped if the result of that statement is evaluated to false. Otherwise it will be executed.

changed_when

When ansible runs on a host, it may change something on that host. Sometime we want to define ourself when to considere the system as changed. That's what changed_when is for.

---

- name: This is a playbook showcasing the use of `changed_when` conditional statement
  become: 'yes'
  hosts: localhost
  gather_facts: false

  tasks:

  - name: Check connection
    command: ping -c 4 mywebapp.local
    register: ping_result
    failed_when: false # never fail
    changed_when: false # never change anything

Handlers

Handlers are use to manage task dependencies. When we want to run a task only after another one has completed with changed=true we a handler. In the example below, we are only enabling nginx service after nginx is installed successfully.

---

- name: This is a playbook showcasing the use of handlers
  become: 'yes'
  hosts: servers
  gather_facts: true

  tasks:

    - name: Install nginx
      ansible.builtin.dnf:
        name: nginx
        state: present
      notify:
        - Enable nginx service

  handlers:

    - name: Enable nginx service
      ansible.builtin.service:
        name: nginx
        enabled: true
        state: restarted

Ansible Vault

The vault is where we keep our secrets secret. When we have confidential information that we want to keep secure, we use ansible vault. It allows a seemless encryption and decryption of sensitive data with a smooth integration with other ansible features suc has ansible playbook.

Encrypt a variable

ansible-vault encrypt-string "secret token string" --name "api_key"

Encrypt a file

ansible-vault encrypt myfile.txt

Dencrypt a file

ansible-vault decrypt myfile.txt

View content of encrypted file

ansible-vault view myfile.txt

Edit content of an encrypted file

ansible-vault edit myfile.txt

Change encrypted file encryption key

ansible-vault rekey myfile.txt
---

- name: This is a playbook showcasing the use of handlers
  become: 'yes'
  hosts: servers
  gather_facts: true

  vars:
    my_secret: !vault |
                  $ANSIBLE_VAULT;1.1;AES256
                  15396363646563646365353331396364333839346632333964353531386132323034353163346432
                  6365313938653033613538366132353631626430373032620a653030326634376663613964366164
                  33373965656433346466326266363438376330386561386563353764646237643061613337323733
                  3633383934636236620a353132306539343363326437316539633432363436653437333866353534
                  3738

  tasks:

    - ansible.builtin.debug:
        var: my_secret # never print secrets
      no_log: true

The playbook will not be executed until we provide the key to decrypt the encrypted variable.

Conclusion

I am going to stop here for now but will come back later in other posts to talk more in about ansible playbook. Stay worm, everyone.

Virtualization Technologies

Traditional System Configuration

in traditional server systems, applications are ran directly on top of the host operating system. If we have applications that needs isolation, we need to have a dedicated physical server for each of them, despite the fact that a single server would have enough resources to run multiple applications.

It is inefficient and very expensive in this scenario. To solve this problem, we use virtualization to allow running multiple operating systems on the same physical server.

Virtualization

With virtualization, instead of installing applications directly on top of the physical server host operating system, we install a hypervisor on top of the physical server, then install guest operating systems in virtual machines. Those virtual machines are managed by the hypervisor, that is also called virtual machine manager or VMM.

An hypervisor is software that seats on top of a bare metal server to allow the sharing of resources between multiple operating systems (also called guess operating systems). There are two types of hypervisors: Type 1 hypervisor that seats directly on top of bare metal server, and Type 2 hypervisor that is a software package installed on a host operating system.

Running multiple operating systems on a single host, allows saving money, time, and space since VM management is easier, and we have less physical equipment to purchase, install, and manage.

  • VMWare ESXi

  • Oracle OVM/OLVM

  • Microsoft Hyper-V

  • Citrix Xen Server

  • RedHat KVM

  • Proxmox VE

  • XCP-ng

  • Incus

VMWare ESXi

I have some lab experience with VirtualBox, Proxmox, and Incus but not VMWare yet. That is about to change. Since I am also looking into oportunities in data centers, I think it is a good time to start learning about VMWare technologies along with my system and network automation journey I recently started.

Ansible: Introduction to Playbooks

What is a Playbook

A Playbook in ansible is a file containing a set of instructions for automating systems configuration. It like a bash script but in ansible "language". A ad-hoc command is suitable for a basic single line task. But if we want to perform a complex and repeatable deployement, we certainly must use a playbook.

Playbooks are written in YAML following a structured syntax. A playbook contains an ordered list of plays that runs in order from top to bottom by default.

---
- name: Update servers
  hosts: servers
  remote_user: ans-user

  tasks:
  - name: Update nginx
    ansible.builtin.dnf:
      name: nginx
      state: latest

To ping all hosts;

---
- name: Ping all hosts
  hosts: servers

  tasks:
  - name: Ping servers
    ansible.builtin.ping:

This is how you run a playbook:

ansible-playbook my_playbook.yml

or in dry run mode:

ansible-playbook --check my_playbook.yml

or to check the syntax of our playbook:

ansible-playbook --syntax-check my_playbook.yml

or to list all hosts:

ansible-playbook --list-hosts my_playbook.yml

or to list all tasks:

ansible-playbook --list-tasks my_playbook.yml

We can a lot more configuration to the playbook to perform advanced automation tasks. We are going to leave that for a future post.

By default, ansible gather facts about the nodes before executing the playbook To disable this feature, we can add gather_facts: false to our playbook:

---
- name: Ping all hosts
  hosts: servers
  gather_facts: false

  tasks:
  - name: Ping servers
    ansible.builtin.ping:

Example of Simple Playbooks

Ping hosts

We've already seen how to do that. This is a simple way to ping nodes using ansible playbook:

---
- name: Ping Linux hosts
  hosts: servers
  gather_facts: false

  tasks:
  - name: Ping servers
    ansible.builtin.ping:

Install/Uninstall packages

---
- name: Install/uninstall packages
  hosts: servers
  become: 'yes'

  tasks:
  - name: Install OpenSSH on Linux servers
    ansible.builtin.dnf:
      name: openssh
      state: present

  - name: Uninstall Apache
    ansible.builtin.dnf:
      name: httpd
      state: absent

Update packages

---
- name: Update packages
  hosts: servers
  become: 'yes'

  tasks:
  - name: Update OpenSSH on Linux servers
    ansible.builtin.dnf:
      name: openssh
      state: latest

  - name: Uninstall Apache
    ansible.builtin.dnf:
      name: nginx
      state: latest

Enable/Disable services

---
- name: Enable nginx service
  hosts: servers
  become: 'yes'

  tasks:
  - name: Install nginx
    ansible.builtin.service:
      name: nginx
      state: present

  - name: Enable nginx service
    ansible.builtin.service:
      name: nginx
      state: started
      enabled: yes

  - name: Disable Apache service
    ansible.builtin.service:
      name: cups
      state: stopped
      enabled: no

Ansible: Ad-Hoc Commands

An ansible ad-hoc command is a single command sent to an ansible client. For example:

ansible servers -m setup

setup is an ansible module, which is a set of tools that handle some specific operations. setup gathers information about selected ansible clients.

We can pass a filter to setup argument to gather information we are interested in from the managed node. For example:

ansible servers -m setup -a "filter=ansible_all_ip_addresses"
ansible servers -m ping

This ad-hoc command ping all clients from the servers group in the configured inventory.

Run Shell Commands on Ansible Clients

ansible servers -m shell -a "ip addr show"
ansible servers -m shell -a "uptime"

This commands sends a shell command to all nodes in the group. -a is used to specify the argument for the ad-hoc command which is the command we want to run on each ansible client.

Copy Files from Ansible Controle Node to Clients

ansible servers -m copy -a "src=/home/me/my-file.txt dest=/etc/ansible/data/my-file.txt"
ansible servers -m copy -a "content='My Text file content' dest=/etc/ansible/data/my-text-file.txt"

This command copy a file from ansible control node to all clients. We can choose whether we want to copy the content of the file or file itself by using the content or src params. Note that the destination folder in which the file is going must exist in the clients. When uploading a file content, we must specify the complete destination file name (dest=/data/upload/my-file.conf).

If the file exist in the destination, ansible will use the md5 checksum of the file to determine if that task was previously done. If the file was not modified ansible will not re-upload the file. Otherwise, whether the file was updated in the control node or in the clients, ansible with upload the file again in the selected clients.

Create and Delete File and Folders in Ansible Clients

To create a new file in ansible clients:

ansible servers -m file -a "dest=FILE OR DIRECTORY DESTINATION state=touch"

To delete a file in ansible clients:

ansible servers -m file -a "dest=FILE OR DIRECTORY DESTINATION state=absent"

To create a directory, change the state to directory:

ansible servers -m file -a "dest=/my/directory/data state=directory"

A directory deletion is performed like a file deletion. Just specify the diirectory name in the dest and state=absent then ansible will delete the directory.

Install and Uninstall Packages on Ansible Clients

We can use use the shell or dnf/apt ansible module.

ansible servers -m shell -a "sudo dnf install nginx"
ansible servers -m dnf -a "name=nginx state=present" -b

If the operation requires root user priviledge, we can pass sudo to the shell command. But if we are using dnf/apt module, the ansible user must have root priviledge and we also need to add -b option to the command.

Use the latest state to update already installed package.

ansible servers -m dnf -a "name=nginx state=latest" -b

The state can be one of absent, installed, present, removed, and latest.

Understanding ansible ad-hoc commands is important for understanding ansible playbook. From here, we are going to move slowly towards efficient ways to automate tasks using ansible.

Ansible: Control Node Reasonable Setup

This post will focus on coming up with an reasonable Ansible control node setup for a homelab. By reasonable setup I mean a setup that will allow me to properly send tasks to managed nodes with a lower likelyhood of failure. From this point I would like to focus on learning the important parts of ansible instead of juggling left and right to fix basic setup errors.

Create or select a working folder

To keep things simple, I am going to have my inventory in /etc/ansible-admin/ own by the ansible-admin group.

Where to keep ansible.cfg

The default ansible.cfg can be left where it is. For managing our nodes, I am going to keep my own ansible configuration inside /etc/ansible-admin/ansible.cfg

Where to keep the inventories

The lab inventories can be kept in /etc/ansible-admin/inventory/

Disable the host key verification

From ansible.cfg:

[defaults]
host_key_checking = False

or from an environment variable:

export ANSIBLE_HOST_KEY_CHECKING=False

or from the command line:

ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook my_playbook.yml

Ansible: More about the Inventory File

Ansible default inventory is located at /etc/ansible/hosts. But we can have it elsewhere. For example at /home/me/ansible/hosts.ini. Then point ansible to it using the -i flag.

ansible web -m ping -i ./hosts.ini

Or I can just confirgure ansible.cfg to point to the path of my inventory file.

[defaults]

inventory = /etc/ansible/inventory/hosts.ini

Hosts can be organized in groups inside the inventory file. A group name must be unique and following the criteria of a valid variable name.

Here are example of groups: web and db

[web]
192.168.10.15
192.168.10.16

[db]
192.168.12.15
192.168.12.16
192.168.12.17

Here is the same inventory in YAML format

web:
  hosts:
    192.168.10.15:
    192.168.10.16:
db:
  hosts:
    192.168.12.15:
    192.168.12.16:
    192.168.12.17:

Ansible automatically creates the all and ungrouped groups behind the scene. The all group contains all hosts, and ungrouped group contains all that are not in any group.

So, ansible -m ping all will all hosts listed in the inventory file, and ansible -m ping ungrouped will ping all hosts not listed in any group.

Do more in your inventory

  • A host can be part of multiple groups

  • Groups can also be grouped

prod:
  children:
    web:
    db:
test:
  children:
    web_test:
[prod:children]
web
db

[test:children]
web_test
  • Add a range of hosts
[servers]
192.168.11.[15:35]
servers:
  hosts:
    192.168.11.[15:35]:
  • Add variables to hosts or groups
[prod]
192.168.10.15:4422

prod1 ansible_port=4422 ansible_host=192.168.10.22

You can do way more than what I have listed above, I am not going to bore with everything about Ansible inventory here because I don't need to use them at this stage of my learning. But if you feel like you want to learn more about this topic, go here

Good bye for now

Ansible: Initial Setup

In my previous post, I went quickly through ansible installation and initial setup. I did not really setup anything. I just showed you where the find things that are brought by ansible by default.

In this post I will go deeper in the setup process. But I am still not going to try to impress you here. Let keep that for future posts.

Ansible Control Node

Ansible config file is locate at /etc/ansible/ansible.cfg by default. We are going to use this file later to customize our installation of Ansible.

If you have just a fiew nodes, you can SSH into each one of them to make sure you can correctly connect. That also means that if you have just a few nodes, Ansible might not the right tool.

Use ssh-copy-id key.pub node-user@192.168.10.10 to add the controller ssh key to authorized hosts that can connect to the nodes.

Ansible Inventory

The inventory contains the nodes you want ansible to manage. The default inventory file is located at /etc/ansible/hosts. The nodes are put into groups for ease of management. The group names must be unique and they are case sensitive. The inventory file contains the IP addresses or FQDN of the managed hosts.

If we want to use the default inventory file we can just run:

# to ping all nodes in the web group
ansible -m ping web

But if we are working on a dedicated inventory file, like my_nodes.ini, we should tell ansible that we are providing and inventory file by adding -i [INVENTORY FILE]. For example, ansible web -i my_nodes -m ping

The inventory in the ini format looks like:

[web]
192.168.12.13
192.168.12.14

[db]
192.168.13.13
192.168.13.15

But the inventory file can also be written in the YAML format:

my_nodes:
  hosts:
    node_01:
      ansible_host: 192.168.10.12
    node_02:
      ansible_host: 192.168.10.13

[web] is a group name. It is unique accross the inventory file. We can have multiple groups in a inventory file.

To run ansible command on multiple groups we do separate the groups name with colons. For example:

ansible web:db -m ping -i my_nodes.ini --ask-pass

This command will nodes in the web and db groups. --ask-pass allows prompting for password if somehow the SSH daemon in the managed nodes is asking for the user password.

If our command requires an input to function, maybe we are doing it the wrong way. Ansible is suppose to facilitate automation. A command should be able to run until completion without additional user input. In my initial ansible setup, I provided input twice when I was running the the ping command: The first was the host keys verification, the second was to provide the node password because the ssh keys were not setup properly. We are going to fix this in our next posts.

How to Manage Nodes with Ansible

Until now we only learned how to ping our nodes using ansible ping module. ansible web -m ping is the language to tell ansible to use the ping module to ping the web group.

Key Points to Remember

  • Ansible is used to automate repetitive tasks we perform on network devices

  • Ansible inventory contains grouped list of nodes we want to manage

  • The inventory can be written in the ini or YAML format

  • Ansible comes with prebuilt modules like ping to faciliate the nodes management.

In my next posts, I will be going deeper on each importaint part of Ansible such as inventory and playbook.

So, read me soon.

Ansible: Installation and Initial Setup

What is Ansible?

Let's cut the chase. Ansible is tool for system and network admins to automate repetitive tasks for example installing and configuring multiple servers, and configuring routers, switches, firewalls, and WAPs at once. Ansible can talk to any device that talks the SSH language. Other connection types are supported but SSH is the default connection type0. Visit Ansible Documentation page to learn more.

This is not going to be step by step tutorial on how to using Ansible nor an in depth overview of Ansible. A lot of important basic topics will be missing from this post but they might appear in future posts. So if something is missing here, you can always look at the other posts in the same category. If something in my sayings does not feel right, you can reach out to me with questions or suggestions via LinkedIn or Email.

How to Install Ansible on Linux

Ansible is agent less. That means that you do not need to install Ansible on the managed nodes to have Ansible push some tasks into them. So only the control node needs to have Ansible installed on it. But you will need python and SSH installed and configured on the managed nodes.

You have multiple ways to install Ansible on your Linux workstation but I will be using the method via the Linux package manager.

How to locate python?

which python3 

# /usr/bin/python3

How to locate SSH?

sudo systemctl status sshd

The ssh daemon must enabled, active, and running.

Update your system:

sudo dnf update -y

Install Ansible using the package manager

sudo dnf install ansible -y

Ansible keeps its main configuration files in /etc/ansible. There you should find the files ansible.cfg and hosts.

Run ansible --version to get details about your Ansible installation. The command will also tell you the location of your Ansible default configuration file.

If you install Ansible using the Linux package manager, you should have the config file generated and set in Ansible. In case you are missing the ansible.cfg file in your installation, you can create it the file /etc/ansible/ansible.cfg. There are many ways to set Ansible configuration file but I am going stick with the one generated by default during the installation.

Ansible Inventory

Ansible inventory contains the list of hosts you want to manage. By default the hosts file contains the list of nodes but you can customize the hosts file inside ansible.cfg. In the hosts file, you can put the nodes into groups like:

[web]
172.16.10.10
172.16.10.12

[db]
172.16.20.22

If you're able to run ansible --version without issue and locate ansible installation folder (/etc/ansible), you are good to do awesome things with ansible. In the next posts, we are going to deaper in the basics of ansible.

So, stay tuned.

Linux: Troubleshooting Performance Issues

CPU Issues

High CPU usage

  • What process is using the CPU?
  • Use top or htop to see what process is using the CPU
  • Optimize the code or limit the number of processes running

High load average

This issue occurs when the number of processes waiting to be executed exceeds the system's processing capacity.

  • Check output of uptime or top
  • Is the system overloaded?
  • Are there too many processes running simultaneously?
  • Or is it one process that is causing the backlog?
  • What quey to optimize if it is a db process?
  • Maybe offload some tasks to another server
  • Maybe swap CPU with one with better specs

High context switching

A context switching is when the CPU switches between different processes to allocated resources. A context refers to the state of a running process that allows the CPU to resume a process later (Running, Waiting, and Stopped).

Too many context switches lead to inefficiency and higher CPU usage.

  • Check context switching with vmstat or pidstat. Check the number of context switches per second.
  • How many context switches per second do we have
  • More than 500-1000 context switches per second is considered high and may indicate that the system is overwhelmed
  • Maybe reduce the number of running processes
  • Optimize applications to use fewer threads
  • Adjust system limits

CPU bottleneck

This issue occurs when the CPU is the limiting factor in system performance.

  • The CPU usage is consistently high (above 80%)
  • Are tasks processing take too long?
  • Load average exceeds the number of available CPU cores
  • Use top and htop to identify processes that are using an inordinate amount of CPU time
  • Optime the processes using the CPU
  • Maybe a hardware upgrade is needed

Memory Issues

Swapping Issues

This issue occurs when the system runs out of physical memory (RAM) and starts using the hard drive as virtual memory

  • Is the system running out of swap space?
  • The swap file is typical located at /swapfile or on a dedicated swap partition
  • Identify swap space with swapon -s
  • Has the system performance degraded?
  • What does top and htop say about swap usage?
  • Is the usage more that 10%? That might be high
  • Monitor swap usage with free -h or vmstat
  • Maybe more physical RAM is needed
  • Or adjust the swappiness kernel value. ex: sudo sysctl vm.swappiness=10

Out of Memory (OOM) Errors

This issue happens when the system runs out of both physical and virtual memory

  • Are critical processes been unexpectedly terminated?
  • Check logs for OOM in system logs
  • Adjust application configurations to optimize memory usage
  • Increase RAM
  • Adjust swap space

Disk I/O Issues

This issue occurs when the system slows down due to delays in reading from or writing to storage device.

High input/output wait time

This issue occurs when processes are waiting for data to be read from or written to the disk. Is is indicating that the disk is struggling to handle requests causing delays in executing processes.

  • Monitor disk load with iotop or dstat. top shows io wait as variable x.x wa. iostat give %iowait
  • Optimize or spread out disk operations
  • Improve performance or upgrade to SSDs if needed

High disk latency

Disk latency refers to the time it takes for the disk to respond to read or write requests. It can be cause by high number or concurrent request, a disk hardware issue, or inefficient disk configurations.

  • What does iostat output say about the disk latency?
  • Is the latency above the normal 10ms? higher than 20ms may indicate a latency issues
  • Is the disk operating at its maximum throughput?
  • Maybe some drivers need to be upgraded
  • Maybe adjusting the disk (RAID) config can help reduce the latency
  • Maybe upgrade to faster disks

Slow remote storage response

This issue occurs when accessing data from remote storage systems such as NFS, SAN, or cloud storage

  • Long response time when accessing files?
  • Check network performance using ping or netperf to check for network issues
  • Optimize network settings or upgrade network hardware

Network Stability Issues

Packet drops

This issue occurs when data packets fail to reach their destination due to network congestion or hardware issues

  • ping -c 100 <DESTINATION> will give the percentage of packet dropped. Over 1% packet loss my be unacceptable
  • Packet loss can lead to slow performance, timeouts, and application errors
  • Check routers and switches for hardware issues and faulty cables
  • Check for NICs errors using ifconfig and ethtool
  • Adjust QoS settings or increase bandwidth

Random disconnects

This issue occurs when a network connection is unexpectedly terminated

  • Are users or services suddenly losing network access?
  • Look for connection reset or connection closed messages in logs using dmesg or ifconfig
  • Check network stack configuration
  • Maybe the cable is faulty and some hardware in the path is faulty
  • Maybe a firewall is closing the connection
  • Check and adjust TCP settings if necessary

Random timeouts

This issue occurs when a connection fails to receive a response within the expected time frame.

  • Errors can be seen in logs
  • Getting connection timed out error when using curl to connect to a service?
  • Maybe the network is congested
  • Maybe there is a DNS issue
  • Maybe the server is overloaded
  • A timeout threshold of 5-10 seconds is typically acceptable
  • Use ping or traceroute to check for network congestion
  • Make sure DNS servers are correctly configured
  • Check server performance
  • Adjust TCP timeout if necessary

Network Performance Issues

High latency

High latency refers to the delay in the time it takes for data to travel from one point to another

Measure latency using ping or traceroute A 100ms latency in a local network is considered high and 300ms over remote communication is also high Check for network congestion Identify hardware issues Maybe the routing is misconfigured. Optimize the network path Maybe upgrading the network infrastructure could help

Jitter

Jitter is the variation of latency over time, which can cause problems in real-time applications.

  • A value above 30ms of fluctuation can cause noticeable issues
  • Detect Jitter issues with ping -i 0.2 <DESTINATION>
  • Check for network congestion or hardware issues
  • Implement QoS to prioritize relevant traffic

Slow response time

This issue occurs when the network takes too long to respond to requests.

This could be due to:

  • High latency
  • Congestion
  • Overloaded servers
  • Misconfigured applications

  • Use curl or wget to measure response time and identify bottlenecks in the network or a server

  • Check server load
  • Optimize application code
  • Check for server resources
  • Review network configurations

Low throughput

This issue occurs when the network is unable to transmit data a a high enough rate

  • Identify low throughput but using iperf anything below 80% of the expected bandwidth is considered low throughput
  • Check for network congestion
  • Check for faulty cables
  • Check for incorrect settings
  • Maybe switch to a high bandwidth network
  • Maybe reduce unnecessary traffic
  • Optimize network routes

System Responsiveness Issues

Slow application response

This issue occurs when an application takes longer than expected to react to user inputs

  • Use top and htop to identify application resource consumption
  • Maybe the application code needs to be optimized
  • Increase system resources
  • Check disk I/O for overload
  • Check for unnecessary background

Sluggish terminal behavior

It happens when commands in the terminal are delayed. The system takes an unusually long tome to execute commands.

  • Use top or iotop to check for system resource usage
  • Optimize processes running on the system
  • Cleanup system resources
  • Add more RAM or CPU cores

Slow startup

System a taking an unusual time to boot up

  • See which services take longer than expected using systemd-analyze
  • Maybe too many services are configured to start at the same time
  • Maybe one of the startup services is misconfigured?
  • Delay or disable non essential services from starting at boot time using systemctl
  • Optimize the boot sequence

System unresponsiveness

This issue occurs when the system becomes completely unresponsive

  • Is the system not accepting new input?
  • Applications are no longer responding?
  • Use dmesg or journalctl to identify what caused kernel panic
  • Identify runaway processes using top and htp
  • Maybe upgrade the RAM or add more CPU cores

Process Management Issues

Blocked processes

This issue occurs when a process is unable to proceed due to waiting on resources or system locks

  • Are command or application stuck?
  • ps and top show processes in D or `uninterruptible state
  • Use lsof to check with file a process is waiting for
  • Use strace to trace system calls and signals
  • Is a process repeatedly stuck or blocked? Maybe due to resource contention
  • Optimize disk I/O
  • Maybe add more memory
  • Investigate dependency issues between processes

Exceeding baselines

This happens when processes consume more resources than expected

  • Notice high CPU usage
  • Unusual memory consumption
  • Excessive disk activity
  • Use top, htop, or pidstat to identify this issue
  • Optimize application resource usage
  • Maybe configure system resource limits with ulimit

High failed log-in attempts

This issue often signals attempted unauthorized access or brute force attacks

  • Maybe a brute force attack?
  • Unauthorized access attempts?
  • System compromise?
  • What are logs in /var/log/auth.log saying?
  • Check journalctl for identify login attempts
  • 5-10 login attempts from a single IP address within a short time may be a red flag for brute force attack
  • Implement fail2ban to block abusive IPs
  • Enforce strong password policy
  • Use MFA
  • Limit access with firewall or IP allowlist

Linux: Troubleshooting Security Issues

SELinux Issues

SELinux policy issues

SELinux Policy defines what actions users and applications can perform on a system based on security rules.

A too restricted or misconfigured policy can prevent the system from working properly.

avc: denied is a typical error message found in logs if dealing SELinux policy issues.

  • Review logs with ausearch or sealert
  • Modify rules if necessary
  • Test policy in a safe environment before applying

SELinux context issues

SELinux uses context to label every file, process, and resource on the system, determining what access is allowed.

Incorrect or misconfigured label can prevent applications for accessing the resources they need to function

  • User ls -Z for files and ps -Z for processes to look for SELinux context issues
  • Does the file or process have incorrect context?
  • Restore the context with sudo restorecon -v <FILE PATH>
  • Running restorecon regularly on key directories helps avoid repeated context mislabeling issues

SELinux boolean issues

SELinux Boolean allow adjustment of certain security settings without modifying the underlying policy.

An incorrectly set boolean can cause certain services or applications to malfunction

  • Check booleans with getsebool
  • Are certain booleans incorrectly set?
  • Toggle booleans with setsebool. ex: setsebool -P httpd_can_sendmail 1
  • Test modification and document changes

File and Directory Permission Issues

File attributes

File attributes control certain behaviors and restrictions on files and directories, which go beyond the regular rwx permissions.

  • Check file attributes with lsattr. i=immutable, a=append-only
  • Remove incorrect attribute with chattr. ex: chattr -i <FILE PATH>
  • Verify file access and document changes

Access Control Lists (ACLs)

ACLs provide more fine-grained control over who can access a file or directory and what actions can be performed.

  • Check if a file is using ACLs with getfacl
  • Adjust the ACLs with setfacl. ex: give read-only access to user tom setfacl -m u:tom:r <FILE PATH>
  • Verify proper access and document changes

Access Issues

Account access issues

Most common issue

  • Are the credentials incorrect?
  • Maybe the account is locked or disable
  • Check system logs for messages
  • Check if account is locked with sudo passwd -S tom
  • Unlock account with sudo passwd -u tom
  • Reset the user password with sudo passwd tom
  • Re-enable a disable account with sudo usermod -e '' tom. '' means no account expiration date

Remote access issues

Issues with VPN or SSH

  • Is the issue caused by network issues, misconfigurations, or firewall?
  • Is the SSH service running? check with sudo systemctl status sshd
  • Enable SSH service with sudo systemctl start sshd && systemctl enable sshd
  • Check firewall with sudo ufw status or sudo iptables -L
  • The problem sill persist? check routing, and public keys validity

Certificate issues

Common messages: SSL certificate expired, SSL handshake failure

  • Is the certificate expired?
  • Maybe the certificate chains are misconfigured
  • Maybe it is a CA issue
  • Check certificate issues with openssl s_client -connect mysite.com:443
  • Renew the certificate if necessary
  • Ensure the full certificate chain is correctly installed

Configuration Issues

Exposed or misconfigured services

This issue occur when system services are either left open to the public or configured incorrectly.

  • Does the service have proper security settings? The db should not accessible from the internet
  • Review security logs
  • Use tools like nmap to scan open ports
  • Configure the firewall to restrict access to trusted IPs
  • Disable unused services
  • Ensure critical services are only accessible when necessary

Misconfigured package repositories

This issue prevents the system from accessing the correct software sources. It prevents software updates and installations.

  • What errors show when running sudo apt update or sudo dnf update
  • Check repository configuration files: /etc/apt/sources.list on Debian-based systems or /etc/yum.repo.d/ on RHEL-based systems
  • Edit repository url if necessary

Vulnerabilities

Vulnerabilities are weaknesses of flaws in the system that can be exploited by attackers bo compromise security.

Unpatched vulnerable system

  • Do i have the latest security patches?
  • Use vulnerability scanners to detect security issues
  • Regular apply update with sudo apt update && sudo apt upgrade on Debian or sudo dnf update on RHEL.

The use of obsolete or insecure protocols and ciphers

  • Is the system using secure ciphers for data and communication protection?
  • Are insecure cipher like disable in the system? SSLv3 is vulnerable to POODLE Attack, RC4 is vulnerable to RC4 Bias Attack
  • Check used protocols in sshd_config for SSH and apache2.conf for Apache.
  • Disable outdated protocol
  • Remove week ciphers in the configuration files
  • Use strong ciphers like AES and protocols like TLS1.2, 1.3

Cipher negotiation issues

This issue occurs when there is a failure in the negotiation or encryption methods between a client and a server.

Review connection logs to confirm both server and client are using strong encryption methods