Vai al contenuto

Part 7. High availability

Clustering under Linux

High availability is a term often used in IT, in connection with system architecture or a service, to designate the fact that this architecture or service has a suitable rate of availability. ~ wikipedia

This availability is a performance measure expressed as a percentage obtained by the ratio Operating time / Total desired operating time.

Rates Annual downtime
90% 876 hours
95% 438 hours
99% 87 hours et 36 minutes
99,9% 8 hours 45 minutes 36 seconds
99,99% 52 minutes, 33 seconds
99,999% 5 minutes, 15 seconds
99,9999% 31,68 seconds

"High Availability" (HA) refers to all the measures taken to guarantee the highest possible availability of a service. In other words: its correct operation 24 hours a day.

Overview

A cluster is a "computer cluster", a group of two or more machines.

A cluster allows :

  • distributed computing by using the computing power of all the nodes
  • high availability: service continuity and automatic service failover in the event of a node failure

Types of services

  • Active/passive services

    Installing a cluster with two active/passive nodes by using Pacemaker and DRBD is a low-cost solution for many situations requiring a high-availability system.

  • N+1 services

    With multiple nodes, Pacemaker can reduce hardware costs by allowing several active/passive clusters to combine and share a backup node.

  • N TO N services

    With shared storage, every node can potentially be used for fault tolerance. Pacemaker can also run multiple copies of services to spread the workload.

  • Remote site services

    Pacemaker includes enhancements to simplify the creation of multisite clusters.

VIP

The VIP is a virtual IP address. This is the address assigned to an Active/Passive cluster. Assign the VIP to a cluster node that is active. If a service failure occurs, deactivation of the VIP occurs on the failed node while activation occurs on the node taking over. This is known as failover.

Clients always address the cluster using VIP, making active server failovers transparent to them.

Split-brain

Split-brain is the main risk a cluster may encounter. This condition occurs when several nodes in a cluster think their neighbor is inactive. The node then tries to start the redundant service, and several nodes provide the same service, which can lead to annoying side-effects (duplicate VIPs on the network, competing data access, and so on.).

Possible technical solutions to avoid this problem are:

  • Separate public network traffic from cluster network traffic
  • using network bonding

Pacemaker (PCS)

In this chapter, you will learn about Pacemaker, a clustering solution.


Objectives: In this chapter, you will learn how to:

✔ install and configure a Pacemaker cluster; ✔ administer a Pacemaker cluster.

🏁 clustering, ha, high availability, pcs, pacemaker

Knowledge: ⭐ ⭐ ⭐ Complexity: ⭐ ⭐

Reading time: 20 minutes


Generalities

Pacemaker is the software part of the cluster that manages its resources (VIPs, services, data). It is responsible for starting, stopping and supervising cluster resources. It guarantees high node availability.

Pacemaker uses the message layer provided by corosync (default) or Heartbeat.

Pacemaker consists of 5 key components:

  • Cluster Information Base (CIB)
  • Cluster Resource Management daemon (CRMd)
  • Local Resource Management daemon (LRMd)
  • Policy Engine (PEngine or PE)
  • Fencing daemon (STONITHd)

The CIB represents the cluster configuration and the current state of all cluster resources. The contents of the CIB are automatically synchronized across the entire cluster and used by the PEngine to calculate how to achieve the ideal cluster state.

The list of instructions is then provided to the Designated Controller (DC). Pacemaker centralizes all cluster decisions by electing one of the CRMd instances as master.

The DC executes the PEngine's instructions in the required order, transmitting them either to the local LRMd or to the CRMd of the other nodes via Corosync or Heartbeat.

In some cases, it may be necessary to stop nodes to protect shared data or enable their recovery. Pacemaker comes with STONITHd for this purpose.

Stonith

Stonith is a component of Pacemaker. It stands for Shoot-The-Other-Node-In-The-Head, a recommended practice for ensuring the isolation of the malfunctioning node as quickly as possible (shut down or at least disconnected from shared resources), thus avoiding data corruption.

An unresponsive node does not mean that it can no longer access data. The only way to ensure that a node is no longer accessing data before handing over to another node is to use STONITH, which will either shut down or restart the failed server.

STONITH also has a role to play if a clustered service is failing to shut down. In this case, Pacemaker uses STONITH to force the entire node to stop.

Quorum management

The quorum represents the minimum number of nodes in operation to validate a decision, such as deciding which backup node should take over when one of the nodes is in error. By default, Pacemaker requires more than half the nodes to be online.

When communication problems split a cluster into several groups of nodes, quorum prevents resources from starting up on more nodes than expected. A cluster is quorate when more than half of all nodes known to be online are in its group (active_nodes_group > active_total_nodes / 2 ).

The default decision when quorum is not reached is to disable all resources.

Case study:

  • On a two-node cluster, since reaching quorum is not possible, if there is a node failure, it must be ignored or the entire cluster will be shut down.
  • If a 5-node cluster is split into 2 groups of 3 and 2 nodes, the 3-node group will have quorum and continue to manage resources.
  • If a 6-node cluster is split into 2 groups of 3 nodes, no group will have quorum. In this case, pacemaker's default behavior is to stop all resources to avoid data corruption.

Cluster communication

Pacemaker uses either Corosync or Heartbeat (from the linux-ha project) for node-to-node communication and cluster management.

Corosync

Corosync Cluster Engine is a messaging layer between cluster members and integrates additional functionalities for implementing high availability within applications. The Corosync is derived from the OpenAIS project.

Nodes communicate in Client/Server mode with the UDP protocol.

It can manage clusters of more than 16 Active/Passive or Active/Active modes.

Heartbeat

Heartbeat technology is more limited than Corosync. It is not possible to create a cluster of more than 2 nodes, and its management rules are less sophisticated than those of its competitor.

Note

The choice of pacemaker/corosync today seems more appropriate, as it is the default choice for RedHat, Debian and Ubuntu distributions.

Data management

The DRDB network raid

DRDB is a block-type device driver that enables the implementation of RAID 1 (mirroring) over the network.

DRDB can be useful when NAS or SAN technologies are not available, but a need exists for data synchronization.

Installation

To install Pacemaker, first enable the highavailability repository:

sudo dnf config-manager --set-enabled highavailability

Some information about the pacemaker package:

$ dnf info pacemaker
Rocky Linux 9 - High Availability                                                                                                                                     289 kB/s | 250 kB     00:00
Available Packages
Name         : pacemaker
Version      : 2.1.7
Release      : 5.el9_4
Architecture : x86_64
Size         : 465 k
Source       : pacemaker-2.1.7-5.el9_4.src.rpm
Repository   : highavailability
Summary      : Scalable High-Availability cluster resource manager
URL          : https://www.clusterlabs.org/
License      : GPL-2.0-or-later AND LGPL-2.1-or-later
Description  : Pacemaker is an advanced, scalable High-Availability cluster resource
             : manager.
             :
             : It supports more than 16 node clusters with significant capabilities
             : for managing resources and dependencies.
             :
             : It will run scripts at initialization, when machines go up or down,
             : when related resources fail and can be configured to periodically check
             : resource health.
             :
             : Available rpmbuild rebuild options:
             :   --with(out) : cibsecrets hardening nls pre_release profiling
             :                 stonithd

Using the repoquery command, you can find out the dependencies of the pacemaker package:

$ repoquery --requires pacemaker
corosync >= 3.1.1
pacemaker-cli = 2.1.7-5.el9_4
resource-agents
systemd
...

The pacemaker installation will therefore automatically install corosync and a CLI interface for pacemaker.

Some information about the corosync package:

$ dnf info corosync
Available Packages
Name         : corosync
Version      : 3.1.8
Release      : 1.el9
Architecture : x86_64
Size         : 262 k
Source       : corosync-3.1.8-1.el9.src.rpm
Repository   : highavailability
Summary      : The Corosync Cluster Engine and Application Programming Interfaces
URL          : http://corosync.github.io/corosync/
License      : BSD
Description  : This package contains the Corosync Cluster Engine Executive, several default
             : APIs and libraries, default configuration files, and an init script.

Install now the required packets:

sudo dnf install pacemaker

Open your firewall if you have one:

sudo firewall-cmd --permanent --add-service=high-availability
sudo firewall-cmd --reload

Note

Do not start the services now, as they are not configured and will not work.

Cluster management

The pcs package provides cluster management tools. The pcs command is a command-line interface for managing the Pacemaker high-availability stack.

Cluster configuration could possibly be done by hand, but the pcs package makes managing (creating, configuring and troubleshooting) a cluster much easier!

Note

There are alternatives to pcs.

Install the package on all nodes and activate the daemon:

sudo dnf install pcs
sudo systemctl enable pcsd --now

The package installation created a hacluster user with an empty password. To perform tasks such as synchronizing corosync configuration files or rebooting remote nodes. Assigning a password to this user is necessary.

hacluster:x:189:189:cluster user:/var/lib/pacemaker:/sbin/nologin

On all nodes, assign an identical password to the hacluster user:

echo "pwdhacluster" | sudo passwd --stdin hacluster

Note

Please replace "pwdhacluster" with a more secure password.

From any node, it is possible to authenticate as a hacluster user on all nodes, then use the pcs commands on them:

$ sudo pcs host auth server1 server2
Username: hacluster
Password:
server1: Authorized
server2: Authorized

From the node on which pcs authentication occurs, launch the cluster configuration:

$ sudo pcs cluster setup mycluster server1 server2
No addresses specified for host 'server1', using 'server1'
No addresses specified for host 'server2', using 'server2'
Destroying cluster on hosts: 'server1', 'server2'...
server2: Successfully destroyed cluster
server1: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'server1', 'server2'
server1: successful removal of the file 'pcsd settings'
server2: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'server1', 'server2'
server1: successful distribution of the file 'corosync authkey'
server1: successful distribution of the file 'pacemaker authkey'
server2: successful distribution of the file 'corosync authkey'
server2: successful distribution of the file 'pacemaker authkey'
Sending 'corosync.conf' to 'server1', 'server2'
server1: successful distribution of the file 'corosync.conf'
server2: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.

Note

The pcs cluster setup command takes care of the quorum problem for two-node clusters. Such a cluster will therefore function correctly in the event of the failure of one of the two nodes. If you are manually configuring corosync or using another cluster management shell, you will need to configure corosync correctly yourself.

You can now start cluster:

$ sudo pcs cluster start --all
server1: Starting Cluster...
server2: Starting Cluster...

Enable the cluster service to start on boot:

sudo pcs cluster enable --all

Check the service status:

$ sudo pcs status
Cluster name: mycluster

WARNINGS:
No stonith devices and stonith-enabled is not false

Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: server1 (version 2.1.7-5.el9_4-0f7f88312) - partition with quorum
  * Last updated: Mon Jul  8 17:50:14 2024 on server1
  * Last change:  Mon Jul  8 17:50:00 2024 by hacluster via hacluster on server1
  * 2 nodes configured
  * 0 resource instances configured

Node List:
  * Online: [ server1 server2 ]

Full List of Resources:
  * No resources

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Adding resources

Before you can configure the resources, you will need to deal with the alert message:

WARNINGS:
No stonith devices and stonith-enabled is not false

In this state, Pacemaker will refuse to start your new resources.

You have two choices:

  • disable stonith
  • configure it

First, you will disable stonith until you learn how to configure it:

sudo pcs property set stonith-enabled=false

Warning

Be careful not to leave stonith disabled on a production environment!

VIP configuration

The first resource you are going to create on your cluster is a VIP.

List the standard resources available with the pcs resource standards command:

$ pcs resource standards
lsb
ocf
service
systemd

This VIP, corresponds to the IP address used by customers to access future cluster services. You must assign it to one of the nodes. Then, if a failure occurs, the cluster will switch this resource from one node to another to ensure continuity of service.

pcs resource create myclusterVIP ocf:heartbeat:IPaddr2 ip=192.168.1.12 cidr_netmask=24 op monitor interval=30s

The ocf:heartbeat:IPaddr2 argument contains three fields that provide pacemaker with :

  • the standard (here ocf)
  • the script namespace (here heartbeat)
  • the resource script name

The result is the addition of a virtual IP address to the list of managed resources:

$ sudo pcs status
Cluster name: mycluster

...
Cluster name: mycluster
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  ...
  * 2 nodes configured
  * 1 resource instance configured

Full List of Resources:
  * myclusterVIP        (ocf:heartbeat:IPaddr2):         Started server1
...

In this case, VIP is active on server1. Verification with the ip command is possible:

$ ip add show dev enp0s3
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:df:29:09 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.10/24 brd 192.168.1.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.12/24 brd 192.168.1.255 scope global secondary enp0s3
       valid_lft forever preferred_lft forever
Toggle tests

From anywhere on the network, run the ping command on the VIP :

ping 192.168.1.12

Put the active node on standby:

sudo pcs node standby server1

Check that all pings succeed during the operation: (no missing icmp_seq)

64 bytes from 192.168.1.12: icmp_seq=39 ttl=64 time=0.419 ms
64 bytes from 192.168.1.12: icmp_seq=40 ttl=64 time=0.043 ms
64 bytes from 192.168.1.12: icmp_seq=41 ttl=64 time=0.129 ms
64 bytes from 192.168.1.12: icmp_seq=42 ttl=64 time=0.074 ms
64 bytes from 192.168.1.12: icmp_seq=43 ttl=64 time=0.099 ms
64 bytes from 192.168.1.12: icmp_seq=44 ttl=64 time=0.044 ms
64 bytes from 192.168.1.12: icmp_seq=45 ttl=64 time=0.021 ms
64 bytes from 192.168.1.12: icmp_seq=46 ttl=64 time=0.058 ms

Check the cluster status:

$ sudo pcs status
Cluster name: mycluster
Cluster Summary:
...
  * 2 nodes configured
  * 1 resource instance configured

Node List:
  * Node server1: standby
  * Online: [ server2 ]

Full List of Resources:
  * myclusterVIP        (ocf:heartbeat:IPaddr2):         Started server2

The VIP has moved to server2. Check with the ip add command as before.

Return server1 to the pool:

sudo pcs node unstandby server1

Note that once server1 has been unstandby, the cluster returns to its normal state, but the resource is not transferred back to server1: it remains on server2.

Service configuration

You will install the Apache service on both nodes of your cluster. This service is only started on the active node, and will switch nodes at the same time as the VIP if a failure of the active node occurs.

Refer to the apache chapter for detailed installation instructions.

You must install httpd on both nodes:

sudo dnf install -y httpd
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --reload

Warning

Don not start or activate the service yourself. Pacemaker will take care of it.

An HTML page containing the server name will show by default:

echo "<html><body>Node $(hostname -f)</body></html>" | sudo tee "/var/www/html/index.html"

The Pacemaker resource agent will use the /server-status page (see apache chapter) to determine its health status. You must activate it by creating the file /etc/httpd/conf.d/status.conf on both servers:

sudo vim /etc/httpd/conf.d/status.conf
<Location /server-status>
    SetHandler server-status
    Require local
</Location>

To create a resource you will call "WebSite", you will call the apache script of the OCF resource and in the heartbeat namespace.

sudo pcs resource create WebSite ocf:heartbeat:apache configfile=/etc/httpd/conf/httpd.conf statusurl="http://localhost/server-status" op monitor interval=1min

The cluster will check Apache's health every minute (op monitor interval=1min).

Finally, to ensure that the Apache service starts on the same node as the VIP address, you must add a constraint to the cluster:

sudo pcs constraint colocation add WebSite with myclusterVIP INFINITY

Configuring the Apache service to start after the VIP is also possible. This can be useful if Apache has VHost configurations to listen to the VIP address (Listen 192.168.1.12):

$ sudo pcs constraint order myclusterVIP then WebSite
Adding myclusterVIP WebSite (kind: Mandatory) (Options: first-action=start then-action=start)
Testing the failover

You will perform a failover and test that your webserver is still available:

$ sudo pcs status
Cluster name: mycluster
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: server1 (version 2.1.7-5.el9_4-0f7f88312) - partition with quorum
  ...

Node List:
  * Online: [ server1 server2 ]

Full List of Resources:
  * myclusterVIP        (ocf:heartbeat:IPaddr2):         Started server1
  * WebSite     (ocf:heartbeat:apache):  Started server1

You are currently working on server1.

$ curl http://192.168.1.12/
<html><body>Node server1</body></html>

Simulate a failure on server1:

sudo pcs node standby server1
$ curl http://192.168.1.12/
<html><body>Node server2</body></html>

As you can see, your webservice is still working but on server2 now.

sudo pcs node unstandby server1

Note that the service was only interrupted for a few seconds while the VIP switched over and the services restarted.

Cluster troubleshooting

The pcs status command

The pcs status command provides information about the overall status of the cluster:

$ sudo pcs status
Cluster name: mycluster
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: server1 (version 2.1.7-5.el9_4-0f7f88312) - partition with quorum
  * Last updated: Tue Jul  9 12:25:42 2024 on server1
  * Last change:  Tue Jul  9 12:10:55 2024 by root via root on server1
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ server1 ]
  * OFFLINE: [ server2 ]

Full List of Resources:
  * myclusterVIP        (ocf:heartbeat:IPaddr2):         Started server1
  * WebSite     (ocf:heartbeat:apache):  Started server1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

As you can see, one of the two server is offline.

The pcs status corosync

The pcs status corosync command provides information about the status of corosync nodes:

$ sudo pcs status corosync

Membership information
----------------------
    Nodeid      Votes Name
         1          1 server1 (local)

and once the server2 is back:

$ sudo pcs status corosync

Membership information
----------------------
    Nodeid      Votes Name
         1          1 server1 (local)
         2          1 server2

The crm_mon command

The crm_mon command returns cluster status information. Use the -1 option to display the cluster status once and exit.

$ sudo crm_mon -1
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: server1 (version 2.1.7-5.el9_4-0f7f88312) - partition with quorum
  * Last updated: Tue Jul  9 12:30:21 2024 on server1
  * Last change:  Tue Jul  9 12:10:55 2024 by root via root on server1
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ server1 server2 ]

Active Resources:
  * myclusterVIP        (ocf:heartbeat:IPaddr2):         Started server1
  * WebSite     (ocf:heartbeat:apache):  Started server1

The corosync-*cfgtool* commands

The corosync-cfgtool command checks that the configuration is correct and that communication with the cluster is working properly:

$ sudo corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
        addr    = 192.168.1.10
        status:
                nodeid:          1:     localhost
                nodeid:          2:     connected

The corosync-cmapctl command is a tool for accessing the object database. For example, you can use it to check the status of cluster member nodes:

$ sudo corosync-cmapctl  | grep members
runtime.members.1.config_version (u64) = 0
runtime.members.1.ip (str) = r(0) ip(192.168.1.10)
runtime.members.1.join_count (u32) = 1
runtime.members.1.status (str) = joined
runtime.members.2.config_version (u64) = 0
runtime.members.2.ip (str) = r(0) ip(192.168.1.11)
runtime.members.2.join_count (u32) = 2
runtime.members.2.status (str) = joined

Workshop

For this workshop, you will need two servers with Pacemaker services installed, configured, and secured, as described in the previous chapters.

You will configure a highly available Apache cluster.

Your two servers have the following IP addresses:

  • server1: 192.168.1.10
  • server2: 192.168.1.11

If you do not have a service to resolve names, fill the /etc/hosts file with content like the following:

$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.1.10 server1 server1.rockylinux.lan
192.168.1.11 server2 server2.rockylinux.lan

You will use the VIP address of 192.168.1.12.

Task 1 : Installation and configuration

To install Pacemaker. Remember to enable the highavailability repository.

On both nodes:

sudo dnf config-manager --set-enabled highavailability
sudo dnf install pacemaker pcs
sudo firewall-cmd --permanent --add-service=high-availability
sudo firewall-cmd --reload
sudo systemctl enable pcsd --now
echo "pwdhacluster" | sudo passwd --stdin hacluster

On server1:

$ sudo pcs host auth server1 server2
Username: hacluster
Password:
server1: Authorized
server2: Authorized
$ sudo pcs cluster setup mycluster server1 server2
$ sudo pcs cluster start --all
$ sudo pcs cluster enable --all
$ sudo pcs property set stonith-enabled=false

Task 2 : Adding a VIP

The first resource you are going to create on your cluster is a VIP.

pcs resource create myclusterVIP ocf:heartbeat:IPaddr2 ip=192.168.1.12 cidr_netmask=24 op monitor interval=30s

Check the cluster status:

$ sudo pcs status
Cluster name: mycluster
Cluster Summary:
...
  * 2 nodes configured
  * 1 resource instance configured

Node List:
  * Node server1: standby
  * Online: [ server2 ]

Full List of Resources:
  * myclusterVIP        (ocf:heartbeat:IPaddr2):         Started server2

Task 3 : Installing the Apache server

Perform this installation on both nodes:

$ sudo dnf install -y httpd
$ sudo firewall-cmd --permanent --add-service=http
$ sudo firewall-cmd --reload
echo "<html><body>Node $(hostname -f)</body></html>" | sudo tee "/var/www/html/index.html"
sudo vim /etc/httpd/conf.d/status.conf
<Location /server-status>
    SetHandler server-status
    Require local
</Location>

Task 4 : Adding the httpd resource

Only on server1, add the new resource to the cluster with the needed constraints:

sudo pcs resource create WebSite ocf:heartbeat:apache configfile=/etc/httpd/conf/httpd.conf statusurl="http://localhost/server-status" op monitor interval=1min
sudo pcs constraint colocation add WebSite with myclusterVIP INFINITY
sudo pcs constraint order myclusterVIP then WebSite

Task 5 : Test your cluster

You will perform a failover and test that your webserver is still available:

$ sudo pcs status
Cluster name: mycluster
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: server1 (version 2.1.7-5.el9_4-0f7f88312) - partition with quorum
  ...

Node List:
  * Online: [ server1 server2 ]

Full List of Resources:
  * myclusterVIP        (ocf:heartbeat:IPaddr2):         Started server1
  * WebSite     (ocf:heartbeat:apache):  Started server1

You are currently working on server1.

$ curl http://192.168.1.12/
<html><body>Node server1</body></html>

Simulate a failure on server1:

sudo pcs node standby server1
$ curl http://192.168.1.12/
<html><body>Node server2</body></html>

As you can see, your webservice is still working but on server2 now.

sudo pcs node unstandby server1

Note that the service was only interrupted for a few seconds while the VIP switched over and the services restarted.

Check your knowledge

✔ The pcs command is the only one command to control a pacemaker cluster?

✔ Which command returns the cluster state?

  • sudo pcs status
  • systemctl status pcs
  • sudo crm_mon -1
  • sudo pacemaker -t

Author: Antoine Le Morvan

Contributors: Steven Spencer