Creating a Highly-Available Virtualization Cluster in CentOS 5.5

This guide covers basic installation and configuration of the Redhat Cluster Suite (RHCS), as well as creating shared storage using GFS2, and configuring your virtual machines as highly-available services.

Contents:
Why use HA?
Preparation
KVM Installation
Installing the Redhat Cluster Suite
Using Conga for Initial Cluster Creation
Core Configuration: Fencing
Creating the GFS2 Shared Volume
Mounting the Filesystem
Creating a Virtual Machine
Configuring Your VM as a Clustered Service
Reference

Why use HA?

Here are some of the benefits of running your virtual machines as HA clustered services.

Adds convenience and robustness to VM hosting
Automation leaves less room for human error, and allows you to do more with your time.

Automatic VM relocation & recovery
No human intervention required. If your VM host machine goes down, every VM previously running on that machine is automatically relocated to a new node. This allows you to run more VMs, more reliably.

Keeps VM definitions in sync
If you add more RAM to your VM’s configuration on one node, it will stay consistent across every other node too. You won’t have to worry about updating VM settings on the other machines.

Protects data by only allowing a VM to start up in one location at a time
Accidentally starting a VM on more than one machine would almost definitely result in data loss and corruption. As a clustered resource, you won’t have to worry about this happening to your VM. This allows you to safely run many VMs without having to keep track of what’s running where.

Offers true high-availability
The heart of HA is ensuring that your applications are available for uninterrupted service at all times. With HA, all VMs are recovered instantly upon failure.

Now that you understand the benefits of using an HA cluster, let’s get building!

Preparation
In order to speed things up, I suggest you use a parallel SSH program like clusterssh to interact with all your nodes at once. This will allow you to configure each node identically, and is particularly useful for larger installations (3 to 16 nodes).

Clusterssh is available for Centos 5.x and 6.x using the RPMforge repo. See this article for details on setting up the RPMforge repo.

yum install clusterssh
cssh node1 node2 node3 node4

You should also set aside a network interface to use for cluster communications. There are two reasons to do this:
1. Your cluster will communicate continuously throughout normal operation. It is vital that these communications are not interrupted/delayed by heavy network traffic or other network settings, otherwise your nodes might drop out of the cluster.
2. It’s an important security measure to keep the cluster communication away from a public network interface. This interface can be used to fence (shutoff/reboot/cut off) your nodes from the rest of the cluster.

KVM Installation
SSH into every node and complete this section across all nodes.

Check for CPU virtualization support. This should return some output.

egrep '(vmx|svm)' --color=always /proc/cpuinfo

If the above command returned successfully, continue with package installation.

yum groupinstall Clustering
yum install kvm kmod-kvm qemu libvirt python-virtinst qspice-libs virt-viewer virt-manager

Configure a network bridge for KVM. This is required if you want your machines to be accessible from hosts other than the VM host. (Required for running virtual servers)

cd /etc/sysconfig/network-scripts/
cp ifcfg-eth0 ifcfg-br0

Edit ifcfg-br0, the bridge interface. Change “DEVICE=” to br0, and edit the TYPE to equal Bridge (with an uppercase ‘B’!)

Editing /etc/sysconfig/network-scripts/ifcfg-br0:

DEVICE=br0
TYPE=Bridge
# IP information stays the same
# remove HWADDR

Bridge the two interfaces together

echo "BRIDGE=br0" >> ifcfg-eth0

Edit ifcfg-eth0 and remove the IP address, if any.

Enable forwarding of bridged traffic in iptables
Edit /etc/sysconfig/iptables and add this near the top (before the majority of the INPUT statements).

-A FORWARD -m physdev --physdev-is-bridged -j ACCEPT

Disable iptables on the bridge. This will keep the host machine from applying its iptables rules to virtual machines.

cat >> /etc/sysctl.conf < net.bridge.bridge-nf-call-ip6tables = 0 net.bridge.bridge-nf-call-iptables = 0 net.bridge.bridge-nf-call-arptables = 0 EOF

Reboot or wait for next section to reboot for these changes to take effect.

Installing the Redhat Cluster Suite
This section is to  be done across all nodes, unless specified otherwise.

Configure IPtables to allow access to all cluster-related ports. You can either copy the config below, or allow all traffic on your private cluster interfaces.

vim /etc/sysconfig/iptables

# Allow all on cluster interface... assuming yours is eth1
-A RH-Firewall-1-INPUT -i eth1 -j ACCEPT

Or, specify the ports individually. These ports are documented in Redhat’s RHCS Cluster Administration Guide.

vim /etc/sysconfig/iptables

#-------------------------#
# RHCS ports             #
#-------------------------#----------------------------------------------------------#
# cman
-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 5404:5405 -j ACCEPT
# ricci
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 11111 -j ACCEPT
# luci web interface
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 8084 -j ACCEPT
# modclusterd
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 16851 -j ACCEPT
# dlm (Distributed Lock Manager)
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 21064 -j ACCEPT
# ccsd
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 50006 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 50008:50009 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 50007 -j ACCEPT
#------------------------------------------------------------------------------------#

1. Start ricci on all machines

service ricci start
chkconfig ricci on

2. Select a computer to host luci.
This is the web interface that runs on the machine of your choice. The machine doesn’t need to be a part of the cluster, as long as it’s on the same network.

yum install luci
chkconfig luci on

3. Initialize the luci server.
This will prompt you to create an admin password for the web interface.

luci_admin init

4. Start the luci service

service luci restart
Shutting down luci:                                        [  OK  ]
Starting luci: generating https SSL certificates...  done
                                                           [  OK  ]

Please, point your web browser to https://localhost:8084 to access luci

Using Conga for Initial Cluster Creation

This is my favorite way to start a Redhat cluster, since it will always generate a valid XML config. Creating the config from plain XML is also perfectly fine, if you are familiar with the syntax. But the problem I’ve had with that in the past is that I’d use XML that worked on one system, and found that the syntax was invalid on my new system.

This is why I start an RHCS cluster using Conga: to create a reliably valid base config that is ready for customization further down the line.

Also, pushing buttons is easy.

The Conga web interface is one of the few areas where I actually enjoy using a gui – it’s a simple, straightforward, and powerful tool.

1. Log into Conga using the username/password you created previously.
https://localhost:8084
2. Under the Cluster tab, click Create a New Cluster.
3. Enter your nodes’ information and check the box for Enable Shared Storage if not already checked.
4. Check the box to Reboot nodes before joining cluster, then submit.

At this point, all necessary packages will be downloaded and installed on your new nodes. Then the cluster configuration will be propagated and your cluster should come online.

Core cluster configuration: Fencing
Fencing is a vital part of clustering which helps maintain data integrity by ensuring that out-of-sync, misbehaving nodes are removed from the cluster before they can do damage.

This is one of the first things you’ll want to configure, if you want to avoid trouble. Nodes without a configured fence device can sometimes can hang the entire cluster, as the other nodes wait for it to be fenced (which will be a very very long time if you haven’t configured fencing at all).

Fencing generally requires specialized hardware, like an IPMI port on a server, or a switched PDU (basically a power strip that you can control via the network). Whichever fencing hardware you choose, it has to be able to physically cut the node off from network/disk/power, to prevent the rogue node from doing damage.

In this guide, we’ll be using IPMI and SAN-based fencing. (Yes, you can use more than one fence device.)

How fencing works in RHCS
Fencing is configured either as a shared resource, or on a per-node basis. An example of a shared fence device would be a single Qlogic SanBox switch to which all nodes are attached.

The SanBox2 fence agent would then control which nodes have access to the shared storage.

In contrast, an example of a per-node fence device would be IPMI, which uses the LOM (lights-out management port) of individual servers.

If you decide to use IPMI fencing in your cluster, be sure to shut off the ACPI Soft-Off feature. This will ensure that the node shuts off immediately, rather than issuing a more polite ‘shutdown’ command.

chkconfig --level 2345 acpid off

Creating a Shared Fence Device in Conga
If you’re not using a shared fence device, skip to the next section for per-node configuration.

1. Log into the Conga web interface. Click your cluster’s name -> Shared Fence Devices ->Add a Fence Device.
2. After clicking Add a Fence Device, you’ll be given the option to Add a Sharable Fence Device.
3. On the Add a Sharable Fence Device page, click the drop-down box under Fencing Type and select the type of fence device to configure.

Adding a fence device to cluster nodes
1. In the Conga interface, go to Clusters -> Cluster List -> Choose a cluster to administer.
2. Click a link for a node.
3. At the bottom of the page, under Main Fencing Method, click Add a fence device to this level.
4. Select the shared SanBox2 entry created earlier. Enter the switch port to which this node is connected. The authentication information should already be filled out.

If using IPMI instead, select ‘IPMI’ from the menu, and enter the IP, username and password.
5. Click Update main fence properties and wait for the change to take effect.
6. Repeat on each node.

Additional configuration for IPMI fencing
If you chose IPMI rather than SanBox2 as your fencing method, additional configuration is required.

Install the IPMI software on all nodes.

yum install ipmitool

Testing
Regardless of fencing type used, it should be tested. Run this command on each node, one at a time, allowing the node to recover and rejoin the cluster before proceeding.

fence_node <nodename>

Newbie SAN admin section…
Personally, my first cluster was also my first experience with SAN administration. This is likely not the case for you. But if you’re new to SAN administration, and are using a SAN in your cluster, this section may pertain to you.

Each node must have equal, full permission to the SAN volume. Here is one way to do that:

1. Find your nodes’ WWN. This is the identifier that will be used by the fibre switch.

cat /sys/class/fc_host/host6/port_name

2. Look for the above number in your fibre switch, add that WWN to a Zone containing your Hardware RAID.
3. In your Hardware RAID’s console, add write permission for those WWN’s to access the necessary volumes. Each node must have equal access.

Start cluster services
After you have fencing configured, it’s safe to start up the cluster.

service cman start
service rgmanager start
service clvmd start

Creating the GFS2 Shared Volume
This section is done from a single node, unless otherwise specified.

1. Locate the partition or SAN volume that you want to use. This can be done by looking through dmesg for the device name associated with your raid volume name. (If you’ve just created the SAN volume, and haven’t rebooted since, you’ll need to run /usr/bin/rescan-scsi-bus.sh first. Some larger volumes may still require a reboot before the nodes will see them.)

2. Format the partition/volume.
If the volume is less than 2.2GB in size, you can use fdisk to create the partition.

fdisk /dev/sdj

If the volume is larger than 2.2GB, you’ll have to use parted. The volume in the example below is 4110GB:

parted /dev/sdj
(parted) mklabel gpt
(parted) mkpart primary ext2 0 4100000
(parted) set 1 lvm on
(parted) quit

3. Create a clustered LVM Volume Group for your nodes to share.

pvcreate /dev/sdj1
vgcreate --clustered y shared_test_vol /dev/sdj1

3. Sync each node’s filesystem tables and scan for the new volume group.

partprobe; vgscan

4. Create the Logical Volume.

lvcreate -L 200G -n vmspace shared_test_vol

5. Create the GFS2 filesystem. The syntax for this is:
mkfs.gfs2 -p lock_dlm -t ClusterName:FSName -j NumberJournals BlockDevice

mkfs.gfs2 -p lock_dlm -t mytestcluster:my_neato_shared_fs -j 3 /dev/shared_test_vol/vmspace

Special Considerations when Mounting GFS2 File Systems
(paraphrased/condensed from Redhat documentation)

GFS2 performs most efficiently when written to by a single node at a time. To prevent clustered reads from turning into unnecessary writes, always mount the file system with the noatime option. Doing so will speed up performance since the access times of each read will not be written to disk.

An entry in /etc/fstab is also very important. Without an entry in fstab, the GFS volume will not be known to the system when file systems are unmounted at system shutdown. As a result, the GFS2 script will not unmount the GFS2 file system, and the cluster node will hang.

Mount the filesystem on each cluster node

mount /dev/shared_test_vol/vmspace /var/lib/libvirt/images

vim /etc/fstab

/dev/shared_test_vol/vmspace /var/lib/libvirt/images gfs2 noatime 0 0

Creating a virtual machine to use in the cluster

1. Type virt-manager and use the GUI to create a virtual machine of your choice.
Make sure to create the disk image on the shared volume, now mounted at /var/lib/libvirt/images/
This is the default image location.
2. Once you have the VM running, save the config in the shared storage.

mkdir /var/lib/libvirt/images/xml_defs
cd /var/lib/libvirt/images/xml_defs
virsh dumpxml my_vm_name > my_vm_name.xml

3. SSH into each machine and define the VM.

virsh define /var/lib/libvirt/images/xml_defs/my_vm_name.xml

The VM is now able to freely migrate between nodes. Though it’s not managed by the cluster infrastructure yet.

Configuring Your VM as a Clustered Service

Making your virtual machines into clustered services is the best way to ensure they will always be accessible. This way, if the node running your VM shuts off unexpectedly, your VM will be recovered and relocated to another machine.

Creating a VM service also means you can control your VM through the cluster’s service manager. After this point, you won’t be able to use virt-manager to power on/off or migrate your VM. But, there will be similar commands to control your VM through rgmanager.

1. Take note of your Virtual Machine’s name, as seen by virsh list.

[[email protected] ~]# virsh list --all
 Id Name                 State
----------------------------------
 1  mytestVM   running

1. Dump the XML definition of the new VM into the xml_defs directory. Give this file the exact name seen in ‘virsh list’, appending .xml to the name.

virsh dumpxml mynewVM >  /var/lib/libvirt/images/xml_defs/mytestVM.xml

2. Make a copy of the running cluster config for backup.

cd /etc/cluster 
mkdir backups
cp cluster.conf ./backups/cluster.conf-`date +%Y_%m_%d`

3. Edit the cluster.conf.
Add a line like this in the section with the other services. It should be placed between the <rm> </rm> tags.

<vm name="mytestVM" path="/var/lib/libvirt/images/xml_defs/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>

The meaning of that XML line is as follows:

name="mytestVM"             # virsh domain name
path="..."                  # path where the XML definition can be found. Shared between all nodes.
autostart="0"               # dont autostart this VM on any machine... this saves boot time
exclusive="0"               # this service is allowed to run with other services (VMs)
recovery="restart"          # restart the VM when it fails
max_restarts="2"            # only restart it twice within 10 minutes... if it fails more than that, relocate it.
restart_expire_time="600"   # 600 seconds (10 minutes) is as frequently as we'll allow this to restart

4. Increment the ‘config_version’ number at the top of cluster.conf.

<cluster alias="MyClusterName" config_version="5" name="Kog">

5. Validate and propagate the config to the other nodes:

xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng cluster.conf
ccs_tool update /etc/cluster/cluster.conf

You should now be able to view your VM service with clustat.

 

[[email protected] ~]# clustat
Cluster Status for MyNeatoTestCluster @ Sat Aug  6 14:32:20 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node01-p                                      1 Online, rgmanager
 node02-p                                      2 Online, Local, rgmanager
 node03-p                                      3 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                     ----- ------                   -----         
 vm:testbox1                      node01-p                         started       
 vm:mytestVM                      node02-p                         started   

Reference
Here are some commands that may be useful in your cluster administration.

--- RHCS commands ---
ccs_tool update /etc/cluster/cluster.conf      # (update/propagate cluster.conf to all nodes)
clustat                                        # (see cluster status)
clusvcadm -e <service>                         # (enable/start a service)
clusvcadm -e <group> -m <member>               # (enable group/service on member)
clusvcadm -d <service>                         # (disable/stop service)
clusvcadm -M <vm:service> -m <member>          # (Migrate a VM service to another member)

--- LVM commands ---
vgs                                            # show all LVM/cLVM Volume Groups
lvs                                            # show all LVM/cLVM Logical Volumes
pvs                                            # show all LVM/cLVM Physical Volumes (disk/partitions) 

--- KVM commands ---
virsh list --all
virsh dumpxml <domain>
virsh define <domain>
virsh migrate --live <domain> qemu+ssh://[email protected]/system

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>