Creating a Highly-Available Virtualization Cluster in CentOS 5.5
This guide covers basic installation and configuration of the Redhat Cluster Suite (RHCS), as well as creating shared storage using GFS2, and configuring your virtual machines as highly-available services.
Contents:
Why use HA?
Preparation
KVM Installation
Installing the Redhat Cluster Suite
Using Conga for Initial Cluster Creation
Core Configuration: Fencing
Creating the GFS2 Shared Volume
Mounting the Filesystem
Creating a Virtual Machine
Configuring Your VM as a Clustered Service
Reference
Why use HA?
Here are some of the benefits of running your virtual machines as HA clustered services.
Adds convenience and robustness to VM hosting
Automation leaves less room for human error, and allows you to do more with your time.
Automatic VM relocation & recovery
No human intervention required. If your VM host machine goes down, every VM previously running on that machine is automatically relocated to a new node. This allows you to run more VMs, more reliably.
Keeps VM definitions in sync
If you add more RAM to your VM’s configuration on one node, it will stay consistent across every other node too. You won’t have to worry about updating VM settings on the other machines.
Protects data by only allowing a VM to start up in one location at a time
Accidentally starting a VM on more than one machine would almost definitely result in data loss and corruption. As a clustered resource, you won’t have to worry about this happening to your VM. This allows you to safely run many VMs without having to keep track of what’s running where.
Offers true high-availability
The heart of HA is ensuring that your applications are available for uninterrupted service at all times. With HA, all VMs are recovered instantly upon failure.
Now that you understand the benefits of using an HA cluster, let’s get building!
Preparation
In order to speed things up, I suggest you use a parallel SSH program like clusterssh to interact with all your nodes at once. This will allow you to configure each node identically, and is particularly useful for larger installations (3 to 16 nodes).
Clusterssh is available for Centos 5.x and 6.x using the RPMforge repo. See this article for details on setting up the RPMforge repo.
yum install clusterssh cssh node1 node2 node3 node4
You should also set aside a network interface to use for cluster communications. There are two reasons to do this:
1. Your cluster will communicate continuously throughout normal operation. It is vital that these communications are not interrupted/delayed by heavy network traffic or other network settings, otherwise your nodes might drop out of the cluster.
2. It’s an important security measure to keep the cluster communication away from a public network interface. This interface can be used to fence (shutoff/reboot/cut off) your nodes from the rest of the cluster.
KVM Installation
SSH into every node and complete this section across all nodes.
Check for CPU virtualization support. This should return some output.
egrep '(vmx|svm)' --color=always /proc/cpuinfo
If the above command returned successfully, continue with package installation.
yum groupinstall Clustering yum install kvm kmod-kvm qemu libvirt python-virtinst qspice-libs virt-viewer virt-manager
Configure a network bridge for KVM. This is required if you want your machines to be accessible from hosts other than the VM host. (Required for running virtual servers)
cd /etc/sysconfig/network-scripts/ cp ifcfg-eth0 ifcfg-br0
Edit ifcfg-br0, the bridge interface. Change “DEVICE=” to br0, and edit the TYPE to equal Bridge (with an uppercase ‘B’!)
Editing /etc/sysconfig/network-scripts/ifcfg-br0:
DEVICE=br0 TYPE=Bridge # IP information stays the same # remove HWADDR
Bridge the two interfaces together
echo "BRIDGE=br0" >> ifcfg-eth0
Edit ifcfg-eth0 and remove the IP address, if any.
Enable forwarding of bridged traffic in iptables
Edit /etc/sysconfig/iptables and add this near the top (before the majority of the INPUT statements).
-A FORWARD -m physdev --physdev-is-bridged -j ACCEPT
Disable iptables on the bridge. This will keep the host machine from applying its iptables rules to virtual machines.
cat >> /etc/sysctl.conf < net.bridge.bridge-nf-call-ip6tables = 0 net.bridge.bridge-nf-call-iptables = 0 net.bridge.bridge-nf-call-arptables = 0 EOF
Reboot or wait for next section to reboot for these changes to take effect.
Installing the Redhat Cluster Suite
This section is to be done across all nodes, unless specified otherwise.
Configure IPtables to allow access to all cluster-related ports. You can either copy the config below, or allow all traffic on your private cluster interfaces.
vim /etc/sysconfig/iptables
# Allow all on cluster interface... assuming yours is eth1 -A RH-Firewall-1-INPUT -i eth1 -j ACCEPT
Or, specify the ports individually. These ports are documented in Redhat’s RHCS Cluster Administration Guide.
vim /etc/sysconfig/iptables
#-------------------------# # RHCS ports # #-------------------------#----------------------------------------------------------# # cman -A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 5404:5405 -j ACCEPT # ricci -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 11111 -j ACCEPT # luci web interface -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 8084 -j ACCEPT # modclusterd -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 16851 -j ACCEPT # dlm (Distributed Lock Manager) -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 21064 -j ACCEPT # ccsd -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 50006 -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 50008:50009 -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 50007 -j ACCEPT #------------------------------------------------------------------------------------#
1. Start ricci on all machines
service ricci start chkconfig ricci on
2. Select a computer to host luci.
This is the web interface that runs on the machine of your choice. The machine doesn’t need to be a part of the cluster, as long as it’s on the same network.
yum install luci chkconfig luci on
3. Initialize the luci server.
This will prompt you to create an admin password for the web interface.
luci_admin init
4. Start the luci service
service luci restart
Shutting down luci: [ OK ] Starting luci: generating https SSL certificates... done [ OK ] Please, point your web browser to https://localhost:8084 to access luci
Using Conga for Initial Cluster Creation
This is my favorite way to start a Redhat cluster, since it will always generate a valid XML config. Creating the config from plain XML is also perfectly fine, if you are familiar with the syntax. But the problem I’ve had with that in the past is that I’d use XML that worked on one system, and found that the syntax was invalid on my new system.
This is why I start an RHCS cluster using Conga: to create a reliably valid base config that is ready for customization further down the line.
Also, pushing buttons is easy.
The Conga web interface is one of the few areas where I actually enjoy using a gui – it’s a simple, straightforward, and powerful tool.
1. Log into Conga using the username/password you created previously.
https://localhost:8084
2. Under the Cluster tab, click Create a New Cluster.
3. Enter your nodes’ information and check the box for Enable Shared Storage if not already checked.
4. Check the box to Reboot nodes before joining cluster, then submit.
At this point, all necessary packages will be downloaded and installed on your new nodes. Then the cluster configuration will be propagated and your cluster should come online.
Core cluster configuration: Fencing
Fencing is a vital part of clustering which helps maintain data integrity by ensuring that out-of-sync, misbehaving nodes are removed from the cluster before they can do damage.
This is one of the first things you’ll want to configure, if you want to avoid trouble. Nodes without a configured fence device can sometimes can hang the entire cluster, as the other nodes wait for it to be fenced (which will be a very very long time if you haven’t configured fencing at all).
Fencing generally requires specialized hardware, like an IPMI port on a server, or a switched PDU (basically a power strip that you can control via the network). Whichever fencing hardware you choose, it has to be able to physically cut the node off from network/disk/power, to prevent the rogue node from doing damage.
In this guide, we’ll be using IPMI and SAN-based fencing. (Yes, you can use more than one fence device.)
How fencing works in RHCS
Fencing is configured either as a shared resource, or on a per-node basis. An example of a shared fence device would be a single Qlogic SanBox switch to which all nodes are attached.
The SanBox2 fence agent would then control which nodes have access to the shared storage.
In contrast, an example of a per-node fence device would be IPMI, which uses the LOM (lights-out management port) of individual servers.
If you decide to use IPMI fencing in your cluster, be sure to shut off the ACPI Soft-Off feature. This will ensure that the node shuts off immediately, rather than issuing a more polite ‘shutdown’ command.
chkconfig --level 2345 acpid off
Creating a Shared Fence Device in Conga
If you’re not using a shared fence device, skip to the next section for per-node configuration.
1. Log into the Conga web interface. Click your cluster’s name -> Shared Fence Devices ->Add a Fence Device.
2. After clicking Add a Fence Device, you’ll be given the option to Add a Sharable Fence Device.
3. On the Add a Sharable Fence Device page, click the drop-down box under Fencing Type and select the type of fence device to configure.
Adding a fence device to cluster nodes
1. In the Conga interface, go to Clusters -> Cluster List -> Choose a cluster to administer.
2. Click a link for a node.
3. At the bottom of the page, under Main Fencing Method, click Add a fence device to this level.
4. Select the shared SanBox2 entry created earlier. Enter the switch port to which this node is connected. The authentication information should already be filled out.
If using IPMI instead, select ‘IPMI’ from the menu, and enter the IP, username and password.
5. Click Update main fence properties and wait for the change to take effect.
6. Repeat on each node.
Additional configuration for IPMI fencing
If you chose IPMI rather than SanBox2 as your fencing method, additional configuration is required.
Install the IPMI software on all nodes.
yum install ipmitool
Testing
Regardless of fencing type used, it should be tested. Run this command on each node, one at a time, allowing the node to recover and rejoin the cluster before proceeding.
fence_node <nodename>
Newbie SAN admin section…
Personally, my first cluster was also my first experience with SAN administration. This is likely not the case for you. But if you’re new to SAN administration, and are using a SAN in your cluster, this section may pertain to you.
Each node must have equal, full permission to the SAN volume. Here is one way to do that:
1. Find your nodes’ WWN. This is the identifier that will be used by the fibre switch.
cat /sys/class/fc_host/host6/port_name
2. Look for the above number in your fibre switch, add that WWN to a Zone containing your Hardware RAID.
3. In your Hardware RAID’s console, add write permission for those WWN’s to access the necessary volumes. Each node must have equal access.
Start cluster services
After you have fencing configured, it’s safe to start up the cluster.
service cman start service rgmanager start service clvmd start
Creating the GFS2 Shared Volume
This section is done from a single node, unless otherwise specified.
1. Locate the partition or SAN volume that you want to use. This can be done by looking through dmesg for the device name associated with your raid volume name. (If you’ve just created the SAN volume, and haven’t rebooted since, you’ll need to run /usr/bin/rescan-scsi-bus.sh
first. Some larger volumes may still require a reboot before the nodes will see them.)
2. Format the partition/volume.
If the volume is less than 2.2GB in size, you can use fdisk to create the partition.
fdisk /dev/sdj
If the volume is larger than 2.2GB, you’ll have to use parted. The volume in the example below is 4110GB:
parted /dev/sdj (parted) mklabel gpt (parted) mkpart primary ext2 0 4100000 (parted) set 1 lvm on (parted) quit
3. Create a clustered LVM Volume Group for your nodes to share.
pvcreate /dev/sdj1 vgcreate --clustered y shared_test_vol /dev/sdj1
3. Sync each node’s filesystem tables and scan for the new volume group.
partprobe; vgscan
4. Create the Logical Volume.
lvcreate -L 200G -n vmspace shared_test_vol
5. Create the GFS2 filesystem. The syntax for this is:
mkfs.gfs2 -p lock_dlm -t ClusterName:FSName -j NumberJournals BlockDevice
mkfs.gfs2 -p lock_dlm -t mytestcluster:my_neato_shared_fs -j 3 /dev/shared_test_vol/vmspace
Special Considerations when Mounting GFS2 File Systems
(paraphrased/condensed from Redhat documentation)
GFS2 performs most efficiently when written to by a single node at a time. To prevent clustered reads from turning into unnecessary writes, always mount the file system with the noatime
option. Doing so will speed up performance since the access times of each read will not be written to disk.
An entry in /etc/fstab is also very important. Without an entry in fstab, the GFS volume will not be known to the system when file systems are unmounted at system shutdown. As a result, the GFS2 script will not unmount the GFS2 file system, and the cluster node will hang.
Mount the filesystem on each cluster node
mount /dev/shared_test_vol/vmspace /var/lib/libvirt/images
vim /etc/fstab
/dev/shared_test_vol/vmspace /var/lib/libvirt/images gfs2 noatime 0 0
Creating a virtual machine to use in the cluster
1. Type virt-manager
and use the GUI to create a virtual machine of your choice.
Make sure to create the disk image on the shared volume, now mounted at /var/lib/libvirt/images/
This is the default image location.
2. Once you have the VM running, save the config in the shared storage.
mkdir /var/lib/libvirt/images/xml_defs cd /var/lib/libvirt/images/xml_defs virsh dumpxml my_vm_name > my_vm_name.xml
3. SSH into each machine and define the VM.
virsh define /var/lib/libvirt/images/xml_defs/my_vm_name.xml
The VM is now able to freely migrate between nodes. Though it’s not managed by the cluster infrastructure yet.
Configuring Your VM as a Clustered Service
Making your virtual machines into clustered services is the best way to ensure they will always be accessible. This way, if the node running your VM shuts off unexpectedly, your VM will be recovered and relocated to another machine.
Creating a VM service also means you can control your VM through the cluster’s service manager. After this point, you won’t be able to use virt-manager to power on/off or migrate your VM. But, there will be similar commands to control your VM through rgmanager.
1. Take note of your Virtual Machine’s name, as seen by virsh list.
[[email protected] ~]# virsh list --all Id Name State ---------------------------------- 1 mytestVM running
1. Dump the XML definition of the new VM into the xml_defs directory. Give this file the exact name seen in ‘virsh list’, appending .xml
to the name.
virsh dumpxml mynewVM > /var/lib/libvirt/images/xml_defs/mytestVM.xml
2. Make a copy of the running cluster config for backup.
cd /etc/cluster mkdir backups cp cluster.conf ./backups/cluster.conf-`date +%Y_%m_%d`
3. Edit the cluster.conf.
Add a line like this in the section with the other services. It should be placed between the <rm> </rm> tags.
<vm name="mytestVM" path="/var/lib/libvirt/images/xml_defs/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
The meaning of that XML line is as follows:
name="mytestVM" # virsh domain name path="..." # path where the XML definition can be found. Shared between all nodes. autostart="0" # dont autostart this VM on any machine... this saves boot time exclusive="0" # this service is allowed to run with other services (VMs) recovery="restart" # restart the VM when it fails max_restarts="2" # only restart it twice within 10 minutes... if it fails more than that, relocate it. restart_expire_time="600" # 600 seconds (10 minutes) is as frequently as we'll allow this to restart
4. Increment the ‘config_version’ number at the top of cluster.conf.
<cluster alias="MyClusterName" config_version="5" name="Kog">
5. Validate and propagate the config to the other nodes:
xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng cluster.conf ccs_tool update /etc/cluster/cluster.conf
You should now be able to view your VM service with clustat
.
[[email protected] ~]# clustat Cluster Status for MyNeatoTestCluster @ Sat Aug 6 14:32:20 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node01-p 1 Online, rgmanager node02-p 2 Online, Local, rgmanager node03-p 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- vm:testbox1 node01-p started vm:mytestVM node02-p started
Reference
Here are some commands that may be useful in your cluster administration.
--- RHCS commands --- ccs_tool update /etc/cluster/cluster.conf # (update/propagate cluster.conf to all nodes) clustat # (see cluster status) clusvcadm -e <service> # (enable/start a service) clusvcadm -e <group> -m <member> # (enable group/service on member) clusvcadm -d <service> # (disable/stop service) clusvcadm -M <vm:service> -m <member> # (Migrate a VM service to another member) --- LVM commands --- vgs # show all LVM/cLVM Volume Groups lvs # show all LVM/cLVM Logical Volumes pvs # show all LVM/cLVM Physical Volumes (disk/partitions) --- KVM commands --- virsh list --all virsh dumpxml <domain> virsh define <domain> virsh migrate --live <domain> qemu+ssh://[email protected]/system