AdministeringHeartBeat

From MicroFocusInternationalWiki
Jump to: navigation, search
-My other wikis

Best Practice for adding/modifying HeartBeat (HB) resources

This is based on my experience.

  1. Connect to the Designated cordinator (DC) and run hb_gui from there (to find DC launch hb_gui or crm_mon -i2
  2. Turn OFF stonith (check box in hb_gui)
  3. chkconfig heartbeat off (this is so if one reboots they don't get into a reboot cycle)
  4. Put other nodes in standby mode (everyone except the DC)
  5. Gracefully off-line resources
  6. backup the /var/lib/heartbeat/crm/cib.xml on the DC node.
  7. Ensure nodes are communicating via HB broadcasts (use ethereal/wireshark to sniff the network and look for HB packets). Also check the /etc/ha.d/ha.cf file and ensure the private HB LAN is listed first.
  8. Do your hb_gui operations
  9. Change from standby mode to online mode (start with one node and ensure he comes back in fine, then go to the second and so on).
  10. Test Migrating resources
  11. Once everything looks good re-enable stonith
  12. chkconfig heartbeat on

The Rolf Procedure

What to do if your cluster keeps choking/abending/rebooting.

  1. chkconfig heartbeat off (then next time it reboots, heartbeat will not load and thus load all the resources. You may need to watch the server as it boots and just simply change to the safe mode at the grub boot option (that way it doesn't boot, untill you are ready for it).
  2. Modify the cib.xml file and disable stonith - /var/lib/heartbeat/crm/cib.xml
  3. After you modify the cix.xml file, you'll need to delete the check file (cib.xml.sig) - you are welcome to create a directory and copy it there for safe keeping.
  4. If you suspect a particular resource is causing the fault, then change the state from started to stopped in the xml file.
  5. Look at each nodes cib.xml (date/size) and determine which one is the most recent/correct, then go to the other nodes and delete their cib.xml and cib.xml.sig files, that way you can ensure that only ONE version of the cib.xml is going to load (the one that has stonith disabled and the resources stopped).
  6. This will stabilize the cluster, you'll need to determine what is causing the faults, then you can proceed to put

the cluster back together.

    1. Finding the fault
      1. tail -f /var/log/messages - it is handy to view this while doing operations, especially if the operations are failing, you can see it real time and correlate it to an error.
  1. In order to make modifications while one node is up, you'll need to put the other nodes into standby mode. You can do this from the DC, or the only one you have up at the moment, BEFORE you bring the other nodes up.
    1. crm_standby -U gwmail1 -v 1
    2. crm_standby -U gwmail2 -v 1
  2. Once you have confirmed the problem has been fixed, then you can re-enable things.
  3. Go from standby mode to online mode
    1. crm_standby -U gwmail1 -v 0
    2. crm_standby -U gwmail2 -v 0
  4. crm_mon -i2 and ensure they join and are RUNNING
  5. cibadmin -Q > /root/cibadmin.txt and ensure the other nodes see the correct cib.xml in memory
  6. migrate the resources to where you want them to be
    1. crm_resource -M -r RESOURCEGROUPNAME -H NODENAME -f
  7. Use /usr/bin/hb_gui to re-enable stonith
  8. chkconfig heartbeat on