Dec8ops

From MicroFocusInternationalWiki
Jump to: navigation, search
-HA TODO


go to http://wiki.novell.com/index.php/Maintwindows

  1. Dec 8 Maint. window checklist
    1. Primary objective - Move two MTAs off NW cluster and retire that 4 node cluster.
      1. Check list
        1. Down mta agents and backup GW DB files
        2. Online mta agents and change paths, ensure syncs (connect to each Domain)
        3. Copy DB files to Linux servers
        4. configure/connect linux GW agents to DB files
        5. Configure HA load scripts for new MTAs
        6. Add HA monitor to GWIA and ensure each important HA sub-module has monitor enabled
        7. Verify each HA resource can go on each node
          1. chkconfig heartbeat off for tests, then turn back on after tests
    2. X Secondary Objective - Install and configure GW Monitor and alerting
    3. Third Objective - Move Webaccess and IM to gwmail4 (non-cluster enabled)
    4. MISC



Maint. window 1

COMPLETED DEC 22

correct small size cib.xml tasklist

  1. ensure customer has full backup
  2. correct the /etc/ha.d/ha.cf file so the private NIC is first - we suspect comms problem and it might be trying the public LAN first :(
    1. Backout plan: cp ha.cf ha.cf.back1 BEFORE modifying. So backout plan is: rcheartbeat stop, cp ha.cf.back1 ha.cf, rcheartbeat start
    2. Verification: Sniff the wire and ensure HB traffic is talking on the HB Lan
  3. bring the other two nodes into the cluster
    1. ensure gwmail1 and gwmail2 servers are in standby mode
    2. ensure all 3 nodes have chkconfig heartbeat off - to avoid any chance of multiple reboots.
    3. ensure stonith is OFF
    4. consider stopping all HB resources as hb_gui requests seem to be ignored currently and it would be nice to stop all the resources and rcheartbeat stop
    5. backup the cib.xml file on gwmail3 and snip the check file
    6. rcheartbeat start on gwmail3
    7. crm_mon -i2 to ensure he joins and takes on resources
    8. gwmail1 and gwmail3 - rcheartbeat start
      1. use crm_mon -i2 to ensure they join and say standby
      2. verification: cibadmin -Q to ensure they are getting the proper cib.xml
    9. bring gwmail1 and gwmail2 out of standby and into ONLINE mode
    10. verify migration of resources to gwmail1 and gwmail2 are successful
  4. correct pridom's need for manual start (Troy)
    1. ensure cib.xml for this resource is correct
    2. thing1
    3. thing2
  5. ensure pridom resource is able to move to each node
  6. enable HB monitor for appropriate resources (like GWIA)
    1. put non-DC nodes into standby mode
    2. use hb_gui to enable HB Monitor
    3. take nodes out of standby mode once hb_gui modifications are done.
  7. re-enable stonith
  8. Once everything checks out ok, then make sure all 3 nodes have the same GW version.
    1. gwmail1 has the hp1b/DST patch on it - so rpm -Uivh to back rev those to the same version as gwmail2, gwmail3
    2. ensure GroupWise startup script is HB aware and has the correct paths to 64bit
  9. VERIFY each resource can go successfully to each node.
  10. chkconfig heartbeat on on each node
  11. Time permitting move the secondary GW domain over to the Linux HA

Customer requirements

  1. A list of names and expertise of who will be on-line during the maintenance windows.
    1. A list of names and expertise of who will be on-site for the maintenance windows.
      1. Thomas E., PSE comms, NOS(NW/Linux), eDir, GW, ZENworks
      2. Cameron C., PSE Linux/IDM
      3. Troy W., PSE former GW resources for PSEs
      4. Jason R., PSE resource for Linux NOS
  2. Trigger times for initiating a roll back for each major task.
    1. To be determined by customer
  3. The roll back steps for each major task.
  4. The testing verification steps that indicate a completed task.

Maint. window 2 - move secondary MTA to HB Cluster, bind former webaccess and IM ip addr. to gwmail4

Move secondary domain

COMPLETED DEC 29

  1. ensure chkconfig heartbeat off
  2. ensure stonith is disabled
  3. ssh into HB DC
  4. put other two nodes into standby mode
  5. offline all resources gracefully
  6. Create EVMS container for secondary Domain
  7. Create EVMS volume for secondary Domain
  8. Create hb_gui resource group and add the necessary hb_gui resources to mount the disk
  9. mount the new secondary domain partition and copy the DB files to it
  10. ensure /gwise/dom2 domain directory is samba enabled
  11. update the domain paths to the new location
  12. repair the domain so the paths get sync'd to the DB
  13. ensure the agent has the GW startup files - AKA install and configure the GW MTA agent with the correct domain name and paths
  14. add additional hb_gui group resource sub-modules to enable the agent to start properly.
  15. verify the agent is online and communicating
    1. ps aux | grep gwmta
    2. gw http monitor - check links
    3. send mail to diff. Post Office and to/from the Internet
  16. re-enable stonith
  17. Verify resources can migrate to each node
  18. chkconfig heartbeat on


Bind former webaccess and IM ip addr. to gwmail4

COMPLETED DEC 22

  1. Customer to provide Public SSL Certificate
  2. unbind/offline former IP addr. holder
  3. add these two ip addresses to gwmail4
  4. restart IM/webaccess services and see if they bind to the new alias
    1. If not simply re-run the config script for IM and put the new IP addr
    2. Webaccess should bind to all addresses including alias addresses, but if not we can add it to the apache conf file
  5. verify IM clients can connect non-ssl and ssl
  6. verify webaccess users can connect
  7. ensure chkconfig grpwise on and IM services are on
  8. put in place scripts to auto-restart service if they go offline
  9. retire former Webaccess and IM servers
    1. check timesync
    2. check report sync
    3. check for obits - dsrepair -a | adv | ext ref
    4. load config.nlm /all put sys:\system\config.txt in a safe place
    5. dsrepair -rc put sys:system\dsr_dib in a safe place
    6. nwconfig | directory options | remove DS from this server

Remove old NW Cluster

COMPLETED DEC 29

  1. Monitor period before removing NW Cluster (time between maint. window 1 and maint. window 2 should be long enough to know)
    1. Remove NW cluster
      1. Time in sync?
      2. Report sync clean?
      3. no obits with dsrepair -a | adv | check ext ref
      4. document all load/unload scripts
      5. offline each resource
      6. uldncs.ncf
      7. delete sub NCS objects - or just wait
      8. nwconfig | dir | remove directory services (CHECK other services like slpda/time provider)

Misc tasks

  1. snip old GW library reference from DB
    1. COMPLETED DEC 29