SUSE Manager/Osad and jabberd troubleshooting

From MicroFocusInternationalWiki
Revision as of 07:12, 15 December 2017 by Kwk (Talk | contribs) (Split cure and last resort)

Jump to: navigation, search

SUSE Manager Main Page

Typical issues

Open file count exceeded

Symptoms

OSAD clients cannot contact the SUSE Manager Server, jabberd takes a lot of time responding to port 5222.

Cause

The number of maximum files that the jabber user can open is lower thant the number of connected clients. Since every clients needs one always-open TCP connection and each of this consume one file handler, jabberd starts queuing and refusing connections.

Cure

Add a line like the following to /etc/security/limits.conf

jabbersoftnofile<#clients + 100>
jabberhardnofile<#clients + 1000>

You should substitute <#clients + 100> and <#clients + 1000> according to your setup, for example for 5000 clients:

jabbersoftnofile5100
jabberhardnofile6000


Also update /etc/jabberd/c2s.xml's max_fds parameter accordingly, for example:

<max_fds>6000</max_fds>


Explanation: the soft file limit is the limit of the maximum open files for a single process. In SUSE Manager case the highest consuming process is c2s, which opens a connection per client. 100 additional files are added, here, to accommodate for any non-connection file that c2s needs to work correctly. The hard limit applies to all processes belonging to the jabber user, and accounts for open files from the router, s2s and sm processes as well.

jabberd database corruption

Symptoms

After a disk full error or a disk crash, the jabberd database might be corrupted and jabberd fails to start up during spacewalk-service start:

   Starting spacewalk services...
   Initializing jabberd processes...
       Starting router                                                                   done
       Starting sm startproc:  exit status of parent of /usr/bin/sm: 2                   failed
   Terminating jabberd processes...

/var/log/messages shows more details:

   jabberd/sm[31445]: starting up
   jabberd/sm[31445]: process id is 31445, written to /var/lib/jabberd/pid/sm.pid
   jabberd/sm[31445]: loading 'db' storage module
   jabberd/sm[31445]: db: corruption detected! close all jabberd processes and run db_recover
   jabberd/router[31437]: shutting down

Cure

Database corruptions shouldn't happen anymore since jabberd now uses sqlite in place of the berkeley db database to improve stability and performance. sqlite is the preferred database option for jabberd.

Fresh installations of SUSE Manager 3.0 Server will use sqlite by default.

Existing installation need to manually switch to this database as follows

     # systemctl stop jabberd
     # sed -i.bak "s#<driver>db</driver>#<driver>sqlite</driver>#g" /etc/jabberd/sm.xml
     # sed -i.bak "s#<module>db</module>#<module>sqlite</module>#g" /etc/jabberd/c2s.xml
     # systemctl start jabberd

Note: Running spacewalk-setup-jabberd is NOT recommended as it will reset the jabberd password, breaking existing installations.

Also, a word of advice: NEVER use osad without rhnsd! rhnsd is your fallback if osad is not working anymore. You can schedule a job for restarting osad including removing the auth file if needed.

Last resort

As a last resort, you can remove the jabberd database, re-init it, and restart.

These steps assume a sqlite database for jabberd !

   spacewalk-service stop
   rm -Rf /var/lib/jabberd/db/*
   /usr/share/spacewalk/setup/jabberd/create_sqlite3_database
   spacewalk-service start

Dumping XMPP network data for debugging purposes

In case of bugs regarding OSAD, it might be useful to dump network messages to help in debugging. The procedure is slightly different if you have to capture from the client and the server side.

Server side:

  • install the tcpdump package on the SUSE Manager Server
  • stop the OSA dispatcher and Jabber processes
  • start data capture from port 5222
  • start the OSA dispatcher and Jabber processes
  • operate the SUSE Manager server and clients so that the bug is reproduced
  • stop the capture

Commands to implement the above instructions are:

# in terminal 1, type:
su
zypper in tcpdump
rcosa-dispatcher stop
rcjabberd stop
tcpdump -s 0 port 5222 -w server_dump.pcap

# open a different terminal (terminal 2) and type:
rcosa-dispatcher start
rcjabberd start

# now operate on the SUSE Manager server and clients to reproduce the bug
# after that, reopen terminal 1 and use CTRL+C to terminate the capture


Client side:

  • install the tcpdump package on the client
  • stop the OSA process
  • start data capture to port 5222
  • start the OSA process
  • operate the SUSE Manager server and clients so that the bug is reproduced
  • stop the capture

Commands to implement the above instructions are:

# in terminal 1, type:
su
zypper in tcpdump
rcosad stop
tcpdump -s 0 port 5222 -w client_dump.pcap

# open a different terminal (terminal 2) and type:
rcosad start

# now operate on the SUSE Manager server and client to reproduce the bug
# after that, reopen terminal 1 and use CTRL+C to terminate the capture

Engineering notes: analyzing dumped data

  • obtain the certificate file from the SUSE Manager server: /etc/pki/spacewalk/jabberd/server.pem
  • edit this file removing all lines before ----BEGIN RSA PRIVATE KEY-----, save it as key.pem
  • install Wireshark
  • open the captured file
  • from Edit -> Preferences select Protocols -> SSL from the left pane
  • click RSA keys list: Edit... -> New
    • IP Address: any
    • Port: 5222
    • Protocol: xmpp
    • Key File: open the key.pem file edited above
    • Password: leave blank

Further information:


Upstream guides

Configuring Osad

https://fedorahosted.org/spacewalk/wiki/OsadHowTo

Jabber and OSAD client connection issues

https://fedorahosted.org/spacewalk/wiki/JabberAndOSAD