SUSE CaaS Platform/Troubleshooting

From MicroFocusInternationalWiki
Revision as of 14:44, 23 February 2018 by Jsevans (Talk | contribs) (Created page with "= Docker Troubleshooting = Each docker container has a "log" function where you can review the system logs for that container. == Example 1 == If I have a mysql container r...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Docker Troubleshooting

Each docker container has a "log" function where you can review the system logs for that container.

Example 1

If I have a mysql container running, this is what it would look like if I reviewed logs:

jsevans@lab:~> docker logs clever_kalam

error: database is uninitialized and password option is not specified 
  You need to specify one of MYSQL_ROOT_PASSWORD, MYSQL_ALLOW_EMPTY_PASSWORD and MYSQL_RANDOM_ROOT_PASSWORD
jsevans@lab:~>

In the example, you can see that this container failed to start because a root password was never set.

CaaSP has several containers that run in the admin node

jse-velum:~ # docker ps
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS               NAMES
07ce66a2fb0d        2e581ae1971b         "bash /usr/local/bin/"   2 hours ago         Up 2 hours                              k8s_haproxy_haproxy-127.0.0.1_kube-system_fc86b28e4b267305c6be1c4873664816_0
713107d66b64        sles12/pause:1.0.0   "/usr/share/suse-dock"   2 hours ago         Up 2 hours                              k8s_POD_haproxy-127.0.0.1_kube-system_fc86b28e4b267305c6be1c4873664816_0
90371851a838        0ae495fb075d         "entrypoint.sh salt-m"   3 hours ago         Up 3 hours                              k8s_salt-master_velum-public-
127.0.0.1_default_8febe56752d5b78228314baf894d5740_1
66a569fe4bb0        82023d5e30f1         "entrypoint.sh bundle"   3 hours ago         Up 3 hours                              k8s_velum-event-processor_velum-public-
127.0.0.1_default_8febe56752d5b78228314baf894d5740_2
111ebdbc8d89        1b49b518ec09         "/usr/local/bin/entry"   3 hours ago         Up 2 hours                              k8s_openldap_velum-public-127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
0128b51867fe        82023d5e30f1         "entrypoint.sh bin/in"   3 hours ago         Up 2 hours                              k8s_velum-dashboard_velum-public-
127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
fb62bc923356        3dbd223dcfac         "salt-minion.sh"         3 hours ago         Up 3 hours                              k8s_salt-minion-ca_velum-public-
127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
f328d25e963d        23b3d88e6f1a         "salt-api"               3 hours ago         Up 3 hours                              k8s_salt-api_velum-public-127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
8f795c149af7        82023d5e30f1         "entrypoint.sh bundle"   3 hours ago         Up 3 hours                              k8s_velum-autoyast_velum-public-
127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
f409e46f8938        1732313eb42f         "entrypoint.sh /usr/l"   3 hours ago         Up 3 hours                              k8s_velum-mariadb_velum-private-
127.0.0.1_default_bf640ab62f9d8d01fa0c2f7e66744787_0
341a7eba9003        sles12/pause:1.0.0   "/usr/share/suse-dock"   3 hours ago         Up 3 hours                              k8s_POD_velum-public-127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
6a26fbdeba6c        sles12/pause:1.0.0   "/usr/share/suse-dock"   3 hours ago         Up 3 hours                              k8s_POD_velum-private-127.0.0.1_default_bf640ab62f9d8d01fa0c2f7e66744787_0

For each of these, you can run the log command against the CONTAINER ID or use the following bash shortcut to see the current system logs for that container:

Example 2

This is how to see the logs from the salt-master

docker logs `docker ps | grep salt-master | awk '{print $1}` -f

local:
   - master.pem
   - master.pub
minions:
   - admin
   - ca
minions_denied:
minions_pre:
   - 86c31b34f5694c0f968f7ac4b09ad9fd
   - fc02599431da43b0bef03aa0343efe35
minions_rejected:

salt

You can add access to salt by adding the following aliases to /root/.bashrc (you will need to create this file). Then logout and log back in to use them. Using SALT directly is good for troubleshooting, but should generally not be used unless you are very familiar with SALT.

alias salt='docker exec `docker ps -q --filter name=salt-master` salt'
alias salt-key='docker exec `docker ps -q --filter name=salt-master` salt-key'

Example 3

Check salt keys:

jse-velum:~ # docker exec `docker ps -q --filter name=salt-master` salt-key -L
Accepted Keys:
0d07bd4cd7e54a4880423fc42f025b88
2dc9b41185c649f1bf01e4a451efb1bf
56b5e2057320402fb06c757ceaebbe04
6ba7e76c7deb482685362a596ae24442
872e3dd4eeb04007ad3ae0aabe4018bf
9a2c6ef0187e4c5faaebc255e074b793
a6dfc9dff43a4b29ae18213ebd743295
admin
ca
de2a71947dc544739e4f46489288984f
Denied Keys:
Unaccepted Keys:
Rejected Keys:

Example 4

Check IP addresses:

jse-velum:~ # docker exec `docker ps -q --filter name=salt-master` salt '*' grains.get ipv4
2dc9b41185c649f1bf01e4a451efb1bf:
   - 127.0.0.1
   - 149.44.138.247
   - 149.44.139.246
   - 172.17.0.1
6ba7e76c7deb482685362a596ae24442:
   - 127.0.0.1
   - 149.44.138.242
   - 149.44.139.211
   - 172.17.0.1
a6dfc9dff43a4b29ae18213ebd743295:
   - 127.0.0.1
   - 149.44.139.162
   - 172.17.0.1
56b5e2057320402fb06c757ceaebbe04:
   - 127.0.0.1
   - 149.44.138.241
   - 149.44.139.247
   - 172.17.0.1
872e3dd4eeb04007ad3ae0aabe4018bf:
   - 127.0.0.1
   - 149.44.138.246
   - 149.44.139.164
   - 172.17.0.1
0d07bd4cd7e54a4880423fc42f025b88:
   - 127.0.0.1
   - 149.44.138.248
   - 149.44.139.223
   - 172.17.0.1
9a2c6ef0187e4c5faaebc255e074b793:
   - 127.0.0.1
   - 149.44.139.224
   - 172.17.0.1
admin:
   - 127.0.0.1
   - 149.44.138.239
   - 172.17.0.1
de2a71947dc544739e4f46489288984f:
   - 127.0.0.1
   - 149.44.138.240
   - 149.44.139.225
   - 172.17.0.1
ca:
   - 127.0.0.1
   - 149.44.138.239
   - 172.17.0.1

supportconfig

The following files are added specifically for CaaSP:

velum-files.txt velum-migrations.txt velum-minions.yml velum-routes.txt velum-salt-events.yml velum-salt-pillars.yml kubernetes.txt

The problem with these logs is that they are in yaml format and not easy to troubleshoot. Rather than grepping as we normally would, a better strategy would be to grep for context.

Example 5:

grep -C3 fail velum-salt-events.yml 

 fun: grains.get
 id: 0d07bd4cd7e54a4880423fc42f025b88
- fun_args:
 - tx_update_failed
 jid: '20180214102215466369'
 return: 
 retcode: 0

The -C3 flag will get 3 lines before and after the line that matches so you can get a better idea of what is happening that than just getting a single line that won't tell you anything about what the error is actually about.

Bootstrapping

If a bootstrap fails, we are not given any direct output about why it fails. This has been a big issue I've been working with the CaaSP team about. Now we have a couple of new tools.

First, we can manually kick off another bootstrap and record the output to bootstrap.log.

docker exec -it $(docker ps | grep salt-master | awk '{print $1}') salt-run -l debug state.orchestrate orch.kubernetes | tee bootstrap.log

The CaaSP team says, "This is unsupported and may cause issues further down the line but you can run." so we don't want to do this unless this is a recurring issue that the customer can't get past. In CaaSP 3.0, this will be an option built into Velum though I don't know if it will be logged.

Secondly, if this is recurring, we can put put the salt-master into debug mode after installing the admin node and before installing other nodes.

vim /etc/caasp/salt-master-custom.conf (on admin node)

and add

# Custom Configurations for Salt-Master

log_level: debug

You can then restart the salt-master container with:

docker stop `docker ps | grep salt-master | awk '{print $1}`

CaaSP will automatically restart the container with debugging turned on.

The logs that the salt-master produces are in json format. Before starting the boostrap, run "script" and then run the command on Example 2. Once the bootstrap is failed, do CTRL-C and then exit. The output that was just on your screen from the log command will be in a file called "typescript".

From the dev team:

During botstrap, thousands of "things" are done and sequenced accross the set of cluster machines - and the set of things is constantly changing - any list would be out of date pretty quickly.

Generally speaking, finding what specific salt step has failed is not *all that difficult* by looking at the velm-salt-events.yaml file in the admin node's supportconfig dump.

The process looks like this:

  • Open velm-salt-events.yaml in an editor
  • Search for "result: False"
    • If the match you find says something like "Failed as prerequisite bar failed" (I don't have the exact wording handy) - then trace "up" the chain of failures logs to the "bar" failure
    • Repeat until you see a failure where the message is not simply a prerequisite failing, and you have found the real failure event.
  • The failure event will, more often than not, have a reasonably descriptive name. The name will indicate which area failed and needs investigation - e.g. if the name includes "etcd", start checking etcd

logs on each node etc.

Patching (from the Dev team)

How does the update process of CaaSP works?

1. Cluster nodes checks daily, if there are updates pending and if they can be applied.

In case of success:

2. Cluster node tells velum via salt, that this machine has pending updates and need a reboot.

3. Velum shows to the admin, that a Node has pending updates and need a reboot

=> If you saw the button to start the update, NEVER EVER play with transactional-update on the cluster node, even if something goes wrong during the update. transactional-update was successfull, and all you can do is to destroy your cluster, but not fixing the problems velum run into!

4. Customer decides to press the update button in velum, whenever he thinks the best time is for it. The world will not go down if you don't press the button immeaditly, but wait until your update window. But if you start updating one node, you have to do so for all.

5. velum will via salt shutdown the services on that cluster node, reboot it to enable the update and apply missing patches, adjust the configuration and start the services.

In Case of Failure

=> if this goes wrong, velum will notify you. To get this fixed, you need to find out which of the steps in 5. did go wrong.

Hint: it is not transactional-update which did go wrong, we know already that this succeeded.

Hint2: if you think the "Failed" is wrong: you cannot fix the wrong "Failed" with transactional-update, since this is not coming from transactional-update In case the check from 1. goes wrong: velum will tell you so. Means you should get an error message, that there is something broken, without that you have pressed the update button before. In this case: read the error message in /var/log/transactional-update.log on that cluster node and most likely you have to fix your update channel configuration.

Another hint: register your system during installation and apply all patches. If you register later, you have to "fix" that everywhere and disable all other local installation sources at your own.