Difference between revisions of "SUSE CaaS Platform/Troubleshooting"

From MicroFocusInternationalWiki
Jump to: navigation, search
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{SUSE_CaaS_Platform_SideBar}}
+
This page has been decommissioned, the content has been transferred to -> https://confluence.suse.com/display/SUSECaaSPlatform4/CaaS+Platform+Troubleshooting
 
+
__TOC__
+
 
+
= Kubernetes =
+
 
+
== Kubernetes Cluster Information ==
+
 
+
To get information about the cluster and its master nodes, run:
+
 
+
  kubectl cluster-info
+
 
+
= Etcd Troubleshooting =
+
 
+
== Etcd Cluster Status ==
+
 
+
For information about etcd, log in to any node of the cluster and run:
+
 
+
  set -a; source /etc/sysconfig/etcdctl; set +a; etcdctl cluster-health
+
 
+
= Docker Troubleshooting =
+
 
+
Each docker container has a "log" function where you can review the system logs for that container.
+
 
+
== Example 1 ==
+
 
+
If I have a mysql container running, this is what it would look like if I reviewed logs:
+
 
+
jsevans@lab:~> docker logs clever_kalam
+
+
error: database is uninitialized and password option is not specified
+
  You need to specify one of MYSQL_ROOT_PASSWORD, MYSQL_ALLOW_EMPTY_PASSWORD and MYSQL_RANDOM_ROOT_PASSWORD
+
jsevans@lab:~>
+
 
+
In the example, you can see that this container failed to start because a root password was never set.
+
 
+
CaaS Platform has several containers that run in the admin node
+
 
+
jse-velum:~ # docker ps
+
CONTAINER ID        IMAGE                COMMAND                  CREATED            STATUS              PORTS              NAMES
+
07ce66a2fb0d        2e581ae1971b        "bash /usr/local/bin/"  2 hours ago        Up 2 hours                              k8s_haproxy_haproxy-127.0.0.1_kube-system_fc86b28e4b267305c6be1c4873664816_0
+
713107d66b64        sles12/pause:1.0.0  "/usr/share/suse-dock"  2 hours ago        Up 2 hours                              k8s_POD_haproxy-127.0.0.1_kube-system_fc86b28e4b267305c6be1c4873664816_0
+
90371851a838        0ae495fb075d        "entrypoint.sh salt-m"  3 hours ago        Up 3 hours                              k8s_salt-master_velum-public-
+
127.0.0.1_default_8febe56752d5b78228314baf894d5740_1
+
66a569fe4bb0        82023d5e30f1        "entrypoint.sh bundle"  3 hours ago        Up 3 hours                              k8s_velum-event-processor_velum-public-
+
127.0.0.1_default_8febe56752d5b78228314baf894d5740_2
+
111ebdbc8d89        1b49b518ec09        "/usr/local/bin/entry"  3 hours ago        Up 2 hours                              k8s_openldap_velum-public-127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
+
0128b51867fe        82023d5e30f1        "entrypoint.sh bin/in"  3 hours ago        Up 2 hours                              k8s_velum-dashboard_velum-public-
+
127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
+
fb62bc923356        3dbd223dcfac        "salt-minion.sh"        3 hours ago        Up 3 hours                              k8s_salt-minion-ca_velum-public-
+
127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
+
f328d25e963d        23b3d88e6f1a        "salt-api"              3 hours ago        Up 3 hours                              k8s_salt-api_velum-public-127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
+
8f795c149af7        82023d5e30f1        "entrypoint.sh bundle"  3 hours ago        Up 3 hours                              k8s_velum-autoyast_velum-public-
+
127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
+
f409e46f8938        1732313eb42f        "entrypoint.sh /usr/l"  3 hours ago        Up 3 hours                              k8s_velum-mariadb_velum-private-
+
127.0.0.1_default_bf640ab62f9d8d01fa0c2f7e66744787_0
+
341a7eba9003        sles12/pause:1.0.0  "/usr/share/suse-dock"  3 hours ago        Up 3 hours                              k8s_POD_velum-public-127.0.0.1_default_8febe56752d5b78228314baf894d5740_0
+
6a26fbdeba6c        sles12/pause:1.0.0  "/usr/share/suse-dock"  3 hours ago        Up 3 hours                              k8s_POD_velum-private-127.0.0.1_default_bf640ab62f9d8d01fa0c2f7e66744787_0
+
 
+
For each of these, you can run the log command against the CONTAINER ID or use the following bash shortcut to see the current system logs for that container:
+
 
+
== Example 2 ==
+
 
+
This is how to see the logs from the salt-master
+
 
+
docker logs `docker ps | grep salt-master | awk '{print $1}'` -f
+
+
local:
+
    - master.pem
+
    - master.pub
+
minions:
+
    - admin
+
    - ca
+
minions_denied:
+
minions_pre:
+
    - 86c31b34f5694c0f968f7ac4b09ad9fd
+
    - fc02599431da43b0bef03aa0343efe35
+
minions_rejected:
+
 
+
= salt =
+
 
+
You can add access to salt by adding the following aliases to /root/.bashrc (you will need to create this file). Then logout and log back in to use them. Using SALT directly is good for troubleshooting, but should generally not be used unless you are very familiar with SALT.
+
 
+
alias salt='docker exec `docker ps -q --filter name=salt-master` salt'
+
alias salt-key='docker exec `docker ps -q --filter name=salt-master` salt-key'
+
 
+
== Example 3 ==
+
 
+
Check salt keys:
+
 
+
jse-velum:~ # docker exec `docker ps -q --filter name=salt-master` salt-key -L
+
Accepted Keys:
+
0d07bd4cd7e54a4880423fc42f025b88
+
2dc9b41185c649f1bf01e4a451efb1bf
+
56b5e2057320402fb06c757ceaebbe04
+
6ba7e76c7deb482685362a596ae24442
+
872e3dd4eeb04007ad3ae0aabe4018bf
+
9a2c6ef0187e4c5faaebc255e074b793
+
a6dfc9dff43a4b29ae18213ebd743295
+
admin
+
ca
+
de2a71947dc544739e4f46489288984f
+
Denied Keys:
+
Unaccepted Keys:
+
Rejected Keys:
+
 
+
== Example 4 ==
+
 
+
Check IP addresses:
+
 
+
jse-velum:~ # docker exec `docker ps -q --filter name=salt-master` salt '*' grains.get ipv4
+
2dc9b41185c649f1bf01e4a451efb1bf:
+
    - 127.0.0.1
+
    - 149.44.138.247
+
    - 149.44.139.246
+
    - 172.17.0.1
+
6ba7e76c7deb482685362a596ae24442:
+
    - 127.0.0.1
+
    - 149.44.138.242
+
    - 149.44.139.211
+
    - 172.17.0.1
+
a6dfc9dff43a4b29ae18213ebd743295:
+
    - 127.0.0.1
+
    - 149.44.139.162
+
    - 172.17.0.1
+
56b5e2057320402fb06c757ceaebbe04:
+
    - 127.0.0.1
+
    - 149.44.138.241
+
    - 149.44.139.247
+
    - 172.17.0.1
+
872e3dd4eeb04007ad3ae0aabe4018bf:
+
    - 127.0.0.1
+
    - 149.44.138.246
+
    - 149.44.139.164
+
    - 172.17.0.1
+
0d07bd4cd7e54a4880423fc42f025b88:
+
    - 127.0.0.1
+
    - 149.44.138.248
+
    - 149.44.139.223
+
    - 172.17.0.1
+
9a2c6ef0187e4c5faaebc255e074b793:
+
    - 127.0.0.1
+
    - 149.44.139.224
+
    - 172.17.0.1
+
admin:
+
    - 127.0.0.1
+
    - 149.44.138.239
+
    - 172.17.0.1
+
de2a71947dc544739e4f46489288984f:
+
    - 127.0.0.1
+
    - 149.44.138.240
+
    - 149.44.139.225
+
    - 172.17.0.1
+
ca:
+
    - 127.0.0.1
+
    - 149.44.138.239
+
    - 172.17.0.1
+
 
+
== Example 5 ==
+
 
+
Check if all salt-minions are up and responding to salt. It might be useful when Velum web is showing "We're sorry, but something went wrong.":
+
 
+
jse-velum:~ # docker exec `docker ps -q --filter name=salt-master` salt '*' test.ping
+
admin:
+
    True
+
ca:
+
    True
+
adbb810767ec43209ec10338d1cfdc27:
+
    True
+
32b55f8a6c6149a6ac095438a215fe22:
+
    True
+
ee1aea918d924391b54d7ef56c8df027:
+
    True
+
ce8b28d0fa1f406cb48bcb1213a7f45e:
+
    True
+
6751dcc7fafe42ffa37932b79f83a240:
+
    True
+
2ce8e42643b44c8b8f72a2292aac8640:
+
    True
+
c10985b20e034e98acdee76822284534:
+
    True
+
37c5bb6101614a5fad3a56fb2cb03442:
+
    Minion did not return. [No response]
+
 
+
= supportconfig =
+
 
+
The following files are added specifically for CaaS Platform:
+
 
+
velum-files.txt 
+
velum-migrations.txt 
+
velum-minions.yml 
+
velum-routes.txt 
+
velum-salt-events.yml 
+
velum-salt-pillars.yml
+
kubernetes.txt
+
 
+
The problem with these logs is that they are in yaml format and not easy to troubleshoot. Rather than grepping as we normally would, a better strategy would be to grep for context.
+
 
+
== Example 6 ==
+
 
+
grep -C3 fail velum-salt-events.yml
+
+
  fun: grains.get
+
  id: 0d07bd4cd7e54a4880423fc42f025b88
+
- fun_args:
+
  - tx_update_failed
+
  jid: '20180214102215466369'
+
  return: ''
+
  retcode: 0
+
 
+
The -C3 flag will get 3 lines before and after the line that matches so you can get a better idea of what is happening that than just getting a single line that won't tell you anything about what the error is actually about.
+
 
+
= Bootstrapping =
+
 
+
If a bootstrap fails, we are not given any direct output about why it fails. We have a couple of new tools now:
+
 
+
First, we can manually kick off another bootstrap and record the output to bootstrap.log.
+
 
+
docker exec -it $(docker ps | grep salt-master | awk '{print $1}') salt-run -l debug state.orchestrate orch.kubernetes | tee bootstrap.log
+
 
+
The CaaS Platform team says, "This is unsupported and may cause issues further down the line but you can run." so we don't want to do this unless this is a recurring issue that the customer can't get past. In CaaS Platform 3, this will be an option built into Velum though I don't know if it will be logged.
+
 
+
Secondly, if this is recurring, we can put put the salt-master into debug mode after installing the admin node and before installing other nodes.
+
 
+
vim /etc/caasp/salt-master-custom.conf (on admin node)
+
 
+
and add
+
 
+
# Custom Configurations for Salt-Master
+
+
log_level: debug
+
 
+
You can then restart the salt-master container with:
+
 
+
docker stop `docker ps | grep salt-master | awk '{print $1}`
+
 
+
CaaSP will automatically restart the container with debugging turned on.
+
 
+
The logs that the salt-master produces are in json format. Before starting the bootstrap, run "script" and then run the command on Example 2. Once the bootstrap is failed, do CTRL-C and then exit. The output that was just on your screen from the log command will be in a file called "typescript".
+
 
+
== From the dev team: ==
+
 
+
During bootstrap, thousands of "things" are done and sequenced across the set of cluster machines - and the set of things is constantly changing - any list would be out of date pretty quickly.
+
 
+
Generally speaking, finding what specific salt step has failed is not *all that difficult* by looking at the velm-salt-events.yaml file in the admin node's supportconfig dump.
+
 
+
The process looks like this:
+
 
+
* Open velm-salt-events.yaml in an editor
+
* Search for "result: False"
+
** If the match you find says something like "Failed as prerequisite bar failed" (I don't have the exact wording handy) - then trace "up" the chain of failures logs to the "bar" failure
+
** Repeat until you see a failure where the message is not simply a prerequisite failing, and you have found the real failure event.
+
* The failure event will, more often than not, have a reasonably descriptive name. The name will indicate which area failed and needs investigation - e.g. if the name includes "etcd", start checking etcd
+
logs on each node etc.
+
 
+
 
+
[[Category:SUSE CaaS Platform]]
+

Latest revision as of 19:02, 1 August 2019

This page has been decommissioned, the content has been transferred to -> https://confluence.suse.com/display/SUSECaaSPlatform4/CaaS+Platform+Troubleshooting