Difference between revisions of "SUSE Manager/SaltTimeoutsForMinions"

From MicroFocusInternationalWiki
Jump to: navigation, search
m (A quick introduction to Salt timeouts)
m (A quick introduction to Salt timeouts: more wording changes)
Line 2: Line 2:
  
 
== A quick introduction to Salt timeouts ==
 
== A quick introduction to Salt timeouts ==
Salt features two different timeouts called <code>timeout</code> and <code>gather_job_timeout</code> that are relevant during the execution of Salt commands and jobs--it does not matter whether they are triggered using CLI or API. These two parameters work as follows during a Salt command execution:
+
Salt features two timeout parameters called <code>timeout</code> and <code>gather_job_timeout</code> that are relevant during the execution of Salt commands and jobs--it does not matter whether they are triggered using CLI or API. These two parameters are explained in the follwoing article.
  
 
This is a normal workflow when all minions are well reachable:
 
This is a normal workflow when all minions are well reachable:
 
 
# A salt command/job is executed: <code>salt '*' test.ping</code>
 
# A salt command/job is executed: <code>salt '*' test.ping</code>
 
# Salt master publishes the job with the targeted minions into the Salt PUB channel.
 
# Salt master publishes the job with the targeted minions into the Salt PUB channel.
 
# Minions take that job and start working on it.
 
# Minions take that job and start working on it.
# Salt master is looking at the Salt RET channel to gather responses from minions.
+
# Salt master is looking at the Salt RET channel to gather responses from the minions.
# If Salt master gets all responses from targeted minions, then everything is completed and Salt master will return a response containing all minion responses.
+
# If Salt master gets all responses from targeted minions, then everything is completed and Salt master will return a response containing all the minion responses.
<br />
+
 
But, if some of the minions are down during this process:
+
If some of the minions are down during this process, the workflow continues as follows:
 
<ol start="5">
 
<ol start="5">
 
<li>If <code>timeout</code> is reached before getting all expected responses from minions, then Salt master would trigger *another* job (a Salt <code>find_job</code> job) targeting only pending minions to see if the job is already running on the minion.</li>
 
<li>If <code>timeout</code> is reached before getting all expected responses from minions, then Salt master would trigger *another* job (a Salt <code>find_job</code> job) targeting only pending minions to see if the job is already running on the minion.</li>

Revision as of 13:15, 4 May 2017

SUSE Manager Main Page

A quick introduction to Salt timeouts

Salt features two timeout parameters called timeout and gather_job_timeout that are relevant during the execution of Salt commands and jobs--it does not matter whether they are triggered using CLI or API. These two parameters are explained in the follwoing article.

This is a normal workflow when all minions are well reachable:

  1. A salt command/job is executed: salt '*' test.ping
  2. Salt master publishes the job with the targeted minions into the Salt PUB channel.
  3. Minions take that job and start working on it.
  4. Salt master is looking at the Salt RET channel to gather responses from the minions.
  5. If Salt master gets all responses from targeted minions, then everything is completed and Salt master will return a response containing all the minion responses.

If some of the minions are down during this process, the workflow continues as follows:

  1. If timeout is reached before getting all expected responses from minions, then Salt master would trigger *another* job (a Salt find_job job) targeting only pending minions to see if the job is already running on the minion.
  2. Now gather_job_timeout goes into the stage. A new counter is now triggered.
  3. If this new find_job job responses that the original job is actually running on the minion, then Salt master will wait to get that minion response.
  4. In case of reaching gather_job_timeout without having any response from a minion (nor for initial test.ping job, nor for find_job), Salt master will return with only the gathered responses from responding minions.


Currently SUMA globally set these timeout and gather_job_timeout to 120 seconds each one. So, in worst cases, a Salt call targeting unreachable minions will end up with 240 seconds of waiting until getting a response.

Synchronous calls with unreachable Salt minions. A presence ping mechanism

In order to prevent waiting until timeout are reached when some minions are down, we've introduced a so call "presence mechanism" for Salt minions.

From SUMA 3.0.5+ on, this presence mechanism checks for unreachable Salt minions when SUMA is performing synchronous calls to these minions, and it excludes unreachable minions from that call. Synchronous calls are going to be displaced in favor of asynchronous calls but still being used currently during some workflow.

The new introduced presence mechanism triggers a Salt test.ping with a custom and fixed short Salt timeouts values. Defaults Salt values for the presence ping are: timeout = 4 and gather_job_timeout = 1. This way, we obtain what targeted minions are unreachable in a short time, and then we can exclude them from the synchronous call.


Overriding Salt presence timeouts values (/etc/rhn/rhn.conf)

SUMA administrators are able to increase or decrease defaults presence ping timeouts values by uncommenting and setting the desire value for `salt_presence_ping_timeout` and `salt_presence_ping_gather_job_timeout` options in `/etc/rhn/rhn.conf`:

   # SUSE Manager presence timeouts for Salt minions
   # salt_presence_ping_timeout = 4
   # salt_presence_ping_gather_job_timeout = 1


Salt SSH Minions (SSH Push)

Salt SSH minions are slightly different that regular minions (zeromq). In this case, they don't use Salt PUB/RET channels but a wrapper salt command inside an ssh call. Salt timeout and gather_job_timeout are not playing a role here.

SUMA defines a timeout for SSH connection in /etc/rhn/rhn.conf:

  # salt_ssh_connect_timeout = 180

The presence ping mechanism is also working with SSH minions. In this case, SUMA will use salt_presence_ping_timeout to override the default timeout value for SSH connections.