failover.job

A test job that demonstrates handling of joblet failover.

Usage

> zos login --user zenuser
Please enter current password for 'zenuser':
 Logged into grid as zenuser

> zos jobinfo --detail failover
Jobname/Parameters    Attributes
------------------    ----------
failover           Desc: This test jobs can be used to demonstrate joblet
                         failover handling.

    sleeptime      Desc: specify the execute length of joblet before failure in
                         seconds
                   Type: Integer
                Default: 7

    numJoblets     Desc: joblets to run
                   Type: Integer
                Default: 1

Description

Schedules one joblet, which fails, then re-instantiates in a repeating cycle until a specified retry limit is reached and the Orchestration Server does not create another instance. This example demonstrates how the orchestration server can be made more robust, as described in Section 3.11, Improving Job and Joblet Robustness.

The files that make up the Failover job include:

failover                                    # Total: 94 lines
|-- failover.jdl                            #   64 lines
`-- failover.policy                         #   30 lines

failover.jdl

 1  # -----------------------------------------------------------------------------
 2  #  Copyright © 2010 Novell, Inc. All Rights Reserved.
 3  #
 4  #  NOVELL PROVIDES THE SOFTWARE "AS IS," WITHOUT ANY EXPRESS OR IMPLIED
 5  #  WARRANTY, INCLUDING WITHOUT THE IMPLIED WARRANTIES OF MERCHANTABILITY,
 6  #  FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGMENT.  NOVELL, THE AUTHORS
 7  #  OF THE SOFTWARE, AND THE OWNERS OF COPYRIGHT IN THE SOFTWARE ARE NOT LIABLE
 8  #  FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
 9  #  TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE
10  #  OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
11  # -----------------------------------------------------------------------------
12  #  $Id: failover.jdl 10344 2009-11-20 21:46:43Z jastin $
13  # -----------------------------------------------------------------------------
14
15  # Test job to illustrate joblet failover and max retry limits
16  #
17  # Job args:
18  #    numJoblets - specify number of Joblets to run
19  #    sleeptime -- specify the execute length of joblet before failure in seconds
20  #
21
22  import sys,os,time
23
24  #
25  # Add to the 'examples' group on deployment
26  #
27  if __mode__ == "deploy":
28      try:
29          jobgroupname = "examples"
30          jobgroup = getMatrix().getGroup(TYPE_JOB, jobgroupname)
31          if jobgroup == None:
32              jobgroup = getMatrix().createGroup(TYPE_JOB, jobgroupname)
33          jobgroup.addMember(__jobname__)
34      except:
35          exc_type, exc_value, exc_traceback = sys.exc_info()
36          print "Error adding %s to %s group: %s %s" % (__jobname__, jobgroupname, exc_type, exc_value)
37
38
39  class failover(Job):
40
41       def job_started_event(self):
42            numJoblets = self.getFact("jobargs.numJoblets")
43            print 'Launching ', numJoblets, ' joblets'
44            self.schedule(failoverjoblet,numJoblets)
45
46
47  class failoverjoblet(Joblet):
48
49       def joblet_started_event(self):
50            print "------------------ joblet_started_event"
51            print "node=%s joblet=%d" % (self.getFact("resource.id"), self.getFact("joblet.number"))
52            print "self.getFact(joblet.retrynumber)=%d" % (self.getFact("joblet.retrynumber"))
53            print "self.getFact(job.joblet.maxretry)=%d" % (self.getFact("job.joblet.maxretry"))
54
55            sleeptime = self.getFact("jobargs.sleeptime")
56            print "sleeping for %d seconds" % (sleeptime)
57            time.sleep(sleeptime)
58
59            # This will cause joblet failure and thus retry
60            raise RuntimeError, "Artifical error in joblet. node=%s" % (self.getFact("resource.id"))
61
62
63
64

failover.policy

 1  <!--
 2   *=============================================================================
 3   * Copyright © 2010 Novell, Inc. All Rights Reserved.
 4   *
 5   * NOVELL PROVIDES THE SOFTWARE "AS IS," WITHOUT ANY EXPRESS OR IMPLIED
 6   * WARRANTY, INCLUDING WITHOUT THE IMPLIED WARRANTIES OF MERCHANTABILITY,
 7   * FITNESS FOR A PARTICULAR PURPOSE, AND NON INFRINGMENT.  NOVELL, THE AUTHORS
 8   * OF THE SOFTWARE, AND THE OWNERS OF COPYRIGHT IN THE SOFTWARE ARE NOT LIABLE
 9   * FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
10   * TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE
11   * OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
12   *=============================================================================
13   * $Id: failover.policy 10344 2009-11-20 21:46:43Z jastin $
14   *=============================================================================
15   -->
16
17  <policy>
18      <jobargs>
19            <fact name="sleeptime" description="specify the execute length of joblet before failure in seconds" value="7" type="Integer"  />
20            <fact name="numJoblets" description="joblets to run" value="1" type="Integer" />
21      </jobargs>
22
23      <job>
24            <fact name="description" value="This test jobs can be used to demonstrate joblet failover handling." type="String" />
25
26           <!-- Number of times to retry joblet on failure -->
27           <fact name="joblet.maxretry" type="Integer" value="3" />
28      </job>
29  </policy>
30

Classes and Methods

Definitions:

Class failover in line 25 of failover.jdl is derived from the Job class; and the class failoverjoblet in line 33 of failover.jdl is derived from the Joblet class.

Job

A representation of a running job instance.

Joblet

Defines execution on the resource.

MatrixInfo

A representation of the matrix grid object, which provides operations for retrieving and creating grid objects in the system. MatrixInfo is retrieved using the built-in getMatrix() function. Write capability is dependent on the context in which getMatrix() is called. For example, in a joblet process on a resource, creating new grid objects is not supported.

GroupInfo

A representation of Group grid objects. Operations include retrieving the group member lists and adding/removing from the group member lists, and retrieving and setting facts on the group.

failover

Class failover (line 39 in dgtest.jdl is derived from the Job class.

failoverjoblet

Class failoverjoblet (line 47 in dgtest.jdl is derived from the Joblet class.

Job Details

The following sections describe the Failover job:

zosadmin deploy

In failover.policy, in addition to describing the jobargs and default settings for sleeptime and numJoblets (lines 18-21), the <job/> section (lines 23-28) describes static facts. Note that the joblet.maxretry attribute in line 27 has a default setting of 0 but is set here to 3. This attribute can also be modified in the failover.jdl file by inserting a line between line 41 and 42, as shown in the following example:

 41       def job_started_event(self):
 ++            self.setFact("job.joblet.maxretries", 3)
 42            numJoblets = self.getFact("jobargs.numJoblets")

job_started Event

After the Orchestrate Server deploys a job for the first time (see Section 3.5, Deploying Jobs), the job JDL files are executed in a special “deploy” mode. When the job is deployed (line 27, failover.jdl, it attempts to find the examples jobgroup (lines 29-30), creates it if is missing (lines 31-32), and adds the failover job to the group (line 33).

Jobs can be deployed using either the Orchestrate Development Client or the zosadmin deploy command. If the deployment fails for some reason, an exception is thrown (line 34), which prints the job name (line 36), group name, exception type, and value.

job_started Event

In failover.jdl, the failover class (line 39) defines only the required job_started_event (line 41) method. This method runs on the Orchestrate Server when the job is run to launch the joblets.

On execution, the job_started_event simply gets the number of joblets to create (numJoblets in line 42), then schedules that specified number of instances (line 44) of the failoverjoblet class.failoverjoblet. The failoverjoblet class (lines 47-60) defines only the required joblet_started_event (line 49) method.

When executed on an agent node, the joblet_started_event prints some helpful information for tracking execution (lines 50-53). The first output is where the joblet is running and which instance is running (line 51). The current joblet retry number (line 52) is displayed, followed by the job’s static joblet.maxretry (line 53) that was specified in the policy file.

The joblet then sleeps for jobargs.sleeptime seconds (lines 55-57) and on waking raises an exception of type RuntimeError (line 60).

This is the point of this example. After a RuntimeError exception is thrown, the zos server attempts to run the same instance of the joblet again if job.joblet.maxretry (default is 0) is less than or equal to joblet.retrynumber.

Configure and Run

You must be logged into the Orchestrate Server before you run zosadmin or zos commands.

  1. Deploy failover.job into the grid:

    > zosadmin deploy failover.job
    JobID: zenuser.failover.269
    

    The job appears to have run successfully, now take a look at the log and see the joblet failure and being relaunched until finally the "maxretry" count is exceeded and the job exits with a failure status:

  2. Display the list of deployed jobs:

    > zos joblist
    

    failover should appear in this list.

  3. Run the job on one or more resources using the default values for numJoblets and sleeptime, specified in the failover.policy file:

    > zos run failover sleeptime=1 numJoblets=2
    JobID: zenuser.failover.269
    

The job appears to have run successfully, now take a look at the log and see the joblet failure and being relaunched until finally the maxretry count is exceeded and the job exits with a failure status:

> zos log zenuser.failover.269Launching  2  joblets
[melt] ------------------ joblet_started_event
[melt] node=melt joblet=1
[melt] self.getFact(joblet.retrynumber)=0
[melt] self.getFact(job.joblet.maxretry)=3
[melt] sleeping for 1 seconds
[melt] Traceback (innermost last):
[melt]   File "failover.jdl", line 60, in joblet_started_event
[melt] RuntimeError: Artifical error in joblet. node=melt
[freeze] ------------------ joblet_started_event
[freeze] node=freeze joblet=0
[freeze] self.getFact(joblet.retrynumber)=0
[freeze] self.getFact(job.joblet.maxretry)=3
[freeze] sleeping for 1 seconds
[freeze] Traceback (innermost last):
[freeze]   File "failover.jdl", line 60, in joblet_started_event
[freeze] RuntimeError: Artifical error in joblet. node=freeze
[melt] ------------------ joblet_started_event
[melt] node=melt joblet=0
[melt] self.getFact(joblet.retrynumber)=1
[melt] self.getFact(job.joblet.maxretry)=3
[melt] sleeping for 1 seconds
[melt] Traceback (innermost last):
[melt]   File "failover.jdl", line 60, in joblet_started_event
[melt] RuntimeError: Artifical error in joblet. node=melt
[freeze] ------------------ joblet_started_event
[freeze] node=freeze joblet=1
[freeze] self.getFact(joblet.retrynumber)=1
[freeze] self.getFact(job.joblet.maxretry)=3
[freeze] sleeping for 1 seconds
[freeze] Traceback (innermost last):
[freeze]   File "failover.jdl", line 60, in joblet_started_event
[freeze] RuntimeError: Artifical error in joblet. node=freeze
[melt] ------------------ joblet_started_event
[melt] node=melt joblet=1
[melt] self.getFact(joblet.retrynumber)=2
[melt] self.getFact(job.joblet.maxretry)=3
[melt] sleeping for 1 seconds
[melt] Traceback (innermost last):
[melt]   File "failover.jdl", line 60, in joblet_started_event
[melt] RuntimeError: Artifical error in joblet. node=melt
[freeze] ------------------ joblet_started_event
[freeze] node=freeze joblet=0
[freeze] self.getFact(joblet.retrynumber)=2
[freeze] self.getFact(job.joblet.maxretry)=3
[freeze] sleeping for 1 seconds
[freeze] Traceback (innermost last):
[freeze]   File "failover.jdl", line 60, in joblet_started_event
[freeze] RuntimeError: Artifical error in joblet. node=freeze

See Also