Glossary

aborted

Is a node status.

When the ECF_JOB_CMD fails or the job file sends a ecflow_client –abort child command, then the task is placed into a aborted state.

active

Is a node status.

If job creation was successful, and job file has started, then the ecflow_client –init child command is received by the ecflow_server and the task is placed into a active state

autocancel

autocancel is a way to automatically delete a node which has completed.

The delete may be delayed by an amount of time in hours and minutes or expressed in days. Any node may have a single autocancel attribute. If the auto cancelled node is referenced in the trigger expression of other nodes it may leave the node waiting. This can be solved by making sure the trigger expression also checks for the unknown state. i.e.:

trigger node_to_cancel == complete or node_to_cancel == unknown

This guards against the ‘node_to_cancel’ being undefined or deleted

See also:

Python API

ecflow.Autocancel, ecflow.Node.add_autocancel

Definition file grammar

autocancel

aviso

An aviso is an attribute of a Node (typically a Task), and creates a dependency on an external Aviso server.

A Node with an aviso attribute is held from executing until a notification matching the configured listener is received. When a matching notification is received, the node then allowed to execute following a behaviour similar to trigger or time dependency e.g. cron).

Only one aviso attribute is allowed per node, and each attribute is defined by the following properties:

  • name, an identifier

  • listener, the configuration for the Aviso listener

  • url, the base location of the Aviso server

  • schema, the location of the Aviso schema used to evaluate the notifications

  • polling, the value (in seconds) used to periodically contact the Aviso server

  • auth, the location to the Aviso authentication credentials file

Note

The listener parameter is expected to be a valid single line JSON string, enclosed in single quotes.

The value of the properties url, schema, polling, and auth can be composed of Variables. When these properties are not provided, the following default values are used:

  • %ECF_AVISO_URL%, for url

  • %ECF_AVISO_SCHEMA%, for schema

  • %ECF_AVISO_POLLING%, for polling

  • %ECF_AVISO_AUTH%, for auth

Important

The variables ECF_AVISO_* are not automatically provided at server level, and must be defined at Suite level by the user.

Each aviso attribute implies that a background thread is spawned whenever the associated node is (re)queued. This independent background thread, responsible for polling the Aviso server and periodically processing the latest notifications, uses the configuriguration available when the associated task is queued.

Note

If any variables provinding the configuration are updated, the Aviso configuration can be reloaded (without unqueuing the Task) by issuing an Alter change command with the value reload to the relevant Aviso attribute.

The authentication credentials file is expected to be in JSON format, following the ECMWF Web API (this is conventionally stored in a file located at $HOME/.ecmwfapirc):

{
  "url" : "https://api.ecmwf.int/v1",
  "key" : "<your-api-key>",
  "email" : "<your-email>"
}

Only the fields url, key, and email are required; any additional fields are ignored.

Important

If %ECF_AVISO_AUTH% provides a path to a nonexistent file, or if the provided file is not a valid JSON, the credentials will be ignored and the Aviso notification retrieval will eventually fail due to “UNAUTHORIZED” access.

The Aviso schema file is a JSON file that defines the event listener schema. This is used by both Aviso server and client (thus, by ecFlow) to define the valid event types and request parameters used when polling for notifications. The schema file path must be provided to the schema option (or via the ECF_AVISO_SCHEMA variable).

check point

The check point file is like the suite definition, but includes all the state information.

It is periodically saved by the ecflow_server.

It can be used to recover the state of the node tree should server die, or machine crash.

By default when a ecflow_server is started it will look to load the check point file.

The default check point file name is <host>.<port>.ecf.check. This can be overridden by the ECF_CHECK environment variable

The check point file format is the same as the defs file format (from release 4.7.0 onwards). However, the indentation has been removed to preserve space. To view with indentation use:

ecflow_client --load=<check_point_file> print check_only
child command

Child (or Task) commands are called from within the ecf script files. The table also includes the default action (from version 4.0.4) if the child command is part of a zombie. ‘block’ means the job will be held by the ecflow_client command. Until time out, or manual/automatic intervention.

Child (or Task) Command

Description

Zombie (default action)

ecflow_client –init

Sets the task to the active status

block

ecflow_client –wait

Wait for a expression to evaluate

block

ecflow_client –queue

Update queue step in server

block

ecflow_client –abort

Sets the task to the aborted status

block

ecflow_client –complete

Sets the task to the complete status

block

ecflow_client –event

Set an event

fob

ecflow_client –meter

Change a meter

fob

ecflow_client –label

Change a label

fob

The following environment variables must be set for the child commands. ECF_HOST, ECF_NAME , ECF_PASS and ECF_RID. See ecflow_client.

clock

A clock is an attribute of a suite.

A gain can be specified to offset from the given date.

The hybrid and real clocks always runs in phase with the system clock (UTC in UNIX) but can have any offset from the system clock.

The clock can be :

time, day and date and cron dependencies work a little differently under the clocks.

The default clock type is hybrid.

If the ecflow_server is shutdown or halted the job scheduling is suspended. If this suspension is left for period of time, then it can affect task submission under hybrid and real clocks. In particular it will affect tasks with time, today or cron dependencies.

  • dependencies with time series, can result in missed time slots:

    time 10:00 20:00 00:15    # If server is suspended > 15 minutes, time slots can be missed
    time +00:05 20:00 00:15   # start 5 minutes after the start of the suite, then every 15m until 20:00
    
  • When the server is placed back into running state any time dependencies with an expired time slot are submitted straight away. i.e if ecflow_server is halted at 10:59 and then placed back into running state at 11:20:

    time 11:00
    

    Then any task with a expired single time slot dependency will be submitted straight away.

See also:

Python API

ecflow.Clock, ecflow.Suite.add_clock

Definition file grammar

clock

complete

Is a node status.

The node can be set to complete:

complete expression

Force a node to be complete if the expression evaluates, without running any of the nodes.

This allows the user to have tasks in the suite which run only in case others fail. In practice the node would need to have a trigger also.

Command line interface (CLI)

–complete

Python API

ecflow.Expression, ecflow.Node.add_complete

Definition file grammar

complete

cron

A cron defines a time dependency for a node, similar to time, but one that will be repeated indefinitely.

See also:

Text Definition

cron

Python API

ecflow.Cron, ecflow.Node.add_cron

Definition file grammar

cron

date

This defines a date dependency for a node.

There can be multiple date dependencies. The European format is used for dates, which is: dd.mm.yy as in 31.12.2007. Any of the three number fields can be expressed with a wildcard * to mean any valid value. Thus, 01.*.* means the first day of every month of every year.

If a hybrid clock is defined, any node held by a date dependency will be set to complete at the beginning of the suite, without running the corresponding job. Otherwise under a hybrid clock the suite would never complete.

Python API

ecflow.Date, ecflow.Node.add_date

Definition file grammar

date

day

This defines a day dependency for a node.

There can be multiple day dependencies.

If a hybrid clock is defined, any node held by a day dependency will be set to complete at the beginning of the suite, without running the corresponding job. Otherwise under a hybrid clock the suite would never complete.

Python API

ecflow.Day, ecflow.Node.add_day

Definition file grammar

day

defstatus

Defines the default status for a task/family to be assigned to the node when the begin command is issued.

By default node gets queued when you use begin on a suite. defstatus is useful in preventing suites from running automatically once begun or in setting tasks complete so they can be run selectively.

See also:

Python API

ecflow.DState, ecflow.Node.add_defstatus

Definition file grammar

defstatus

dependencies

Dependencies are attributes of node, that can suppress/hold a task from taking part in job creation.

They include trigger, date, day, time, today, cron, complete expression, inlimit and limit.

A task that is dependent cannot be started as long as some dependency is holding it or any of its parent node s.

The ecflow_server will check the dependencies every minute, during normal scheduling and when any child command causes a state change in the suite definition.

directives

Directives appear in a ecf script. (i.e. typically .ecf file, but could be .py file).Directives start with a % character. This is referred to as ECF_MICRO character.

The directives are used in two main context.

  • Preprocessing directives. In this case the directive starts as the first character on a line in a ecf script file. See the table below which shows the allowable values. Only one directive is allowed on the line.

  • Variable directives. We use two ECF_MICRO characters ie %VAR%, in this case they can occur anywhere on the line and in any number.

    %CAR% %TYPE% %WISHLIST%
    

    These directives take part in variable substitution.

    If the micro characters are not paired (i.e uneven) then variable substitution cannot take place hence an error message is issued.

    port=%ECF_PORT       # error issued since '%' micro character are not paired.
    

    However an uneven number of micro character are allowed, If the line begins with ‘#’ comment character.

    # This is a comment line with a single micro character % no error issued
    # port=%ECF_PORT        again no error issued
    

Directives are expanded during pre-processing. Examples include:

Symbol

Meaning

%include <filename>

%ECF_INCLUDE% directory is searched for the filename and the contents included into the job file. If that variable is not defined ECF_HOME is used. If the ECF_INCLUDE is defined but the file does not exist, then we look in ECF_HOME. This allows specific files to be placed in ECF_INCLUDE and the more general/common include files to be placed in ECF_HOME. This is the recommended format

%include “filename”

Include the contents of the file: %ECF_HOME%/%SUITE%/%FAMILY%/filename into the job.

%include filename

Include the contents of the file filename into the output. The only form that can be used safely must start with a slash ‘/’

%includenopp filename

Same as %include, but the file is not interpreted at all.

%comment

Starts a comment, which is ended by %end directive. The section enclosed by %comment - %end is removed during pre-processing

%manual

Starts a manual, which is ended by %end directive. The section enclosed by %manual - %end is removed during pre-processing. The manual directive is used to create the manual page show in ecflow_ui.

%nopp

Stop pre-processing until a line starting with %end is found. No interpretation of the text will be done (i.e. no variable substitutions)

%end

End processing of %comment or %manual or %nopp

%ecfmicro CHAR

Change the directive character, to the character given. If set in an include file the effect is retained for the rest of the job (or until set again). It should be noted that the ecfmicro directive specified in the ecf script file, does not effect the variable substitution for ECF_JOB_CMD, ECF_KILL_CMD or ECF_STATUS_CMD variables. They still use ECF_MICRO. If no ecfmicro directive exists, we default to using ECF_MICRO from the suite definition

From ecFlow release 4.4.0, use of %VAR% (variable substitution) can be a part of the include filename. i.e.:

# %file% must be defined, on the task, or on the parent hierarchy
%include <%file%.h>

# use %INCLUDEFILE% if defined (on the task, or on the parent hierarchy,
# and MUST follow one of formats above: ".filename", "../filename", "filename",
# filename>)  otherwise use <file>
%include %INCLUDEFILE:<file>%

Care should be taken to avoid spaces in the variable values.

ecf file location algorithm

ecflow_server and job creation checking uses the following algorithm to locate the ‘.ecf’ file corresponding to a task.

Note

To search for files with a different extension, i.e. to look for python file ‘.py’. Override the ECF_EXTN variable. Its default value is ‘.ecf’

  • ECF_SCRIPT: First it uses the generated variable ECF_SCRIPT to locate the script. This variable is generated from: ECF_HOME/<path to task>.ecf Hence if the task path is /suite/f1/f2/t1, then ECF_SCRIPT=ECF_HOME/suite/f1/f2/t1.ecf

  • ECF_FETCH (user variable): File is obtained from running the command after some postfix arguments are added. (Output of popen)

  • ECF_SCRIPT_CMD (user variable): File is obtained from running the command. (Output of popen)

  • ECF_FILES: Second it checks for the user defined ECF_FILES variable. If defined the value of this variable must correspond to a directory. This directory is searched in reverse order.

I.e. lets assume we have a task /o/12/fc/model and ECF_FILES is defined as /home/ecmwf/emos/def/o/ECFfiles

The ecFlow will use the following search pattern.

  1. /home/ecmwf/emos/def/o/ECFfiles/o/12/fc/model.ecf

  2. /home/ecmwf/emos/def/o/ECFfiles/12/fc/model.ecf

  3. /home/ecmwf/emos/def/o/ECFfiles/fc/model.ecf

  4. /home/ecmwf/emos/def/o/ECFfiles/model.ecf

If the directory does not exist, the server will try variable substitution. This allows additional configuration:

edit ECF_FILES /home/ecmwf/emos/def/o/%FILE_DIR:ECFfiles%

The search can be reversed, by adding a variable ECF_FILES_LOOKUP, with a value of “prune_leaf” (from ecFlow 4.12.0). Then ecFlow will use the following search pattern.

  1. /home/ecmwf/emos/def/o/ECFfiles/o/12/fc/model.ecf

  2. /home/ecmwf/emos/def/o/ECFfiles/o/12/model.ecf

  3. /home/ecmwf/emos/def/o/ECFfiles/o/model.ecf

  4. /home/ecmwf/emos/def/o/ECFfiles/model.ecf

However please be aware this will also affect the search in ECF_HOME

  • ECF_HOME: Thirdly it searches for the script in reverse order using ECF_HOME (i.e like ECF_FILES). If this fails, than the task is placed into the aborted state. We can check that file can be located before loading the suites into the server.

Note: The addition of variable with a name ECF_FILES_LOOKUP and value ‘prune_leaf’, affects the search in BOTH ECF_FILES and ECF_HOME

See also:

ecf script

The ecFlow script refers to an ‘.ecf’ file.

The script file is transformed into the job file by the job creation process.

The base name of the script file must match its corresponding task. i.e t1.ecf , corresponds to the task of name ‘t1’. The script if placed in the ECF_FILES directory, may be re-used by multiple tasks belonging to different families, providing the task name matches.

The ecFlow script is similar to a UNIX shell script.

The differences, however, includes the addition of “C” like pre-processing directives and ecFlow variables. Also the script must include calls to the init and complete child commands so that the ecflow_server is aware when the job starts (i.e changes state to active) and finishes (i.e changes state to complete)

ECF_DUMMY_TASK

This is a user variable that can be added to task to indicate that there is no associated ecf script file.

If this variable is added to suite or family then all child tasks are treated as dummy.

This stops the server from reporting an error during job creation.

ECF_EXTN

Defines the extension for the script that will be turned into a job file. This has a default value of ‘.ecf’. But could be any extension.This is used by the server as part of ‘ecf file location algorithm’

ECF_FETCH

Experimental This is used to specify a command, whose output can be used as a job script. The ecFlow server will run the command with popen. Hence great care needs to be taken not to doom the server, with command that can hang. As this could severely affect servers ability to schedule jobs.

edit ECF_FETCH my_custom_cmd.sh

After variable substitution, the server will add the following.

my_custom_cmd.sh -s <task_name>.<ECF_EXTN>   # to extract the script and create the job
my_custom_cmd.sh -i                          # to extract the includes
my_custom_cmd.sh -m <task_name>.<ECF_EXTN>   # to extract the manual, i.e. for display in the info tab
my_custom_cmd.sh -c <task_name>.<ECF_EXTN>   # to extract the comments

The output of running these commands (-s) is used to create the job.

ECF_HOME

This is user defined variable; it has four functions:

  • it is used as a prefix portion of the path of the job files created by ecFlow server; see the description of the ECF_JOB generated variable.

  • it is a default directory where ecFlow server looks for scripts (with file extension defined by ECF_EXTN,default is .ecf); overridden by ECF_FILES user defined variable. See the “ecf file location algorithm” entry for more detail.

  • it is a default directory where ecFlow server looks for include files; overridden by ECF_INCLUDE user defined variable. See the “directives” entry for more detail.

  • it is used as a default prefix portion of the job output path (the ECF_JOBOUT generated variable); overridden by ECF_OUT user defined variable. See descriptions of ECF_JOBOUT and ECF_OUT variables for more detail.

ECF_INCLUDE

This is a user defined variable. It is used to specify directory locations, that are used to search for include files.

edit ECF_INCLUDE /home/fred/course/include           # a single directory
edit ECF_INCLUDE /home/fred/course/include:/home/fred/course/include2:/home/fred/course/include_me  # set of directories to search
ECF_JOB

This is a generated variable. If defines the path name location of the job file.

The variable is composed as:

ECF_HOME/ECF_NAME.job<ECF_TRYNO>
ECF_JOB_CMD

This variable should point to a script that can submit the job. (i.e. to the queuing system, via, SLURM,PBS).

The ecFlow server will detect abnormal termination of this command. Hence for errors in the job file, should call ‘ecflow_client –abort”, then exits cleanly. Otherwise server detects abnormal job termination, and abort flag is set. Which will prevent job re-queue(due to ECF_TRIES).

If the job also sends an abort, zombies can be created. If ECF_JOB_CMD command fails, and the task is in a submitted state, then the task is set to the aborted state. However if the task was active or complete, then we do NOT abort the task. Instead the zombie flag is set. (since ecFlow 4.17.1)

ECF_JOBOUT

This is a generated variable. This variable defines the path name for the job output file. The variable is composed as following.

If ECF_OUT is specified:

ECF_OUT/ECF_NAME.ECF_TRYNO

otherwise:

ECF_HOME/ECF_NAME.ECF_TRYNO
ECF_LISTS

This is the server variable. The variable specifies the path to the White list file. This file controls who has read/write access to the server via the user commands.

The user name can be found using linux, id command and is typically the login name. The file has a very simple format.

The file path specified by ECF_LISTS environment, is read by the server on start up. The contents of the white list can be modified, and reloaded by the server. (However the path to the white-list file can NOT be modified after the server has started).

If ECF_LISTS is not set, the server will look for a file named <host>.<port>.ecf.lists (i.e.my_host.3141.ecf.lists) in same directory where the server was started.

If the file specified by ECF_LISTS or <host>.<port>.ecf.lists, does not exist or exists but is empty, then all users will have read/write access to suites on the server. Special care must be taken, so that user reloading the white list file does not remove write access for the administrator.

Listing 122 Re-load white list file
 ecflow_client --help=reloadwsfile
 ecflow_client --reloadwsfile
Listing 123 Read write access for specific users
 4.4.14   # this is a comment, the first non-comment line must include a version.

 # These users have read and write access to the server
 uid1  # user uid1,uid2,cog have read and write access to the server
 uid2
 cog

 # Read only users
 -fred  # users fred,bill and jake have read only access
 -bill
 -jake
Listing 124 Example where all users have read access
 4.4.14   # this is a comment, the first non-comment line must include a version.

 # These users have read and write access to the server
 uid1  # user uid1,uid2,cog have read and write access to the server
 uid2
 cog

 # User with read access
 -*    # all users have read access
Listing 125 From ecFlow release 4.1.0, users can be restricted via node paths
 4.4.5
 fred             # has read /write access to all suites
 -joe             # has read access to all suites

 *  /x /y    # all users have read/write access to suites /x /y
 -* /w /z    # all users have read access to suites /w /z

 user1 /a,/b,/c  # user1 has read/write access to suite /a /b /c
 user2 /a
 user2 /b
 user2 /c       # user2 has read write access to suite /a /b /c
 user3 /a /b /c # user3 has read write access to suite /a /b /c

 -user4 /a,/b,/c  # user4 has read access to suite /a /b /c
 -user5 /a
 -user5 /b
 -user5 /c    # user5 has read access to suite /a /b /c
 -user6 /a /b /c   # user6 has read access to suite /a /b /c
ECF_MICRO

This is a generated variable. The default value is %. This variable is used in variable substitution during command invocation and default directive character during pre-processing. It can be overriden, but must be replaced by a single character.

ECF_NAME

This is a generated variable. It defines the path name of the task. It will typically be used inside script file, referring to the corresponding task.

Listing 126 t1.ecf
 %include <head.h>
 ....
 ecflow_client --alter change variable "fred" "bill" %ECF_NAME% # change variable on corresponding task
 ...
 %include <tail.h>
ECF_NO_SCRIPT

This is a user variable, that can be added to a node (introduced with ecFlow release 4.3.0). It is used to inform the ecflow_server that there is no SCRIPT associated with a task. However unlike ECF_DUMMY_TASK, the task can still be submitted provided the ECF_JOB_CMD is set up.

This is suitable for very lightweight tasks that want to minimize latency. The output can still be seen, if it is redirected to ECF_JOBOUT. Care must be taken to ensure the path to ecflow_client is accessible.

Listing 127 ECF_NO_SCRIPT examples
family no_script
edit ECF_NO_SCRIPT "1"  # the server will not look for .ecf files
edit ECFLOW_CLIENT ecflow_client
edit DIROUT %VERBOSE%
edit SILENT ""
edit VERBOSE " > %ECF_JOBOUT 2>&1"

task non_script_task
   edit ECF_JOB_CMD "export ECF_PASS=%ECF_PASS%;export ECF_PORT=%ECF_PORT%;export ECF_HOST=%ECF_HOST%;export ECF_NAME=%ECF_NAME%;export ECF_TRYNO=%ECF_TRYNO%; %ECF_CLIENT% --init=$$; echo 'test test_ecf_no_script' %DIROUT% && %ECF_CLIENT% --complete"
   # this command is not expected to fail. hence no error handling.(i.e.. will stay active)

task ecf_no_script
edit ECF_JOB_CMD "ecf_no_script --pass %ECF_PASS% --host %ECF_HOST% --port %ECF_PORT% " # %DIROUT%
# ecf_no_script contains init, complete, call to ecflow_client and trapping to raise abort
# use this approach for robust error handling

task ymd2jul
edit ECF_JOB_CMD "ECF_PASS=%ECF_PASS% ECF_NAME=%ECF_NAME% /usr/local/bin/ymd2jul.sh -p %ECF_PORT% -n %ECF_HOST% -r /%SUITE%/%FAMILY% -y %YMD% > %ECF_JOBOUT% 2>&1 &"
# /usr/local/bin/ymd2jul.sh can be called on command line or as ecflow_client
endfamily
ECF_OUT

This is user/suite variable that specifies a directory PATH. It controls the location of job output (stdout and stderr of the process) on a remote file system. It provides an alternate location for the job and cmd output files. If it exists, it is used as a base for ECF_JOBOUT, but it is also used to search for the output by ecFlow, when asked by ecflow_ui/ecflow_client. If the output is in ECF_OUT/ECF_NAME.ECF_TRYNO it is returned, otherwise ECF_HOME/ECF_NAME.ECF_TRYNO is used.

The user must ensure that all the directories exists, including suite/family. If this is not done, you may well find task remains stuck in a submitted state. At ECMWF our submission scripts will ensure that directories exists.

ECF_PASS

This is a generated variable. During job generation process in the server, a unique password is generated and stored in the task. It then replaces %ECF_PASS% in the scripts(.ecf), with the actual value. When the job runs, ecflow_client reads this, as an environment variable, and passes it to the server. The server then compares this password with the one held on the task. This is used as a part of the authentication for child commands, and is used to detect zombies.

The authentication process can be bypassed, and allow the job to proceed (i.e.. when the user is sure that there is only a single process, trying to communicate with the server), by adding it as a user variable. i.e.:

ecflow_client --alter add variable ECF_PASS FREE  <path to task>

This functionality is also available in the GUI. Select a task, RMB > Special >Free password. However it is important not leave this in place, as it will always bypass the authentication. Just delete the variable.

ECF_PASSWD

This is an environment variable, which points to a password file for both client and server. This enables password based authentication for ecFlow user commands. The password file is required for the client and server.

Listing 128 Example client password file. The same file can be used for multiple servers
4.5.0
# <user> <host> <port> <passwd>
user1 machine1 3141 xxxty
user1 machine2 3142 shhert
Listing 129 Example server password file for machine1 and port 3141
4.5.0
user1 machine1 3141 xxxty
user2 machine1 3141 bbsdd7

The server administrator needs to set Unix file permissions, so that this file is only readable by ecFlow server and the administrator.

ECF_SCRIPT

This is a generated variable. If defines the path name for the ecf script

ECF_SCRIPT_CMD

Experimental

This allows the output of running a command to be treated as a script. The command is run after variable substitution. The output is obtained from running the system function popen in the server. Great care should be taken when running this command, to ensure errors in the command do not crash the server. This approach could be used for short lived tasks, where extremely low latency is required. Commands that take more than 20s can interfere with job scheduling and should be avoided. Could possibly be used to checkout a script from a version control system.

If the output contains %include,%manual,%noop they are treated in the same manner as a normal ‘.ecf’ script.

Listing 130 Here the output of the ‘cat’ command is treated as a script
suite test
   family family
      task check
         edit ECF_SCRIPT_CMD "cat /tmp/ECF_SCRIPT_CMD/family/check.ecf"
      task t1
         trigger check == complete
         edit ECF_SCRIPT_CMD "cat /tmp/ECF_SCRIPT_CMD/family/t1.ecf"
   endfamily
endsuite
ECF_STATUS_CMD

User defined variable defining the ecflow_client –status command. It invokes a user-supplied (shell) command that queries the status of the job.

The command should be written in such a way that the output is written to %ECF_JOB%.stat, and if the script determines that the job is not active, it should abort the task in ecflow. This command can be particularly useful when nodes on the supercomputer go down, and we don’t know the true state of the jobs.

The status command can be invoked from the Command line interface (CLI) and ecFlowUI. If applied to a family or suite, the command will be run hierarchically. In ecFlowUI use the Status tab in the Info panel or use Special > Status from the node context menu to run it and see the output.

The code below allows the output of the status command to be shown by the --file command on the command line, and automatically via the Status tab in ecFlowUI:

suite s1
   edit ECF_STATUS_CMD /home/ma/emos/bin/ecfstatus  %USER% %HOST% %ECF_RID% %ECF_JOB% > %ECF_JOB%.stat 2>&1
....
endsuite
Listing 131 Invoking status cmd, from the command line
ecflow_client --status=/s1/f1/t1     # ECF_STATUS_CMD should output to %ECF_JOB%.stat
ecflow_client --file=/s1/f1/t1 stat  # Return contents of %ECF_JOB%.stat file"
ECF_TRIES

This is generated variable added at the server level with a default value of 2. It can be overridden by the user and controls the number of times job should re-run should it abort. Provided:

  • the task/job has NOT been killed(user action)

  • the job process (created from .ecf or .py) exited cleanly and not with exit 1 || sys.exit(1) as process death is captured by the server. Always ensure your script exits cleanly. i.e. exit(0)

  • the task has NOT been set to abort by the user(user action)

  • job creation has not failed . i.e. task pre-processing(include file expansion,variable - substitution, change of file permission for job file)

  • the value of the variable ECF_TRIES must be convertible to an integer.

Please note this allows your scripts to be self-aware of the number times it is being run. i.e.:

Listing 132 task.ecf
 %include <head.h>
 "echo do some work\n";
 if [ %ECF_TRYNO% -eq 1 ] ; then
    echo "first attempt"
    .....
 fi
 if [ %ECF_TRYNO% -eq 2 ] ; then
    echo "first attempt failed, trying a different approach, clean data, etc"
    .....
 fi
 %include <tail.h>
ECF_TRYNO

This is a generated variable that is used in file name generation. It represents the current try number for the task.

After begin it is set to 1. The number is advanced if the job is re-run. It is re-set back to 1 after a re-queue. It is used in output and job file numbering. (i.e It avoids overwriting the job file output during multiple re-runs)

ecFlow

Is the ECMWF work flow manager.

A general purpose application designed to schedule a large number of computer process in a heterogeneous environment.

Helps computer jobs design, submission and monitoring both in the research and operation departments.

ecflow_client

This executable provides the ecFlow Command line interface (CLI); it is used for all communication with the ecflow_server.

To see the full range of commands that can be sent to the ecflow_server type the following in a UNIX shell:

ecflow_client --help

This functionality is also provided by the Python API.

The following variables affect the execution of ecflow_client.

Since the ecf script can call ecflow_client(i.e child command) then typically some are set in an include header. i.e. head.h.

Table 7 Environment variables common for User and Task commands

Variable Name

Explanation

Compulsory

Example

ECF_PORT

Port number of the ecflow_server. Must match ecflow_server

Yes/No

We can use:

ecflow_client --port 3141

as an alternative to specifying the ECF_PORT.

ECF_HOST

Name of the host running the ecflow_server

Yes/No

We can use:

ecflow --host machine1

as an alternative to specifying ECF_HOST

NO_ECF

If set exits ecflow_client immediately with success. This allows the scripts to be tested independent of the server

No

export NO_ECF=1

ECF_DENIED

If server denies client communication and this flag is set, exit with an error. Avoids 24hr hour connection attempt to ecflow_server.

No

export ECF_DENIED=1

ECF_SSL

For secure communication between server and client – requires build with SSL enabled.

No

# To share a certificate amongst multiple servers
export ECF_SSL=1 # or empty value
# To use specific server certificate
export ECF_SSL=<any non-empty value, except '1'>

When ECF_SSL=1, ecflow will search for a shared certificate at $HOME/.ecflowrc/ssl/server.crt, and then fallback to the server specific certificate at $HOME/.ecflowrc/ssl/<host>.<port>.crt.

Secure communication can also be activated using the ecflow_client --ssl ... option. When using the –ssl option, if ECF_SSL is not explicitly specified, it is assumed ECF_SSL=1.

Table 8 Environment variables for Task commands

Variable Name

Explanation

Compulsory

Example

ECF_NAME

Path to the task

Yes

/suite/family/task

ECF_PASS

Jobs password. Generated by the server, will replace %ECF_PASS% in the scripts,during job generation.Used for authenticating child commands.

Yes

(generated)

ECF_RID

Remote id. Allow easier job kill, and disambiguate a zombie

Yes

(generated)

ECF_TRYNO

The number of times the job has run. This is allocated by the server and used in job/output file name generation.

No

(generated)

ECF_HOSTFILE

File that lists alternate hosts to try, if connection to main host fails

No

$HOME/.echostfile

ECF_HOSTFILE_POLICY

The policy, one of “task” or “all” indicates when to perform retry based on the ECF_HOSTFILE. The default policy is “task”, meaning that the retry will only be performed for task (i.e. commands) commands. If the policy is “all”, the retry will be performed for both task and user commands (including ping).

No

export ECF_HOSTFILE_POLICY=all

ECF_TIMEOUT

Maximum time (in seconds) for the client to deliver message

No

default value: 24 * 60 * 60 # i.e. 24 hours

export ECF_TIMEOUT=36024*3600

ECF_CONNECT_TIMEOUT

Maximum time (in seconds) for the client to establish connection

No

default value: 0

ECF_ZOMBIE_TIMEOUT

Maximum time (in seconds) for the zombie Task client (performing init, abort, complete, etc) to get a reply from the server.

No

12*3600 (default value):

export ECF_ZOMBIE_TIMEOUT=36024*3600
Table 9 Variables specific to user commands

Variable Name

Explanation

Compulsory

Example

ECF_PASSWD

path to the client password file, used for password based authentication

No

export ECF_PASSWD=mymachine.3141.ecf.passwd

ECF_USER

When user need to pose as another user, i.e. when users id on the client machine, doesn’t match his id on the remote server. Requires password file.

No

export ECF_USER=my_user_name

To avoid setting environment variable we can use:

ecflow_client --user my_user_name ......
ecflow_server

This executable is the server.

It is responsible for scheduling the jobs and responding to ecflow_client requests

Multiple servers can be run on the same machine/host providing they are assigned a unique port number.

The server records all requests in the log file.

The server will periodically (see ECF_CHECKINTERVAL) write out a check point file.

The following environment variables control the execution of the server and may be set before the start of the server. ecflow_server will start happily with out any of these variables being set, since all of them have default values.

Variable Name

Explanation

Default value

ECF_HOME

Home for all the ecFlow files

Current working directory

ECF_PORT

Server port number. Must be unique

3141

ECF_LOG

History or log file

<host>.<port>.ecf.log

ECF_CHECK

Name of the checkpoint file

<host>.<port>.ecf.check

ECF_CHECKOLD

Name of the backup checkpoint file

<host>.<port>.ecf.check.b

ECF_CHECKINTERVAL

Interval in second to save check point file

120

ECF_LISTS

White list file. Controls read/write access to the server for each user

<host>.<port>.ecf.lists

ECF_TASK_THRESHOLD

Report in log file all task/job that take longer than given threshold. Used to debug/instrument, those scripts that are very large.

4000 (milliseconds). Before release 4.0.6 default was 2000 ms.

ECF_PASSWD

path to server password file, used to authenticate user commands. Use when ALL should be password authenticated

<host>.<port>.ecf.passwd

ECF_CUSTOM_PASSWD

path to server password file, used to authenticate user commands. Use when a small number of users need to be password authenticated. Typically client would use:ecflow_client –user=fred ….export ECF_USER=fred; ecflow_client …

<host>.<port>.ecf.custom_passwd

ECF_PRUNE_NODE_LOG

When the checkpoint point file is loaded, node log history older than 30 days is automatically pruned. The variable allows this value to be changed.Setting the variable to zero, means there will be no pruning. All history is preserved at the cost increasing server memory, and time taken to write checkpoint file.

export ECF_PRUNE_NODE_LOG=40

Prune node log history older than 40 days, upon reload of check point file.

ECF_SSL

For secure communication between server and client – requires build with SSL enabled.

# To share a certificate amongst multiple servers
export ECF_SSL=1 # or empty value
# To use specific server certificate
export ECF_SSL=<any non-empty value, except '1'>

When ECF_SSL=1, ecflow will search for a shared certificate at $HOME/.ecflowrc/ssl/server.crt, and then fallback to the server specific certificate at $HOME/.ecflowrc/ssl/<host>.<port>.crt.

Secure communication can also be activated using the ecflow_server --ssl ... option. When using the –ssl option, if ECF_SSL is not explicitly specified, it is assumed ECF_SSL=1.

Consider using ecflow_start.sh -s to start the server with SSL support.

The server can be in several states. The default when first started is halted, See server states

ecflow_ui

ecflow_ui executable in the new GUI based client. It is used to visualise and monitor the hierarchical structure of the suite definition.

event

The purpose of an event is to signal partial completion of a task and to be able to trigger another job which is waiting for this partial completion.

Only tasks can have events and they can be considered as an attribute of a task.

There can be many events and they are displayed as nodes.

The event is updated by placing the --event child command in a ecf script.

An event has a number and possibly a name. If it is only defined as a number, its name is the text representation of the number without leading zeroes.

See also:

Command line interface (CLI)

event

Python API

ecflow.Event, ecflow.Node.add_event

Definition file grammar

event

Events can be referenced in trigger and complete expression s.

extern

This allows an external node to be used in a trigger expression.

All nodes in triggers must be known to ecflow_server by the end of the load command. No cross-suite dependencies are allowed unless the names of tasks outside the suite are declared as external. An external trigger reference is considered unknown if it is not defined when the trigger is evaluated. You are strongly advised to avoid cross-suite dependencies.

Families and suites that depend on one another should be placed in a single suite. If you think you need cross-suite dependencies, you should consider merging the suites together and have each as a top-level family in the merged suite.

For grammar see extern.

family

A family is an organisational entity that is used to provide hierarchy and grouping. It consists of a collection of tasks and families.

Typically you place tasks that are related to each other inside the same family, analogous to the way you create directories to contain related files. For python see ecflow.Family. For BNF see family

It serves as an intermediate node in a suite definition.

generic

A generic attribute associates a name to a set of generic string values, and is used to gracefully indicate the presence of unknown attributes in the suite definition.

This kind of attribute is used to allow the introduction of future attributes without requiring an API change. When an older version of ecflow encounters a new/unknown attribute, the attribute is automatically converted into a generic attribute.

Warning

The user is strongly advised not to include generic attributes in suite definitions.

halted

Is a ecflow_server state. See server states.

hybrid clock

A hybrid clock is a complex notion: the date and time are not connected.

The date has a fixed value during the complete execution of the suite. This will be mainly used in cases where the suite does not complete in less than 24 hours. This guarantees that all tasks of this suite are using the same date. On the other hand, the time follows the time of the machine.

Hence the date never changes unless specifically altered or unless the suite restarts, either automatically or from a begin command.

Under a hybrid clock any node held by a date or day dependency will be set to complete at the beginning of the suite. (i.e without its job ever running). Otherwise the suite would never complete.

inlimit

The inlimit works in conjunction with limit/ecflow.Limit for providing simple load management. inlimit is added to the node that needs to be limited.

Listing 133 Limiting tasks, only allow 5 tasks to run in parallel
suite suite
   limit disk 100
   family anon
      inlimit /suite:disk 5
      task t1
      ...
      task t100
   endfamily
endsuite
Listing 134 Limiting Families, only two families can run in parallel. The tasks are unconstrained
   suite test
      limit fam 2
      family f1
         inlimit -n fam
         task t1
         ....
      endfamily
      family f2
         inlimit -n fam
         task t1
         ....
      endfamily
      family f3
         inlimit -n fam
         task t1
         ....
      endfamily
   endsuite
Listing 135 Limit submission
   # Hence we could have more than 2 active jobs, since we are only control the number in the submitted state.
   # If we removed the -s then we can only have two active jobs running at one time
   suite test_limit_on_submission
      limit disk 2
      family anon
         inlimit -s disk   # Inlimit submission
         task t1
         task t2
         ....
      endfamily
   endsuite

See also:

Python API

ecflow.InLimit, ecflow.Node.add_inlimit

Definition file grammar

inlimit

job creation

Job creation or task invocation can be initiated manually via ecflow_ui but also by the ecflow_server during scheduling when a task (and all of its parent node s) is free of its dependencies.

The process of job creation includes:

The steps above transforms an ecf script to a job file that can be submitted by performing variable substitution on the ECF_JOB_CMD variable and invoking the command.

The running jobs will communicate back to the ecflow_server by calling child commands.

This causes status changes on the nodes in the ecflow_server and flags can be set to indicate various events.

If a task is to be treated as a dummy task (i.e. is used as a scheduling task) and is not meant to to be run, then a variable of name ECF_DUMMY_TASK can be added:

task.add_variable("ECF_DUMMY_TASK","")
job file

The job file is created by the ecflow_server during job creation using the ECF_TRYNO variable

It is derived from the ecf script after expanding the pre-processing directives.

It has the form <task name>.job<ECF_TRYNO>”, i.e. t1.job1.

Note job creation checking will create a job file with an extension with zero. i.e ‘.job0’. See ecflow.Defs.check_job_creation

When the job is run the output file has the ECF_TRYNO as the extension. i.e t1.1 where ‘t1’ represents the task name and ‘1’ the ECF_TRYNO

label

A label has a name and a value and is a way of displaying information in ecflow_ui

By placing a label child commands in the ecf script the user can be informed about progress in ecflow_ui.

Labels can be added to family nodes. To change the labels, scripts should use:

ecflow_client --alter change label <label_name> <new_value> /path/to/family_node/with/label

If the label child commands results in a zombie then the default action if for the server to fob, this allows the ecflow_client command to exit normally. (i.e. without any errors). This default can be overridden by using a zombie attribute.

Command line interface (CLI)

label, add, alter

Python API

ecflow.Label, ecflow.Node.add_label

Definition file grammar

label

late

Define a tag for a node to be late. A node can only have one late attribute. The late attribute only applies to a task. You can define it on a Suite/Family in which case it will be inherited. Any late defined lower down the hierarchy will override the aspect(submitted,active, complete) defined higher up.

Command options:

  • -s submitted: The time node can stay submitted (format [+]hh:mm). submitted is always relative, so + is simple ignored, if present. If the node stays submitted longer than the time specified, the late flag is set

  • -a active: The time of day the node must have become active (format hh:mm). If the node is still queued or submitted, the late flag is set

  • -c complete: The time node must become complete (format {+}hh:mm). If relative, time is taken from the time the node became active, otherwise node must be complete by the time given.

suite late
   family familyName
      task t1
            late -s +00:15 -a 20:00 -c +02:00
      task t2
            late -a 20:00 -c +02:00 -s +00:15
      task t3
            late -c +02:00 -a 20:00  -s +00:15
      task t4
            late  -s 00:02 -c +00:05
      task t5
            late  -s 00:01 -a 14:30 -c +00:01
   endfamily
endsuite

Suites cannot be late, but you can define a late tag for submitted in a suite, to be inherited by the families and tasks. When a node is classified as being late, the only action ecflow_server takes is to set a flag. ecflow_ui will display these alongside the node name as an icon (and optionally pop up a window).

suite late
   late -s +00:15    # report late for all task taking longer than 15 minutes in submitted state
   family familyName
      late -c +02:00 # all child task that take longer than 2 hours to complete should raise a late flag
      task t1
            # effective late -s +00:05 -c +02:00
            late -s +00:05
      task t2
            # effective late  -s +00:15 -c +02:00
      task t5
            # effective late  -c +03:00 -a 18:00 -s +00:15
            late -c +03:00 -a 18:00
   endfamily
endsuite

The late attribute can be added/deleted to any suite/family/task.

ecflow_client --alter add    late "-s 00:15" <path-to-node>
ecflow_client --alter change late "-s 00:01 -a 14:30 -c +00:01" <path-to-node>
ecflow_client --alter delete late

See also:

Command line interface (CLI)

add, alter

Python API

ecflow.Late, ecflow.Node.add_late

Definition file grammar

late

limit

Limits provide simple load management by limiting the number of tasks submitted by a specific ecflow_server. Typically you either define limits on suite level or define a separate suite to hold limits so that they can be used by multiple suites.

Setting limits on a separate suite, has the benefit that by setting the limit value to zero, you can control task submission over a number of suites.

Listing 136 Limits
suite suiteName
   limit sg1  10
   limit mars 10
endsuite

The limits are used in conjunction with inlimit.

The limit max value can be changed on the command line:

ecflow_client --alter change limit_max <limit-name> <new-limit-value> <path-to-limit>
ecflow_client --alter change limit_max limit 2 /suite

It can also be changed in python:

import ecflow

try:
   ci = ecflow.Client()
   ci.alter("/suite","change","limit_max","limit", "2")
except RuntimeError, e:
   print("Failed: " + str(e))

See also:

Command line interface (CLI)

add, alter

Python API

ecflow.Limit, ecflow.Node.add_limit

Definition file grammar

limit

manual page

Manual pages are part of the ecf script.

This is to ensure that the manual page is updated when the ecf script is updated. The manual page is a very important operational tool allowing you to view a description of a task, and possibly describing solutions to common problems. The pre-processing can be used to extract the manual page from the script file and is visible in ecflow_ui. The manual page is the text contained within the %manual and %end directives. They can be seen using the Manual tab in the Info panel in ecflow_ui.

The text in the manual page in not included in the job file.

There can be multiple manual sections in the same ecf script file. When viewed they are simply concatenated. It is good practice to modify the manual pages when the script changes.

The manual page may have the %include directives.

Suite and families may also have a manual page. These will also be available in the GUI. Ecflow will look for a file <node_name>.man (where node_name is the name of suite or family) using a backwards search algorithm first in ECF_FILES directory, then ECF_HOME directory. Note that errors in variable pre-processing are ignored inside of a manual section. It should also be noted that for family and suite manuals, the %manual and %end directives are not strictly necessary, as the whole file is treated as a manual.

If we have family /suite/big/f1, ecFlow will search for “f1.man” in:

<ECF_FILES>/suite/big/f1.man
<ECF_FILES>/suite/f1.man
<ECF_FILES>/f1.man
<ECF_HOME>/suite/big/f1.man
<ECF_HOME>/suite/f1.man
<ECF_HOME>/f1.man
meter

The purpose of a meter is to signal proportional completion of a task and to be able to trigger another job which is waiting on this proportional completion.

The meter is updated by placing the –meter child command in a ecf script. Meters can be added to family nodes. To change the meters, in the scripts should use:

ecflow_client --alter change meter <meter_name> <new_value> /path/to/family_node/with/meter

If the meter child command results in a zombie, then the default action if for the server to fob, this allows the ecflow_client command to exit normally (i.e. without any errors). This default can be overridden by using a zombie attribute.

See also:

Command line interface (CLI)

meter, add, alter

Python API

ecflow.Meter, ecflow.Node.add_meter

Definition file grammar

meter

Meters can be referenced in trigger and complete expression expressions.

mirror

A mirror is an attribute of a local Node (typically a Task), and allows to synchronise with a node on a remote ecFlow server. The node synchronisation includes:

  • Node status

  • Variables (User, Generated and Inherited)

  • Meters

  • Lavels

  • Events

Notice that all synchronized Variables, including generated and inherited, become user variables on the local ecFlow server.

A Node with a mirror attribute will have its status periodically synchronized with the status of a node on a remote ecFlow server. The synchronised status and attributes can be used to trigger the execution of local nodes.

Note

Synchronised tasks don’t need to be provided with .ecf files on the local ecFlow server, as the execution of a Task with a mirror attribute does not happen under the responsibility of the local ecFlow server.

Operations to execute synchronised Tasks have been disabled from the ecflow_ui.

Only one mirror attribute is allowed per node, and each attribute is defined by the following properties:

  • name, an identifier

  • remote_path, the path of the node on the remote ecFlow server

  • remote_host, the remote ecFlow server host

  • remote_port, the remote ecFlow server port

  • ssl, to connect to the ecFlow server using SSL

  • polling, the value (in seconds) used to periodically contact the remote ecFlow server

  • auth, the location to the Mirror authentication credentials file

The value of the properties remote_host, remote_port, polling, and auth can be composed of Variables. When these properties are not provided, the following default values are used:

  • %ECF_MIRROR_REMOTE_HOST%, for remote_host

  • %ECF_MIRROR_REMOTE_PORT%, for remote_port

  • %ECF_MIRROR_REMOTE_POLLING%, for polling

  • %ECF_MIRROR_REMOTE_AUTH%, for auth

The following fallback values are considered when the default value is used but the variable is not actually defined:

  • in case %ECF_MIRROR_REMOTE_PORT% is not defined, the fallback value is 3141

  • in case %ECF_MIRROR_REMOTE_POLLING% is not defined, the fallback value is 120 (seconds)

  • in case %ECF_MIRROR_REMOTE_AUTH% is not defined, the fallback value is "" (empty string), which effectively disables Authentication

Each mirror attribute implies that a background thread is spawned whenever the ecFlow server is running (i.e. when the server is shutdown or halted the thread is terminated and the mirroring process is completely stopped). This independent background thread, responsible for polling the remote ecFlow server and periodically synchronise node status, uses the configuration available when the server is restarted.

Note

If any variables provinding the configuration are updated, the Mirror configuration can be reloaded (without restarting the Server) by issuing an Alter change command with the value reload to the relevant attributes.

The authentication credentials file is expected to be in JSON, according to the following format:

{
  "username" : "<your-username>",
  "password" : "<your-password>",
}

Only the fields username, and password are required; any additional fields are ignored.

node

suite, family and task form a hierarchy. Where a suite serves as the root of the hierarchy. The family provides the intermediate nodes, and the task provide the leafs.

Collectively suite, family and task can be referred to as nodes.

For python see ecflow.Node.

pre-processing

Pre-processing takes place during job creation and acts on directives specified in ecf script file.

This involves:

queue

Queues allows efficiently running jobs that are identical but vary only in the step.

This attribute makes it possible to follow a producer(server)/consumer(tasks) pattern. Note additional task consumers can be added for load balancing.

suite test_queue
family f1
   queue q1 001 002 003 004 005 006 007
   task t
endfamily
family f2
   queue q2 1 2 3 4 5 6 8 9 10
   task a
   task b
      # notice that queue name is accessible to the trigger
      trigger /test_queue/f1:q1 > 5
   task c
      trigger ../f2/a:q2 > 9
endfamily
endsuite

The queue child command will signal when a step is active, complete, or has aborted:

# Note: because --queue is treated like a child command(init,complete,event,label,meter,abort,wait), the task path ECF_NAME is read from the environment

# The --queue command will search up the node hierarchy for the queue name. If not found it fails.

step=$(ecflow_client --queue queue_name  active)                # returns first queued/aborted step from the server and makes it active, Return "NULL" for the last step.
ecflow_client --queue queue_name complete $step                 # Tell the server that step has completed for the given queue
ecflow_client --queue queue_name aborted  $step                 # Tell the server that step has aborted for the given queue
no_of_aborted=$(ecflow_client --queue queue_name no_of_aborted) # returns as a string the number of aborted steps
ecflow_client --queue queue_name reset

The queue values can be strings, however, if they are to be used in trigger expressions, they must be convertible to integers:

suite test_queue
   family f1
      queue q1 red orange yellow green blue indigo violet
      task t
   endfamily
endsuite

See also:

Command line interface (CLI)

queue

Python API

ecflow.Queue, ecflow.Node.add_queue

Definition file grammar

queue

queued

Is a node status.

After the begin command, the task without a defstatus are placed into the queued state

real clock

A suite using a real clock will have its clock matching the clock of the machine. Hence the date advances by one day at midnight.

repeat

Repeats provide looping functionality. There can only be a single repeat on a node.

repeat day step [ENDDATE]   # only for suites
repeat integer VARIABLE start end [step]
repeat enumerated VARIABLE first [second [third ...]]
repeat string VARIABLE str1 [str2 ...]
repeat file VARIABLE filename
repeat date VARIABLE yyyymmdd yyyymmdd [delta]
repeat datetime VARIABLE yyyymmddTHHMMSS yyyymmddTHHMMSS [delta]
repeat datelist VARIABLE yyyymmdd(1) yyyymmdd(2) ...

The repeat variable name is available as a generated variable.

The repeat date and repeat datetime define several generated variables, prefixed by variable name:

# Provided for `repeat date` and `repeat datetime`
<variable>           # the default, the value is the current date
<variable>_YYYY      # the year
<variable>_MM        # the month
<variable>_DD        # the day of the month
<variable>_DOW       # day of the week
<variable>_JULIAN    # the julian value for the date
# Provided for `repeat datetime`
<variable>_DATE      # the date formatted as yyyymmdd
<variable>_TIME      # the time formatted as HHMMSS
<variable>_HOURS     # the hours
<variable>_MINUTES   # the minutes
<variable>_SECONDS   # the seconds

For example:

Listing 137 Repeat generated variables, accessible for trigger expressions
repeat date YMD 20090101 20220101
# The following generated variables, are accessible for trigger expressions
# YMD
# YMD_YYYY, YMD_MM, YMD_DD, YMD_DOW, YMD_JULIAN

repeat datetime DT 20090101T000000 20090102T000000 06:00:00
# The following generated variables, are accessible for trigger expressions
# DT
# DT_DATE, DT_YYYY, DT_MM, DT_DD, DT_DOW, DT_JULIAN
# DT_TIME, DT_HOURS, DT_MINUTES, DT_SECONDS

The repeat VARIABLE can be used in trigger and complete expression expressions.

As the repeat variable changes so do the generated variables. (See the tutorial for an example. Repeat)

Warning

If a repeat is added to a family/suite, then the repeat will ONLY loop(and automatically re-queue its children) if all the children are complete. Hence additional care needs to be taken. i.e. if the parent node has a repeat and the child has a cron attribute then the cron will always force a re-queue on the node once it has run, and hence will stop the parent from looping.

If we use relative time attribute. i.e. time +02:00, under a repeat, then the time is relative to the repeat re-queue.

The repeat VARIABLE can be used in trigger and complete expression expressions. Depending on the kind of repeat the value can vary:

RepeatDate       -> value
RepeatDateList   -> value
RepeatString     -> index  (will always return a index)
RepeatInteger    -> value
RepeatEnumerated -> value | index  ( return value at index if cast-able to integer, otherwise return index )
RepeatDay        -> value

If a “repeat date” VARIABLE is used in a trigger expression then date arithmetic is used, when the expression uses addition and subtraction. i.e.:

defs = ecflow.Defs()
s1 = defs.add_suite("s1");
t1 = s1.add_task("t1").add_repeat( ecflow.RepeatDate("YMD",20090101,20091231,1) );
t2 = s1.add_task("t2").add_trigger("t1:YMD - 1 eq 20081231");
assert t2.evaluate_trigger(), "Expected trigger to evaluate. 20090101 - 1  == 20081231"

When we use relative time attributes under a Repeat. They are automatically reset when the repeat loops. Take for example:

suite s1
   family hc00
      repeat integer HYEAR 1993 2017
      time +00:01                     # when the repeat loops delay starting task a, for 1 minute
      task a
      task b
         trigger a  == complete
   endfamily
endsuite

Now when task ‘a’ and Task ‘b’ complete, the repeat is incremented, and any relative time attributes are reset. In this case effectively delaying the starting of task ‘a’ for 1 minute.

See also:

Command line interface (CLI)

add, alter

Python API

ecflow.Node.add_repeat, ecflow.Repeat, ecflow.RepeatDate, ecflow.RepeatEnumerated, ecflow.RepeatInteger, ecflow.RepeatDay

Definition file grammar

repeat

running

Is a ecflow_server state. See server states

scheduling

The ecflow_server is responsible for task scheduling.

It will check dependencies in the suite definition every minute. If these dependencies are free, the ecflow_server will submit the task. See job creation.

server states

The following tables reflects the ecflow_server capabilities in the different states

State

User Request

Task Request

Job Scheduling

Auto-Check-pointing

running

yes

yes

yes

yes

shutdown

yes

yes

no

yes

halted

yes

no

no

no

shutdown

Is a ecflow_server state. See server states

status

Each node in suite definition has a status.

Status reflects the state of the node. In ecflow_ui the background colour of the text reflects the status.

task status are: unknown, queued, submitted, active, complete, aborted and suspended

ecflow_server status are: shutdown, halted, running this is shown on the root node in ecflow_ui

submitted

Is a node status.

When the task dependencies are resolved/free the ecflow_server places the task into a submitted state. However if the ECF_JOB_CMD fails, the task is placed into the aborted state

suite

A suite is an organisational entity. It is serves as the root node in a suite definition. It should be used to hold a set of jobs that achieve a common function. It can be used to hold user variables that are common to all of its children.

Only a suite node can have a clock.

Suite generated variables:

SUITE

The name of the suite

ECF_TIME

23:30 the current suite time

TIME

2330 time as integer, Can be used in a trigger expression, ideally using <=, <, >=, >

YYYY

The year as an integer

DOW

Day of the week, as an integer. Sunday=0,Monday=1,etc

DOY

Day of the year, as an integer

DAY

The days as a string, i.e. monday

DD

Day of the month as an integer

MM

The month as an integer

MONTH

as a string

ECF_DATE

YYYMMDD year,month,day of the month as 8 digit integer

ECF_JULIAN

The julian value of the current date (added in ecFlow 4.7.0)

ECF_CLOCK

<day>:<month>:<day of week>:<day of year>. i.e. Tuesday:December:2:348

It is a collection of familys, variables, repeat and a single clock definition.

See also:

Python API

ecflow.Suite

Definition file grammar

suite

suite definition

The suite definition is the hierarchical node tree. It describes how your tasks run and interact. It can be built up using:

Once the definition is built, it can be loaded into the ecflow_server, and started. It can be monitored by ecflow_ui

suspended

Is a node state. A node can be placed into the suspended state via a defstatus or via ecflow_ui

A suspended node including any of its children cannot take part in scheduling until the node is resumed.

task

A task represents a job that needs to be carried out. It serves as a leaf node in a suite definition

Only tasks can be submitted.

A job inside a task ecf script should generally be re-entrant so that no harm is done by rerunning it, since a task may be automatically submitted more than once if it aborts.

See also:

Python API

ecflow.Task

Definition file grammar

task

time

This defines a time dependency for a node.

Time is expressed in the format [h]h:mm. Only numeric values are allowed.

There can be multiple time dependencies for a node, but overlapping times may cause unexpected results.

Listing 138 The task is free to run when the time is 10:00 or 11:00
task t
   time 10:00
   time 11:00

To define a series of times, specify the start time, end time and a time increment. If the start time begins with ‘+’, times are relative to the beginning of the suite or, in repeated families, relative to the beginning of the repeated family.

If the time the job takes to complete is longer than the interval a “slot” is missed, e.g.:

time 10:00 20:00 01:00

If the 10:00 run takes more than an hour, the 11:00 run will never occur.

See also:

Python API

ecflow.Time, ecflow.Node.add_time

Definition file grammar

time

time dependencies

This includes time, today, day, date and cron.

When we have multiple time dependencies on the same task, then time dependency of the same type are or’ed together, and and’ed with the different types.

Listing 139 This task will run on the 17th of February 2017 at 10am
task xx
   time 10:00
   date 17.2.2017
Listing 140 Run task xx. at 10am and 8pm, on the 17th and 19th of February 2017, that is four times in all. Notice the task is queued in between and completes only after the last run
task xx
   time 10:00
   time 20:00
   date 17.2.2017
   date 19.2.2017
today

Like time, but If the suites begin time is past the time given for the “today” command the node is free to run (as far as the time dependency is concerned).

For example:

task x
   today 10:00

If we begin or re-queue the suite at 9.00 am, then the task in held until 10.00 am. However if we begin or re-queue the suite at 11.00am, the task is run immediately.

No lets look at time:

task x
   time 10:00

If we begin or re-queue the suite at 9.00am, then the task in held until 10.00 am. If we begin or re-queue the suite at 11.00am, the task is still held.

If the time the job takes to complete is longer than the interval a “slot” is missed, e.g.:

today 10:00 20:00 01:00

If the 10:00 run takes more than an hour, the 11:00 run will never occur.

See also:

Python API

ecflow.Today

Definition file grammar

today

trigger

Triggers defines a dependency for a task or family.

There can be only one trigger dependency per node, but that can be a complex boolean expression of the status of several nodes. Triggers should be avoided on suites. A node with a trigger can only be activated when its trigger has expired. A trigger holds the node as long as the trigger expression evaluation returns false.

Trigger evaluation occurs when ever the child command communicates with the server. i.e whenever there is a state change in the suite definition.

The keywords in trigger expressions are: unknown, suspended, complete, queued, submitted, active, aborted and clear and set for event status.

Triggers can also reference Node attributes like event, meter, variable, repeat and generated variables. Trigger evaluation for node attributes uses integer arithmetic:

  • event: has the integer value of 0(clear) and set(1)

  • meter: values are integers hence they are used as is

  • variable: value is converted to an integer, otherwise 0 is used. See example below

  • repeat string: use the index values as integers. See example below

  • repeat enumerated: use the index values as integers. See example below

  • repeat integer: use the implicit integer values

  • repeat date: use the date values as integers. Use of plus/minus on repeat date variable uses date arithmetic

  • repeat datetime: use the date+time instant values as integers. Use of plus/minus on repeat datetime variable uses second arithmetic

  • limit: the limit value is used as an integer. This allows a degree of prioritisation amongst tasks under a limit

  • late: the value is stored in a flag, and is a simple boolean. Used to signify when a task is late.

Here are some examples:

Listing 141 Trigger examples
suite trigger_suite
   task a
      event EVENT
      meter METER 1 100 50
      edit  VAR_INT 12
      edit  VAR_STRING "captain scarlett"         # This is not convertible to an integer, if referenced will use '0'
   family f1
      edit SLEEP 2
      repeat string NAME a b c d e f              # This has values: a(0), b(1), c(3), d(4), e(5), f(6) i.e index
      family f2
         repeat integer VALUE 5 10                # This has values: 5,6,7,8,9,10
         family f3
            repeat enumerated red green blue      # red(0), green(1), blue(2)
            task t1
               repeat date DATE 19991230 20000102 # This has values: 19991230,19991231,20000101,20000102
         endfamily
      endfamily
   endfamily
   family f2
      task event_meter
          trigger /suite/a:EVENT == set and /suite/a:METER >= 30
      task variable
          trigger /suite/a:VAR_INT >= 12 and /suite/a:VAR_STRING == 0
      task repeat_string
          trigger /suite/f1:NAME >= 4
      task repeat_integer
          trigger /suite/f1/f2:VALUE >= 7
      task repeat_date
          trigger /suite/f1/f2/f3/t1:DATE >= 19991231
      task repeat_date2
          # Using plus/minus on a repeat DATE will use date arithmetic
          # Since the starting value of DATE is 19991230, this task will run straight away
          trigger /suite/f1/f2/f3/t1:DATE - 1 == 19991229
   endfamily
endsuite

What happens when we have multiple node attributes of the same name, referenced in trigger expressions?

Listing 142 Trigger priority when name clashes
 task foo
   event blah
   meter blah 0 200 50
   edit  blah 10
 task bar
   trigger foo:blah >= 0

In this case ecFlow will use the following precedence:

Hence in the example above expression foo:blah >= 0 will reference the event.

See also:

Python API

ecflow.Expression, ecflow.Node.add_trigger

Definition file grammar

trigger

unknown

Is a node status.

This is the default node status when a suite definition is loaded into the ecflow_server

user command

User commands are any client to server requests that are not child commands.

variable

ecFlow makes heavy use of different kinds of variables.There are several kinds of variables:

Variables can be referenced in trigger and complete expressions . When used as part of a trigger or complete expression, the value of the variable should be convertible to an integer (n.b. floating point values are allowed, but will be truncated). It is important to notice that, when evaluating a trigger or complete expression, the default value 0 will be used for any variable with a non-numerical value.

See also:

Python API

ecflow.Node.add_variable

Definition file grammar

variable

variable inheritance

When a variable is needed at job creation time, it is first sought in the task itself.

If it is not found in the task, it is sought from the task’s parent and so on, up through the node levels until found.

For any node, there are two places to look for variables.

Suite definition variables are looked for first, and then any generated variables.

variable substitution

Takes place during pre-processing or command invocation.(i.e ECF_JOB_CMD,ECF_KILL_CMD,etc)

It involves searching each line of ecf script file or command, for ECF_MICRO character. typically ‘%’

The text between two % character, defines a variable. i.e %VAR%

This variable is searched for in the suite definition.

First the suite definition variables (sometimes referred to as user variables) are searched and then Repeat variable name, and finally the generated variables. If no variable is found then the same search pattern is repeated up the node tree.

The value of the variable is replaced between the % characters.

If the micro character are not paired and an error message is written to the log file, and the task is placed into the aborted state.

If the variable is not found in the suite definition during pre-processing then job creation fails, and an error message is written to the log file, and the task is placed into the aborted state. To avoid this, variables in the ecf script can be defined as:

%VAR:replacement%

This is similar to %VAR% but if VAR is not found in the suite definition then ‘replacement’ is used.

virtual clock

Like real clock until the ecflow_server is suspended (i.e shutdown or halted), the suites clock is also suspended.

Hence will honour relative times in cron, today and time dependencies. It is possible to have a combination of hybrid/real and virtual.

More useful when we want complete adherence to time related dependencies at the expense being out of sync with system time.

zombie

Zombies are running jobs that fail authentication when communicating with the ecflow_server

child commands like (init, event,meter, label, abort,complete) are placed in the ecf script file and are used to communicate with the ecflow_server.

The ecflow_server authenticates each connection attempt made by the child command. Authentication can fail for a number of reasons:

When authentication fails the job is considered to be a zombie. The ecflow_server will keep a note of the zombie for a period of time, before it is automatically removed. However the removed zombie, may well re-appear. (this is because each child command will continue attempting to contact the ecflow_server for 24 hours. This is configurable see ECF_TIMEOUT on ecflow_client)

See also:

Python API

ecflow.ZombieAttr, ecflow.ZombieUserActionType

Definition file grammar

zombie

There are several types of zombies see zombie type and ecflow.ZombieType

zombie attribute

The zombie attribute defines how a zombie should be handled in an automated fashion. Very careful consideration should be taken before this attribute is added as it may hide a genuine problem. It can be added to any node. But is best defined at the suite or family level. If there is no zombie attribute the default behaviour is to block the child command.

To add a zombie attribute in python, please see: ecflow.ZombieAttr

zombie type

See zombie and class ecflow.ZombieAttr for further information.

How do zombies arise.

  • Server crashed (or terminated and restarted) and the recovered check point file is out of date.

  • A task is repeatedly re-run, earlier copies will not be remembered.

  • Job sent by another ecflow_server , but which cannot talk to the original ecflow_server

  • Network glitches/network down

  • errors in script, i.e. multiple calls to init, complete

  • errors in job submission i.e. job submitted twice.

There are several types of zombies:

  • path:

    • The task path cannot be found in the server, because node tree was deleted, replaced,reload, server crashed or backup server does not have node tree.

    • Jobs could have been created, via server scheduling or by user commands

  • user: Job is created by user commandss like, rerun, re-queue. User zombies are differentiated from server(scheduled) since they are automatically created when the force option is used and we have tasks in an active or submitted states.

  • ecf: Jobs are created as part of the normal scheduling

    • Two init commands or task complete or aborted but receives another child command

    • Server crashed (or terminated and restarted) and the recovered check point file is out of date.

    • A task is repeatedly re-run, earlier copies will not be remembered.

    • Job sent by another ecflow_server, but which cannot talk to the original ecflow_server

    • Network glitches/network down

  • ecf_pid: pid mismatched, Job scheduled twice. Check submitter.

  • ecf_passwd: Password mismatch, PID matches, system has re-cycled PID or hacked job file?

  • ecf_pid_passwd: Both PID and password mismatch. Re-queue & submit of active job?

The type of the zombie is not fixed and may change.