Glossary
- aborted
-
When the ECF_JOB_CMD fails or the job file sends a ecflow_client –abort child command, then the task is placed into a aborted state.
- active
-
If job creation was successful, and job file has started, then the ecflow_client –init child command is received by the ecflow_server and the task is placed into a active state
- autocancel
autocancel is a way to automatically delete a node which has completed.
The delete may be delayed by an amount of time in hours and minutes or expressed in days. Any node may have a single autocancel attribute. If the auto cancelled node is referenced in the trigger expression of other nodes it may leave the node waiting. This can be solved by making sure the trigger expression also checks for the unknown state. i.e.:
trigger node_to_cancel == complete or node_to_cancel == unknown
This guards against the ‘node_to_cancel’ being undefined or deleted
See also:
- aviso
An aviso is an attribute of a Node (typically a Task), and creates a dependency on an external Aviso server.
A Node with an aviso attribute is held from executing until a notification matching the configured listener is received. When a matching notification is received, the node then allowed to execute following a behaviour similar to trigger or time dependency e.g. cron).
Only one aviso attribute is allowed per node, and each attribute is defined by the following properties:
name, an identifierlistener, the configuration for the Aviso listenerurl, the base location of the Aviso serverschema, the location of the Aviso schema used to evaluate the notificationspolling, the value (in seconds) used to periodically contact the Aviso serverauth, the location to the Aviso authentication credentials file
The value of the properties
url,schema,polling, andauthcan be composed of Variables. When these properties are not provided, the following default values are used:%ECF_AVISO_URL%, forurl%ECF_AVISO_SCHEMA%, forschema%ECF_AVISO_POLLING%, forpolling%ECF_AVISO_AUTH%, forauth
Important
The variables
ECF_AVISO_*are not automatically provided at server level, and must be defined at Suite level by the user.Each aviso attribute implies that a background thread is spawned whenever the associated node is (re)queued. This background thread is responsible for polling the Aviso server, and periodically processing the latest notifications.
- check point
The check point file is like the suite definition, but includes all the state information.
It is periodically saved by the ecflow_server.
It can be used to recover the state of the node tree should server die, or machine crash.
By default when a ecflow_server is started it will look to load the check point file.
The default check point file name is <host>.<port>.ecf.check. This can be overridden by the ECF_CHECK environment variable
The check point file format is the same as the defs file format (from release 4.7.0 onwards). However, the indentation has been removed to preserve space. To view with indentation use:
ecflow_client --load=<check_point_file> print check_only
- child command
Child commands (or task requests) are called from within the ecf script files. The table also includes the default action (from version 4.0.4) if the child command is part of a zombie. ‘block’ means the job will be held by the ecflow_client command. Until time out, or manual/automatic intervention.
Child Command
Description
Zombie (default action)
block
Wait for a expression to evaluate
block
Update queue step in server
block
block
block
Set an event
fob
Change a meter
fob
Change a label
fob
The following environment variables must be set for the child commands. ECF_HOST, ECF_NAME ,:term:ECF_PASS and ECF_RID. See ecflow_client.
- clock
A clock is an attribute of a suite.
A gain can be specified to offset from the given date.
The hybrid and real clocks always runs in phase with the system clock (UTC in UNIX) but can have any offset from the system clock.
The clock can be :
time, day and date and cron dependencies work a little differently under the clocks.
The default clock type is hybrid.
If the ecflow_server is shutdown or halted the job scheduling is suspended. If this suspension is left for period of time, then it can affect task submission under hybrid and real clocks. In particular it will affect tasks with time, today or cron dependencies.
dependencies with time series, can result in missed time slots:
time 10:00 20:00 00:15 # If server is suspended > 15 minutes, time slots can be missed time +00:05 20:00 00:15 # start 5 minutes after the start of the suite, then every 15m until 20:00
When the server is placed back into running state any time dependencies with an expired time slot are submitted straight away. i.e if ecflow_server is halted at 10:59 and then placed back into running state at 11:20:
time 11:00
Then any task with a expired single time slot dependency will be submitted straight away.
See also:
- complete
-
The node can be set to complete:
By the complete expression
At job end when the task receives the ecflow_client –complete child command
Manually via the command line or GUI. When this happens any time attributes are expired in order.
- complete expression
Force a node to be complete if the expression evaluates, without running any of the nodes.
This allows the user to have tasks in the suite which run only in case others fail. In practice the node would need to have a trigger also.
- cron
A cron defines a time dependency for a node, similar to time, but one that will be repeated indefinitely.
See also:
Text Definition
- date
This defines a date dependency for a node.
There can be multiple date dependencies. The European format is used for dates, which is: dd.mm.yy as in 31.12.2007. Any of the three number fields can be expressed with a wildcard * to mean any valid value. Thus, 01.*.* means the first day of every month of every year.
If a hybrid clock is defined, any node held by a date dependency will be set to complete at the beginning of the suite, without running the corresponding job. Otherwise under a hybrid clock the suite would never complete.
- day
This defines a day dependency for a node.
There can be multiple day dependencies.
If a hybrid clock is defined, any node held by a day dependency will be set to complete at the beginning of the suite, without running the corresponding job. Otherwise under a hybrid clock the suite would never complete.
- defstatus
Defines the default status for a task/family to be assigned to the node when the begin command is issued.
By default node gets queued when you use begin on a suite. defstatus is useful in preventing suites from running automatically once begun or in setting tasks complete so they can be run selectively.
See also:
- dependencies
Dependencies are attributes of node, that can suppress/hold a task from taking part in job creation.
They include trigger, date, day, time, today, cron, complete expression, inlimit and limit.
A task that is dependent cannot be started as long as some dependency is holding it or any of its parent node s.
The ecflow_server will check the dependencies every minute, during normal scheduling and when any child command causes a state change in the suite definition.
- directives
Directives appear in a ecf script. (i.e. typically .ecf file, but could be .py file).Directives start with a % character. This is referred to as ECF_MICRO character.
The directives are used in two main context.
Preprocessing directives. In this case the directive starts as the first character on a line in a ecf script file. See the table below which shows the allowable values. Only one directive is allowed on the line.
Variable directives. We use two ECF_MICRO characters ie %VAR%, in this case they can occur anywhere on the line and in any number.
%CAR% %TYPE% %WISHLIST%
These directives take part in variable substitution.
If the micro characters are not paired (i.e uneven) then variable substitution cannot take place hence an error message is issued.
port=%ECF_PORT # error issued since '%' micro character are not paired.
However an uneven number of micro character are allowed, If the line begins with ‘#’ comment character.
# This is a comment line with a single micro character % no error issued # port=%ECF_PORT again no error issued
Directives are expanded during pre-processing. Examples include:
Symbol
Meaning
%include <filename>
%ECF_INCLUDE% directory is searched for the
filenameand the contents included into the job file. If that variable is not defined ECF_HOME is used. If the ECF_INCLUDE is defined but the file does not exist, then we look in ECF_HOME. This allows specific files to be placed in ECF_INCLUDE and the more general/common include files to be placed in ECF_HOME. This is the recommended format%include “filename”
Include the contents of the file: %ECF_HOME%/%SUITE%/%FAMILY%/filename into the job.
%include filename
Include the contents of the file
filenameinto the output. The only form that can be used safely must start with a slash ‘/’%includenopp filename
Same as %include, but the file is not interpreted at all.
%comment
Starts a comment, which is ended by %end directive. The section enclosed by %comment - %end is removed during pre-processing
%manual
Starts a manual, which is ended by %end directive. The section enclosed by %manual - %end is removed during pre-processing. The manual directive is used to create the manual page show in ecflow_ui.
%nopp
Stop pre-processing until a line starting with %end is found. No interpretation of the text will be done (i.e. no variable substitutions)
%end
End processing of %comment or %manual or %nopp
%ecfmicro CHAR
Change the directive character, to the character given. If set in an include file the effect is retained for the rest of the job (or until set again). It should be noted that the ecfmicro directive specified in the ecf script file, does not effect the variable substitution for ECF_JOB_CMD, ECF_KILL_CMD or ECF_STATUS_CMD variables. They still use ECF_MICRO. If no ecfmicro directive exists, we default to using ECF_MICRO from the suite definition
From ecFlow release 4.4.0, use of %VAR% (variable substitution) can be a part of the include filename. i.e.:
# %file% must be defined, on the task, or on the parent hierarchy %include <%file%.h> # use %INCLUDEFILE% if defined (on the task, or on the parent hierarchy, # and MUST follow one of formats above: ".filename", "../filename", "filename", # filename>) otherwise use <file> %include %INCLUDEFILE:<file>%
Care should be taken to avoid spaces in the variable values.
- ecf file location algorithm
ecflow_server and job creation checking uses the following algorithm to locate the ‘.ecf’ file corresponding to a task.
Note
To search for files with a different extension, i.e. to look for python file ‘.py’. Override the ECF_EXTN variable. Its default value is ‘.ecf’
ECF_SCRIPT: First it uses the generated variable ECF_SCRIPT to locate the script. This variable is generated from: ECF_HOME/<path to task>.ecf Hence if the task path is /suite/f1/f2/t1, then ECF_SCRIPT=ECF_HOME/suite/f1/f2/t1.ecf
ECF_FETCH (user variable): File is obtained from running the command after some postfix arguments are added. (Output of popen)
ECF_SCRIPT_CMD (user variable): File is obtained from running the command. (Output of popen)
ECF_FILES: Second it checks for the user defined ECF_FILES variable. If defined the value of this variable must correspond to a directory. This directory is searched in reverse order.
I.e. lets assume we have a task /o/12/fc/model and ECF_FILES is defined as /home/ecmwf/emos/def/o/ECFfiles
The ecFlow will use the following search pattern.
/home/ecmwf/emos/def/o/ECFfiles/o/12/fc/model.ecf
/home/ecmwf/emos/def/o/ECFfiles/12/fc/model.ecf
/home/ecmwf/emos/def/o/ECFfiles/fc/model.ecf
/home/ecmwf/emos/def/o/ECFfiles/model.ecf
If the directory does not exist, the server will try variable substitution. This allows additional configuration:
edit ECF_FILES /home/ecmwf/emos/def/o/%FILE_DIR:ECFfiles%
The search can be reversed, by adding a variable ECF_FILES_LOOKUP, with a value of “prune_leaf” (from ecFlow 4.12.0). Then ecFlow will use the following search pattern.
/home/ecmwf/emos/def/o/ECFfiles/o/12/fc/model.ecf
/home/ecmwf/emos/def/o/ECFfiles/o/12/model.ecf
/home/ecmwf/emos/def/o/ECFfiles/o/model.ecf
/home/ecmwf/emos/def/o/ECFfiles/model.ecf
However please be aware this will also affect the search in ECF_HOME
ECF_HOME: Thirdly it searches for the script in reverse order using ECF_HOME (i.e like ECF_FILES). If this fails, than the task is placed into the aborted state. We can check that file can be located before loading the suites into the server.
Note: The addition of variable with a name ECF_FILES_LOOKUP and value ‘prune_leaf’, affects the search in BOTH ECF_FILES and ECF_HOME
See also:
- ecf script
The ecFlow script refers to an ‘.ecf’ file.
The script file is transformed into the job file by the job creation process.
The base name of the script file must match its corresponding task. i.e t1.ecf , corresponds to the task of name ‘t1’. The script if placed in the ECF_FILES directory, may be re-used by multiple tasks belonging to different families, providing the task name matches.
The ecFlow script is similar to a UNIX shell script.
The differences, however, includes the addition of “C” like pre-processing directives and ecFlow variables. Also the script must include calls to the init and complete child commands so that the ecflow_server is aware when the job starts (i.e changes state to active) and finishes (i.e changes state to complete)
- ECF_DUMMY_TASK
This is a user variable that can be added to task to indicate that there is no associated ecf script file.
If this variable is added to suite or family then all child tasks are treated as dummy.
This stops the server from reporting an error during job creation.
- ECF_EXTN
Defines the extension for the script that will be turned into a job file. This has a default value of ‘.ecf’. But could be any extension.This is used by the server as part of ‘ecf file location algorithm’
- ECF_FETCH
Experimental This is used to specify a command, whose output can be used as a job script. The ecFlow server will run the command with popen. Hence great care needs to be taken not to doom the server, with command that can hang. As this could severely affect servers ability to schedule jobs.
edit ECF_FETCH my_custom_cmd.sh
After variable substitution, the server will add the following.
my_custom_cmd.sh -s <task_name>.<ECF_EXTN> # to extract the script and create the job my_custom_cmd.sh -i # to extract the includes my_custom_cmd.sh -m <task_name>.<ECF_EXTN> # to extract the manual, i.e. for display in the info tab my_custom_cmd.sh -c <task_name>.<ECF_EXTN> # to extract the comments
The output of running these commands (-s) is used to create the job.
- ECF_HOME
This is user defined variable; it has four functions:
it is used as a prefix portion of the path of the job files created by ecFlow server; see the description of the ECF_JOB generated variable.
it is a default directory where ecFlow server looks for scripts (with file extension defined by ECF_EXTN,default is .ecf); overridden by ECF_FILES user defined variable. See the “ecf file location algorithm” entry for more detail.
it is a default directory where ecFlow server looks for include files; overridden by ECF_INCLUDE user defined variable. See the “directives” entry for more detail.
it is used as a default prefix portion of the job output path (the ECF_JOBOUT generated variable); overridden by ECF_OUT user defined variable. See descriptions of ECF_JOBOUT and ECF_OUT variables for more detail.
- ECF_INCLUDE
This is a user defined variable. It is used to specify directory locations, that are used to search for include files.
edit ECF_INCLUDE /home/fred/course/include # a single directory edit ECF_INCLUDE /home/fred/course/include:/home/fred/course/include2:/home/fred/course/include_me # set of directories to search
- ECF_JOB
This is a generated variable. If defines the path name location of the job file.
The variable is composed as:
ECF_HOME/ECF_NAME.job<ECF_TRYNO>
- ECF_JOB_CMD
This variable should point to a script that can submit the job. (i.e. to the queuing system, via, SLURM,PBS).
The ecFlow server will detect abnormal termination of this command. Hence for errors in the job file, should call ‘ecflow_client –abort”, then exits cleanly. Otherwise server detects abnormal job termination, and abort flag is set. Which will prevent job re-queue(due to ECF_TRIES).
If the job also sends an abort, zombies can be created. If ECF_JOB_CMD command fails, and the task is in a submitted state, then the task is set to the aborted state. However if the task was active or complete, then we do NOT abort the task. Instead the zombie flag is set. (since ecFlow 4.17.1)
- ECF_JOBOUT
This is a generated variable. This variable defines the path name for the job output file. The variable is composed as following.
If ECF_OUT is specified:
ECF_OUT/ECF_NAME.ECF_TRYNO
otherwise:
ECF_HOME/ECF_NAME.ECF_TRYNO
- ECF_LISTS
This is the server variable. The variable specifies the path to the White list file. This file controls who has read/write access to the server via the user commands.
The user name can be found using linux, id command and is typically the login name. The file has a very simple format.
The file path specified by ECF_LISTS environment, is read by the server on start up. The contents of the white list can be modified, and reloaded by the server. (However the path to the white-list file can NOT be modified after the server has started).
If ECF_LISTS is not set, the server will look for a file named <host>.<port>.ecf.lists (i.e.my_host.3141.ecf.lists) in same directory where the server was started.
If the file specified by ECF_LISTS or <host>.<port>.ecf.lists, does not exist or exists but is empty, then all users will have read/write access to suites on the server. Special care must be taken, so that user reloading the white list file does not remove write access for the administrator.
ecflow_client --help=reloadwsfile ecflow_client --reloadwsfile
4.4.14 # this is a comment, the first non-comment line must include a version. # These users have read and write access to the server uid1 # user uid1,uid2,cog have read and write access to the server uid2 cog # Read only users -fred # users fred,bill and jake have read only access -bill -jake
4.4.14 # this is a comment, the first non-comment line must include a version. # These users have read and write access to the server uid1 # user uid1,uid2,cog have read and write access to the server uid2 cog # User with read access -* # all users have read access
4.4.5 fred # has read /write access to all suites -joe # has read access to all suites * /x /y # all users have read/write access to suites /x /y -* /w /z # all users have read access to suites /w /z user1 /a,/b,/c # user1 has read/write access to suite /a /b /c user2 /a user2 /b user2 /c # user2 has read write access to suite /a /b /c user3 /a /b /c # user3 has read write access to suite /a /b /c -user4 /a,/b,/c # user4 has read access to suite /a /b /c -user5 /a -user5 /b -user5 /c # user5 has read access to suite /a /b /c -user6 /a /b /c # user6 has read access to suite /a /b /c
- ECF_MICRO
This is a generated variable. The default value is %. This variable is used in variable substitution during command invocation and default directive character during pre-processing. It can be overriden, but must be replaced by a single character.
- ECF_NAME
This is a generated variable. It defines the path name of the task. It will typically be used inside script file, referring to the corresponding task.
%include <head.h> .... ecflow_client --alter change variable "fred" "bill" %ECF_NAME% # change variable on corresponding task ... %include <tail.h>
- ECF_NO_SCRIPT
This is a user variable, that can be added to a node (introduced with ecFlow release 4.3.0). It is used to inform the ecflow_server that there is no SCRIPT associated with a task. However unlike ECF_DUMMY_TASK, the task can still be submitted provided the ECF_JOB_CMD is set up.
This is suitable for very lightweight tasks that want to minimize latency. The output can still be seen, if it is redirected to ECF_JOBOUT. Care must be taken to ensure the path to ecflow_client is accessible.
family no_script edit ECF_NO_SCRIPT "1" # the server will not look for .ecf files edit ECFLOW_CLIENT ecflow_client edit DIROUT %VERBOSE% edit SILENT "" edit VERBOSE " > %ECF_JOBOUT 2>&1" task non_script_task edit ECF_JOB_CMD "export ECF_PASS=%ECF_PASS%;export ECF_PORT=%ECF_PORT%;export ECF_HOST=%ECF_HOST%;export ECF_NAME=%ECF_NAME%;export ECF_TRYNO=%ECF_TRYNO%; %ECF_CLIENT% --init=$$; echo 'test test_ecf_no_script' %DIROUT% && %ECF_CLIENT% --complete" # this command is not expected to fail. hence no error handling.(i.e.. will stay active) task ecf_no_script edit ECF_JOB_CMD "ecf_no_script --pass %ECF_PASS% --host %ECF_HOST% --port %ECF_PORT% " # %DIROUT% # ecf_no_script contains init, complete, call to ecflow_client and trapping to raise abort # use this approach for robust error handling task ymd2jul edit ECF_JOB_CMD "ECF_PASS=%ECF_PASS% ECF_NAME=%ECF_NAME% /usr/local/bin/ymd2jul.sh -p %ECF_PORT% -n %ECF_HOST% -r /%SUITE%/%FAMILY% -y %YMD% > %ECF_JOBOUT% 2>&1 &" # /usr/local/bin/ymd2jul.sh can be called on command line or as ecflow_client endfamily
- ECF_OUT
This is user/suite variable that specifies a directory PATH. It controls the location of job output (stdout and stderr of the process) on a remote file system. It provides an alternate location for the job and cmd output files. If it exists, it is used as a base for ECF_JOBOUT, but it is also used to search for the output by ecFlow, when asked by ecflow_ui/ecflow_client. If the output is in ECF_OUT/ECF_NAME.ECF_TRYNO it is returned, otherwise ECF_HOME/ECF_NAME.ECF_TRYNO is used.
The user must ensure that all the directories exists, including suite/family. If this is not done, you may well find task remains stuck in a submitted state. At ECMWF our submission scripts will ensure that directories exists.
- ECF_PASS
This is a generated variable. During job generation process in the server, a unique password is generated and stored in the task. It then replaces %ECF_PASS% in the scripts(.ecf), with the actual value. When the job runs, ecflow_client reads this, as an environment variable, and passes it to the server. The server then compares this password with the one held on the task. This is used as a part of the authentication for child commands, and is used to detect zombies.
The authentication process can be bypassed, and allow the job to proceed (i.e.. when the user is sure that there is only a single process, trying to communicate with the server), by adding it as a user variable. i.e.:
ecflow_client --alter add variable ECF_PASS FREE <path to task>
This functionality is also available in the GUI. Select a task, RMB > Special >Free password. However it is important not leave this in place, as it will always bypass the authentication. Just delete the variable.
- ECF_PASSWD
This is an environment variable, which points to a password file for both client and server. This enables password based authentication for ecFlow user commands. The password file is required for the client and server.
4.5.0 # <user> <host> <port> <passwd> user1 machine1 3141 xxxty user1 machine2 3142 shhert
4.5.0 user1 machine1 3141 xxxty user2 machine1 3141 bbsdd7
The server administrator needs to set Unix file permissions, so that this file is only readable by ecFlow server and the administrator.
- ECF_SCRIPT
This is a generated variable. If defines the path name for the ecf script
- ECF_SCRIPT_CMD
Experimental
This allows the output of running a command to be treated as a script. The command is run after variable substitution. The output is obtained from running the system function popen in the server. Great care should be taken when running this command, to ensure errors in the command do not crash the server. This approach could be used for short lived tasks, where extremely low latency is required. Commands that take more than 20s can interfere with job scheduling and should be avoided. Could possibly be used to checkout a script from a version control system.
If the output contains %include,%manual,%noop they are treated in the same manner as a normal ‘.ecf’ script.
suite test family family task check edit ECF_SCRIPT_CMD "cat /tmp/ECF_SCRIPT_CMD/family/check.ecf" task t1 trigger check == complete edit ECF_SCRIPT_CMD "cat /tmp/ECF_SCRIPT_CMD/family/t1.ecf" endfamily endsuite
- ECF_STATUS_CMD
User defined variable defining the ecflow_client –status command. It invokes a user-supplied (shell) command that queries the status of the job.
The command should be written in such a way that the output is written to %ECF_JOB%.stat, and if the script determines that the job is not active, it should abort the task in ecflow. This command can be particularly useful when nodes on the supercomputer go down, and we don’t know the true state of the jobs.
The status command can be invoked from the Command line interface (CLI) and ecFlowUI. If applied to a family or suite, the command will be run hierarchically. In ecFlowUI use the Status tab in the Info panel or use Special > Status from the node context menu to run it and see the output.
The code below allows the output of the status command to be shown by the
--filecommand on the command line, and automatically via the Status tab in ecFlowUI:suite s1 edit ECF_STATUS_CMD /home/ma/emos/bin/ecfstatus %USER% %HOST% %ECF_RID% %ECF_JOB% > %ECF_JOB%.stat 2>&1 .... endsuite
ecflow_client --status=/s1/f1/t1 # ECF_STATUS_CMD should output to %ECF_JOB%.stat ecflow_client --file=/s1/f1/t1 stat # Return contents of %ECF_JOB%.stat file"
- ECF_TRIES
This is generated variable added at the server level with a default value of 2. It can be overridden by the user and controls the number of times job should re-run should it abort. Provided:
the task/job has NOT been killed(user action)
the job process (created from .ecf or .py) exited cleanly and not with exit 1 || sys.exit(1) as process death is captured by the server. Always ensure your script exits cleanly. i.e. exit(0)
the task has NOT been set to abort by the user(user action)
job creation has not failed . i.e. task pre-processing(include file expansion,variable - substitution, change of file permission for job file)
the value of the variable ECF_TRIES must be convertible to an integer.
Please note this allows your scripts to be self-aware of the number times it is being run. i.e.:
%include <head.h> "echo do some work\n"; if [ %ECF_TRYNO% -eq 1 ] ; then echo "first attempt" ..... fi if [ %ECF_TRYNO% -eq 2 ] ; then echo "first attempt failed, trying a different approach, clean data, etc" ..... fi %include <tail.h>
- ECF_TRYNO
This is a generated variable that is used in file name generation. It represents the current try number for the task.
After begin it is set to 1. The number is advanced if the job is re-run. It is re-set back to 1 after a re-queue. It is used in output and job file numbering. (i.e It avoids overwriting the job file output during multiple re-runs)
- ecFlow
Is the ECMWF work flow manager.
A general purpose application designed to schedule a large number of computer process in a heterogeneous environment.
Helps computer jobs design, submission and monitoring both in the research and operation departments.
- ecflow_client
This executable provides the ecFlow Command line interface (CLI); it is used for all communication with the ecflow_server.
To see the full range of commands that can be sent to the ecflow_server type the following in a UNIX shell:
ecflow_client --helpThis functionality is also provided by the Python API.
The following variables affect the execution of ecflow_client.
Since the ecf script can call ecflow_client(i.e child command) then typically some are set in an include header. i.e. head.h.
Table 7 Environment variables common for user and child commands Variable Name
Explanation
Compulsory
Example
ECF_PORT
Port number of the ecflow_server. Must match ecflow_server
Yes/No
We can use:
ecflow_client --port 3141
as an alternative to specifying the ECF_PORT.
ECF_HOST
Name of the host running the ecflow_server
Yes/No
We can use:
ecflow --host machine1
as an alternative to specifying ECF_HOST
NO_ECF
If set exits ecflow_client immediately with success. This allows the scripts to be tested independent of the server
No
export NO_ECF=1
ECF_DENIED
If server denies client communication and this flag is set, exit with an error. Avoids 24hr hour connection attempt to ecflow_server.
No
export ECF_DENIED=1
ECF_SSL
For secure socket communication with server. Requires client/server built with openssl libs.
No
# Use same certificate for multiple server export ECF_SSL=1 # Use server specific certificates export ECF_SSL=<host>.<port>
Alternatively to avoid setting environmental variables we can use
ecflow_client --ssl ....The client will first look for: $HOME/.ecflowrc/ssl/server.crt then $HOME/.ecflowrc/ssl/<host>.<port>.crt
Table 8 Environment variables for child commands Variable Name
Explanation
Compulsory
Example
Path to the task
Yes
/suite/family/task
Jobs password. Generated by the server, will replace %ECF_PASS% in the scripts,during job generation.Used for authenticating child commands.
Yes
(generated)
ECF_RID
Remote id. Allow easier job kill, and disambiguate a zombie
Yes
(generated)
The number of times the job has run. This is allocated by the server and used in job/output file name generation.
No
(generated)
ECF_HOSTFILE
File that lists alternate hosts to try, if connection to main host fails
No
$HOME/.echostfile
ECF_TIMEOUT
Maximum time is seconds for the client to deliver message
No
24*3600 (default value):
export ECF_TIMEOUT=36024*3600
ECF_ZOMBIE_TIMEOUT
Maximum time in seconds for the child(init, abort, complete, etc) zombie client to get a reply from the server.
No
12*3600 (default value):
export ECF_ZOMBIE_TIMEOUT=36024*3600
Table 9 Variables specific to user commands Variable Name
Explanation
Compulsory
Example
path to the client password file, used for password based authentication
No
export ECF_PASSWD=mymachine.3141.ecf.passwd
ECF_USER
When user need to pose as another user, i.e. when users id on the client machine, doesn’t match his id on the remote server. Requires password file.
No
export ECF_USER=my_user_name
To avoid setting environment variable we can use:
ecflow_client --user my_user_name ......
- ecflow_server
This executable is the server.
It is responsible for scheduling the jobs and responding to ecflow_client requests
Multiple servers can be run on the same machine/host providing they are assigned a unique port number.
The server records all requests in the log file.
The server will periodically (see ECF_CHECKINTERVAL) write out a check point file.
The following environment variables control the execution of the server and may be set before the start of the server. ecflow_server will start happily with out any of these variables being set, since all of them have default values.
Variable Name
Explanation
Default value
Home for all the ecFlow files
Current working directory
ECF_PORT
Server port number. Must be unique
3141
ECF_LOG
History or log file
<host>.<port>.ecf.log
ECF_CHECK
Name of the checkpoint file
<host>.<port>.ecf.check
ECF_CHECKOLD
Name of the backup checkpoint file
<host>.<port>.ecf.check.b
ECF_CHECKINTERVAL
Interval in second to save check point file
120
ECF_LISTS
White list file. Controls read/write access to the server for each user
<host>.<port>.ecf.lists
ECF_TASK_THRESHOLD
Report in log file all task/job that take longer than given threshold. Used to debug/instrument, those scripts that are very large.
4000 (milliseconds). Before release 4.0.6 default was 2000 ms.
path to server password file, used to authenticate user commands. Use when ALL should be password authenticated
<host>.<port>.ecf.passwd
ECF_CUSTOM_PASSWD
path to server password file, used to authenticate user commands. Use when a small number of users need to be password authenticated. Typically client would use:ecflow_client –user=fred ….export ECF_USER=fred; ecflow_client …
<host>.<port>.ecf.custom_passwd
ECF_PRUNE_NODE_LOG
When the checkpoint point file is loaded, node log history older than 30 days is automatically pruned. The variable allows this value to be changed.Setting the variable to zero, means there will be no pruning. All history is preserved at the cost increasing server memory, and time taken to write checkpoint file.
export ECF_PRUNE_NODE_LOG=40
Prune node log history older than 40 days, upon reload of check point file.
ECF_SSL
For secure socket communication with client.Requires client/server built with openssl libs
#Use same certificate for multiple servers export ECF_SSL=1 # Use server specific certificates export ECF_SSL=<host>.<port>
Alternatively to avoid setting environmental variables we can use:
ecflow_server --ssl ... || ecflow_start.sh -s
The server will then first look for $HOME/.ecflowrc/ssl/server.crt then $HOME/.ecflowrc/ssl/<host>.<port>.crt
The server can be in several states. The default when first started is halted, See server states
- ecflow_ui
ecflow_ui executable in the new GUI based client. It is used to visualise and monitor the hierarchical structure of the suite definition.
- event
The purpose of an event is to signal partial completion of a task and to be able to trigger another job which is waiting for this partial completion.
Only tasks can have events and they can be considered as an attribute of a task.
There can be many events and they are displayed as nodes.
The event is updated by placing the
--eventchild command in a ecf script.An event has a number and possibly a name. If it is only defined as a number, its name is the text representation of the number without leading zeroes.
See also:
Events can be referenced in trigger and complete expression s.
- extern
This allows an external node to be used in a trigger expression.
All nodes in triggers must be known to ecflow_server by the end of the load command. No cross-suite dependencies are allowed unless the names of tasks outside the suite are declared as external. An external trigger reference is considered unknown if it is not defined when the trigger is evaluated. You are strongly advised to avoid cross-suite dependencies.
Families and suites that depend on one another should be placed in a single suite. If you think you need cross-suite dependencies, you should consider merging the suites together and have each as a top-level family in the merged suite.
For grammar see
extern.- family
A family is an organisational entity that is used to provide hierarchy and grouping. It consists of a collection of tasks and families.
Typically you place tasks that are related to each other inside the same family, analogous to the way you create directories to contain related files. For python see
ecflow.Family. For BNF seefamilyIt serves as an intermediate node in a suite definition.
- generic
A generic attribute associates a name to a set of generic string values, and is used to gracefully indicate the presence of unknown attributes in the suite definition.
This kind of attribute is used to allow the introduction of future attributes without requiring an API change. When an older version of ecflow encounters a new/unknown attribute, the attribute is automatically converted into a generic attribute.
Warning
The user is strongly advised not to include generic attributes in suite definitions.
- halted
Is a ecflow_server state. See server states.
- hybrid clock
A hybrid clock is a complex notion: the date and time are not connected.
The date has a fixed value during the complete execution of the suite. This will be mainly used in cases where the suite does not complete in less than 24 hours. This guarantees that all tasks of this suite are using the same date. On the other hand, the time follows the time of the machine.
Hence the date never changes unless specifically altered or unless the suite restarts, either automatically or from a begin command.
Under a hybrid clock any node held by a date or day dependency will be set to complete at the beginning of the suite. (i.e without its job ever running). Otherwise the suite would never complete.
- inlimit
The inlimit works in conjunction with limit/
ecflow.Limitfor providing simple load management. inlimit is added to the node that needs to be limited.suite suite limit disk 100 family anon inlimit /suite:disk 5 task t1 ... task t100 endfamily endsuite
suite test limit fam 2 family f1 inlimit -n fam task t1 .... endfamily family f2 inlimit -n fam task t1 .... endfamily family f3 inlimit -n fam task t1 .... endfamily endsuite
# Hence we could have more than 2 active jobs, since we are only control the number in the submitted state. # If we removed the -s then we can only have two active jobs running at one time suite test_limit_on_submission limit disk 2 family anon inlimit -s disk # Inlimit submission task t1 task t2 .... endfamily endsuite
See also:
- job creation
Job creation or task invocation can be initiated manually via ecflow_ui but also by the ecflow_server during scheduling when a task (and all of its parent node s) is free of its dependencies.
The process of job creation includes:
Generating a unique password ECF_PASS, which is placed in ecf script during pre-processing. See head.h
Locating ecf script files , corresponding to the task in the suite definition, See ecf file location algorithm
pre-processing the contents of the ecf script file
The steps above transforms an ecf script to a job file that can be submitted by performing variable substitution on the ECF_JOB_CMD variable and invoking the command.
The running jobs will communicate back to the ecflow_server by calling child commands.
This causes status changes on the nodes in the ecflow_server and flags can be set to indicate various events.
If a task is to be treated as a dummy task (i.e. is used as a scheduling task) and is not meant to to be run, then a variable of name ECF_DUMMY_TASK can be added:
task.add_variable("ECF_DUMMY_TASK","")
- job file
The job file is created by the ecflow_server during job creation using the ECF_TRYNO variable
It is derived from the ecf script after expanding the pre-processing directives.
It has the form <task name>.job<ECF_TRYNO>”, i.e. t1.job1.
Note job creation checking will create a job file with an extension with zero. i.e ‘.job0’. See
ecflow.Defs.check_job_creationWhen the job is run the output file has the ECF_TRYNO as the extension. i.e t1.1 where ‘t1’ represents the task name and ‘1’ the ECF_TRYNO
- label
A label has a name and a value and is a way of displaying information in ecflow_ui
By placing a label child commands in the ecf script the user can be informed about progress in ecflow_ui.
Labels can be added to family nodes. To change the labels, scripts should use:
ecflow_client --alter change label <label_name> <new_value> /path/to/family_node/with/label
If the label child commands results in a zombie then the default action if for the server to fob, this allows the ecflow_client command to exit normally. (i.e. without any errors). This default can be overridden by using a zombie attribute.
- late
Define a tag for a node to be late. A node can only have one late attribute. The late attribute only applies to a task. You can define it on a Suite/Family in which case it will be inherited. Any late defined lower down the hierarchy will override the aspect(submitted,active, complete) defined higher up.
Command options:
-s submitted: The time node can stay submitted (format
[+]hh:mm). submitted is always relative, so + is simple ignored, if present. If the node stays submitted longer than the time specified, the late flag is set-a active: The time of day the node must have become active (format
hh:mm). If the node is still queued or submitted, the late flag is set-c complete: The time node must become complete (format
{+}hh:mm). If relative, time is taken from the time the node became active, otherwise node must be complete by the time given.
suite late family familyName task t1 late -s +00:15 -a 20:00 -c +02:00 task t2 late -a 20:00 -c +02:00 -s +00:15 task t3 late -c +02:00 -a 20:00 -s +00:15 task t4 late -s 00:02 -c +00:05 task t5 late -s 00:01 -a 14:30 -c +00:01 endfamily endsuite
Suites cannot be late, but you can define a late tag for submitted in a suite, to be inherited by the families and tasks. When a node is classified as being late, the only action ecflow_server takes is to set a flag. ecflow_ui will display these alongside the node name as an icon (and optionally pop up a window).
suite late late -s +00:15 # report late for all task taking longer than 15 minutes in submitted state family familyName late -c +02:00 # all child task that take longer than 2 hours to complete should raise a late flag task t1 # effective late -s +00:05 -c +02:00 late -s +00:05 task t2 # effective late -s +00:15 -c +02:00 task t5 # effective late -c +03:00 -a 18:00 -s +00:15 late -c +03:00 -a 18:00 endfamily endsuite
The late attribute can be added/deleted to any suite/family/task.
ecflow_client --alter add late "-s 00:15" <path-to-node> ecflow_client --alter change late "-s 00:01 -a 14:30 -c +00:01" <path-to-node> ecflow_client --alter delete late
See also:
- limit
Limits provide simple load management by limiting the number of tasks submitted by a specific ecflow_server. Typically you either define limits on suite level or define a separate suite to hold limits so that they can be used by multiple suites.
Setting limits on a separate suite, has the benefit that by setting the limit value to zero, you can control task submission over a number of suites.
suite suiteName limit sg1 10 limit mars 10 endsuite
The limits are used in conjunction with inlimit.
The limit max value can be changed on the command line:
ecflow_client --alter change limit_max <limit-name> <new-limit-value> <path-to-limit> ecflow_client --alter change limit_max limit 2 /suite
It can also be changed in python:
import ecflow try: ci = ecflow.Client() ci.alter("/suite","change","limit_max","limit", "2") except RuntimeError, e: print("Failed: " + str(e))
See also:
- manual page
Manual pages are part of the ecf script.
This is to ensure that the manual page is updated when the ecf script is updated. The manual page is a very important operational tool allowing you to view a description of a task, and possibly describing solutions to common problems. The pre-processing can be used to extract the manual page from the script file and is visible in ecflow_ui. The manual page is the text contained within the %manual and %end directives. They can be seen using the Manual tab in the Info panel in ecflow_ui.
The text in the manual page in not included in the job file.
There can be multiple manual sections in the same ecf script file. When viewed they are simply concatenated. It is good practice to modify the manual pages when the script changes.
The manual page may have the %include directives.
Suite and families may also have a manual page. These will also be available in the GUI. Ecflow will look for a file
<node_name>.man(where node_name is the name of suite or family) using a backwards search algorithm first in ECF_FILES directory, then ECF_HOME directory. Note that errors in variable pre-processing are ignored inside of a manual section. It should also be noted that for family and suite manuals, the %manual and %end directives are not strictly necessary, as the whole file is treated as a manual.If we have family
/suite/big/f1, ecFlow will search for “f1.man” in:<ECF_FILES>/suite/big/f1.man <ECF_FILES>/suite/f1.man <ECF_FILES>/f1.man <ECF_HOME>/suite/big/f1.man <ECF_HOME>/suite/f1.man <ECF_HOME>/f1.man
- meter
The purpose of a meter is to signal proportional completion of a task and to be able to trigger another job which is waiting on this proportional completion.
The meter is updated by placing the –meter child command in a ecf script. Meters can be added to family nodes. To change the meters, in the scripts should use:
ecflow_client --alter change meter <meter_name> <new_value> /path/to/family_node/with/meter
If the meter child command results in a zombie, then the default action if for the server to fob, this allows the ecflow_client command to exit normally (i.e. without any errors). This default can be overridden by using a zombie attribute.
See also:
Meters can be referenced in trigger and complete expression expressions.
- mirror
A mirror is an attribute of a Node (typically a Task), and allows to synchronise the status of a node on a remote ecFlow server.
A Node with a mirror attribute will have its status periodically synchronized with the status of a node on a remote ecFlow server. The synchronised status can be used to trigger the execution of local nodes.
Note
Synchronised tasks don’t need to be provided with
.ecffiles on the local ecFlow server, as the execution of a Task with a mirror attribute does not happen under the responsibility of the local ecFlow server.Operations to execute synchronised Tasks have been disabled from the ecflow_ui.
Only one mirror attribute is allowed per node, and each attribute is defined by the following properties:
name, an identifierremote_path, the path of the node on the remote ecFlow serverremote_host, the remote ecFlow server hostremote_port, the remote ecFlow server portssl, to connect to the ecFlow server using SSLpolling, the value (in seconds) used to periodically contact the remote ecFlow serverauth, the location to the Mirror authentication credentials file
The value of the properties
remote_host,remote_port,polling, andauthcan be composed of Variables. When these properties are not provided, the following default values are used:%ECF_MIRROR_REMOTE_HOST%, forremote_host%ECF_MIRROR_REMOTE_PORT%, forremote_port%ECF_MIRROR_REMOTE_POLLING%, forpolling%ECF_MIRROR_REMOTE_AUTH%, forauth
Each mirror attribute implies that a background thread is spawned whenever the ecFlow server is running. This background thread is responsible for polling the remote ecFlow server, and periodically synchronise node status.
- node
suite, family and task form a hierarchy. Where a suite serves as the root of the hierarchy. The family provides the intermediate nodes, and the task provide the leafs.
Collectively suite, family and task can be referred to as nodes.
For python see
ecflow.Node.- pre-processing
Pre-processing takes place during job creation and acts on directives specified in ecf script file.
This involves:
expanding any include file directives. i.e similar to ‘c’ language pre-processing
removing comments and manual directives
performing variable substitution
- queue
Queues allows efficiently running jobs that are identical but vary only in the step.
This attribute makes it possible to follow a producer(server)/consumer(tasks) pattern. Note additional task consumers can be added for load balancing.
suite test_queue family f1 queue q1 001 002 003 004 005 006 007 task t endfamily family f2 queue q2 1 2 3 4 5 6 8 9 10 task a task b # notice that queue name is accessible to the trigger trigger /test_queue/f1:q1 > 5 task c trigger ../f2/a:q2 > 9 endfamily endsuite
The queue child command will signal when a step is active, complete, or has aborted:
# Note: because --queue is treated like a child command(init,complete,event,label,meter,abort,wait), the task path ECF_NAME is read from the environment # The --queue command will search up the node hierarchy for the queue name. If not found it fails. step=$(ecflow_client --queue queue_name active) # returns first queued/aborted step from the server and makes it active, Return "NULL" for the last step. ecflow_client --queue queue_name complete $step # Tell the server that step has completed for the given queue ecflow_client --queue queue_name aborted $step # Tell the server that step has aborted for the given queue no_of_aborted=$(ecflow_client --queue queue_name no_of_aborted) # returns as a string the number of aborted steps ecflow_client --queue queue_name reset
The queue values can be strings, however, if they are to be used in trigger expressions, they must be convertible to integers:
suite test_queue family f1 queue q1 red orange yellow green blue indigo violet task t endfamily endsuite
See also:
queue- queued
-
After the begin command, the task without a defstatus are placed into the queued state
- real clock
A suite using a real clock will have its clock matching the clock of the machine. Hence the date advances by one day at midnight.
- repeat
Repeats provide looping functionality. There can only be a single repeat on a node.
repeat day step [ENDDATE] # only for suites repeat integer VARIABLE start end [step] repeat enumerated VARIABLE first [second [third ...]] repeat string VARIABLE str1 [str2 ...] repeat file VARIABLE filename repeat date VARIABLE yyyymmdd yyyymmdd [delta] repeat datetime VARIABLE yyyymmddTHHMMSS yyyymmddTHHMMSS [delta] repeat datelist VARIABLE yyyymmdd(1) yyyymmdd(2) ...
The repeat variable name is available as a generated variable.
The repeat date and repeat datetime define several generated variables, prefixed by variable name:
# Provided for `repeat date` and `repeat datetime` <variable> # the default, the value is the current date <variable>_YYYY # the year <variable>_MM # the month <variable>_DD # the day of the month <variable>_DOW # day of the week <variable>_JULIAN # the julian value for the date # Provided for `repeat datetime` <variable>_DATE # the date formatted as yyyymmdd <variable>_TIME # the time formatted as HHMMSS <variable>_HOURS # the hours <variable>_MINUTES # the minutes <variable>_SECONDS # the seconds
For example:
repeat date YMD 20090101 20220101 # The following generated variables, are accessible for trigger expressions # YMD # YMD_YYYY, YMD_MM, YMD_DD, YMD_DOW, YMD_JULIAN repeat datetime DT 20090101T000000 20090102T000000 06:00:00 # The following generated variables, are accessible for trigger expressions # DT # DT_DATE, DT_YYYY, DT_MM, DT_DD, DT_DOW, DT_JULIAN # DT_TIME, DT_HOURS, DT_MINUTES, DT_SECONDS
The repeat VARIABLE can be used in trigger and complete expression expressions.
As the repeat variable changes so do the generated variables. (See the tutorial for an example. Repeat)
Warning
If a repeat is added to a family/suite, then the repeat will ONLY loop(and automatically re-queue its children) if all the children are complete. Hence additional care needs to be taken. i.e. if the parent node has a repeat and the child has a cron attribute then the cron will always force a re-queue on the node once it has run, and hence will stop the parent from looping.
If we use relative time attribute. i.e. time +02:00, under a repeat, then the time is relative to the repeat re-queue.
The repeat VARIABLE can be used in trigger and complete expression expressions. Depending on the kind of repeat the value can vary:
RepeatDate -> value RepeatDateList -> value RepeatString -> index (will always return a index) RepeatInteger -> value RepeatEnumerated -> value | index ( return value at index if cast-able to integer, otherwise return index ) RepeatDay -> value
If a “repeat date” VARIABLE is used in a trigger expression then date arithmetic is used, when the expression uses addition and subtraction. i.e.:
defs = ecflow.Defs() s1 = defs.add_suite("s1"); t1 = s1.add_task("t1").add_repeat( ecflow.RepeatDate("YMD",20090101,20091231,1) ); t2 = s1.add_task("t2").add_trigger("t1:YMD - 1 eq 20081231"); assert t2.evaluate_trigger(), "Expected trigger to evaluate. 20090101 - 1 == 20081231"
When we use relative time attributes under a Repeat. They are automatically reset when the repeat loops. Take for example:
suite s1 family hc00 repeat integer HYEAR 1993 2017 time +00:01 # when the repeat loops delay starting task a, for 1 minute task a task b trigger a == complete endfamily endsuite
Now when task ‘a’ and Task ‘b’ complete, the repeat is incremented, and any relative time attributes are reset. In this case effectively delaying the starting of task ‘a’ for 1 minute.
See also:
ecflow.Node.add_repeat,ecflow.Repeat,ecflow.RepeatDate,ecflow.RepeatEnumerated,ecflow.RepeatInteger,ecflow.RepeatDay- running
Is a ecflow_server state. See server states
- scheduling
The ecflow_server is responsible for task scheduling.
It will check dependencies in the suite definition every minute. If these dependencies are free, the ecflow_server will submit the task. See job creation.
- server states
The following tables reflects the ecflow_server capabilities in the different states
State
User Request
Task Request
Job Scheduling
Auto-Check-pointing
yes
yes
yes
yes
yes
yes
no
yes
yes
no
no
no
- shutdown
Is a ecflow_server state. See server states
- status
Each node in suite definition has a status.
Status reflects the state of the node. In ecflow_ui the background colour of the text reflects the status.
task status are: unknown, queued, submitted, active, complete, aborted and suspended
ecflow_server status are: shutdown, halted, running this is shown on the root node in ecflow_ui
- submitted
-
When the task dependencies are resolved/free the ecflow_server places the task into a submitted state. However if the ECF_JOB_CMD fails, the task is placed into the aborted state
- suite
A suite is an organisational entity. It is serves as the root node in a suite definition. It should be used to hold a set of jobs that achieve a common function. It can be used to hold user variables that are common to all of its children.
Only a suite node can have a clock.
Suite generated variables:
SUITE
The name of the suite
ECF_TIME
23:30 the current suite time
TIME
2330 time as integer, Can be used in a trigger expression, ideally using <=, <, >=, >
YYYY
The year as an integer
DOW
Day of the week, as an integer. Sunday=0,Monday=1,etc
DOY
Day of the year, as an integer
DAY
The days as a string, i.e. monday
DD
Day of the month as an integer
MM
The month as an integer
MONTH
as a string
ECF_DATE
YYYMMDD year,month,day of the month as 8 digit integer
ECF_JULIAN
The julian value of the current date (added in ecFlow 4.7.0)
ECF_CLOCK
<day>:<month>:<day of week>:<day of year>. i.e. Tuesday:December:2:348
It is a collection of familys, variables, repeat and a single clock definition.
See also:
- suite definition
The suite definition is the hierarchical node tree. It describes how your tasks run and interact. It can be built up using:
Ascii text file by following the rules defined in the ecFlow Definition file grammar. Hence any language can be used, to generate this format.
Once the definition is built, it can be loaded into the ecflow_server, and started. It can be monitored by ecflow_ui
- suspended
Is a node state. A node can be placed into the suspended state via a defstatus or via ecflow_ui
A suspended node including any of its children cannot take part in scheduling until the node is resumed.
- task
A task represents a job that needs to be carried out. It serves as a leaf node in a suite definition
Only tasks can be submitted.
A job inside a task ecf script should generally be re-entrant so that no harm is done by rerunning it, since a task may be automatically submitted more than once if it aborts.
See also:
- time
This defines a time dependency for a node.
Time is expressed in the format
[h]h:mm. Only numeric values are allowed.There can be multiple time dependencies for a node, but overlapping times may cause unexpected results.
task t time 10:00 time 11:00
To define a series of times, specify the start time, end time and a time increment. If the start time begins with ‘+’, times are relative to the beginning of the suite or, in repeated families, relative to the beginning of the repeated family.
If the time the job takes to complete is longer than the interval a “slot” is missed, e.g.:
time 10:00 20:00 01:00
If the 10:00 run takes more than an hour, the 11:00 run will never occur.
See also:
- time dependencies
This includes time, today, day, date and cron.
When we have multiple time dependencies on the same task, then time dependency of the same type are or’ed together, and and’ed with the different types.
task xx time 10:00 date 17.2.2017
Listing 140 Run task xx. at 10am and 8pm, on the 17th and 19th of February 2017, that is four times in all. Notice the task is queued in between and completes only after the last runtask xx time 10:00 time 20:00 date 17.2.2017 date 19.2.2017
- today
Like time, but If the suites begin time is past the time given for the “today” command the node is free to run (as far as the time dependency is concerned).
For example:
task x today 10:00
If we begin or re-queue the suite at 9.00 am, then the task in held until 10.00 am. However if we begin or re-queue the suite at 11.00am, the task is run immediately.
No lets look at time:
task x time 10:00
If we begin or re-queue the suite at 9.00am, then the task in held until 10.00 am. If we begin or re-queue the suite at 11.00am, the task is still held.
If the time the job takes to complete is longer than the interval a “slot” is missed, e.g.:
today 10:00 20:00 01:00
If the 10:00 run takes more than an hour, the 11:00 run will never occur.
See also:
- trigger
Triggers defines a dependency for a task or family.
There can be only one trigger dependency per node, but that can be a complex boolean expression of the status of several nodes. Triggers should be avoided on suites. A node with a trigger can only be activated when its trigger has expired. A trigger holds the node as long as the trigger expression evaluation returns false.
Trigger evaluation occurs when ever the child command communicates with the server. i.e whenever there is a state change in the suite definition.
The keywords in trigger expressions are: unknown, suspended, complete, queued, submitted, active, aborted and clear and set for event status.
Triggers can also reference Node attributes like event, meter, variable, repeat and generated variables. Trigger evaluation for node attributes uses integer arithmetic:
event: has the integer value of 0(clear) and set(1)
meter: values are integers hence they are used as is
variable: value is converted to an integer, otherwise 0 is used. See example below
repeat string: use the index values as integers. See example below
repeat enumerated: use the index values as integers. See example below
repeat integer: use the implicit integer values
repeat date: use the date values as integers. Use of plus/minus on repeat date variable uses date arithmetic
repeat datetime: use the date+time instant values as integers. Use of plus/minus on repeat datetime variable uses second arithmetic
limit: the limit value is used as an integer. This allows a degree of prioritisation amongst tasks under a limit
late: the value is stored in a flag, and is a simple boolean. Used to signify when a task is late.
Here are some examples:
suite trigger_suite task a event EVENT meter METER 1 100 50 edit VAR_INT 12 edit VAR_STRING "captain scarlett" # This is not convertible to an integer, if referenced will use '0' family f1 edit SLEEP 2 repeat string NAME a b c d e f # This has values: a(0), b(1), c(3), d(4), e(5), f(6) i.e index family f2 repeat integer VALUE 5 10 # This has values: 5,6,7,8,9,10 family f3 repeat enumerated red green blue # red(0), green(1), blue(2) task t1 repeat date DATE 19991230 20000102 # This has values: 19991230,19991231,20000101,20000102 endfamily endfamily endfamily family f2 task event_meter trigger /suite/a:EVENT == set and /suite/a:METER >= 30 task variable trigger /suite/a:VAR_INT >= 12 and /suite/a:VAR_STRING == 0 task repeat_string trigger /suite/f1:NAME >= 4 task repeat_integer trigger /suite/f1/f2:VALUE >= 7 task repeat_date trigger /suite/f1/f2/f3/t1:DATE >= 19991231 task repeat_date2 # Using plus/minus on a repeat DATE will use date arithmetic # Since the starting value of DATE is 19991230, this task will run straight away trigger /suite/f1/f2/f3/t1:DATE - 1 == 19991229 endfamily endsuite
What happens when we have multiple node attributes of the same name, referenced in trigger expressions?
task foo event blah meter blah 0 200 50 edit blah 10 task bar trigger foo:blah >= 0
In this case ecFlow will use the following precedence:
Hence in the example above expression
foo:blah >= 0will reference the event.See also:
- unknown
-
This is the default node status when a suite definition is loaded into the ecflow_server
- user command
User commands are any client to server requests that are not child commands.
- variable
ecFlow makes heavy use of different kinds of variables.There are several kinds of variables:
Environment variables: which are set in the UNIX shell before the ecFlow starts. These control ecflow_server, and ecflow_client .
suite definition variables: Also referred to as user variables. These control ecflow_server, and ecflow_client and are available for use in job file.
Generated variables: These are generated within the suite definition node tree during job creation and are available for use in the job file.
Variables can be referenced in trigger and complete expressions . The value part of the variable should be convertible to an integer otherwise a default value of 0 is used.
See also:
- variable inheritance
When a variable is needed at job creation time, it is first sought in the task itself.
If it is not found in the task, it is sought from the task’s parent and so on, up through the node levels until found.
For any node, there are two places to look for variables.
Suite definition variables are looked for first, and then any generated variables.
- variable substitution
Takes place during pre-processing or command invocation.(i.e ECF_JOB_CMD,ECF_KILL_CMD,etc)
It involves searching each line of ecf script file or command, for ECF_MICRO character. typically ‘%’
The text between two % character, defines a variable. i.e %VAR%
This variable is searched for in the suite definition.
First the suite definition variables (sometimes referred to as user variables) are searched and then Repeat variable name, and finally the generated variables. If no variable is found then the same search pattern is repeated up the node tree.
The value of the variable is replaced between the % characters.
If the micro character are not paired and an error message is written to the log file, and the task is placed into the aborted state.
If the variable is not found in the suite definition during pre-processing then job creation fails, and an error message is written to the log file, and the task is placed into the aborted state. To avoid this, variables in the ecf script can be defined as:
%VAR:replacement%
This is similar to %VAR% but if VAR is not found in the suite definition then ‘replacement’ is used.
- virtual clock
Like real clock until the ecflow_server is suspended (i.e shutdown or halted), the suites clock is also suspended.
Hence will honour relative times in cron, today and time dependencies. It is possible to have a combination of hybrid/real and virtual.
More useful when we want complete adherence to time related dependencies at the expense being out of sync with system time.
- zombie
Zombies are running jobs that fail authentication when communicating with the ecflow_server
child commands like (init, event,meter, label, abort,complete) are placed in the ecf script file and are used to communicate with the ecflow_server.
The ecflow_server authenticates each connection attempt made by the child command. Authentication can fail for a number of reasons:
password(ECF_PASS) supplied with the child command, does not match the one in the ecflow_server
path name(ECF_NAME) supplied with the child command, does not locate a task in the ecflow_server
process id(ECF_RID) supplied with child command, does not correspond with the one stored in the ecflow_server
task is already active, but receives another init child command
task is already complete, but receives another child command
task is already aborted, but receives another child command
When authentication fails the job is considered to be a zombie. The ecflow_server will keep a note of the zombie for a period of time, before it is automatically removed. However the removed zombie, may well re-appear. (this is because each child command will continue attempting to contact the ecflow_server for 24 hours. This is configurable see ECF_TIMEOUT on ecflow_client)
See also:
There are several types of zombies see zombie type and
ecflow.ZombieType- zombie attribute
The zombie attribute defines how a zombie should be handled in an automated fashion. Very careful consideration should be taken before this attribute is added as it may hide a genuine problem. It can be added to any node. But is best defined at the suite or family level. If there is no zombie attribute the default behaviour is to block the child command.
To add a zombie attribute in python, please see:
ecflow.ZombieAttr- zombie type
See zombie and class
ecflow.ZombieAttrfor further information.How do zombies arise.
Server crashed (or terminated and restarted) and the recovered check point file is out of date.
A task is repeatedly re-run, earlier copies will not be remembered.
Job sent by another ecflow_server , but which cannot talk to the original ecflow_server
Network glitches/network down
errors in script, i.e. multiple calls to init, complete
errors in job submission i.e. job submitted twice.
There are several types of zombies:
path:
The task path cannot be found in the server, because node tree was deleted, replaced,reload, server crashed or backup server does not have node tree.
Jobs could have been created, via server scheduling or by user commands
user: Job is created by user commandss like, rerun, re-queue. User zombies are differentiated from server(scheduled) since they are automatically created when the force option is used and we have tasks in an active or submitted states.
ecf: Jobs are created as part of the normal scheduling
Two init commands or task complete or aborted but receives another child command
Server crashed (or terminated and restarted) and the recovered check point file is out of date.
A task is repeatedly re-run, earlier copies will not be remembered.
Job sent by another ecflow_server, but which cannot talk to the original ecflow_server
Network glitches/network down
ecf_pid: pid mismatched, Job scheduled twice. Check submitter.
ecf_passwd: Password mismatch, PID matches, system has re-cycled PID or hacked job file?
ecf_pid_passwd: Both PID and password mismatch. Re-queue & submit of active job?
The type of the zombie is not fixed and may change.