3. Job Execution Environment
3.1. Job Services
The following job services are available in the ABCI System.
|Service name||Description||Service charge coefficient||Job style|
|Spot||Job service of batch execution||1.0||Batch|
|On-demand||Job service of interactive execution||1.0||Interactive|
|Reserved||Job service of reservation||1.5||Batch/Interactive|
In the case of Spot service and On-demand service, when starting a job, the ABCI point scheduled for job is calculated by limited value of elapsed time, and subtract processing is executed. When a job finishes, the ABCI point is calculated again by actual elapsed time, and repayment process is executed. In the case of Reserved service, when completing a reservation, the ABCI point is calculated by a period of reservation, end subtract processing is executed. The repayment process is not executed unless reservation is cancelled.
3.2. Job Executing Resource
The ABCI System allocates system resources to jobs using resource type that means logical partition of compute nodes. To submit or execute a job, specify the following resource type name.
|Resource type||Resource type name||Description||Assigned physical CPU core||Number of assigned GPU||Memory (GiB)||Local storage (GB)||Resource type charge coefficient|
When you execute a job using multiple nodes, you need to specify resource type
rt_F for node-exclusive.
On node-sharing job, the job process information can be seen from other jobs executed on the same nodes. If you want to hide your job process information, specify resource type
rt_F and execute a node-exclusive job.
The available resource type and number of nodes for each service are as follows.
|Service||Resource type name||Number of nodes|
|Reserved||rt_F||1–number of reserved nodes|
The job limit of elapsed time for each service are as follows.
|Service||Resource type name||Limit of elapsed time (upper limit/default)|
|rt_G.large, rt_C.large, rt_M.large, rt_M.small||72:00:00/1:00:00|
|rt_G.large, rt_C.large, rt_M.large||12:00:00/1:00:00|
|rt_G.small, rt_C.small, rt_M.small||12:00:00/1:00:00|
However, if multiple nodes are used in Spot or On-demand service, the job cannot be executed over the node-hour restriction bellow.
|Service||max value of node-hour|
|Spot||2304 nodes · hours|
|On-demand||12 nodes · hours|
The job limit of submission and execution for the job service are as follows.
|The maximum number of tasks within an array job||75000|
|The maximum number of any user's unfinished jobs at the same time||1000|
|The maximum number of any user's running jobs at the same time||200|
In the case of Spot service, the job can be executed with priority, by specifying the POSIX priority.
|Service||Description||POSIX priority||POSIX priority coefficient|
The calculation formula of ABCI point for using Spot service and On-demand services is as follows.
ABCI point = Service charge coefficient × Resource type charge coefficient × Number of resource type × POSIX priority charge coefficient × max(Elapsed time[sec], Minimum Elapsed time[sec]) / 3600
- The five and under decimal places is rounding off.
- If elapsed time of job executing is less than minimum elapsed time, ABCI point calculated based on minimum elapsed time.
The calculation formula of ABCI point for using Reserved service is follows.
ABCI point = Service charge coefficient × number of reserved nodes × number of reserved days × 24
3.3. Job Executing Options
To execute a job in batch mode, use the
To execute a job in interactive mode, use the
The major options of the
qsub and the
qrsh command are follows.
|-g group||Specify ABCI user group|
|-l resource_type=number||Specify resource type (mandatory)|
|-l h_rt=[HH:MM:]SS||Specify elapsed time by [HH:MM:]SS. When execution time of job exceed specified time, job is rejected.|
|-N name||Specify job name. default is name of job script.|
|-o stdout_name||Specify standard output stream of job|
|-p priority||Specify POSIX priority for Spot service|
|-e stderr_name||Specify standard error stream of job|
|-j y||Specify standard error stream is merged into standard output stream|
|-m a||Mail is sent when job is aborted|
|-m b||Mail is sent when job is started|
|-m e||Mail is sent when job is finished|
|-t start[-end[:step]]||Specify task ID of array job. suboption is start_number[-end_number[:step_size]]|
|-hold_jid job_id||Specify job ID having dependency. the submitted job is not executed until dependent job finished.|
|-ar ar_id||Specify reserved ID (AR-ID), when using reserved compute node|
3.4. Interactive Jobs
To execute an interactive job, use the
If ABCI point is insufficient when executing interactive job, execution is failed.
$ qrsh -g group -l resource_type=number [option]
Example) Executing an Interactive job
[username@es1 ~]$ qrsh -g grpname -l rt_F=1 [username@g0001 ~]$
To execute an application using X-Window, you need to login with the X forwading option (-X or -Y option) as follows.
[yourpc ~]$ ssh -XC -p 10022 -l username localhost
To execute an interactive job, specify the
-pty yes -display $DISPLAY -v TERM /bin/bash.
[username@es1 ~]$ qrsh -g grpname -l rt_F=1 -pty yes -display $DISPLAY -v TERM /bin/bash [username@g0001 ~]$ xterm <- execute X application
3.5. Batch Jobs
To run a batch job on the ABCI System, you need to make a job script in addition to execution program. The job script is described job execute option, such as resource type, elapsed time limit, etc., and executing command sequence.
#!/bin/bash #$ -l rt_F=1 #$ -l h_rt=1:23:45 #$ -j y #$ -cwd [Initialization of Environment Modules] [Setting of Environment Modules] [Executing program]
Example) Sample job script executing program with CUDA
#!/bin/bash #$-l rt_F=1 #$-j y #$-cwd source /etc/profile.d/modules.sh module load cuda/9.2/220.127.116.11 ./a.out
3.5.1. Submit a batch job
To submit a batch job, use the
If ABCI point is insufficient when submitting batch job, submission is failed.
$ qsub -g group [option] job_script
Example) Submission job script run.sh as a batch job
[username@es1 ~]$ qsub -g grpname run.sh Your job 12345 ("run.sh") has been submitted
-g option cannot specify in job script.
3.5.2. Show the status of batch jobs
To show the current status of batch jobs, use the
$ qstat [option]
The major options of the
qstat command are follows.
|-r||Display resource information about job|
|-j||Display additional information about job|
[username@es1 ~]$ qstat job-ID prior name user state submit/start at queue jclass slots ja-task-ID ------------------------------------------------------------------------------------------------------------------------------------------------ 12345 0.25586 run.sh username r 06/27/2018 21:14:49 gpu@g0001 80
|state||Job status (r: running, qw: waiting, d: delete, E: error)|
|submit/start at||Job submission/start time|
|jclass||Job class name|
|slots||Number of job slot (number of node x 80)|
|ja-task-ID||Task ID of array job|
3.5.3. Delete a batch job
To delete a batch job, use the
$ qdel job_ID
Example) Delete a batch job
[username@es1 ~]$ qstat job-ID prior name user state submit/start at queue jclass slots ja-task-ID ------------------------------------------------------------------------------------------------------------------------------------------------ 12345 0.25586 run.sh username r 06/27/2018 21:14:49 gpu@g0001 80 [username@es1 ~]$ qdel 12345 username has registered the job 12345 for deletion
3.5.4. Stdout and Stderr of Batch Jobs
Standard output file and standard error output file are written to job execution directory, or to files specified at job submission. Standard output generated during a job execution is written to a standard output file and error messages generated during the job execution to a standard error output file if no standard output and standard err output files are specified at job submission, the following files are generated for output.
- JOB_NAME.oJOB_ID --- Standard output file
- JOB_NAME.eJOB_ID --- Standard error output file
3.5.5. Report batch job accounting
To report batch job accounting, use the
$ qacct [options]
The major options of the
qacct command are follows.
|-g group||Display accounting information of jobs owend by group|
|-j job_id||Display accounting information of job_id|
|-t n[-m[:s]]||Specify task ID of array job. Suboption is start_number[-end_number[:step_size]]. Only available with the -j option.|
Example) Report batch job accounting
[username@es1 ~]$ qacct -j 12345 ============================================================== qname gpu hostname g0001 group group owner username project group department group jobname run.sh jobnumber 12345 taskid undefined account username priority 0 cwd NONE submit_host es1.abci.local submit_cmd /bb/system/uge/latest/bin/lx-amd64/qsub -P username -l h_rt=600 -l rt_F=1 qsub_time 07/01/2018 11:55:14.706 start_time 07/01/2018 11:55:18.170 end_time 07/01/2018 11:55:18.190 granted_pe perack17 slots 80 failed 0 deleted_by NONE exit_status 0 ru_wallclock 0.020 ru_utime 0.010 ru_stime 0.013 ru_maxrss 6480 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 1407 ru_majflt 0 ru_nswap 0 ru_inblock 0 ru_oublock 8 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 13 ru_nivcsw 1 wallclock 3.768 cpu 0.022 mem 0.000 io 0.000 iow 0.000 ioops 0 maxvmem 0.000 maxrss 0.000 maxpss 0.000 arid undefined jc_name NONE
The major fields of accounting information are follows.
For more detail, use
man sge_accounting command.
|taskid||Task ID of array job|
|qsub_time||Job submission time|
|start_time||Job start time|
|end_time||Job end time|
|failed||Job end code managed by job scheduler|
|exit_status||Job end status|
|wallclock||Job running time (including pre/post process)|
3.5.6. Environment Variables
During job execution, the following environment variables are available for the executing job script/binary.
|ENVIRONMENT||Univa Grid Engine fills in BATCH to identify it as an Univa Grid Engine job submitted with qsub.|
|JOB_NAME||Name of the Univa Grid Engine job.|
|JOB_SCRIPT||Name of the script, which is currently executed|
|NHOSTS||The number of hosts on which this parallel job is executed|
|PE_HOSTFILE||The absolute path includes hosts, slots and queue name|
|RESTARTED||Indicates if the job was restarted (1) or if it is the first run (0)|
|SGE_JOB_HOSTLIST||The absolute path includes only hosts assigned by Univa Grid Engine|
|SGE_LOCALDIR||The local storage path assigned by Univa Grid Engine|
|SGE_O_WORKDIR||The working directory path of the job submitter|
|SGE_TASK_ID||Task number of the array job task the job represents (If is not an array task, the variable contains undefined)|
|SGE_TASK_FIRST||Task number of the first array job task|
|SGE_TASK_LAST||Task number of the last array job task|
|SGE_TASK_STEPSIZE||Step size of the array job|
In the case of Reserved service, job execution can be scheduled by reserving compute node in advance.
|Minimum reservation days||1 day|
|Maximum reservation days||30 days|
|Maximum number of nodes can be reserved at once per system||442 nodes|
|Maximum reserved nodes per reservation||32|
|Maximum reserved node time per reservtation||12,288 node x hour|
|Start time of accept reservation||10:00a.m of 30 days ago|
|Closing time of accept reservation||9:00a.m of Start reservation of the day before|
|Canceling reservation accept term||9:00a.m of Start reservation of the day before|
|Reservation start time||10:00am of Reservation start day|
|Reservation end time||9:30am of Reservation end day|
3.6.1. Make a reservation
Making reservation of compute node is permitted to a responsible person or a manager.
To make a reservation compute node, use
qrsub command or the ABCI User Portal .
$ qrsub options
|-a YYYYMMDD||Specify start reservation date (format: YYYYMMDD)|
|-d days||Specify reservation day. exclusive with -e option|
|-e YYYYMMDD||Specify end reservation date (format: YYYYMMDD). exclusive with -d option|
|-g group||Specify ABCI UserGroup|
|-N name||Specify reservation name. the reservation name can be specified following character: "A-Za-z0-9_" and maximum length is 64|
|-n nnode||Specify the number of nodes.|
Example) Make a reservation 4 compute nodes from 2018/07/05 to 1 week (7 days)
[username@es1 ~]$ qrsub -a 20180705 -d 7 -g gxa50001 -n 4 -N "Reserve_for_AI" Your advance reservation 12345 has been granted
The ABCI points are consumed when complete reservation.
3.6.2. Show the status of reservations
To show the current status of reservations, use the
qrstat command or the ABCI User Portal.
[username@es1 ~]$ qrstat ar-id name owner state start at end at duration sr ---------------------------------------------------------------------------------------------------- 12345 Reserve_fo root w 07/05/2018 10:00:00 07/12/2018 09:30:00 167:30:00 false
|ar-id||Reserve ID (AR-ID)|
|state||Status of reservation|
|start at||Start reservation date (start time is 10:00am at all time)|
|end at||End reservation date (end time is 9:30am at all time)|
|duration||Reservation term (hhh:mm:ss)|
If you want to show the number of nodes that can be reserved, you need to access User Portal, or use
qrstat command with
[username@es1 ~]$ qrstat --available 06/27/2018 441 07/05/2018 432 07/06/2018 434
The no reservation day is not printed.
3.6.3. Cancel a reservation
Canceling reservation is permitted to a responsible person or a manager.
To cancel a reservation, use the
qrdel command or the ABCI User Portal.
Example) Cancel a reservation
[username@es1 ~]$ qrdel 12345