Skip to content

3. Job Execution Environment

3.1. Job Services

The following job services are available in the ABCI System.

Service name Description Service charge coefficient Job style
Spot Job service of batch execution 1.0 Batch
On-demand Job service of interactive execution 1.0 Interactive
Reserved Job service of reservation 1.5 Batch/Interactive

In the case of Spot service and On-demand service, when starting a job, the ABCI point scheduled for job is calculated by limited value of elapsed time, and subtract processing is executed. When a job finishes, the ABCI point is calculated again by actual elapsed time, and repayment process is executed. In the case of Reserved service, when completing a reservation, the ABCI point is calculated by a period of reservation, end subtract processing is executed. The repayment process is not executed unless reservation is cancelled.

3.2. Job Executing Resource

The ABCI System allocates system resources to jobs using resource type that means logical partition of compute nodes. To submit or execute a job, specify the following resource type name.

Resource type Resource type name Description Assigned physical CPU core Number of assigned GPU Memory (GiB) Local storage (GB) Resource type charge coefficient
Full rt_F node-exclusive 40 4 360 1440 1.00
G.large rt_G.large node-sharing
with GPU
20 4 240 720 0.90
G.small rt_G.small node-sharing
with GPU
5 1 60 180 0.30
C.large rt_C.large node-sharing
CPU only
20 0 120 720 0.60
C.small rt_C.small node-sharing
CPU only
5 0 30 180 0.20

When you execute a job using multiple nodes, you need to specify resource type rt_F for node-exclusive.

Warning

On node-sharing job, the job process information can be seen from other jobs executed on the same nodes. If you want to hide your job process information, specify resource type rt_F and execute a node-exclusive job.

The available resource type and number for each service are as follows.

Service Resource type name Range of number
Spot rt_F 1-512
rt_G.large 1
rt_G.small 1
rt_C.large 1
rt_C.small 1
On-demand rt_F 1-32
rt_G.large 1
rt_G.small 1
rt_C.large 1
rt_C.small 1
Reserved rt_F 1-number of reserved nodes
rt_G.large 1
rt_G.small 1
rt_C.large 1
rt_C.small 1

The job limit of elapsed time for each service are as follows.

Service Resource type name Limit of elapsed time (upper limit/default)
Spot rt_F 72:00:00/1:00:00
rt_G.large, rt_C.large 72:00:00/1:00:00
rt_G.small, rt_C.small 168:00:00/1:00:00
On-demand rt_F 12:00:00/1:00:00
rt_G.large, rt_C.large 12:00:00/1:00:00
rt_G.small, rt_C.small 12:00:00/1:00:00
Reserved rt_F unlimited
rt_G.large, rt_C.large unlimited
rt_G.small, rt_C.small unlimited

In case of Spot or On-demand service, the job cannot be executed over the node-hour restriction bellow.

Service max value of node-hour
Spot 2304 nodes · hours
On-demand 12 nodes · hours

In the case of Spot service, the job can be executed with priority, by specifying the POSIX priority.

Service Description POSIX priority POSIX priority coefficient
Spot default -500 1.0
priority -400 1.5

The calculation formula of ABCI point for using Spot service and On-demand services is as follows.

ABCI point = Service charge coefficient
           × Resource type charge coefficient
           × Number of resource type
           × POSIX priority charge coefficient
           × max(Elapsed time[sec], Minimum Elapsed time[sec]) / 3600

Note

  • The five and under decimal places is rounding off.
  • If elapsed time of job executing is less than minimum elapsed time, ABCI point calculated based on minimum elapsed time.

The calculation formula of ABCI point for using Reserved service is follows.

ABCI point = Service charge coefficient
          × number of reserved nodes
          × number of reserved days × 24

3.3. Job Executing Options

To execute a job in batch mode, use the qsub command. To execute a job in interactive mode, use the qrsh command.

The major options of the qsub and the qrsh command are follows.

Option Description
-g group Specify ABCI user group
-l resource_type=number Specify resource type (mandatory)
-l h_rt=[HH:MM:]SS Specify elapsed time by [HH:MM:]SS. When execution time of job exceed specified time, job is rejected.
-N name Specify job name. default is name of job script.
-o stdout_name Specify standard output stream of job
-p priority Specify POSIX priority for Spot service
-e stderr_name Specify standard error stream of job
-j y Specify standard error stream is merged into standard output stream
-m a Mail is sent when job is aborted
-m b Mail is sent when job is started
-m e Mail is sent when job is finished
-t start[-end[:step]] Specify task ID of array job. suboption is start_number[-end_number[:step_size]]
-hold_jid job_id Specify job ID having dependency. the submitted job is not executed until dependent job finished.
-ar ar_id Specify reserved ID (AR-ID), when using reserved compute node

3.4. Interactive Jobs

To execute an interactive job, use the qrsh command. If ABCI point is insufficient when executing interactive job, execution is failed.

$ qrsh -g group -l resource_type=number [option]

Example) Executing an Interactive job

[username@es1 ~]$ qrsh -g grpname -l rt_F=1
[username@g0001 ~]$ 

To execute an application using X-Window, you need to login with the X forwading option (-X or -Y option) as follows.

[yourpc ~]$ ssh -XC -p 10022 -l username localhost

To execute an interactive job, specify the -pty yes -display $DISPLAY -v TERM /bin/bash.

[username@es1 ~]$ qrsh -g grpname -l rt_F=1 -pty yes -display $DISPLAY -v TERM /bin/bash
[username@g0001 ~]$ xterm <- execute X application

3.5. Batch Jobs

To run a batch job on the ABCI System, you need to make a job script in addition to execution program. The job script is described job execute option, such as resource type, elapsed time limit, etc., and executing command sequence.

#!/bin/bash

#$ -l rt_F=1
#$ -l h_rt=1:23:45
#$ -j y
#$ -cwd

[Initialization of Environment Modules]
[Setting of Environment Modules]
[Executing program]

Example) Sample job script executing program with CUDA

#!/bin/bash

#$-l rt_F=1
#$-j y
#$-cwd

source /etc/profile.d/modules.sh
module load cuda/9.2/9.2.88.1
./a.out

3.5.1. Submit a batch job

To submit a batch job, use the qsub command. If ABCI point is insufficient when submitting batch job, submission is failed.

$ qsub -g group [option] job_script

Example) Submission job script run.sh as a batch job

[username@es1 ~]$ qsub -g grpname run.sh
Your job 12345 ("run.sh") has been submitted

Warning

The -g option cannot specify in job script.

3.5.2. Show the status of batch jobs

To show the current status of batch jobs, use the qstat command.

$ qstat [option]

The major options of the qstat command are follows.

Option Description
-r Display resource information about job
-j Display additional information about job

Example)

[username@es1 ~]$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------
     12345 0.25586 run.sh     username     r     06/27/2018 21:14:49 gpu@g0001                                                        80
Field Description
job-ID Job ID
prior Job priority
name Job name
user Job owner
state Job status (r: running, qw: waiting, d: delete, E: error)
submit/start at Job submission/start time
queue Queue name
jclass Job class name
slots Number of job slot (number of node x 80)
ja-task-ID Task ID of array job

3.5.3. Delete a batch job

To delete a batch job, use the qdel command.

$ qdel job_ID

Example) Delete a batch job

[username@es1 ~]$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------
     12345 0.25586 run.sh     username     r     06/27/2018 21:14:49 gpu@g0001                                                        80
[username@es1 ~]$ qdel 12345
username has registered the job 12345 for deletion

3.5.4. Stdout and Stderr of Batch Jobs

Standard output file and standard error output file are written to job submission directory, or to files specified at job submission. Standard output generated during a job execution is written to a standard output file and error messages generated during the job execution to a standard error output file if no standard output and standard err output files are specified at job submission, the following files are generated for output.

  • Jobname.o.JobID --- Standard output file
  • Jobname.s.JobID --- Standard error output file

3.5.5. Report batch job accounting

To report batch job accounting, use the qacct command.

$ qacct [options]

The major options of the qacct command are follows.

Option Description
-g group Display accounting information of jobs owend by group
-j job_id Display accounting information of job_id
-t n[-m[:s]] Specify task ID of array job. Suboption is start_number[-end_number[:step_size]]. Only available with the -j option.

Example) Report batch job accounting

[username@es1 ~]$ qacct -j 12345
==============================================================
qname        gpu
hostname     g0001
group        group 
owner        username
project      group 
department   group
jobname      run.sh
jobnumber    12345
taskid       undefined
account      username
priority     0
cwd          NONE
submit_host  es1.abci.local
submit_cmd   /bb/system/uge/latest/bin/lx-amd64/qsub -P username -l h_rt=600 -l rt_F=1
qsub_time    07/01/2018 11:55:14.706
start_time   07/01/2018 11:55:18.170
end_time     07/01/2018 11:55:18.190
granted_pe   perack17
slots        80
failed       0
deleted_by   NONE
exit_status  0
ru_wallclock 0.020
ru_utime     0.010
ru_stime     0.013
ru_maxrss    6480
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    1407
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   8
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     13
ru_nivcsw    1
wallclock    3.768
cpu          0.022
mem          0.000
io           0.000
iow          0.000
ioops        0
maxvmem      0.000
maxrss       0.000
maxpss       0.000
arid         undefined
jc_name      NONE

The major fields of accounting information are follows. For more detail, use man sge_accounting command.

Field Description
jobnunmber Job ID
taskid Task ID of array job
qsub_time Job submission time
start_time Job start time
end_time Job end time
failed Job end code managed by job scheduler
exit_status Job end status
wallclock Job running time (including pre/post process)

3.5.6. Environment Variables

During job execution, the following environment variables are available for the executing job script/binary.

Variable Name Description
ENVIRONMENT Univa Grid Engine fills in BATCH to identify it as an Univa Grid Engine job submitted with qsub.
JOB_ID Job ID
JOB_NAME Name of the Univa Grid Engine job.
JOB_SCRIPT Name of the script, which is currently executed
NHOSTS The number of hosts on which this parallel job is executed
PE_HOSTFILE The absolute path includes hosts, slots and queue name
RESTARTED Indicates if the job was restarted (1) or if it is the first run (0)
SGE_JOB_HOSTLIST The absolute path includes only hosts assigned by Univa Grid Engine
SGE_LOCALDIR The local storage path assigned by Univa Grid Engine
SGE_O_WORKDIR The working directory path of the job submitter
SGE_TASK_ID Task number of the array job task the job represents (If is not an array task, the variable contains undefined)
SGE_TASK_FIRST Task number of the first array job task
SGE_TASK_LAST Task number of the last array job task
SGE_TASK_STEPSIZE Step size of the array job

3.6. Reservation

In the case of Reserved service, job execution can be scheduled by reserving compute node in advance.

Item Description
Minimum reservation days 1 day
Maximum reservation days 30 days
Maximum number of nodes can be reserved at once per system 442 nodes
Maximum reserved nodes per reservation 32
Maximum reserved node time per reservtation 12,288 node x hour
Start time of accept reservation 10:00a.m of 30 days ago
Closing time of accept reservation 9:00a.m of Start reservation of the day before
Canceling reservation accept term 9:00a.m of Start reservation of the day before
Reservation start time 10:00am of Reservation start day
Reservation end time 9:30am of Reservation end day

3.6.1. Make a reservation

Warning

Making reservation of compute node is permitted to a responsible person or a manager.

To make a reservation compute node, use qrsub command or the ABCI User Portal .

$ qrsub options
Option Description
-a YYYYMMDD Specify start reservation date (format: YYYYMMDD)
-d days Specify reservation day. exclusive with -e option
-e YYYYMMDD Specify end reservation date (format: YYYYMMDD). exclusive with -d option
-g group Specify ABCI UserGroup
-N name Specify reservation name. the reservation name can be specified following character: "A-Za-z0-9_" and maximum length is 64
-n nnode Specify the number of nodes.

Example) Make a reservation 4 compute nodes from 2018/07/05 to 1 week (7 days)

[username@es1 ~]$ qrsub -a 20180705 -d 7 -g gxa50001 -n 4 -N "Reserve_for_AI"
Your advance reservation 12345 has been granted

The ABCI points are consumed when complete reservation.

3.6.2. Show the status of reservations

To show the current status of reservations, use the qrstat command or the ABCI User Portal.

Example)

[username@es1 ~]$ qrstat
ar-id      name       owner        state start at             end at               duration    sr
----------------------------------------------------------------------------------------------------
     12345 Reserve_fo root         w     07/05/2018 10:00:00  07/12/2018 09:30:00  167:30:00    false
Field Description
ar-id Reserve ID (AR-ID)
name Reserve name
owner root is always displayed
state Status of reservation
start at Start reservation date (start time is 10:00am at all time)
end at End reservation date (end time is 9:30am at all time)
duration Reservation term (hhh:mm:ss)
sr false is always displayed

If you want to show the number of nodes that can be reserved, you need to access User Portal, or useqrstat command with --available option.

[username@es1 ~]$ qrstat --available
06/27/2018  441
07/05/2018  432
07/06/2018  434

Note

The no reservation day is not printed.

3.6.3. Cancel a reservation

Warning

Canceling reservation is permitted to a responsible person or a manager.

To cancel a reservation, use the qrdel command or the ABCI User Portal.

Example) Cancel a reservation

[username@es1 ~]$ qrdel 12345