FAQ
Q. If I enter Ctrl+S during interactive jobs, I cannot enter keys after that
This is because standard terminal emulators for macOS, Windows, and Linux have Ctrl+S/Ctrl+Q flow control enabled by default. To disable it, execute the following in the terminal emulator of the local PC:
$ stty -ixon
Executing while logged in to the interactive node has the same effect.
Q. The group area is consumed more than the actual size
Generally, any file systems have their own block size, and even the smallest file consumes the capacity of the block size.
ABCI sets the block size of the group area to 128 KB and the block size of the home area to 4 KB. For this reason, if a large number of small files are created in the group area, usage efficiency will be reduced. For example, if you want to create a file that is less than 4KB in the group area, you need about 32 times the capacity of the home area.
Q. Singularity cannot use container registries that require authentication
Singularity version 2.6 and SingularityPRO version 3.5 have a function equivalent to docker login
that provides authentication information with environment variables.
[username@es ~]$ export SINGULARITY_DOCKER_USERNAME='username'
[username@es ~]$ export SINGULARITY_DOCKER_PASSWORD='password'
[username@es ~]$ singularity pull docker://myregistry.azurecr.io/namespace/repo_name:repo_tag
For more information on Singularity version 2.6 authentication, see below.
For more information on SingularityPRO version 3.5 authentication, see below.
Q. NGC CLI cannot be executed
When running NGC Catalog CLI on ABCI, the following error message appears and execution is not possible. This is because the NGC CLI is built for Ubuntu 14.04 and later.
ImportError: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by /tmp/_MEIxvHq8h/libstdc++.so.6)
[89261] Failed to execute script ngc
By preparing the following shell script, it can be executed using Singularity. This technique can be used not only for NGC CLI but also for general use.
Singularity 2.6
1 2 3 4 5 6 |
|
SingularityPRO 3.5
1 2 3 4 5 6 |
|
Q. I want to assign multiple compute nodes and have each compute node perform different processing
If you give -l rt_F=N
option to qrsh
or qsub
, you can assign N compute nodes. You can also use MPI if you want to perform different processing on each assigned compute node.
$ module load openmpi/2.1.6
$ mpirun -hostfile $SGE_JOB_HOSTLIST -np 1 command1 : -np 1 command2 : ... : -np1 commandN
Q. I want to avoid to close SSH session unexpectedly
The SSH session may be closed shortly after connecting to ABCI with SSH. In such a case, you may be able to avoid it by performing KeepAlive communication between the SSH client and the server.
To enable KeepAlive, set the option ServerAliveInterval to about 60 seconds in the system ssh configuration file (/etc/ssh/ssh_config) or per-user configuration file (~/.ssh/config) on the user's terminal.
[username@yourpc ~]$ vi ~/.ssh/config
[username@yourpc ~]$ cat ~/.ssh/config
(snip)
Host as.abci.ai
ServerAliveInterval 60
(snip)
[username@userpc ~]$
Note
The default value of ServerAliveInterval is 0 (no KeepAlive).
Q. I want to use a newer version of Open MPI
ABCI offers CUDA-aware and CUDA non-aware versions of Open MPI, and you can check the availability provided by Using MPI.
The Environment Modules provided by ABCI will attempt to configure CUDA-aware Open MPI environment when loading openmpi
module only if cuda
module has been loaded beforehand.
For the combination where CUDA-aware MPI is provided (cuda/10.0/10.0.130.1
, openmpi/2.1.6
), therefore, the environment settings will succeed:
$ module load cuda/10.0/10.0.130.1
$ module load openmpi/2.1.6
$ module list
Currently Loaded Modulefiles:
1) cuda/10.0/10.0.130.1 2) openmpi/2.1.6
For the combination where CUDA-aware MPI is not provided (cuda/10.0/10.0.130.1
, openmpi/3.1.3
), the environment setup will fail and openmpi
module will not be loaded:
$ module load cuda/10.0/10.0.130.1
$ module load openmpi/3.1.3
ERROR: loaded cuda module is not supported.
WARINING: openmpi/3.1.3 is supported only host version
$ module list
Currently Loaded Modulefiles:
1) cuda/10.0/10.0.130.1
On the other hand, there are cases where CUDA-aware version of Open MPI is not necessary, such as when you want to use Open MPI just for parallelization by Horovod. In this case, you can use a newer version of Open MPI that does not support CUDA-aware functions by loading openmpi
module first.
$ module load openmpi/3.1.3
$ module load cuda/10.0/10.0.130.1
module list
Currently Loaded Modulefiles:
1) openmpi/3.1.3 2) cuda/10.0/10.0.130.1
Note
The functions of CUDA-aware versions of Open MPI can be found on the Open MPI site: FAQ: Running CUDA-aware Open MPI
Q. I want to know how ABCI job execution environment is congested
ABCI operates a web service that visualizes job congestion status as well as utilization of compute nodes, power consumption of the whole datacenter, PUE, cooling facility, etc.
The service runs on an internal server, named vws1
, on 3000/tcp port.
You can access it by following the procedure below.
You need to set up SSH tunnel.
The following example, written in $HOME/.ssh/config
on your PC, sets up the SSH tunnel connection to ABCI internal servers through as.abci.ai by using ProxyCommand.
Please also refer to the procedure in Login using an SSH Client::General method in ABCI System User Environment.
Host *.abci.local
User username
IdentityFile /path/identity_file
ProxyCommand ssh -W %h:%p -l username -i /path/identity_file as.abci.ai
You can create an SSH tunnel that transfers 3000/tcp on your PC to 3000/tcp on vws1.
[username@userpc ~]$ ssh -L 3000:vws1:3000 es.abci.local
You can access the service by opening http://localhost:3000/
on your favorite browser.
Q. Are there any pre-downloaded datasets?
Please see this page.
Q. Image file creation with Singularity pull fails in batch job
When you try to create an image file with Singularity pull in a batch job, the mksquashfs executable file may not be found and the creation may fail.
INFO: Converting OCI blobs to SIF format
FATAL: While making image from oci registry: while building SIF from layers: unable to create new build: while searching for mksquashfs: exec: "mksquashfs": executable file not found in $PATH
The problem can be avoided by adding /usr/sbin
to PATH like this:
Example)
[username@g0001~]$ PATH="$PATH:/usr/sbin"
[username@g0001~]$ module load singularitypro/3.5
[username@g0001~]$ singularity run --nv docker://caffe2ai/caffe2:latest
Q. How can I find the job ID?
When you submit a batch job using the qsub
command, the command outputs the job ID.
[username@es1 ~]$ qsub -g grpname test.sh
Your job 1000001 ("test.sh") has been submitted
If you are using qrsh
, you can get the job ID by retrieving the value of the JOB_ID environment variable.This variable is available for qsub
(batch job environment) as well.
[username@es1 ~]$ qrsh -g grpname -l rt_C.small=1 -l h_rt=1:00:00
[username@g0001 ~]$ echo $JOB_ID
1000002
[username@g0001 ~]$
To find the job ID of your already submitted job, use the qstat
command.
[username@es1 ~]$ qstat
job-ID prior name user state submit/start at queue jclass slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------
1000003 0.00000 test.sh username qw 08/01/2020 13:05:30
To find the job ID of your completed job, use qacct -j
. The -b
and -e
options are useful for narrowing the search range. See qacct(1) man page (type man qacct
on an interactive node). The following example lists the completed jobs that started on and after September 1st, 2020. jobnumber
has the same meaning as job-ID
.
[username@es1 ~]$ qacct -j -b 202009010000
==============================================================
qname gpu
hostname g0001
group grpname
owner username
:
jobname QRLOGIN
jobnumber 1000010
:
qsub_time 09/01/2020 16:41:37.736
start_time 09/01/2020 16:41:47.094
end_time 09/01/2020 16:45:46.296
:
==============================================================
qname gpu
hostname g0001
group grpname
owner username
:
jobname testjob
jobnumber 1000120
:
qsub_time 09/07/2020 15:35:04.088
start_time 09/07/2020 15:43:11.513
end_time 09/07/2020 15:50:11.534
:
Q. I want to run the Linux command on all allocated compute node
ABCI provides the ugedsh
command to execute Linux commands in parallel on all allocated compute nodes.
The command specified in the argument of the ugedsh
command is executed once on each node.
Example)
[username@es1 ~]$ qrsh -g grpname -l rt_F=2
[username@g0001 ~]$ ugedsh hostname
g0001: g0001.abci.local
g0002: g0002.abci.local