Skip to content

FAQ

Q. If I enter Ctrl+S during interactive jobs, I cannot enter keys after that

This is because standard terminal emulators for macOS, Windows, and Linux have Ctrl+S/Ctrl+Q flow control enabled by default. To disable it, execute the following in the terminal emulator of the local PC:

$ stty -ixon

Executing while logged in to the interactive node has the same effect.

Q. The Group Area is consumed more than the actual size

Generally, any file systems have their own block size, and even the smallest file consumes the capacity of the block size.

ABCI sets the block size of the Group Area 1,2,3 to 128 KB and the block size of the home area and Group Area to 4 KB. For this reason, if a large number of small files are created in the Group Area 1,2,3 , usage efficiency will be reduced. For example, if you want to create a file that is less than 4KB in the Group Area, you need about 32 times the capacity of the home area.

Q. Singularity cannot use container registries that require authentication

SingularityPRO has a function equivalent to docker login that provides authentication information with environment variables.

[username@es1 ~]$ export SINGULARITY_DOCKER_USERNAME='username'
[username@es1 ~]$ export SINGULARITY_DOCKER_PASSWORD='password'
[username@es1 ~]$ singularity pull docker://myregistry.azurecr.io/namespace/repo_name:repo_tag

For more information on SingularityPRO authentication, see below.

Q. NGC CLI cannot be executed

When running NGC Catalog CLI on ABCI, the following error message appears and execution is not possible. This is because the NGC CLI is built for Ubuntu 14.04 and later.

ImportError: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by /tmp/_MEIxvHq8h/libstdc++.so.6)
[89261] Failed to execute script ngc

By preparing the following shell script, it can be executed using Singularity. This technique can be used not only for NGC CLI but also for general use.

1
2
3
4
5
6
#!/bin/sh
source /etc/profile.d/modules.sh
module load singularitypro

NGC_HOME=$HOME/ngc
singularity exec $NGC_HOME/ubuntu-18.04.simg $NGC_HOME/ngc $@

Q. I want to assign multiple compute nodes and have each compute node perform different processing

If you give -l rt_F=N or -l rt_AF=N option to qrsh or qsub, you can assign N compute nodes. You can also use MPI if you want to perform different processing on each assigned compute node.

$ module load openmpi/2.1.6
$ mpirun -hostfile $SGE_JOB_HOSTLIST -np 1 command1 : -np 1 command2 : ... : -np1 commandN

Q. I want to avoid to close SSH session unexpectedly

The SSH session may be closed shortly after connecting to ABCI with SSH. In such a case, you may be able to avoid it by performing KeepAlive communication between the SSH client and the server.

To enable KeepAlive, set the option ServerAliveInterval to about 60 seconds in the system ssh configuration file (/etc/ssh/ssh_config) or per-user configuration file (~/.ssh/config) on the user's terminal.

[username@yourpc ~]$ vi ~/.ssh/config
[username@yourpc ~]$ cat ~/.ssh/config
(snip)
Host as.abci.ai
   ServerAliveInterval 60
(snip)
[username@userpc ~]$

Note

The default value of ServerAliveInterval is 0 (no KeepAlive).

Q. I want to use a newer version of Open MPI

ABCI offers CUDA-aware and CUDA non-aware versions of Open MPI, and you can check the availability provided by Open MPI.

The Environment Modules provided by ABCI will attempt to configure CUDA-aware Open MPI environment when loading openmpi module only if cuda module has been loaded beforehand.

For the combination where CUDA-aware MPI is provided (cuda/10.0/10.0.130.1, openmpi/2.1.6), therefore, the environment settings will succeed:

$ module load cuda/10.0/10.0.130.1
$ module load openmpi/2.1.6
$ module list
Currently Loaded Modulefiles:
  1) cuda/10.0/10.0.130.1   2) openmpi/2.1.6

For the combination where CUDA-aware MPI is not provided (cuda/9.1/9.1.85.3, openmpi/3.1.6), the environment setup will fail and openmpi module will not be loaded:

$ module load cuda/9.1/9.1.85.3
$ module load openmpi/3.1.6
ERROR: loaded cuda module is not supported.
WARNING: openmpi/3.1.6 cannot be loaded due to missing prereq.
HINT: at least one of the following modules must be loaded first: cuda/9.2/9.2.88.1 cuda/9.2/9.2.148.1 cuda/10.0/10.0.130.1 cuda/10.1/10.1.243 cuda/10.2/10.2.89 cuda/11.0/11.0.3 cuda/11.1/11.1.1 cuda/11.2/11.2.2
$ module list
Currently Loaded Modulefiles:
  1) cuda/9.1/9.1.85.3

On the other hand, there are cases where CUDA-aware version of Open MPI is not necessary, such as when you want to use Open MPI just for parallelization by Horovod. In this case, you can use a newer version of Open MPI that does not support CUDA-aware functions by loading openmpi module first.

$ module load openmpi/3.1.6
$ module load cuda/9.1/9.1.85.3
module list
Currently Loaded Modulefiles:
  1) openmpi/3.1.6       2) cuda/9.1/9.1.85.3

Note

The functions of CUDA-aware versions of Open MPI can be found on the Open MPI site: FAQ: Running CUDA-aware Open MPI

Q. I want to know how ABCI job execution environment is congested

ABCI operates a web service that visualizes job congestion status as well as utilization of compute nodes, power consumption of the whole datacenter, PUE, cooling facility, etc. The service runs on an internal server, named vws1, on 3000/tcp port. You can access it by following the procedure below.

You need to set up SSH tunnel. The following example, written in $HOME/.ssh/config on your PC, sets up the SSH tunnel connection to ABCI internal servers through as.abci.ai by using ProxyCommand. Please also refer to the procedure in Login using an SSH Client::General method in ABCI System User Environment.

Host *.abci.local
    User         username
    IdentityFile /path/identity_file
    ProxyCommand ssh -W %h:%p -l username -i /path/identity_file as.abci.ai

You can create an SSH tunnel that transfers 3000/tcp on your PC to 3000/tcp on vws1.

[username@userpc ~]$ ssh -L 3000:vws1:3000 es.abci.local

You can access the service by opening http://localhost:3000/ on your favorite browser.

Q. Are there any pre-downloaded datasets?

Please see Datasets.

Q. Image file creation with Singularity pull fails in batch job

When you try to create an image file with Singularity pull in a batch job, the mksquashfs executable file may not be found and the creation may fail.

INFO:    Converting OCI blobs to SIF format
FATAL:   While making image from oci registry: while building SIF from layers: unable to create new build: while searching for mksquashfs: exec: "mksquashfs": executable file not found in $PATH

The problem can be avoided by adding /usr/sbin to PATH like this:

Example)

[username@g0001 ~]$ export PATH="$PATH:/usr/sbin" 
[username@g0001 ~]$ module load singularitypro
[username@g0001 ~]$ singularity run --nv docker://caffe2ai/caffe2:latest

Q. I get an error due to insufficient disk space, when I ran the singularity build/pull on the compute node.

The singularity build and pull commands use /tmp as the location to create temporary files. When you build a large container on the compute node, it may cause an error due to insufficient space in /tmp.

If you get an error due to insufficient space, set the SINGULARITY_TMPDIR environment variable to use the local storage as shown below:

[username@g0001 ~]$ SINGULARITY_TMPDIR=$SGE_LOCALDIR singularity pull docker://nvcr.io/nvidia/tensorflow:20.12-tf1-py3

Q. How can I find the job ID?

When you submit a batch job using the qsub command, the command outputs the job ID.

[username@es1 ~]$ qsub -g grpname test.sh
Your job 1000001 ("test.sh") has been submitted

If you are using qrsh, you can get the job ID by retrieving the value of the JOB_ID environment variable.This variable is available for qsub (batch job environment) as well.

[username@es1 ~]$ qrsh -g grpname -l rt_C.small=1 -l h_rt=1:00:00
[username@g0001 ~]$ echo $JOB_ID
1000002
[username@g0001 ~]$

To find the job ID of your already submitted job, use the qstat command.

[username@es1 ~]$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------
   1000003 0.00000 test.sh username   qw    08/01/2020 13:05:30

To find the job ID of your completed job, use qacct -j. The -b and -e options are useful for narrowing the search range. See qacct(1) man page (type man qacct on an interactive node). The following example lists the completed jobs that started on and after September 1st, 2020. jobnumber has the same meaning as job-ID.

[username@es1 ~]$ qacct -j -b 202009010000
==============================================================
qname        gpu
hostname     g0001
group        grpname
owner        username

:

jobname      QRLOGIN
jobnumber    1000010

:

qsub_time    09/01/2020 16:41:37.736
start_time   09/01/2020 16:41:47.094
end_time     09/01/2020 16:45:46.296

:

==============================================================
qname        gpu
hostname     g0001
group        grpname
owner        username

:

jobname      testjob
jobnumber    1000120

:

qsub_time    09/07/2020 15:35:04.088
start_time   09/07/2020 15:43:11.513
end_time     09/07/2020 15:50:11.534

:

Q. I want to run the Linux command on all allocated compute node

ABCI provides the ugedsh command to execute Linux commands in parallel on all allocated compute nodes. The command specified in the argument of the ugedsh command is executed once on each node.

Example)

[username@es1 ~]$ qrsh -g grpname -l rt_F=2
[username@g0001 ~]$ ugedsh hostname
g0001: g0001.abci.local
g0002: g0002.abci.local

Q. What is the difference between Compute Node (A) and Compute Node (V)

ABCI was upgraded to ABCI 2.0 in May 2021. In addition to the previously provided Compute Nodes (V) with NVIDIA V100, the Compute Nodes (A) with NVIDIA A100 are currently available.

This section describes the differences between the Compute Node (A) and the Compute Node (V), and points to note when using the Compute Node (A).

Resource type name

The Resource type name is different between the Compute Node (A) and the Compute Node (V). Compute Node (A) can be used by specifying the following Resource type name when submitting a job.

Resource type Resource type name Assigned physical CPU core Number of assigned GPU Memory (GiB)
Full rt_AF 72 8 480
AG.small rt_AG.small 9 1 60

For more detailed Resource types, see Available Resource Types.

Accounting

Compute Node (A) and Compute Node (V) have different Resource type charge coefficient, as described at Available Resource types. Therefore, the number of ABCI points used, which are calculated based on Accounting, is also different.

The number of ABCI points used when using the Compute Node (A) is as follows:

Resource type name
Execution Priority
On-demand or Spot Service
Execution Priority: -500 (default)
(point/hour)
On-demand or Spot Service
Execution Priority: -400
(point/hour)
Reserved Service
(point/day)
rt_AF 3.0 4.5 108
rt_AG.small 0.5 0.75 N/A

Operating System

The Compute Node (A) and the Compute Node (V) use different Operating Systems.

Node Operating System
Compute Node (A) Red Hat Enterprise Linux 8.2
Compute Node (V) CentOS Linux 7.5

Since the versions of kernels and libraries such as glibc are different, the operation cannot be guaranteed when the program built for the Compute Node (V) is run on the Compute Node (A) as it is.

Please rebuild the program for the Compute Node (A) using the Compute Node (A) or the Interactive Node (A) described later.

CUDA Version

The NVIDIA A100 installed on the compute node (A) is Compute Capability 8.0 compliant.

CUDA 10 and earlier does not support Compute Capability 8.0. Therefore, Compute Node (A) should use CUDA 11 or later that supports Compute Capability 8.0.

Note

Environment Modules makes CUDA 10 available for testing, but its operation is not guaranteed.

Interactive Node (A)

ABCI provides the Interactive Nodes (A) with the same software configuration as the Compute Node (A) for the convenience of program development for the Compute Node (A). The program built on the Interactive Node (A) does not guarantee the operation on the Compute Node (V).

Please refer to the following for the proper use of Interactive Nodes:

Interactive Node (V) Interactive Node (A)
Can users log in? Yes Yes
Can users develop programs for Compute Nodes (V)? Yes No
Can users develop programs for Compute Nodes (A)? No Yes
Can users submit jobs for Compute Nodes (A)? Yes Yes
Can users submit jobs for Compute Nodes (A)? Yes Yes
Can users access the old Group Area? Yes Yes

For more information on Interactive Node (A), see Interactive Node.

Group Area

The Old Area (/groups[1-2]/gAA50NNN) cannot be accessed from the Compute Node (A).

When using the files in the Old Area from the Compute Node (A), the user needs to copy the files to the home area or the New Area (/groups/gAA50NNN) in advance. If you want to copy the files in the Old Area, please use the Interactive Nodes or the Compute Node (V).

Since April 2021, we are also working on migrating files from the Old Area to the New Area. For information of Group Area data migration, see this FAQ Q. What are the new Group Area and data migration?.

Q. How to use ABCI 1.0 Environment Modules

ABCI was upgraded in May 2021. Due to the upgrade, the Environment Modules as of FY2020 (The ABCI 1.0 Environment Modules) is installed in the /apps/modules-abci-1.0 directory. If you want to use the ABCI 1.0 Environment Modules, set the MODULE_HOME environment variable as follows and load the configuration file.

Please not that the ABCI 1.0 Environment Modules is not eligible for the ABCI System support.

sh, bash:

export MODULE_HOME=/apps/modules-abci-1.0
. ${MODULE_HOME}/etc/profile.d/modules.sh

ch, tcsh:

setenv MODULE_HOME /apps/modules-abci-1.0
source ${MODULE_HOME}/etc/profile.d/modules.csh

Q. What are the new Group Area and data migration?

In FY2021, we expanded the storage system. Refer to Storage Systems for details. As the storage system is expanded, the configuration of the Group Area will be changed. All the data in the existing Group Area used in FY2020 are going to be migrated into a new Group Area in FY2021.

The existing Group Area (the Old Area) is not accessible from the computing resources newly established in May 2021 (the Compute Node (A)). Therefore we have created a new Group Area (the New Area), which is accessible from the Compute Node (A), and are migrating all the data stored in the Old Area to the New Area. The data migration is managed by the operation team, so the users need not to take care of the migration process.

User groups who are using the Old Area /groups[1-2]/gAA50NNN/ until FY2020 have newly been allocated the New Area /groups/gAA50NNN/ since April 2021, and some User groups who are using the Old Area /fs3/ have been allocated the New Area /projects/ since mid July. Both the Old Area and the New Area are accessible from all the Interactive Nodes and Compute Nodes (V).

In addition, for the groups newly created in FY2021, only New Area is allocated, so it is not a target of data migration. As results, it is not affected by data migration.

The following is about description of the data migration.

Basic Strategy

  • The ABCI operation team will copy all the files in the Old Area to the New Area in the background. The migration of all group data is scheduled to be completed by the end of FY2021.
  • After the copy is completed, a symlink to the migration destination in the New Area will be created and you can refer to it with the same path as the Old Area.
  • The sources and the destinations of data migration are as follows.

    Source Destination Remarks
    d002 users'
    /groups1/gAA50NNN/
    /projects/datarepository/gAA50NNN/migrated_from_SFA_GPFS/1
    others'
    /groups1/gAA50NNN/
    /groups/gAA50NNN/migrated_from_SFA_GPFS/
    /groups2/gAA50NNN/ /groups/gAA50NNN/migrated_from_SFA_GPFS/ Completed
    /fs3/d001/gAA50NNN/ /projects/d001/gAA50NNN/migrated_from_SFA_GPFS/
    /fs3/d002/gAA50NNN/ /projects/datarepository/gAA50NNN/migrated_from_SFA_GPFS3/1
  • After completion, we will notify the user by email that the migration has been completed.

The following command is executed for data migration.

# rsync -avH /{Old Area}/gAA50NNN/ /{New Area}/gAA50NNN/migrated_from_SFA_GPFS/ 

The following command is executed for verification and confirmation after data migration.

# rsync -avH --delete /{Old Area}/gAA50NNN/ /{New Area}/gAA50NNN/migrated_from_SFA_GPFS/ 

The New Area

  • The user cannot access to the above migration destination directory in the New Area until the data migration is completed.
  • The area other than the destination directory in the New Area can be freely used.
  • Disk usage will increase as data is copied. For this reason, the limit of the storage usage for the New Area is set to be twice the quota value, which is the group disk quantity value applied in the ABCI User Portal. This is a temporal treatment. After the migration, the limit of the storage usage is set to the same value as the quota value in the ABCI User Portal, after the certain grace period.

The Old Area /groups[1-2]/gAA50NNN and /fs3/d00[1-2]/gAA50NNN

  • The Old Areas /groups1/gAA50NNN and /fs3/d00[1-2]/gAA50NNN have been set to read-only after August 11, 2021. You should use the New Area from now on.
  • After the Data Migration is completed, you will not be able to access /groups[1-2]/gAA50NNN or /fs3/d00[1-2]/gAA50NNN/ on the Old Area.
  • These paths will be replaced with symlinks to the destination directory in the New Area after all the data in each Old Area has been migrated, making them accessible with the same path as before.

Q. About the Quota Value and the Limit of the Storage Usage

During the Data Migration

Before the data migration started, the limit of the storage usage (shown as "limit" with the show_quota command) has been set to the same value of the quota value, which is the group disk quantity value applied in the ABCI User Portal. After the data migration started, the relationship between the quota value and the the limit of the storage usage was changed. By June 27, 2021, the limit of the storage usage for the New Area is set to be twice the quota value. After June 28, 2021, the relationship was changed as follows.

Increasing the Quota Value

  • Even if you apply to increase the quota value, the limit of the storage usage of the Old Area will not be increased.
  • The limit of the storage usaget of the New Area (/groups/gAA50NNN) is set to "the value set at that time" or "twice of the new quota value", whichever is greater.

Decreasing the Quota Value

  • When you apply to decrease the quota value, it can be decreased only when the usage amount of the Old Area (shown as "used" with the show_quota command) is less than the new quota value.
  • After application, the limit of the storage usage of the Old Area will be decreased to the same value as the quota value.
  • The limit of the storage usage of the New Area will not be decreased.

ABCI points consumed by using Group disks are calculated based on the quota value as before.

After the Data Migration completed

During the data migration task, the value twice or larger the quota value is set as the limit of the storage usage of the New Area. After the data migration is completed, the limit of the storage usage for the New Area is going to be set to the same value as the quota value after a grace period.

The grace period is as follows. After the grace period, if the usage amount of the New Area (shown as "used" with the show_quota command) is larger than the quota value, you cannot write onto the New Area. Please delete unnecessary files (duplicated files, etc.) or apply to increase the quota value by accessing ABCI User Portal and access "User Group Management".

Group Area Grace Period
/groups1/gAA50NNN/ Set after the Migration task
/groups2/gAA50NNN/ Until September 30, 2021
/fs3/ Set after the Migration task

Q. About the status of the Data Migration Task

With the expansion of the storage system in FY2021, we are migrating data from the Group Area that was used until FY2020 to the New Group Area. As of August 2021, the migration status of the each Group Area is as follows.

Group Area Status
/groups1/gAA50NNN/ in Progress
/groups2/gAA50NNN/ Completed in July 1, 2021
/fs3 Start in Mid Oct, 2021

Q. Why cannot I write data to the Old Area

The Old Area /groups1 and /fs3 used until FY 2020 were changed to read-only on August 11, 2021, to improve the efficiency of data migration. If you want to write data, please use /groups and /projects in the New Area.

The data migration of the Old Area /groups2 has been completed, and a symlink to the New Area has been set. Therefore, it is possible to write using the same path as before.

For more information on data migration, see Q. What are the new Group Area and data migration?.

Q. About Access Rights for Each Directory in the Group Area

The Access Rights for each directory in the Group Area during data migration

Directories Read Write Delete Descriptions
/groups/gAA50NNN/ Yes2 Yes2 Yes2 New Area
/groups1/gAA50NNN/ Yes No No Old Area
/groups2/gAA50NNN/ Yes Yes Yes Symlink to /groups/gAA50NNN/migrated_from_SFA_GPFS/
/fs3/d00[1-2]/gAA50NNN/ Yes No No Old Area
/projects/d001/gAA50NNN/ Yes2 Yes2 Yes2 New Area for d001 users
/projects/datarepository/gAA50NNN/ Yes2 Yes2 Yes2 New Area for d002 users
/groups/gAA50NNN/migrated_from_SFA_GPFS/ No3 No3 No3 Destination from /groups1/gAA50NNN/
/projects/d001/gAA50NNN/migrated_from_SFA_GPFS/ No3 No3 No3 Destination from /fs3/d001/
/projects/datarepository/gAA50NNN/migrated_from_SFA_GPFS/1 No3 No3 No3 Destination from d002 users' /groups1/gAA50NNN/
/projects/datarepository/gAA50NNN/migrated_from_SFA_GPFS3/1 No3 No3 No3 Destination from /fs3/d002/gAA50NNN/

The Access Rights for each directory in the Group Area After data migration

  • After the Data Migration is completed, you will not be able to access /groups[1-2]/gAA50NNN or /fs3/d00[1-2]/gAA50NNN/ on the Old Area.
  • These paths will be replaced with symlinks to the destination directory in the New Area after all the data in each Old Area has been migrated, making them accessible with the same path as before.
  • Access Rights of the paths to Old Area after the Data Mmigration task is completed
Paths Read Write Delete Reference to Remarks
/groups1/gAA50NNN/ Yes No4 No4 /groups/gAA50NNN/migrate_from_SFA_GPFS/
/groups2/gAA50NNN/ Yes Yes Yes /groups/gAA50NNN/migrate_from_SFA_GPFS/
/fs3/d001/gAA50NNN/ Yes No5 No5 /projects/d001/gAA50NNN/migrated_from_SFA_GPFS/
/fs3/d002/gAA50NNN/ Yes No5 No5 /projects/datarepository/gAA50NNN/migrated_from_SFA_GPFS3/1
/groups1/gAA50NNN/ Yes No4 No4 /projects/datarepository/gAA50NNN/migrated_from_SFA_GPFS/1 for d002 users'
  • Access Rights of the directories in New Area after the Data Migration task is completed
Directories Read Write Delete Descriptions
/groups/gAA50NNN/ Yes Yes Yes New Area
/groups/gAA50NNN/migrated_from_SFA_GPFS/ Yes Yes Yes Destination from the Old Area
/projects/d001/gAA50NNN/ Yes Yes Yes New Area for d001 users
/projects/d001/gAA50NNN/migrated_from_SFA_GPFS/ Yes Yes Yes Destination from /fs3/d001/gAA500NNN
/projects/datarepository/gAA50NNN/ Yes Yes Yes New Area for d002 users
/projects/datarepository/gAA50NNN/migrated_from_SFA_GPFS/1 Yes Yes Yes Destination from /groups1/gAA500NNN for d002 users
/projects/datarepository/gAA50NNN/migrated_from_SFA_GPFS3/1 Yes Yes Yes Destination from /fs3/d002/gAA500NNN

  1. As /fs3/d002 users have multiple migration sources, there are two migration destination directories, migrated_from_SFA_GPFS/ and migrated_from_SFA_GPFS3/

  2. except the Destination directories. 

  3. until the data migration is complated. 

  4. Write/Delete will be available after migration of all data from /groups1/ completed. 

  5. Write/Delete will be available after migration of all data from /fs3/ completed.