Skip to content

1. ABCI System Overview

1.1. System Architecture

This system is AI Bridging Cloud Infrastructure (ABCI). The ABCI system consists of 1,088 compute nodes, large storage system which has 22PB disk space, high performance interconnect and software which makes the most of hardware.

The ABCI system provides total 16FP (half precision floating point) theoretical peak performance of 550PFLOPS and total 64FP (double precision floating point) theoretical peak performance of 37 PFLOPS. The total memory capacity is 476TiB, and total memory peak bandwidth is 4.19PB.

Each compute nodes and storage system are connected with InfiniBand EDR (100Gbps), and this system is connected to the Internet at speed of 100Gbps via SINET5.

Screenshot

1.2. Compute Node Configuration

The ABCI system comprises 1,088 nodes of FUJITSU Server PRIMERGY CX2570 M4. Each compute node has two Intel Xeon Gold 6148 Processor (2.4 GHz, 20cores) and total number of cores is 43,520 cores. In addition, each compute node has four NVIDIA GPU Tesla V100, and total number of GPU is 4,352 GPUs.

The specifications of the compute node are as follows.

Item Description #
CPU Intel Xeon Gold 6148 Processor
2.4 GHz, 20 Cores (40 Threads)
2
GPU NVIDIA Tesla V100 for NVLink
16GiB HBM2
4
Memory 384 GiB DDR4 2666 MHz RDIMM (ECC)
SSD Intel SSD DC P4600 1.6 TB u.2 1
Interconnects InfiniBand EDR (12.5 GB/s) 2

1.3. Software Configuration

The software available on the ABCI system is shown below.

Category Software Version
OS CentOS 7.5
Job Scheduler Univa Grid Engine 8.6.6
Development Environment Intel Parallel Stduio XE Cluster Edition 2017.8
2018.2
2018.3
2019.3
PGI Professional Edition 17.10
18.10
19.3
CUDA Toolkit 8.0.61.2
9.0.176.2
9.0.176.3
9.0.176.4
9.1.85.3
9.2.88.1
9.2.148.1
10.0.130
10.0.130.1
10.1.243
GCC 4.8.5
7.3.0
7.4.0
Python 2.7.15
3.4.8
3.5.5
3.6.5
Ruby 2.0.0.648-33
R 3.5.0
Java 1.6.0_41
1.7.0_171
1.8.0_161
Scala 1.27-248
Lua 5.1.4
Perl 5.16.3
File System DDN Lustre 2.10.5_ddn7-1
DDN GRIDScaler 4.2.3-17
BeeOND 7.1.3
Container Docker 17.12.0
Singularity 2.6.1
MPI Open MPI 1.10.7
2.1.3
2.1.5
2.1.6
3.0.3
3.1.0
3.1.2
3.1.3
MVAPICH2 2.3rc2
2.3
MVAPICH2-GDR 2.3a
2.3rc1
2.3
2.3.2
Intel MPI 2018.2.199
Library cuDNN 5.1.10
6.0.21
7.0.5
7.1.3
7.1.4
7.2.1
7.3.1
7.4.2
7.5.0
7.5.1
7.6.0
7.6.1
7.6.2
7.6.3
7.6.4
NCCL 1.3.5-1
2.1.15-1
2.2.13-1
2.3.4-1
2.3.5-2
2.3.7-1
2.4.2-1
2.4.7-1
2.4.8-1
gdrcopy 1.2
Intel MKL 2017.8
2018.2
2018.3
2019.3
Utility aws-cli 1.16.194
fuse-sshfs 2.10
s3fs-fuse 1.85
sregistry-cli 0.2.31

1.4. Storage Configuration

The ABCI system has large capacity storage for storing the results of Artificial Intelligence and Big Data Analytics. In addition, 1.6TB SSD is installed as local scratch in each compute node. A list of each file system that can be used in the ABCI system is shown below.

Usage Mount point Capacity File system Notes
Home area /home 1.0PB Lustre
Group area 1 /groups1 6.6PB GPFS
Group area 2 /groups2 6.6PB GPFS
Application area /apps 335TB GPFS
Local scratch area for intaractive node /local 12TB/node XFS
Local scratch area for compute node /local 1.5TB/node XFS

1.5. System Use Overview

In the ABCI system, all compute nodes, interactive nodes share files through the parallel file system (DDN GRIDScaler). All users can login to the interactive node as frontend with SSH tunneling. After login to the interactive node, users can develop/compile/link a program, submit a job, display the status of the job and etc. Users can develop a program for the compute node on the interactive node not equipped with GPUs. To run a program on the compute nodes, users submit a batch job and an interactive job to job management system. A interactive job is typically used for debugging a program, running an interactive application or a visualization application.

Warning

Do not run high load tasks on the interactive node because computing resources such as CPU and memory of the interactive node are shared by many users. To run high load tasks for pre- & post-processing, use the compute nodes. Note that if you run high load tasks on the interactive node, the tasks will be forcibly terminated.