Storage

ABCI has the following four types of storage.

Home Area
Group Area
Local Storage
ABCI Cloud Storage

Tips

Such as Home Area or Group Area, other than Local Storage, are resources shared by all users. Excessive I/O load or unnecessary access will not only cause inconvenience to other users but also slow down the execution speed of your own jobs. Please keep the following points in mind when using each storage space.

For data that does not require persistence, such as intermediate data, we recommend that you refrain from creating files and use memory.
Proactively utilize scratch areas that can be accessed at high speed. It is recommended that files that will be accessed many times during job execution be staged (temporarily copied) to a Local scratch.
Creating and accessing large numbers of small files on a shared file system is not recommended. It is recommended to use scratch space or combine multiple files into one larger file and then access them. For example, consider using HDF5, WebDataset, etc.
Refrain from opening/closing the same file unnecessarily and repeatedly within a single job.
Please consult us in advance if you intend to create more than a hundred million files in a short period of time.

Home Area

Home area is the disk area of the Lustre file system shared by interactive and compute nodes, and is available to all ABCI users by default. The disk quota is limited to 2TiB.

[Advanced Option] File Striping

Home area is provided by the Lustre file system. The Lustre file system distributes and stores file data onto multiple disks. On home area, you can choose two distribution methods which are Round-Robin (default) and Striping.

Tips

See Configuring Lustre File Striping for an overview of file striping feature.

How to Set Up File Striping

$ lfs setstripe  [options] <dirname | filename>

Striping can be achieved by lfs setstripe command. lfs setstripe command can specify the stripe pattern that distributes the data(stripe size, range).

Option	Description
-S	Sets a stripe size. -S #k, -S #m or -S #g option sets the size to KiB, MiB or GiB respectively.
-i	Specifies the start OST index to which a file is written. If -1 is set, the start OST is randomly selected.
-c	Sets a stripe count. If -1 is set, all available OSTs are written.

Tips

To display OST index, use the lfs df or lfs osts command

Example) Set a stripe pattern #1. (Creating a new file with a specific stripe pattern.)

[username@login1 work]$ lfs setstripe -S 1m -i 4 -c 4 stripe-file
[username@login1 work]$ ls
stripe-file

Example) Set a stripe pattern #2. (Setting up a stripe pattern to a directory.)

[username@login1 work]$ mkdir stripe-dir
[username@login1 work]$ lfs setstripe -S 1m -i 4 -c 4 stripe-dir

How to Display File Striping Settings

To display the stripe pattern of a specified file or directory, use the lfs getstripe command.

$ lfs getstripe <dirname | filename>

Example) Display stripe settings #1. (Displaying the stripe pattern of a file.)

[username@login1 work]$ lfs getstripe stripe-file
stripe-file
lmm_stripe_count:  4
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 4
        obdidx           objid           objid           group
             4         9161985       0x8bcd01      0x500000406
             5         9162113       0x8bcd81      0x540000402
             6         9161761       0x8bcc21      0x580000407
             7         9162113       0x8bcd81      0x5c0000402

Example) Display stripe settings #2. (Displaying the stripe pattern of a directory.)

stripe-dir
lmm_stripe_count:  4
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 4
        obdidx           objid           objid           group
             4         9161986       0x8bcd02      0x500000406
             5         9162114       0x8bcd82      0x540000402
             6         9161762       0x8bcc22      0x580000407
             7         9162114       0x8bcd82      0x5c0000402

Group Area

Group area is the disk area of the Lustre file system shared by interactive and compute nodes. To use Group area, "User Administrator" of the group needs to apply "Add group disk" via ABCI User Portal). Regarding how to add group disk, please refer to Disk Addition Request) in the ABCI Portal Guide.

To find the path to your group area, use the show_quota command. For details, see Checking Disk Quota.

How to check inode usage

The MDT stores inode information for a file, but there is an upper limit on the number of inodes that can be stored per MDT. You can see how much inodes are currently used for each MDT with the lfs df -i. The IUse%entry in the/groups [MDT:?]line in the output of the command is the percentage of the inode used in each MDT.
In the following example, the inode utilization for MDT:0 is 12%.

[username@login1 ~]$ lfs df -i /groups
UUID                      Inodes       IUsed       IFree IUse% Mounted on
groups-MDT0000_UUID   3110850464   353856623  2756993841  12% /groups[MDT:0]
groups-MDT0001_UUID   3110850464   378826453  2732024011  13% /groups[MDT:1]
groups-MDT0002_UUID   3110850464   440177041  2670673423  15% /groups[MDT:2]
groups-MDT0003_UUID   3110850464   359182112  2751668352  12% /groups[MDT:3]
groups-MDT0004_UUID   3110850464   363397094  2747453370  12% /groups[MDT:4]
groups-MDT0005_UUID   3110850464   393711820  2717138644  13% /groups[MDT:5]
(snip)

Local Storage

In ABCI System, a 7TB NVMe SSD x2 is installed into each compute node and they are built and mounted with RAID0. There are two ways to utilize these storages as follows:

Using as a local scratch of a node (Local scratch, Persistent local scratch (Reserved only)).
Using as a distributed shared file system, which consists of multiple NVMe storages in multiple compute nodes (BeeOND storage).
Using as a cache region of lustre client(Hot Nodes).

Local scratch

Local storage on compute nodes is available as a local scratch without specifying any special options at job submission. Note that the amount of the local storage you can use is determined by "Resource type". For more detail on "Resource type", please refer to Job Execution Resource. The local storage path is different for each job and you can access to local storage by using environment variables PBS_LOCALDIR.

Example) sample of job script (use_local_storage.sh)

#!/bin/bash

#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=1

echo test1 > $PBS_LOCALDIR/foo.txt
echo test2 > $PBS_LOCALDIR/bar.txt
cp -rp $PBS_LOCALDIR/foo.txt $HOME/test/foo.txt

Example) Submitting a job

[username@login1 ~]$ qsub -g grpname use_local_storage.sh

Example) Status after execution of use_local_storage.sh

[username@login1 ~]$ ls $HOME/test/
foo.txt    <- The file remain only when it is copied explicitly in script.

Warning

The files stored under $PBS_LOCALDIR directory are removed when the job finished. The required files need to be moved to Home area or Group area in a job script using cp command.

BeeGFS storage

By using the BeeGFS On Demand (BeeOND), you can aggregate local storages attached to compute nodes on which your job is running to use as a temporal distributed shared file system. To use BeeOND, you need to submit job with -v BEEOND_ON=1 option. And you need to specify -q rt_HF option in this case, because node must be exclusively allocated to job.

The distributed shared file system region will be created in /beeond.

Example) sample of job script (use_beeond.sh)

#!/bin/bash

#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=2
#PBS -v BEEOND_ON=1

echo test1 > /beeond/foo.txt
echo test2 > /beeond/bar.txt
cp -rp /beeond/foo.txt $HOME/test/foo.txt

Example) Submitting a job

[username@login1 ~]$ qsub use_beeond.sh

Example) Status after execution of use_beeond.sh

[username@login1 ~]$ ls $HOME/test/
foo.txt    <- The file remain only when it is copied explicitly in script.

Warning

The file stored under /beeond directory is removed when job finished. The required files need to be moved to Home area or Group area in job script using cp command.

BeeGFS allows data to be staged in and out of the BeeOND storage in parallel using the beeond-cp command. To use beeond-cp, specify the USE_SSH=1 option to enable SSH login to the compute nodes.

Example) sample of job script (use beeond-cp)

#!/bin/bash

#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=2
#PBS -v BEEOND_ON=1,USE_SSH=1

export src_dir=$HOME/data

beeond-cp stagein -n ${PBS_NODEFILE} -g ${src_dir} -l /beeond
(main process)
beeond-cp stageout -l /beeond

[Advanced Option] Configure BeeOND Servers

A BeeOND filesystem partition consists of two kinds of services running on compute nodes: one is storage service which stores files, and the other is metadata service which stores file matadata. Each service runs on compute nodes. We refer to a compute node which runs storage service as a storage server and a compute node which runs metadata service as a metadata server. Users can specify number of storage servers and metadata servers.

The default values for counts of metadata server and storage server are as follows.

Parameter	Default
Count of metadata servers	1
Count of storage servers	Number of nodes requested by a job

To change the counts, define following environment variables. These environment variables have to be defined at job submission, and changing in job script takes no effect. When counts of servers are less than the number of requested nodes, servers are lexicographically selected by their names from assigned compute nodes.

Environment Variable	Description
BEEOND_METADATA_SERVER	Count of metadata servers in integer
BEEOND_STORAGE_SERVER	Count of storage servers in integer

The following example create a BeeOND partition with two metadata servers and four storage servers. beegfs node list is used to see the configuration.

Example) sample of job script (use_beeond.sh)

#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=4
#PBS -v BEEOND_ON=1,USE_SSH=1,BEEOND_METADATA_SERVER=2,BEEOND_STORAGE_SERVER=4

beegfs node list

Example output

ID    TYPE        ALIAS
c:1   client      c1F4461-69E83A4A-hnode004
c:2   client      c1E8E89-69E83A4A-hnode007
c:3   client      c1ECE95-69E83A4A-hnode005
c:4   client      c1EB2BE-69E83A4A-hnode006
m:1   meta        node_meta_1
m:2   meta        node_meta_2
s:1   storage     node_storage_1
s:2   storage     node_storage_2
s:3   storage     node_storage_3
s:4   storage     node_storage_4
mg:1  management  management

[Advanced Option] File Striping

Files are split into small chunks and stored in multiple storage servers in a BeeOND partition. Users can change file striping configuration of BeeOND.

The default configuration of the file striping is as follows.

Parameter	Default	Description
Stripe size	512 KB	File chunk size
Stripe count	4	Number of storage servers that store chunks of a file

Users can configure file striping per-file or per-directory using beegfs entry set command.

The following example sets file striping configuration of /beeond/data directory as 6 stripe count and 4MB stripe size.

Example) sample of job script (use_beeond.sh)

#!/bin/bash

#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=4
#PBS -v BEEOND_ON=1

BEEOND_DIR=/beeond/data
mkdir ${BEEOND_DIR}
beegfs entry set --num-targets=6 --chunk-size=4MiB ${BEEOND_DIR}
beegfs entry info --retro --verbose ${BEEOND_DIR}

Output example

Entry type: directory
EntryID: 1-69E83E72-1
ParentID: root
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 4M
+ Number of storage targets: desired: 6
+ Storage Pool: 1  (storage_pool_default)
Inlined inode: no
Dentry info:
 + Path: 38/51/root
 + Metadata node: node_meta_1 [ID: 1]
Inode info:
 + Path: 74/55/1-69E83E72-1
 + Metadata node: node_meta_1 [ID: 1]

Hot Nodes

ABCI system provides Hot Nodes functionality of Lustre filesystem. When Hot Nodes is enabled, the local storage(/local) is used as a transparent cache of group region(/groups).

Note

Currently, only READ cache is provided.

To enable Hot Nodes, please speficy -v HOTNODES_ON=1 option when submitting the job. And, Hot Nodes requires node occupancy, please specify -q rt_HF option as well. Also, Hot Nodes and BeeOND are mutually exclusive.

Available environment variables for submitting the jobs are the folloing.

Environment variable	Description	Default	Available value
HOTNODES_ON	Enable Hot Nodes	0	0(disable)、1(enable)
HOTNODES_HIGH	Cache eviction will start when the local storage disk usage exceeds this threshold.	90	2-100(%)
HOTNODES_LOW	Cache eviction will be halted when the local storage disk usage drops below this threshold.	75	1-99(%)
HOTNODES_INTERVAL	This is the interval at which the local storage usage is checked.	30	1-3600(seconds)

Info

Remain of HOTNODES_HIGH percentage is used for user scratch region.

The following example performs I/O by IOR into /groups/grpname with enabling Hot Nodes.

#!/bin/bash

#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=1:mpiprocs=8
#PBS -v HOTNODES_ON=1

systemctl is-active lpcc
cd /groups/grpname/

module purge
module load hpcx

mpirun ./ior -a MPIIO -k