Storage
ABCI has the following four types of storage.
Tips
Such as Home Area or Group Area, other than Local Storage, are resources shared by all users. Excessive I/O load or unnecessary access will not only cause inconvenience to other users but also slow down the execution speed of your own jobs. Please keep the following points in mind when using each storage space.
- For data that does not require persistence, such as intermediate data, we recommend that you refrain from creating files and use memory.
- Proactively utilize scratch areas that can be accessed at high speed. It is recommended that files that will be accessed many times during job execution be staged (temporarily copied) to a Local scratch.
- Creating and accessing large numbers of small files on a shared file system is not recommended. It is recommended to use scratch space or combine multiple files into one larger file and then access them. For example, consider using HDF5, WebDataset, etc.
- Refrain from opening/closing the same file unnecessarily and repeatedly within a single job.
- Please consult us in advance if you intend to create more than a hundred million files in a short period of time.
Home Area
Home area is the disk area of the Lustre file system shared by interactive and compute nodes, and is available to all ABCI users by default. The disk quota is limited to 2TiB.
[Advanced Option] File Striping
Home area is provided by the Lustre file system. The Lustre file system distributes and stores file data onto multiple disks. On home area, you can choose two distribution methods which are Round-Robin (default) and Striping.
Tips
See Configuring Lustre File Striping for an overview of file striping feature.
How to Set Up File Striping
$ lfs setstripe [options] <dirname | filename>
Striping can be achieved by lfs setstripe command. lfs setstripe command can specify the stripe pattern that distributes the data(stripe size, range).
| Option | Description |
|---|---|
| -S | Sets a stripe size. -S #k, -S #m or -S #g option sets the size to KiB, MiB or GiB respectively. |
| -i | Specifies the start OST index to which a file is written. If -1 is set, the start OST is randomly selected. |
| -c | Sets a stripe count. If -1 is set, all available OSTs are written. |
Tips
To display OST index, use the lfs df or lfs osts command
Example) Set a stripe pattern #1. (Creating a new file with a specific stripe pattern.)
[username@login1 work]$ lfs setstripe -S 1m -i 4 -c 4 stripe-file
[username@login1 work]$ ls
stripe-file
Example) Set a stripe pattern #2. (Setting up a stripe pattern to a directory.)
[username@login1 work]$ mkdir stripe-dir
[username@login1 work]$ lfs setstripe -S 1m -i 4 -c 4 stripe-dir
How to Display File Striping Settings
To display the stripe pattern of a specified file or directory, use the lfs getstripe command.
$ lfs getstripe <dirname | filename>
Example) Display stripe settings #1. (Displaying the stripe pattern of a file.)
[username@login1 work]$ lfs getstripe stripe-file
stripe-file
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 4
obdidx objid objid group
4 9161985 0x8bcd01 0x500000406
5 9162113 0x8bcd81 0x540000402
6 9161761 0x8bcc21 0x580000407
7 9162113 0x8bcd81 0x5c0000402
Example) Display stripe settings #2. (Displaying the stripe pattern of a directory.)
stripe-dir
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 4
obdidx objid objid group
4 9161986 0x8bcd02 0x500000406
5 9162114 0x8bcd82 0x540000402
6 9161762 0x8bcc22 0x580000407
7 9162114 0x8bcd82 0x5c0000402
Group Area
Group area is the disk area of the Lustre file system shared by interactive and compute nodes. To use Group area, "User Administrator" of the group needs to apply "Add group disk" via ABCI User Portal). Regarding how to add group disk, please refer to Disk Addition Request) in the ABCI Portal Guide.
To find the path to your group area, use the show_quota command. For details, see Checking Disk Quota.
How to check inode usage
The MDT stores inode information for a file, but there is an upper limit on the number of inodes that can be stored per MDT.
You can see how much inodes are currently used for each MDT with the lfs df -i.
The IUse%entry in the/groups [MDT:?]line in the output of the command is the percentage of the inode used in each MDT.
In the following example, the inode utilization for MDT:0 is 12%.
[username@login1 ~]$ lfs df -i /groups
UUID Inodes IUsed IFree IUse% Mounted on
groups-MDT0000_UUID 3110850464 353856623 2756993841 12% /groups[MDT:0]
groups-MDT0001_UUID 3110850464 378826453 2732024011 13% /groups[MDT:1]
groups-MDT0002_UUID 3110850464 440177041 2670673423 15% /groups[MDT:2]
groups-MDT0003_UUID 3110850464 359182112 2751668352 12% /groups[MDT:3]
groups-MDT0004_UUID 3110850464 363397094 2747453370 12% /groups[MDT:4]
groups-MDT0005_UUID 3110850464 393711820 2717138644 13% /groups[MDT:5]
(snip)
Local Storage
In ABCI System, a 7TB NVMe SSD x2 is installed into each compute node and they are built and mounted with RAID0. There are two ways to utilize these storages as follows:
- Using as a local scratch of a node (Local scratch, Persistent local scratch (Reserved only)).
- Using as a distributed shared file system, which consists of multiple NVMe storages in multiple compute nodes (BeeOND storage).
- Using as a cache region of lustre client(Hot Nodes).
Local scratch
Local storage on compute nodes is available as a local scratch without specifying any special options at job submission.
Note that the amount of the local storage you can use is determined by "Resource type". For more detail on "Resource type", please refer to Job Execution Resource.
The local storage path is different for each job and you can access to local storage by using environment variables PBS_LOCALDIR.
Example) sample of job script (use_local_storage.sh)
#!/bin/bash
#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=1
echo test1 > $PBS_LOCALDIR/foo.txt
echo test2 > $PBS_LOCALDIR/bar.txt
cp -rp $PBS_LOCALDIR/foo.txt $HOME/test/foo.txt
Example) Submitting a job
[username@login1 ~]$ qsub -g grpname use_local_storage.sh
Example) Status after execution of use_local_storage.sh
[username@login1 ~]$ ls $HOME/test/
foo.txt <- The file remain only when it is copied explicitly in script.
Warning
The files stored under $PBS_LOCALDIR directory are removed when the job finished. The required files need to be moved to Home area or Group area in a job script using cp command.
BeeGFS storage
By using the BeeGFS On Demand (BeeOND), you can aggregate local storages attached to compute nodes on which your job is running to use as a temporal distributed shared file system.
To use BeeOND, you need to submit job with -v BEEOND_ON=1 option.
And you need to specify -q rt_HF option in this case, because node must be exclusively allocated to job.
The distributed shared file system region will be created in /beeond.
Example) sample of job script (use_beeond.sh)
#!/bin/bash
#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=2
#PBS -v BEEOND_ON=1
echo test1 > /beeond/foo.txt
echo test2 > /beeond/bar.txt
cp -rp /beeond/foo.txt $HOME/test/foo.txt
Example) Submitting a job
[username@login1 ~]$ qsub use_beeond.sh
Example) Status after execution of use_beeond.sh
[username@login1 ~]$ ls $HOME/test/
foo.txt <- The file remain only when it is copied explicitly in script.
Warning
The file stored under /beeond directory is removed when job finished. The required files need to be moved to Home area or Group area in job script using cp command.
BeeGFS allows data to be staged in and out of the BeeOND storage in parallel using the beeond-cp command. To use beeond-cp, specify the USE_SSH=1 option to enable SSH login to the compute nodes.
Example) sample of job script (use beeond-cp)
#!/bin/bash
#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=2
#PBS -v BEEOND_ON=1,USE_SSH=1
export src_dir=$HOME/data
beeond-cp stagein -n ${PBS_NODEFILE} -g ${src_dir} -l /beeond
(main process)
beeond-cp stageout -l /beeond
[Advanced Option] Configure BeeOND Servers
A BeeOND filesystem partition consists of two kinds of services running on compute nodes: one is storage service which stores files, and the other is metadata service which stores file matadata. Each service runs on compute nodes. We refer to a compute node which runs storage service as a storage server and a compute node which runs metadata service as a metadata server. Users can specify number of storage servers and metadata servers.
The default values for counts of metadata server and storage server are as follows.
| Parameter | Default |
|---|---|
| Count of metadata servers | 1 |
| Count of storage servers | Number of nodes requested by a job |
To change the counts, define following environment variables. These environment variables have to be defined at job submission, and changing in job script takes no effect. When counts of servers are less than the number of requested nodes, servers are lexicographically selected by their names from assigned compute nodes.
| Environment Variable | Description |
|---|---|
| BEEOND_METADATA_SERVER | Count of metadata servers in integer |
| BEEOND_STORAGE_SERVER | Count of storage servers in integer |
The following example create a BeeOND partition with two metadata servers and four storage servers. beegfs node list is used to see the configuration.
Example) sample of job script (use_beeond.sh)
#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=4
#PBS -v BEEOND_ON=1,USE_SSH=1,BEEOND_METADATA_SERVER=2,BEEOND_STORAGE_SERVER=4
beegfs node list
Example output
ID TYPE ALIAS
c:1 client c1F4461-69E83A4A-hnode004
c:2 client c1E8E89-69E83A4A-hnode007
c:3 client c1ECE95-69E83A4A-hnode005
c:4 client c1EB2BE-69E83A4A-hnode006
m:1 meta node_meta_1
m:2 meta node_meta_2
s:1 storage node_storage_1
s:2 storage node_storage_2
s:3 storage node_storage_3
s:4 storage node_storage_4
mg:1 management management
[Advanced Option] File Striping
Files are split into small chunks and stored in multiple storage servers in a BeeOND partition. Users can change file striping configuration of BeeOND.
The default configuration of the file striping is as follows.
| Parameter | Default | Description |
|---|---|---|
| Stripe size | 512 KB | File chunk size |
| Stripe count | 4 | Number of storage servers that store chunks of a file |
Users can configure file striping per-file or per-directory using beegfs entry set command.
The following example sets file striping configuration of /beeond/data directory as 6 stripe count and 4MB stripe size.
Example) sample of job script (use_beeond.sh)
#!/bin/bash
#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=4
#PBS -v BEEOND_ON=1
BEEOND_DIR=/beeond/data
mkdir ${BEEOND_DIR}
beegfs entry set --num-targets=6 --chunk-size=4MiB ${BEEOND_DIR}
beegfs entry info --retro --verbose ${BEEOND_DIR}
Output example
Entry type: directory
EntryID: 1-69E83E72-1
ParentID: root
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 4M
+ Number of storage targets: desired: 6
+ Storage Pool: 1 (storage_pool_default)
Inlined inode: no
Dentry info:
+ Path: 38/51/root
+ Metadata node: node_meta_1 [ID: 1]
Inode info:
+ Path: 74/55/1-69E83E72-1
+ Metadata node: node_meta_1 [ID: 1]
Hot Nodes
ABCI system provides Hot Nodes functionality of Lustre filesystem.
When Hot Nodes is enabled, the local storage(/local) is used as a transparent cache of group region(/groups).
Note
Currently, only READ cache is provided.
To enable Hot Nodes, please speficy -v HOTNODES_ON=1 option when submitting the job.
And, Hot Nodes requires node occupancy, please specify -q rt_HF option as well.
Also, Hot Nodes and BeeOND are mutually exclusive.
Available environment variables for submitting the jobs are the folloing.
| Environment variable | Description | Default | Available value |
|---|---|---|---|
| HOTNODES_ON | Enable Hot Nodes | 0 | 0(disable)、1(enable) |
| HOTNODES_HIGH | Cache eviction will start when the local storage disk usage exceeds this threshold. | 90 | 2-100(%) |
| HOTNODES_LOW | Cache eviction will be halted when the local storage disk usage drops below this threshold. | 75 | 1-99(%) |
| HOTNODES_INTERVAL | This is the interval at which the local storage usage is checked. | 30 | 1-3600(seconds) |
Info
Remain of HOTNODES_HIGH percentage is used for user scratch region.
The following example performs I/O by IOR into /groups/grpname with enabling Hot Nodes.
#!/bin/bash
#PBS -P grpname
#PBS -q rt_HF
#PBS -l select=1:mpiprocs=8
#PBS -v HOTNODES_ON=1
systemctl is-active lpcc
cd /groups/grpname/
module purge
module load hpcx
mpirun ./ior -a MPIIO -k