Using SHARP
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™ is available on the ABCI Compute Node (A).
Using SHARP may improve the performance of collective operations in MPI and machine learning, due to offloading collective operations from the CPU or GPU to the network, and reduction of data transfer between endpoints.
Using SHARP with NVIDIA NCCL
You can use SHARP with NVIDIA NCCL. To use SHARP with NVIDIA NCCL, use the NCCL-SHARP plugin.
ABCI provides the NCCL-SHARP plugin as a module for the Compute Node (A). The corresponding module of the plugin changes depending on the version of NCCL. Refer to the following table for the correspondence between plugins and NCCL.
Note
The NCCL-SHARP plugin is provided on a trial basis and performance and operation are not guaranteed.
NCCL-SHARP plugin module | NCCL versions |
---|---|
nccl-rdma-sharp-plugins/v2.1.x-5f238fb |
2.8、2.11 |
nccl-rdma-sharp-plugins/v2.2.x-5e6ed3e |
2.12 |
nccl-rdma-sharp-plugins/v2.5.x-4ccb98a |
2.12、2.13、2.14、2.15、2.16、2.17、2.18、2.19 |
To use SHARP with NCCL, load the CUDA, NCCL and NCCL SHARP plugin modules and set the following environment variables:
[username@es-a1 ~] module load cuda/11.2 nccl/2.8 nccl-rdma-sharp-plugins/v2.1.x-5f238fb
NCCL_COLLNET_ENABLE=1
SHARP_COLL_LOCK_ON_COMM_INIT=1
SHARP_COLL_NUM_COLL_GROUP_RESOURCE_ALLOC_THRESHOLD=0
- (Optional)
SHARP_COLL_LOG_LEVEL=3
Example using nccl-tests
The following is an example of enabling SHARP on NCCL using nccl-tests.
Warning
We have confirmed an issue in the nccl-rdma-sharp-plugins/v2.5.x-4ccb98a
module where nccl-tests do not run with NCCL 2.12 through 2.16.
First, download nccl-tests, enable MPI support, and then build.
[username@es-a1 ~] module load hpcx/2.12 cuda/11.2 nccl/2.8
[username@es-a1 ~] git clone https://github.com/NVIDIA/nccl-tests.git -b v2.11.0
[username@es-a1 ~] cd nccl-tests
[username@es-a1 ~] make MPI=1 MPI_HOME=${OMPI_HOME} CUDA_HOME=${CUDA_HOME} NCCL_HOME=${NCCL_HOME}
After building, a binary will be generated under the build
directory, so execute this using mpirun
.
[username@es-a1 ~] qrsh -g group -l rt_AF=2 -l h_rt=01:00:00
[username@a0000 ~] module load hpcx/2.12 cuda/11.2 nccl/2.8 nccl-rdma-sharp-plugins/v2.1.x-5f238fb
[username@a0000 ~] cd nccl-tests
[username@a0000 ~] mpirun -np 16 -map-by ppr:8:node \
-hostfile ${SGE_JOB_HOSTLIST} \
-x UCX_TLS=dc,shm,self \
-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} \
-x NCCL_COLLNET_ENABLE=1 \
-x SHARP_COLL_LOCK_ON_COMM_INIT=1 \
-x SHARP_COLL_NUM_COLL_GROUP_RESOURCE_ALLOC_THRESHOLD=0 \
-x SHARP_COLL_LOG_LEVEL=3 \
./build/all_reduce_perf -b 8 -e 2G -f 2 -g 1 -w 50 -n 50
# nThread 1 nGpus 1 minBytes 8 maxBytes 2147483648 step: 2(factor) warmup iters: 50 iters: 50 validation: 1
#
# Using devices
# Rank 0 Pid 2916721 on a0000 device 0 [0x27] NVIDIA A100-SXM4-40GB
# Rank 1 Pid 2916722 on a0000 device 1 [0x2a] NVIDIA A100-SXM4-40GB
# Rank 2 Pid 2916723 on a0000 device 2 [0x51] NVIDIA A100-SXM4-40GB
# Rank 3 Pid 2916724 on a0000 device 3 [0x57] NVIDIA A100-SXM4-40GB
# Rank 4 Pid 2916725 on a0000 device 4 [0x9e] NVIDIA A100-SXM4-40GB
# Rank 5 Pid 2916726 on a0000 device 5 [0xa4] NVIDIA A100-SXM4-40GB
# Rank 6 Pid 2916727 on a0000 device 6 [0xc7] NVIDIA A100-SXM4-40GB
# Rank 7 Pid 2916728 on a0000 device 7 [0xca] NVIDIA A100-SXM4-40GB
# Rank 8 Pid 3868300 on a0001 device 0 [0x27] NVIDIA A100-SXM4-40GB
# Rank 9 Pid 3868301 on a0001 device 1 [0x2a] NVIDIA A100-SXM4-40GB
# Rank 10 Pid 3868302 on a0001 device 2 [0x51] NVIDIA A100-SXM4-40GB
# Rank 11 Pid 3868303 on a0001 device 3 [0x57] NVIDIA A100-SXM4-40GB
# Rank 12 Pid 3868304 on a0001 device 4 [0x9e] NVIDIA A100-SXM4-40GB
# Rank 13 Pid 3868305 on a0001 device 5 [0xa4] NVIDIA A100-SXM4-40GB
# Rank 14 Pid 3868306 on a0001 device 6 [0xc7] NVIDIA A100-SXM4-40GB
# Rank 15 Pid 3868307 on a0001 device 7 [0xca] NVIDIA A100-SXM4-40GB
[a0000:0:2916721 - context.c:589] INFO job (ID: 2838387367436317) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[a0000:0:2916721 - context.c:759] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
--(snip)--
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 22.54 0.00 0.00 4e-07 24.94 0.00 0.00 4e-07
16 4 float sum 23.66 0.00 0.00 4e-07 24.72 0.00 0.00 1e-07
32 8 float sum 24.62 0.00 0.00 1e-07 23.51 0.00 0.00 1e-07
64 16 float sum 24.10 0.00 0.00 1e-07 23.54 0.00 0.01 1e-07
128 32 float sum 22.98 0.01 0.01 1e-07 22.58 0.01 0.01 1e-07
256 64 float sum 24.35 0.01 0.02 1e-07 24.08 0.01 0.02 1e-07
512 128 float sum 25.48 0.02 0.04 1e-07 25.99 0.02 0.04 1e-07
1024 256 float sum 34.96 0.03 0.05 4e-07 35.66 0.03 0.05 4e-07
2048 512 float sum 35.83 0.06 0.11 4e-07 34.95 0.06 0.11 4e-07
4096 1024 float sum 35.33 0.12 0.22 5e-07 34.38 0.12 0.22 5e-07
8192 2048 float sum 37.07 0.22 0.41 5e-07 35.50 0.23 0.43 5e-07
16384 4096 float sum 39.64 0.41 0.77 5e-07 39.44 0.42 0.78 5e-07
32768 8192 float sum 45.63 0.72 1.35 5e-07 44.35 0.74 1.39 5e-07
65536 16384 float sum 52.22 1.26 2.35 5e-07 50.17 1.31 2.45 5e-07
131072 32768 float sum 63.21 2.07 3.89 5e-07 59.93 2.19 4.10 5e-07
262144 65536 float sum 78.91 3.32 6.23 5e-07 77.77 3.37 6.32 5e-07
524288 131072 float sum 118.5 4.43 8.30 5e-07 117.8 4.45 8.34 5e-07
1048576 262144 float sum 177.0 5.93 11.11 5e-07 174.8 6.00 11.25 5e-07
2097152 524288 float sum 215.2 9.75 18.28 5e-07 215.7 9.72 18.23 5e-07
4194304 1048576 float sum 275.5 15.22 28.55 5e-07 275.3 15.24 28.57 5e-07
8388608 2097152 float sum 387.0 21.67 40.64 5e-07 382.6 21.92 41.11 5e-07
16777216 4194304 float sum 549.8 30.51 57.21 5e-07 548.9 30.56 57.30 5e-07
33554432 8388608 float sum 870.1 38.56 72.31 5e-07 866.8 38.71 72.58 5e-07
67108864 16777216 float sum 1491.4 45.00 84.37 5e-07 1487.8 45.11 84.58 5e-07
134217728 33554432 float sum 2587.4 51.87 97.26 5e-07 2581.4 51.99 97.49 5e-07
268435456 67108864 float sum 5207.5 51.55 96.65 5e-07 5194.4 51.68 96.90 5e-07
536870912 134217728 float sum 9979.3 53.80 100.87 5e-07 9930.5 54.06 101.37 5e-07
1073741824 268435456 float sum 19340 55.52 104.10 5e-07 19335 55.53 104.13 5e-07
2147483648 536870912 float sum 38180 56.25 105.46 5e-07 38163 56.27 105.51 5e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 29.0317
#