NVIDIA NGC

NVIDIA NGC (hereinafter referred to as "NGC") provides Docker images for GPU-optimized deep learning framework containers and HPC application containers and NGC container registry to distribute them. ABCI allows users to execute NGC-provided Docker images easily by using Singularity.

In this page, we will explain the procedure to use Docker images registered in NGC container registry with ABCI.

Prerequisites

NGC Container Registry

Each Docker image of NGC container registry is specified by the following format:

nvcr.io/<namespace>/<repo_name>:<repo_tag>

When using with Singularity, each image is referenced first with the URL schema docker:// as like:

docker://nvcr.io/<namespace>/<repo_name>:<repo_tag>

NGC Website

NGC Website is the portal for browsing the contents of the NGC container registry, generating NGC API keys, and so on.

Most of the docker images provided by the NGC container registry are freely available, but some are 'locked' and required that you have an NGC account and an API key to access them. Below are examples of both cases.

Freely available image: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow
Locked image: https://ngc.nvidia.com/catalog/containers/partners:chainer

If you do not have signed in with an NGC account, you can neither see the information such as pull command to use locked images, nor generate an API key.

In the following instructions, we will use freely available images. To use locked images, we will explain later (Using Locked Images).

See NGC Documentation for more details on NGC Website.

Single-node Run

Using TensorFlow as an example, we will explain how to run Docker images provided by NGC container registry.

Identify Image URL

First, you need to find the URL for TensorFlow image via NGC Website.

Open https://ngc.nvidia.com/ with your browser, and input "tensorflow" to the search form "Search Containers". Then, you'll find: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow

In this page, you will see the pull command for using TensorFlow image on Docker:

docker pull nvcr.io/nvidia/tensorflow:21.06-tf1-py3

As we mentioned at NGC Container Registry, when using with Singularity, this image can be specified by the following URL:

docker://nvcr.io/nvidia/tensorflow:21.06-tf1-py3

Build a Singularity image

Build a Singularity image for TensorFlow on the interactive node.

[username@es1 ~]$ module load singularitypro
[username@es1 ~]$ singularity pull docker://nvcr.io/nvidia/tensorflow:21.06-tf1-py3

An image named tensorflow_21.06-tf1-py3.sif will be generated.

Run a Singularity image

Start an interactive job with one full-node and run a sample program cnn_mnist.py.

[username@es1 ~]$ qrsh -g grpname -l rt_F=1 -l h_rt=1:00:00
[username@g0001 ~]$ module load singularitypro
[username@g0001 ~]$ wget https://raw.githubusercontent.com/tensorflow/tensorflow/v1.15.5/tensorflow/examples/tutorials/layers/cnn_mnist.py
[username@g0001 ~]$ singularity run --nv tensorflow_21.06-tf1-py3 python cnn_mnist.py
:
{'accuracy': 0.9703, 'loss': 0.10137254, 'global_step': 20000}

You can do the same thing with a batch job.

#!/bin/sh
#$ -l rt_F=1
#$ -j y
#$ -cwd

source /etc/profile.d/modules.sh
module load singularitypro
wget https://raw.githubusercontent.com/tensorflow/tensorflow/v1.15.5/tensorflow/examples/tutorials/layers/cnn_mnist.py
singularity run --nv tensorflow_21.06-tf1-py3.sif python cnn_mnist.py

Multiple-node Run

Some of NGC container images support multiple-node run with using MPI. TensorFlow image, which you used for Single-node Run, also supports multi-node run.

Identify MPI version

First, check the version of MPI installed into the TensorFlow image.

[username@es1 ~] $ module load singularitypro
[username@es1 ~] $ singularity exec tensorflow_21.06-tf1-py3.sif mpirun --version
mpirun (Open MPI) 4.1.1rc1

Report bugs to http://www.open-mpi.org/community/help/

Next, check the available versions of Open MPI on the ABCI system.

[username@es1 ~] $ module avail openmpi

-------------------- /apps/modules/modulefiles/centos7/mpi ---------------------
openmpi/4.0.5          openmpi/4.1.3(default)

openmpi/4.1.3 module seems to be suitable to run this image. In general, at least the major versions of both MPIs should be the same.

Run a Singularity image with MPI

Start an interative job with two full-nodes, and load required environment modules.

[username@es1 ~]$ qrsh -g grpname -l rt_F=2 -l h_rt=1:00:00
[username@g0001 ~]$ module load singularitypro openmpi/4.1.3

Each full-node has four GPUs, and you have eight GPUs in total. In this case, you run four processes on each full-node in parallel, that means eight processes in total, so as to execute the sample program tensorflow_mnist.py.

[username@g0001 ~]$ wget https://raw.githubusercontent.com/horovod/horovod/v0.22.1/examples/tensorflow/tensorflow_mnist.py
[username@g0001 ~]$ mpirun -np 8 -npernode 4 singularity run --nv tensorflow_21.06-tf1-py3.sif python tensorflow_mnist.py
:
INFO:tensorflow:loss = 0.13635147, step = 30 (0.236 sec)
INFO:tensorflow:loss = 0.16320482, step = 30 (0.236 sec)
INFO:tensorflow:loss = 0.23524982, step = 30 (0.237 sec)
INFO:tensorflow:loss = 0.1300551, step = 30 (0.236 sec)
INFO:tensorflow:loss = 0.10259462, step = 30 (0.237 sec)
INFO:tensorflow:loss = 0.04606852, step = 30 (0.237 sec)
INFO:tensorflow:loss = 0.10536947, step = 30 (0.236 sec)
INFO:tensorflow:loss = 0.09811305, step = 30 (0.237 sec)
INFO:tensorflow:loss = 0.06823079, step = 40 (0.225 sec)
INFO:tensorflow:loss = 0.0671196, step = 40 (0.225 sec)
INFO:tensorflow:loss = 0.1545426, step = 40 (0.225 sec)
INFO:tensorflow:loss = 0.13310829, step = 40 (0.225 sec)
INFO:tensorflow:loss = 0.084449895, step = 40 (0.225 sec)
INFO:tensorflow:loss = 0.10252285, step = 40 (0.225 sec)
INFO:tensorflow:loss = 0.078794435, step = 40 (0.225 sec)
INFO:tensorflow:loss = 0.17852336, step = 40 (0.225 sec)
:

You can do the same thing with a batch job.

#!/bin/sh
#$ -l rt_F=2
#$ -j y
#$ -cwd

source /etc/profile.d/modules.sh
module load singularitypro openmpi/4.0.5
wget https://raw.githubusercontent.com/horovod/horovod/v0.22.1/examples/tensorflow/tensorflow_mnist.py
mpirun -np 8 -npernode 4 singularity run --nv tensorflow_21.06-tf1-py3.sif python tensorflow_mnist.py

Using Locked Images

Using Chainer as an example, we will explain how to run locked Docker images provided by NGC container registry.

Identify Locked Image URL

First, you need to find the URL for Chainer image via NGC Website.

Open https://ngc.nvidia.com/ with your browser, sign in with an NGC account, and input "chainer" to the search form "Search Containers". Then, you'll find: https://ngc.nvidia.com/catalog/containers/partners:chainer

In this page, you will see the pull command for using Chainer image on Docker (you must sign in with an NGC account):

docker pull nvcr.io/partners/chainer:4.0.0b1

When using with Singularity, this image can be specified by the following URL:

docker://nvcr.io/partners/chainer:4.0.0b1

Build a Singularity image for a locked NGC image

To build an image, an NGC API key is required. Follow the following procedure to generate an API key:

Generating Your NGC API Key

Build a Singularity image for Chainer on the interactive node. In this case, you need to set two environment variables, SINGULARITY_DOCKER_USERNAME and SINGULARITY_DOCKER_PASSWORD for downloading images from NGC container registry.

[username@es1 ~]$ module load singularitypro
[username@es1 ~]$ export SINGULARITY_DOCKER_USERNAME='$oauthtoken'
[username@es1 ~]$ export SINGULARITY_DOCKER_PASSWORD=<NGC API Key>
[username@es1 ~]$ singularity pull docker://nvcr.io/partners/chainer:4.0.0b1

An image named chainer_4.0.0b1.sif will be generated.

You can also specify --docker-login option to download images instead of environment variables.

[username@es1 ~]$ module load singularitypro
[username@es1 ~]$ singularity pull --disable-cache --docker-login docker://nvcr.io/partners/chainer:4.0.0b1
Enter Docker Username: $oauthtoken
Enter Docker Password: <NGC API Key>

Run a Singularity image

You can run the resulted image, just as same as freely available images.

[username@es1 ~]$ qrsh -g grpname -l rt_G.small=1 -l h_rt=1:00:00
[username@g0001 ~]$ module load singularitypro
[username@g0001 ~]$ wget https://raw.githubusercontent.com/chainer/chainer/v4.0.0b1/examples/mnist/train_mnist.py
[username@g0001 ~]$ singularity exec --nv chainer_4.0.0b1.sif python train_mnist.py -g 0
:
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.191976    0.0931192             0.942517       0.9712            18.7328
2           0.0755601   0.0837004             0.9761         0.9737            20.6419
3           0.0496073   0.0689045             0.984266       0.9802            22.5383
4           0.0343888   0.0705739             0.988798       0.9796            24.4332
:

NVIDIA NGC

Prerequisites

NGC Container Registry

NGC Website

Single-node Run

Identify Image URL

Build a Singularity image

Run a Singularity image

Multiple-node Run

Identify MPI version

Run a Singularity image with MPI

Using Locked Images

Identify Locked Image URL

Build a Singularity image for a locked NGC image

Run a Singularity image

Reference