Start TensorFlow containers

Thank you for your feedback!

In a basic bare-metal deployment of TensorFlow and MPI, all software must be installed on each node. MPI then uses SSH to connect to each node to start the TensorFlow application processes.
In the world of Docker containers, this becomes a bit more complex but significantly easier to manage dependencies with. On each Server, a single Docker container is launched which has an SSH daemon that listens on the custom port 2222. This Docker container also has TensorFlow, OpenMPI, and NVIDIA libraries and tools. We can then run the docker exec and the mpirun command on one of these containers and MPI will connect to the Docker containers on all other VM instances by SSH on port 2222.
First, a custom Docker image is created using the following Dockerfile.
# Build with: docker build -t user/tensorflow:20.12-tf2-py3 .
FROM nvcr.io/nvidia/tensorflow:20.12-tf2-py3
# Install SSH and various utilities.
RUN apt-get update && apt-get install -y --no-install-recommends \
        openssh-client \
        openssh-server \
        lsof \
    && \
    rm -rf /var/lib/apt/lists/*
# Configure SSHD for MPI.
RUN mkdir -p /var/run/sshd && \
    mkdir -p /root/.ssh && \
    echo "StrictHostKeyChecking no" >> /etc/ssh/ssh_config && \
    echo "UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
    sed -i 's/^#*Port 22/Port 2222/' /etc/ssh/sshd_config && \
    echo "HOST *" >> /root/.ssh/config && \
    echo "PORT 2222" >> /root/.ssh/config && \
    mkdir -p /root/.ssh && \
    ssh-keygen -t rsa -b 4096 -f /root/.ssh/id_rsa -N "" && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys && \
    chmod 700 /root/.ssh && \
    chmod 600 /root/.ssh/*
# Install Python libraries.
RUN pip install ConfigArgParse
WORKDIR /scripts
EXPOSE 2222
As you can see, this Dockerfile is based on the NVIDIA GPU Cloud (NGC) TensorFlow image.
Run the following command to build the Docker image. Replace user with your NGC ID, Docker ID, or host:port if you are using an on-premises container registry.
docker build -t user/tensorflow:20.12-tf2-py3 .
During the build process, a new RSA key pair is randomly generated and stored in the image. This key pair allows containers running this image to SSH into each other. Although this is convenient for a lab environment, a production environment should never store private keys in an image.
Next, you must push this image to a Docker container registry so that it can be pulled from all other servers. Once logged in to your container registry, run the following command to upload the container.
docker push user/tensorflow:20.12-tf2-py3
You are now ready to start the containers on all servers. Repeat this command for each server, replacing host with the server name and user by your NGC ID, Docker ID or host:port.
ssh host \
docker \
run \
--rm \
--detach \
--privileged \
--gpus all \
-v /mnt:/mnt \
-v /mnt/nfs/data/tensorflow-benchmarks:/tensorflow-benchmarks
-v /mnt/nfs/data/imagenet-scratch:/imagenet-scratch
-v /mnt/nfs/data/ai-benchmark-utils:/scripts
--network=host \
--shm-size=1g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--name tf \
user/tensorflow:20.12-tf2-py3 \
bash -c \
"/usr/sbin/sshd ; sleep infinity"
The final line starts the SSH daemon and waits forever. At this point, the container can be accessed by MPI by the SSH daemon listening on port 2222.
Choose any of the VM instances as the primary and enter the container by running the following command. This will give you a bash prompt within the container.
docker exec -it tf bash
Confirm that this container can connect to all other containers by password-less SSH on port 2222.
ssh -p 2222 dl-worker-01 hostname
ssh -p 2222 dl-worker-02 hostname
Next, test that MPI can launch processes across all VM instances.
mpirun --allow-run-as-root -np 2 -H dl-worker-01 -H dl-worker-02 hostname
To stop the containers and all processes within them, run the following command on each server
docker stop tf

Your Browser is Out of Date

Start TensorFlow containers

Start TensorFlow containers