Start TensorFlow containers

Thank you for your feedback!

In a basic bare-metal deployment of TensorFlow and MPI, all software must be installed on each node. MPI then uses SSH to connect to each node to start the TensorFlow application processes.
In the world of Docker containers, this becomes a bit more complex but significantly easier for managing dependencies. On each server, a single Docker container is launched that has an SSH daemon that listens on the custom port 2222. This Docker container also has TensorFlow, OpenMPI, and AMD ROCm libraries and tools. We can then run the docker exec and the mpirun command on one of these containers and MPI will connect to the Docker containers on all other VM instances by SSH on port 2222.
First, a custom Docker image is created using the following Dockerfile.
# Build with: docker build --network=host --rm -t user/tensorflow-amd:rocm4.5.2-tf1.15-dev .
FROM rocm/tensorflow:rocm4.5.2-tf1.15-dev
RUN wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add –
# Install SSH and various utilities.
RUN apt-get update && apt-get install -y --no-install-recommends \
        openssh-client \
        openssh-server \
        lsof \
    && \
    rm -rf /var/lib/apt/lists/*
# Configure SSHD for MPI.
RUN mkdir -p /var/run/sshd && \
    mkdir -p /root/.ssh && \
    echo "StrictHostKeyChecking no" >> /etc/ssh/ssh_config && \
    echo "UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
    sed -i 's/^#*Port 22/Port 2222/' /etc/ssh/sshd_config && \
    echo "HOST *" >> /root/.ssh/config && \
    echo "PORT 2222" >> /root/.ssh/config && \
    mkdir -p /root/.ssh && \
    ssh-keygen -t rsa -b 4096 -f /root/.ssh/id_rsa -N "" && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys && \
    chmod 700 /root/.ssh && \
    chmod 600 /root/.ssh/*
# Install Python libraries.
RUN pip install ConfigArgParse
WORKDIR /root
EXPOSE 2222
As you can see, this Dockerfile is based on the AMD ROCm Docker image for TensorFlow release.
Run the following command to build the Docker image. Replace user with your Docker ID, or host:port if you are using an on-premises container registry.
docker build -t user/tensorflow-amd:rocm4.5.2-tf1.15-dev .
Note that during the build process, a new RSA key pair is randomly generated and stored in the image. This key pair allows containers running this image to SSH into each other. Although this is convenient for a lab environment, a production environment should never store private keys in an image.
Next, you must push this image to a Docker container registry so that it can be pulled from all other servers. Once logged into your container registry, run the following command to upload the container.
docker push user/tensorflow-amd:rocm4.5.2-tf1.15-dev
You are now ready to start the containers on all servers. Repeat this command for each server, replacing host with the server name and user by your Docker ID or host:port.
ssh host \
docker \
run \
--rm \
--detach \
--privileged \
-v /mnt:/mnt \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--ipc=host \
--shm-size 16G \
--ulimit memlock=-1 \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--name tf \
user/tensorflow-amd:rocm4.5.2-tf1.15-dev \
bash -c \
"/usr/sbin/sshd ; sleep infinity"
The final line starts the SSH daemon which waits forever. At this point, the container can be accessed by MPI by the SSH daemon listening on port 2222.
Choose any one of the VM instances as the primary and enter the container by running the following command. This will give you a bash prompt within the container.
docker exec -it tf bash
Confirm that this container can connect to all other containers by password-less SSH on port 2222.
ssh -p 2222 dl-worker-01 hostname
ssh -p 2222 dl-worker-02 hostname
Next, test that MPI can launch processes across all VM instances.
mpirun --allow-run-as-root -np 2 -H dl-worker-01 -H dl-worker-02 hostname
To stop the containers and all processes within them, run the following command on each server
docker stop tf

Your Browser is Out of Date

Start TensorFlow containers

Start TensorFlow containers