Introduction

Thank you for your feedback!

Model training is the most computationally intensive part of ML/DL. Kubeflow uses TFJobs, a Kubernetes custom resource, to run TensorFlow training jobs in an automated fashion and enable data scientists to monitor job progress by viewing the results. Nvidia GPUs are used for accelerating the training of neural network models and training.
Execution time can also be reduced by running TensorFlow distributed training, which takes advantage of the compute capability of multiple GPUs to work on the same neural network training. Multiple components play a role to enable distributed training: worker nodes, where the computation (model training) takes place, and Parameter Server (PS), which is responsible for storing the parameters needed by the individual workers.
Kubeflow provides a YAML representation for TFJobs.