Model training is the most computationally intensive part of ML/DL. Kubeflow uses TFJobs, a Kubernetes custom resource, to run TensorFlow training jobs in an automated fashion and enable data scientists to monitor job progress by viewing the results. Nvidia GPUs are used to accelerate the training of neural network models.
Execution time can also be reduced by running TensorFlow distributed training, which takes advantage of the compute capability of multiple GPUs to work on the same neural network training. Multiple components play a role to enable distributed training: worker nodes, where the computation (model training) takes place, and Parameter Server (PS), which is responsible for storing the parameters needed by the individual workers.