Home > Storage > PowerScale (Isilon) > Industry Solutions and Verticals > Analytics > PowerScale Deep Learning Infrastructure with NVIDIA DGX A100 Systems for Autonomous Driving > Data model training
The ability to train neural network data models with many hidden layers, as well as the ability to train them with large datasets in a short amount of time, at scale, is critical to ADAS/AD development. To assure safety and reliability, the neural networks designed for driving operations will utilize many permutations of parameters which will generate more compute-intensive requirements for the underlying systems and hardware architecture. In distributed DL platforms, the model needs to be synchronized across all nodes. It also requires the careful management and distributed coordination of computation and communication across all nodes.
Here are some key considerations for designing scalable neural networks:
Most DL frameworks use data parallelism to partition the workload over multiple devices. The following figure shows the details of the process of data parallelism to distribute training processes across multiple GPU systems and devices.
Most DL frameworks use data parallelism to partition the workload over multiple devices. The following figure shows the details of the process of data parallelism to distribute training processes across multiple GPU systems and devices.
Data parallelism also requires less communication between nodes as it benefits from high amount of computations per weight. Assume for example, there are n devices, where each device receives a copy of the complete model and trains it with 1/nth of the data. The results such as gradients and the updated model itself are communicated across these devices.
To ensure efficient training, the network bandwidth across the nodes cannot become a bottleneck. Also note it is inefficient—and bad practice—to store training data on the local disks of every worker node, which forces the copying of terabytes of data to each worker node before the actual training can be started.
When models are so large that they don’t fit into device memory, then an alternative method, called model parallelism, is employed. With model parallelism, as illustrated in the following figure, different devices are assigned the task of learning different parts of the model.
Model parallelism requires more careful consideration of dependences between the model parameters. Model parallelism may work well for GPUs in a single server that shares a high-speed bus. It can be used with larger models as hardware constraints per node are no longer a limitation.