Home > Storage > PowerScale (Isilon) > Industry Solutions and Verticals > Analytics > Dell Technologies Solution: Distributed Deep Learning Infrastructure for Autonomous Driving > Data Model Training
The ability to train neural network data models with many hidden layers, as well as the ability to train them with large datasets in a short amount of time at scale is critical to ADAS/AD development. To assure safety and reliability, the neural networks designed for driving operations will utilize many permutations of parameters which will generate more compute-intensive requirements for the underlying systems and hardware architecture. In distributed DL platforms, the model needs to be synchronized across all nodes. It also requires the careful management and distributed coordination of computation and communication across all nodes.
Here are some key considerations for designing scalable neural networks:
Data Parallelism:
Most DL frameworks use data parallelism to partition the workload over multiple devices. Following picture shows the detail process of data parallelism approach to distribute training processes across the multiple GPU servers and devices. Data parallelism also requires less communication between nodes as it benefit from high amount of computations per weight. Assume for example, there are n devices, where each device receives a copy of the complete model and train it with 1/nth of the data. The results such as gradients and the updated model itself are communicated across these devices.
To ensure the efficient training, the network bandwidth across the nodes cannot become a bottleneck. Also note, it is inefficient – and bad practice - to store training data on the local disks of every worker node – this forces the copying of terabytes of data to each worker node before the actual training can be started.
parallelism approach to distribute training processes:
When models are so large that they don’t fit into device memory, then an alternative, called model parallelism shown as below. Different devices are assigned the task of learning different parts of the model. Model parallelism requires more careful consideration of dependences between the model parameters. Model parallelism may work well for GPUs in a single server that share a high-speed bus. It can be used with larger models as hardware constraints per node are no more a limitation.
Model Parallelism: