Floating point precision (FP16 vs. FP32)

Thank you for your feedback!

The NVIDIA V100 GPU contains a new type of processing core called Tensor Cores which support mixed precision training. Although many High Performance Computing (HPC) applications require high precision computation with FP32 (32-bit floating point) or FP64 (64-bit floating point), deep learning researchers have found they are able to achieve the same inference accuracy with FP16 (16-bit floating point) as can be had with FP32. In this document, mixed precision training which includes FP16 and FP32 representations is denoted as “FP16” training.
In experiments where training tests were executed using FP16 precision, the batch size was doubled since FP16 consumes only half the memory for floating points as FP32. Doubling the batch size with FP16 ensures that GPU memory is utilized equally for both types of tests.
Although FP16 makes training faster, it requires extra work in the neural network model implementation to match the accuracy achieved with FP32. This is because some neural networks require their gradient values to be shifted into FP16 representable range, and they may do some scaling and normalization to use FP16 during training. For more details, please refer to NVIDIA mixed precision training.