The NVIDIA Tesla T4, based on NVIDIA’s Turing™ architecture is one of the most widely used AI inference accelerators. The Tesla T4 features NVIDIA Turing Tensor cores which enable it to accelerate all types of neural networks for images, speech, translation, and recommender systems, to name a few. Tesla T4 is supported by a wide variety of precisions and accelerates all major DL & ML frameworks, including TensorFlow, PyTorch, MXNet, Chainer, and Caffe2.
Table 5 NVIDIA Tesla T4 technical specifications
NVIDIA Tesla T4 | |
GPU architecture | NVIDIA Turing |
NVIDIA Turing Tensor cores | 320 |
NVIDIA CUDA® cores | 2,560 |
Single-precision | 8.1 TFLOPS |
Mixed-precision (FP16/FP32) | 65 TFLOPS |
INT8 | 130 TOPS |
INT4 | 260 TOPS |
GPU memory | 16 GB GDDR6 300 GB/s |
ECC | Yes |
Interconnect bandwidth | 32 GB/s |
System interface | X16 PCIe Gen3 |
Form factor | Low-profile PCIe |
Thermal solution | Passive |
Compute APIs | CUDA, NVIDIA TensorRT™, ONNX |
TDP | 70 watts |
For more details on NVIDIA Tesla T4, see https://www.nvidia.com/en-us/data-center/tesla-t4/
Table 6 NVIDIA Quadro RTX 8000 technical specifications
NVIDIA Quadro RTX 8000 | |
GPU architecture | NVIDIA Turing |
NVIDIA Tensor cores | 576 |
NVIDIA CUDA cores | 4,608 |
Single-precision | 16.3 TFLOPS |
Half-precision | 32.6 TFLOPS |
INT8 | 206.1 TOPS |
INT4 | 522 TOPS |
GPU memory | 48 GB GDDR6 |
ECC | Yes |
Memory bandwidth | 672 GB/s |
System interface | PCI Express 3.0 x 16 |
Form factor | 4.4” H x 10.5” L, dual slot, full height |
Thermal solution | Passive |
Compute APIs | CUDA, DirectCompute, OpenCL™ |
TDP | 260 watts |
For more details on NVIDIA® Quadro® RTX™ 8000, see https://www.nvidia.com/en-us/design-visualization/quadro/rtx-8000/ .
The NVIDIA A100 Tensor Core GPU is the flagship product of the NVIDIA data center platform for deep learning, HPC, and data analytics. The platform accelerates over 700 HPC applications and every major deep learning framework. It is available everywhere, from desktops to servers to cloud services, delivering both dramatic performance gains and cost-saving opportunities.
Table 7 NVIDIA A100-PCIE technical specifications
NVIDIA A100-PCIE | |
GPU architecture | NVIDIA Ampere |
NVIDIA Tensor cores | 576 |
NVIDIA CUDA cores | 4,608 |
Single-precision | 19.5 TFLOPS |
Double-precision | 9.7 TFLOPS |
INT8 | 1248 TOPS |
INT4 | 2496 TOPS |
GPU Memory | 40 GB |
ECC | Yes |
Memory bandwidth | 1,555 GB/s |
Interconnect interface | PCIe Gen4: 64 GB/s |
Form factor | PCIe |
Thermal solution | Passive |
Compute APIs | CUDA, DirectCompute, OpenCL, OpenACC® |
TDP | 250 watts |
For more details, see https://www.nvidia.com/en-us/data-center/a100/.
At its core, NVIDIA TensorRT™ is a C++ library that is designed to optimize deep learning inference performance on systems which use NVIDIA GPUs, and support models that are trained in most of the major deep learning frameworks including, but not limited to, TensorFlow, Caffe, PyTorch, MXNet. After the neural network is trained, TensorRT™ enables the network to be compressed, optimized, and deployed as a runtime without the overhead of a framework. It supports FP32, FP16, and INT8 precisions.
To optimize the model, TensorRT™ builds an inference engine out of the trained model by analyzing the layers of the model and eliminating layers whose output is not used or combining operations to perform faster calculations. On top of all the model-specific optimizations, it also performs framework-specific optimizations. The result of all these optimizations is improved latency, throughput, and efficiency.