NVIDIA Accelerators (GPUs)

Thank you for your feedback!

NVIDIA Tesla T4

The NVIDIA Tesla T4, based on NVIDIA’s Turing™ architecture is one of the most widely used AI inference accelerators. The Tesla T4 features NVIDIA Turing Tensor cores which enable it to accelerate all types of neural networks for images, speech, translation, and recommender systems, to name a few. Tesla T4 is supported by a wide variety of precisions and accelerates all major DL & ML frameworks, including TensorFlow, PyTorch, MXNet, Chainer, and Caffe2.

Table 5 NVIDIA Tesla T4 technical specifications

NVIDIA Tesla T4
GPU architecture	NVIDIA Turing
NVIDIA Turing Tensor cores	320
NVIDIA CUDA® cores	2,560
Single-precision	8.1 TFLOPS
Mixed-precision (FP16/FP32)	65 TFLOPS
INT8	130 TOPS
INT4	260 TOPS
GPU memory	16 GB GDDR6 300 GB/s
ECC	Yes
Interconnect bandwidth	32 GB/s
System interface	X16 PCIe Gen3
Form factor	Low-profile PCIe
Thermal solution	Passive
Compute APIs	CUDA, NVIDIA TensorRT™, ONNX
TDP	70 watts

For more details on NVIDIA Tesla T4, see https://www.nvidia.com/en-us/data-center/tesla-t4/

NVIDIA® Quadro® RTX™ 8000

NVIDIA® Quadro® RTX™ 8000, powered by the NVIDIA Turing™ architecture and the NVIDIA RTX platform, combines unparalleled performance and memory capacity to deliver one of the world’s most powerful graphics card solutions for professional workflows. The NVIDIA Quadro RTX 8000 features 72 RT cores for real-time ray tracing and 576 Tensor Cores for AI enhanced workflows, resulting in over 130 TFLOPS of deep learning performance. With 48 GB of GDDR6 memory, the NVIDIA Quadro RTX 8000 is designed to work with memory intensive workloads that create complex models, build massive architectural datasets and visualize immense data science workloads.

Table 6 NVIDIA Quadro RTX 8000 technical specifications

NVIDIA Quadro RTX 8000
GPU architecture	NVIDIA Turing
NVIDIA Tensor cores	576
NVIDIA CUDA cores	4,608
Single-precision	16.3 TFLOPS
Half-precision	32.6 TFLOPS
INT8	206.1 TOPS
INT4	522 TOPS
GPU memory	48 GB GDDR6
ECC	Yes
Memory bandwidth	672 GB/s
System interface	PCI Express 3.0 x 16
Form factor	4.4” H x 10.5” L, dual slot, full height
Thermal solution	Passive
Compute APIs	CUDA, DirectCompute, OpenCL™
TDP	260 watts

For more details on NVIDIA® Quadro® RTX™ 8000, see https://www.nvidia.com/en-us/design-visualization/quadro/rtx-8000/ .

NVIDIA® A100-PCIE

The NVIDIA A100 Tensor Core GPU is the flagship product of the NVIDIA data center platform for deep learning, HPC, and data analytics. The platform accelerates over 700 HPC applications and every major deep learning framework. It is available everywhere, from desktops to servers to cloud services, delivering both dramatic performance gains and cost-saving opportunities.

Table 7 NVIDIA A100-PCIE technical specifications

NVIDIA A100-PCIE
GPU architecture	NVIDIA Ampere
NVIDIA Tensor cores	576
NVIDIA CUDA cores	4,608
Single-precision	19.5 TFLOPS
Double-precision	9.7 TFLOPS
INT8	1248 TOPS
INT4	2496 TOPS
GPU Memory	40 GB
ECC	Yes
Memory bandwidth	1,555 GB/s
Interconnect interface	PCIe Gen4: 64 GB/s
Form factor	PCIe
Thermal solution	Passive
Compute APIs	CUDA, DirectCompute, OpenCL, OpenACC®
TDP	250 watts

For more details, see https://www.nvidia.com/en-us/data-center/a100/.

NVIDIA TensorRT

At its core, NVIDIA TensorRT™ is a C++ library that is designed to optimize deep learning inference performance on systems which use NVIDIA GPUs, and support models that are trained in most of the major deep learning frameworks including, but not limited to, TensorFlow, Caffe, PyTorch, MXNet. After the neural network is trained, TensorRT™ enables the network to be compressed, optimized, and deployed as a runtime without the overhead of a framework. It supports FP32, FP16, and INT8 precisions.

To optimize the model, TensorRT™ builds an inference engine out of the trained model by analyzing the layers of the model and eliminating layers whose output is not used or combining operations to perform faster calculations. On top of all the model-specific optimizations, it also performs framework-specific optimizations. The result of all these optimizations is improved latency, throughput, and efficiency.

Your Browser is Out of Date