In this chapter, we explain some of the key architectural concepts that are part of this design for generative AI inferencing, including LLM characteristics and examples. We list the Dell PowerEdge servers and NVIDIA GPUs used in the design, including GPU configurations and GPU connectivity and networking methods.
We also describe the primary software components used for inferencing, including NVDIA AI Enterprise, Triton Inference Server, the NeMo framework for generative AI models, and the NVIDIA cluster manager.