Many enterprises forgo initial training and elect to use and customize a pretrained model as the basis for their solution. Using fine-tuning and P-tuning, it is possible to apply enterprise-specific data to retrain a portion of an existing model or build a better prompt interface to it. This method requires significantly less compute power than training a model initially, with the ability to start with a similar configuration to the inference-only configuration. The key difference is the addition of InfiniBand networking between compute systems.
Design considerations for large model customization with fine-tuning or P-training using pretrained large models include the following:
- Even though this task is relatively less compute-intensive than large model training, there is a need for a tremendous amount of information exchange (for example, weights) between GPUs of different nodes. Therefore, InfiniBand is required for optimized performance and throughput with an eight-way GPU and an all-to-all NVLInk connection. In some cases, when the model sizes are less than 40 B parameters and based on the application latency requirements, the InfiniBand module can be optional.
- P-tuning uses a small trainable model before using the LLM. The small model is used to encode the text prompt and generate task-specific virtual tokens. Prompt-tuning and prefix-tuning, which only tune continuous prompts with a frozen language model, substantially reduce per-task storage and memory usage at training.
- For models less than 40B parameters, you might be able to use a PowerEdge XE8640 server. For larger models, we recommend the PowerEdgeXE9680 server.
- The Data module is optional because there are no snapshot requirements. Certain prompt-engineering techniques might require a large dataset and require a high-performance data module.