Inferencing

Model Customization for Code Creation with Red Hat OpenShift AI on Dell AI Optimized Infrastructure

Introduction Solution Overview Solution design Results or findings

Training

Inferencing

Visual Studio Code + Continue Plugin

Conclusion References

Thank you for your feedback!

With the model now tuned an inference server can be configured with the model to see the results. In the hardware configuration there are two PowerEdge R760 systems each with 4 * H100 GPUs which can be used as inference servers, one for the base model and one for the fine-tuned model. As new models are trained, evaluated, and promoted, it would be possible to set them up in a development/production pair.
vLLM can be used as the inference serving runtime leveraging KServe (integrated to Red Hat OpenShift AI) to startup and scale the inferencing as necessary and appropriate. Although the vLLM serving runtime is now part of the OpenShift AI further customization is possible as shown in the YAML configuration files available here:
https://github.com/openshift-psap/inference/tree/api-server/language/llama2-70b/api-endpoint-artifacts
An example of such customization is to enable a local node PVC for model storage. During the refining of the training and, the overall process local storage can be leveraged as a fast medium to reduce load times for large models.
The KServe user guide has examples of other storage types that can be used with vLLM serving runtime instances.
When the vLLM inference server instance is completely up and running KServe provides a OpenShift route for simple access to the provided API endpoints (for example https://codellama-70b-loraft-drd.apps.eclipse.ai.lab)
A simple test to ensure the inference server is working is to query the “/v1/models” endpoint using a web browser, a command line tool like cURL or some basic python code which will return text similar to the following:
"object":"list","data":[{"id":"/mnt/models/CodeLlama-70b-finetuned","object":"model","created":1721847460,"owned_by":"vllm","root":"/mnt/models/CodeLlama-70b-finetuned","parent":null,"max_model_len":2048,"permission":[{"id":"modelperm-fae45d29b08d4ce2ad1b6a4f74a5fdc3","object":"model_permission","created":1721847460,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

Your Browser is Out of Date

Inferencing

Inferencing