Home > AI Solutions > Artificial Intelligence > Guides > Design Guide—Implementing a Digital Assistant with Red Hat OpenShift AI on Dell APEX Cloud Platform > Model deployment
This section describes deploying the LLM model to the Caikit model serving environment.
There are two options for deploying an LLM model:
This procedure includes deploying an example LLM model Llama-2-7b-chat-hf with the Caikit + TGIS Serving runtime.
Note: The Llama-2-7b-chat-hf LLM model has been uploaded to the object storage bucket.
export TEST_NS=kserve-demo oc new-project ${TEST_NS} oc patch smmr/default -n istio-system --type='json' -p="[{'op': 'add', 'path': '/spec/members/-', 'value': \"$TEST_NS\"}]"
b. Create a caikit ServingRuntime. By default, it requests 4CPU and 8Gi of memory. You can adjust these values as required.
oc apply -f ./custom-manifests/caikit/caikit-servingruntime.yaml -n ${TEST_NS}
c. Deploy the MinIO data connection and service account.
oc apply -f ./custom-manifests/caikit/storage-config-secret.yaml -n ${TEST_NS} oc create -f ./custom-manifests/Caikit/serviceaccount.yaml -n ${TEST_NS}
d. Deploy the inference service. It will point to the model located in the modelmesh-example-models/llm/models directory.
oc apply -f ./custom-manifests/caikit/caikit-isvc.yaml -n ${TEST_NS}
e. Verify that the inference service's READY state is True.
oc get isvc/caikit-example-isvc -n ${TEST_NS}
2. Perform inference with Remote Procedure Call (gPRC) commands.
a. Determine whether the HTTP2 protocol is enabled in the cluster.
oc get ingresses.config/cluster -ojson | grep ingress.operator.openshift.io/default-enable-http2
If the annotation is set to true, skip to Step 2c.
b. If the annotation is set to either false or not present, enable it.
oc annotate ingresses.config/cluster ingress.operator.openshift.io/default-enable-http2=true
c. Run the following grpcurl command for all tokens in a single call:
export KSVC_HOSTNAME=$(oc get ksvc caikit-example-isvc-predictor -n ${TEST_NS} -o jsonpath='{.status.url}' | cut -d'/' -f3) grpcurl -insecure -d '{"text": "At what temperature does liquid Nitrogen boil?"}' -H "mm-model-id: flan-t5-small-caikit" ${KSVC_HOSTNAME}:443 caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict
The response should be similar to:
{ "generated_token_count": "5", "text": "74 degrees F", "stop_reason": "EOS_TOKEN", "producer_id": { "name": "Text Generation", "version": "0.1.0" } }
d. Run the grpcurl command to generate a token stream.
grpcurl -insecure -d '{"text": "At what temperature does liquid Nitrogen boil?"}' -H "mm-model-id: flan-t5-small-caikit" ${KSVC_HOSTNAME}:443 caikit.runtime.Nlp.NlpService/ServerStreamingTextGenerationTaskPredict
The response should be similar to:
{ "details": { } } { "tokens": [ { "text": "▁", "logprob": -1.599083423614502 } ], "details": { "generated_tokens": 1 } } { "generated_text": "74", "tokens": [ { "text": "74", "logprob": -3.3622500896453857 } ], "details": { "generated_tokens": 2 } } ....