Model deployment

Thank you for your feedback!

This section describes deploying the LLM model to the Caikit model serving environment.
Deploying a LLM model with the Caikit + TGIS Serving runtime (GitHub)
There are two options for deploying an LLM model:
1. Following the step-by-step commands that are described in this procedure.
2. Running short scripts as described in Using scripts to deploy an LLM model with the Caikit + TGIS Serving runtime.
This procedure includes deploying an example LLM model Llama-2-7b-chat-hf with the Caikit + TGIS Serving runtime.
Note: The Llama-2-7b-chat-hf LLM model has been uploaded to the object storage bucket.
Prerequisites
- You have installed the Caikit-TGIS-Serving stack as described in the Caikit-TGIS-Serving README file.
- Your current working directory is the /01-rhods/ directory.
- If your LLM model is in an S3-like object storage (for example, Dell ObjectScale, AWS S3, MinIO), change the connection data in storage-config-secret.yaml and serviceaccount.yaml
- Change the storageUri to the location of the model stored in the object storage bucket, within the Caikit-isvc.yaml.
Procedure
1. Deploy the LLM model with Caikit+TGIS Serving runtime.
  1. Create a new namespace and patch ServiceMesh related object.
```
export TEST_NS=kserve-demo
oc new-project ${TEST_NS}
oc patch smmr/default -n istio-system --type='json' -p="[{'op': 'add', 'path': '/spec/members/-', 'value': \"$TEST_NS\"}]"
 
```
b. Create a caikit ServingRuntime. By default, it requests 4CPU and 8Gi of memory. You can adjust these values as required.
```
oc apply -f ./custom-manifests/caikit/caikit-servingruntime.yaml -n ${TEST_NS}
```
c. Deploy the MinIO data connection and service account.
```
oc apply -f ./custom-manifests/caikit/storage-config-secret.yaml -n ${TEST_NS} 
oc create -f ./custom-manifests/Caikit/serviceaccount.yaml -n ${TEST_NS}
```
d. Deploy the inference service. It will point to the model located in the modelmesh-example-models/llm/models directory.
```
oc apply -f ./custom-manifests/caikit/caikit-isvc.yaml -n ${TEST_NS}
```
e. Verify that the inference service's READY state is True.
```
oc get isvc/caikit-example-isvc -n ${TEST_NS}
```
2. Perform inference with Remote Procedure Call (gPRC) commands.
a. Determine whether the HTTP2 protocol is enabled in the cluster.
```
oc get ingresses.config/cluster -ojson | grep ingress.operator.openshift.io/default-enable-http2
```
If the annotation is set to true, skip to Step 2c.
b. If the annotation is set to either false or not present, enable it.
```
oc annotate ingresses.config/cluster ingress.operator.openshift.io/default-enable-http2=true
```
c. Run the following grpcurl command for all tokens in a single call:
```
export KSVC_HOSTNAME=$(oc get ksvc caikit-example-isvc-predictor -n ${TEST_NS} -o jsonpath='{.status.url}' | cut -d'/' -f3)
grpcurl -insecure -d '{"text": "At what temperature does liquid Nitrogen boil?"}' -H "mm-model-id: flan-t5-small-caikit" ${KSVC_HOSTNAME}:443 caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict
```
The response should be similar to:
```
{
  "generated_token_count": "5",
  "text": "74 degrees F",
  "stop_reason": "EOS_TOKEN",
  "producer_id": {
   "name": "Text Generation",
   "version": "0.1.0"
  }
}
```
d. Run the grpcurl command to generate a token stream.
```
grpcurl -insecure -d '{"text": "At what temperature does liquid Nitrogen boil?"}' -H "mm-model-id: flan-t5-small-caikit" ${KSVC_HOSTNAME}:443 caikit.runtime.Nlp.NlpService/ServerStreamingTextGenerationTaskPredict
```
The response should be similar to:
```
{
  "details": {  
  }
}
{
  "tokens": [
    {
      "text": "▁",
      "logprob": -1.599083423614502
    }
  ],
  "details": {
    "generated_tokens": 1
  }
}
{
  "generated_text": "74",
  "tokens": [
    {
      "text": "74",
      "logprob": -3.3622500896453857
    }
  ],
  "details": {
    "generated_tokens": 2
  }
}
....
```

Your Browser is Out of Date

Model deployment

Model deployment

Deploying a LLM model with the Caikit + TGIS Serving runtime (GitHub)

Prerequisites

Procedure