Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Dell.com Contact Us
United States/English
Ian Roche
Ian Roche

Ian Roche is a lead in the ISG AI Solutions team working on building Award winning AI solutions that enable Customers rapidly adopt AI. 


Since joining Dell in 2007 Ian has operated as a Data Scientist, Enterprise Architect, and Developer and is a member of the Dell Technical Leadership Community (TLC). He has completed a Masters in Artificial Intelligence and worked on various technologies including LLM, Neural Networks, ML, Optimization, and Spark. He comes from an Agile background and is passionate about Architecture and DevOps.

Assets

Home > AI Solutions > Gen AI > Blogs

Scale your Model Deployments with the Dell Enterprise Hub

Ian Roche Donagh Keeshan Ian Roche Donagh Keeshan

Thu, 05 Sep 2024 20:46:16 -0000

|

Read Time: 0 minutes

Overview

The Dell Enterprise Hub (https://dell.huggingface.co/) is a game changer for obtaining and using optimized models that run on some of the latest and greatest Dell hardware. The Dell Enterprise Hub has a curated set of models that have been deployed and validated on Dell Hardware.

This blog shows how a user can go from the Dell Enterprise Hub portal to a running model in minutes and will step you through the setup from the beginning until the containers are running.

Implementation

The Dell Optimized containers are built on top of the TGI framework (https://huggingface.co/docs/text-generation-inference/index) . This allows a user to rely on all the existing benefits of the TGI framework with it being optimized for Dell Hardware. In addition, once a Dell container is downloaded, it comes preconfigured with all of the required weights. This pre-configuration with weights may increase the download size to provide larger containers, however this provides simplicity to the user and no additional searching is needed to have a running system.

In the past, we showed how to run models from the Dell Enterprise Hub with Docker https://infohub.delltechnologies.com/en-us/p/hugging-face-model-deployments-made-easy-by-dell-enterprise-hub/. In this blog, we look at how to scale up model deployments using Kubernetes.  

Kubernetes setup

In the Dell AI Labs, we leveraged the Dell Validated Design for Generative AI to perform models from the Dell Enterprise Hub. This blog focuses on deploying on to a Kubernetes cluster of Dell XE9680 servers that each have 8 x H100 NVIDIA GPUs. Support is also available on the Dell Enterprise Hub for 760XA servers that leverage H100 and L40S GPUs. Therefore, there is a lot of flexibility available when deploying a scalable solution on Dell servers from the Dell Enterprise Hub.

This image shows an example of Enterprise Kubernetes Deployment.Example of Enterprise Kubernetes Deployment

Note: This example shows an Enterprise Kubernetes deployment. Other Kubernetes distributions should also work fine.

Generating Kubernetes YAML

The first step is to log in to the Dell Enterprise Hub and navigate to a model you want to deploy. A sample for deploying Llama 70B with 3 replicas is:

https://dell.huggingface.co/authenticated/models/meta-llama/meta-llama-3.1-70b-instruct/deploy/kubernetes?gpus=4&replicas=3&sku=xe9680-nvidia-h100  

An example of how that looks is below:

Model Deployment

When you select a model to deploy and select Kubernetes to deploy it with, a Kubernetes YAML snippet is generated. This snippet should be copied and inserted to a file, for example deployment.yaml, on your server where you have access to kubectl.

Then it is as simple as running the following:

>kubectl apply -f deployment.yaml
deployment.apps/tgi-deployment created
service/tgi-service created
ingress.networking.k8s.io/tgi-ingress created

Note: The containers come prebuilt with all the weights so some models can be >100 GB. It takes time to download the models for the first time, depending on the size.

Gated Models

Some models may be private, or they may also require you to accept terms and conditions over the Hugging Face portal. The gated model deployment above fails if no token is passed since it is required for Llama models.

To solve for this it is possible to specify your token in two ways:

  1. Set your token as an environment variable in the deployment.yaml.
    env:
                - name: HF_TOKEN
                  value: “hf_...your-token”
  2. A good alternative is to use the Kubernetes secret feature. A sample of creating a secret for your token is:
kubectl create secret generic hf-secret --from-literal=hf_token=hf_**** --dry-run=client -o yaml | kubectl apply -f -

To use this token in your deployment.yaml you can insert the following:

         env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_token

It is important to secure your token and not to post it in any public repository. For full details on how to use tokens see https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/gated_model_access .

Testing the Deployment

Once the deployment is run, the following happens:

  • A Deployment is made that pulls down the Llama 70B container.
  • A LoadBalancer service is created.
  • An Ingress component (based on nginx) is created.

In your environment it is possible that a LoadBalancer and Ingress may already exist. In that case the deployment.yaml can be adjusted to match your environment. This example was performed in an empty Kubernetes namespace without issues.

Validate the deployment

In this example, we have deployed three replicas of the Llama 3.1 70B model. Each model uses 4xH100 GPUs, which means that our model will be scaled across multiple XE9680 servers.

After the deployment has completed, and the Dell Enterprise Hub containers are downloaded, you can check on the state of your Kubernetes Pods using the following:

kubectl get pods

This does not show the location of the pods so the following command can be run to get the nodes that they are deployed to:

kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name

This shows something similar to the following:

Here the pods running Llama 3.1 70B are distributed across 2 different XE9680 servers (node005 and node002). The replicas can be increased as long as resources are available on the cluster.

Invoke the models

The Dell Enterprise Hub containers expose HTTP endpoints that can be used to perform queries in various formats. The full swagger definition of the API is available at https://huggingface.github.io/text-generation-inference/#/ .

Since we have deployed three replicas across multiple servers, it is necessary to access the HTTP endpoints through the Kubernetes LoadBalancer service. When running the models for Inference, it is important to note that the calls are stateless and there is no guarantee that subsequent requests go to the same model. To get the IP of the LoadBalancer you can run:

kubectl describe services tgi-service

For a simple test we can use the “generate” endpoint to POST a simple query to the model that we deployed in the previous step:

 curl 123.4.5.6:80/generate    \
  -X POST   \
  -d '{"inputs":"What is Dell Technologies World?", "parameters":{"max_new_tokens":50}}'   \
  -H 'Content-Type: application/json'

This produces the following output:

{"generated_text":" Dell Technologies World is an annual conference held by Dell Technologies, a multinational technology company. The conference is designed to bring together customers, partners, and industry experts to share knowledge, showcase new products and services, and network with others in the technology industry.\n"}

In the example above, the response is successfully generated and keeps within the maximum limit of 50 tokens that was specified in the query.

Conclusion

The Dell Enterprise Hub simplifies the deployment and execution of the latest AI models. The prebuilt optimized containers deploy seamlessly on Dell Hardware.

In this blog we showed how quick and easy it is to run the latest Llama 3.1 model on a scalable Kubernetes cluster built on Dell servers.

For more infformation, see Simplifying AI: Dell Enterprise Hub Enables Kubernetes Deployment for AI Models.

Home > AI Solutions > Gen AI > Blogs

NVIDIA PowerEdge generative AI LLM Llama

Introduction to Using the GenAI-Perf Tool on Dell Servers

Fabricio Bronzati Srinivas Varadharajan Ian Roche Fabricio Bronzati Srinivas Varadharajan Ian Roche

Mon, 22 Jul 2024 13:16:51 -0000

|

Read Time: 0 minutes

Overview

Performance analysis is a critical component that ensures the efficiency and reliability of models during the inference phase of a machine learning life cycle. For example, using a concurrency mode or request rate mode to simulate load on a server helps you understand various load conditions that are crucial for capacity planning and resource allocation. Depending on the use case, the analysis helps to replicate real-world scenarios. It can optimize performance to maintain a specific concurrency of incoming requests to the server, ensuring that the server can handle constant load or bursty traffic patterns. Providing a comprehensive view of the models’ performance enables data scientists to build models that are not only accurate but also robust and efficient.

Triton Performance Analyzer is a CLI tool that analyzes and optimizes the performance of Triton-based systems. It provides detailed information about the systems’ performance, including metrics related to GPU, CPU, and memory. It can also collect custom metrics using Triton’s C API. The tool supports various inference load modes and performance measurement modes.

The Triton Performance Analyzer can help identify performance bottlenecks, optimize system performance, troubleshoot issues, and more. In the suite of Tritons’ performance analysis tools, GenAI-Perf (which was released recently) uses Perf Analyzer in the backend. The GenAI-Perf tool can be used to gather various LLM metrics.

This blog focuses on the capabilities and use of GenAI-Perf.

GenAI-Perf

GenAI-Perf is a command-line performance measurement tool that is customized to collect metrics that are more useful when analyzing an LLM’s performance. These metrics include, output token throughput, time to first token, inter-token latency, and request throughput.  

The metrics can:

  • Analyze the system performance
  • Determine how quickly the system starts processing the request
  • Provide the overall time taken by the system to completely process a request
  • Retrieve a granular view of how fast the system can generate individual parts of the response
  • Provide a general view on the system’s efficiency in generating tokens

This blog also describes how to collect these metrics and automatically create plots using GenAI-Perf. 

Implementation

The following steps guide you through the process of using GenAI-Perf. In this example, we collect metrics from a Llama 3 model.

Triton Inference Server

Before running the GenAI-Perf tool, launch Triton Inference Server with your Large Language Model (LLM) of choice.

The following procedure starts Llama 3 70 B and runs on Triton Inference Server v24.05. For more information about how to convert HuggingFace weights to run on Triton Inference Server, see the Converting Hugging Face Large Language Models to TensorRT-LLM blog.

The following example shows a sample command to start a Triton container:

docker run --rm -it --net host --shm-size=128g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $(pwd)/llama3-70b-instruct-ifb:/llama_ifb \
-v $(pwd)/scripts:/opt/scripts \
nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

Because Llama 3 is a gated model distributed by Hugging Face, you must request access to Llama 3 using Hugging Face and then create a token. For more information about Hugging Face tokens, see https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/gated_model_access

An easy method to use your token with Triton is to log in to Hugging Face, which caches a local token:

 huggingface-cli login --token hf_Enter_Your_Token

The following example shows a sample command to start the inference:

python3 /opt/scripts/launch_triton_server.py --model_repo /llama_ifb/ --world_size 4

NIM for LLMs

Another method to deploy the Llama 3 model is to use the NVIDIA Inference Microservices (NIM). For more information about how to deploy NIM on the Dell PowerEdge XE9680 server, see Introduction to NVIDIA Inference Microservices, aka NIM. Also, see NVIDIA NIM for LLMs - Introduction.

The following example shows a sample script to start NIM with Llama 3 70b Instruct:

export NGC_API_KEY=<enter-your-key>
export CONTAINER_NAME=meta-llama3-70b-instruct
export IMG_NAME="nvcr.io/nim/meta/llama3-70b-instruct:1.0.0"
export LOCAL_NIM_CACHE=/aipsf810/project-helix/NIM/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

Triton SDK container

After the Triton Inference container is launched and the inference is started, run the Triton Server SDK:

docker run -it --net=host --gpus=all \
nvcr.io/nvidia/tritonserver:24.05-py3-sdk 

You can install the GenAI-Perf tool using pip. In our example, we use the NGC container, which is easier to use and manage.

Measure throughput and latency

When the containers are running, log in to the SDK container and run the GenAI-Perf tool.

 The following example shows a sample command:

genai-perf \
-m ensemble \
--service-kind triton \
--backend tensorrtllm \
--num-prompts 100 \
--random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
--streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--output-tokens-mean-deterministic \
--tokenizer hf-internal-testing/llama-tokenizer \
--concurrency 1 \
--generate-plots \
--measurement-interval 10000 \
--profile-export-file profile_export.json \
--url localhost:8001

This command produces values similar to the values in the following table: 

 Statistic 
AverageMinimumMaximump99 
p90 
p75 
Time to first token (ns)
Inter token latency (ns)
Request latency (ns)
Number of output tokens
Number of input tokens
40,375,620
17,272,993
1,815,146,838
108 
200
37,453,094 
5,665,738 
1,811,433,087
100
200
 74,652,113
19,084,237
1,850,664,440 
123
200


 69,046,198 
19,024,802
1,844,310,335
122
200
39,642,518
18,060,240 
1,814,057,039 
116
200
38,639,988
18,023,915
1,813,603,920
112
200

Output token throughput (per sec): 59.63
Request throughput (per sec): 0.55

See Metrics at  https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html#metrics for more information.

To run the performance tool with NIM, you must change parameters such as the model name, service-kind, endpoint-type, and so on, as shown in the following example: 

genai-perf \
-m meta/llama3-70b-instruct \
--service-kind openai \
--endpoint-type chat \
--backend tensorrtllm \
--num-prompts 100 \
--random-seed 123 \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
--streaming \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--tokenizer hf-internal-testing/llama-tokenizer \
--concurrency 1 \
--measurement-interval 10000 \
--profile-export-file profile_export.json \
--url localhost:8000 

Results

The GenAI-Perf tool saves the output to the artifacts directory by default. Each run creates an artifacts/<model-name>_<service-kind>_<backend>_<concurrency"X"> directory.

The following example shows a sample directory:

ll artifacts/ensemble-triton-tensorrtllm-concurrency1/
total 2800
drwxr-xr-x  3 user user     127 Jun 10 13:40 ./
drwxr-xr-x 10 user user    4096 Jun 10 13:34 ../
-rw-r--r--  1 user user   16942 Jun 10 13:40 all_data.gzip
-rw-r--r--  1 user user  126845 Jun 10 13:40 llm_inputs.json
drwxr-xr-x  2 user user    4096 Jun 10 12:24 plots/
-rw-r--r--  1 user user 2703714 Jun 10 13:40 profile_export.json
-rw-r--r--  1 user user     577 Jun 10 13:40 profile_export_genai_perf.csv

The profile_export_genai_perf.csv file provides the same results that are displayed during the test.

You can also plot charts that are based on the data automatically. To enable this feature, include --generate-plots in the command.

The following figure shows the distribution of input tokens to generated tokens. This metric is useful to understand how the model responds to different lengths of input.

The figure shows a graph of the distribution of input tokens to generated tokens.

Figure 1:  Distribution of input tokens to generated tokens

The following figure shows a scatter plot of how token-to-token latency varies with output token position. These results show how quickly tokens are generated and how consistent the generation is regarding various output token positions.

The figure shows a graph of token-to-token latency compared to output token position

Figure 2: Token-to-token latency compared to output token position

Conclusion

Performance analysis during the inference phase is crucial as it directly impacts the overall effectiveness of a machine learning solution. Tools such as GenAI-Perf provide comprehensive information that help anyone looking to deploy optimized LLMs in production. The NVIDIA Triton suite has been extensively tested on Dell servers and can be used to capture important LLM metrics with minimum effort. The GenAI-Perf tool is easy to use and produces extensive data that can be used to tune your LLM for optimal performance.

Home > AI Solutions > Gen AI > Blogs

Hugging Face Model Deployments Made Easy by Dell Enterprise Hub

Ian Roche Bala Rajendran Ian Roche Bala Rajendran

Mon, 20 May 2024 14:46:02 -0000

|

Read Time: 0 minutes

Overview

The Dell Enterprise Hub (https://dell.huggingface.co/) is a game changer for obtaining and using optimized models that run on some of the latest and greatest Dell hardware. The Dell Enterprise Hub has a curated set of models that have been containerized with all the software dependencies optimized to run and validated on Dell Hardware.

This blog shows how a user can go from the Dell Enterprise Hub portal to a running model in minutes. We will step through the setup from the beginning until one or more containers are running.

Implementation

The Dell Optimized containers are built on top of the TGI framework (https://huggingface.co/docs/text-generation-inference/index). This allows a user to rely on all the existing benefits of TGI yet it is optimized for Dell. In addition, once a Dell container is downloaded it comes preconfigured with all the required model weights so no additional searching is needed to have a running system. This is a trade-off to have larger containers in order to provide simplicity and minimize accidently running incorrect model weights.

In this blog we look at the simpler case of deploying a model for inference. There are also containers that can be used for model training and fine-tuning and these will be covered in a future blog.

Server setup

During our testing we worked on different Dell Servers and GPUs. In this example we will focus on the 760xa servers for inference.

Hardware

CPU

2 x Intel(R) Xeon(R) Gold 6438M (32 cores each)

Memory

512GB (16 x 32GB)

Storage

2TB local storage + PowerScale F600 mounted share

GPU

4 x NVIDIA L40S

This server has the capacity to run multiple inference sessions in parallel. It contains the maximum number of GPUs supported by this Dell server. If more GPUs are required for your model, then an XE9680 can be used that hosts up to 8 GPUs.

Software

The software stack along with the versions we used are below:

  • Ubuntu 22.04
  • Docker 24.0.6
  • NVIDIA Container Toolkit 1.14.2

It's likely that other versions also work but this was what was running in our lab.

Top tip: We missed the install of the NVIDIA toolkit on one server so to avoid this you can run a test container to check if it is working by using the command: 

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

If the toolkit is missing, follow the instructions at https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html.

This image shows the optimized containers from Dell Enterprise HubOptimized containers from Dell Enterprise Hub

The optimized containers from Dell Enterprise Hub have three basic requirements - Docker, NVIDIA Container Toolkit, and Linux OS on Dell PowerEdge Platforms.

Select a model

The Dell Enterprise Hub contains an expanding set of models that are optimized to run on Dell hardware. To select a model, go to https://dell.huggingface.co/, log in using your Hugging Face username and select your model of choice. Check out the Model Selection Made Easy by Dell Enterprise Hub blog. It is possible to also use your own fine-tuned model but for this test we will use a prebuilt Llama 3 8B model. For more details on how to use the portal see AI Made Easy Unleash the Potential of Dell Enterprise Hub on Hugging Face.

See below for a sample portal screen for deployment of the Llama 3 8B model on a Dell 760xa with L40S GPUs:  

This image shows sample portal screen for deployment of the Llama 3 8B model on a Dell 760xa with L40S GPUsSample portal screen for Llama 3 deployment on a Dell 760xa with L40s GPUs

See the Dell Enterprise Hub Deploy page for Meta Llama 3 8B instructions.

The models on the Dell Enterprise Hub are under three broad categories of Licenses – Apache 3.0, Llama 3, and Gemma. Even though all of these models are permissive for enterprise usage, you will have to accept terms and conditions before accessing the models.

Container deployment

From the portal above the following Docker run command was generated:

docker run \
    -it \
    --gpus 2 \
     --shm-size 1g \
    -p 80:80 \
    -e NUM_SHARD=2 \
    -e MAX_BATCH_PREFILL_TOKENS=16182 \
    -e MAX_INPUT_TOKENS=8000 \
    -e MAX_TOTAL_TOKENS=8192 \
     registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3-8b-instruct

This command can be performed as-is on your Dell server and the model will be pulled locally and run.

Note: The containers come prebuilt with all the weights so some models can be more than100GB. This Llama3 8B model is ~27GB.

When running a standard Docker command (as above) there is no link between your Hugging Face account and the model running, this allows a secure access to the models.

To solve for this it is possible to specify your hugging face hub token in 2 ways:

  1. Set your token as an environment variable “HUGGING_FACE_HUB_TOKEN”
  2. Add it to each Docker container run command “-e HUGGING_FACE_HUB_TOKEN=$token”

It is important to secure your token and not to post it in any public repo. For full details on how to use tokens see How to use HF tokens and for token generation see Token Generation.

Testing the deployment

The TGI containers expose http endpoints that can be used to perform queries in various formats. The full swagger definition of the API is available at https://huggingface.github.io/text-generation-inference/#/.

For a simple test we can use the “generate” endpoint to POST a simple query to the model that we ran in the previous step:

 curl 127.0.0.1:80/generate    \
  -X POST   \
  -d '{"inputs":"What is Dell Technologies World?", "parameters":{"max_new_tokens":50}}'   \
  -H 'Content-Type: application/json'

This produces the following output:

{"generated_text":" Dell Technologies World is an annual conference held by Dell Technologies, a multinational technology company. The conference is designed to bring together customers, partners, and industry experts to share knowledge, showcase new products and services, and network with others in the technology industry.\n"}

As can be seen the response is generated and keeps within the limit of max 50 tokens that was specified in the query.

Conclusion

The Dell Enterprise Hub simplifies the deployment and execution of the latest AI models. The prebuilt containers run seamlessly on Dell Hardware.

In this example we showed how quick and easy it is to run the latest Llama 3 model on a 760xa with L40S GPUs. The Dell Enterprise Hub also supports training and fine-tuning models on Dell Hardware.  

Eager for more? Check out the other blogs in this series to get inspired and discover what else you can do with the Dell Enterprise Hub and Hugging Face partnership.


Home > Workload Solutions > Computer Vision > Blog

The Future of AI Using LiDAR

Ian Roche Philip Hummel Ian Roche Philip Hummel

Tue, 30 Jan 2024 14:48:31 -0000

|

Read Time: 0 minutes

Introduction

Light Detection and Ranging (LiDAR) is a method for determining the distance from a sensor to an object or a surface by sending out a laser beam and measuring the time for the reflected light to return to the receiver. We recently designed a solution to understand how using data from multiple LiDAR sensors monitoring a single space can be combined into a three-dimensional (3D) perceptual understanding of how people and objects flow and function within public and private spaces. Our key partner in this research is Seoul Robotics, a leader in LiDAR 3D perception and analytics tools.

Most people are familiar with the use of LiDAR on moving vehicles to detect nearby objects that has become popular in transportation applications. Stationary LiDAR is now becoming more widely adopted for 3D imaging in applications where cameras have been used traditionally. 

Multiple sensor LiDAR applications can produce a complete 3D grid map with precise depth and location information for objects in the jointly monitored environment. This technology overcomes several limitations of 2D cameras. Using AI, LiDAR systems can improve the quality of analysis results for data collected during harsh weather conditions like rain, snow, and fog. Furthermore, LiDAR is more robust than optical cameras for conditions where the ambient lighting is low or produces reflections and glare.

Another advantage of LiDAR for computer vision is related to privacy protection. The widespread deployment of high-resolution optical cameras has raised concerns regarding the potential violation of individual privacy and misuse of the data. 

LiDAR 3D perception is a promising alternative to traditional camera systems. LiDAR data does not contain biometric data that could be cross-referenced with other sources to identify individuals uniquely. This approach allows operators to track anonymous objects that maintain individuals' privacy. Therefore, it is essential to consider replacing or augmenting such cameras to reduce the overhead of ensuring that data is secure and used appropriately. 

Challenges

Worldwide, organizations use AI-enabled computer vision solutions to create safer, more efficient public and private spaces using only optical thermal and infrared cameras. Data scientists have developed many machine learning and deep neural network tools to detect and label objects using data from these different camera types. 

As LiDAR becomes vital for the reasons discussed above, organizations are investigating their options for whether LiDAR is best deployed alongside traditional cameras or if there are opportunities to design new systems using LiDAR sensors exclusively. It is rare when existing cameras can be replaced with LiDAR sensors mounted in the exact locations used today.

An example deployment of 2 LiDAR sensors for a medium-sized room is below:

 

Detecting the position of the stationary objects and people moving through this space (flow and function) with LiDAR requires careful placement of the sensors, calibration of the room's geometry, and data processing algorithms that can extract information from both sensors without distortion or duplications. Collecting and processing LiDAR data for 3D perception requires a different toolset and expertise, but companies like Seoul Robotics can help.

Another aspect of LiDAR systems design that needs to be evaluated is data transfer requirements. In most large environments using camera deployments today (e.g., airport/transportation hubs, etc.), camera data is fed back to a centralized hub for real-time processing. 

A typical optical camera in an AI computer vision system would have a resolution and refresh rate of 1080@30FPS. This specification would translate to ~4Mb/s of network traffic per camera. Even with older network technology, thousands of cameras can be deployed and processed. 

There is a significant increase in the density of the data produced and processed for LiDAR systems compared to video systems. A currently available 32-channel LiDAR sensor will produce between 25Mb/s and 50Mb/s of data on the network segment between the device and the AI processing node. Newer high-density 128-channel LiDAR sensors consume up to 256Mb/s of network bandwidth, so something will need to change from the current strategy of centralized data processing. 

Technical Solution

It is not feasible to design a system that will consume the entire network capacity of a site with LiDAR traffic. In addition, it can also be challenging and expensive to upgrade the site's private network to handle higher speeds. The most efficient solution, therefore, is to design a federated solution for processing LiDAR data closer to the location of the sensors.

This image shows diagrams of the Centralized and Federated archtiectures for processing LiDAR data.

 

With a switch to the architecture in the right-side panel above, it is possible to process multiple LiDAR sensors closer to where they are mounted at the site and only send any resulting alerts and events back to a central location (primary node) for further processing and triggering corrective actions. This approach avoids the costly transfer of dense LiDAR data across long network segments. 

It is important to note that processing LiDAR data with millions of points per second requires significant computational capability. We also validated that leveraging the massive parallel computing power of GPUs like the NVIDIA A2 greatly enhanced the object detection accuracy in the distributed processing nodes. The Dell XR4000 series of rugged Dell servers should be a good option for remote processing in many environments.

Conclusion

LiDAR is becoming increasingly important in designing AI for computer vision solutions due to its ability to handle challenging lighting situations and enhance user privacy. LiDAR differs from video cameras, so planning the deployment carefully is essential.

LiDAR systems can be designed in either a central or federated manner or even a mix of both. The rapidly growing network bandwidth requirements of LiDAR may cause a rethink on how systems for AI-enabled data processes are deployed sooner rather than later.

For more details on CV 3D Flow and Function with LiDAR see Computer Vision 3D Flow and Function AI with LiDAR.


Home > Workload Solutions > Computer Vision > Blog

Optimizing Computer Vision Workloads: A Guide to Selecting NVIDIA GPUs

Philip Hummel Ian Roche Philip Hummel Ian Roche

Fri, 27 Oct 2023 15:31:21 -0000

|

Read Time: 0 minutes

Introduction

Long gone are the days when facilities managers and security personnel were required to be in a control room with their attention locked onto walls of video monitors. The development of lower-cost and more capable video cameras, more powerful data science computing platforms, and the need to reduce operations overhead have caused the deployment of video management systems (VMS) and computer vision analytics applications to skyrocket in the last ten years in all sectors of the economy. Modern computer vision applications can detect a wide range of events without constant human supervision, including overcrowding, unauthorized access, smoke detection, vehicle operation infractions, and more. Better situational awareness of their environments can help organizations achieve better outcomes for everyone involved.

Table 1 – Outcomes achievable with better situational awareness

Increased operational efficiencies

Leverage all the data that you capture to deliver high-quality services and improve resource allocation.

Optimized safety and security

Provide a safer, more real-time aware environment.

Enhanced experience

Provide a more positive, personalized, and engaging experience for both customers and employees.

Improved sustainability

Measure and lower your environmental impact.

New revenue opportunities

Unlock more monetization opportunities from your data with more actionable insights.

 

The technical challenge

Computer vision analytics uses various techniques and algorithms, including object detection, classification, feature extraction, and more. The computation resources that are required for these tasks depend on the resolution of the source video, frame rates, and the complexity of both the scene and the types of analytics being processed. The diagram below shows a simplified  set of steps (pipeline) that is frequently implemented in a computer vision application.

Figure 1: Logical processing pipeline for computer vision

Inference is the step that most people are familiar with. A trained algorithm can distinguish between a passenger automobile and a delivery van, similar to the classic dogs versus cats example often used to explain computer vision. While the other steps are less familiar to the typical user of computer vision applications, they are critical to achieving good results and require dedicated graphics processing units (GPUs). For example, the Decode/Encode steps are tuned to leverage hardware that resides on the GPU to provide optimal performance.

Given the extensive portfolio of NVIDIA GPUs available today, organizations that are getting started with computer vision applications often need help understanding their options. We have tested the performance of computer vision analytics applications with various models of NVIDIA GPUs and collected the results. The remainder of this article provides background on the test results and our choice of model.

Choosing a GPU

The market for GPUs is broadly divided into data center, desktop, and mobility products. The workload that is placed on a GPU when training large image classification and detection models is almost exclusively performed on data center GPUs. Once these models are trained and delivered in a computer vision application, multiple CPU and GPU resource options can be available at run time. Small facilities, such as a small retailer with only a few cameras, can afford to deploy only a desktop computer with a low-power GPU for near real-time video analytics. In contrast, large organizations with hundreds to thousands of cameras need the power of data center-class GPUs.

However, all data center GPUs are not created equal. The table below compares selected characteristics for a sample of NVIDIA data center GPUs. The FP32 floating point calculations per second metric indicates the relative performance that a developer can expect on either model training or the inference stage of the typical pipeline used in a computer vision application, as discussed above.

The capability of the GPU for performing other pipeline elements required for high-performance computer vision tasks, including encoding/decoding, is best reflected by the Media Engines details.

First, consider the Media Engines row entry for the A30 GPU column. There is 1 JPEG decoder and 4 video decoders, but no video encoders. This configuration makes the A30 incompatible with the needs of many market-leading computer vision application vendors' products, even though it is a data center GPU.

Table 2:  NVIDA Ampere architecture GPU characteristics

 

A2

A16

A30

A40

FP32 (Tera Flops)

4.5

4x 4.5

10.3

37.4

Memory (GB)  

16 GDDR6

4x 16 GDDR6

24 GB HBM2

48 GDDR6

with ECC

Media Engines

1 video encoder

2 video decoders (includes AV1 decode)

4 video encoder

8 video decoders (includes AV1 decode)

1 JPEG decoder

4 video decoders

1 optical flow accelerator

1 video encoder

2 video decoders (includes AV1 decode)

Power (Watts)

40-60 (Configurable)

250

165

300

 

Comparing the FP32 TFLOPS between the A30 and A40 shows that the A40 is a more capable GPU for training and pure inference tasks. During our testing, the computer vision applications quickly exhausted the available Media Engines on the A40. Selecting a GPU for computer vision requires matching the available resources needed for computer vision including media engines, available memory, and other computing capabilities that can be different across use cases.

Next, examining the Media Engines description for the A2 GPU column confirms that the product houses 1 video encoder and 2 video decoders. This card will meet the needs of most computer vision applications and is supported for data center use; however, the low number of encoders and decoders, memory, and floating point processing will limit the number of concurrent streams that can be processed. The low power consumption of the A2 increases the flexibility of choice of server for deployment, which is important for edge and near-edge scenarios.

Still focusing on the table above, compare all the characteristics of the A2 GPU column with the A16 GPU. Notice that there are four times the resources on the A16 versus the A2. This can be explained by looking at the diagram below. The A16 was constructed by putting four A2 “engines” on a single PCI card. Each of the boxes labeled GPU0-GPU3 contains all the memory, media engines and other processing capabilities that you would have available to a server that had a standard A2 GPU card installed. Also notice that the A16 requires approximately 4 times the power of an A2.

 

 

The table below shows the same metric comparison used in the discussion above for the newest NVIDIA GPU products based on the Ada Lovelace architecture. The L4 GPU offers 2 encoders and 4 decoders for a card that consumes just 72 W. Compared with the 1 encoder and 2 decoder configuration on the A2 at 40 to 60 W, the L4 should be capable of processing many more video streams for less power than two A2 cards. The L40 with 3 encoders and 3 decoders is expected to be the new computer vision application workhorse for organizations with hundreds to thousands of video streams. While the L40S has the same number of Media Engines and memory as the L40, it was designed to be an upgrade/replacement for the A100 Ampere architecture training and/or inference computing leader.

 

L4

L40

L40S

FP32 (Tera Flops)

30.3

90.5

91.6

Memory (GB)

24 GDDR6 w/ ECC

48 GDDR6 w/ ECC

48 GDDR6 w/ ECC

Media Engines

2 video encoder

4 video decoders

4 JPEG decoder

(includes AV1 decode)

3 video encoder

3 video decoders

 

3 video encoder

3 video decoders

 

Power (Watts)

72

300

350

 

Conclusion

In total seven different NVIDIA GPU cards were discussed that are useful for CV workloads. From the Ampere family of cards we found that the A16 performed well for a wide variety of CV inference workloads. The A16 provides a good balance of video Decoders/Encoders, CUDA cores and memory for computer vision workloads.

For the newer Ada Lovlace family of cards, the L40 looks like a well-balanced card with great throughput potential. We are currently testing out this card in our lab and will provide a future blog on its performance for CV workloads.

References

A2 - https://www.nvidia.com/content/dam/en-zz/solutions/data-center/a2/pdf/a2-datasheet.pdf

A16 - https://images.nvidia.com/content/Solutions/data-center/vgpu-a16-datasheet.pdf

A30 - https://www.nvidia.com/en-us/data-center/products/a30-gpu/

A40 - https://images.nvidia.com/content/Solutions/data-center/a40/nvidia-a40-datasheet.pdf

L4 - https://www.nvidia.com/en-us/data-center/l4/

L40 - https://www.nvidia.com/en-us/data-center/l40/

L40S - https://www.nvidia.com/en-us/data-center/l40s/

Home > Workload Solutions > Computer Vision > Blog

AI video analytics cybersecurity

Who’s watching your IP cameras?

Ian Roche Philip Hummel Ian Roche Philip Hummel

Thu, 20 Jul 2023 18:05:50 -0000

|

Read Time: 0 minutes

Introduction

In today’s world, the deployment of security cameras is a common practice.  In some public facilities like airports, travelers can be in view of a security camera 100% of the time. The days of security guards watching banks of video panels being fed from hundreds of security cameras are quickly being replaced by computer vision systems powered by artificial intelligence (AI).  Today’s advanced analytics can be performed on many camera streams in real-time without a human in the loop. These systems enhance not only personal safety but also provide other benefits, including better passenger experience and enhanced shopping experiences.

Modern IP cameras are complex devices.  In addition to recording video streams at increasingly higher resolutions (4k is now common), they can also encode and send those streams over traditional internet protocol IP to downstream systems for additional analytic processing and eventually archiving.  Some cameras on the market today have enough onboard computing power and storage to evaluate AI models and perform analytics right on the camera.

The Problem

The development of IP-connected cameras provided great flexibility in deployment by eliminating the need for specialized cables.  IP cameras are so easy to plug into existing IT infrastructure that almost anyone can do it.  However, since most camera vendors use a modified version of an open-source Linux operating system, IT and security professionals realize there are hundreds or thousands of customized Linux servers mounted on walls and ceilings all over their facilities. Whether you are responsible for <10 cameras at a small retail outlet or >5000 at an airport facility, the question remains “How much exposure do all those cameras pose from cyber-attacks?”

The Research

To understand the potential risk posed by IP cameras, we assembled a lab environment with multiple camera models from different vendors. Some cameras were thought to be up to date with the latest firmware, and some were not. 

Working in collaboration with the Secureworks team and their suite of vulnerability and threat management tools, we assessed a strategy for detecting IP camera vulnerabilities   Our first choice was to implement their Secureworks Taegis™ VDR vulnerability scanning software to scan our lab IP network to discover any camera vulnerabilities. VDR provides a risk-based approach to managing vulnerabilities driven by automated & intelligent machine learning.

We planned to discover the cameras with older firmware and document their vulnerabilities.  Then we would have the engineers upgrade all firmware and software to the latest patches available and rescan to see if all the vulnerabilities were resolved.

Findings

Once the SecureWorks Edge agent was set up in the lab, we could easily add all the IP ranges that might be connected to our cameras. All the cameras on those networks were identified by SecureWorks VDR and automatically added to the VDR AWS cloud-based reporting console. 

Discovering Camera Vulnerabilities

The results of the scans were surprising.  Almost all discovered cameras had some Critical issues identified by the VDR scanning.  In one case, even after a camera was upgraded to the latest firmware available from the vendor, VDR found Critical software and configuration vulnerabilities shown below: 

One of the remaining critical issues was the result of an insecure FTP username/password that was not changed from the vendor’s default settings before the camera was put into service. These types of procedural lapses should not happen, but inadvertently they are bound to.  The password hardening mistake was easily caught by a VDR scan so that another common cybersecurity risk could be dealt with. This is an example of an issue not related to firmware but a combination of the need for vendors not to ship with a well-known FTP login and the responsibility of users to not forget to harden the login.

Another example of the types of Critical issues you can expect when dealing with IP cameras relates to discovering an outdated library dependency found on the camera. The library is required by the vendor software but was not updated when the latest camera firmware patches were applied.

Camera Administration Consoles

The VDR tool will also detect if a camera is exposing any HTTP sites/services and look for vulnerabilities there. Most IP cameras ship with an embedded HTTP server so administrators can access the cameras' functionality and perform maintenance.  Again, considering the number of deployed cameras, this represents a huge number of websites that may be susceptible to hacking.  Our testing found some examples of the type of issues that a camera’s web applications can expose:

The scan of this device found an older version of Apache webserver software and outdated SSL libraries in use for this cameras website and should be considered a critical vulnerability. 

Conclusion

In this article, we have tried to raise awareness of the significant Cyber Security risk that IP cameras pose to organizations, both large and small. Providing effective video recording and analysis capabilities is much more than simply mounting cameras on the wall and walking away. IT and security professionals must ask, “Who’s watching our IP cameras?  Each camera should be continuously patched to the latest version of firmware and software - and scanned with a tool like SecureWorks VDR. If vulnerabilities still exist after scanning and patching, it is critical to engage with your camera vendor to remediate the issues that may adversely impact your organization if neglected. Someone will be watching your IP cameras; let’s ensure they don’t conflict with your best interests.

Dell Technologies is at the forefront of delivering enterprise-class computer vision solutions.  Our extensive partner network and key industry stakeholders have allowed us to develop an award-winning process that takes customers from ideation to full-scale implementation faster and with less risk.  Our outcomes-based process for computer vision delivers:

  • Increased operational efficiencies: Leverage all the data you’re capturing to deliver high-quality services and improve resource allocation.
  • Optimized safety and security: Provide a safer, more real-time aware environment
  • Enhanced experience: Provide a more positive, personalized, and engaging experience for customers and employees.
  • Improved sustainability: Measure and lower your environmental impact.
  • New revenue opportunities: Unlock more monetization opportunities from your data with more actionable insights

Where to go next...

Beyond the platform - How Dell Technologies is leading the industry with an outcomes-based process for computer vision

Dell Technologies Workload Solutions for Computer Vision

Secureworks

Virtualized Computer Vision for Smart Transportation with Genetec

Virtualized Computer Vision for Smart Transportation with Milestone