Thu, 05 Sep 2024 20:46:16 -0000
|Read Time: 0 minutes
The Dell Enterprise Hub (https://dell.huggingface.co/) is a game changer for obtaining and using optimized models that run on some of the latest and greatest Dell hardware. The Dell Enterprise Hub has a curated set of models that have been deployed and validated on Dell Hardware.
This blog shows how a user can go from the Dell Enterprise Hub portal to a running model in minutes and will step you through the setup from the beginning until the containers are running.
The Dell Optimized containers are built on top of the TGI framework (https://huggingface.co/docs/text-generation-inference/index) . This allows a user to rely on all the existing benefits of the TGI framework with it being optimized for Dell Hardware. In addition, once a Dell container is downloaded, it comes preconfigured with all of the required weights. This pre-configuration with weights may increase the download size to provide larger containers, however this provides simplicity to the user and no additional searching is needed to have a running system.
In the past, we showed how to run models from the Dell Enterprise Hub with Docker https://infohub.delltechnologies.com/en-us/p/hugging-face-model-deployments-made-easy-by-dell-enterprise-hub/. In this blog, we look at how to scale up model deployments using Kubernetes.
In the Dell AI Labs, we leveraged the Dell Validated Design for Generative AI to perform models from the Dell Enterprise Hub. This blog focuses on deploying on to a Kubernetes cluster of Dell XE9680 servers that each have 8 x H100 NVIDIA GPUs. Support is also available on the Dell Enterprise Hub for 760XA servers that leverage H100 and L40S GPUs. Therefore, there is a lot of flexibility available when deploying a scalable solution on Dell servers from the Dell Enterprise Hub.
Note: This example shows an Enterprise Kubernetes deployment. Other Kubernetes distributions should also work fine.
The first step is to log in to the Dell Enterprise Hub and navigate to a model you want to deploy. A sample for deploying Llama 70B with 3 replicas is:
An example of how that looks is below:
When you select a model to deploy and select Kubernetes to deploy it with, a Kubernetes YAML snippet is generated. This snippet should be copied and inserted to a file, for example deployment.yaml, on your server where you have access to kubectl.
Then it is as simple as running the following:
>kubectl apply -f deployment.yaml deployment.apps/tgi-deployment created service/tgi-service created ingress.networking.k8s.io/tgi-ingress created
Note: The containers come prebuilt with all the weights so some models can be >100 GB. It takes time to download the models for the first time, depending on the size.
Some models may be private, or they may also require you to accept terms and conditions over the Hugging Face portal. The gated model deployment above fails if no token is passed since it is required for Llama models.
To solve for this it is possible to specify your token in two ways:
env: - name: HF_TOKEN value: “hf_...your-token”
kubectl create secret generic hf-secret --from-literal=hf_token=hf_**** --dry-run=client -o yaml | kubectl apply -f -
To use this token in your deployment.yaml you can insert the following:
env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_token
It is important to secure your token and not to post it in any public repository. For full details on how to use tokens see https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/gated_model_access .
Once the deployment is run, the following happens:
In your environment it is possible that a LoadBalancer and Ingress may already exist. In that case the deployment.yaml can be adjusted to match your environment. This example was performed in an empty Kubernetes namespace without issues.
In this example, we have deployed three replicas of the Llama 3.1 70B model. Each model uses 4xH100 GPUs, which means that our model will be scaled across multiple XE9680 servers.
After the deployment has completed, and the Dell Enterprise Hub containers are downloaded, you can check on the state of your Kubernetes Pods using the following:
kubectl get pods
This does not show the location of the pods so the following command can be run to get the nodes that they are deployed to:
kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name
This shows something similar to the following:
Here the pods running Llama 3.1 70B are distributed across 2 different XE9680 servers (node005 and node002). The replicas can be increased as long as resources are available on the cluster.
The Dell Enterprise Hub containers expose HTTP endpoints that can be used to perform queries in various formats. The full swagger definition of the API is available at https://huggingface.github.io/text-generation-inference/#/ .
Since we have deployed three replicas across multiple servers, it is necessary to access the HTTP endpoints through the Kubernetes LoadBalancer service. When running the models for Inference, it is important to note that the calls are stateless and there is no guarantee that subsequent requests go to the same model. To get the IP of the LoadBalancer you can run:
kubectl describe services tgi-service
For a simple test we can use the “generate” endpoint to POST a simple query to the model that we deployed in the previous step:
curl 123.4.5.6:80/generate \ -X POST \ -d '{"inputs":"What is Dell Technologies World?", "parameters":{"max_new_tokens":50}}' \ -H 'Content-Type: application/json'
This produces the following output:
{"generated_text":" Dell Technologies World is an annual conference held by Dell Technologies, a multinational technology company. The conference is designed to bring together customers, partners, and industry experts to share knowledge, showcase new products and services, and network with others in the technology industry.\n"}
In the example above, the response is successfully generated and keeps within the maximum limit of 50 tokens that was specified in the query.
The Dell Enterprise Hub simplifies the deployment and execution of the latest AI models. The prebuilt optimized containers deploy seamlessly on Dell Hardware.
In this blog we showed how quick and easy it is to run the latest Llama 3.1 model on a scalable Kubernetes cluster built on Dell servers.
For more infformation, see Simplifying AI: Dell Enterprise Hub Enables Kubernetes Deployment for AI Models.
Mon, 22 Jul 2024 13:16:51 -0000
|Read Time: 0 minutes
Performance analysis is a critical component that ensures the efficiency and reliability of models during the inference phase of a machine learning life cycle. For example, using a concurrency mode or request rate mode to simulate load on a server helps you understand various load conditions that are crucial for capacity planning and resource allocation. Depending on the use case, the analysis helps to replicate real-world scenarios. It can optimize performance to maintain a specific concurrency of incoming requests to the server, ensuring that the server can handle constant load or bursty traffic patterns. Providing a comprehensive view of the models’ performance enables data scientists to build models that are not only accurate but also robust and efficient.
Triton Performance Analyzer is a CLI tool that analyzes and optimizes the performance of Triton-based systems. It provides detailed information about the systems’ performance, including metrics related to GPU, CPU, and memory. It can also collect custom metrics using Triton’s C API. The tool supports various inference load modes and performance measurement modes.
The Triton Performance Analyzer can help identify performance bottlenecks, optimize system performance, troubleshoot issues, and more. In the suite of Tritons’ performance analysis tools, GenAI-Perf (which was released recently) uses Perf Analyzer in the backend. The GenAI-Perf tool can be used to gather various LLM metrics.
This blog focuses on the capabilities and use of GenAI-Perf.
GenAI-Perf is a command-line performance measurement tool that is customized to collect metrics that are more useful when analyzing an LLM’s performance. These metrics include, output token throughput, time to first token, inter-token latency, and request throughput.
The metrics can:
This blog also describes how to collect these metrics and automatically create plots using GenAI-Perf.
The following steps guide you through the process of using GenAI-Perf. In this example, we collect metrics from a Llama 3 model.
Before running the GenAI-Perf tool, launch Triton Inference Server with your Large Language Model (LLM) of choice.
The following procedure starts Llama 3 70 B and runs on Triton Inference Server v24.05. For more information about how to convert HuggingFace weights to run on Triton Inference Server, see the Converting Hugging Face Large Language Models to TensorRT-LLM blog.
The following example shows a sample command to start a Triton container:
docker run --rm -it --net host --shm-size=128g \ --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ -v $(pwd)/llama3-70b-instruct-ifb:/llama_ifb \ -v $(pwd)/scripts:/opt/scripts \ nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
Because Llama 3 is a gated model distributed by Hugging Face, you must request access to Llama 3 using Hugging Face and then create a token. For more information about Hugging Face tokens, see https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/gated_model_access.
An easy method to use your token with Triton is to log in to Hugging Face, which caches a local token:
huggingface-cli login --token hf_Enter_Your_Token
The following example shows a sample command to start the inference:
python3 /opt/scripts/launch_triton_server.py --model_repo /llama_ifb/ --world_size 4
Another method to deploy the Llama 3 model is to use the NVIDIA Inference Microservices (NIM). For more information about how to deploy NIM on the Dell PowerEdge XE9680 server, see Introduction to NVIDIA Inference Microservices, aka NIM. Also, see NVIDIA NIM for LLMs - Introduction.
The following example shows a sample script to start NIM with Llama 3 70b Instruct:
export NGC_API_KEY=<enter-your-key> export CONTAINER_NAME=meta-llama3-70b-instruct export IMG_NAME="nvcr.io/nim/meta/llama3-70b-instruct:1.0.0" export LOCAL_NIM_CACHE=/aipsf810/project-helix/NIM/nim mkdir -p "$LOCAL_NIM_CACHE" docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME
After the Triton Inference container is launched and the inference is started, run the Triton Server SDK:
docker run -it --net=host --gpus=all \ nvcr.io/nvidia/tritonserver:24.05-py3-sdk
You can install the GenAI-Perf tool using pip. In our example, we use the NGC container, which is easier to use and manage.
When the containers are running, log in to the SDK container and run the GenAI-Perf tool.
The following example shows a sample command:
genai-perf \ -m ensemble \ --service-kind triton \ --backend tensorrtllm \ --num-prompts 100 \ --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ --output-tokens-mean-deterministic \ --tokenizer hf-internal-testing/llama-tokenizer \ --concurrency 1 \ --generate-plots \ --measurement-interval 10000 \ --profile-export-file profile_export.json \ --url localhost:8001
This command produces values similar to the values in the following table:
Statistic | Average | Minimum | Maximum | p99 | p90 | p75 |
Time to first token (ns) Inter token latency (ns) Request latency (ns) Number of output tokens Number of input tokens | 40,375,620 17,272,993 1,815,146,838 108 200 | 37,453,094 5,665,738 1,811,433,087 100 200 | 74,652,113 19,084,237 1,850,664,440 123 200 | 69,046,198 19,024,802 1,844,310,335 122 200 | 39,642,518 18,060,240 1,814,057,039 116 200 | 38,639,988 18,023,915 1,813,603,920 112 200 |
Output token throughput (per sec): 59.63
Request throughput (per sec): 0.55
See Metrics at https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html#metrics for more information.
To run the performance tool with NIM, you must change parameters such as the model name, service-kind, endpoint-type, and so on, as shown in the following example:
genai-perf \ -m meta/llama3-70b-instruct \ --service-kind openai \ --endpoint-type chat \ --backend tensorrtllm \ --num-prompts 100 \ --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ --tokenizer hf-internal-testing/llama-tokenizer \ --concurrency 1 \ --measurement-interval 10000 \ --profile-export-file profile_export.json \ --url localhost:8000
The GenAI-Perf tool saves the output to the artifacts directory by default. Each run creates an artifacts/<model-name>_<service-kind>_<backend>_<concurrency"X"> directory.
The following example shows a sample directory:
ll artifacts/ensemble-triton-tensorrtllm-concurrency1/ total 2800 drwxr-xr-x 3 user user 127 Jun 10 13:40 ./ drwxr-xr-x 10 user user 4096 Jun 10 13:34 ../ -rw-r--r-- 1 user user 16942 Jun 10 13:40 all_data.gzip -rw-r--r-- 1 user user 126845 Jun 10 13:40 llm_inputs.json drwxr-xr-x 2 user user 4096 Jun 10 12:24 plots/ -rw-r--r-- 1 user user 2703714 Jun 10 13:40 profile_export.json -rw-r--r-- 1 user user 577 Jun 10 13:40 profile_export_genai_perf.csv
The profile_export_genai_perf.csv file provides the same results that are displayed during the test.
You can also plot charts that are based on the data automatically. To enable this feature, include --generate-plots in the command.
The following figure shows the distribution of input tokens to generated tokens. This metric is useful to understand how the model responds to different lengths of input.
Figure 1: Distribution of input tokens to generated tokens
The following figure shows a scatter plot of how token-to-token latency varies with output token position. These results show how quickly tokens are generated and how consistent the generation is regarding various output token positions.
Figure 2: Token-to-token latency compared to output token position
Performance analysis during the inference phase is crucial as it directly impacts the overall effectiveness of a machine learning solution. Tools such as GenAI-Perf provide comprehensive information that help anyone looking to deploy optimized LLMs in production. The NVIDIA Triton suite has been extensively tested on Dell servers and can be used to capture important LLM metrics with minimum effort. The GenAI-Perf tool is easy to use and produces extensive data that can be used to tune your LLM for optimal performance.
Mon, 20 May 2024 14:46:02 -0000
|Read Time: 0 minutes
The Dell Enterprise Hub (https://dell.huggingface.co/) is a game changer for obtaining and using optimized models that run on some of the latest and greatest Dell hardware. The Dell Enterprise Hub has a curated set of models that have been containerized with all the software dependencies optimized to run and validated on Dell Hardware.
This blog shows how a user can go from the Dell Enterprise Hub portal to a running model in minutes. We will step through the setup from the beginning until one or more containers are running.
The Dell Optimized containers are built on top of the TGI framework (https://huggingface.co/docs/text-generation-inference/index). This allows a user to rely on all the existing benefits of TGI yet it is optimized for Dell. In addition, once a Dell container is downloaded it comes preconfigured with all the required model weights so no additional searching is needed to have a running system. This is a trade-off to have larger containers in order to provide simplicity and minimize accidently running incorrect model weights.
In this blog we look at the simpler case of deploying a model for inference. There are also containers that can be used for model training and fine-tuning and these will be covered in a future blog.
During our testing we worked on different Dell Servers and GPUs. In this example we will focus on the 760xa servers for inference.
CPU | 2 x Intel(R) Xeon(R) Gold 6438M (32 cores each) |
Memory | 512GB (16 x 32GB) |
Storage | 2TB local storage + PowerScale F600 mounted share |
GPU | 4 x NVIDIA L40S |
This server has the capacity to run multiple inference sessions in parallel. It contains the maximum number of GPUs supported by this Dell server. If more GPUs are required for your model, then an XE9680 can be used that hosts up to 8 GPUs.
The software stack along with the versions we used are below:
It's likely that other versions also work but this was what was running in our lab.
Top tip: We missed the install of the NVIDIA toolkit on one server so to avoid this you can run a test container to check if it is working by using the command:
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
If the toolkit is missing, follow the instructions at https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html.
The optimized containers from Dell Enterprise Hub have three basic requirements - Docker, NVIDIA Container Toolkit, and Linux OS on Dell PowerEdge Platforms.
The Dell Enterprise Hub contains an expanding set of models that are optimized to run on Dell hardware. To select a model, go to https://dell.huggingface.co/, log in using your Hugging Face username and select your model of choice. Check out the Model Selection Made Easy by Dell Enterprise Hub blog. It is possible to also use your own fine-tuned model but for this test we will use a prebuilt Llama 3 8B model. For more details on how to use the portal see AI Made Easy Unleash the Potential of Dell Enterprise Hub on Hugging Face.
See below for a sample portal screen for deployment of the Llama 3 8B model on a Dell 760xa with L40S GPUs:
See the Dell Enterprise Hub Deploy page for Meta Llama 3 8B instructions.
The models on the Dell Enterprise Hub are under three broad categories of Licenses – Apache 3.0, Llama 3, and Gemma. Even though all of these models are permissive for enterprise usage, you will have to accept terms and conditions before accessing the models.
From the portal above the following Docker run command was generated:
docker run \ -it \ --gpus 2 \ --shm-size 1g \ -p 80:80 \ -e NUM_SHARD=2 \ -e MAX_BATCH_PREFILL_TOKENS=16182 \ -e MAX_INPUT_TOKENS=8000 \ -e MAX_TOTAL_TOKENS=8192 \ registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3-8b-instruct
This command can be performed as-is on your Dell server and the model will be pulled locally and run.
Note: The containers come prebuilt with all the weights so some models can be more than100GB. This Llama3 8B model is ~27GB.
When running a standard Docker command (as above) there is no link between your Hugging Face account and the model running, this allows a secure access to the models.
To solve for this it is possible to specify your hugging face hub token in 2 ways:
It is important to secure your token and not to post it in any public repo. For full details on how to use tokens see How to use HF tokens and for token generation see Token Generation.
The TGI containers expose http endpoints that can be used to perform queries in various formats. The full swagger definition of the API is available at https://huggingface.github.io/text-generation-inference/#/.
For a simple test we can use the “generate” endpoint to POST a simple query to the model that we ran in the previous step:
curl 127.0.0.1:80/generate \ -X POST \ -d '{"inputs":"What is Dell Technologies World?", "parameters":{"max_new_tokens":50}}' \ -H 'Content-Type: application/json'
This produces the following output:
{"generated_text":" Dell Technologies World is an annual conference held by Dell Technologies, a multinational technology company. The conference is designed to bring together customers, partners, and industry experts to share knowledge, showcase new products and services, and network with others in the technology industry.\n"}
As can be seen the response is generated and keeps within the limit of max 50 tokens that was specified in the query.
The Dell Enterprise Hub simplifies the deployment and execution of the latest AI models. The prebuilt containers run seamlessly on Dell Hardware.
In this example we showed how quick and easy it is to run the latest Llama 3 model on a 760xa with L40S GPUs. The Dell Enterprise Hub also supports training and fine-tuning models on Dell Hardware.
Eager for more? Check out the other blogs in this series to get inspired and discover what else you can do with the Dell Enterprise Hub and Hugging Face partnership.
Tue, 30 Jan 2024 14:48:31 -0000
|Read Time: 0 minutes
Light Detection and Ranging (LiDAR) is a method for determining the distance from a sensor to an object or a surface by sending out a laser beam and measuring the time for the reflected light to return to the receiver. We recently designed a solution to understand how using data from multiple LiDAR sensors monitoring a single space can be combined into a three-dimensional (3D) perceptual understanding of how people and objects flow and function within public and private spaces. Our key partner in this research is Seoul Robotics, a leader in LiDAR 3D perception and analytics tools.
Most people are familiar with the use of LiDAR on moving vehicles to detect nearby objects that has become popular in transportation applications. Stationary LiDAR is now becoming more widely adopted for 3D imaging in applications where cameras have been used traditionally.
Multiple sensor LiDAR applications can produce a complete 3D grid map with precise depth and location information for objects in the jointly monitored environment. This technology overcomes several limitations of 2D cameras. Using AI, LiDAR systems can improve the quality of analysis results for data collected during harsh weather conditions like rain, snow, and fog. Furthermore, LiDAR is more robust than optical cameras for conditions where the ambient lighting is low or produces reflections and glare.
Another advantage of LiDAR for computer vision is related to privacy protection. The widespread deployment of high-resolution optical cameras has raised concerns regarding the potential violation of individual privacy and misuse of the data.
LiDAR 3D perception is a promising alternative to traditional camera systems. LiDAR data does not contain biometric data that could be cross-referenced with other sources to identify individuals uniquely. This approach allows operators to track anonymous objects that maintain individuals' privacy. Therefore, it is essential to consider replacing or augmenting such cameras to reduce the overhead of ensuring that data is secure and used appropriately.
Worldwide, organizations use AI-enabled computer vision solutions to create safer, more efficient public and private spaces using only optical thermal and infrared cameras. Data scientists have developed many machine learning and deep neural network tools to detect and label objects using data from these different camera types.
As LiDAR becomes vital for the reasons discussed above, organizations are investigating their options for whether LiDAR is best deployed alongside traditional cameras or if there are opportunities to design new systems using LiDAR sensors exclusively. It is rare when existing cameras can be replaced with LiDAR sensors mounted in the exact locations used today.
An example deployment of 2 LiDAR sensors for a medium-sized room is below:
Detecting the position of the stationary objects and people moving through this space (flow and function) with LiDAR requires careful placement of the sensors, calibration of the room's geometry, and data processing algorithms that can extract information from both sensors without distortion or duplications. Collecting and processing LiDAR data for 3D perception requires a different toolset and expertise, but companies like Seoul Robotics can help.
Another aspect of LiDAR systems design that needs to be evaluated is data transfer requirements. In most large environments using camera deployments today (e.g., airport/transportation hubs, etc.), camera data is fed back to a centralized hub for real-time processing.
A typical optical camera in an AI computer vision system would have a resolution and refresh rate of 1080@30FPS. This specification would translate to ~4Mb/s of network traffic per camera. Even with older network technology, thousands of cameras can be deployed and processed.
There is a significant increase in the density of the data produced and processed for LiDAR systems compared to video systems. A currently available 32-channel LiDAR sensor will produce between 25Mb/s and 50Mb/s of data on the network segment between the device and the AI processing node. Newer high-density 128-channel LiDAR sensors consume up to 256Mb/s of network bandwidth, so something will need to change from the current strategy of centralized data processing.
It is not feasible to design a system that will consume the entire network capacity of a site with LiDAR traffic. In addition, it can also be challenging and expensive to upgrade the site's private network to handle higher speeds. The most efficient solution, therefore, is to design a federated solution for processing LiDAR data closer to the location of the sensors.
With a switch to the architecture in the right-side panel above, it is possible to process multiple LiDAR sensors closer to where they are mounted at the site and only send any resulting alerts and events back to a central location (primary node) for further processing and triggering corrective actions. This approach avoids the costly transfer of dense LiDAR data across long network segments.
It is important to note that processing LiDAR data with millions of points per second requires significant computational capability. We also validated that leveraging the massive parallel computing power of GPUs like the NVIDIA A2 greatly enhanced the object detection accuracy in the distributed processing nodes. The Dell XR4000 series of rugged Dell servers should be a good option for remote processing in many environments.
LiDAR is becoming increasingly important in designing AI for computer vision solutions due to its ability to handle challenging lighting situations and enhance user privacy. LiDAR differs from video cameras, so planning the deployment carefully is essential.
LiDAR systems can be designed in either a central or federated manner or even a mix of both. The rapidly growing network bandwidth requirements of LiDAR may cause a rethink on how systems for AI-enabled data processes are deployed sooner rather than later.
For more details on CV 3D Flow and Function with LiDAR see Computer Vision 3D Flow and Function AI with LiDAR.
Fri, 27 Oct 2023 15:31:21 -0000
|Read Time: 0 minutes
Long gone are the days when facilities managers and security personnel were required to be in a control room with their attention locked onto walls of video monitors. The development of lower-cost and more capable video cameras, more powerful data science computing platforms, and the need to reduce operations overhead have caused the deployment of video management systems (VMS) and computer vision analytics applications to skyrocket in the last ten years in all sectors of the economy. Modern computer vision applications can detect a wide range of events without constant human supervision, including overcrowding, unauthorized access, smoke detection, vehicle operation infractions, and more. Better situational awareness of their environments can help organizations achieve better outcomes for everyone involved.
Table 1 – Outcomes achievable with better situational awareness
Increased operational efficiencies | Leverage all the data that you capture to deliver high-quality services and improve resource allocation. |
Optimized safety and security | Provide a safer, more real-time aware environment. |
Enhanced experience | Provide a more positive, personalized, and engaging experience for both customers and employees. |
Improved sustainability | Measure and lower your environmental impact. |
New revenue opportunities | Unlock more monetization opportunities from your data with more actionable insights. |
Computer vision analytics uses various techniques and algorithms, including object detection, classification, feature extraction, and more. The computation resources that are required for these tasks depend on the resolution of the source video, frame rates, and the complexity of both the scene and the types of analytics being processed. The diagram below shows a simplified set of steps (pipeline) that is frequently implemented in a computer vision application.
Figure 1: Logical processing pipeline for computer vision
Inference is the step that most people are familiar with. A trained algorithm can distinguish between a passenger automobile and a delivery van, similar to the classic dogs versus cats example often used to explain computer vision. While the other steps are less familiar to the typical user of computer vision applications, they are critical to achieving good results and require dedicated graphics processing units (GPUs). For example, the Decode/Encode steps are tuned to leverage hardware that resides on the GPU to provide optimal performance.
Given the extensive portfolio of NVIDIA GPUs available today, organizations that are getting started with computer vision applications often need help understanding their options. We have tested the performance of computer vision analytics applications with various models of NVIDIA GPUs and collected the results. The remainder of this article provides background on the test results and our choice of model.
The market for GPUs is broadly divided into data center, desktop, and mobility products. The workload that is placed on a GPU when training large image classification and detection models is almost exclusively performed on data center GPUs. Once these models are trained and delivered in a computer vision application, multiple CPU and GPU resource options can be available at run time. Small facilities, such as a small retailer with only a few cameras, can afford to deploy only a desktop computer with a low-power GPU for near real-time video analytics. In contrast, large organizations with hundreds to thousands of cameras need the power of data center-class GPUs.
However, all data center GPUs are not created equal. The table below compares selected characteristics for a sample of NVIDIA data center GPUs. The FP32 floating point calculations per second metric indicates the relative performance that a developer can expect on either model training or the inference stage of the typical pipeline used in a computer vision application, as discussed above.
The capability of the GPU for performing other pipeline elements required for high-performance computer vision tasks, including encoding/decoding, is best reflected by the Media Engines details.
First, consider the Media Engines row entry for the A30 GPU column. There is 1 JPEG decoder and 4 video decoders, but no video encoders. This configuration makes the A30 incompatible with the needs of many market-leading computer vision application vendors' products, even though it is a data center GPU.
Table 2: NVIDA Ampere architecture GPU characteristics
| A2 | A16 | A30 | A40 |
FP32 (Tera Flops) | 4.5 | 4x 4.5 | 10.3 | 37.4 |
Memory (GB) | 16 GDDR6 | 4x 16 GDDR6 | 24 GB HBM2 | 48 GDDR6 with ECC |
Media Engines | 1 video encoder 2 video decoders (includes AV1 decode) | 4 video encoder 8 video decoders (includes AV1 decode) | 1 JPEG decoder 4 video decoders 1 optical flow accelerator | 1 video encoder 2 video decoders (includes AV1 decode) |
Power (Watts) | 40-60 (Configurable) | 250 | 165 | 300 |
Comparing the FP32 TFLOPS between the A30 and A40 shows that the A40 is a more capable GPU for training and pure inference tasks. During our testing, the computer vision applications quickly exhausted the available Media Engines on the A40. Selecting a GPU for computer vision requires matching the available resources needed for computer vision including media engines, available memory, and other computing capabilities that can be different across use cases.
Next, examining the Media Engines description for the A2 GPU column confirms that the product houses 1 video encoder and 2 video decoders. This card will meet the needs of most computer vision applications and is supported for data center use; however, the low number of encoders and decoders, memory, and floating point processing will limit the number of concurrent streams that can be processed. The low power consumption of the A2 increases the flexibility of choice of server for deployment, which is important for edge and near-edge scenarios.
Still focusing on the table above, compare all the characteristics of the A2 GPU column with the A16 GPU. Notice that there are four times the resources on the A16 versus the A2. This can be explained by looking at the diagram below. The A16 was constructed by putting four A2 “engines” on a single PCI card. Each of the boxes labeled GPU0-GPU3 contains all the memory, media engines and other processing capabilities that you would have available to a server that had a standard A2 GPU card installed. Also notice that the A16 requires approximately 4 times the power of an A2.
The table below shows the same metric comparison used in the discussion above for the newest NVIDIA GPU products based on the Ada Lovelace architecture. The L4 GPU offers 2 encoders and 4 decoders for a card that consumes just 72 W. Compared with the 1 encoder and 2 decoder configuration on the A2 at 40 to 60 W, the L4 should be capable of processing many more video streams for less power than two A2 cards. The L40 with 3 encoders and 3 decoders is expected to be the new computer vision application workhorse for organizations with hundreds to thousands of video streams. While the L40S has the same number of Media Engines and memory as the L40, it was designed to be an upgrade/replacement for the A100 Ampere architecture training and/or inference computing leader.
| L4 | L40 | L40S |
FP32 (Tera Flops) | 30.3 | 90.5 | 91.6 |
Memory (GB) | 24 GDDR6 w/ ECC | 48 GDDR6 w/ ECC | 48 GDDR6 w/ ECC |
Media Engines | 2 video encoder 4 video decoders 4 JPEG decoder (includes AV1 decode) | 3 video encoder 3 video decoders
| 3 video encoder 3 video decoders
|
Power (Watts) | 72 | 300 | 350 |
In total seven different NVIDIA GPU cards were discussed that are useful for CV workloads. From the Ampere family of cards we found that the A16 performed well for a wide variety of CV inference workloads. The A16 provides a good balance of video Decoders/Encoders, CUDA cores and memory for computer vision workloads.
For the newer Ada Lovlace family of cards, the L40 looks like a well-balanced card with great throughput potential. We are currently testing out this card in our lab and will provide a future blog on its performance for CV workloads.
A2 - https://www.nvidia.com/content/dam/en-zz/solutions/data-center/a2/pdf/a2-datasheet.pdf
A16 - https://images.nvidia.com/content/Solutions/data-center/vgpu-a16-datasheet.pdf
A30 - https://www.nvidia.com/en-us/data-center/products/a30-gpu/
A40 - https://images.nvidia.com/content/Solutions/data-center/a40/nvidia-a40-datasheet.pdf
L4 - https://www.nvidia.com/en-us/data-center/l4/
Thu, 20 Jul 2023 18:05:50 -0000
|Read Time: 0 minutes
In today’s world, the deployment of security cameras is a common practice. In some public facilities like airports, travelers can be in view of a security camera 100% of the time. The days of security guards watching banks of video panels being fed from hundreds of security cameras are quickly being replaced by computer vision systems powered by artificial intelligence (AI). Today’s advanced analytics can be performed on many camera streams in real-time without a human in the loop. These systems enhance not only personal safety but also provide other benefits, including better passenger experience and enhanced shopping experiences.
Modern IP cameras are complex devices. In addition to recording video streams at increasingly higher resolutions (4k is now common), they can also encode and send those streams over traditional internet protocol IP to downstream systems for additional analytic processing and eventually archiving. Some cameras on the market today have enough onboard computing power and storage to evaluate AI models and perform analytics right on the camera.
The development of IP-connected cameras provided great flexibility in deployment by eliminating the need for specialized cables. IP cameras are so easy to plug into existing IT infrastructure that almost anyone can do it. However, since most camera vendors use a modified version of an open-source Linux operating system, IT and security professionals realize there are hundreds or thousands of customized Linux servers mounted on walls and ceilings all over their facilities. Whether you are responsible for <10 cameras at a small retail outlet or >5000 at an airport facility, the question remains “How much exposure do all those cameras pose from cyber-attacks?”
To understand the potential risk posed by IP cameras, we assembled a lab environment with multiple camera models from different vendors. Some cameras were thought to be up to date with the latest firmware, and some were not.
Working in collaboration with the Secureworks team and their suite of vulnerability and threat management tools, we assessed a strategy for detecting IP camera vulnerabilities Our first choice was to implement their Secureworks Taegis™ VDR vulnerability scanning software to scan our lab IP network to discover any camera vulnerabilities. VDR provides a risk-based approach to managing vulnerabilities driven by automated & intelligent machine learning.
We planned to discover the cameras with older firmware and document their vulnerabilities. Then we would have the engineers upgrade all firmware and software to the latest patches available and rescan to see if all the vulnerabilities were resolved.
Once the SecureWorks Edge agent was set up in the lab, we could easily add all the IP ranges that might be connected to our cameras. All the cameras on those networks were identified by SecureWorks VDR and automatically added to the VDR AWS cloud-based reporting console.
The results of the scans were surprising. Almost all discovered cameras had some Critical issues identified by the VDR scanning. In one case, even after a camera was upgraded to the latest firmware available from the vendor, VDR found Critical software and configuration vulnerabilities shown below:
One of the remaining critical issues was the result of an insecure FTP username/password that was not changed from the vendor’s default settings before the camera was put into service. These types of procedural lapses should not happen, but inadvertently they are bound to. The password hardening mistake was easily caught by a VDR scan so that another common cybersecurity risk could be dealt with. This is an example of an issue not related to firmware but a combination of the need for vendors not to ship with a well-known FTP login and the responsibility of users to not forget to harden the login.
Another example of the types of Critical issues you can expect when dealing with IP cameras relates to discovering an outdated library dependency found on the camera. The library is required by the vendor software but was not updated when the latest camera firmware patches were applied.
The VDR tool will also detect if a camera is exposing any HTTP sites/services and look for vulnerabilities there. Most IP cameras ship with an embedded HTTP server so administrators can access the cameras' functionality and perform maintenance. Again, considering the number of deployed cameras, this represents a huge number of websites that may be susceptible to hacking. Our testing found some examples of the type of issues that a camera’s web applications can expose:
The scan of this device found an older version of Apache webserver software and outdated SSL libraries in use for this cameras website and should be considered a critical vulnerability.
In this article, we have tried to raise awareness of the significant Cyber Security risk that IP cameras pose to organizations, both large and small. Providing effective video recording and analysis capabilities is much more than simply mounting cameras on the wall and walking away. IT and security professionals must ask, “Who’s watching our IP cameras? Each camera should be continuously patched to the latest version of firmware and software - and scanned with a tool like SecureWorks VDR. If vulnerabilities still exist after scanning and patching, it is critical to engage with your camera vendor to remediate the issues that may adversely impact your organization if neglected. Someone will be watching your IP cameras; let’s ensure they don’t conflict with your best interests.
Dell Technologies is at the forefront of delivering enterprise-class computer vision solutions. Our extensive partner network and key industry stakeholders have allowed us to develop an award-winning process that takes customers from ideation to full-scale implementation faster and with less risk. Our outcomes-based process for computer vision delivers:
Dell Technologies Workload Solutions for Computer Vision
Virtualized Computer Vision for Smart Transportation with Genetec
Virtualized Computer Vision for Smart Transportation with Milestone