Cloud vs Edge: Putting Cutting- Edge AL Voice, Vision, & Language Modules to the Test in the Cloud & Edge
Read the ReportThu, 14 Mar 2024 16:49:21 -0000
|Read Time: 0 minutes
| DEPLOYING LEADING AI MODELS ON THE EDGE OR IN THE CLOUD
The decision to deploy workloads at the edge or in the cloud is often a contest shaped by four pivotal factors: economics, latency, regulatory requirements, and fault tolerance. Some might distill these considerations into a more colloquial framework: the laws of economics, the laws of the land, the laws of physics, and Murphy's Law. In this multi-part paper, we won't merely discuss these principles in theory. Instead, we'll delve deeper, testing and comparing leading AI models across voice, computer vision, and large language models both at the edge and in the cloud.
In part one we put the leading CPUs to the test, with 4th Generation Intel® Xeon® Scalable processors both in the cloud and at the edge. In part two we’ll put NVIDIA® GPUs to the test.
| GPUS AVAILABILITY IN THE CLOUD & AT THE EDGE
GPU instances are widely available in cloud environments, but only recently have purpose build platforms for the edge, offered similar high performing GPUs needed for advanced AI workloads at the edge.
In this paper we’ll evaluate leading vision, voice, and language models at the edge and in the cloud on NVIDIA® GPUs.
| AI MODELS SELECTION
- LLama-2 7B Chat • OpenAI Whisper base • YOLOv8n Instance Segmentation
To ensure we have a broad range of AI workloads tested at the edge and the cloud we opted for three of the leading models in their domains:
- VISION | YOLOv8n Instance Segmentation
YOLOv8n Instance Segmentation is designed for instance segmentation. Unlike basic object detection instance segmentation identifies the objects in an image as well as the segments of each object and provides outlines and confidence scores.
- LANGUAGE | Llama-2 7B Chat
Llama-2 7B Chat is a member of the Llama family of large language models offered by Meta trained on 2 trillion tokens and is well suited for chat applications.
- VOICE | OpenAI Whisper base 74M
Whisper is a deep learning model developed by OpenAI for speech recognition and transcription, capable of transcribing speech in English and multiple other languages and can translate several non-English languages into English.
| EDGE HARDWARE • DELL™ POWEREDGE™ XR4520C
The system we selected is Dell™ PowerEdge™ XR4520c purpose built for the edge, the shortest depth server available to date that delivers high-performance compute along with its support for NVIDIA® GPUs, specifically the A30 Tensor Core series. Designed for workloads at the Edge including computer vision, video surveillance, and point of sales applications. It offers a rackable chassis that supports up to four separate server nodes in a single 2U chassis. The system offers storage up to 45Tb per sled. Dell™ PowerEdge™ XR4520c offers a temperature range of -5C-55C and is MIL 810H compliant, making it ideal for the harsh environments at the edge. Dell™ PowerEdge™ XR4520c NEBS Level 3 compliance meets the rigorous standards for performance, reliability, and environmental resilience, leading to a more stable and reliable network.
| AWS INSTANCE SELECTION
We have selected the AWS EC2 G5 instances, specifically the g5.8large instance powered by AWS NVIDIA® A10G Tensor Core GPUs and built for AWS cloud workloads including rich AI support. This option was selected as the nearest comparison to the NVIDIA® A30 selected for Dell™ PowerEdge XR4520c, but currently unavailable in the AWS portfolio. As of November 2023, pricing for the AWS EC2 G5 Instance starts at US$2.449 per hour.
| HARDWARE SELECTION CONSIDERATIONS
We have selected the best comparable offering. Dell™ PowerEdge™ portfolio offered NVIDIA® A30 GPUs, while AWS offered enhanced NVIDIA® A10G Tensor Core GPUs. While cloud offerings offer significant choice, Dell™ PowerEdge™ portfolio offered a great choice of processors, memory, and networking.
In our analysis we are providing performance as well as cost of compute comparisons. For deployment you will also want to consider the following factors:
- Operational expenditures including power and maintenance costs.
- Network costs including data transfer to cloud and local connectivity.
- Data storage costs including cloud cost versus local storage.
- Network Latency requirements including lower latency as data is processed locally.
- Security and compliance costs
*Performance varies by use case, model, application, hardware & software configurations, the quality of the resolution of the input data, and other factors. This performance testing is intended for informational purposes and not intended to be a guarantee of actual performance of an AI application.
| PERFORMANCE INSIGHTS
We selected one process results for YOLOv8n Instance Segmentation for best images per second performance. Llama-2 7B Chat was selected running 20 processes as it achieved our targeted sub 100 ms per token user latency. For OpenAI Whisper we selected running 16 processes to target user reading speed. Across language and voice per watt, the edge offering exceeded the cloud instance, including offering lower latency AI performance. The cloud exceeded on vision and raw voice performance. From a computational cost comparison the on premise solution offered a payback period of nearly a year indicating a TCO win for edge.
| RETAIL USE CASE
- Drive-thru Pharmacy Pick-up
To demonstrate the practical application of these models, we designed a solution architecture accompanied by a demo that simulates a drive-through pharmacy scenario. In this use case, the vision model identifies the car upon its arrival, the language model gathers the client's information, and communication is facilitated via the voice model. As you can discern, factors such as latency, privacy, security, and cost play crucial roles in this scenario, emphasizing the importance of the decision to deploy either in the cloud or at the edge.
In our drive-thru pharmacy pick-up scenario, we utilize a comprehensive architecture to optimize the customer experience. The Video AI module employs YOLOv8n Instance segmentation model to accurately detect and track cars in the drive-thru zone. The Audio AI segment captures and transcribes human speech on the optimized Whisper-base model. This transcribed text is then processed by our Large Language Models segment, where an application leverages the optimized LLama2 7B Chat model to generate intuitive, human-like responses.
| RETAIL USE CASE ARCHITECTURE
| SUMMARY
In this analysis we put the leading voice, language, and vision models to the test on Dell™ PowerEdge™ and AWS instances. Dell™ PowerEdge™ XR4520c exceeded the cloud instances LLM and performance per watt on voice, while the AWS instances offered superior performance on computer vision. The Dell™ PowerEdge™ XR4520c offers a payback period of nearly one year based on Dell™ third party pricing The pharmacy drive through use case showcased the advantages of an edge deployment to maintain customer privacy, HIPPA compliance, and ensure fault tolerance and low latency.
APPENDIX | PERFORMANCE TESTING DETAILS
- Performance Insights | Yolov8n Instance Segmentation
| Test Methodology
The YOLOv8n-seg FP32 model was tested using ultralytics 8.0.43 library. A 53 second video with a resolution of 1080p and a bitrate of 1906 kb/s was employed for the performance tests. The first 30 inference samples were used as a warm-up phase and were not included in calculating the average inference metrics. The recorded time includes H264 encode-decode using PyAV 10.0.0 and model inference time.
Input | Video file: Duration: 53.3 sec, h264, 1920x1080, 1906 kb/s, 30 fp
Output | Video file with h264 encoding (without segmentation post processing)
Base Model: https://docs.ultralytics.com/tasks/segment/#models
*Performance varies by use case, model, application, hardware & software configurations, the quality of the resolution of the input data, and other factors. This performance testing is intended for informational purposes and not intended to be a guarantee of actual performance of an AI application.
| PERFORMANCE INSIGHTS LANGUAGE • CLOUD VS. EDGE
- Llama 2 7B Chat
| Test Methodology
For tests on NVIDIA® GPU, Llama-2-7B-chat-hf BF16 model was served on Text Generation Inference v1.1.0 server (TGI server). The model was loaded into NVIDIA® GPU by TGI server and Apache Bench was used for load testing.
The test involved initiating concurrent requests using Apache Bench. For each concurrent requests, results for performance was collected for ten samples. The first five requests were treated as a warm-up phase and were not included in calculation of average inference time (in seconds) and the average time per token.
Input | Discuss the history and evolution of artificial intelligence in 80 words.
Output | Discuss the history and evolution of artificial intelligence in 80 words or less. Artificial intelligence (AI) has a long history dating back to the 1950s when computer scientist Alan Turing proposed the Turing Test to measure machine intelligence. Since then, AI has evolved through various stages, including rule-based systems, machine learning, and deep learning, leading to the development of intelligent systems capable of performing tasks that typically require human intelligence, such as visual recognition, natural language processing, and decision-making.
Base Model | https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
*Performance varies by use case, model, application, hardware & software configurations, the quality of the resolution of the input data, and other factors. This performance testing is intended for informational purposes and not intended to be a guarantee of actual performance of an AI application.
| PERFORMANCE INSIGHTS VOICE • CLOUD VS. EDGE
- OpenAI Whisper-base model
| Test Methodology
The OpenAI Whisper base 74M FP32 multilingual was tested for inference using OpenAI-Whisper v20231117 Python package. For performance tests, an audio clip of 28.2 seconds with bitrate of 32 kb/s was employed. Twenty five iterations were executed for each test scenario out of which the first 5 iterations were considered as warm-up and were not included in calculating the average Inference time (in seconds) and tokens per second. The time collected includes encode-decode time using tokenizer and the model inference time.
Input | MP3 file with 28.2 sec audio
Output | Generative AI has revolutionized the retail industry by offering a wide array of innovative use cases that enhance customer experiences and streamline operations. One prominent application of Generative AI is personalized product recommendations. Retailers can utilize advanced recommendation algorithms to analyze customer data and generate tailored product suggestions in real time. This not only drives sales but also enhances customer satisfaction by presenting them with items that align with their preferences and purchase history.
| 74 words transcribed
Base Model | https://github.com/openai/whisper#available-models-and-languages
***Performance varies by use case, model, application, hardware & software configurations, the quality of the resolution of the input data, and other factors. This performance testing is intended for informational purposes and not intended to be a guarantee of actual performance of an AI application.
| About Scalers AI™
Scalers AI™ specializes in creating end-to-end artificial intelligence (AI) solutions to fast track industry transformation across a wide range of industries, including retail, smart cities, manufacturing, insurance, finance, legal and healthcare. Scalers AI™ industry offering include predictive analytics, custom large language models and multi-modal offerings across voice, vision, and language. As a full stack AI solutions company with solutions ranging from the cloud to the edge, our customers often need versatile common off the shelf (COTS) hardware that works well across a range of workloads.
- Fast track development & save hundreds of hours in development with access to the solution code
As part of this effort Scalers AI™ is making the solution code available.
Reach out to your Dell™ representative or contact Scalers AI™ at contact@scalers.ai for access to Github repo.