Distributed inferencing | Driving GenAI Advancements: Dell PowerEdge R760 with the Latest 5th Gen Intel® Xeon® Scalable Processors | Dell Technologies Info Hub

Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Distributed inferencing

Distributed inferencing

Thank you for your feedback!

Table 3. Software configuration for distributed inferencing across 1,2,4 nodes

Workload	Meta LLAMA-2-7B,13B, Falcon-40B model BF16 precision
Application	IPEX intel_extension_for_pytorch==2.1.0
Tools/Compilers	gcc=12.2.1
Middleware, Framework, Runtimes	cmake-3.20.2, findutils-4.6.0, bzip2-1.0.6, gcc-8.5.0, gcc-c++-8.5.0, gcc-toolset-12-12.0, gcc-toolset-12-runtime-12.0, git-2.39.3, gperftools-devel-2.7-9.el8, libatomic-8.5.0, libfabric-1.18.0, procps-ng-3.3.15, python3-distutils-extra-2.39, python39-3.9.18,python39-devel-3.9.18, python39-pip-20.2.4,unzip-6.0, wget-1.19.5,which-2.21, intel-oneapi-openmp-2023.2.1, PSM3 https://downloadmirror.intel.com/789689/IntelEth-FS.RHEL88-x86_64.11.5.1.1.1.tgz, ninja==1.11.1.1, accelerate==0.25.0, sentencepiece==0.1.99, protobuf==4.25.1, datasets==2.15.0, transformers==4.31.0, wheel==0.42.0, PyTorch 2.1 - torch==2.1.0, IPEX - intel_extension_for_pytorch==2.1.0, neural-compressor==2.3.1, TorchCCL --branch v2.1.0+cpu https://github.com/intel/torch-ccl, mpi4py==3.1.4, Deepspeed --branch gma/run-opt-branch https://github.com/delock/DeepSpeedSYCLSupport
Orchestration	Kubernetes v1.27.5
Command line	mpirun -n 2 -ppn 1 -iface net1 -genv OMP_NUM_THREADS=31 -genv MASTER_ADDR=$MASTER_ADDR -genv MASTER_PORT=$MASTER_PORT -genv LD_PRELOAD=/usr/lib64/libstdc++.so.6:/usr/lib64/libtcmalloc.so:/opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so -genv TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=4294967296 -f /machinefile python /datasets/run_generation_with_deepspeed.xmr.py --benchmark -m $MODEL_NAME --dtype bfloat16 --ipex --deployment-mode --token-latency --batch-size (1,2,4,8) --input-tokens (256,1024,2048) --num-iter 100 --num-warmup 10 --greedy --max-new-tokens 256
Warm up steps	10
Steps	100
Batch size	1, 2, 4, 8
Beam width	1 (greedy search)
Input token size	256, 1024, 2048
Output token size	256