Home > Servers > Rack and Tower Servers > Intel > White Papers > Driving GenAI Advancements: Dell PowerEdge R760 with the Latest 5th Gen Intel® Xeon® Scalable Processors > Fine-tuning
Workload | Bio-GPT |
Application | IPEX 2.2.0+gitad9564f6/llm_feature_branch |
Tools/Compilers | gcc=12.2.1 |
Middleware, Framework, Runtimes | cmake-3.20.2, findutils-4.6.0, bzip2-1.0.6, gcc-8.5.0, gcc-c++-8.5.0, gcc-toolset-12-12.0, gcc-toolset-12-runtime-12.0, git-2.39.3, gperftools-devel-2.7-9.el8, libatomic-8.5.0, libfabric-1.17.0, procps-ng-3.3.15, python3-distutils-extra-2.39, python39-3.9.16, python39-devel-3.9.16, python39-pip-20.2.4, unzip-6.0, wget-1.19.5, which-2.21, torch==2.2.0.dev20231006+cpu, torchvision==0.17.0.dev20231006+cpu, torchaudio==2.2.0.dev20231006+cpu, ninja==1.11.1, accelerate==0.23.0, sentencepiece==0.1.99, protobuf==3.20.3, datasets==2.14.5, transformers==4.35.0, sacremoses, scikit-learn, peft, gdown, llvm-16.0.6, mpi4py==3.1.4, IPEX https://github.com/intel/intel-extension-for-pytorch - branch: llm_feature_branch (commit ad9564f61aef5e6be41ff04fbc17f308d43ad300), TorchCCL - https://github.com/intel/torch-ccl - tag: ccl_torch_dev_0905, Deepspeed https://github.com/delock/DeepSpeedSYCLSupport - branch: gma/run-opt-branch (commit f0ef3eaa959617eb5d29d7fc4132fde8e6773cbe) |
Orchestration | Kubernetes v1.27.5 |
Command line | mpirun -n 2 -ppn 1 -iface $FI_TCP_IFACE -genv OMP_NUM_THREADS=$OMP_NUM_THREADS -genv MASTER_ADDR=$MASTER_ADDR -genv MASTER_PORT=$MASTER_PORT -genv LD_PRELOAD=/usr/lib64/libstdc++.so.6:/usr/lib64/libtcmalloc.so:/opt/intel/oneapi/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libiomp5.so -genv TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=4294967296 -f /machinefile python3 /bio-gpt/finetune_multinode_biogpt.py --model_name_or_path "/datasets/biogpt-large" --dataset_path "/bio-gpt/raw/train.tsv,/bio-gpt/raw/valid.tsv,/bio-gpt/raw/ori_pqaa_rrp.json" --dataset_concatenation --gradient_accumulation_steps 1 --do_train --cache_dir "/biogpt_finetuned_model/cache/" --output_dir "/biogpt_finetuned_model/" --max_train_samples 5000 --per_device_train_batch_size 8 --learning_rate 4e-03 --num_train_epochs 3 --lora_alpha 64 --lora_target_modules q_proj, v_proj, k_proj, out_proj --lr_scheduler_type "linear" --use_cpu --use_ipex --ddp_backend ccl --ddp_find_unused_parameters True --bf16 --save_steps 2000' |
Local batch size | 8 |
Max train samples | 10500 |
Peft Lora Alpha | 64 |
Peft Lora Modules | Q_proj, k_proj, v_proj, out_proj |
Sequence length | 512 |
Learning rate | 0.004 |
Epochs | 3 |
Accumulation steps | 1 |
Num ranks | 2 per node |