Home > Workload Solutions > Data Analytics > White Papers > GenAI on Dell APEX File Storage for Azure using Databricks, HuggingFace, and MosaicML > Solution Scope
This section delineates the validated scope of the GenAI solution. Any elements not explicitly mentioned in this description are considered out of scope or not validated as part of the solution.
Note: GenAI is validated by using Hugging Face and MosaicML Open-Source libraries on the same solution architecture.
The following sections describe the systems with which we perform general process of end-to-end LLM training:
End-to-end fine-tuning requires several major components to produce a result that is of value:
As an initial step in the comprehensive process of fine-tuning pretrained LLM models, users are presented with the raw form of the fine-tuning dataset. The dataset compiles its own data in a prompt-response format, mirroring the example provided, and ensures users can prepare their data to suit the requirements of the model.
Note: The fine-tuning dataset is out of the scope of this solution—the dataset used is pre-fine-tuned and placed in the Dell APEX File Storages.
The team chose foundation models, specifically OpenAI GPT, Google BERT, and Falcon 1B, from the Hugging Face Transformers library for the fine-tuning of this LLM.
Streaming dataset formats are typically employed to tackle common challenges in training or fine-tuning. In this case, however, the team directly retrieved data from Dell APEX File Storages using S3 protocol. The Databricks Spark service efficiently manages IO activity, enabling the team to perform advanced parallel operations.
Note: Streaming Dataset libraries are beyond this endeavor's scope, as it falls under the purview of Data Engineers or Data Scientists.
The Hugging Face transformer code for fine-tuning autonomously identifies the system's CPU/GPU configuration, ensuring proper distribution of data and model resources.
Note: As the solution is a functional validation, the team evaluated only on CPUs.
During the fine-tuning process, you can choose to reserve a portion of the data for validation or perform a separate evaluation task. This could involve various in-context learning (ICL) tasks to gauge the performance of your refined model.
Ultimately, you can submit fresh prompts to the fine-tuned model and receive responses to empirically confirm their utility for the user.
LLM training requires several major components:
As an initial step in the comprehensive process of training LLM models, the team set up the fine-tuned dataset as train and test dataset generated synthetically using dataset library.
MosaiclML training runtime provides a deep learning library built for NLP and Datasets. For the solution validation the team only ran training algorithm “resnet_56” by Mosaic composer library.
Streaming dataset formats are typically employed to tackle common challenges in training or fine-tuning. In our case, however, we directly retrieve data from Dell APEX File Storage using S3 protocol. The Databricks Spark service efficiently manages IO activity, enabling us to perform advanced parallel operations.
Note: Streaming Dataset libraries are beyond this endeavor's scope, as it falls under the purview of Data Engineers or Data Scientists.
The MosaiclML composer code autonomously identifies the system's CPU/GPU configuration, ensuring proper distribution of data and model resources.
Note: As the solution is a functional validation, we evaluate only on CPU.