The PDF datastore
Using PDF datastore
- The PDF files that are included in the repo are located in the “pdfs-dell-infohub” directory. This is a random assortment of freely available PDF whitepapers, blogs, and tech documents from https://infohub.delltechnologies.com/
- You can use these as-is or delete and replace them with other PDF files.
- The directory name is hardcoded into the notebook. You can easily make adjustments to the name or location—you just need to modify the code references.
The vector database
More info on vector databases
- When the notebook is run, the PDF files are encoded and embedded into a ChromaDB vector database in the directory “db” located alongside the same directory as the notebook.
- You can easily change the location of this in the notebook.
- For demo purposes and experimentation, the code deletes the “db” directory every time the notebook is run, and a new chatbot session is created so the content is up to date. You can change this as well if you like.
Install Miniconda
Creating use-case-basedbased Conda environments
- When working with Python tools and libraries it is highly recommended to use virtual environments.
- As you experiment with different notebooks, compatibility issues become very evident as different libraries are installed or uninstalled for certain tools.
- Miniconda is used to segregate your work area into use-case-based
Conda environments. - RAG workloads are very similar to other RAG workloads, so the libraries and tools won’t fluctuate much. What you use for one notebook you’ll probably use for another. Fine-tuning workloads is also similar; however, both of these workloads are different enough from each other that they should have their own Conda environments. Try to avoid mixing your RAG notebook kernel with your FT notebook kernel.
- Installer link: https://docs.conda.io/projects/miniconda/en/latest/index.html
- Conda environment YAML file is available in the GenAI GitHub repo.
- Create the RAG environment from the included file: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file
Install Jupyter Notebooks
Visualizing and sharing Python code with Jupyter Notebooks
Starting the Jupyter Notebook server
In the (base) Conda environment
- Edit the Jupyter config file and replace “localhost” with your Linux host IP address.
- At the command line, start the notebook server from the same directory as the GenAI GitHub repo.
- Jupyter Notebooks will create a small web server on the Linux host to serve the UI from. You’ll need a Windows jump box or browser with network access to your Linux host to reach that URL.
Figure 4. Example of how to start the Jupyter Notebook server