Home > Workload Solutions > Data Analytics > White Papers > Multimodal RAG Chatbot Powered by Dell Data Lakehouse > 2. Retrieval Method
The objective is to embed tabular data, SQL queries and dataset documentation from DDAE and other type of data like images, audio, video, and text into a unified vector space to enable simultaneous vector searches across all media types.
Note: For simplicity purpose we focus on DDL Schema, historical SQL queries and sample images from data types to chunk
For retrieval method we will be using a multimodal embedding model to convert both text and images into a unified vector space. This approach ensures seamless retrieval of relevant data, enabling the chatbot to provide accurate and contextually rich responses. By embedding structured data, such as DDL schemas, historical SQL queries, federated ETL queries, BI report logic, and dataset documentation, alongside unstructured data like images, we create a cohesive vector representation that facilitates efficient and simultaneous vector searches across all modalities. This method optimizes data accessibility and enhances the system's ability to integrate diverse information sources, addressing the limitations of LLMs by supplementing them with up-to-date and contextually relevant data.
First, process the text data from various sources files and convert them into embeddings using a text embedding model.
from sentence_transformers import SentenceTransformer
# Initialize the text embedding model
text_embedding_model = SentenceTransformer('text-embedding-ada-002')
# Function to read text from a file
def read_text_file(file_path):
with open(file_path, 'r') as file:
return file.read()
# Load structured text data from files
sql_queries = read_text_file('sql_queries.txt')
dataset_documentation = read_text_file('dataset_documentation.txt')
bi_report_queries = read_text_file('bi_report_queries.txt')
ddl_schemas = read_text_file('ddl_schemas.txt')
etl_queries = read_text_file('etl_queries.txt')
# Combine all structured text data into a single document
combined_text_data = "\n".join([
sql_queries,
dataset_documentation,
bi_report_queries,
ddl_schemas,
etl_queries
])
# Generate embeddings for the combined text data
text_embeddings = text_embedding_model.encode(combined_text_data)
# Output the text embeddings
print("Text Embeddings:", text_embeddings)
Next, we process image data and generate embeddings using a multimodal embedding model like CLIP, which can handle both text and image inputs.
from PIL import Image
import torch
import clip
# Load the CLIP model and preprocess function
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, preprocess = clip.load("ViT-B/32", device=device)
# Function to load and preprocess an image
def preprocess_image(image_path):
image = Image.open(image_path)
return preprocess(image).unsqueeze(0).to(device)
# Load and preprocess image data
image_path = 'example_image.jpg'
image_input = preprocess_image(image_path)
# Generate image embeddings
with torch.no_grad():
image_embeddings = clip_model.encode_image(image_input).cpu().numpy()
# Output the image embeddings
print("Image Embeddings:", image_embeddings)
Finally, store the embeddings in a vector database and demonstrate how to perform a retrieval operation.
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
import numpy as np
# Initialize the ChromaDB client with default settings
chroma_client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory=".chroma_db" # Specify a directory to persist the database
))
# Create a collection for storing embeddings
collection_name = 'multimodal_embeddings'
collection = chroma_client.create_collection(name=collection_name)
# Function to convert numpy array to list for ChromaDB
def numpy_to_list(np_array):
return np_array.tolist()
# Prepare data for upsert
vectors = [
{'id': 'text_data', 'embedding': numpy_to_list(text_embeddings)},
{'id': 'image_data', 'embedding': numpy_to_list(image_embeddings)}
]
# Insert the embeddings into the collection
collection.add(
ids=[vector['id'] for vector in vectors],
embeddings=[vector['embedding'] for vector in vectors],
metadatas=[{} for _ in vectors] # Optional: metadata can be added for each vector
)
# Example query vector (for retrieval)
query_text = "Sample query text related to the data"
query_vector = text_embedding_model.encode(query_text)
# Retrieve similar embeddings
results = collection.query(
query_embeddings=[query_vector.tolist()],
n_results=5 # Number of similar results to retrieve
)
# Output the retrieval results
print("Retrieval Results:", results)
.