Exploring RAG: A Sisense-Based Approach with BigQuery and Gemini

himanshu_negi · ‎12-27-2024

Exploring RAG: A Sisense-Based Approach with BigQuery and Gemini

Continuing from our previous article, Automated Machine Learning with Sisense Fusion: A Practical Guide, where we discussed how Sisense and AutoML simplify complex machine learning workflows, I am now excited to introduce an advanced feature: a chatbot interface powered by Retrieval-Augmented Generation (RAG) and a large language model (LLM) like Gemini.

This enhancement allows users to interact with their data in natural language through Sisense's Native Blox widget. Imagine asking questions like "How many products did I sell per category?" and receiving insightful answers instantly. This not only improves data accessibility but also bridges the gap between technical data repositories and business decision-making.

In this article, I’ll cover the first step of enabling this functionality: setting up a seamless integration pipeline between Sisense and Google BigQuery, embedding table and column metadata, and preparing the data for efficient querying.

Important Note:

Experimental Project: This implementation is an experimental project and not an official Sisense feature. It demonstrates how Sisense’s functionalities can be utilized to build a custom Retrieval-Augmented Generation (RAG) pipeline or how you can use this example by importing the provided notebooks.
Google Ecosystem: The example provided is specific to the Google ecosystem, utilizing services like BigQuery (BQ) and Gemini for query generation and processing.
Associated Costs: Please note that each prompt sent to the chatbot incurs a cost, which will be billed to your Google Cloud Platform (GCP) account. This includes charges for LLM processing and database queries.

Building the Foundation: Data Preparation and Embeddings

To enable natural language interactions with your data, we first need to establish a robust data infrastructure. This includes creating a vector store for embeddings in Google BigQuery (BQ) and preparing metadata about tables and columns. The process is fully automated using a pre-built Sisense custom code notebook, which simplifies embedding generation and management.

Dataset in BigQuery

The journey begins with your dataset in Google BigQuery. The notebook retrieves metadata such as table and column names, descriptions, and schema information from BigQuery’s Information_Schema. If any descriptions are missing, the system automatically generates them, ensuring comprehensive metadata coverage.

Setting Up the Vector Store in BigQuery

The vector store acts as the backbone for similarity searches within the RAG system. Using the custom notebook, the system:

Creates the vector store: A structured repository to store embeddings for tables and columns.
Organizes metadata: Ensures all table and column descriptions are structured and accessible.

Generating Table and Column Descriptions

Missing descriptions can hinder data understanding. The notebook includes a Description Agent that automatically generates meaningful text for:

Tables: Contextual descriptions based on schema and usage.
Columns: Descriptions highlighting their role within the dataset.

These enhancements ensure the metadata is both informative and ready for embedding generation.

Creating Embeddings with Vertex AI

To enable semantic search, metadata descriptions are converted into numerical embeddings using Vertex AI’s TextEmbeddingModel. This is facilitated by the EmbedderAgent, which:

Accepts strings or lists of strings (e.g., column descriptions).
Generates embeddings through Vertex AI.
Handles both single and batch processing for efficiency.

Efficient Embedding with Chunked Processing

For large datasets, embeddings are generated in chunks using the get_embedding_chunked function. This ensures:

Scalability: Handles datasets of all sizes without performance issues.
Parallel Processing: Processes text chunks simultaneously to speed up the workflow.
Structured Outputs: Embeddings are returned in a structured DataFrame for storage or analysis.

Storing Embeddings in the Vector Store

The final step is storing these embeddings in the BigQuery vector store. This ensures that:

Similarity searches are fast and efficient.
Metadata is always accessible for chatbot interactions.

ALT text: A screenshot of a data table displaying various columns such as "table_schema," "column_name," "data_type," "source_type," and several other attributes. The table shows sample values and metadata related to a database structure, organized in rows and columns.

How the Chatbot Interface Works

Now that the foundation for embeddings and metadata storage is set, let’s explore the chatbot interface in action.

Imagine opening the chatbot in the Blox widget and asking a question about your dataset. Within moments, the chatbot responds in natural language, providing actionable insights. But what exactly happens under the hood to generate this seamless interaction?

RAG Notebook and the Chatbot Workflow

The chatbot operates using a pre-built RAG custom code transformation notebook, which orchestrates the end-to-end process. With this notebook, the entire pipeline—from understanding the query to generating the response—is automated.

The notebook uses multiple specialized agents, each responsible for a specific task, ensuring precision and efficiency at every step.

SQL Query Builder Agent

BuildSQLAgent

This agent specializes in constructing SQL queries for BigQuery.
It uses the LLM to analyze the user’s natural language query and matches it with table schemas and column details from the vector store.
It outputs a fully formed SQL query tailored to the user’s dataset and question.

SQL Validation Agent

ValidateSQLAgent

The ValidateSQLAgent validates the SQL query before execution using a Large Language Model (LLM).
Validation ensures the query adheres to essential rules, including:

The presence of all referenced columns and tables.
Proper table relationships and join conditions based on the schema.
Formatting and compliance with BigQuery-specific SQL standards.

Validation occurs during the debugging process, specifically within the DebugSQLAgent, to identify potential errors before attempting a dry run or execution.
It provides a detailed JSON response:

If valid, the process moves to the next step (dry run or execution).
If invalid, the DebugSQLAgent uses the error details to refine the query iteratively.

SQL Debugging Loop Agent

DebugSQLAgent

This agent runs the debugging loop to refine queries that fail validation or execution.
The process includes:

Validation:

The query is passed to ValidateSQLAgent to check syntax, schema compliance, and structure.
If valid, the query is ready for execution.

Dry Run:

If validation passes, the query is tested using a dry run via the test_sql_plan_execution function to confirm execution readiness.

Execution:

Once validation and dry runs succeed, the final query is executed using the retrieve_df function, which returns results as a DataFrame.

Iterative Refinement:

If the query fails either validation or the dry run, the DebugSQLAgent uses the LLM to troubleshoot and generate an alternative query.
The loop repeats until a valid query is generated or the maximum debugging rounds are reached.

This agent ensures the final query is:

Correctly structured and semantically valid.
Optimized for performance and aligns with the original user intent.

Response Agent

ResponseAgent

This agent translates the SQL query results into natural language.
It bridges the gap between technical SQL outputs and user-friendly communication.
By combining the query results with the user’s original question, it crafts a clear and relevant response.

How the Workflow Executes

Here’s the step-by-step process for generating a response:

User Query Embedding

The EmbedderAgent converts the user’s natural language question into a numerical embedding.
Using BigQuery native vector search, the system retrieves similar embeddings from the vector store created in the first phase.

Schema and Content Retrieval

Based on the retrieved embeddings, the system fetches relevant table and column schema details from the vector store.

SQL Query Generation

The BuildSQLAgent uses the retrieved schema details to construct an SQL query that aligns with the user’s question.

SQL Validation and Execution

The ValidateSQLAgent checks the generated SQL for accuracy and potential errors.
If the SQL passes validation, it is executed against the BigQuery database.

Debugging (if needed)

If the query fails or generates an error, the DebugSQLAgent refines it iteratively until a valid query is produced.

Response Generation

The ResponseAgent uses the query results and the user’s original prompt to generate a natural language response.
If the system fails to generate a valid response, it communicates the issue to the user.

Conclusion

By combining the foundational embedding process with this RAG-powered workflow, the chatbot transforms how users interact with their data. From seamless SQL query generation to delivering natural language responses, the system exemplifies the power of Sisense Fusion and advanced AI tools to simplify data-driven decision-making.

As always, please reach out to your Customer Success Manager (CSM) if you would like to implement this in your own environment.

Sisense Community

Exploring RAG: A Sisense-Based Approach with BigQuery and Gemini

Exploring RAG: A Sisense-Based Approach with BigQuery and Gemini