Exploring RAG: A Sisense-Based Approach with BigQuery and Gemini
Exploring RAG: A Sisense-Based Approach with BigQuery and Gemini Continuing from our previous article, Automated Machine Learning with Sisense Fusion: A Practical Guide, where we discussed how Sisense and AutoML simplify complex machine learning workflows, I am now excited to introduce an advanced feature: a chatbot interface powered by Retrieval-Augmented Generation (RAG) and a large language model (LLM) like Gemini. This enhancement allows users to interact with their data in natural language through Sisense's Native Blox widget. Imagine asking questions like "How many products did I sell per category?" and receiving insightful answers instantly. This not only improves data accessibility but also bridges the gap between technical data repositories and business decision-making. In this article, I’ll cover the first step of enabling this functionality: setting up a seamless integration pipeline between Sisense and Google BigQuery, embedding table and column metadata, and preparing the data for efficient querying. Important Note: Experimental Project: This implementation is an experimental project and not an official Sisense feature. It demonstrates how Sisense’s functionalities can be utilized to build a custom Retrieval-Augmented Generation (RAG) pipeline or how you can use this example by importing the provided notebooks. Google Ecosystem: The example provided is specific to the Google ecosystem, utilizing services like BigQuery (BQ) and Gemini for query generation and processing. Associated Costs: Please note that each prompt sent to the chatbot incurs a cost, which will be billed to your Google Cloud Platform (GCP) account. This includes charges for LLM processing and database queries. Building the Foundation: Data Preparation and Embeddings To enable natural language interactions with your data, we first need to establish a robust data infrastructure. This includes creating a vector store for embeddings in Google BigQuery (BQ) and preparing metadata about tables and columns. The process is fully automated using a pre-built Sisense custom code notebook, which simplifies embedding generation and management. Dataset in BigQuery The journey begins with your dataset in Google BigQuery. The notebook retrieves metadata such as table and column names, descriptions, and schema information from BigQuery’s Information_Schema. If any descriptions are missing, the system automatically generates them, ensuring comprehensive metadata coverage. Setting Up the Vector Store in BigQuery The vector store acts as the backbone for similarity searches within the RAG system. Using the custom notebook, the system: Creates the vector store: A structured repository to store embeddings for tables and columns. Organizes metadata: Ensures all table and column descriptions are structured and accessible. Generating Table and Column Descriptions Missing descriptions can hinder data understanding. The notebook includes a Description Agent that automatically generates meaningful text for: Tables: Contextual descriptions based on schema and usage. Columns: Descriptions highlighting their role within the dataset. These enhancements ensure the metadata is both informative and ready for embedding generation. Creating Embeddings with Vertex AI To enable semantic search, metadata descriptions are converted into numerical embeddings using Vertex AI’s TextEmbeddingModel. This is facilitated by the EmbedderAgent, which: Accepts strings or lists of strings (e.g., column descriptions). Generates embeddings through Vertex AI. Handles both single and batch processing for efficiency. Efficient Embedding with Chunked Processing For large datasets, embeddings are generated in chunks using the get_embedding_chunked function. This ensures: Scalability: Handles datasets of all sizes without performance issues. Parallel Processing: Processes text chunks simultaneously to speed up the workflow. Structured Outputs: Embeddings are returned in a structured DataFrame for storage or analysis. Storing Embeddings in the Vector Store The final step is storing these embeddings in the BigQuery vector store. This ensures that: Similarity searches are fast and efficient. Metadata is always accessible for chatbot interactions. ALT text: A screenshot of a data table displaying various columns such as "table_schema," "column_name," "data_type," "source_type," and several other attributes. The table shows sample values and metadata related to a database structure, organized in rows and columns. How the Chatbot Interface Works Now that the foundation for embeddings and metadata storage is set, let’s explore the chatbot interface in action. Imagine opening the chatbot in the Blox widget and asking a question about your dataset. Within moments, the chatbot responds in natural language, providing actionable insights. But what exactly happens under the hood to generate this seamless interaction? RAG Notebook and the Chatbot Workflow The chatbot operates using a pre-built RAG custom code transformation notebook, which orchestrates the end-to-end process. With this notebook, the entire pipeline—from understanding the query to generating the response—is automated. The notebook uses multiple specialized agents, each responsible for a specific task, ensuring precision and efficiency at every step. SQL Query Builder Agent BuildSQLAgent This agent specializes in constructing SQL queries for BigQuery. It uses the LLM to analyze the user’s natural language query and matches it with table schemas and column details from the vector store. It outputs a fully formed SQL query tailored to the user’s dataset and question. SQL Validation Agent ValidateSQLAgent The ValidateSQLAgent validates the SQL query before execution using a Large Language Model (LLM). Validation ensures the query adheres to essential rules, including: The presence of all referenced columns and tables. Proper table relationships and join conditions based on the schema. Formatting and compliance with BigQuery-specific SQL standards. Validation occurs during the debugging process, specifically within the DebugSQLAgent, to identify potential errors before attempting a dry run or execution. It provides a detailed JSON response: If valid, the process moves to the next step (dry run or execution). If invalid, the DebugSQLAgent uses the error details to refine the query iteratively. SQL Debugging Loop Agent DebugSQLAgent This agent runs the debugging loop to refine queries that fail validation or execution. The process includes: Validation: The query is passed to ValidateSQLAgent to check syntax, schema compliance, and structure. If valid, the query is ready for execution. Dry Run: If validation passes, the query is tested using a dry run via the test_sql_plan_execution function to confirm execution readiness. Execution: Once validation and dry runs succeed, the final query is executed using the retrieve_df function, which returns results as a DataFrame. Iterative Refinement: If the query fails either validation or the dry run, the DebugSQLAgent uses the LLM to troubleshoot and generate an alternative query. The loop repeats until a valid query is generated or the maximum debugging rounds are reached. This agent ensures the final query is: Correctly structured and semantically valid. Optimized for performance and aligns with the original user intent. Response Agent ResponseAgent This agent translates the SQL query results into natural language. It bridges the gap between technical SQL outputs and user-friendly communication. By combining the query results with the user’s original question, it crafts a clear and relevant response. How the Workflow Executes Here’s the step-by-step process for generating a response: User Query Embedding The EmbedderAgent converts the user’s natural language question into a numerical embedding. Using BigQuery native vector search, the system retrieves similar embeddings from the vector store created in the first phase. Schema and Content Retrieval Based on the retrieved embeddings, the system fetches relevant table and column schema details from the vector store. SQL Query Generation The BuildSQLAgent uses the retrieved schema details to construct an SQL query that aligns with the user’s question. SQL Validation and Execution The ValidateSQLAgent checks the generated SQL for accuracy and potential errors. If the SQL passes validation, it is executed against the BigQuery database. Debugging (if needed) If the query fails or generates an error, the DebugSQLAgent refines it iteratively until a valid query is produced. Response Generation The ResponseAgent uses the query results and the user’s original prompt to generate a natural language response. If the system fails to generate a valid response, it communicates the issue to the user. Conclusion By combining the foundational embedding process with this RAG-powered workflow, the chatbot transforms how users interact with their data. From seamless SQL query generation to delivering natural language responses, the system exemplifies the power of Sisense Fusion and advanced AI tools to simplify data-driven decision-making. As always, please reach out to your Customer Success Manager (CSM) if you would like to implement this in your own environment.2.2KViews1like0CommentsLimiting Date Range Filters in Sisense Dashboards
Wide date ranges in Sisense dashboards can lead to performance issues, especially when using live models or querying large datasets. For live data models, large queries increase costs as more data is pulled from the data warehouse. For Elasticubes, this can cause performance bottlenecks. To avoid these issues, here is a quick solution to "limit" the date range users can select, ensuring both cost-efficiency and smooth performance. Read more to find out how!639Views1like0CommentsResolving Issues with Updating EC2EC Passwords in Elasticube Connections
This article addresses the issue of updating EC2EC (Elasticube to Elasticube) passwords in the context of employee off-boarding or password changes. Specifically, it covers the scenario where tables do not appear in the source cube after updating the password, preventing the completion of the update.401Views0likes0CommentsRestoring missing connector
Sometimes connectors can go missing for some reason, leading to a breakdown in data connectivity. Restoring these connectors is crucial to maintaining a smooth data flow within your Sisense environment. This article describes the process of restoring a missing connector in Sisense; additionally, the added tutorial describes how to restore the missing connectors in the Sisense Linux environment.1.2KViews0likes1CommentBigquery can't retrieve data but can see schema due to wrong dataset location
I can retrieve all datasets and tables from Bigquery but when trying to actually get the data (preview or executing custom sql). It happens for Live connection and Elasticube. I use a json service account file to authenticate and withing Google Cloud logging I see that Sisense is trying to fetch the data from US which is not the location the dataset is. I don't know how to fix it. Google cloud log: when trying to fetch data: "Not found: Dataset PROJECT_XX:dataset_xx was not found in location US"1.3KViews0likes3CommentsQuerying the Sisense Elasticube with Python
This article will cover two methods you can use to connect Python to the Elasticube. Method 1: Using ODBC Method 2: Using the REST API Method 1: Connect to the EC with ODBC Download Sisense ODBC This method will require you to download the ODBC driver. Follow the instructions on the ODBC documentation page before starting to work in Python. Connect to the system DSN ODBC with pyODBC The following libraries are required for this step (both are standard with anaconda) pyodbc pandas Step 1: Import libraries import pyodbc import pandas as pd Step 2: Establish a connection to the EC cnxn = pyodbc.connect('DSN=Sisense ODBC Driver') Step 3: (optional) Print out available tables # Create a cursor that can execute commands against the connection cursor = cnxn.cursor() # Print all available tables for t in cursor.tables(): print(t[2]) Step 4: Put the data into a dataframe Substitute <Table Name> with the table you want to query # Bring the data into a dataframe called df df = pd.read_sql('select * from <Table Name>', cnxn) # Print out the top records from df df.head() Method 2: Query the EC with the REST API This method will require the following python libraries: requests urllib.parse pandas Step 1: Import the libraries import requests import urllib.parse as parse import pandas as pd Step 2: Define the datasource and your query # Your elasticube name is your dataSource dataSource = 'leapYearUseCase' # Query the data as if you were writing a custom SQL query query = 'select Years, min(Amount) as maxAmount from <Your Table> group by Years' Step 3: Parse the queries dataSourceURI = parse.quote(dataSource) queryRUI = parse.quote(query) Step 4: Establish your connection To retrieve your Bearer token follow instructions under USING THE SISENSE API - Authentication # http://localhost:8081 may vary depending on your configuration # The request asks for the data formatted as a csv for convenience endPoint = 'http://localhost:8081/api/elasticubes/{}/Sql?format=json&query={}'.format(dataSourceURI, queryRUI) # Enter your bearer token in place of <your bearer token> header = {'Authorization': 'Bearer <your bearer token>'} Step 5: Post and Get the Response # Post to get the response response = requests.get(endPoint, headers=header) # Collect the response text formatted as a JSON data = response.json() # Put the data into a dataframe df = pd.DataFrame(data['values'], columns=data['headers']) Cheers!1.8KViews0likes1CommentName Type Field Returned as { "Name": "Text" } instead of just Text
All of a sudden, all the name type fields/columns are now returned as { "Name": "John Doe") instead of John Doe. Data source is all from Salesforce tables. Is this a connector issue? It was ok before today. Thanks.869Views0likes1CommentConnection to BigQuery
Hi to all. I am a brand new Sisense user, never used before, and sorry if this question is too basic. I am trying to connect to our BigQuery database and I was reading that I need to copy as service account, therefore I will need the key .json file and, accordingly the documentation I need to copy this json file into the sisense server, is there a way to do this using the GUI interface instead copying physically into a folder on the sisense instalation path. Best Regards.1.3KViews0likes1CommentRemove Accordion button symbol with a hover over from Widget Title Bar
I am using the Accordion plugin and it works great. My only issue is that it places a symbol with a hover over in widget title. Great idea but it causes the widget title to look garbled and hard to read. I would like to remove this in the widget title. Thank you1KViews0likes0Comments,Sisense doesn't update cached table from RedShift
Hi there , We've had a table called jora.historical_email_engagement that was connected to Sisense Periscope yet stopped updating the data from 28th June onwards. The original table source is updated regularly and can be queried via SQL workbench. However, the problem is that Sisense reports are showing that the table was updated as usual and haven't experienced any errors. Please help me in resolving this. Thank you.800Views0likes0Comments