Sisense Community logo
     
    • Community Feedback
    • Chapters
    • Events
    • Forums
      • Help and How To
      • Product Feedback Forum
      • Strategy & Use Cases
    • Blogs
    • KB Docs
      • KB Docs
      • Add-Ons & Plug-Ins
      • APIs
      • Best Practices
      • Blox
      • CDT
      • Cloud Managed Service
      • Data Models
      • Data Sources
      • Embedding Analytics
      • How-Tos & FAQs
      • Onboarding
      • PySisense
      • Security
      • Sisense Administration
      • Sisense Intelligence & AI
      • Troubleshooting
      • Widget & Dashboard Scripts
    • Support
    • Learning
      • Sisense Academy: Free Courses and Certifications
      • Official Developer Documentation
      • Official Product Documentation
      • Official Sisense Youtube Channel
      • Sisense Compose SDK Playground
      • Official Sisense Discord
    • Use Case Gallery
    •      
    Discussions
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
    •                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
    Discussions
    • TagsChevronRightIcon
    AI & Machine Learning
      • Product Feedback ForumChevronRightIcon
      Allow Multiple Providers for Sisense Intelligence
                                               
      Fred Ortmann
      Fred OrtmannPosted 1 week ago
               
      0
               
      • Product Feedback ForumChevronRightIcon
      Make Narrative more polished and professional
                       
      Chris Wallingford
      Chris WallingfordPosted 2 weeks ago
               
      0
               
      • Product Feedback ForumChevronRightIcon
      Support for Amazon Bedrock as LLM Provider in Sisense AI Assistant
               
      Surya Kant
      Surya KantPosted 3 weeks ago • Last reply 3 weeks ago
               
      1
               
    • Blog banner
      • News & UpdatesChevronRightIcon

      🚀 Scale your Analytics faster: Unveiling the Sisense "Co-Pilot" Stack for Natural Language Ops (NLOps)

               

      The Agentic Shift in Sisense Environment Management For years, Sisense has excelled at helping users answer the critical "What" of their data—the insights and analytics. However, as environments scale, the "How" of management—governance, migrations, and schema optimization—has remained a technical, manual task  The Sisense Field Engineering team is excited to introduce the experimental Sisense "Co-Pilot" Stack to help you manage the mundane tasks in Sisense.  Meta-Management: The Operational Distinction It is crucial to differentiate this stack from the official Sisense AI Assistant. While the native product AI focuses on data analysis (e.g., “What was our revenue last quarter?”), the “Co-Pilot” Stack is specifically engineered for Meta-Management . It empowers users to manage their Sisense environment   with AI-driven autonomy and "as-code" precision. This agent handles the underlying infrastructure, executing tasks like:  Auditing unused fields Migrating dashboards Orchestrating complex workflows across different environments.  Shifting the focus from the data in the charts to the operational engine that powers the entire Sisense ecosystem.   🛠️ The Co-Pilot Stack: Three Modular Solutions   The "Co-Pilot" Stack is designed for synergy but is comprised of three independent components. Each module can be utilized as a standalone solution to address specific environmental challenges, depending on your technical requirements: 🐍 PySisense SDK: The "As-Code" Co-Pilot This is an independent, high-performance Python SDK. It enables developers to tackle management complexities through direct scripting, moving beyond manual REST API calls to embrace standardized, repeatable "Platform-as-Code" automation. Best for: developers and engineers automating Sisense asset management, programmatic governance, and large-scale environment orchestration via code. Read the Deep Dive: Mastering Programmable Environment Management ➡️ 🌐Sisense Meta-Management MCP Server: The Universal Bridge The Meta-Management MCP Server is a standalone component that sits in front of your Sisense environment. Under the hood, it uses the PySisense SDK to expose environment operations as “AI-ready tools,” so external AI agents like Claude Desktop can run governance, migration, and admin workflows directly—without you having to build a custom user interface. Best for: Teams that use a central AI orchestrator to coordinate Sisense operations alongside other enterprise systems and tools. Read the Deep Dive: Connecting Sisense to the Global AI Ecosystem ➡️ 3. 🤖 FES Assistant: The Agentic Sisense Co-Pilot  A full, turnkey AI application that delivers a chatbot-style experience. It brings the SDK and MCP server together into one conversational UI, so you can manage your Sisense environment—build models, migrate assets, and run admin tasks—just by chatting with your AI sidekick. Best for: Admins, Data Designers, and Dashboard Designers who want to move fast and automate workflows without writing code. Read the Deep Dive: Chatting with your Infrastructure ➡️ 👥 Empowering Every Persona The “Co-Pilot” stack is designed for everyone in the Sisense ecosystem, not just administrators: 📈 For Dashboard Designers : Locate assets instantly, validate filters and formulas, and run quick environment checks without clicking through menus. 🏗️ For Data Designers: Audit models by identifying unused fields, validating joins and M2M relationships, and catching issues early using natural language. 🛡️ For Admins:   Execute cross-tenant/environment migrations, bulk governance, and ownership/permission changes with built-in safety checks and approval loops.   🤝 Support and Contributing   This is an experimental, community-contributed project maintained by Sisense Field Engineering and provided “as-is” (not a GA Sisense feature). Support: Do not open a GSS ticket. This project is not supported through standard Sisense Customer Support. For installation help, usage questions, or issues, contact your Customer Success Manager (CSM). Your CSM will route requests and feedback to the appropriate Field Engineering contact. Community users should report bugs via GitHub Issues and include logs plus clear reproduction steps.   Contributing: Feature requests and improvements are welcome. Use GitHub Issues to propose ideas and report gaps. Submit Pull Requests (PRs) for fixes, enhancements, or documentation updates. Share feedback and learnings through the community resources linked above to help guide future iterations. Appendix: Full Disclaimer and Security Notes Community-Contributed Tool from Sisense Field Engineering This project is an experimental tool developed by Sisense Field Engineering to facilitate customer learning and exploration of Sisense capabilities. While maintained by Field Engineering, it is shared "as-is" to encourage feedback and experimentation.   Important Disclaimer: This tool is not part of the core Sisense product release lifecycle and does not undergo the same validation, support, or certification processes as generally available (GA) Sisense features. It is intended to complement, not replace, officially supported Sisense features. Technical & Security Considerations Deployment & Execution Control: Local SDK Usage (PySisense): All processing logic runs locally on your machine or server. No data is transmitted to Sisense Field Engineering. Self-hosted Components (FES Assistant / MCP Server): These components are designed for deployment within your own environment (on-prem or VPC). You maintain complete control over infrastructure, security configuration, access controls, and logs. Data & LLM Handling: LLM Feature Status: The FES Assistant summarization feature is disabled by default. Data Transmission: When the summarization feature is enabled, responses retrieved via the Sisense SDK may be sent to your chosen Large Language Model (LLM) provider for processing. Third-Party Clients: When using the MCP Server with third-party clients (e.g., IDE agents or desktop assistants like Claude Desktop), data retrieved from Sisense is passed directly to the client’s LLM. Customer Responsibility: Customers are responsible for selecting an LLM provider that meets their organization’s data privacy and security requirements. Recommended Usage Guidelines To ensure secure and effective use of this experimental tool: Environment: Use the tool primarily in sandbox or non-production environments. Access: Utilize a dedicated Sisense service account with limited privileges. Validation: Thoroughly review and validate the tool's behavior before any broader adoption within your organization.

      Amanda Hammar
      Amanda HammarPosted 4 months ago
      0
               
      • PySisenseChevronRightIcon

      🐍 PySisense: Programmable Sisense Environment Management

                               

      ⚠️ Experimental Project Notice PySisense is an experimental, community-contributed Python SDK from Sisense Field Engineering, shared “as-is” for learning and experimentation. It is not a GA Sisense feature and is not supported through standard Sisense Customer Support. PySisense runs locally in your environment. You control where it runs, which credentials it uses, and what permissions it has. Use a sandbox first and validate behavior before broader adoption. PySisense is a Python SDK designed for structured, repeatable interaction with the Sisense API. It turns common “environment work” (governance, migrations, lifecycle management, and audits) into clean Python methods so teams can manage Sisense like a platform, not a set of manual clicks and one-off scripts. GitHub Repository: PySisense PySisense's Purpose As Sisense environments scale, operational work becomes harder to manage in three common areas: Efficient Governance:  Reduce repetitive audits across large numbers of assets and folders. Smooth and Reliable Migrations: Promote assets across dev, staging, and production with speed and consistency. Straightforward Workflow Development: Avoid stitching together multiple API calls and handling pagination, retries, and permission edge cases manually. PySisense was created to remove that friction by packaging real-world Sisense admin workflows into a consistent SDK that is easy to automate, debug, and extend. PySisense Capabilities 1) Environment and asset management at scale Use Python to search, inventory, and manage Sisense assets and metadata without writing raw REST calls each time. Examples of the type of work PySisense supports: Finding dashboards and folders quickly by name or metadata patterns Auditing ownership, access, and sharing posture Standardizing repeatable admin workflows across environments 2) Cross-environment migrations (dev to prod and beyond) PySisense supports repeatable migrations where you can move users, dashboards, and models across Sisense environments with predictable behavior and structured outputs. This is particularly useful when teams need: Controlled promotion from dev to prod Tenant-to-tenant migration and consolidation Repeatable rollout patterns across multiple environments 3) Governance and bulk operations with consistent logging Operational actions are only useful if they are traceable. PySisense is built for automation with: Structured logging (debug/info/error) Repeatable method interfaces and return shapes A design that is friendly for CI/CD and scripted execution 4) WellCheck: Visibility into complexity and risk PySisense includes a WellCheck module that helps you evaluate dashboard and model health, focusing on complexity and patterns that create operational risk. This can include checks such as: Dashboard structural complexity (widget density, pivot usage, etc.) Data model patterns (many-to-many relationships, island tables, RLS datatype checks, import query patterns) Unused columns and other common cleanup opportunities Intended Audience for PySisense Sisense platform owners who need repeatable governance and operational control Analytics engineering teams managing multiple environments or tenants Developers automating Sisense workflows through scripts and pipelines Advanced Sisense users who prefer “as-code” control over manual UI-heavy operations Organization of the SDK PySisense follows a class-based structure to keep workflows clear and modular: SisenseClient: Base API wrapper for consistent HTTP operations AccessManagement: Users, groups, roles, and permissions workflows Dashboard: Dashboard lifecycle tasks including shares and ownership patterns DataModel: Dataset/schema/security operations and model-level tasks Migration: Cross-environment workflows (users, dashboards, models) WellCheck: Health checks for dashboards and data models Utils: Helpers for export, formatting, and data operations Contributing and Support This is an experimental, community-contributed project maintained by Sisense Field Engineering and provided “as-is.” Do not open a GSS ticket (this is not a GA Sisense feature). For usage questions or help getting started, contact your Customer Success Manager (CSM), who will route feedback to the Field Engineering team. For bugs and improvements, use GitHub Issues or submit a Pull Request. For feature requests, comment below or open a GitHub Issue with details.

      Himanshu Negi
      Himanshu NegiPosted 4 months ago
      0
               
    • Chris Wallingford

      Help and How-To

               
      Chris Wallingford
      Posted 5 months ago • Last reply 5 months ago
      Can Explanations be Relevant?
                       

      We are interested in rolling out Explanations to our clients, but I've hit a block and need some guidance. I think that the source of the issue may be our use of null replacement values in our data models. We don't allow data to join out of a query and instead ensure that all fact data joins to a null-replacement row in our dimensional tables. For example, given a fact table of product PURCHASES, imagine it was possible to sell those products at a discount in some orders. Assume that the fact table then joins to a dimension table of DISCOUNTS. Most purchases do not have a discount, but some do. Without a null-replacement row in the DISCOUNTS table, adding a field from that dimensional table onto a widget will limit the results in the widget to only fact rows with a discount. In our models, we would replace the null discount key in the fact table with a value that joins to a null-replacement row in the DISCOUNTS table. For example, replace null with -999999 in the fact table and have a row in DISCOUNTS with [id] = -999999 and a [description] = "(none)". The problem we're running into with Explanations is that the feature is presenting only the most useless information as being the most probable explanations. For example, when a single discount exists in the data set, then 99.9% of the PURCHASE rows have a "(none)" DISCOUNT, which is deemed this highest scoring field. More generally it seems that Explanations just reports back the most highly probably explanations as those fields where the member in the field is the commonly present in the data anyway, and it's just reflecting the raw change in fact data. For example, here's an image where the drop in revenue is explained by the drop in revenue from web sales (BRestAPI). But web sales constitute a larger over proportion of sales after the drop (89% up from 82% in the prior period). And another example in which an "Initiator Specified Flag" is whether or not an alternate order "owner" has been specified on a order. Specifying an alternat owner is allowed ("Y"), but not a common practice ("N"). Again, we're just seeing a reflection of sales in the explanation. Is there anything we can do to make the explanations discovered by Sisense relevant? The way it's functioning now, it would be better if Sisense DID NOT recommend possible explanations and just let the user Explore Other Fields immediately without having to wait for Sisense to return the list of completely irrelevant fields. I've tested this across many of our clients' data sets and found the same behavior. Hopefully there are options for us to improve the recommendations or maybe prevent it from recommending.

                                             
      2
               
    • Blog banner
      • Use Case GalleryChevronRightIcon

      FAQ-style chatbot with BloX: use case of AI Assistant

                                                       

      Introduction While Sisense AI features (Simply Ask and the newer Dashboard Assistant) support free-text questions, outcomes can vary depending on factors such as data model quality, business terminology, and user familiarity. In practice, this can result in inconsistent questions, ambiguous phrasing, or less predictable results, especially for less technical users or in environments with less than ideal data models. This use case focuses on how Sisense BloX was used to create a guided FAQ-style interface that triggers the AI chatbot automatically, providing a more controlled, consistent, and user-friendly experience. This solution was implemented for a financial technology company to support users with a wide range of recurring business questions related to multi-asset trading and order management. What the solution does This solution uses  BloX to create a guided AI chatbot experience. Instead of typing questions manually, users select a question from a dropdown of predefined FAQs and submit it with a button click. BloX then automatically opens the AI chat window (Simply Ask or Dashboard Assistant), populates the question, and submits it to the chatbot. Questions can be defined directly in the BloX code or sourced dynamically from a data model, which allows the team to manage and update the list of supported questions over time. From the user’s perspective, the experience feels like interacting with an FAQ. Under the hood, the AI chatbot handles the analysis and response. Why it’s useful Lower barrier to entry for AI features By guiding users through predefined, curated questions, the solution reduces ambiguity and removes the need to worry about phrasing, terminology, or syntax. This results in more consistent, predictable, and accurate answers, making AI insights accessible to a broader audience, including users with varying technical backgrounds and less mature or optimized data models. Fewer widgets and dashboards to maintain Not every user needs answers to every possible question. By centralizing common questions into a single guided AI experience, the team avoids creating and maintaining excessive widgets and dashboards for individual analysis, improving performance and reducing long-term maintenance effort. Attachments FAQswithSimplyAskOrAIAssistant.dash.txt (example dashboard using the Sample ECommerce cube) BloXActionsForAI-FAQs.zip (BloX actions' scripts) BloXTemplatesForAI-FAQs.zip (BloX templates for the FAQ widgets, also included in the .dash file above). Note: Remove the  .txt  extension before importing the dashboard (.dash) file.

      Tri Anthony
      Tri AnthonyPosted 5 months ago
      0
               
    • Blog banner
      • News & UpdatesChevronRightIcon

      Have you heard about Sisense Intelligence?

                                       

      AI that builds with you: Meet Sisense Intelligence If you’re an app builder or product manager embedding analytics into your products, you know today’s users expect more: intuitive insights, smart visualizations, and fast answers– all without leaving the product experience. That’s where Sisense Intelligence comes in. Sisense Intelligence is our new suite of AI-powered capabilities designed to accelerate every stage of the analytics journey– from development to insight delivery– all within the Sisense platform. Whether you're a product leader, developer, or data expert, these features are built to help you create seamless, intelligent analytics experiences at scale. What’s inside Sisense Intelligence? A unified framework of powerful tools, including: Assistant : A conversational interface for building dashboards and exploring data with natural language. Narrative : Auto-generated summaries that highlight key takeaways from charts and widgets. Forecast & Trend : Tools to spot patterns and predict what’s ahead. Explanation : Pinpoint drivers of change across key metrics. These features are connected by a common goal: to help builders move faster, deliver smarter, and create product experiences users love. Want to go deeper? Join us for a live webinar on June 5 at 11:00 AM ET: Register now → Build with AI: What’s new (and what’s next) in the Sisense platform See how AI-powered analytics can accelerate your product strategy– schedule a demo . If you're an existing Sisense customer, reach out to your Customer Success Manager. Can't wait? Watch this 90-second video highlighting our newest AI capabilities: Visit trust.sisense.com for security details.

      Community_Admin
      Community_AdminPosted 1 year ago
      0
               
    • Blog banner
      • News & UpdatesChevronRightIcon

      New academy content for administrators!

                                                                                       

      New academy content for administrators! We are pleased to announce the launch of 45 ALL-NEW COURSES that are AVAILABLE NOW for Administrators! These courses will help Admins of all skill levels by expanding their knowledge, providing hands-on opportunities, and covering all of our LATEST features and best practices (approximately 6 to 7 hours of content and hands-on practice). Our new courses are based on brand-new data and assets, divided into microlearning units. They are interactive and accessible! ( yes, yes, we finally have captions! ) To get access to the new Administrators learning path, CLICK HERE and then click on the blue Get Started button to register. If you are new to the Sisense Academy, I encourage you to make an account and sign up for courses based on your role. This is just the beginning of new content releases in Sisense Academy as we are working hard behind the scenes on the next set, and we look forward to sharing more with you soon! This is a continuation of the Academy Content Refresh. During January 2025, we launched 25 ALL-NEW COURSES that are AVAILABLE NOW for Data Designers, and during March 2025, we launched 15 ALL-NEW COURSES that are AVAILABLE NOW for Dashboards Designers. I hope to see you all in the Academy!

      iyyar_sg
      iyyar_sgPosted 1 year ago
      0
               
    • Blog banner
      • News & UpdatesChevronRightIcon

      Data prep essentials for AI-driven analytics - part 3

                               

      This is Part 3 of a multi-part series about Data Preparation for AI-driven Analytics.  We can agree that it's been said enough that data quality is important. And we don’t need to explain to you why feeding your AI model poor-quality training data, and then validating it with more poor-quality data, is a bad idea. What matters is knowing how to spot common data issues, how to fix them, and how to prevent them from happening again. Preparing data for AI teaching and validation processes requires addressing several common issues to ensure data quality and reliability In Part 1 and Part 2 of our Data Preparation Essentials series, we covered why clean, transformed, and enriched data is critical for AI success—and how proper training, testing, and refinement lead to better model accuracy. In this post, we’ll look at the most common data quality challenges and share practical steps to address them.  Data fields that are empty (NULL), missing or undefined, leading to gaps in analysis Duplicated data records that can skew results or inflate counts Values stored in the incorrect data types, causing errors in processing Unusually high or low values that may distort trends or averages Variations in how data is presented and formatted, reducing data reliability Whether you build or borrow your datasets, validation is necessary We’ll also provide some helpful code samples for both SQL and Python to help get you started. When data quality is a priority, AI-driven analytics perform better and deliver more reliable insights. Identify missing values to ensure unbiased analyses, accurate models, and maximum statistical power Missing values, often represented as NULLs, are a common issue in datasets. They can result from entry errors, incomplete data collection, or system failures. If a large portion of a dataset contains missing values, it can skew the results of statistical tests or machine learning algorithms, leading to misleading conclusions. Additionally, missing values complicate data preprocessing, often requiring extra steps to address them properly. Where to start: Assess the extent and pattern of missingnes s using visualizations (e.g., heatmaps, missing value matrices) and summary statistics to understand how much data is missing and whether it follows any patterns. Choose an appropriate handling strategy , depending on the context, to remove, impute, or flag values Document your approach to handling missing data to ensure transparency and reproducibility in analysis or modeling. The simplest approach is to run a query to identify where there are missing (NULL) values. You can then replace each identified NULL with a single value. Depending on the type of field, you can replace strings with specific text, numbers with 0 or another single value, and dates with the current date or the duplicate creation date for a given entry.  In some cases, when N/A is a product of a faulty data copy/creation process, you can use mechanisms like JOIN or LOOKUP to retrieve missing values from your source systems. After handling missing or default values, it’s important to check how common they are in a given column. If a large percentage of the values—say, 30% or more—are missing or set to a default like 0, the column may be too statistically skewed to be useful for AI or analytics. Running a simple query to calculate this percentage can help you decide whether the column is a good candidate for modeling. Dedupe datasets to ensure accurate counts, achieve unbiased summaries, and avoid overfitting your models Duplicates are unintended multiple entries of the same data point in a dataset. They can arise from data entry errors, merging datasets without proper checks, or system glitches. Datasets that contain multiple identical records can artificially inflate totals, distort summary statistics, and even lead to overfitting in machine learning models. Duplicates also pose problems during data integration, where unique identifiers are essential for accurately merging or joining records. Where to start: Identify and review duplicates using tools to detect and investigate repeated records. Remove or consolidate duplicates by dropping exact matches or merging partial ones with grouping and aggregation. Since the duplicate rows are identical, don’t stress about which one to keep — use MIN, ROW_NUMBER, or drop_duplicates() to your advantage! Prevent duplicates at the source through validation rules, unique constraints, or deduplication in data pipelines. Correcting data types can reduce errors in data processing, analysis, and visualization Incorrect data types occur when data is ingested and stored in a format that does not match its intended use, caused by improper data entry, incorrect data import settings, or lack of data validation. Common examples are dates stored as strings or numerical values stored as text.  These mismatches can cause issues, such as errors when performing calculations or unexpected results during analysis. They can also slow down database operations and data processing, leading to inefficiencies and higher computational costs. Where to start: Audit and validate data types using tools or code to ensure each column matches expected formats. Convert columns to the correct types with functions like astype() or pd.to_datetime() while handling errors. Standardize data entry and ingestion to prevent incorrect types from entering the dataset in the first place. Effectively managing outliers can lead to cleaner data and more reliable outcomes Outliers are data points that differ significantly from the rest of a dataset. They can occur naturally due to normal variability, or they may result from errors in data collection or entry, like measurement mistakes or incorrect input. Even a single outlier can have a large impact. It can distort key metrics like the mean and standard deviation, which may lead to misleading conclusions. Outliers can complicate key steps in data preparation, such as normalization, scaling, and feature engineering. If left unaddressed, they can distort the range of values, reduce the effectiveness of algorithms like k-means or linear regression, and lead to biased or unstable model performance. Where to start: Detect outliers using statistical methods like IQR, Z-scores, or visual tools such as box plots and scatter plots. Handle outliers based on context by removing, capping, transforming, or treating them as a separate category. Validate and document your approach to ensure transparency and account for whether outliers are errors or meaningful data. Create consistent formatting to make aggregation and analysis more accurate Inconsistent formatting happens when the same type of data is represented in different ways within a dataset. Common examples include mismatched date formats, inconsistent capitalization, or unexpected special characters. These types of inconsistencies seem trivial but make it harder to analyze or combine data accurately and often lead to errors or extra cleanup work. It can also disrupt data integration processes, where consistent formatting is essential for correctly merging or joining datasets. Where to start: Audit your data fields to identify inconsistent formats , so you know exactly what needs to be fixed and where issues exist. Define and document standard formats for each data type , ensuring consistency across your dataset and setting clear rules for future data entries. Create and apply transformation rules to standardize values , enabling reliable sorting, filtering, and AI-ready analysis. Build or borrow your datasets, but always validate If you’re creating your own dataset, everything discussed in this blog post—handling NULLs, deduplication, correcting data types, managing outliers, and ensuring consistency—applies from start to finish. You're responsible not just for the structure of your data but also for its completeness and integrity.  On the other hand, using a pre-existing dataset can save time and reduce initial effort, especially if it comes from a trusted source or has been curated for similar AI or analytics use cases. That said, whether you build or borrow, validation is essential. Remember to make sure to separate your testing datasets and validation datasets! Depending on your project, sourcing data may get you moving faster, but public datasets can still contain bias, outdated records, or inconsistent formats. If you're sourcing data externally: Look for accompanying documentation, licensing details, and data dictionaries that clarify how the data was collected and maintained Use profiling tools to understand the shape and distribution of the data Always run checks for missing values, unexpected types, or anomalies before feeding it into an AI model. Even “clean” data deserves a second look—assume nothing, validate everything. When it comes to tooling, there’s no single solution, but there are clear patterns. The key is consistency. Pick tools that help you catch problems early, document fixes, and apply the same rules across your data pipeline.  Core data cleansing can be achieved using flexible, low-level options like SQL and Python (especially with libraries like Pandas and NumPy)  Need to validate? Use tools like Great Expectations, Soda, and Pandera to help you define and enforce data quality rules Store and govern data using modern warehouses like Snowflake, BigQuery, and Delta Lake support schema validation and versioning out of the box Download code samples for all the steps above here. Samples are provided in both SQL and Python. Ensuring your data is AI-ready means more than just removing NULLs or fixing the occasional typo. It requires a consistent, structured approach to cleaning and validating your data. There’s a reason these are considered “common problems”—any organization working with data will run into them sooner or later, and not always because of manual errors. In many cases, these issues arise from changes in upstream systems, unanticipated dependencies, or well-intentioned updates that have unintended consequences. Someone shuts off a system without realizing it feeds other processes. Someone makes a formatting change that breaks your formulas. It happens. Proactively preparing your data helps reduce friction later in the AI development pipeline: Audit your datasets regularly Apply consistent formattin g and business logic rules Use validation queries and basic data profiling techniques Standardize inputs and create clear documentation While having the right monitoring in place helps, the reality is that many problems are still flagged by humans first—someone noticing that “something feels off.” Even with the best preparation, data issues will still happen. That’s why it’s so important to know where to look and what to look for: Know where issues are most likely to crop up (e.g., manual entry fields, API feeds, third-party integrations) Set up anomaly detection or alerts around critical metrics Build a habit of root cause analysis —don’t just fix symptoms Use automated checks for common failure points Data prep might not be the flashiest part of the AI workflow, but it's the foundation on which everything else is built. Your future AI models—and your future self—will thank you.

      Mia Isaacson
      Mia IsaacsonPosted 1 year ago
      0