Sisense Data Pipeline Best Practices

Architecturally,

It is important to understand the role of Sisense in your use case and where it will fit best in your data architecture from a high-level perspective. Sisense is an excellent visualization tool that supports use cases for cached, live, and hybrid data pipelines. The focus of this article is to explain how to manage cached data to optimize Sisense’s performance, quality, and availability.

Sisense allows you to design custom solutions specific to your business requirements with a large library of ETL tools and data source connectors. These are important features that make Sisense an industry leader. With great power comes great responsibility; care must be taken with the power that Sisense is giving you.

Before delivering data to Sisense, it is recommended you carefully examine your data pipeline. Having clean, consistent data transformed at the source will improve your build performance, data quality, and data consistency across endpoints. Complex transformations should be performed at the source or in an intermediary integration layer. An integration layer (between source and endpoints) allows you the capability to persist historical data, implement business rules (transformations), and perform data governance (cleansing).

Large data sets, complex transformations, and large numbers of sources directly impact cache build performance. Sisense’s ability to connect to such a large variety of sources makes it flexible enough to be leveraged for complex data architectures with highly diverse assets. By consolidating your data in an integration layer, it decreases cost and improves performance. You gain the ability to maintain a minimal dataset at your endpoints, minimize the number of data sources directly connected to Sisense, and shape the schema of the data to fit the schema of the consumption endpoint.

By persisting your data in an integration layer you gain the ability to maintain a minimal dataset at your endpoints, audit incoming updates, retain historical data, implement quality controls, and standardize business rule enforcement. An integration layer can consist of multiple data stores based on specific use cases and business requirements. Typical data stores that reside in an integration layer include staging, historical, transforming, and aggregation.

Staging

The ETL process to load data into an integration layer typically uses a temporary staging area where incoming data is checked to assure it matches the expected data types/schema. The staging area may be stored in memory or on a disk depending on process design, source, target, and implementation tools. In data integration processes, the most common failures are related to unexpected changes in the source data types/schema. The staging portion of the ETL process should have robust error handling/notification features to assure that these issues are handled quickly to allow the data flow to continue.

While having the staging area on a disk does make production support easier (you can examine the result easily), you may initially implement an in-memory process and add persistence to the disk as needed (probably for your most challenging sources) in a future sprint. You may also find that for your data source/type the IO incurred by persisting to disk is too high regardless of the easier support opportunities.

Historical

If you have historical data retention or change audit requirements, the data would move from staging into the historical store. The persistent historical store should contain all data received from all data sources based on the data retention requirements of the business for the relative data domain/source. Data types, schemas, and values should be preserved as received in order to assure compliance and simplify troubleshooting for production support issues. Having a persistent historical store also allows consumption endpoints to retain only the minimal data set (preserving resources) for normal operations while still having the ability to reload history or perform advanced queries against a complete historical set.

Transformed

The transformed store is used to implement conformed data models, data governance, and business rules. Implementing transformations to conform and cleanse data for all consumers avoids data inconsistency across consumption endpoints. Development and tracing of data lineages are simplified by supplying a conformed model with standardized naming conventions.

Aggregated

When needed, the aggregated store can be used to pre-aggregate common data sets depending on endpoint granularity requirements.

Interface

Depending on the source database platform, materialized and/or non-materialized views are used to interface to consumption endpoints. Using these views as an interface allows management of the physical tables without system breakage when change happens. Using these interfaces, complex operations (joins, groups, predicates) can be performed in a native database environment. Please note that support for interface mechanisms on sources varies. Please check before you design.

Summary

If you want to decrease costs while increasing performance, quality, and availability, think about implementing an integration layer. By moving processing to its most effective point in your data pipeline you gain the ability to:

Reduce cost
Improve performance
Increase availability
Audit incoming updates
Retain historical data
Implement quality controls
Standardize business rule enforcement
Minimize endpoint datasets size
Minimize Sisense source connections
Shape the source schema to reflect the endpoint schema
Pre-aggregate common data sets

FAQ

Why does Sisense allow me to connect to everything if the best practice is to use an integration layer?

The ElastiCube is a data cache that uses MonetDB as an internal store to enhance dashboard performance. There are many use cases that require direct access to raw data. While Sisense meets this requirement, discretion should be used to be sure your implementation is aligned with your organization’s data management practices.

Are there any platform-specific best practices for supported source database platforms?

Good idea! Comment below to tell us which platforms pique your interest.

What is a “domain-focused” elasticube?

This is an elasticube that is designed to answer a specific set of questions focused on a single business domain. Domain-focused cubes are best implemented against clean consistent data to assure that data shared by multiple elasticubes reflect the same values. Let me know in the comments if you would like to see more information on this topic.

When would you use Sisense to directly query raw data sources for cached data?

If you do not need to save historical data, audit data changes, enforce standard business rules, cleanse the incoming data, or supply the data to multiple consumption endpoints.

What is a data pipeline?

A data pipeline is a series of data processing steps. If the data is not currently loaded into the data platform, then it is ingested at the beginning of the pipeline. Then there are a series of steps in which each step delivers an output that is the input to the next step. This continues until the pipeline is complete. In some cases, independent steps may be run in parallel.