Choosing the Right Data Model
This post has become outdated. You can find guidance on choosing a data model on our documentation site here. https://docs.sisense.com/main/SisenseLinux/choosing-the-right-data-model.htm Introduction Customers often run into the question of which data model they should use (an ElastiCube, a Live model, or a Build-to-Destination). The following article presents some of the aspects you should consider when choosing between them. Sisense recommends that you discuss your needs and requirements with Sisense's technical team during the Jumpstart process, so the result will best meet your business expectations. Table of Contents Definitions The ElastiCube Data Model Importing data into an ElastiCube data model allows the customer to pull data from multiple data sources on-demand or at a scheduled time, and create a single source of truth inside Sisense. The imported data can then be transformed and aggregated to meet your business needs. Once imported, the data snapshot is used to generate analytical information. The process of importing the data, known as a "Build", includes the following steps: Extract the data: Query the different data source(s) for data. Load the data: Write the data extracted to Sisense (the local MonetDB). Transform the data: Transform the local MonetDB (using SQL queries). To read more about ElastiCubes, see Introducing ElastiCubes. The Live Data Model Using a Live data model does not require importing data. Only the data's schema needs to be defined. Once configured, analytical information required by the user is queried directly against the backend data source. To read more about Live models, see Introducing Live Models. Determining Factors Refresh Rate One of the most fundamental aspects of determining your data model is your data's refresh rate. The data refresh rate refers to the age of the data in your dashboards: For Live models, the data displayed on your dashboards is near-real-time, as every query is passed directly to the backend database. A good example of using a live model (due to refresh rate requirements) is a dashboard that shows stock prices. For ElastiCubes, the data displayed on your dashboard is current to the last successful build event. Every query is passed to the local database for execution. A good example of using an ElastiCube (due to refresh rate requirements) is a dashboard that shows historical stock prices. In this case, a daily ETL process will provide results that are good enough. To make a choice based on this factor, answer the following questions: How frequently do I need to pull new data from the database? Do all my widgets require the same data refresh frequency? How long does an entire ETL process take? Data Transformation Options The ETL process includes a "Transformation" phase. This transformation phase usually includes: Migrating the data tables into a dim-fact schema Enriching your data Pre-aggregating the data to meet your business needs The amount of data transformation on Sisense helps determine the suitable data model: For Live models, Sisense allows minimal to no data transformation. Data is not imported before a query is issued from the front end. Therefore, data cannot be pre-conditioned or pre-aggregated. Most data sources used by Live models are data warehouses that may perform all data preparations themselves. For ElastiCubes, data is imported before a query is issued from the front end. Therefore, it may be pre-conditioned and pre-aggregated. A user may customize the data model to optimally answer their business questions. To make a choice based on this factor, answer the following questions: Is my data in a fact-dim schema? Does my data require enriching or pre-conditioning? Can my data be pre-aggregated? Operational Database Load Your operational databases do more than serve your analytical system. Any application loading the operational databases should be closely examined: For Live models, Sisense will constantly query information from your operational databases, and feed it into your dashboard widgets. This occurs every time a user loads a dashboard. For ElastiCubes, Sisense highly stresses your operational databases during an ETL process while reading all tables. To make a choice based on this factor, answer the following questions: Does the analytical system stress my operational database(s)? Can the query load be avoided by using a "database replica"? Operational Database Availability Your operational database(s) availability is critical for collecting information for your analytical system. For Live models, all queries are redirected to your data sources. If the data source is not available, widgets will generate errors and not present any data. For ElastiCubes, data source availability is critical during the ETL process. If the data source is not available, the data in your widgets will always be available, but not necessarily be up to date. To make a choice based on this factor, answer the following questions: How frequently are analytical data sources offline? How critical is my analytical system? Is being offline (showing out-of-date information) acceptable? Additional Vendor Costs Various database vendors use a chargeback charging model, meaning that you will be charged by the amount of data you pull from the database or the computational power required to process your data. For Live models, every time a user loads a dashboard, each widget will trigger (at least) one database query. A combination of a chargeback charging model and a large user load may result in high costs. For ElastiCubes, every time the user triggers an ETL process, a large amount of data is queried from the database and loaded into Sisense. To make a choice based on this factor, answer the following questions: What is the number of users using my dashboards / What is my "build" frequency? Which data model will result in lower costs? What is the tipping point? Are you willing to pay more for real-time data? Database Size For ElastiCubes, please refer to these documents: Introducing ElastiCubes Minimum Requirements for Sisense in Linux Environments For Live models, there is no limitation as data is not imported to Sisense, only the data's schema. To make a choice based on this factor, answer the following questions: What is the amount of data I need in my data model? What is the amount of history I need to store? Can I reduce the amount of data (e.g., trimming historical data? reducing the number of columns? etc.) Query Performance Query performance depends on the underlying work required to fetch data and process it. Although every widget generates a query, the underlying data model will determine the work necessary to execute it. For ElastiCubes, every query is handled inside Sisense: The client-side widget sends a JAQL query to the Sisense analytical system. The query Is translated into SQL syntax, and run against an internal database. The query result is transformed back to JAQL syntax and returned to the client-side. For Live models, every query is forwarded to an external database and then processed internally: The client-side widget sends a JAQL query to the Sisense analytical system. The query Is translated into SQL syntax, and run against an external database. Sisense waits for the query to execute. Once returned, the query result is transformed back into JAQL syntax and returned to the client-side. To make a choice based on this factor, answer the following questions: How sensitive is the client to a delay in the query's result? When showing real-time data, is this extra latency acceptable? Connector Availability Sisense supports hundreds of data connectors (see Data Connectors). However, not all connectors are available for live data models. The reasoning behind this has to do with the connector's performance. A "slow connector" or one that requires a significant amount of processing may lead to a bad user experience when using Live models (that is, widgets take a long time to load): For ElastiCubes, Sisense allows the user to utilize all the data connectors. For Live models, Sisense limits the number of data connectors to a few high-performing ones (including most data warehouses and high-performing databases). To make a choice based on this factor, answer the following questions: Does my data source's connector support both data model types? Should I consider moving my data to a different data source to allow live connectivity? Caching Optimization Sisense optimizes performance by caching query results. In other words, query results are stored in memory for easier retrieval, in case they are re-executed. This ability provides a great benefit and improves the end-user experience: For ElastiCubes, Sisense recycles (caches) query results. For Live models, Sisense performs minimal caching to make sure data is near real-time. (Note that caching can be turned off upon request.) To make a choice based on this factor, answer the following questions: Do I want to leverage Sienese's query caching? How long do I want to cache data? Dashboard Design Limitations Specific formulas (such as Mode and Standard Deviation) and widget types (such as Box plots or Whisker plots) may result in "heavy" database queries: For Live models, Sisense limits the use of these functions and visualizations as the results of these formulas and visualizations may take a long time, causing a bad user experience. For ElastiCubes, Sisense allows the user to use them, as processing them is internal to Sisense. To make a choice based on this factor, answer the following questions: Do I need these functions and visualizations? Can I pre-aggregate the data and move these calculations to the data source instead of Sisense? See also Choosing a Data Strategy for Embedded Self-Service.3.7KViews2likes1CommentHow to avoid query caching in live data models
Query caching is a common mechanism used to enhance performance by storing frequently accessed data in memory. However, in scenarios where data models are continuously updated in real-time, caching can lead to outdated data being displayed. This article guides how to bypass query caching in Sisense Live Data Models to ensure that updated information is always displayed.496Views1like0CommentsReading multiple excel files into Sisense data model
Reading multiple Excel files into Sisense data model Introduction: If you are migrating from Windows to a Sisense Linux-hosted solution, on the Sisense Windows machine you may use an SFTP client that moves Excel files placed in a remote server's folder to a folder on the Sisense Windows server. With this, it is used as a source folder for Excel file import to an ElastiCube. Since this is not possible with your hosted Linux, we could use the CDATA SFTP Connector to import these files. In your existing connection string, you will get just the names of the files. When connecting to an Excel sheet stored in an SFTP server, the URI must be sftp://<server>:<port>/<path to file>, as shown below. If the connection string does not contain this, you will just get the names of the files. Additionally, the ConnectionType and AuthScheme must be set to SFTP, and the SSHAuthMode must be set to either None, Password, or Public_Key depending on your SFTP server. If the issue still persists, it would be helpful to have a log file. To generate a log file, in your connection string to Excel, please set the Log file to the path where the log file will be generated (such as C:/Logs/log.txt) and Verbosity to 3. Then reproduce the error. Additional Resources: Sisense Docs: https://docs.sisense.com/win/SisenseWin/introduction-to-data-sources.htm Sisense Academy: https://academy.sisense.com/sisense-data-designer-web-application442Views1like0CommentsLinux | ElastiCube Local Files System
Linux | ElastiCube Local files system Sisense application utilizes MonetDB to write raw data into Elasticubes (commonly referred to as EC or cube). Why does this happen? This allows Sisense to query the data from itself rather than sending a query back to the data source every time the dashboard is accessed. The process of importing raw data in Sisense Elasticube is called "build". On Linux, there are a few different types of build modes that exist in Sisense: Regular - shared storage Build on local Build on local with S3 https://documentation.sisense.com/docs/storing-elasticubes-on-s3 In each case, the build method will be different, which means the query will also be different too. 1. Regular - Shared Storage Regular builds put the data into a shared storage /opt/sisense/storage/farms while building. The name of the specific farm is composed of the elasticube name and the timestamp at which the build was started. For example, the directory for a Sample Ecommerce cube that started building on September 23rd may look like aSampleIAAaECommerce_2022.09.23.07.26.31.585 A new directory (composed of name + timestamp) is created for every build. Full Build: When a full build process starts, an empty directory is created, and the data is imported to that location. In the case of a failure, this newly created directory will be deleted. In the case of success, a new EC query pod will be loaded and attached to the new location. After a successful startup, the previous copy of the query pod of this cube (if it exists) will be terminated, and the directory attached to it will be deleted. These operations are done by the management service. Accumulative or Schema Changes Build: In cases involving accumulative or schema changes, the original directory will be copied before the build starts. Because of this, the first build after an accumulative behavior is selected will take more time. After the build finishes successfully, and the new cube is loaded, the original cube will get patched with the changes, so it is not necessary to copy the full data set each time. The new cube will be activated, and the old cube will become ready for the next build. The directory will be the same as the original but with "_next" added to the end. aSampleIAAaECommerce_2021.02.23.07.26.31.585_next When the next accumulative build starts, it will use the _next directory and will not need to copy the entire cube. Note: this will double the storage size since both directories will remain. Loading of the Cube When the cube loads, the farms point to /tmp/aSampleIAAaECommerce_2021.02.23.07.26.31.585/dbfarm inside the cube pod. The files are copied to /opt/sisense/storage/farms/aSampleIAAaECommerce_2021.02.23.07.26.31.585/ 2. Build on Local: Full Build, Accumulative, or Schema changes: Same as regular behavior, with a few modifications to farm locations: Builds put the data into local storage /opt/sisense/local_storage/ while building. Using the Sample Ecommerce Example: /opt/sisense/local_storage/aSampleIAAaECommerce After the build finishes, copy the cube into the shared storage: /opt/sisense/storage/farms/aSampleIAAaECommerce_2021.02.23.07.26.31.585 Delete the local folder /opt/sisense/local_storage/aSampleIAAaECommerce from the server. Additionally, you will need to create a compressed file dbfarm.tar.gz of all the files smaller than 65K and all the symbolic links that will be needed to operate. Symbolic links should point to the shared storage. This speeds up how fast the cube loads. Creating symbolic links takes time, and getting small files from the shared storage can create high IOPS that can overload the shared storage. It is more useful to open one tar.gz file that includes all links and small files. When the POD is terminated or killed, delete the temporary folder /opt/sisense/local_storage/aSampleIAAaECommerce. If we kill the pod with "kubectl delete pod ec-sample-ECommerce-bld-749dd924-45a0-4-5866b6c47d-nlsq7 --force --grace-period=0" the delete will not take place, and the farm will remain on the server. If the server reboots and /opt/sisense/local_storage is not ephemeral storage, the files will remain. In AWS, it is recommended to use local NVMe ephemeral storage that comes with r5d or m4d machines. Loading of the Cube: When cube load, the farm points to /opt/sisense/local_storage/aSampleIAAaECommerce_2021.02.23.07.26.31.585-bbe0/ All files from /opt/sisense/storage/farms/aSampleIAAaECommerce_2021.02.23.07.26.31.585/ will be copied there. Additionally, the dbfarm.tar.gz, which was generated during the build stage, and contains small files and links to large files, is copied and extracted into a local storage folder. Also, notice that the folders are under /opt/sisense/local_storage, which is a host map and not pod local storage. The name of the folder has a unique identifier (-bbe0) to prevent two EC instances on the same server from stepping on each other. When the POD is terminated or killed, we delete the temporary folder /opt/sisense/local_storage/aSampleIAAaECommerce_2021.02.23.07.26.31.585-bbe0/. If you kill the pod with "kubectl delete pod ec-sample-ECommerce-qry-749dd924-45a0-4-5866b6c47d-nlsq7 --force --grace-period=0" the delete will not occur, and the farm will remain on the server. If the server reboots and /opt/sisense/local_storage is not ephemeral storage, the files will remain. In AWS, it is recommended to use local NVMe ephemeral storage that comes with r5d or m4d machines. 3. Build on S3 This is the same as building on local, with a few additions. Building happens on the local path - /opt/sisense/local_storage/aSampleIAAaECommerce. At the end of the build, we pack all the folders as one big tar.gz file and copy it to s3. s3://sisensebucket/sisense/Default/aSampleIAAaECommerce_2021.02.23.07.26.31.585.tar.gz The path determined from s3://<management.S3bucket>/<management.S3path>/<DataGroupName>/<farms>.tar.gz To set the management.S3bucket and management.S3path use the CLI: si config set -key management.S3bucket -new-value sisensebucket si config set -key management.S3path -new-value sisense After the copy finishes, the local directory will be deleted, which is similar to building on local. Loading of the cube: When the cube loads to the farms point to /opt/sisense/local_storage/aSampleIAAaECommerce_2021.02.23.07.26.31.585-bbe0/ We download the farms from the s3 ( s3://sisensebucket/sisense/Default/aSampleIAAaECommerce_2021.02.23.07.26.31.585.tar.gz) and open it on the local file system. Since we do not delete the files from s3 and leave them to be deleted by the s3 lifecycle, you may have a cube pointing to a file that the life cycle deletes before you rebuild the cube. The following rule deletes the files after 14 days. cat <<EOF >expire-s3-rule.json { "Rules": [ { "Expiration": { "Days": 14 }, "ID": "deleteold", "Filter": { "Prefix": "*" }, "Status": "Enabled", "NoncurrentVersionExpiration": { "NoncurrentDays": 14 }, "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 1 } } ] } EOF aws s3api put-bucket-lifecycle-configuration --bucket sisense-shared-s3-storage --lifecycle-configuration file://expire-s3-rule.json In this case, the download will fail, and you will see that reflected in the log: download failed: s3://sisensebucket/sisense/Default/aSampleIAAaECommerce_2021.02.23.07.26.31.585.tar.gz to - An error occurred (404) when calling the HeadObject operation: Not Found The cube will not be loaded. To avoid this, make sure the lifecycle is longer than the building period of the cube. When the POD is terminated or killed, delete the temporary folder /opt/sisense/local_storage/aSampleIAAaECommerce_2021.02.23.07.26.31.585-bbe0/. If you kill the pod with "kubectl delete pod ec-sample-ECommerce-qry-749dd924-45a0-4-5866b6c47d-nlsq7 --force --grace-period=0" the delete will not occur, and the farm will remain on the server. If the server reboot and /opt/sisense/local_storage is not ephemeral storage, the files will remain. In AWS, it is recommended to use local NVMe ephemeral storage that comes with r5d or m4d machines. Conclusion This information can be used to understand the path to import row data, the purpose of having multiple folders, and help troubleshoot the build process itself. As always, if you need additional help, please contact Sisense Technical Support to get more in-depth assistance!1.5KViews2likes0CommentsAlert on a M2M Risk
Introduction This article describes how to enable an alert* when there is a risk of Many-To-Many relationship (M2M) between two or more tables. * This implementation involve the use of Pulse and it's supported in Sisense v6.5 and above. Motivation Almost every Data Model designer that dealt with a bit more advanced modeling at some point ran into data\performance issue and after a long troubleshoot process he\her found out that the root cause came from M2M. M2M can occur due to the following**: Connecting 2 tables with a pure M2M relationship such as two Fact tables with a non-unique key. A unique key became a non-unique due to changes or data issue within the source system. A business need to have M2M between two entities. From the described scenario you can assume that the M2M can be identified not only during the EC design but also during its ongoing use. Therefore, having a proactive alert for the EC designer can prevent from the user experiencing poor dashboard performances, wrong result sets and save many hours of troubleshooting, panicking users and general feeling of frustration. ** There can be other more complex cases of M2M which we won't describe here as the mechanism of the M2M alert is less relevant for them. Please refer to this article for a more detailed description of M2M relationship Implementation As an Data Model designer, during the design and after the final version of my schema is completed, I want to track on my connection and make sure that my assumptions of uniqueness on one side of the relationship (to get 1:M relationship) is being captured. Step 1 - Build a M2M Detection Table We'll create a table that will reflect the uniqueness of each table within the final schema that has to be unique according to its relevant primary or surrogate key. Basically we'll count the number of total rows within a table, the number of unique values of the table key and compare between them. if they are equal then the key is unique within the table, but if not there is a risk for M2M and so we'll raise a flag as you can see here: We can learn from the output of the table that for our case we have only one table that is not unique (Cost). Here's the query I used in order to create this table: SELECT 'DimDates' AS Table, 'Date' AS Key, count(t.[Date]) [Total Count], DISTINCT_count(t.[Date]) [Distinct Count], count(t.[Date])-DISTINCT_count(t.[Date]) AS [Duplicated Records], ifint(count(t.[Date])-DISTINCT_count(t.[Date])>0,1,0) AS M2M_Flag FROM DimDates t UNION ALL SELECT 'Customer' AS Table, 'CustomerID' AS Key, count(t.CustomerID) [Total Count], DISTINCT_count(t.CustomerID) [Distinct Count], count(t.CustomerID)-DISTINCT_count(t.CustomerID) AS [Duplicated Records], ifint(count(t.CustomerID)-DISTINCT_count(t.CustomerID)>0,1,0) AS M2M_Flag FROM Customer t UNION ALL SELECT 'SalesPerson' AS Table, 'BusinessEntityID' AS Key, count(t.BusinessEntityID) [Total Count], DISTINCT_count(t.BusinessEntityID) [Distinct Count], count(t.BusinessEntityID)-DISTINCT_count(t.BusinessEntityID) AS [Duplicated Records], ifint(count(t.BusinessEntityID)-DISTINCT_count(t.BusinessEntityID)>0,1,0) AS M2M_Flag FROM SalesPerson t UNION ALL SELECT 'Product' AS Table, 'ProductID' AS Key, count(t.ProductID) [Total Count], DISTINCT_count(t.ProductID) [Distinct Count], count(t.ProductID)-DISTINCT_count(t.ProductID) AS [Duplicated Records], ifint(count(t.ProductID)-DISTINCT_count(t.ProductID)>0,1,0) AS M2M_Flag FROM Product t UNION ALL SELECT 'SalesOrderHeader' AS Table, 'SalesOrderID' AS Key, count(t.SalesOrderID) [Total Count], DISTINCT_count(t.SalesOrderID) [Distinct Count], count(t.SalesOrderID)-DISTINCT_count(t.SalesOrderID) AS [Duplicated Records], ifint(count(t.SalesOrderID)-DISTINCT_count(t.SalesOrderID)>0,1,0) AS M2M_Flag FROM SalesOrderHeader t --Example for uniqueness by surrogate key UNION ALL SELECT 'Cost' AS Table, 'Size + Weight' AS Key, count(t2.Key) [Total Count], DISTINCT_count(t2.Key) [Distinct Count], count(t2.Key)-DISTINCT_count(t2.Key) AS [Duplicated Records], ifint(count(t2.Key)-DISTINCT_count(t2.Key)>0,1,0) AS M2M_Flag FROM ( SELECT t.Size + '|' + tostring(t.Weight) AS Key FROM Cost t) t2 --Example for checking uniqueness for a table which not in the final schema but effects another table (as a source of a lookup custom field) UNION ALL SELECT 'Store' AS Table, 'BusinessEntityID' AS Key, count(t.BusinessEntityID) [Total Count], DISTINCT_count(t.BusinessEntityID) [Distinct Count], count(t.BusinessEntityID)-DISTINCT_count(t.BusinessEntityID) AS [Duplicated Records], ifint(count(t.BusinessEntityID)-DISTINCT_count(t.BusinessEntityID)>0,1,0) AS M2M_Flag FROM Store t Step 2 - Create relevant KPI within a widget We'll create an indicator with the following formula: if([Total M2M_Flag]>0,1,0) This widget can be temporary as it only will serve us for alerting when the total of M2M flags is bigger than 0. Step 3 - Add the KPI to Pulse After Adding the widget to the Pulse, we'll mark the condition according the threshold 'Greater than' 0. Feel free to adjust the notification mode and other preferences according your requirements.952Views0likes0CommentsNot all dimensions relate to all facts
Analytical Need In a dashboard, we wish to analyze how our entities (sales, inventory, tickets etc.) behave and break them by several categories. However, not all entities share the same categories. We wish to avoid a situation in which selecting a value in a certain category will produce illogical/incorrect results due to the random path mechanism (because there is no direct path between each dim & each fact). Modeling Challenge In the image below, getting from the DimStatus to ClosedProjects can be achieved via several paths: Image 1. No direct path between Dim Status & Closed Projects fact table. In red are the status dimension & fields. ** Random path mechanism (based on Handling Relationship Cyclesđ When Sisense encounters numerous possible paths (examples described above), it chooses the shortest path, which will have less impact performance-wise on the query. The logic in choosing the path is as follows: Sisense will prefer relationships in the following order: 1:1, 1:m, m:m. If all possible paths contain many-to-many relationships, the path with the least number of m2m connections is preferred. If more than one possible path still exists, it will prefer the path containing the least number of tables. Finally, if numerous paths are still possible, one is picked at random. Solution We will add a fake key to our schema in a 3 step solution. This will ensure the direct path between the Status & Closed Projects tables: Step 1: Add a fake key to the Dim Status table. SQL: SELECT DISTINCT p.Status FROM [InProgress] p union SELECT DISTINCT o.Status FROM [OpenProjects] o union SELECT '-1' The last union ( Select '-1' ) will add another record to this dimension. This is one side of the direct connection to our Closed Projects table. As a rule of thumb, if there is a description for the key, put down a value that symbolizes that is it a fake record, like 'No Status'/ 'Fake Status' etc. Step 2: Add the fake key to the Closed Projects fact table (either by the below example query or by adding a new custom field with the value of -1): SQL: SELECT p.*, '-1' AS Status FROM [ClosedProjects] p Adding this '-1' as a new field will be the other side of the direct connection to the dim status table. Step 3: Connect the dim & fact table. The schema will look like this: Image 2. Updated schema. Marked in red are the changes that have been made. Include/exclude options: Include : If nothing is selected in the filter (taken from the fake linked DimStatus) then the measure from the connected fact will still be calculated. When any value is selected then no results/null would appear for the measure - This is the solution described above. Exclude : If there is a filter, then the measure would never be calculated - Skip step 1 in the solution. Attachments - 545 KB - MultiFactNotAllDims.ecdata1.6KViews0likes0CommentsMarket Basket Analysis
Analytical Need Market Basket Analysis is a data mining technique that is widely used to identify consumer patterns, such as which items are purchased together, or how the purchase of one item affects the likelihood of another item being purchased. One of the famous examples, even though it is almost certainly an urban legend, is the story about beers and diapers. A supermarket chain discovered in its analysis that customers that bought diapers often bought beer as well, so they put the diapers close to beer coolers, and their sales increased dramatically. Another widely used example of cross selling on the web with market basket analysis is Amazon.comâs use of âcustomers who bought book A also bought or took an interest in book Bâ. Modeling Challenge In our example, we will analyze the correlation between different products and product categories within a set of orders. (Sample data files, ecdata and dash, attached below) We will start by preparing our data model to support this type of analysis and then build a dashboard for our market basket analysis. Solution In our example we will use a Microsoft âNorthwindâ database sample, which holds the sales transactions that occur between the âNorthwind tradersâ company and its customers, as well as the purchase transactions between Northwind and its suppliers. We will start by demonstrating how to adjust the data model to support market basket analysis by using a custom SQL to join a table to a duplication of itself. Basic Terminology Items are the objects that we are identifying associations between. In our example, an item is a product. Transactions are instances of groups of items coexisting together. In our example, each order is a transaction. Open the Northwind database in the Sisense ElastiCube Manager The âOrderDetailsâ table holds the order transactions and the products that were purchased in each order. To analyze if there are any associations between one product or a product category to another, we will create a self-join with the following conditions: Transaction unique identifier = Transaction unique identifier In our example, Order ID = Order ID Item <> Item In our example, Product ID <> Product ID Click âAdd Dataâ, and then select the âCustom SQL Expressionâ. A new window will open, enter the new table name (for example OrderDetailsForAnalysis), and click on the V icon. Enter the self-join script in the table editor window: SELECT o1.* ,o2.ProductID AS Product2ID, o2.Quantity AS Product2Quantity FROM [Order Details] o1, [Order Details] o2 WHERE o1.OrderID = o2.OrderID AND o1.ProductID <> o2.ProductID Preview the results and save the table. A new table is created with no connection to any other table. Connect the table to the same tables that the original table (OrderDetails) was connected to. In our example, join the new âOrderDetailsForAnalysisâ table with the âProductsâ and âOrdersâ tables. If the item we are analyzing is part of a hierarchy, like in our case where products can be grouped into a category, and we want to use the hierarchy as part of the analysis, we should do one of the following: Bring the hierarchy as part of our new custom table (add the 2nd product name and category name as part of the SQL) Duplicate the hierarchy dimensions table/s, and join the tables with the 2nd item identifier. In our example, we choose the 2nd option, and duplicate the âProductsâ and âCategoryâ tables, and join Product2ID from the âOrderDetailsForAnalysisâ table with ProductID from the duplicate âProductsâ table. We can disconnect the original âOrderDetailsâ table and make it invisible. Our data model is now ready. Build the cube and open the Sisense dashboard. Analyze with Reports and Dashboards You can easily visualize and analyze the correlation between one product to another or one category to another by creating a pivot table that will display the number of occurrences per pair. Create a new widget, and add the following: CategoryName from the 1st categories table CategoryName from the 2nd categories table Arithmetic function on a field (for example âCount Allâ function the âQuantityâ field) from the transactions table Click on the advanced configuration at the bottom left. Drag the 2nd category name from the rows area into the column area. You can also highlight some of the cells to highlight important information for the dashboardâs consumers. For example, mark the top correlation per each one of the different product categories. Click on the color button of your measure. A new window opens. Click on the conditional tab, and set the formula to the max of occurrences. Our product category correlation table is ready. Each number inside the table represents the number of individuals who have purchased both from the rows product category and the column product category. Straight away, you can see that many of the categories are correlated with the beverage category, but beverage is the most correlated with confectionary and seafood. Note, that since we are analyzing the product category and not the products themselves, we can have a situation where an individual purchased two different products from the same category, and this is why we have numbers greater than 0 where the rows and the columns category name is identical. You can easily create different kinds of reports and graphs to visualize and assist you in your analysis. Several examples are presented below: Stacked column chart, that easily summarizes the contribution of each item (product/ category) to another Tree map chart per each item (product/ category) Pie chart that will display the correlated product/category distribution and can be affected by product/category filter selection within the dashboard Attachments 563 KB - Northwind.ecdata .sdata (for Sisense Linux) for the NorthWind elasticube and the csv files for the source table are attached.1.7KViews0likes0CommentsData Relationships: Many-to-Many Relationship Resolutions
For general information on types of data relationships, potential complications and how to resolve them, see the following articles: Data Relationships: Introduction to Many-to-Many Data Relationships: Check Type of relationship Data Relationships: One-to-Many Relationship Data Relationships: The Invisible Many to Many Data Relationships: Many-to-Many Relationship Resolutions (You are here) There are several methods to resolve and bypass a many-to-many relationship; the solution depends on the business model and the logic of the business questions at hand. The following solutions differ by business logic and the schema at hand, each solution can be applied to each schema respectively. Resolutions for 2 tables, one relationship Two Tables, One Relationship Aggregated Table Possible resolutions for more than 2 tables, more than 1 relationship Lookup Function Concatenate the two tables into one Resolutions for 2 tables, one relationship Two Tables, One Relationship The direct solution for Many-to-Many would be to break this relationship into two separate one-to-many relationships, as seen the image below. The logic behind testing this issue can be visualized in the decision tree below. Follow the steps below to resolve the Many-to-Many this way: Create a custom dimension by using a custom SQL expression in the Elasticube. In the expression of this table select all the individual values for the identifier column from both sides, the expression should look like this: SELECT * FROM (SELECT DISTINCT r.GuestID, r.GuestName FROM [Reservations] r UNION SELECT DISTINCT p.GuestID, p.GuestName FROM [Payments] p) AS G This query will take all Guest Id values from both tables, and using the UNION statement, will bring in only the unique values from both tables, making this a complete list of all distinct Guest Id values. Merge the Guest Id field from the new 'linking' table to the other two Guest Id fields from the other two tables, thus creating two One-To-Many Relationship. You can now use this Guest Id field as the rows or axes elements of a widget, pulling the unique values from the new Guest Dimension, with measures from the two other tables. Image 1. Custom Common Dimension Aggregated Table In situations where we have more than one fact table (A Fact table is a primary table containing the measures or fields used for calculations in the dashboard) in our Elasticube, there are several situations when an aggregated table can resolve a many-to-many relationship. Image 2. Incorrect Connections Assuming we'd like to segment our data according to a few different dimensions, creating relationships directly between these fields can and will create many-to-many relationships in one of two ways, according to the schema: Both tables don't hold unique values, and all values from one table are held in the second table. In this scenario either a linked dimension (as described in solution 1) or an aggregated table can be created which will hold all the unique values and the desired calculations for one of the tables. In order to create an aggregate table, one can create a custom SQL expression and aggregate values from the table which holds all values; its own, and the subset present in the other table with the following expression: SELECT i.OrderDateKey, i.ProductKey, sum(i.DiscountAmount), sum(i.SalesAmount), avg(i.UnitPriceDiscountPct) FROM [FactInternetSales] i GROUP BY i.OrderDateKey, i.ProductKey This custom SQL expression will select the distinct OrderDateKey and their corresponding ProductKey from the FactInternetSales , grouped by these fields, together with single value aggregations for the different fields : Discount Amount, Sales Amount and the average unit Price discount. Now merge the OrderDateKey and Product Key between this table and the two other tables and you will be able to pull the values from this new table into the rows or axes panel of a widget in Sisense Web with measures and additional aggregations. Both fact tables don't hold unique values, and there are different values for several fields in both the tables. Resolving this scenario would incorporate both solutions from sections 2.1 and 1. In this scenario one should create an aggregated table as stated in 2.1, and a dimension table as stated in 1. The final resolution should look like this: Image 3. New Schema with dim & aggregated table Resolutions for more than two tables, more than one relationship Using the Lookup function In most scenarios we'll aggregate values according to a given id, from the unique side of the relationship to the duplicate side. However in specific cases it'll be vice versa. For example in the following scenario, in which we have 3 tables, and between them two one-to-many relationships, this can potentially create a many-to-many relationship, if we were to query the two leaf tables. This means that the query result table will have multiple rows which won't be distinguishable one from another. Image 4: Two consecutive M-to-M relationships Using the Lookup Function, we can import values from a remote table by matching values in a field. This will create a new field in the table with the matching value of the identifying field from the other table. Taking the following example of tables T1, T2 and T3, we'd like to run a query which will display aggregations from the duplicate id's from T1, with a measure from T3. If we ran the query as is, we'd get multiple values for the query's result set, and we won't be able to run this aggregation. In order to resolve this, we'll use the Lookup function in order to import the values from T3 into T2 and then re-run the query only on tables T1 and T2. We can import the values of 'M3' from the 'T3' table into the 'T2' table. Create a new custom column, and use the Lookup function to import the values of attribute, In this case, the Lookup function should look like this: Lookup([T3],[T3].[M3], [T2].id2,[T3].id2) Running this statement in table T2 will import the matching values of T3.M3 from T3 according to the matching results in id2 between the two tables. Image 5: Two consecutive Many to One relationships after Lookup fix Concatenate the two tables into one Assuming we have 2 separate tables with duplicate id values in each, and each holding different columns for each id, we can create a custom table which will hold all values for every id, and pull the aggregations from this new table. Notice that the two original tables; Table_1, Table_2 have different columns. Image 6: Concatenating tables Using the following SQL statement, we can import the data from both tables, with the id's and the columns respectively: SELECT s.id AS id, s.m1, s.m2, ToInt(NULL) m3 , ToInt(NULL) m4 FROM [Table 1] s UNION SELECT t.id, ToInt(NULL) , ToInt(NULL) , t.m3, t.m4 FROM [Table 2] t This will create a table with 5 columns: Id, M1 (from table_1), M2 (from table_1), M3 (from table_2), M4 (from table_2) The values missing from each table respectively will be NULL's which will result in the following table: Image 7: Concatenated table; result set Image 8: Determining a Many-to-Many relationship; decision tree. This is based on the first example with the Payments and Reservations tables.8KViews0likes0CommentsData Relationships: Introduction to a Many-to-Many
A relationship between tables specifies the field/s which the 2 tables have in common. In Sisense, we create relationships between tables in the ElastiCube. Poorly modeled data relationships can lead to one or more of the following: Complex data sets that return incorrect results Use excessive computing resources Return no results at all. Hence it is highly important to ensure that the relationships between tables are properly modeled in the ElastiCube. Types of Data Relationships One-to-One relationship â The field used to connect both tables only has unique values in every row. One-to-Many relationship â One table holds unique values for every row, but the other table holds duplicate values for any or all of the corresponding values in the first table. Many-to-Many relationship â There are duplicated values in both sides of the table. Business Case Example A Many-to-Many relationship exists when the value in the field the relationship is based on, is contained multiple times in each table. For example, a Company has 2 tables: Sales and Purchases. In both tables we have the date on which the transaction has been performed. If a relationship between the Sales and Purchases were created based on the date, a many-to-many relationship would have been created. Since a date can appear multiple times because there were several transactions performed every day. To sum up, when a field from two or more tables contains the same value, and these values are duplicated in both tables a connection created based on this field will result in a many-to-many relationship. Test and Resolve Data Relationship Issues There are several methods to identify and resolve/bypass a many-to-many relationship. The solution depends on the business model and logic of the business question at hand. Here are the resources that can help you test, identify and properly model data relationships in your ElastiCube. Data Relationships: Introduction to Many-to-Many (You are here) Data Relationships: Check Type of relationship Data Relationships: One-to-Many Relationship Data Relationships: The Invisible Many to Many Data Relationships: Many-to-Many Relationships Resolutions2.4KViews0likes0CommentsCalculating YTD With Variable Start And End Dates Of The Year (Academic Calendar)
Question: For a specific industry, like Education, even though the number of days is roughly the same every year, each academic year can start and end on different dates. The usual way of calculating YTD wouldn't work for this requirements because school year can start and end in the middle of calendar year. Solution: The idea is we want to compare the years by the numbers of days that have passed since the current year started, e.g. compare the first 135 days of this school year with the first 135 days of last school year, the first 135 days of last two school years, the first 135 days of last three school years, and so on, regardless of the actual dates. To accomplish this, we need to create a date dimension table with a day rank and a YTD flag. Day rank is an enumeration of school days within each school year. By comparing the day rank of each day by the day rank of today's date, we can calculate the YTD flag. We need these two tables for this solution to work: List of school year with start and end dates List of dates. You can use the date dimension Excel file from this article. School Year table: Date table (you only need the Date column for this purpose) Step 1 Create a new dim date table using the custom table option: Get the list of school dates (only dates between start and end dates of each school year) and remove dates that fall on Saturday and Sunday (optional). Rank the dates within each school year. Based on the result from above, get today's date rank. Compare each date's rank with today's rank to calculate the YTD flag. You can use this SQL as an example. --Join the list of school dates with today's date rank --Compare each date's rank with today's date rank to create YTD flag SELECT *, CASE WHEN t1.[Day Rank] <= t2.[Today Day Rank] THEN 1 ELSE 0 END AS isYTD FROM ( --Get the list of only school dates (dates between start and end date) --Remove dates that fall on Saturday and Sunday (not school dates) --Rank dates within each school year SELECT d.Date, s.[School Year], s.[Start Date], s.[End Date], rankasc(s.[School Year], d.Date) AS [Day Rank] FROM [Dim Dates] d JOIN [School Year] s ON d.Date BETWEEN s.[Start Date] AND s.[End Date] WHERE DayOfWeek(d.Date) <= 5 ) t1 JOIN ( --Get today's date rank SELECT t.[Day Rank] AS [Today Day Rank] FROM ( --Get the list of only school dates (dates between start and end dates of each year) --Remove dates that fall on Saturday and Sunday (not school dates) --Rank dates within each school year SELECT d.Date, s.[School Year], s.[Start Date], s.[End Date], rankasc(s.[School Year], d.Date) AS [Day Rank] FROM [Dim Dates] d JOIN [School Year] s ON d.Date BETWEEN s.[Start Date] AND s.[End Date] WHERE DayOfWeek(d.Date) <= 5 ) t WHERE t.Date = createdate(getyear(now()), getmonth(now()), getday(now())) ) t2 ON 1=1 The table should look like this: Step 2 Link the fact table(s) to the new date dimension table on the Date field. Step 3 Create your widget(s) and add the YTD flag as a filter, set to 1.1.9KViews0likes0Comments