cancel
Showing results for 
Search instead for 
Did you mean: 
Ophir_Buchman
12 - Data Integration
12 - Data Integration

Introduction

The following article discussed "Many-to-Many" relationships - Both expected and unexpected. It focuses on how they occur and the best practices which one can implement to avoid them.

 

Table of Contents

Table Relationships

When modeling data, BI recommends you either use a "Flat" or a "Fact-Dimension" table schema:

  • To implement a "Flat" table schema - The user would denormalize all their data into a single table. Denormalizing data will require extensive data duplication, row duplication, and result in exponential growth of the database size. For example: Placing all "product names" in a "sale transactions" table will cause a duplication of strings in your data model.
  • To implement a "Fact/Dimension" table schema - The user would denormalize some of their data and place it in a two-level hierarchy model:
    • The first level is the "Fact" level (a.k.a. "Measures") and is used to store transactional data that could be aggregated. 
    • The second level is the "Dimension" level (a.k.a. "Filters") and is used to store descriptive information about the transactions. This data would be used towards filtering or segmenting. 

When using multiple data tables in the model, a user has to teach the system how to join them in case of a cross-table query. For example, if the "Fact" table describes sale transactions and the "Dimension" describes the products sold - The user may request to summarize the sales amount of all red products.

The logical connection between the tables is known as a "Table Relationship" and is performed by linking two related fields from the different tables (a.k.a. Keys). For example, a "Fact_Sales" table listing sale transactions holding a "Product ID" column (which represents the product purchased) and a "Dim_Products" table describing products having a "Product ID" column (which represents the product described)

A "Table Relationship" describes one of the following relations:

  • A One-to-One relationship
  • A One-to-Many relationship
  • A Many-to-Many relationship

 

The One-to-One Relationship

The "One-to-One" relationship represents a relationship where each value in the "Key Column" of "Table A" exists either once or zero times in the "Key Column" of "Table B" and vice versa.

For example:

  • A customer may have a store membership; a store membership belongs to a single customer.
  • A city may be the capital of one country; a country may only have one capital.
  • Each employee (identified by "Employee ID") may have one SSN; an SSN may belong to one employee.

A "One-to-One" relationship is usually visualized in the following way:

Ophir_Buchman_0-1644955176043.png

Each entry in "Table A" shows once in "Table B." Each entry in "Table B" shows once in "Table A."

 

The One-to-Many Relationship

The "One-to-Many" relationship represents a relationship where each value in the "Key Column" of "Table A" exists either once or zero times in the "Key Column" of "Table B" and each "Key Column" of "Table B" may appear multiple times in the "Key Column" of "Table A."

For example:

  • A person belongs to one family. However, the family includes multiple people.
  • A person was born in a single hospital. However, many people were born in that hospital.
  • An airplane is parked in a single airport. However, the airport accommodates multiple planes.

A "One-to-Many" relationship is usually visualized in the following way:

Ophir_Buchman_2-1644955319311.png

Each entry in "Table A" shows once in "Table B." Each entry in "Table B" shows once or more in "Table A."

 

The Many-to-Many Relationship

A "Many-to-Many" relationship represents a relationship where each value in the "Key Column" of "Table A" may appear multiple times in the "Key Column" of "Table B" and vice versa.

For example:

  • A person may belong to multiple groups; a group may contain multiple people.
  • A product may belong to multiple categories; each category includes various products.
  • A student participates in multiple classes; a class includes numerous students.

A "Many-to-Many" relationship is usually visualized in the following way:

Ophir_Buchman_3-1644955380874.png

Each entry in "Table A" shows once or more in "Table B." Each entry in "Table B" shows once or more in "Table A."

 

Ophir_Buchman_0-1645470915372.png

 

Should One Avoid "Many-to-Many" In All Costs?

So, why are Many-to-Many relationships so big of a deal?

Let's examine both the "Expected" and "Unexpected" many-to-many relationships.

The "Expected Many-to-Many"

In some cases, Many-to-Many relationships are expected and planned. Let's examine this company that sets up fan groups and stores their data in the following user/group table schema:

 

Ophir_Buchman_0-1645274755911.png

Let's attempt to calculate the following:

  • What is Dan's monthly admission?
    We'll have to join both tables on the "Group ID" field and filter on Dan's name to calculate this. The result would be a SUM aggregation of the "Admission Price" column (Result is 20$+30$+5$ = 55$)

  • How many people participate in the "Boardgame Fans" group?
    We'll have to join both tables on the "Group ID" field and filter on the group's name to calculate this. The result would be a COUNT aggregation of the "User ID" column (Result is 2 People)

  • What is the entire cash flow?
    To calculate this - We'll have to group the user/group table based on the "Group ID" and count the users per group. Then, join both tables on the "Group ID" field. The result would be a SUM of the multiple of the "User Count" and the "Admission Price" (Result is 40$+90$+10$+12$ = 152$)

 

The "Unexpected Many-to-Many"

Many-to-Many relationships are created unexpectedly - This could happen due to an unexpected "wrong" business question or inadequate data modeling. Let's examine this retail use-case where a company stores purchases (made against vendors) and sales (made against customers):

Ophir_Buchman_1-1645302914665.png

Let's attempt to calculate the following:

  • How many "Chocolate Bars" were sold?
    To calculate this - We'll have to join the "Products" and "Sales" tables on the "Product ID" field and filter on "Chocolate Bars." The result would be a SUM aggregation of the "Quantity" column (Result is 6+1 = 7)

  • Which Customer Made the Most Purchases (money-wise)?
    To calculate this - We'll have to group the "Sales" table purchases by the "Customer name" and perform a SUM aggregation of the "Sale Price." Then, find the maximal value (Result is "Dan" - 75$)

  • How many sales were performed by "Aid Inc."?
    This question is an interesting one as the company doesn't sell to vendors - Most likely a "wrong" user question. Even though, Sisesne will attempt to find a way to calculate this. It will do that by Joining the "Vendors" table with the "Purchases" table on "Vendor ID" filtering "Aid Inc.". It will then take the products found and join them against the "Sales" table. The result will include the SUM of Sales Price" (Result is 20$+12$+25$ = 57$) - Which is incorrect.

 

Related Risks

Many-to-Many related mistakes (such as the one presented in the previous section) may lead to the following:

  • Displaying no data in your widget
  • Displaying incorrect data in your widget
  • Generating a huge JOIN operation - Leading to excessive use of memory and/or computing power

In severe cases (where data tables are huge) - A many-to-many relationship may lead to an Elasticube crash or a complete Sisense instance failure

Ophir_Buchman_0-1645470915372.png

 


Dealing with Many-to-Many Relationships

There are two ways of dealing with many-to-many relationships:

  • Avoiding them
  • Bullet-proofing your Data-Model

 

Avoiding Many-to-Many Relationships

The easiest way to resolve many-to-many relationship-related issues is by avoiding them.
Avoiding these relationships isn't necessarily easy. Doing so might be very expensive in terms of computing resources (data transformation) and data storage (data duplication).

For Example:

Before After
Ophir_Buchman_0-1645304354340.png Ophir_Buchman_1-1645304361930.png

Notice the data duplication

 

Bullet-Proofing your Data Model

Another way to deal with many-to-many relationships and avoid related risks (discussed in the next section) is bullet-proofing the data model.

There are a few simple practices you should follow to avoid errors originating from incorrect business questions - These are discussed in the "Sisense Best Practices" section.

 

Other Mitigation Steps

Talking from experience, you should expect the unexpected and plan ahead - Meaning that someone may unintentionally generate a many-to-many relationship and risk the Sisense instance stability.

As a best practice, configure "Data Groups" to limit the resources allocated towards a single Elasticube.
Refer to the following link to read more about Data Groups.

 

Ophir_Buchman_1-1645470968754.png

 

Sisense Best Practices

Each best practice answers a different data modeling mistake or a potential "wrong" business question.

 

Use a "Fact-Dimension" Table Schema

BI best practices recommend using a two-level table hierarchy model including:

  • Facts (a.k.a. Measures) - Containing data that is aggregated (e.g., A list of sale transactions)
  • Dimensions (a.k.a. Filters) - Describing the data inside the fact (e.g., A list of products and their characteristics)

Sisense recommends implementing the "Fact/Dim" schema structure.

Why?

The reasons behind this guideline are:

  • Having a two-level table hierarchy model reduces the amount of JOIN events during a query's execution. This results in a shorter query runtime, less compute operations, less memory consumption, and provides a better user experience.
  • Having a two-level table hierarchy model simplifies the data model making it easier to debug data inconsistencies and follow the query path.
  • There's only one single way to get from every fact to every dimension (and vice versa)

How?

Data modeling is a task that requires much planning and may be very challenging. The rule of thumb when migrating your database structure into a dim/fact structure is to start by finding out what the business' questions are. Then, use the business questions to define what data is used for filtering and segmenting (dimension) and what data is used for aggregations (facts).

Modeling your data correctly is a key to your dashboard's success (in terms of runtime performance and data accuracy).

Ophir_Buchman_0-1645470915372.png

 

Never Link a "Fact" Table to Another "Fact" Table

Two "Fact" tables should never connect to each other - A "Fact" table should only connect to "Dimension" tables

Why?

Connecting two "Fact" tables directly may lead to a JOIN between them - Potentially causing a huge query to take place and crashing the ElastiCube.

Ophir_Buchman_0-1645470915372.png

 

Never Link a "Dimension" Table to Another "Dimensions" Table

Two "Dimension" tables should never connect to each other - A "Dimension" table should only connect to "Fact" tables

Why?

"Dimension" tables are used to describe the data in the "Fact" table. Different "Dimension" tables usually don't relate to each other and if they do, linking them may lead to a "Table Cycle" (see the relevant section)

Ophir_Buchman_0-1645470915372.png

 

Link all "Dimension" Tables to All "Fact" Tables

Every "Dimension" table should be connected to all "Fact" tables.
Every "Fact" table should be connected to all "Dimension" tables.

Why?

As a reminder - Every "Dimension" field may be used towards filtering and/or segmenting data. Every "Fact" field may be used towards aggregating data.

You user can choose to filter and aggregate any sets of fields, leading to a "Random Path".

How?

Some Fact/Dimension relationships are straightforward. However, some are not. Let's check out the previous example:

Ophir_Buchman_1-1645302914665.png

Connecting the "Product Dimension" to the "Fact" tables is straightforward as every sale and purchase event refers to a specific product. However, connecting the "Vendor Dimension" table isn't - A vendor doesn't participate in a sale event.

To resolve this, you'd have to create a Fake Key - Allowing a relationship and making sure the tables either ALWAYS or NEVER join. Read more about fake keys at the following link.

Ophir_Buchman_0-1645470915372.png

 

Consolidate "One-to-One" Dimension Tables

In some cases, you may have dimensions connecting on the same primary key.
If two dimensions holding the same primary key have a One-to-One relationship, Sisense recommends consolidating these to decrease the amount of JOIN operations.

Why?

Take a look at the following example:

Ophir_Buchman_3-1645471558028.png


Assuming each person has one driver's license and each driver's license belongs to one person, the two dimensions should be consolidated. By doing so, a dashboard attempting to filter tickets based on age range and license type would require one less JOIN operation.

How?

Refer to the following link for details on how to create custom tables.

The resolution of our example is as follows:

Ophir_Buchman_4-1645471809418.png

 

Ophir_Buchman_0-1645470915372.png

 

Hide Irrelevant Columns

In many cases, columns used for creating relationships (primary/foreign keys) are also public keys. But in case the end-user actually uses a "User ID" in his filtering - A dashboard designer may use the incorrect field name when creating a widget, leading to incorrect query results (Many-to-Many) and query failure in extreme cases.

Sisense recommends hiding "Key" columns from the user as much as possible.

Why?

Take a look at the following example:

 

Ophir_Buchman_0-1645458518949.png

 

Scenario #1

The Customer builds a widget to count the number of sale purchases containing product #3:

  • The aggregative field is "Purchase ID" (in the "Fact_Purchases" table)
  • The filtering field is "Product ID" (mistakenly chosen from the "Fact_Sales" table)
  • The result may be incorrect (i.e., If product #3 was never sold) and will require joining the two "Fact" tables (a VERY expensive operation).

Scenario #2

The Customer builds a widget to count the agents who sold apples:

  • The filtering field is "Product Name" (in the "Dim_Products" table)
  • The aggregation field is "Agent ID" (mistakenly chosen from the "Dim_Agents" table)
  • The result will be unpredictable and most likely incorrect as there's no way to determine which Fact table will be used for joining these two "Dimensions" - This is known as a "Random Path"

How?

Refer to the following link for details on how to hide a column.

The resolution of our example is as follows (grayed-out fields are "Hidden"):

Ophir_Buchman_2-1645459575152.png

 

  • In scenario #1 - There's only one "Product ID" field
  • In scenario #2 - There's only one "Agent ID" field

Notice that the hidden fields align with the relevant business questions

Ophir_Buchman_0-1645470915372.png

 

Avoid Dimension Table Cycles

The term "Chaining Dimension Tables" refers to a scenario where you connect a fact table to a dimension table and then chain an additional dimension table to that table.

Sisense recommends avoiding table chaining as much as possible.

Why?

Using this type of structure introduces unnecessary JOIN operations - For example, trying to count all "Red" products sold:

Ophir_Buchman_5-1645472289458.png

Using this type of structure may also lead to cases where to data structure has a Table Cycle - Leading to unexpected query results - For example, trying to count all sales with a "Red" filter:

Ophir_Buchman_0-1646595940954.png

 

How?

Resolving table cycles can be done in various methods (depending on the business questions) - Here are three possible solutions for this problem:

  • Denormalize the "Color Name" and place it in both "Product" and "Package" dimension tables. This will reduce on JOIN event per query and allow the user to choose between the two filters
  • Disconnect the color dimension from the packaging dimension. This is a valid solution IF there are no business questions based on packaging color
  • Duplicate the "Color" dimension - Once for the product, and the other for packaging.

Ophir_Buchman_0-1645470915372.png

 

Validate Dimensions' Primary Keys are Not NULL and Unique

The primary keys of your "Dimension" table should be unique and not NULL

Why?

Making sure the primary key is unique will guarantee a "One-to-Many" relationship (rather than a "Many-to-Many" one). Consider this case:

Ophir_Buchman_1-1646420492603.png

Note the error made in the "Products" dimension table - Two items have the same "Product ID". A simple query may result in a JOIN operation that requires twice the amount of work, but more importantly, results with an incorrect result.

How?

To monitor your dimension tables create a dup-check table - Counting the following:

Value Definition Expected Value
Key Count The total number of keys ---
Unique Key Count The number of unique keys Should be equal to "Key Count"
Null Key Count The number of NULL keys Should be equal to zero

To create such a table - Create a "Custom Table" with a code similar to:

SELECT
'Customers' AS Table_Name,
COUNT(c.[CustomerID]) AS Key_Count,
COUNT(DISTINCT(c.[CustomerID])) AS Distinct_Keys,
SUM(CASE WHEN c.[CustomerID] IS NULL THEN 1 ELSE 0 END) AS Null_Keys
FROM
[dim_Customers] c

UNION

SELECT
'Employees' AS Table_Name,
COUNT(c.[EmployeeID]) AS Key_Count,
COUNT(DISTINCT(c.[EmployeeID])) AS Distinct_Keys,
SUM(CASE WHEN c.[EmployeeID] IS NULL THEN 1 ELSE 0 END) AS Null_Keys
FROM
[dim_Employees] e

UNION

...
Rate this article:
Version history
Last update:
‎03-02-2023 10:12 AM
Updated by: