cancel
Showing results for 
Search instead for 
Did you mean: 
pbsisense
Sisense Team Member
Sisense Team Member

sisense-community-ai-ml-notebooks-blog-401x200.png

 Sisense customers are increasingly using large datasets. Creating visualizations that are meaningful and easy to interpret poses a challenge to analysts who have to work with multiple data features. Often, visualizations are prepared based on the analyst’s understanding of the data, or with various permutations to show feature relationships. 

One technique that may be applied to explore and visualize data is Principal Component Analysis (PCA). PCA is useful for extracting key dimensions from high-dimensional datasets and for enhancing visualization and interpretability of data while minimizing information loss.

In Part I of this article, we will explore the theory behind PCA. In Part II we will look at an example using the Notebooks feature in Sisense. 

The technique of PCA was first introduced by Pearson [1] and later, independently by Hotelling [2]. Today, the technique has become very popular in light of the increasingly large datasets and the computational resources available to determine the principal features of a dataset.

Consider the following sample dataset. This dataset is made-up solely for this article. In Part II, we will look at a real data set and the exploration as well as curation techniques that come before applying PCA.

Our goal is to figure out what the principal components of this dataset are so we can visualize the relationships between various types of discounts. An analyst may be tempted to create a selection of 10 (pbsisense_1-1647356537811.png = (5 x 4)/2! = 10) two-dimensional charts to visualize relationships between all of the dimensions - which would make the interpretability of the data difficult for the end-user and would result in a non-scalable solution when the number of dimensions increases. 

 

Rebates Discounts Cashback Free Products  Credit
5 10 3 0 8
8 22 15 8 3
8 12 7 4 22
22 14 19 20 19
15 16 25 3 7
8 17 4 7 15

 

When data of this nature is encountered, what is the best way to visualize this information? This dataset contains 5 features or dimensions, and 6 samples. Although the number of samples is not relevant it must be noted that the number of principal components that may be extracted from a dataset is the lower number of samples and dimensions. With large datasets, the issue of running out of samples should be unlikely. 

The mathematics behind how PCA is implemented is beyond the scope of this article. However, the references section provides additional resources [3].

PCA in Action

Let us start by plotting Rebates vs. Discounts as shown in Figure 1 below. For simplicity, we will use two dimensions to start with but will describe how this will work for additional dimensions later.

pbsisense_0-1647291230385.png

PCA then attempts to fit a line among these data points so that the projections of the data points on the line have the maximum variance or most distance between the projected data points and the origin. Because the line could extend to the negative axis, the distances are squared so the sum of the squares is at a maximum. The line is initially generated randomly. Using optimization techniques, the algorithm arrives at the best fit to determine the first component. See Figure 2a and Figure 2b for how this works. 

pbsisense_1-1647291230369.png

pbsisense_2-1647291230362.png


Once the line with the best fit is identified, this line is treated as Principal Component 1 (PC1). The line is normalized using the sum of square distances between the origin and the projected data points on the line. If the distance between the projected data points on the origin are calledpbsisense_0-1647356455882.png, then the sum of square distances would be pbsisense_1-1647356131179.png. See Figure 3 for how this is calculated. Following normalization, an orthogonal unit vector from the origin constitutes Principal Component 2 or PC2. Subsequently, the data points along with PC1 and PC2 are transformed to make PC1 and PC2 the new axes. This transformation happens only after all components are extracted. 

See Figure 4a and Figure 4b. The proportion of each dimension in the PCs are called the loading scores. In short, loading scores are the weights given to the dimensions in generating a linear combination of the data points that project on the PCs. 

The new PCs no longer represent the two dimensions we originally started with -- Rebates and Discounts. Instead, they represent the proportions of each dimension.

What about the other dimensions, such as Cashback, etc? Now that we have an understanding of how PCA works for two dimensions, the same algorithms may be applied to additional dimensions. With three dimensions visualization is possible, but beyond three dimensions additional components may be created but not visualized. Such visualization is not required because there are tools such as the Scree Plot that may help reveal which components are the most relevant and provide the most variation in data.  Figure 5 depicts a sample Scree Plot. It reveals that PC1 and PC2 capture the most information, and therefore may be sufficient to visualize the data. If the Scree Plot shows PCs that are very similar in value, it is an indicator that there are no dimensions that are better representatives of information than others. Under such circumstances, other techniques may have to be explored.

pbsisense_3-1647291230460.png

pbsisense_4-1647291230400.png

pbsisense_5-1647291230460.png

pbsisense_6-1647291230377.png

Conclusion

In this article, we briefly looked at how Principal Component Analysis works. We also discussed its uses and benefits. Finding the principal components is an optimization problem. A Scree Plot shows the relevance of each of the components. The principal components are representations of the loading scores of the dimensions that constitute the component. In Part II, we will look at a real example of how to collect and curate data before applying PCA, as well as what interpretations we can make with PCA. 

References: 

  1. Pearson K. 1901 On lines and planes of closest fit to systems of points in space. Phil. Mag. 2, 559–572. 
  2. Hotelling H. 1933 Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441, 498–520.
  3. Gilbert Strang. 2019 Linear Algebra and Learning from Data. Wellesley-Cambridge Press.