cancel
Showing results for 
Search instead for 
Did you mean: 
pbsisense
Sisense Team Member
Sisense Team Member

From BI to AI with Sisense_ Principal Component Analysis (PCA) Part II.png

In Part I of this article, we explored the theory behind Principal Component Analysis (PCA). In this second and concluding article, we will look at a specific example of performing PCA using Sisense Notebooks.

What is Sisense Notebooks?

Sisense Notebooks is a powerful feature within the Sisense platform that allows data analysts and developers to type SQL to generate charts and then further process the results in Python to perform activities including running advanced data analysis, feature engineering, machine learning, and more. 

 

Setting up for PCA in Notebooks

For this example, we used heart disease data. The data was uploaded to Redshift. The following steps were executed:

1. A new Notebook was created and named PCA_HeartDisease.

2. A query was added and fetched data from the Redshift instance as shown below:

pbsisense_0-1653402316734.png

[ SELECT * is not memory efficient, and it is recommended that the required column names be listed in the SELECT statement. The reader is welcome to apply such best practices of choice in SQL.]

3. Let us open a new code block and explore the data that has been returned into a Pandas Dataframe using the code

df.head() (see below):

pbsisense_1-1653402316773.png

4. We are then going to add two new code blocks to import libraries.

pbsisense_2-1653402316792.png

5. The data looks like this (from Step 3 above). Note that some of the columns are categorical. So we would like to encode them for PCA, which will not work well on such columns. The following code does it.

Partial view of the dataset:
pbsisense_3-1653402316744.png



pbsisense_4-1653402316808.png

6. Next, we run correlations on the dataset. If the correlations are high, the results of PCA will likely be good, and vice-versa.

pbsisense_5-1653402316739.png
Partial view of the correlation information:
pbsisense_6-1653402316809.png

 

7. Next, we run the PCA as shown below. We want to find the six most important components. We tried different numbers of components as well, including 2, 3, 10, and so on. The library that is being used is Scikit-Learn. The data is scaled, transformed, and fit. It is that simple using this library!

pbsisense_7-1653402316807.png

8. We shall then plot the results for the six components:

pbsisense_8-1653402316759.png

Results:
pbsisense_9-1653402316809.png


Notice that between PC1 and PC2, we cover approximately 22% of the data. Adding more components does not reduce the importance of the components PC2-PC6 (or beyond). This is due to the low correlation between the numbers. All components of factors in the heart disease dataset contribute to the decision of whether heart disease is caused.

For an academic example of PCA applied to synthetic data, which demonstrates in more stark contrast the Scree Plot and other measures, please refer to Joshua Starmer’s example on his website StatQuest

 

Conclusion

In this article, we demonstrated how quickly and easily one may perform Principal Component Analysis (PCA) using Sisense Notebooks. PCA is a very useful technique used in dimensionality reduction and visualization.