From BI to AI with Sisense: Principal Component Analysis (PCA) Part II

Sisense Employee

05-24-2022

In Part I of this article, we explored the theory behind Principal Component Analysis (PCA). In this second and concluding article, we will look at a specific example of performing PCA using Sisense Notebooks.

What is Sisense Notebooks?

Sisense Notebooks is a powerful feature within the Sisense platform that allows data analysts and developers to type SQL to generate charts and then further process the results in Python to perform activities including running advanced data analysis, feature engineering, machine learning, and more.

Setting up for PCA in Notebooks

For this example, we used heart disease data. The data was uploaded to Redshift. The following steps were executed:

1. A new Notebook was created and named PCA_HeartDisease.

2. A query was added and fetched data from the Redshift instance as shown below:

[ SELECT * is not memory efficient, and it is recommended that the required column names be listed in the SELECT statement. The reader is welcome to apply such best practices of choice in SQL.]

3. Let us open a new code block and explore the data that has been returned into a Pandas Dataframe using the code

df.head() (see below):

4. We are then going to add two new code blocks to import libraries.

5. The data looks like this (from Step 3 above). Note that some of the columns are categorical. So we would like to encode them for PCA, which will not work well on such columns. The following code does it.

Partial view of the dataset:

6. Next, we run correlations on the dataset. If the correlations are high, the results of PCA will likely be good, and vice-versa.

Partial view of the correlation information:

7. Next, we run the PCA as shown below. We want to find the six most important components. We tried different numbers of components as well, including 2, 3, 10, and so on. The library that is being used is Scikit-Learn. The data is scaled, transformed, and fit. It is that simple using this library!

8. We shall then plot the results for the six components:

Results:

Notice that between PC1 and PC2, we cover approximately 22% of the data. Adding more components does not reduce the importance of the components PC2-PC6 (or beyond). This is due to the low correlation between the numbers. All components of factors in the heart disease dataset contribute to the decision of whether heart disease is caused.

For an academic example of PCA applied to synthetic data, which demonstrates in more stark contrast the Scree Plot and other measures, please refer to Joshua Starmer’s example on his website StatQuest.

Conclusion

In this article, we demonstrated how quickly and easily one may perform Principal Component Analysis (PCA) using Sisense Notebooks. PCA is a very useful technique used in dimensionality reduction and visualization.

Updated 05-24-2022

Version 5.0

Sisense Employee

Joined December 01, 2021

View Profile

Product and Website News

News about Sisense releases and the Sisense Community website