Your Guide To Default Filters
Filters are a great way to slice and dice your data into a variety of views. There are a couple ways you can set default filter values, on the dashboard level and on the query level. Dashboard Level Default Filters If you want all your users to view the same filter set for a dashboard upon login, select the desired filter configuration on your dashboard and click on "Set as Default" in the filter bar. Further details and screenshots can be found on our documentation here. Note that email reports use the default filter configuration. If your users select a different set of filters and want to go back to the default that you set, they can hit "Reset Filters" to restore the default filter set up. Chart Level Default Filters You can define the default filtering behavior in your query by using Direct Replacement Filters with pipe notation. This notation looks like this: [filter_name|default_value] Note that the default pipe notation cannot be used in filter standard notation. Ex [field_name=filter_name|default_value] will not return a result. Users can get creative with the default filter notation to define how exactly they want their charts to behave depending on filter selections. Refer to the following community posts for a few examples! Prompt Users to Enter a Filter Value if nothing is selected Filter out Selected Filter Values Filters in Case When Statements (Controlling Chart Behavior between Different Filters) How else have you creatively used default filters? Comment below!2.2KViews0likes2CommentsLeveraging Sisense Notebooks with Compose SDK
Discover how to enhance your analytics integration with Sisense Notebooks and the Compose SDK. This article explores a code-first approach to creating custom datasets and visualizations using SQL, empowering data teams to leverage their existing skills. Learn how to seamlessly embed tailored insights into customer-facing applications, streamlining workflows for both analysts and developers. Unlock the potential of your data with Sisense, delivering impactful, personalized analytics that meets your organization’s unique needs.590Views1like0CommentsDisplay df.describe() Output In A Dataframe
The .describe() function in Python is really handy for exploratory analysis! Sometimes, you would want to display this as a table on a dashboard, but how? The output is typically printed out as text, not as a dataframe. Fear not, getting this into a dataframe is fairly painless! Before we go on, note that using Python requires the Python/R Integration. Let's say this is the first 10 rows of your SQL output / starting dataframe (the data below is the User IDs and number of gameplays they made for a fictional gaming company) print(df['gameplay_number'].describe()) gives you the following output To get this in a table of its own (rather than a printout), use the code below! Note, the key in the dictionary used as an argument for pd.DataFrame is what we want to name the column of summary values in our final output. I chose to name this 'gameplays.' df2 = pd.DataFrame({'gameplays': df['gameplay_number'].describe()}) df2 = df2.reset_index() # Use Sisense for Cloud Data Teams to visualize a dataframe by passing the data to periscope.output() periscope.output(df2) And there you have it!11KViews0likes1CommentSankey Diagrams With Plot.Ly In Periscope
Sankey diagrams are a very easy-to-digest way to look at the flow of objects between different points. The example we will walk through here is a basic Sankey overview, leveraging data from a fictional gaming company. The emphasis of this post is focused on how to get your pandas dataframe data into something that can be used by plot.ly in a Sankey diagram. The Sankey set up itself is modeled after the first example in the documentation here. If you want to look at different Sankey plots from a visualization standpoint, plot.ly has a great library or resources. I've linked some of those resources at the bottom of this post for further reference Without any further ado, here is the annotated code! Note that you would need the Periscope Python and R integration to use our Plot.ly integration. Our SQL output is a dataframe that contains 3 columns: Source - The source that the user came from Platform - The user's platform of choice Count - The number of users for that source-platform combination The SQL to generate this is noted below for reference select source , platform , count(*) from users group by 1 , 2 And here is the Python 3.6 code, annotated up top where we set up our pandas dataframe df for use in plot.ly. # SQL output is imported as a dataframe variable called 'df' import pandas as pd import plotly.plotly as py # First, we get a list of all of sources, remove duplicates, and make this a list sources = df["source"].drop_duplicates().tolist() # Then, we get a list of all of platforms (our targets), remove duplicates, and make this a list platforms = df["platform"].drop_duplicates().tolist() # Finally, create a list of all our nodes. We will use this for giving an id to each node for plot.ly all_nodes = sources + platforms # Keeping the size of our dataframes, this would be useful for applying the same color for each "node" and "link" of our sankey diagram, if we so choose to do so n = len(all_nodes) n2 = len(df["source"]) # Create a dataframe that has all of the node ids. We will join this to the original data frame to reference later df1 = pd.DataFrame(all_nodes, columns = ['node']) df1 = df1.reset_index() df2 = pd.merge(pd.merge(df, df1, how = 'inner', left_on = "source", right_on ="node"), df1, how = 'inner', left_on = "platform", right_on ="node", suffixes = ('_source','_target')) # Setting up the data in the plotly "data" argument. # The nodes are described in the "node" dictionary (these are the vertical rectangles in the diagram) # The links are described in the "link" dictionary. These have 3 attributes, the "source" (the index of the node they start at), the "target" (the index of the node they end at), and the "value" the thickness of the band. Additional attributes, such as color can also be specified. data = dict( type='sankey', node = dict( pad = 15, thickness = 20, line = dict( color = "#435951", width = 0.5 ), label = all_nodes, color = ["#84baa6"] * n ), link = dict( source = df2["index_source"], target = df2["index_target"], value = df2["count"], color = ['#bdf9e5'] * n2 )) # Setting up the layout settings in the "layout" argument layout = dict( title = "An Example Sankey Diagram", font = dict( size = 12 ) ) fig = dict(data=[data], layout=layout) # Use Periscope to visualize a dataframe, text, or an image by passing data to periscope.table(), periscope.text(), or periscope.image() respectively. periscope.plotly(fig) Plot.ly Sankey Diagrams - 1 Plot.ly Sankey Diagrams - 2 Plot.ly Sankey Diagrams - 33.1KViews0likes0CommentsYour Guide To Multivariate Linear Regressions In R
Once you’ve gotten familiar with bivariate linear regressions (one explanatory variable, and one response variable), you may want to delve into seeing how multiple explanatory variables contribute to your response. Not sure how to get started? Well you've come to the right place 🙂 NOTE: If you are new to regressions, we highly recommend starting off with the bivariate linear regression post linked earlier for background! The equation for a multivariate linear regression looks like this: Part 1: Set Up To begin, we will import the following libraries, which will comes useful in our multiple regression. library(dplyr) library(lmtest) library(car) library(sandwich) Now normally you would be using your SQL output, but for the sake of this post we will be using the built in mtcars library and assign that to our dataframe df. df <- mtcars First, we need to explore our variables to ensure they’re suitable for the regression. Ideally, your variables should be close to normally distributed. The larger your sample size, the more “ok” it is for your sample to deviate from normality. Intuitively though, you want your variables to cover the spectrum from low to high values. In other words, you don’t want a bunch of values clustered around one extreme, and only a sparse number on the other extreme. Note, if you have any indicator variables (ex: a 0 or 1 indicating if a user is male or female), this does not need to have a normal distribution by any means. We do want to make sure we have sufficient sample in each bin though! If a variable’s distribution is skewed, then you can definitely transform them. Note that how you transform your variables will affect the meaning of the slope coefficients generated by our model. In our example, let’s take a look at mpg as our y variable, and hp and drat as our x values. In Periscope Data, we would need to run these charts one by one as the UI will display only the last image generated. mpg_plot <- hist(df$mpg) hp_plot <- hist(df$hp) drat_plot <- hist(df$drat) From above, both mpg and hp look skewed to the right, so we take the log transform of these variables mpg_plot <- hist(log(df$mpg)) hp_plot <- hist(log(df$hp)) That looks much better. Then, split into testing and training data! Note that we always want to do this to ensure we save a subset of our data for additional validation tests on our model. df$test_data_flag <- rbinom(nrow(df), 1, 0.3) training <- filter(df, test_data_flag == 0) test <- filter(df, test_data_flag == 1) Part 2: Generating The Model Now we generate the model! This is very similar to generating a bivariate linear regression, only difference is we add a + sign to include more variables in our model. model <- lm(log(mpg) ~ log(hp) + drat, data=df) To view the slopes we use this command (we avoid using the summary() command for multivariable regressions to ensure our error terms are robust to heteroscedasticity. More on this later). The column under estimate tells us the effect of one variable on the other. Here, a unit change in log(hp) leads to a decrease in 0.43 of log(mpg) (aka, a 43% drop in mpg). The last column Pr(>|t|) lets us know how confident we are in claiming the explanatory variable has any effect on the response variable. Here, we see that the coefficient for log(hp) is significant at the 0.001 level, and drat is not significant at all. print(coeftest(model, vcov = vcovHC)) To assess goodness of fit of the whole model, we take a look at the adjusted R squared value. Note that we look at the adjusted R squared instead of the regular R squared to account for the fact that we have multiple variables. This is because R squared will always increase as you add more variables, even if they do not contribute to response variable. Adjusted R squared takes this into account and “penalizes” additional variables. print(summary(model)$adj.r.squared) This gives us an adjusted r-squared value of about 0.76. Remember, the closer this value is to 1, the better the model is at predicting y. Part 3: Evaluating The Model Now to evaluate our model, we check that we have met the following conditions. Condition 1: Is there linearity? Note that by generating the model itself, we are forcing this condition. Our later conditions will constrain this further to something more tangible! Condition 2: Is the sample random? This requires background knowledge of your sample. An easy way this gets violated is by clustering. Ex: Rather than randomly sampling 1000 students, I randomly select 100 classes and take 10 students from each. This clustering effect violates the random assumption, so we cannot use the model to make any claims based on inference Condition 3: Verify there are no cases of perfect multicollinearity between variables. In other words, if I have weight in pounds and weight in kgs as two variables in a model, this would lead to inaccuracies as the two are essentially the same thing. Even if variables aren’t perfectly correlated, it does reduce the precision of our model if they are highly correlated. To test this, run the vif() command on the model as shown below and view the output in the stdout tab. As best practice, values above 4 signify variables that are highly correlated with another variable. To remedy this, consider dropping one of the highly multicollinear variables from your regression model! print(vif(model)) In our example here, vif for both variables is under 2, so we meet this condition. Condition 4: Ensure there is no trend in the residual plot. Residuals are the difference between the predicted and the actual values. In a bivariate regression, we plot the residuals against the explanatory variable x. For a multivariable regression, like the one here, plot the residuals against the predicted y values (this is more efficient and capture interactions between the explanatory variables better than plotting residuals against every explanatory variable in your model!). We can quickly pull this plot from r from the following command: plot(model, which = 1). In this plot, we want the red line to be as flat as possible plot(model, which = 1) Given the small sample size, the line is actually pretty rough here. However, you should aim for a flatter line in your models with larger samples! Condition 5: Verify that the variance in the residuals is consistent. In other words, there should be the same amount of "scatter" moving left to right in the above chart. Another way of assessing this criteria is by checking that the red line in plot(model, which = 3) is as flat as possible. If there is variance in the “thickness” of the band formed by all the residuals, we say the model is heteroscedastic. This doesn’t necessarily mean our model is bad, but it means we need to use a different set of standard errors than those presented in the summary(model) command. This is why we used the coeftest(model, vcov = vcovHC) earlier. Note, unless you are absolutely confident your model has equal variance in the residuals (is homoscedastic), we recommend using the heteroscedastic-robust errors from the coeftest(model, vcov = vcovHC) command. plot(model, which = 3) Condition 6: Normality of residuals - we want to ensure the distribution of error terms is normal. We can do this in 2 ways. (1) run the command plot(model, which = 2) and assess whether the points match the dotted line as close as possible, or 2. Run plot(model.residuals) and assess whether the histogram looks like a bell curve plot(model, which = 2) hist(model$residuals) A note on outliers: Lastly, we want to understand if any outliers are driving the results of the model. We can uncover this using the plot(model, which=5) command. This gives us a residuals vs leverage plot. You can read more details here if you’re interested in learning more, but essentially we are looking for values in the top right and bottom right regions outside of the red boundary. This signifies that the data point is an outlier. Note we don’t necessarily want to remove the outlier unless it’s an unrealistic value. However, knowing they exist is important context in the model generation process plot(model, which = 5) Part 4: Validating The Model Against The Test Data Note, all these above steps are done on the same dataset used to train the data. However, it is always a good idea to check how these conditions hold against the testing dataset as well. We show how to do this below. Assuming you have randomly assigned values into your testing and training dataset, we can assume that Conditions 1,2,3 on the testing dataset are in line with the training dataset. Therefore, let's jump to Conditions 4 and 5. We run the following command to get a column of predicted y values using our model on the test dataset. test$predicted.mpg <- predict(model, test) Pass this table into periscope.table as shown below periscope.table(test) Plotting the predicted values against the residuals as a scatter plot using the Periscope UI and adding a trendline gives the following residual plot: Since this data frame is pretty small, we don’t have very many values in the testing dataset, so this particular example is admittedly not that insightful here. For Condition 6, the distribution of residuals, we comment out the periscope.table(test) line and run the following command residual_distribution <- hist(test$predicted.mpg) periscope.image(residual_distribution) Remember, we want this distribution to be as bell-curved as possible, but given the small sample size, any such pattern is hard to see in this particular example If you made it this far and all of your tests have checked out, congratulations! You have a robust regression model that can be used for inference, with the statistics to back it 🙂 Want something above elaborated further? Post in the comments below, and I may write a follow up post addressing FAQs 😉4.3KViews0likes0CommentsWindow Functions!
Window Functions are a great way to analyze your data in a different way! Similar to other calculation statements in SQL, window functions allow you to do a calculation across a set of rows or columns. The difference is with window functions the data will not be grouped into a single output row and will return a value for every row in that window. The values returned are calculated by utilizing the values from the sets of rows in that window. The window defines, for each row in the table, as set of rows that is used to compute additional attributes. A window is defined using a window specification (the OVER clause), and is based on three main concepts: Window Partitioning , which forms groups of rows (PARTITION clause) Window Ordering , which defines an order or sequence of rows within each partition (ORDER BY clause) Window Frames , which are defined relative to each row to further restrict the set of rows (ROWS specification) There are two types of functions aggregation (AVG, COUNT, SUM, etc.) and ranking functions (Rank, Row_number, etc.) that can be used in window functions. 1. When do I use a Window Function? There are many useful cases. Window functions can be leveraged so summarize information at a different level. For example, the below query without a window function will group the amount column by user_id and platform and give use summarized results. Select user_id , platform , [Sum(price):$] total_user_spent From Purchases Group By user_id , platform This would give you the following results: If you wanted to see the individual rows of user ids, platform and price but also see the total spend by users you could use a window function! The query below will give you the detailed results of the User Id, individual price, platform and the total spend by user. Select user_id , [price:$] Price , platform , [Sum(price) Over(Partition By user_id):$] As total_user_spent From Purchases This would give you the following results: This would give you the following results: 2. How do I set up a Window Function? The basic structure of a window function is: Select < Window_Function > Over(Partition By < Field >) As < Field_Name > From < Table > 3. What functions are used in Window Functions? The functions below can be used in window functions. Note that the available functions and syntax maybe different between databases. Window functions are available in PostgreSQL, SQL Server, Redshift, and Oracle (and calls them "Analytic Functions" ). AVG COUNT CUME_DIST DENSE_RANK FIRST_VALUE LAG LAST_VALUE LEAD MAX MEDIAN MIN NTH_VALUE NTILE PERCENTILE_CONT PERCENTILE_DISC PERCENT_RANK RANK RATIO_TO_REPORT ROW_NUMBER STDDEV_POP STDDEV_SAMP SUM VAR_POP VAR_SAMP And some more! More examples on using the window functions check out this amazing blog post. Using window functions with date series check out this great blog post.2.7KViews0likes0CommentsWildcard Filter For Comma Separated List Of Values
You can easily create a wildcard filter that allows you to filter by a word or series of words on your chart using this post on Wildcard Filters. However, let's say you have a list of comma separated filter values that you'd like to filter on and it is too many to have to enter in one by one. With a find and replace tool (like this one) and a few lines of python we can filter by a given list of words easily. First create a filter that "Allows users to enter values." Then use the find and replace tool to replace all commas with a dash - (this is because Sisense for Cloud Data Teams removes special characters from filter values but dashes do not get removed). Then include the filter in the select clause of your sql query with single quotes around it like below. select lower(first_name) ||' '|| lower(last_name) as name , '[wildcard_filter]' as filter from users Then in the python editor, you can use these 3 lines to manipulate the string of filter values, filter the results, and then remove the filter column that we added in the sql. # SQL output is imported as a dataframe variable called 'df' import pandas as pd # replace the dashes with the regex pipe all_filters = df['filter'][0].replace('-', '|') # filter dataframe to only return results where the name matches the wildcard filter df = df[df['name'].str.contains(all_filters)] # drop the filter column to return desired output df.drop('filter', axis=1, inplace = True) periscope.table(df) Here is an example of the results:1.8KViews0likes0CommentsWaterfall Charts -- Plot.Ly
Waterfall charts are a great way to highlight changes in your data between steps or periods. The following python script easily plugs in with your SQL output, simply paste this into your Python 3.6 code box provided with the Python/R Integration, ensuring that your SQL output has 2 columns: x_data: the values along the x axis y_data: the y value for each x. Note that this is not the change between x and y, but rather the final value of y at each step Code is inspired from the examples on plot.ly's documentation. Refer to the documentation for further customization options, Python 3.6 Code # SISENSE FOR CLOUD DATA TEAMS WATERFALL TEMPLATE # SQL output should have 2 columns: # 1) x_data: the values along the x axis # 2) y_data: the y value for each x. Note that this is not the change between x and y, but rather the final value of y at each step # Import libraries import pandas as pd import plotly.plotly as py import plotly.graph_objs as go # Create Dummy data x_data = ['Step 1', 'Step 2','Step 3', 'Step 4','Step 5', 'Step 6', 'Step 7'] y_data = [20, 30, 15, 10, -10, -5, 15] data = {'x_data':x_data, 'y_data':y_data} dummy_df = pd.DataFrame(data) community_post = 'https://community.periscopedata.com/t/630sck' # HELPER FUNCTION: For annotation text def style_link(text, link, **settings): style = ';'.join([f'{key.replace("_","-")}:{settings[key]}' for key in settings]) return f'<a href="{link}" style="{style}">{text}</a>' # MAIN FUNCTION: input a dataframe def plot(df, annotation = None): # Split our dataframe for easier work x_data = df['x_data'] y_data = df['y_data'] # Initiate y_bottom array to denote the starting point for all the waterfall bars. y_bottom = [0] # Initiate the y_length array to enote the length of the waterfall "drop" y_length = [y_data[0]] # Initiate a color array that will show red bars for negative change and green bars for positive change if (y_data[0] > 0): y_color = ['rgba(33,196,128,1.0)'] else: y_color = ['rgba(214,24,90,1.0)'] # Calculate remaining bar positioning and appropriate colors, green for a positive change and red for a negative change for i in range(1,len(y_data)): dist = y_data[i] - y_data[i-1] length = abs(dist) y_length.append(length) if (y_data[i] > y_data[i-1]): bottom = y_data[i-1] color = 'rgba(33,196,128,1.0)' else: bottom = y_data[i-1] - length color = 'rgba(214,24,90,1.0)' y_bottom.append(bottom) y_color.append(color) # CREATE PLOT.LY GRAPH # bottom_of_bar is a transparent series. The length bar stacks on top of the bottom_of_bar bottom_of_bar = go.Bar( x = x_data, y = y_bottom, marker=dict( color='rgba(0,0,0, 0.0)', ) ) # length_of_bar stacks on top of the bottom_of_bar length_of_bar = go.Bar( x=x_data, y = y_length, marker=dict( color = y_color ) ) # Putting our data together data = [bottom_of_bar, length_of_bar] # Formatting includes chart titles and margin sizing layout = go.Layout( title='Waterfall Chart Example', barmode='stack', showlegend=False, margin=dict( l=50, r=50, b=50, t=50 ) ) # Add an annotation if the SQL output is in the incorrect format if annotation is not None: layout['annotations'] = [annotation] # Plot the figure fig = go.Figure(data=data, layout=layout) periscope.plotly(fig) # We try to to plot the SQL output. If it is not in the correct format, the dummy data will display with a watermark. See stdout tab for the error message. try: plot(df) except Exception as e: print(e) annotation = { 'x': 0.5, 'y': 0.5, 'ax': 0, 'ay': 0, 'xref': 'paper', 'yref': 'paper', 'text': style_link('DUMMY<br><br><br><br>DATA<br><br><br><br>EXAMPLE', community_post, font_size='60px', font_weight='bold', color='rgba(0, 0, 0, .25)'), 'showarrow': False, 'textangle': -25 } plot(dummy_df, annotation=annotation) Don't have access to plot.ly? You can create something similar using Sisense for Cloud Data Team's default charting libraries!1.3KViews1like0CommentsUsing R To Plot Only The Rolling Average Line
Want to show a rolling average to your data? Using the Python / R integration, we can use a few quick lines in R to accomplish this. Here are the first few rows of our SQL output - a list of dates with the number of users created on a fictional gaming platform. We now use the rollmean() function from the zoo library in R, as shown below. Note from the rollmean() documentation that you can just as easily calculate a host of other rolling calculations, such as a rolling median or sum. library(zoo) # Calculating the rolling average. Window is set to 3 here, corresponding to an average of the current row and the 2 preceding rows window <- 10 padding <- rep(NA, window - 1) df$rollingavg <- c(padding, rollmean(df$number_users, k = window)) periscope.table(df) Now, we make this chart a line graph to visualize the rolling average line by itself! Tip: If you want to display both the raw data and the rolling average, Sisense for Cloud Data Teams' built-in visualizations has a quick "Show Rolling Average" check box that you can toggle on. Prefer Python? Find the Python equivalent to this post here.1.4KViews0likes0CommentsUsing Python To Plot Only The Rolling Average Line
Want to show a rolling average to your data? Using the Sisense for Cloud Data Teams Python / R integration, we can accomplish this with a single line of Python code. Here are the first few rows of our SQL output - a list of dates with the number of users created on a fictional gaming platform. We now use the rolling() function from Python pandas, as shown below. Note that by changing the .mean() to a .sum(), we can now make this a rolling sum calculation. Neat, huh? # SQL output is imported as a pandas dataframe variable called "df" import pandas as pd df["rolling"] = df["number_users"].rolling(10).mean() # Use Sisense for Cloud Data Teams to visualize a dataframe or an image by passing data to periscope.output() periscope.output(df) Now, we make this chart a line graph to visualize the rolling average line by itself! Tip: If you want to display both the raw data and the rolling average, Sisense for Cloud Data Teams' built-in visualizations has a quick "Show Rolling Average" check box that you can toggle on. Prefer R? Find the R equivalent for this post here.1.9KViews0likes0Comments