Weighted Vs Unweighted Averages
When summarizing statistics across multiple categories, analysts often have to decide between using weighted and unweighted averages. An unweighted average is essentially your familiar method of taking the mean. Let's say 0% of users logged into my site on Day 1, and 100% of users logged in on Day 2. The unweighted average for the 2 days combined would be (0% + 100%)/2 = 50%. Weighted averages take the sample size into consideration. Let's say in the example above, there was only 1 user enrolled on Day 1 and 4 users enrolled on Day 2 - making a total of 5 users over the 2 days. The weighted average is 0% * (1/5) + 100% * (4/5) = 80%. Typically, users want to calculate weighted averages as it prevents skewing from categories with smaller sample sizes. If we want to add a row with a weighted average, we can accomplish this via SQL as shown in the example below (note that this example leverages Redshift syntax). Let's walk through an example of how to calculate each of these measures! Let's say our original table, t1, contains the following data: Here is how to calculate the weighted average. To add an extra 'Total' row, I used a SQL Union all. select month::varchar(12) , (round(perc_purchases_over_10_dollars * 100, 2) || '%') as perc_purchases_over_10_dollars , sample_size from t1 union all select 'weighted avg' , (round(sum(perc_purchases_over_10_dollars * sample_size) / sum(sample_size) * 100, 2) || '%') , sum(sample_size) from t1 group by 1 The result looks like this: On the contrary, if we would prefer to use an unweighted average, we can simply union an avg() of each of the categories. (The additional round/decimal casting is for formatting purposes.) select month::varchar(15) , (round(perc_purchases_over_10_dollars * 100, 2) || '%') as perc_purchases_over_10_dollars , sample_size from t1 union all select 'unweighted avg' , (round(avg(perc_purchases_over_10_dollars) * 100, 2)::decimal(6,2) || '%') , sum(sample_size) as sample_size from t1 Here, we can see that the results differ from the weighted average example (12.04% as opposed to 12.00%). Note that unweighted and weighted averages are equal if each category has the same sample size. ***Interested in learning more about adding a 'Total' Row or Column? Check out our community post here.***6.1KViews1like0CommentsWaterfall Charts -- Plot.Ly
Waterfall charts are a great way to highlight changes in your data between steps or periods. The following python script easily plugs in with your SQL output, simply paste this into your Python 3.6 code box provided with the Python/R Integration, ensuring that your SQL output has 2 columns: x_data: the values along the x axis y_data: the y value for each x. Note that this is not the change between x and y, but rather the final value of y at each step Code is inspired from the examples on plot.ly's documentation. Refer to the documentation for further customization options, Python 3.6 Code # SISENSE FOR CLOUD DATA TEAMS WATERFALL TEMPLATE # SQL output should have 2 columns: # 1) x_data: the values along the x axis # 2) y_data: the y value for each x. Note that this is not the change between x and y, but rather the final value of y at each step # Import libraries import pandas as pd import plotly.plotly as py import plotly.graph_objs as go # Create Dummy data x_data = ['Step 1', 'Step 2','Step 3', 'Step 4','Step 5', 'Step 6', 'Step 7'] y_data = [20, 30, 15, 10, -10, -5, 15] data = {'x_data':x_data, 'y_data':y_data} dummy_df = pd.DataFrame(data) community_post = 'https://community.periscopedata.com/t/630sck' # HELPER FUNCTION: For annotation text def style_link(text, link, **settings): style = ';'.join([f'{key.replace("_","-")}:{settings[key]}' for key in settings]) return f'<a href="{link}" style="{style}">{text}</a>' # MAIN FUNCTION: input a dataframe def plot(df, annotation = None): # Split our dataframe for easier work x_data = df['x_data'] y_data = df['y_data'] # Initiate y_bottom array to denote the starting point for all the waterfall bars. y_bottom = [0] # Initiate the y_length array to enote the length of the waterfall "drop" y_length = [y_data[0]] # Initiate a color array that will show red bars for negative change and green bars for positive change if (y_data[0] > 0): y_color = ['rgba(33,196,128,1.0)'] else: y_color = ['rgba(214,24,90,1.0)'] # Calculate remaining bar positioning and appropriate colors, green for a positive change and red for a negative change for i in range(1,len(y_data)): dist = y_data[i] - y_data[i-1] length = abs(dist) y_length.append(length) if (y_data[i] > y_data[i-1]): bottom = y_data[i-1] color = 'rgba(33,196,128,1.0)' else: bottom = y_data[i-1] - length color = 'rgba(214,24,90,1.0)' y_bottom.append(bottom) y_color.append(color) # CREATE PLOT.LY GRAPH # bottom_of_bar is a transparent series. The length bar stacks on top of the bottom_of_bar bottom_of_bar = go.Bar( x = x_data, y = y_bottom, marker=dict( color='rgba(0,0,0, 0.0)', ) ) # length_of_bar stacks on top of the bottom_of_bar length_of_bar = go.Bar( x=x_data, y = y_length, marker=dict( color = y_color ) ) # Putting our data together data = [bottom_of_bar, length_of_bar] # Formatting includes chart titles and margin sizing layout = go.Layout( title='Waterfall Chart Example', barmode='stack', showlegend=False, margin=dict( l=50, r=50, b=50, t=50 ) ) # Add an annotation if the SQL output is in the incorrect format if annotation is not None: layout['annotations'] = [annotation] # Plot the figure fig = go.Figure(data=data, layout=layout) periscope.plotly(fig) # We try to to plot the SQL output. If it is not in the correct format, the dummy data will display with a watermark. See stdout tab for the error message. try: plot(df) except Exception as e: print(e) annotation = { 'x': 0.5, 'y': 0.5, 'ax': 0, 'ay': 0, 'xref': 'paper', 'yref': 'paper', 'text': style_link('DUMMY<br><br><br><br>DATA<br><br><br><br>EXAMPLE', community_post, font_size='60px', font_weight='bold', color='rgba(0, 0, 0, .25)'), 'showarrow': False, 'textangle': -25 } plot(dummy_df, annotation=annotation) Don't have access to plot.ly? You can create something similar using Sisense for Cloud Data Team's default charting libraries!1.3KViews1like0CommentsIntroduction To Hyperparameter Optimization - Machine Learning
There are lots of knobs (a.k.a hyperparameters) we can turn when coming up with a Machine Learning model. in the script below, we take the well-known iris dataset, and play around with different hyperparameters. First, a few notes: In machine learning, we generally split our data into 3 sections A training dataset A dev dataset A test dataset We train the model on the training dataset, tune hyperparameters based on the dev dataset, and only run the test dataset when we're evaluating our model. Note that we don't want to tune hyperparameters based on the output of our test dataset in order to avoid overfitting both the test and training dataset if you want to quickly iterate through many different hyperparameters, I recommend using a smaller subset of your data to allow for quick processing. Once some of the hyperparameters have been narrowed down, you can dedicate more time and computational resources to running the full training dataset and creating the model in your production workflow. Without further ado, here is a hyperparameter optimization on K Nearest Neighbors using the corresponding classifier from the Python sklearn library. The below example uses Python 3.6 code. First, we import our libraries: import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris from sklearn.neighbors import KNeighborsClassifier from sklearn.utils import shuffle from sklearn.metrics import f1_score import matplotlib.pyplot as plt import random I set a random.seed() here to make results reproducible random.seed(123) In this example, we are building a dataframe which contains the iris dataset. However, for your own purposes, you can very well use your SQL output, which will get passed into the Sisense for Cloud Data Teams Python/R editor as a dataframe named df. Your final dataframe must have a list of features (these are the predictor components, you can think of these as your "X," and the corresponding label, you can think of this as your "y"). This dataframe below has 3 features: the sepal length, sepal width, and petal length. We also have a column, 'target,' which contains the name of the iris type. iris = load_iris() df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target']) df = df.drop('petal width (cm)', axis = 1) Next, we shuffle the dataframe to ensure we are getting a representative sample of our data in the training, dev, and test dataset. df = shuffle(df, random_state = 300) Now, we want to split our target column (y) from all of our features (our x values) df_features = df.drop('target', axis = 1) features = df_features.values target = df["target"].values Now we split our data into training, dev, and test datasets num_rows = df.shape[0] train_cutoff = int(num_rows * 0.6) dev_cutoff = int(num_rows * 0.8) features_train = features[:train_cutoff,:] features_dev = features[train_cutoff:dev_cutoff,:] features_test = features[dev_cutoff:,:] target_train = target[:train_cutoff] target_dev = target[train_cutoff:dev_cutoff] target_test = target[dev_cutoff:] Now, we loop through all our hyperparameters. In this example, we are looping through all values of n_neighbors from 1 to 10. This determines how many of our "neighbors" we are using to classify a given point outside our training dataset. Additionally, we will compare the effectiveness of using "uniform" versus "distance" weights for our model. Note that: A "uniform" weight takes a vote between the N closest neighbors of a point to classify it. A "distance" weight gives more importance to those neighbors that are closest to the point. For example, let's say our KNN is looking at n_neighbors of 5. if of the 5 closest neighbors, the closest of them all is Category A, that will be given more weight than the furthest of the 5 neighbors. To evaluate the dataset, we will run a .predict() function on model using the dev dataset, and score the model using the F1 score (F1 is better at capturing false positives and false negatives, more info on this here). Note that you can achieve this logic with GridSearchCV as well all_k = range(1,11) uniform = [] distance = [] # Looping through all values of k for nbrs in all_k: knn_uni = KNeighborsClassifier(n_neighbors = nbrs, weights = 'uniform') knn_dist = KNeighborsClassifier(n_neighbors = nbrs, weights = 'distance') pred_uni = knn_uni.fit(features_train, target_train).predict(features_dev) pred_dist = knn_dist.fit(features_train, target_train).predict(features_dev) f1_uni = f1_score(target_dev, pred_uni, average = 'macro') f1_dist = f1_score(target_dev, pred_dist, average = 'macro') uniform.append(f1_uni) distance.append(f1_dist) Finally, we plot our results fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True) ax1.plot(all_k, uniform) ax1.set_title('Uniform Weights') ax1.set_ylabel('F1 Score') ax2.plot(all_k, distance) ax2.set_title('Distance Weights') # Use Sisense for Cloud Data Teams to visualize a dataframe, text, or an image by passing data to periscope.table(), periscope.text(), or periscope.image() respectively. periscope.image(fig) Now we analyze the output. The number of neighbors used to generate the model is on the x axis, with the F1 score on the y axis. An F1 score closer to 1 is more desirable here. We see that there are more fluctuations in the uniform weights scoring compared to the distance weights. This is expected as we would anticipate the closest neighbors to be more informative when classifying an iris. Therefore, distance weights looks like a better option. Secondly, it looks like distance weights with 6-8 neighbors yield the highest F1 score. We would go on the lower end of our range here as n_neighbors of 6 is less computationally intensive than n_neighbors of 8 (we have fewer neighbors to account for when classifying each point). Of course, we used a very very small dataset here, so we can expect the lines above to be smoother for a larger dataset. Any other parameters you like to play around with for KNN? Now that we have found our desired hyperparameters, let's run this on our test dataset! Let's put this in a Sisense for Cloud Data Teams view so it's easy to leverage this logic multiple times without rewriting code. See post here for further details!1.2KViews1like0CommentsChanging Line Styling -- Plot.ly (Python And R)
Wanted to alter the thickness of the lines in your series, add markers to your lines, and make them dashed or dotted? The plot.ly library support on Sisense for Cloud Data Teams' Python/R integration allows you to tailor these parameters to your heart's desire! For reference, here are the first few rows of our SQL output. This is data from a hypothetical gaming company And below is the annotated Python 3.6 code! Note we need to first pivot the data such that each series gets its own column. More information on that can be found here. Also, plot.ly has great documentation going into different examples. Much of the below options were outlined in the documentation itself! # SQL output is imported as a dataframe variable called 'df' # Import pandas to use dataframe objects import pandas as pd # Import plot.ly viz libraries import plotly.plotly as py import plotly.graph_objs as go # Pivot data. Details on how to do this here https://community.periscopedata.com/t/q5gk76/pivoting-and-melting-dataframes df=pd.pivot_table(df,index='week',columns='source',values='count') df=df.reset_index() # Create traces # Refer to https://plot.ly/python/line-charts/ for more options admob_trace = dict( x = df.week, y = df['admob'], mode = 'lines', type = 'scatter', name = 'admob', line = dict(shape = 'linear', color = 'rgb(205, 12, 24)', width= 4, dash = 'dash'), connectgaps = True ) leadbolt_trace = go.Scatter( x = df['week'], y = df['leadbolt'], mode = 'lines+markers', name = 'leadbolt', line = dict(shape = 'linear', color = 'rgb(10, 12, 240)', dash = 'dash'), marker = dict(symbol = "star-diamond", color = 'rgb(17, 157, 255)',size = 12), connectgaps = True ) organic_trace = go.Scatter( x = df.week, y = df['organic'], mode = 'lines', name = 'organic', line = dict(shape = 'linear', color = 'rgb(10, 120, 24)', dash = 'dot'), connectgaps = True ) tapjoy_trace = go.Scatter( x = df['week'], y = df['tapjoy'], mode = 'lines', name = 'tapjoy', line = dict(shape = 'linear', color = 'rgb(100, 10, 100)', width = 2, dash = 'dot'), connectgaps = True ) # Setting up the layout settings in the "layout" argument layout = dict( xaxis = dict(title = 'Week'), yaxis = dict(title = 'Source'), margin = dict( l=70, r=10, b=50, t=10 ) ) data = [admob_trace, leadbolt_trace, organic_trace, tapjoy_trace] fig = go.Figure(data = data, layout=layout) # Use Periscope to visualize a dataframe, text, or an image by passing data to periscope.table(), periscope.text(), or periscope.image() respectively. periscope.plotly(fig) Prefer R Plot.ly? Here's the code for that! # SQL output is imported as a dataframe variable called 'df' # Use Periscope to visualize a dataframe or show text by passing data to periscope.table() or periscope.text() respectively. Show an image by calling periscope.image() after your plot. # Use tidyr to pivot the data. We want a column per series library(tidyr) # Use plot.ly for visualziation library(plotly) # Pivoting data, see this post for further details https://community.periscopedata.com/t/q5gk76/pivoting-and-melting-dataframes df=spread(df,source,count) admob_trace <- df$admob leadbolt_trace <- df$leadbolt organic_trace <- df$organic tapjoy_trace <- df$tapjoy x <- df$week data <- data.frame(x, admob_trace, leadbolt_trace, organic_trace, tapjoy_trace) # Check out Plot.ly's documentation to find more options! https://plot.ly/r/reference/ p <- plot_ly(data, x = ~x) %>% add_trace(y = ~admob_trace, name = 'admob', type = 'scatter', mode = 'lines', line = list(shape = 'linear', color = 'rgb(205, 12, 24)', width= 4, dash = 'dash'), connectgaps = TRUE) %>% add_trace(y = ~leadbolt_trace, name = 'leadbolt',type = 'scatter', mode = 'lines+markers', line = list(shape = 'linear', color = 'rgb(10, 12, 240)', dash = 'dash'), marker = list(symbol = "star-diamond", color = 'rgb(17, 157, 255)',size = 12), connectgaps = TRUE) %>% add_trace(y = ~organic_trace, name = 'organic', type = 'scatter', mode = 'lines', line = list(shape = 'linear', color = 'rgb(10, 120, 24)', dash = 'dot'), connectgaps = TRUE) %>% add_trace(y = ~tapjoy_trace, name = 'tapjoy', type = 'scatter', mode = 'lines', line = list(shape = 'linear', color = 'rgb(100, 10, 100)', width = 2, dash = 'dot'), connectgaps = TRUE) %>% layout(xaxis = list(title = 'Week'), yaxis = list(title = 'Source')) periscope.plotly(p)22KViews1like0CommentsWhat Pro Sport Should You Play?
[Written by Kyle Demeule] As a Canadian I was always amazed at some of the salaries baseball or football players we're able to get (Jason Heyward getting 21 million year? He plays outfield right?), especially compared to Hockey (one of the best players currently Sidney Crosby getting only 8.7 million). So after hours one day at work, I wanted to actually compare salaries of the four major professional sports in North America (Football, Baseball, Basketball, and Hockey). So I grabbed salary (well cap hit) information for the top 500 players in each league from http://www.spotrac.com/ , and graphed it in millions (See attached). So maybe the NHL doesn't have those extreme salaries, but if you're ok with a couple million, it's still alright 🙂 Used beautiful soup to extract the information, can share the code if anyone is interested.743Views1like0CommentsChart Type -- Control Chart
Not everyone has to be a six sigma black belt to appreciate and use a well-designed control chart. The top and bottom red lines indicate a standard deviation away, the green line indicates the average, and the black line indicates the values over time. Here's an easy parameterized snippet to generate the chart type: Title: control_chart(table,value_field,x_axis) select [x_axis] , [value_field] , average , average - stdev as stdev_minus_one , average + stdev as stdev_plus_one from [table] , ( select avg([value_field]) as average from [table] ) as a , ( select stddev_samp([value_field]) as stdev from [table] ) as s643Views1like0CommentsUsing The Embed Refresh Trigger
The Embed Refresh Trigger PostMessage that is accepted by embedded dashboards can be used to allow public users to refresh their view of an embedded dashboard, or to set a refresh rate on those embeds. The data_ts parameter is sufficient for checking data freshness on page load, but some embeds stay open for a long period of time and the PostMessage can be used to refresh those without requiring a browser refresh. To set an automatic refresh rate for the embed, the PostMessage call can be put in a setInterval() Method. That would look like this, using Javascript: setInterval(function(){ $('iframe').contentWindow.postMessage({event_type: "refresh_charts"}, "") }, 6000010); setInterval(function(){ $('iframe').contentWindow.postMessage({event_type: "refresh_charts"}, "*") }, 60000*10); The above triggers the embed to refresh the charts every 10 minutes (60000 is the multiplier for milliseconds to minutes). The minimum rate is every one minutes, but above 10 is recommend. If a constant refresh is not needed, you can instead opt to add a "refresh charts" button to the parent webpage that would trigger the charts to refresh. That could be done using something like this: $( "#mybutton" ).click(function() { $('iframe').contentWindow.postMessage({event_type: "refresh_charts"}, "*"); }); $( "#mybutton" ).click(function() { $('iframe').contentWindow.postMessage({event_type: "refresh_charts"}, "*"); }); A button allows users to refresh the charts, without setting a refresh rate for the embed, so the amount of queries triggered by the embed is not as high. Please note that since this PostMessage triggers a full refresh of all the charts on the embedded dashboard, it is NOT recommended for sites that already experience heavy query load or query queues on their site. Public embeds can increase the number of queries run against a connection and increasing the number of refreshes requested may cause further delay and queuing. Embeds that use this API should be restricted to dashboards with few charts with low runtimes.935Views1like0CommentsGraphing Plot.Hist() As A Chart On Your Dashboard
Plot.Hist( ) is a very useful function for creating histograms in Python. However, this is ultimately a bar chart that can be graphed by Sisense for Cloud Data Teams as well, and would better match other charts on the dashboard. The data generated by Plot.Hist( ) can be broken out into the underlying data to be graphed like this: (n2, bins2, patches) = plt.hist(datatable, 50) bins2=bins2[:-1] #to remove the extra bin label df = pd.DataFrame({'n2':n2, 'bins2':bins2}) periscope.output(df) For more information on plot.hist() itself, you can find the documentation page here: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html725Views1like0Comments